๐๏ธ Building High-Quality AI Agents ๐ค โ A Comprehensive, Actionable Field Guide - P2 ๐
A synthesis of hard-won lessons from Claude Code, OpenHands, SWE-agent, GoClaw, Nanobot, PicoClaw, and the emerging discipline of harness engineering. This is the guide we wish existed when we started building agents.
The goal: an AI agent that is fast, scalable, capable, reliable, efficient, and secure โ not by accident, but by design.
๐ How to read this guide
- Read top-to-bottom for the full mental model. Each section builds on the previous one.
- Skim the boxes if you only want the takeaways โ every section ends with an Actionable rules box.
- Jump to Part 14 โ The Build-Your-Own Roadmap if you already know the theory and want a sequenced plan.
- Bookmark Part 15 โ Anti-Patterns for design reviews.
๐ Table of Contents
- โก Part 0 โ The Core Equation
- ๐ง Part 1 โ Mental Model: What an AI Agent Actually Is
- ๐ Part 2 โ The Agent Loop (the Kernel)
- ๐ ๏ธ Part 3 โ Tools: The Agent's Hands
- ๐ญ Part 4 โ Context Engineering
- ๐พ Part 5 โ Memory (Long-Term Knowledge)
- โก Part 6 โ Concurrency & Multi-Agent Patterns
- ๐ง Part 7 โ Reliability: Error Recovery, Stuck Detection, Autosubmit
- ๐ Part 8 โ Security: Defense-in-Depth
- ๐ข Part 9 โ Multi-Tenancy from Day One
- ๐ Part 10 โ Performance & Efficiency
- ๐ Part 11 โ Provider Abstraction & Resilience
- ๐ก Part 12 โ Channels & Integration Surface
- ๐ Part 13 โ Observability & Evaluation
- ๐บ๏ธ Part 14 โ The Build-Your-Own Roadmap
- โ ๏ธ Part 15 โ Anti-Patterns to Avoid
- ๐ฏ Part 16 โ Closing: The Harness Mindset
Part 0 -> 12 Read here https://viblo.asia/p/building-high-quality-ai-agents-a-comprehensive-actionable-field-guide-part-1-ymJXDQ9rJkq
๐ Part 13 โ Observability & Evaluation
13.1 ๐ Trace everything
Three span types: agent, llm_call, tool_call. Wrap every LLM call in a span. Wrap every tool call in a span. Trace tree mirrors the run shape.
| Detail | Value |
|---|---|
| Batch size | 100 spans |
| On batch failure | retry individually |
| Verbose mode | full input/output truncated at 50 KB |
| Span exporters | OpenTelemetry compatible |
13.2 ๐ฐ Cost tracking from step 1
Every API response runs through a cost accumulator:
- Per-model usage in bootstrap state.
- Reports to OpenTelemetry.
- Recursively processes nested model calls (sub-agents, recall queries).
- Persists to project config on process exit.
- Restores on next session if persisted session ID matches.
Histograms use reservoir sampling (Algorithm R) with 1,024 entries to compute p50/p95/p99. Averages hide tail latency, and tail latency is what users feel.
Even in v0, instrument cost and latency. You cannot decide what to optimize from feel.
13.3 โฎ๏ธ Replayable trajectories
Every step() writes a .traj JSON file containing history, model output, observations, costs. SWE-agent's run-replay re-executes any old run. The append-only event log is the source of truth.
Worth it just for debugging. When the agent does something weird at minute 47, you can rewind to any event and try a different model or prompt.
13.4 ๐งช Eval taxonomies
Three layers of evaluation:
| Eval | What it measures |
|---|---|
| Single-step | Does one tool call work correctly? |
| Full-run | Does the complete task get solved? |
| Multi-turn | Does the agent handle evolving goals? |
13.5 ๐ Trace grading
Grade agent traces directly โ especially helpful for multi-step tasks where final output alone doesn't reveal process quality. Use a separate LLM as a judge with a clear rubric.
13.6 ๐ฏ Skill-level evals
Measure whether a specific skill actually helps using:
- Bounded tasks โ reproducible inputs.
- Deterministic verifiers โ automated pass/fail.
- No-skill baseline โ does the skill move the needle?
- Trace review โ human-spotcheck the failures.
13.7 ๐ก Infrastructure noise
Runtime configuration can move coding benchmark scores by more than many leaderboard gaps.
Infrastructure choices may matter more than model intelligence. The same model with a better harness, better tools, better verification, lands a higher score.
13.8 ๐ Activity log for every admin action
Every admin write to global tables (settings, permissions, tool config) appends to an audit log: { tenant_id, actor_id, action, target, timestamp, ip }. Cheap to write, invaluable when "who changed X?" comes up.
โ Actionable rules
- Spans on every LLM and tool call. Trace tree mirrors the run.
- Cost + reservoir-sampled latency from day one.
- Append-only event log = replayable trajectories.
- Eval at three layers: single-step, full-run, multi-turn. Trace-grade.
- Skill-level evals with no-skill baselines. If it doesn't move the needle, drop it.
- Audit log for every admin action.
๐บ๏ธ Part 14 โ The Build-Your-Own Roadmap
A pragmatic order to implement everything above. Each step compiles and runs on its own.
๐ฑ Milestone 0 โ Foundation (1โ2 days)
- Pick the language. Go for small/portable; Python for ML/research/speed.
- Pick the DB: PostgreSQL + pgvector if you ever want vector search.
- Skeleton:
cmd/,internal/,pkg/,migrations/,docs/,Makefile,docker-compose.yml. - Define the
Providerinterface (4 methods). - Implement one provider โ start with OpenAI-compatible (covers Groq, DeepSeek, Together for free).
cmd/serveloads config, makes one HTTP request, prints the response.
๐ Milestone 1 โ Minimum Viable Agent Loop (1 week)
- Define
Toolinterface:name,description,schema,execute(ctx, args). - Implement 3 tools:
read_file,write_file,list_filesโ workspace-scoped, withresolvePath()traversal guard. - Build the loop:
for i := 0; i < 20; i++ { think; if no tools break; act; observe }. - Persist sessions:
SessionStoreinterface + in-memory implementation. - Emit events via callback. Three only:
run.started,tool.call,run.completed. - HTTP endpoint
/v1/chat/completions(OpenAI-compatible). One agent. No streaming yet.
You now have an LLM that can read/write files in a workspace.
๐งฉ Milestone 2 โ System Prompt Architecture (3โ4 days)
- Bootstrap files:
agent_context_files(agent-level) +user_context_files(per-user). 6 known files: SOUL, IDENTITY, AGENTS, TOOLS, BOOTSTRAP, USER. ContextFileInterceptorโ when a tool reads/writes a known name, route to DB instead of disk.- System prompt builder โ assemble from sections. Persona early, persona reminder late.
- Two modes:
PromptFullandPromptMinimal. - Per-user file seeding on first chat.
๐ข Milestone 3 โ Multi-Tenancy from the Start (3โ4 days)
tenantsandapi_keystables. UUID v7 PKs.tenant_id NOT NULLon every table that holds tenant data.WithTenantID(ctx)/TenantIDFromContext(ctx)helpers.- Resolve API key โ SHA-256 lookup โ set tenant on ctx at the gateway.
- Update every store query to add
WHERE tenant_id = $N. Audit the diff. - Master tenant for legacy/single-user data; master scope guard for global writes.
๐ง Milestone 4 โ Pipeline Refactor (1 week)
Once your loop has > 3 conditional branches, split it:
- Define
Stageinterface,StageResultenum,RunStatestruct. - Implement
ContextStage,ThinkStage,ToolStage,ObserveStage,CheckpointStage,FinalizeStage. AddPruneStagelater. Pipeline.Runorchestrates: setup โ iteration loop โ finalize.- Feature flag (
pipeline_enabled) so V2 (monolithic) and V3 (pipeline) coexist during migration.
๐พ Milestone 5 โ Memory & Search (1โ2 weeks)
memory_documents+memory_chunkstables.tsvector(FTS) +vector(1536)columns.MemoryInterceptorโ auto-chunks + embeds on.mdwrites insidememory/*.- Hybrid search:
0.7 * vector + 0.3 * fts. Per-user 1.2ร boost. Dedup. memory_searchandmemory_gettools.- Later:
episodic_summaries+EpisodicWorkersubscribed torun.completed. - Later:
kg_entities+kg_relationswith temporal validity for L2.
๐ก๏ธ Milestone 6 โ Tool Registry Hardening (1 week)
- Funnel every tool call through
Registry.ExecuteWithContext. - Token-bucket rate limiting per session key (defaults: 60/min, burst 5).
- Credential scrubber โ start with 5โ10 high-value patterns.
- Policy engine: profiles (
full/coding/messaging/minimal), groups, allow/deny lists. - Shell deny groups (start with:
destructive_ops,reverse_shell,dangerous_paths,package_install). - Capability metadata on every tool.
๐ก Milestone 7 โ Channels (per channel, ~2 days each)
- Define
Channelinterface:Listen(ctx, onMessage),Send(ctx, OutboundMessage) error. - Telegram first (simplest, long-polling).
channel_instancestable withtenant_idbaked in.- Outbound dispatcher routes by
channel_instance_id. - Pairing flow: 8-char code, 60-min TTL.
- Then: Discord, Slack, WhatsApp, Feishu, Zalo.
๐ Milestone 8 โ Observability (3โ4 days)
tracesandspanstables. Three span types.- Wrap every LLM call and tool call in a span.
BatchCreateSpansin batches of 100; on failure, retry individually.- Verbose mode (
TRACE_VERBOSE=1) for full input/output, truncated at 50 KB. - Optional: OpenTelemetry exporter.
๐ Milestone 9 โ Resilience (3โ4 days)
- Wrap providers with retry middleware.
- Per-model cooldown.
- Failover chain.
- Mid-loop compaction at 75%; post-run at 50 messages or 75%.
- Per-session
TryLockfor compaction goroutine. - Stuck detector (5 patterns, semantic comparison).
- Autosubmit on every fatal error path.
๐ค Milestone 10 โ Multi-Agent (1โ2 weeks)
subagenttable. Limits: depth 1, max 5 children, max 8 concurrent.spawntool (async return),delegatetool (sync with timeout).agent_linkstable for delegation eligibility.- When ready:
teams,agent_team_members,team_tasks,team_messages. - Atomic task claim:
UPDATE โฆ WHERE status = 'pending' AND owner_agent_id IS NULL.
๐ Milestone 11 โ Production Hardening (ongoing)
- Add the remaining 4 security layers (input guard, output sanitizer, isolation).
- AES-256-GCM encryption for all at-rest secrets.
aes-gcm:prefix convention. - API keys: 16 random bytes, SHA-256 hash, constant-time compare.
- Activity log for every admin action.
- Hourly snapshot aggregations.
- Per-tenant config UI.
๐งฉ Milestone 12 โ Optional Surface Area
- Knowledge Vault with wikilinks (
[[target]]). - MCP bridge (stdio + SSE + streamable-http transports, per-agent + per-user grants).
- Custom shell tools (DB-stored, hot-reloaded).
- Cron jobs.
- Browser automation (headless Chrome).
๐ฐ๏ธ Save for last (don't build until milestone 12)
- Fork agents (cache-driven sub-agents)
- Swarm teams
- Remote tasks across machines
- KAIROS continuous-mode logs
- Auto-mode permission classifier
- Renderer optimization (cell-diffing, BSU/ESU)
- Bitmap search index for huge filesystems
โ ๏ธ Part 15 โ Anti-Patterns to Avoid
Each row is a trap that's burned multiple production teams.
๐ Loop / control flow
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Callbacks or event emitters for the agent loop | You'll re-invent backpressure poorly | async function* (or channels) |
A single error terminal state |
Lose information about why | Encode 10+ specific reasons in a discriminated union |
| Stop-hooks on error responses | Creates error โ hook blocks โ retry โ error infinite loops |
Skip them on errors |
Forgetting to pair tool_use with tool_result on abort |
API rejects the next message | Drain queued tools with synthetic results on every cancel path |
| Trusting the model's tool-call format | Models hallucinate <tool_call> XML, [Tool Call: ...] text |
7-step output sanitizer strips them all |
One giant runLoop() function |
2k-line functions become untestable | 8-stage pipeline; each stage isolated |
๐ ๏ธ Tools
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Constructor literal instead of factory | Defaults will be unsafe | Always go through buildTool() |
| Per-tool-type concurrency safety | Bash is sometimes safe, sometimes not |
Pass parsed input |
| Concatenating built-ins and MCP tools then sorting flat | Cache breakpoint dies | Sort within partition, then concat |
| Returning huge raw output | Context blows up | Cap with maxResultSizeChars; persist to disk + return preview |
Using SDK's BetaMessageStream |
O(nยฒ) JSON re-parsing | Read raw stream events |
| Bypassing the tool registry "just for this one call" | Loses scrubbing, rate-limit, RBAC | Every tool call through the registry, no exceptions |
Reusing the human shell (cat, grep -rn) |
Bad agent tools โ too much output, no error story | Build agent-shaped commands with bounded output |
Free-form sed -i edits |
Frequent syntactic collapses | Line-range edit with lint + auto-rollback |
๐ Permissions
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
Scattering if mode === ... checks throughout tool code |
Untestable, drift | Centralize in modes + resolution chain |
| Trusting a partial bash parse | Bypassable | If parseForSecurity() fails, treat as unsafe |
Sub-agent default = default mode |
Needs a UI to prompt; bg agents have none | Default to bubble (sync) or dontAsk (async) |
โก Caching / API
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Runtime conditionals in the static prompt prefix | Each one doubles cache key space | Move below the dynamic boundary |
| Mid-session feature toggles that change request headers | Bust cache | Use sticky latches |
| Reserving 64K output tokens by default | Over-reserve 8โ16ร | Cap at 8K, escalate on demand |
| Regenerating the system prompt for fork children | Feature flags or session date may have moved | Pass parent's bytes |
| Filtering tools per child agent in fork mode | Different array โ different cache key | useExactTools: true and runtime guards |
๐พ Memory
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
Storing what git log can answer |
Useless duplication that goes stale | Derivability test: if git/code can answer it, don't memorize |
| Embedding-only retrieval | Misses negation ("do NOT mock") | LLM recall over a manifest, hybrid with FTS |
| Hard expiration | Stale memories are still data | Annotate with age; let model decide |
Letting MEMORY.md grow past 200 lines |
Truncated silently | Treat the index as a budget |
๐ค Multi-agent
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Coordinators with the full tool set | They'll do the work themselves | Restrict to Agent, SendMessage, TaskStop |
| Workers asked to "based on the research, implement X" | Re-derive context, miss specifics, hallucinate paths | Synthesis is the coordinator's job; give exact paths/lines |
| Mid-tool-execution message delivery | Race conditions | Queue at tool-round boundaries |
| Unbounded teammate state | 36.8 GB / 292 agents was a real incident | Cap message history |
General-purpose agents that can spawn Agent |
Exponential fan-out | Block recursive spawning at the schema level |
๐ข Multi-tenancy
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Single-tenant first, "we'll add it later" | Migration is brutal โ every query, test, cache key | tenant_id NOT NULL on day one |
Trusting client-supplied tenant_id header |
Spoofable; cross-tenant leakage | Tenant resolved from API key at gateway |
๐ช Bootstrap / hooks
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
Loading the world for --version |
Slow startup | Fast-path dispatch first |
| Hook config that updates live mid-session | Lets a malicious repo redefine permissions after trust dialog | Snapshot at startup; update only via explicit user channel |
| Treating MCP skills like local skills | They are content-only | Never execute their inline shell commands |
๐ Provider / API
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Hard-coding one LLM provider | You'll need 5 within a year | Provider interface + adapters |
| Storing secrets unencrypted because "it's the same DB" | Database dumps leak; insider widens blast radius | AES-256-GCM with aes-gcm: prefix |
time.Sleep between LLM retries |
Wastes time + cost; thundering herd | Exponential backoff with jitter, honor Retry-After |
| Distributed lock for "claim this task" | Adds Redis/Zookeeper; race conditions still possible | Atomic SQL UPDATE with WHERE status = 'pending' |
| Loading the full agent config on every request | Slow; chatty | Router cache with TTL + pub/sub invalidation |
| Synchronous summarization on the request path | User waits 10+ seconds | Synchronous flush, asynchronous summarize |
| Letting the agent self-modify its prompts unguarded | One bad cycle, quality craters | Suggestion engine + admin approval + rollback_on_drop_pct |
๐ฏ Part 16 โ Closing: The Harness Mindset
Three closing observations distilled from every source.
1. ๐ฏ Push complexity to the boundaries
Permission resolution, protocol translation, state reconciliation, tool I/O โ these are the messy edges. Concentrate the mess there. Keep the loop, the tool composition, the memory recall, and the streaming logic clean and exhaustively typed.
2. ๐ The agent is a function from event history to next event, run in a loop
Everything else is a hook into that one loop:
- "Function" โ stateless Agent.
- "Event history" โ append-only EventLog.
- "Next event" โ Action, executed by Workspace, producing Observation.
- "Run in a loop" โ Conversation, until
Finishor stuck.
There is no big design. There is one tight kernel and a lot of small components hanging off it.
3. ๐ง Iterate on failures
The single most important cultural practice from the harness-engineering discipline:
Anytime an agent makes a mistake, engineer a solution so it never makes that mistake again.
Ship first. Add configuration reactively. Throw away what doesn't help. Distribute battle-tested configurations. Treat technical debt as a high-interest loan.
After many production incidents the pattern is the same:
- "GPT-6 will fix it" โ almost always wrong.
- "It's a configuration problem" โ almost always right.
The fix is in your harness โ context management, tool selection, verification loops, handoff artifacts, prompt reinforcement zones, hook ordering, error ladders.
๐ณ The shortest possible recipe
If you only build six things well, you have a great agent:
- An async-generator loop with typed terminal states and a continue-state ladder for recovery.
- A self-describing tool registry with per-invocation safety, the 14-step pipeline, and bounded output.
- A 4-layer context compression pipeline preserving the prompt cache architecture.
- File-based memory with always-loaded index + LLM recall side-query.
- Defense-in-depth security with five independent layers.
- Multi-tenancy on day one โ
tenant_id NOT NULLeverywhere.
Build those, and you've shipped a real agent. The rest of this guide is layering and polish.
๐ Appendix โ Source Map
| Source | The lessons learned |
|---|---|
| Claude Code (from-source guide) | Async-generator loop, prompt cache as architecture, fork agents, file-based memory, hooks, 4-layer compression |
| OpenHands | CodeAct (code as universal action), append-only event log, Workspace abstraction, Skills/microagents, stuck detection, risk-aware confirmation |
| SWE-agent | The Agent-Computer Interface thesis, line-bounded edit + lint + rollback, autosubmit on error, cost-budget |
| GoClaw | Multi-tenancy from day one, 8-stage pipeline, 3-tier memory, 5-layer security, channel adapters, provider resilience stack |
| Nanobot | Bus-based decoupling, per-session lock + pending queue, files + git for memory, progressive skill loading |
| PicoClaw | Lean Go runtime, capability-based polymorphism, JSONL persistence with sidecar metadata, 64-shard mutex, cheap-first routing, JSON-RPC stdio hooks |
| Harness Engineering | Agent = Model + Harness; feedforward + feedback control; sensors/guides; sub-agents as context firewalls; iterate on failures |
"It's not a model problem. It's a configuration problem." โ every team, after enough incidents.
If you found this helpful, let me know by leaving a ๐ or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! ๐
All Rights Reserved