The Architecture Nobody Else Can Design

The gap between senior and staff

You’ve shipped AI features. You can build a RAG pipeline, write evals, and put a model behind a production harness. So can an increasing number of senior engineers. The question is: what can you do that they can’t?

The answer is systems-level AI architecture — designing how multiple components (agents, tools, data stores, verification layers) coordinate to solve problems no single prompt can handle. Sometimes that’s a multi-agent system. Sometimes it’s a well-designed single agent with strong tooling. The staff-level skill isn’t reaching for the complex pattern. It’s knowing when complexity is warranted and when it isn’t.

When NOT to use multi-agent

This section should come first, because the strongest signal of architectural maturity is restraint.

Don’t use multi-agent when a single agent with tools will do. A support agent that looks up orders, checks policies, and generates responses doesn’t need three agents. It needs one agent with three tools. Multi-agent adds latency (agent-to-agent communication), cost (every agent invocation is a separate LLM call), and failure surface (more agents = more things that can break). If the task decomposition is static and the steps are well-defined, a single agent with structured tool access is simpler, cheaper, and more reliable.

Don’t use multi-agent when a deterministic pipeline will do. If every document goes through the same five steps — extract, classify, enrich, validate, route — that’s a DAG, not an agent system. The steps don’t require planning or decision-making. Build it as a pipeline with LLM calls at specific steps. Pipelines are predictable, debuggable, and don’t hallucinate their own task plans.

Don’t use multi-agent when error tolerance is low and verification is hard. Each additional agent is a source of probabilistic error. In a 5-agent pipeline where each agent is 95% correct, the system-level accuracy is 0.95^5 = 77%. If your use case requires 99%+ accuracy, multi-agent makes it harder to achieve, not easier — unless you design explicit verification layers (which add cost and latency).

Don’t use multi-agent when you can’t observe it. If you don’t have tracing, per-agent eval, and cost attribution, a multi-agent system is a black box that sometimes produces wrong answers and you can’t figure out why. Build the observability layer before you build the second agent.

Use multi-agent when: The task requires dynamic planning (the next step depends on what was discovered in the previous step), specialized reasoning (different steps require fundamentally different expertise that can’t fit in one context window), or independent validation (a separate agent verifies another agent’s work against ground truth, not against the first agent’s reasoning).

Agent architecture patterns

Three patterns dominate production multi-agent systems. Each has distinct tradeoffs — and specific failure modes.

Orchestrator-worker. A central orchestrator decomposes the task and delegates to specialized workers. Most common pattern — the orchestrator is the tech lead, workers are specialists.

Failure mode: The orchestrator is a single point of failure. If the plan is wrong, everything downstream is wrong. And orchestrator hallucination is systemic failure — the orchestrator confidently generates a plan that sounds reasonable but misses a critical step, and every worker executes perfectly on a flawed foundation.

Safeguards: Validate the plan before execution begins — check it against a task schema, verify all required steps are present, confirm dependencies are ordered correctly. Set bounded execution: maximum steps, maximum token budget, maximum wall-clock time. Build a fallback path for when the orchestrator can’t produce a valid plan (escalate to human, fall back to a fixed workflow, or reject the task).

Pipeline (DAG). Agents in a directed acyclic graph, each agent’s output feeding the next. Document processing uses this: extract, classify, enrich, validate, route.

Tradeoff: Rigid topology. Works for stable workflows, poor fit for dynamic decomposition. But rigidity is a feature when you need predictability.

Consensus / debate. Multiple agents independently process the same input; a judge selects the best result.

Tradeoff: Expensive but justified when error costs exceed computation costs — medical coding, legal review, financial analysis.

Critical limitation: Agreement does not guarantee correctness. If all agents share the same model family, they share the same biases. Three instances of Claude will make the same systematic error on the same input. True consensus requires model diversity (Claude + GPT-4o + Llama) or methodological diversity (one agent reasons step-by-step, another uses analogical reasoning, a third checks against a database). Correlated errors are the failure mode that consensus is supposed to prevent — and the one most implementations accidentally reintroduce.

The authority layer

In any system where multiple agents produce information, you need to answer: whose output is authoritative?

Agents propose. Systems of record verify. An extraction agent says the policy limit is $500,000. Before any downstream agent uses that number, verify it against the policy database. The agent’s output is a hypothesis. The database is the truth. This adds a verification call per critical fact — typically 50-200ms of latency and one database query. It prevents the class of failures where a hallucinated fact propagates through every agent in the pipeline.

No agent trusts another agent’s output as ground truth. If agent B needs to validate agent A’s work, agent B checks against the source data — not against agent A’s reasoning. Otherwise you get cascading confirmation: A hallucinates a fact, B validates against A’s output and confirms, C sees consensus between A and B and proceeds. Three agents, zero independent verification.

State mutations require system-level authorization. If an agent wants to modify a database record, send an email, or process a payment, that action goes through a verification layer that checks: is this action authorized for this task? Does the underlying data support this action? Is the action reversible? The agent requests. The system decides. Never let an agent execute irreversible actions without programmatic verification.

Context window engineering

Every agent has a finite context window — a hard limit on how much it can reason over at once. Managing what goes into that window is as important as managing memory in embedded systems.

Context budgeting. Allocate each agent’s context window deliberately: system prompt (fixed), retrieved context (variable), conversation history (growing), output space (reserved). Most models perform best when utilization stays below 60-70% of the maximum window size. Fill it to capacity and reasoning quality degrades well before the hard token limit.

Context compression. Don’t pass raw documents between agents — pass structured summaries. An extraction agent should output a JSON object, not a 5,000-word narrative. This is data serialization for AI systems.

Context isolation. Each agent gets only the context it needs. A fraud detection agent doesn’t need conversation history. A drafting agent doesn’t need raw policy documents — it needs extracted terms. Unnecessary context wastes tokens and introduces noise.

One system, end to end

The problem: An insurance company processes complex claims. Each claim involves document extraction, policy cross-referencing, fraud screening, coverage calculation, and a determination letter.

Why multi-agent: The steps require different expertise (extraction vs. fraud detection vs. legal drafting), different tools (OCR vs. fraud database vs. document templates), and different context (the extraction agent needs the raw document; the drafting agent needs extracted facts, not the raw document). A single agent would need all tools and all context simultaneously — exceeding practical context limits and degrading quality.

Architecture: Orchestrator-worker with verification.

Orchestrator: Receives claim, generates execution plan (extract → cross-reference → fraud screen → calculate → draft). Plan validated against the required-steps schema before execution.
Extraction agent: OCR + structured extraction. Outputs JSON with claim amounts, dates, policy numbers, involved parties. Verified against the policy database (does this policy number exist? does the claimant match?).
Fraud agent: Checks extracted data against fraud indicators. Independent of extraction — reads from the database, not from the extraction agent’s output.
Coverage agent: Calculates coverage based on policy terms (from the database) and claim details (from verified extraction). Outputs determination with cited policy sections.
Drafting agent: Generates the determination letter from structured inputs (determination, cited sections, claim details). Output validated for required sections and tone compliance.

Where it fails and what catches it:

Extraction agent misreads a date → coverage agent calculates wrong window → Caught by: verification against the policy database’s coverage dates, which are authoritative.
Orchestrator omits the fraud screening step → claim approved without fraud check → Caught by: plan validation against required-steps schema, which rejects plans missing mandatory steps.
Drafting agent hallucinates a policy provision → letter cites nonexistent clause → Caught by: output validation that checks all cited section numbers against the policy document index.
Token budget exceeded mid-execution → system hangs → Caught by: per-task budget with hard cutoff and graceful degradation (route to human with partial results).

MCP: the tool interface layer

Model Context Protocol (MCP) is Anthropic’s open protocol for connecting agents to external tools and data sources — databases, APIs, file systems. It matters for multi-agent architecture because:

Tool descriptions drive selection. Each MCP tool has a description the agent reads to decide when to use it. Ambiguous descriptions cause tool selection errors — the agent picks the wrong tool and gets wrong data. Write tool descriptions the way you write API documentation: what it does, what inputs it expects, what it returns, when to use it instead of alternatives.

Scoping limits blast radius. Each agent gets access to only the MCP tools it needs. The extraction agent doesn’t get the payment processing tool. The fraud agent doesn’t get the customer communication tool. This is least-privilege applied to AI systems.

When NOT to use MCP: If your tools are simple function calls within a single application, direct function calling (tool_use) is simpler. MCP adds value when tools are external services, shared across agents, or need to be discovered dynamically. Don’t add a protocol layer for three internal functions.

Observability

You can’t operate what you can’t observe. Multi-agent systems need observability that goes beyond traditional APM.

Trace everything. Every agent invocation produces a trace: input context (or hash), output, model, tokens consumed, latency, parent task ID. LangSmith, Braintrust, and Arize Phoenix provide agent-aware tracing. In-house, OpenTelemetry with custom spans works.

Cost attribution. Attribute token costs to individual agents, tasks, and customers. This is the AI equivalent of cloud cost allocation — equally important for optimization decisions.

Per-agent quality. Track eval scores per agent, not just per system. If system accuracy drops, you need to know whether the extraction agent degraded or the validation agent did. Run per-agent eval suites independently.

The threshold that matters: Define one metric with an action. Example: if claim processing accuracy (verified against human adjuster decisions on a 5% sample) drops below 92%, route new claims to human-only processing until the root cause is identified. Measurement without a decision framework is monitoring theater.

The 60-day milestone

Days 1-10: Design a multi-agent system for a real workflow — or analyze an existing workflow and determine whether multi-agent is warranted. If a single agent with tools or a deterministic pipeline solves it, document why and build that instead. The staff-level decision is choosing the right architecture, not the most complex one.

Days 11-30: Build the orchestration layer (if multi-agent is warranted). Implement plan validation, context management, inter-agent communication, authority verification against systems of record, and token budgeting from day one.

Days 31-45: Implement the agents. Each gets its own system prompt, scoped tools, and eval suite. Test independently before integration. Test against the authority layer — does verification catch hallucinated facts?

Days 46-55: Failure injection. Kill agents mid-execution. Feed malformed input. Exhaust token budgets. Test orchestrator plan validation with incomplete plans. Verify graceful degradation at every failure point.

Days 56-60: Ship with observability. Full tracing, cost attribution, per-agent quality monitoring, and at least one metric with a defined threshold and automated action.

At the end of 60 days, you’ve designed and shipped a system with the right level of complexity for the problem — whether that’s a multi-agent system with orchestration and verification, or a well-designed single agent that didn’t need the overhead. That’s the staff-level judgment.