Harness Design — Competence
What an interviewer or hiring manager expects you to know.
Core Knowledge
-
What a harness is. The engineering scaffolding around LLM calls that turns a demo into a product: retry logic, output parsing, error handling, configuration management, rate limiting, timeout handling, cost tracking, and logging. The harness is everything between “call the API” and “deliver a reliable response to the user.” This is the unsexy infrastructure that separates a Jupyter notebook from a production system. A well-designed harness makes the LLM call the simplest part of the system.
-
SDK landscape. Anthropic SDK (Python/TypeScript — direct API access, streaming, tool use, prompt caching), OpenAI SDK (Python/Node — the most widely used, function calling, structured outputs), LangChain (Python/JS — abstraction layer with chains, agents, RAG integrations, LCEL for composability), LlamaIndex (Python/TS — data-focused, strong RAG abstractions, query engines), Vercel AI SDK (TypeScript — streaming-first, React integration, provider-agnostic), Instructor (Python — structured output extraction using Pydantic models, wraps any provider SDK), Marvin (Python — AI functions as Python decorators). Know when to use the provider SDK directly (simple use cases, maximum control) vs. a framework (complex orchestration, rapid prototyping).
-
Retry and error handling. LLM APIs fail: rate limits (429), server errors (500/503), timeouts, malformed responses, and content policy blocks. Design patterns: exponential backoff with jitter (retry after 1s, 2s, 4s + random jitter to avoid thundering herd), idempotency (retrying a request shouldn’t produce duplicate side effects), circuit breaker (stop retrying after N failures, switch to fallback), timeout budget (total time allowed for retries — don’t retry forever). LiteLLM handles most of this automatically; for direct SDK use, Tenacity (Python) or p-retry (Node) are the standard retry libraries.
-
Output parsing and validation. LLM outputs are strings — the harness converts them to structured data. Patterns: JSON mode (OpenAI/Anthropic — model constrained to valid JSON), tool use / function calling (most reliable — model fills a defined schema), regex extraction (fragile but simple for known patterns), Pydantic validation (Instructor library — define a Pydantic model, get validated output), Guardrails AI validators (schema + semantic validation). The golden rule: never trust raw LLM output in application code. Always validate and handle malformed responses gracefully.
-
Configuration management. Production harnesses separate configuration from code: model selection (configurable, not hardcoded), temperature and sampling parameters, system prompts (version-controlled, loaded from config files or database), feature flags (enable/disable LLM features without deployment), and per-environment settings (dev uses Haiku for cost, production uses Sonnet). CLAUDE.md files, .env files, and feature flag systems (LaunchDarkly, Flagsmith) are the standard tools.
-
Streaming. Server-Sent Events (SSE) for real-time token delivery to the user. Both Anthropic and OpenAI SDKs support streaming natively. The harness must handle: partial JSON in streaming (can’t parse until complete), guardrail checking on partial output (check accumulated text periodically), timeout on stalled streams, and graceful abort (user cancels mid-response). Vercel AI SDK provides React hooks for streaming UX. For server-side: stream processing with async iterators.
Expected Practical Skills
- Build a production-ready LLM call wrapper. Implement: retry with exponential backoff, timeout handling, structured output parsing (tool use or Pydantic), error classification (retryable vs. non-retryable), cost tracking per call, latency logging, and fallback to a secondary model on failure. This is the foundation every other skill builds on.
- Implement streaming with guardrails. Set up SSE streaming to a web client with periodic output checking (toxicity scan every 100 tokens, PII scan on completion, abort if guardrail triggers mid-stream).
- Design a configuration system for prompts. Version-controlled system prompts loaded from files (not hardcoded strings), environment-specific model selection, and feature flags for enabling/disabling LLM features.
- Instrument for observability. Add structured logging (request ID, model, tokens, cost, latency, error type) to every LLM call. Integrate with LangFuse or Helicone for trace visualization. This connects to Skill 16 (observability).
- Handle graceful degradation. When the LLM is unavailable or too slow: return cached responses, fall back to a simpler model, serve a static response with “AI is temporarily unavailable” message, or queue the request for later processing.
Interview-Ready Explanations
-
“Walk me through how you’d build the infrastructure around LLM calls for a production application.” Start with the call wrapper: Anthropic/OpenAI SDK with retry logic (exponential backoff, 3 retries max, circuit breaker after 5 failures in 60s). Add structured output via tool use (define Pydantic schemas for every response type). Implement streaming for user-facing endpoints. Configure model selection per environment (Haiku for dev, Sonnet for production). Add observability (LangFuse traces on every call: input, output, tokens, cost, latency). Set up fallback chain (Claude → GPT-4o → cached response). Version-control all prompts alongside code.
-
“How do you handle LLM API failures in production?” Classify errors: rate limit (backoff and retry), server error (retry with backoff, switch provider after 3 failures), timeout (retry once with shorter timeout, then fallback), content policy block (don’t retry — log and return a safe response), malformed output (retry once, then parse what you can and flag for review). Monitor error rates — a sudden spike in 429s means you need to request higher rate limits or add load shedding.
-
“What’s the difference between using LangChain vs. building with the SDK directly?” SDK direct: full control, minimal abstraction, best for simple use cases or when you need to understand exactly what’s happening. LangChain: rapid prototyping, rich ecosystem of integrations (vector stores, tools, retrievers), LCEL for composable chains, but adds abstraction that can obscure debugging. Rule: start with the SDK for your first feature. Switch to LangChain if you need complex orchestration, multiple providers, or RAG. Never use LangChain if you can’t explain what it’s doing under the hood.
Related
- Prompting — prompts run inside the harness
- Orchestration — harness is the foundation orchestration builds on
- Model Routing — routing logic lives in the harness