Structured Output Design — Competence
What an interviewer or hiring manager expects you to know.
Core Knowledge
-
Why structured output matters. LLMs produce strings. Applications need data — JSON objects, database records, API payloads, form fields. The bridge between “the model said something” and “the application can use it” is structured output design. Without it, every LLM integration is fragile: regex parsing, string splitting, and prayer. With it, LLM outputs are typed, validated, and safe to feed into downstream systems.
-
Structured output mechanisms. Tool use / function calling (the modern standard — define a JSON Schema, model fills it; Anthropic tool use, OpenAI function calling, Google function calling). JSON mode (OpenAI
response_format: { type: "json_object" }— guarantees valid JSON but not schema conformance). Structured outputs with strict mode (OpenAIresponse_format: { type: "json_schema", json_schema: {...} }— guarantees schema conformance via constrained decoding). XML-delimited output (Claude excels at this — define output sections with XML tags, parse with standard XML libraries). Constrained generation (Outlines, Guidance, LMQL — force the model to generate only tokens that satisfy a grammar; works with open-source models). -
Schema design for LLM output. Pydantic models (Python) or Zod schemas (TypeScript) define the expected structure. Design principles: flat over nested (deeply nested schemas increase parsing failure rates), required over optional (the model should always fill every field — use nullable types instead of optional fields when a field may be empty), enum types for constrained choices (force the model to select from a predefined list rather than generating free text), and descriptive field names and descriptions (the model reads the schema to understand what each field means —
customer_sentiment: Literal["positive", "negative", "neutral"]is better thans: int). -
Instructor library. The most important structured output tool. Wraps any LLM SDK (Anthropic, OpenAI, Cohere, Mistral) and returns validated Pydantic objects instead of raw strings. Handles: automatic retry on validation failure (the model gets the validation error and tries again), streaming with partial validation, and nested Pydantic models. Usage:
client.chat.completions.create(response_model=MyPydanticModel, ...). Available for Python (instructor) and TypeScript (instructor-js). Used at Notion, Stripe, and most production LLM applications. Marvin is a similar approach using function decorators. -
Guardrails AI for semantic validation. Beyond structural validation (is it valid JSON?), Guardrails AI validates semantic content: is the email a valid email format? Is the sentiment actually reflected in the text? Is the generated SQL safe from injection? The
Guardobject wraps an LLM call with validators from Guardrails Hub (50+ community validators for common checks). Complements Instructor — Instructor handles structure, Guardrails AI handles meaning. -
Tool use as the preferred pattern. Anthropic’s tool use and OpenAI’s function calling are the most reliable structured output mechanisms because they use the model’s fine-tuned tool-calling capability (not just hoping the model follows format instructions). The model has been specifically trained to fill JSON schemas when presented as tool definitions. Reliability: tool use produces valid schema-conformant output >99% of the time for well-designed schemas, vs. ~90-95% for prompt-based JSON generation.
Expected Practical Skills
- Define Pydantic models for LLM output. Given a product requirement (“extract customer info from an email”), design a Pydantic model with appropriate types, validators, field descriptions, and examples. Use
Field(description="...")for every field — the model reads these. - Implement Instructor for reliable extraction. Set up Instructor with Anthropic or OpenAI SDK. Define response models. Handle validation retries (Instructor retries automatically, but configure max_retries and handle persistent failures).
- Design tool definitions for Claude/GPT. Create tool schemas that guide the model to produce the right output. Include: clear tool names, detailed descriptions, parameter descriptions with examples, and enum constraints for categorical fields.
- Handle partial and streaming structured output. Implement streaming with Instructor’s
Partial[MyModel]— the user sees output building incrementally while validation runs on the complete response. Handle the UX of partial data (show skeleton → fill fields as they arrive). - Validate beyond structure. Add semantic validators: regex patterns for emails/phones, range checks for numerical fields, cross-field consistency (if
status == "completed"thencompletion_datemust be non-null), and domain-specific rules (valid US state codes, valid currency codes).
Interview-Ready Explanations
-
“Walk me through how you’d design structured output for a data extraction pipeline.” Define the target schema as a Pydantic model — every field with a type, description, and validator. Use Instructor with Claude or GPT-4o for extraction. Implement tool use (most reliable mechanism). Add semantic validation: cross-field consistency, format checks, and domain-specific rules. Handle failures: Instructor retries with the validation error in the prompt (self-healing). Test on 100+ examples measuring: extraction accuracy per field, schema conformance rate, and retry rate. For production: add LangFuse tracing to monitor extraction quality over time.
-
“How do you handle cases where the model can’t fill a required field?” Design the schema for graceful incomplete extraction. Use nullable types (
field: str | None) withField(description="Null if not found in the input")rather than optional fields. This forces the model to explicitly acknowledge missing data rather than hallucinating a value. Add a confidence field:field_confidence: Literal["high", "low", "not_found"]alongside each extracted value. Downstream systems use the confidence to decide whether to accept, flag for review, or reject. -
“What are the failure modes of structured output?” Schema violation (model returns invalid JSON — mitigate with tool use / constrained generation, which makes this near-impossible). Field hallucination (model fills a required field with fabricated data — mitigate with confidence fields and semantic validation). Type coercion errors (model returns “five” instead of 5 — mitigate with Pydantic validators and explicit type instructions). Streaming corruption (partial response is invalid — mitigate with Instructor’s partial validation). Over-extraction (model extracts data from the wrong part of the input — mitigate with clear input boundaries and source attribution fields).
Related
- Harness Design — structured output is the key harness component
- Prompting — schema design is a form of prompt engineering
- Eval Frameworks — eval extraction accuracy per field