CapabilityAtlas CapabilityAtlas
Sign In
search
Architecture & Systems Fundamentals

Context Window Engineering

Managing what goes in and stays out: summarization, chunking, retrieval design at scale.

Context Window Engineering — Competence

What an interviewer or hiring manager expects you to know.

Core Knowledge

  • Context window sizes and economics. Claude Opus/Sonnet: 200K tokens (~150K words, ~500 pages). Gemini 2.5 Pro: 1M tokens. GPT-4o: 128K tokens. Llama 3 405B: 128K tokens. Bigger isn’t free — cost scales linearly with input tokens (Claude Opus: $15/MTok input). Filling a 200K window on every call costs ≈$3 per request in input tokens alone. Context window engineering is about managing what goes in and what stays out to maximize quality while controlling cost and latency.

  • The “lost in the middle” problem. Research (Liu et al., 2023 “Lost in the Middle”) showed that LLMs attend most strongly to the beginning and end of their context, with degraded attention to the middle. Practical impact: information placed at positions 3-7 of 10 retrieved documents gets less attention than positions 1-2 and 9-10. Mitigation: put the most important information first and last, use explicit markers (“CRITICAL: the following is the most relevant context”), and keep total context smaller (5 high-quality chunks > 20 mediocre chunks).

  • Token counting and budget management. tiktoken (OpenAI’s tokenizer — works for GPT models), Anthropic’s token counting API, and approximations (1 token ≈ 4 characters in English, ~0.75 words). For production systems: count tokens before sending the request (catch budget overflows before they error), allocate budgets per section (system prompt: 2K tokens, context: 8K, conversation history: 4K, output reservation: 2K), and truncate gracefully when limits are approached (drop the oldest conversation turns, reduce retrieved context, summarize rather than include full text).

  • Context assembly strategies. Static context (always included — system prompt, tool definitions, fixed instructions). Dynamic context (varies per request — retrieved documents, conversation history, user-specific data). Ephemeral context (single-use — current query, intermediate reasoning). The harness (Skill 2) assembles these at runtime. Design: prioritize by information value (most relevant first), enforce per-section token budgets, and have a degradation policy for each section (what to cut first when the total exceeds the budget).

  • Conversation memory management. Multi-turn conversations accumulate context. Strategies: full history (include all turns — simple but grows unbounded), sliding window (keep last N turns — loses early context), summarization (periodically summarize conversation history into a compact representation — LangChain ConversationSummaryMemory), and hybrid (keep last 5 turns verbatim + summary of earlier turns). Claude Code uses automatic context compression as conversations approach limits. The choice depends on: how much early context matters for current queries, cost sensitivity, and whether the application needs to reference specific earlier statements.

  • Prompt caching. Anthropic’s prompt caching (cache long system prompts/context blocks, pay 90% less on cached input tokens for subsequent requests). The cached portion must be a prefix — you can’t cache arbitrary sections. Design: put stable content (system prompt, tool definitions, reference documents) at the beginning, variable content (user query, conversation history) at the end. Cache TTL: 5 minutes (Anthropic) — effective for repeated calls within a session. At high volume, prompt caching can reduce costs 50-80%.

Expected Practical Skills

  • Implement token budget management. Build a context assembler that: counts tokens per section, enforces per-section caps, truncates gracefully (summarize long documents, drop old conversation turns, reduce retrieved chunks), and logs when truncation occurs (so you can monitor information loss).
  • Design a conversation memory system. Implement sliding window + summarization: keep recent turns verbatim, summarize older turns, and store summaries in a compact format. Test: does the model still answer questions about early conversation accurately after summarization?
  • Optimize context for RAG. Given a RAG system returning 10 chunks of 500 tokens each (5K total), determine the optimal number of chunks to include. Test: quality at 3, 5, 7, 10 chunks. Plot quality vs. context size. Usually: quality peaks at 3-5 chunks and degrades with more (noise overwhelms signal).
  • Set up prompt caching. Configure Anthropic prompt caching for a system with a long system prompt (>1K tokens). Measure: cache hit rate, cost savings, latency improvement. Optimize cache boundaries (what goes in the cached prefix vs. the variable suffix).
  • Handle context overflow gracefully. When the assembled context exceeds the window: implement a priority queue of context sections, drop lowest-priority sections first, log what was dropped, and verify that the remaining context still produces acceptable output quality.

Interview-Ready Explanations

  • “Walk me through how you’d manage context for a system processing large documents.” First decision: does the full document fit in the context window? If yes (<200K tokens for Claude), use long context directly — simpler and more reliable. If no, use RAG (Skill 7) — chunk, embed, retrieve relevant sections. For medium-sized documents (50-150K tokens): consider the cost — including 100K tokens per request at $15/MTok is $1.50 per call. If call volume is low, long context is fine. If high, RAG saves money. Always: put the document or chunks first, then instructions, then the query — respect the attention pattern.

  • “How do you handle multi-turn conversations that grow beyond the context window?” Sliding window + summarization. Keep the last 5-10 turns verbatim (most relevant to current query). Summarize earlier turns into a structured summary (key topics discussed, decisions made, open questions). Put the summary at the start of the context, recent turns at the end. Test with eval: can the system still answer questions about early conversation content? If summarization loses critical information, increase the window or use RAG over conversation history.

  • “What are the failure modes of context window management?” Information loss from truncation (critical context gets cut — mitigate with priority-based truncation, never truncate instructions). Lost in the middle (important info in middle positions gets ignored — mitigate with reordering). Context poisoning (irrelevant or contradictory information in the context degrades output quality — mitigate with relevance filtering and source quality checks). Token budget miscalculation (estimated tokens differ from actual — mitigate with actual counting via tiktoken/API, add a safety margin). Cache invalidation (prompt caching returns stale results when context should have changed — mitigate with cache-key design that includes content hashes).