God-Tier Prompting — Competence
What an interviewer or hiring manager expects you to know.
Core Knowledge
-
System prompt architecture. The system prompt is the control surface for LLM behavior. Know the anatomy: role/persona definition, task instructions, output format specification, constraints/guardrails, few-shot examples, and context injection points. Anthropic’s Claude system prompt guide recommends: put the most important instructions first and last (primacy/recency bias), use XML tags to delimit sections (
<instructions>,<context>,<examples>), and keep instructions positive (“do X” not “don’t do Y”). OpenAI’s system prompt patterns differ — they use markdown headers and rely more on fine-tuning behavior. Know both styles. -
Chain-of-thought and reasoning elicitation. Explicit “think step by step” improves accuracy 10-30% on reasoning tasks (Wei et al., 2022). Extended thinking / “thinking” blocks (Anthropic’s approach) give the model a scratchpad for complex reasoning before producing output. Know when to use CoT (math, logic, multi-step analysis) vs. when it’s wasted tokens (simple extraction, classification). Zero-shot CoT (“think step by step”) vs. few-shot CoT (provide worked examples) — few-shot is more reliable but costs more tokens.
-
Few-shot example design. The most underrated prompting technique. 3-5 well-chosen examples can outperform paragraphs of instruction. Design principles: cover the typical case, at least one edge case, and at least one negative example (what NOT to do). Format examples identically to desired output. Use diverse examples — don’t cluster on one type. For structured output (JSON, tables), examples are more effective than schema descriptions alone.
-
Prompt decomposition. Complex tasks fail as monolithic prompts. Decompose into stages: first classify the input, then process based on classification, then format output. This is the bridge between prompting and orchestration (Skill 3). LangChain’s LCEL, LlamaIndex’s query pipelines, and Claude Code’s multi-turn approach all implement decomposition patterns. Know when to decompose (complex, multi-step tasks with branching logic) vs. when to keep it monolithic (simple tasks where decomposition adds latency without improving quality).
-
Output format control. Structured output is the bridge between LLM and application code. JSON mode (OpenAI
response_format: json_object, Anthropic tool use for structured output), XML tags (Claude excels at XML-delimited responses), function calling / tool use (the modern pattern — define a schema, model fills it), and constrained generation (Outlines, Guidance for guaranteed schema conformance with open-source models). Know that tool use / function calling is the most reliable structured output pattern — it uses the model’s fine-tuned tool-use capability rather than hoping the model follows format instructions. -
Prompt engineering tools. Anthropic Console (prompt playground, system prompt testing), OpenAI Playground (model comparison, parameter tuning), Humanloop (prompt versioning, A/B testing, eval), PromptLayer (prompt registry, version history, analytics), Promptfoo (CLI-based prompt testing, assertion-driven), LangSmith (prompt hub, version tracking, evaluation). CLAUDE.md files (Claude Code’s project-level prompt configuration). Know that prompt management is becoming a software engineering discipline — version control, testing, review, and deployment for prompts.
Expected Practical Skills
- Write a production-quality system prompt. Given a product requirement, design a system prompt with: clear role definition, task boundaries (what the system should and shouldn’t do), output format specification, edge case handling, and tone/style guidance. The prompt should be testable against an eval suite.
- Design few-shot examples for a new task. Select examples that cover the input distribution, format them consistently, include positive and negative cases, and validate that adding examples improves output quality vs. zero-shot.
- Iterate prompts with eval-driven feedback. Run an eval (Skill 9), identify the worst-performing categories, diagnose whether the issue is instruction ambiguity, missing examples, or task complexity beyond the model’s capability, and revise. Repeat until eval thresholds are met. This is the core development loop.
- Adapt prompts across models. The same task may need different prompts for Claude vs. GPT-4o vs. Llama. Claude responds well to XML-delimited structure and detailed persona. GPT-4o responds well to markdown and concise instructions. Llama needs more explicit formatting constraints. Maintain per-model prompt variants when routing across providers (Skill 14).
- Debug prompt failures. When output quality is poor: read the model’s full output (including any reasoning traces), identify where it went wrong, test whether the instruction was ambiguous or the task exceeds the model’s capability, and fix. Common causes: conflicting instructions, insufficient examples, context window overflow pushing instructions out, and overly complex single-turn requests that should be decomposed.
Interview-Ready Explanations
-
“Walk me through how you’d design a system prompt for a complex production application.” Start with requirements: what should the system do, what should it never do, what format should output take? Structure the prompt in sections (XML tags for Claude, markdown headers for GPT): persona → task instructions → constraints → output format → examples. Include 3-5 few-shot examples covering typical + edge cases. Add guardrail instructions (“If the user asks about X, respond with Y”). Test against an eval suite of 50+ examples. Iterate: identify failures, revise, re-test. Version-control the prompt alongside the code.
-
“How do you evaluate whether a prompt is working?” Never rely on manual testing (“looks good to me”). Build an eval dataset covering expected input categories (Skill 9). Run the prompt against the dataset. Score on multiple dimensions (accuracy, format compliance, tone, safety). Compare against a baseline (previous prompt version or a simpler approach). Use statistical significance testing for small improvements. Track eval scores over time — prompts degrade as the application context changes.
-
“What are the failure modes of prompt engineering?” Instruction drift (model ignores instructions that are far from the attention window — put critical instructions first and last). Conflicting instructions (two parts of the prompt contradict each other). Over-specification (so many rules that the model becomes overly cautious and refuses reasonable requests). Context overflow (too much context pushes instructions out of the window). Example overfitting (model copies the format of examples too literally, failing on novel inputs). Prompt injection (malicious users override system instructions — see Skill 15).
Related
- Eval Frameworks — eval-driven prompt development is the core workflow
- Model Routing — per-model prompt adaptation
- Guardrails & Safety — prompt injection defense