CapabilityAtlas CapabilityAtlas
Sign In
search
Quality & Measurement Fundamentals

Eval Frameworks

Evaluation suites that measure what matters: dataset curation, metrics, statistical significance.

Rigorous Eval Frameworks — Competence

What an interviewer or hiring manager expects you to know.

Core Knowledge

  • The eval platform landscape. Braintrust (eval + logging, strong on dataset management and scoring, used by Notion and Vercel), Promptfoo (open-source CLI for prompt/model comparison, YAML-based test definitions, CI/CD integration), DeepEval (Python testing framework with pytest integration, 14+ built-in metrics including hallucination and answer relevancy), Ragas (open-source, specialized in RAG evaluation: faithfulness, answer relevancy, context precision/recall), LangSmith (LangChain’s platform — tracing, datasets, evaluation runs, annotation queues), Arize Phoenix (open-source observability + eval, embeddings analysis, trace-level evaluation), Humanloop (prompt management + eval, human annotation workflows). Know the trade-offs: Promptfoo for CI-driven testing, Braintrust for dataset-centric eval, Ragas for RAG-specific metrics, DeepEval for pytest-native teams.

  • Core metrics and when to use them. Factual accuracy (does the output match known ground truth — use for Q&A, data extraction), faithfulness/groundedness (is the output supported by provided context — critical for RAG), relevancy (does the output address the question asked), coherence (is the output well-structured and readable), toxicity (harmful content — use content safety classifiers), latency (time-to-first-token and total response time), cost (tokens consumed × model pricing). Know that no single metric captures “quality” — you always need a multi-dimensional rubric.

  • Dataset curation. An eval is only as good as its test set. Build datasets that cover: typical cases (80% of traffic), edge cases (unusual inputs, ambiguous queries), adversarial cases (attempts to break the system), failure-mode-specific cases (hallucination triggers, context overflow scenarios). Label with ground truth where possible. Minimum viable test set: 50-100 examples for development iteration, 500+ for statistical confidence. Avoid test set contamination — never let eval examples leak into prompts, few-shot examples, or fine-tuning data.

  • Statistical rigor. LLM outputs are non-deterministic — running the same eval twice gives different results. Use multiple runs (3-5 minimum) and report confidence intervals. For A/B comparisons (prompt A vs. prompt B), use paired statistical tests (McNemar’s test for binary outcomes, Wilcoxon signed-rank for scores). Know that a 2% improvement on 50 examples is noise; the same improvement on 500 examples with p<0.05 is signal. Beware Goodhart’s law: when an eval metric becomes the optimization target, it stops being a good metric. Regularly audit whether your metrics still correlate with actual user satisfaction.

  • Eval types. Offline evals (run against a static dataset, batch mode — for development and regression testing), online evals (monitor production traffic in real-time — for drift detection and quality monitoring), human evals (subject matter experts rate outputs — gold standard but expensive and slow), LLM-as-judge (another model rates outputs — scalable but has systematic biases, covered in Skill 10). Know when each type is appropriate and how they complement each other.

Expected Practical Skills

  • Build an eval pipeline from scratch. Define metrics for a specific use case, curate a test dataset (50+ examples with ground truth), implement scoring (programmatic for extractable metrics, LLM-as-judge for subjective quality), run evals, produce a report with scores and confidence intervals. Use Promptfoo or Braintrust for the infrastructure.
  • Run a prompt comparison. Given two prompt variants, run both against the same dataset, score with the same rubric, and determine which is better with statistical confidence. Report per-metric breakdowns, not just aggregate scores.
  • Design eval metrics for a new use case. Given a product requirement (“our chatbot should answer customer questions accurately and helpfully”), decompose into measurable dimensions (accuracy, helpfulness, safety, brand-voice adherence) and define scoring criteria for each.
  • Integrate evals into CI/CD. Set up Promptfoo or DeepEval to run on every prompt/code change. Define pass/fail thresholds. Block merges that degrade quality below threshold. This is the “unit tests for LLMs” pattern.
  • Curate and maintain eval datasets. Build datasets from production traffic (sample, anonymize, label), synthetic generation (use LLMs to generate diverse test cases), and manual creation (domain experts write edge cases). Version datasets and track changes.

Interview-Ready Explanations

  • “Walk me through how you’d design an eval framework for a new LLM application.” Start with the use case requirements — what does “good” mean for this product? Decompose into 3-5 measurable dimensions. For each dimension, define: metric type (programmatic vs. LLM-as-judge vs. human), scoring scale (binary, 1-5, continuous), and examples of what each score level looks like. Build an initial dataset of 50-100 examples covering typical, edge, and adversarial cases. Run baseline eval. Iterate: change prompt/model/retrieval → eval → compare. Graduate to CI integration when the framework is stable. Add online monitoring for production drift.

  • “How do you know your eval metrics actually measure what matters?” Calibrate against human judgment. Take a sample of 50-100 outputs, have humans rate them, and measure correlation between your automated metrics and human scores. If correlation is <0.7, the metric is unreliable. Also: track user-facing signals (thumbs up/down, task completion rate, support tickets) and check that metric improvements translate to user satisfaction improvements. If your accuracy metric improves 10% but user satisfaction doesn’t budge, the metric is measuring the wrong thing.

  • “What are the failure modes of LLM evaluation?” Overfitting to eval set (model/prompt optimized for test cases but fails on real traffic — mitigate with held-out test sets and production monitoring). Goodhart’s law (metric gaming — e.g., optimizing for “includes a citation” leads to outputs that cite irrelevant sources). Eval contamination (test examples leak into training/prompting). Single-metric tunnel vision (optimizing accuracy while coherence degrades). LLM-as-judge bias (systematic preferences for longer outputs, specific formats, or the judging model’s own style).