CapabilityAtlas CapabilityAtlas
Sign In
search
Quality & Measurement Fundamentals

Regression Detection

Detecting quality degradation from model updates, prompt changes, or provider switches.

Regression Detection — Competence

What an interviewer or hiring manager expects you to know.

Core Knowledge

  • What causes LLM regressions. Three distinct sources: (1) Your changes — prompt edits, retrieval pipeline modifications, system prompt updates, dependency upgrades. (2) Provider changes — OpenAI silently updates GPT-4, Anthropic releases a new Claude version, model behavior shifts without your code changing. (3) Data drift — the distribution of production inputs shifts over time (seasonal topics, new user segments, trending queries). Each source requires a different detection strategy. Provider-side regressions are the most insidious because nothing in your codebase changed.

  • Golden datasets as regression anchors. A golden dataset is a curated, labeled set of examples (100-500) that represents your application’s critical capabilities. Each example has: input, expected output (or acceptable output criteria), and quality dimensions scored. Before any change deploys, run the golden dataset and compare scores against the baseline. A statistically significant decline on any dimension blocks the deploy. Tools: Promptfoo (YAML-defined assertions, CI integration), Braintrust (dataset management + scoring), DeepEval (pytest-native assertions).

  • Monitoring tools for production drift. Arize AI (real-time model monitoring, drift detection, embedding visualization — one of the most mature ML monitoring platforms), WhyLabs/LangKit (statistical profiling of LLM inputs/outputs, drift alerts, integrates with any pipeline), Evidently AI (open-source ML monitoring, data drift and model quality dashboards, now supports LLM-specific metrics), LangFuse (trace-level monitoring with scoring, production eval sampling), Datadog LLM Monitoring (integrates with existing APM infrastructure, useful for teams already on Datadog). Know that monitoring is necessary but not sufficient — you also need regression test suites that run proactively.

  • CI/CD patterns for LLM quality. The “eval gate” pattern: every PR that changes a prompt, retrieval config, or model version triggers an eval run against the golden dataset. Pass/fail thresholds per metric. Promptfoo integrates with GitHub Actions, GitLab CI, and Jenkins. DeepEval runs as pytest. The key constraint: eval runs must be fast enough to not block the development loop — target <5 minutes for the CI suite (use a subset of the golden dataset if the full set is too slow).

  • Regression test suite design. Cover three tiers: (1) Smoke tests — 10-20 critical examples that must always pass (exact match or high-threshold scoring). Run on every commit. (2) Core suite — 50-100 examples covering main capabilities. Run on PRs. (3) Full regression — 200-500 examples including edge cases. Run nightly or pre-release. Each tier has different latency budgets and failure thresholds. Version the suite alongside the code — when capabilities change, update the tests.

Expected Practical Skills

  • Build a golden dataset. Sample production traffic, select representative examples across use case categories, add ground truth labels (human annotation or verified outputs), include known edge cases and failure modes. Version-control the dataset. Update quarterly as the application evolves.
  • Set up an eval gate in CI. Configure Promptfoo or DeepEval to run on every PR that modifies prompt-related files. Define per-metric thresholds (e.g., accuracy ≥ 0.92, faithfulness ≥ 0.88). Block merges that fail. Produce a diff report: “accuracy dropped from 0.94 to 0.89 on these 7 examples.”
  • Detect provider-side regressions. Schedule nightly eval runs against a fixed golden dataset, even when you haven’t changed anything. If scores drop, the provider changed something. Alert the team. Have a rollback plan (pin to a previous model version if the API supports it, or switch providers).
  • Attribute regression to root cause. When quality drops, determine: was it a prompt change (check git history), a retrieval change (check index freshness, relevance scores), a model change (compare model version in traces), or data drift (compare production input distribution to training/eval distribution)?
  • Set up production monitoring. Instrument with LangFuse or Arize. Sample 1-10% of production traffic for automated scoring. Define alert thresholds for quality metrics. Set up dashboards showing quality trends over time.

Interview-Ready Explanations

  • “Walk me through how you’d detect and handle a quality regression in a production LLM system.” Three layers: (1) Pre-deployment — eval gate in CI runs golden dataset on every change, blocks deploys that degrade quality. (2) Post-deployment — canary deployment with side-by-side comparison (new version on 5% of traffic, monitor metrics, promote or rollback). (3) Continuous — nightly eval runs detect provider-side changes and data drift. When a regression is detected: alert → triage (which metric, which examples, how severe) → attribute (prompt change? model change? data drift?) → remediate (rollback, fix, or update baseline if the change is intentional).

  • “How do you handle model provider updates that break your application?” Pin model versions when the API supports it (Anthropic model IDs, OpenAI model snapshots). Run nightly golden dataset evals even when you haven’t changed anything — this catches provider-side drift. When detected: quantify the regression (which metrics, by how much), test alternative model versions or providers, communicate to stakeholders (“model provider updated, quality dropped 5% on faithfulness, here’s our remediation plan”). For critical applications, maintain multi-provider fallback (if Claude degrades, route to GPT-4o).

  • “What’s the difference between regression detection and general eval?” Eval measures absolute quality (“how good is this?”). Regression detection measures relative quality change (“is this worse than before?”). Regression detection requires: a stable baseline to compare against, a consistent dataset that doesn’t change between runs, and statistical methods to distinguish real degradation from noise. An eval can pass (absolute quality is good) while a regression is detected (quality dropped from excellent to good).