CapabilityAtlas CapabilityAtlas
Sign In
search
Quality & Measurement Fundamentals

LLM-as-Judge

Automated evaluation with LLM evaluators: rubric design, calibration, detecting evaluator bias.

LLM-as-Judge Design — Competence

What an interviewer or hiring manager expects you to know.

Core Knowledge

  • What LLM-as-judge is and why it matters. Using one LLM to evaluate the output of another LLM (or the same LLM). It’s the scalable alternative to human evaluation — you can score 10,000 outputs in minutes instead of weeks. The foundational paper is “Judging LLM-as-a-Judge” (Zheng et al., 2023) which showed GPT-4 achieved >80% agreement with human judges on MT-Bench. Every major eval platform now supports it: Braintrust (custom scoring prompts), Promptfoo (LLM-graded assertions), DeepEval (built-in LLM eval metrics), Ragas (faithfulness and relevancy via LLM judge).

  • Rubric design. The judge is only as good as its rubric. A rubric specifies: what to evaluate (dimension), how to score (scale — binary, 1-5, 1-10), what each score level means (anchor descriptions with examples), and what evidence to cite. Example rubric for “helpfulness”: 5 = “Directly answers the question with actionable, specific information”; 3 = “Partially addresses the question but missing key details”; 1 = “Does not address the question or provides irrelevant information.” Without explicit anchor descriptions, LLM judges default to giving everything 4/5 (positivity bias).

  • Known biases in LLM judges. Positivity bias (scores skew high — mitigate with calibration examples showing low scores). Verbosity bias (longer outputs score higher regardless of quality — mitigate by explicitly instructing “length should not affect scoring”). Self-enhancement bias (models prefer outputs that match their own style — mitigate by using a different model family as judge than the one being evaluated). Position bias (in pairwise comparison, preference for the first or second option — mitigate by running both orderings and averaging). Format bias (preference for bullet points, markdown, structured output over plain text).

  • Judging paradigms. Pointwise (score a single output on a rubric — simplest, most common), pairwise (compare two outputs and pick the better one — more reliable for ranking, used by Chatbot Arena / LMSYS), reference-based (score output against a gold standard answer — useful when ground truth exists), and multi-turn (evaluate across a conversation, not just one response — hardest). Pairwise is more reliable than pointwise for detecting small quality differences because relative comparison is easier than absolute scoring.

  • Calibration against humans. The judge is useful only if it agrees with human judgment. Calibrate by: taking 50-100 outputs, having both humans and the LLM judge score them on the same rubric, measuring agreement (Cohen’s kappa for categorical, Spearman/Pearson correlation for continuous scores). Target: >0.7 agreement. If lower, the rubric needs revision or the task is too subjective for LLM judging. Re-calibrate whenever you change the rubric, the judge model, or the application domain.

Expected Practical Skills

  • Build an LLM-as-judge scoring pipeline. Define a rubric for a specific use case, implement the judging prompt (system prompt with rubric + examples + output to judge), parse the judge’s response (extract score + reasoning), run at scale across a dataset, validate against human scores on a calibration set. Use Braintrust or Promptfoo for infrastructure.
  • Design rubrics that minimize bias. Include explicit “low score” examples in the rubric (combats positivity bias). Add “ignore length” instructions (combats verbosity bias). Use structured output for scores (JSON with “score” and “reasoning” fields — prevents the judge from hedging).
  • Run pairwise comparisons. Given two prompt variants, have the judge compare outputs pairwise on 100+ examples. Run both orderings (A-B and B-A) to control for position bias. Report win/loss/tie rates with confidence intervals.
  • Validate judge reliability. Compute inter-rater agreement between the LLM judge and human annotators. Report Cohen’s kappa or correlation. Flag dimensions where the judge is unreliable and supplement with human eval for those.
  • Cost-optimize judging. Use a cheaper model (Sonnet, GPT-4o-mini) for development iteration and the strong model (Opus, GPT-4o) for final validation. Cache judge results for identical inputs. Batch judging calls to reduce API overhead.

Interview-Ready Explanations

  • “Walk me through how you’d design an LLM-as-judge system.” Start with the evaluation dimensions (what aspects of quality matter for this use case). For each dimension, write a rubric with 3-5 score levels and concrete anchor descriptions. Implement as a structured judging prompt that outputs JSON (score + reasoning). Calibrate against 50+ human-judged examples — target >0.7 agreement. Integrate into the eval pipeline: judge scores become the primary quality signal for development iteration. Monitor judge consistency over time (judge model updates can change scoring behavior).

  • “What are the limitations of LLM-as-judge?” Known biases (positivity, verbosity, self-enhancement, position). Inability to verify factual claims against external reality (the judge can assess coherence but not ground truth). Difficulty with highly subjective dimensions (humor, creativity, cultural appropriateness). Cost at scale (judging 10K outputs with Opus costs $50-200). Sensitivity to rubric wording (small changes in the rubric prompt can shift scores significantly). Recursive trust problem: you need to evaluate the evaluator, which requires human judgment you’re trying to replace.

  • “When would you NOT use LLM-as-judge?” When factual accuracy matters and ground truth exists (use programmatic exact-match or F1 instead). When the domain requires deep expertise the judge model doesn’t have (medical, legal — use domain experts). When the stakes are too high for automated evaluation (consequential decisions). When budget doesn’t allow validation against human judges (an uncalibrated LLM judge can be worse than no judge).