LLM Observability — Competence
What an interviewer or hiring manager expects you to know.
Core Knowledge
-
What LLM observability covers. Tracing (following a request through every step of a multi-step pipeline — input, each LLM call, tool invocations, retrieval, output), metrics (latency per step, token counts, cost per request, error rates, quality scores), logging (structured records of every LLM interaction for debugging and audit), and alerting (notifications when metrics exceed thresholds — latency spike, cost anomaly, quality degradation, error rate increase). Traditional APM (Datadog, New Relic) covers infrastructure; LLM observability covers the AI-specific layer on top.
-
The observability platform landscape. LangFuse (open-source, self-hostable, full tracing + cost tracking + evaluation scoring + prompt management; the leading OSS option). LangSmith (LangChain’s platform — tracing, datasets, evaluation, monitoring; tightly integrated with LangChain/LangGraph). Arize Phoenix (open-source, OpenTelemetry-based, embedding drift detection, trace-level evaluation). Helicone (proxy-based, request-level logging + cost tracking + caching; lightweight). Braintrust (eval-first with production logging; strong on scoring). Weights & Biases Weave (experiment tracking + LLM tracing; good for teams already on W&B). Datadog LLM Monitoring (integrates with existing Datadog APM — best for teams already on Datadog). OpenLLMetry/Traceloop (OpenTelemetry-based auto-instrumentation for LLM calls — send to any OTEL-compatible backend). Portkey (AI gateway with built-in observability). Patronus AI (evaluation-focused monitoring with hallucination detection).
-
Trace anatomy for LLM systems. A trace captures one end-to-end request: top-level span (user request → final response) containing child spans (retrieval call, LLM call, tool invocation, guardrail check). Each span records: start/end time, input, output, model used, token count (input + output), cost, metadata (user ID, session ID, feature flag). For agent systems, traces are hierarchical: agent decision → tool call → sub-agent call → tool call. LangFuse and LangSmith both render these as visual trace trees.
-
Cost attribution. Track LLM cost at multiple levels: per request (this API call cost $0.03), per feature (the search feature costs $450/month), per user/customer (customer X costs $12/month in LLM usage), per model (60% of spend is Sonnet, 30% Opus, 10% Haiku). Implementation: tag every LLM call with metadata (feature, user_id, team) via LangFuse metadata or LiteLLM tags. Aggregate in dashboards. This feeds into cost estimation (Skill 13) and model routing (Skill 14) decisions.
-
Quality monitoring in production. Beyond latency and error rates: score a sample of production outputs using LLM-as-judge (Skill 10) or programmatic checks. Track quality scores over time. Alert when scores degrade. This is the online eval component from Skill 11 — observability is the infrastructure that makes it possible. LangFuse supports attaching scores to traces; Arize Phoenix supports evaluation as part of the monitoring pipeline.
Expected Practical Skills
- Instrument an LLM application with LangFuse. Add trace creation to every LLM call. Capture: input, output, model, tokens, cost, latency, metadata (user, feature, session). Verify traces appear in the LangFuse UI. Set up cost dashboards.
- Build a monitoring dashboard. Visualize: request volume over time, p50/p95/p99 latency, cost per day/week/month (broken down by model and feature), error rate by type (timeout, rate limit, content policy, parsing failure), and quality score trends (from online eval).
- Set up alerting. Configure alerts for: latency p95 exceeds 5s (performance degradation), error rate exceeds 5% (system health), daily cost exceeds budget (cost control), quality score drops below threshold (quality regression — Skill 11). Route alerts to Slack/PagerDuty.
- Debug a production issue using traces. Given a user report (“the AI gave a wrong answer”), find the trace, examine each span (was the retrieval relevant? was the prompt correct? was the model output reasonable?), identify the failure point, and determine the root cause.
- Implement OpenTelemetry for LLM calls. Use OpenLLMetry or manual OTEL instrumentation to send LLM traces to any OTEL-compatible backend (Grafana Tempo, Jaeger, Honeycomb). This decouples instrumentation from the observability platform.
Interview-Ready Explanations
-
“Walk me through how you’d set up observability for a production LLM application.” Three layers: (1) Tracing — instrument every LLM call with LangFuse or OpenTelemetry. Capture input, output, model, tokens, cost, latency, and metadata per call. For multi-step pipelines, nest spans hierarchically. (2) Metrics & dashboards — aggregate traces into dashboards: request volume, latency percentiles, cost attribution by feature/model/user, error rates by type. (3) Alerting — define thresholds for latency, cost, errors, and quality scores. Alert on-call when thresholds are breached. Add online quality scoring: sample 5% of production traffic, score with LLM-as-judge, track trends, alert on degradation.
-
“How do you debug a quality issue reported by a user?” Start with the trace. Find the request by timestamp/user ID. Walk the trace tree: was the retrieval step relevant (check retrieved chunks against the query)? Was the prompt correct (check assembled prompt for completeness)? Was the model output reasonable given the context (read the full output)? Was the output parsing correct (check for data loss in parsing)? Did guardrails modify the output (check guardrail scores)? The trace gives you the full execution path — the failure is in one of these spans.
-
“What’s the difference between LLM observability and traditional APM?” Traditional APM (Datadog, New Relic) monitors infrastructure: CPU, memory, HTTP status codes, request latency. LLM observability monitors the AI layer: token counts, model selection, prompt content, output quality, retrieval relevance, guardrail decisions. You need both — APM catches “the server is down,” LLM observability catches “the model is giving wrong answers.” They’re complementary: APM for infrastructure health, LLM observability for AI quality.
Related
- Regression Detection — observability is the infrastructure for regression monitoring
- Cost Estimation — observability provides the cost data
- Model Routing — observability tracks per-model performance