CapabilityAtlas CapabilityAtlas
Sign In
search
Integration & Operations Fundamentals

Model Routing

Multi-model architectures routing by task type, cost, latency, or quality. Graceful degradation.

Model Routing & Fallback Design — Competence

What an interviewer or hiring manager expects you to know.

Core Knowledge

  • The model landscape and pricing tiers. Know the current provider lineup and where each sits on the cost-quality-latency spectrum. Anthropic: Opus 4 (≈$15/$75 per MTok, highest reasoning), Sonnet 4 (≈$3/$15, best cost-quality balance), Haiku (≈$0.25/$1.25, fastest/cheapest). OpenAI: o3 (≈$10/$40, strong reasoning), GPT-4o (≈$2.50/$10, general-purpose), GPT-4o-mini (≈$0.15/$0.60, cheap). Google: Gemini 2.5 Pro (≈$1.25/$10, long context), Gemini 2.5 Flash (≈$0.15/$0.60, fast/cheap). Open-source: Llama 3 (70B/405B, self-hosted or via providers like Together AI, Fireworks, Groq), Mistral Large, DeepSeek V3. Know that pricing drops 30-50% annually while capabilities improve.

  • Routing strategies. Task-based routing (classify the task type, route to appropriate model — e.g., simple extraction → Haiku, complex reasoning → Opus). Quality-threshold routing (attempt with cheap model first, escalate to expensive model if quality score is below threshold). Cost-budget routing (route to cheapest model that meets a minimum quality bar for this task type). Latency-based routing (real-time chat → fast model, batch processing → slower but higher quality). Cascade routing (try cheap → evaluate → retry with expensive if needed — trade latency for cost).

  • AI gateway tools. LiteLLM (open-source proxy that normalizes 100+ provider APIs into one interface, handles routing, fallback, rate limiting, cost tracking — the most widely adopted OSS gateway). Portkey (commercial AI gateway with routing, caching, fallback, observability, guardrails — $0 for small scale, enterprise pricing at scale). Martian (model router that automatically selects the cheapest model meeting a quality threshold — ML-based routing, not rule-based). Unify AI (multi-provider routing for cost/latency optimization with quality benchmarking). Helicone (proxy with cost tracking and caching, lighter than LiteLLM). OpenRouter (unified API for 100+ models, pay-per-token with markup). Know LiteLLM cold — it’s the default for most teams.

  • Fallback patterns. Provider failover (primary → secondary → tertiary when a provider is down or rate-limited). Model degradation (Opus → Sonnet → Haiku as a quality/cost fallback chain). Timeout fallback (if primary doesn’t respond within 5s, route to secondary). Rate-limit fallback (when you hit rate limits on one provider, overflow to another). Circuit breaker (after N failures in M seconds, stop trying the failing provider for a cooldown period, route all traffic to backup).

  • Caching for cost reduction. Semantic caching (cache responses for semantically similar queries, not just exact matches — GPTCache, Portkey cache). Exact-match caching (same input → same output, simple but effective for repeated queries). Prompt caching (Anthropic’s prompt caching for reusing long system prompts — reduces input token costs by up to 90% on cached portions). KV-cache reuse (for self-hosted models, reuse computed key-value caches across requests with shared prefixes). Know that caching is the highest-ROI cost optimization for most systems — 30-60% cost reduction is typical for applications with query repetition.

Expected Practical Skills

  • Set up a multi-model gateway. Configure LiteLLM to proxy requests across Claude, GPT-4o, and an open-source model. Define fallback chains. Set up rate limiting. Track per-model cost and latency. This should take <1 hour.
  • Implement task-based routing. Build a classifier (can be as simple as keyword matching, or an LLM call) that categorizes incoming requests and routes to the appropriate model. Measure cost savings vs. single-model baseline.
  • Design a fallback chain. Configure primary → secondary → tertiary provider routing with health checks, timeout thresholds, and circuit breakers. Test failover by simulating provider outages.
  • Implement semantic caching. Set up GPTCache or Portkey’s cache to cache similar queries. Measure cache hit rate and cost savings. Tune similarity threshold (too low = stale responses, too high = no cache hits).
  • Build a cost dashboard. Track per-model, per-feature, per-user cost. Identify which features/users drive the most LLM spend. Use LiteLLM’s built-in cost tracking or LangFuse.

Interview-Ready Explanations

  • “Walk me through how you’d design a multi-model architecture for a production application.” Start with task analysis: what types of requests does the system handle? Classify by complexity (simple extraction, moderate generation, complex reasoning). Map each category to the cheapest model that meets quality requirements — test this empirically with eval suites (Skill 9). Set up LiteLLM as the gateway. Configure fallback chains for each provider. Add semantic caching for repeated patterns. Monitor with per-model quality tracking (Skill 11). Budget-aware routing: when monthly spend hits 80% of budget, shift more traffic to cheaper models.

  • “How do you decide which model to use for a given task?” Empirical, not theoretical. Build an eval dataset for each task type (50+ examples). Run each candidate model. Score on quality, latency, and cost. Pick the cheapest model that clears the quality threshold. The threshold varies by task: customer-facing output needs higher quality than internal analytics. Re-evaluate monthly because model capabilities and pricing change constantly.

  • “What happens when your primary model provider goes down?” Layered defense: (1) Circuit breaker detects failures (3 errors in 30 seconds → trip). (2) Traffic routes to secondary provider (different model family — if Claude is down, route to GPT-4o, not another Claude endpoint). (3) If secondary also fails, degrade gracefully (cached responses, simplified outputs, or queue for later processing). (4) Alert the team. (5) Health checks ping the primary every 30s; auto-recover when it’s back. Key design principle: fallback providers should be from different companies to avoid correlated outages.