RAG System Design — Competence
What an interviewer or hiring manager expects you to know.
Core Knowledge
-
The retrieval pipeline. Query → embed → vector search → rerank → inject into prompt → generate. Know each stage and what can go wrong at each. The pipeline looks simple but the quality is determined by decisions at every stage: which embedding model, which vector DB, how to chunk documents, whether to rerank, how many results to include, and how to format context in the prompt. This is the first whiteboard question in any RAG interview.
-
Embedding models. OpenAI text-embedding-3-small ($0.02/MTok, 1536 dims, good general-purpose) and text-embedding-3-large ($0.13/MTok, 3072 dims, better for specialized domains). Cohere embed-v3 (multilingual, 1024 dims, strong on cross-lingual retrieval). Voyage AI voyage-large-2-instruct ($0.12/MTok, instruction-tuned, excels on domain-specific retrieval when given retrieval instructions). Open-source: BGE-large-en-v1.5 (BAAI, top-tier open model), E5-large-v2 (Microsoft), Nomic embed-text-v1.5 (Matryoshka embeddings — variable dimensionality for cost/quality trade-off). Key trade-off: API models are simpler but create vendor dependency; open-source models require GPU hosting but enable customization and fine-tuning.
-
Vector databases. Pinecone (fully managed, serverless option, strongest enterprise features, $0.33/hr per pod — expensive at scale). Weaviate (open-source + cloud, hybrid search built-in, modular vectorizer). Qdrant (open-source, Rust-based, fast, good filtering, self-host or cloud). Chroma (open-source, Python-native, embedded mode for development, lightweight). Milvus/Zilliz (open-source, designed for billion-scale, complex but powerful). pgvector (PostgreSQL extension — add vector search to existing Postgres, ideal when you don’t want another database, limited scale ~1M vectors before performance degrades). Know when to use which: pgvector for startups with <1M docs and existing Postgres, Pinecone for managed simplicity, Qdrant/Weaviate for self-hosted scale, Milvus for billion-scale.
-
Chunking strategies. Fixed-size (split every N tokens — simple, fast, loses semantic boundaries). Recursive/character-based (split on paragraphs, then sentences, then characters — preserves structure better). Semantic chunking (use embedding similarity to find natural break points — expensive but highest quality). Document-aware (respect headings, sections, tables — critical for structured documents like grants, legal contracts, technical docs). Parent-child chunking (embed small chunks for precision, retrieve the parent section for context). Typical chunk sizes: 256-512 tokens for precise retrieval, 1024-2048 for broader context. The chunking decision has more impact on retrieval quality than the embedding model choice.
-
Hybrid search. Combine vector similarity (semantic understanding) with BM25/keyword search (exact term matching). Vector search misses exact terms and acronyms; keyword search misses semantic similarity. Hybrid catches both. Weaviate has built-in hybrid search. For others, run both searches and combine with Reciprocal Rank Fusion (RRF). Typical weight: 70% vector, 30% keyword, tuned per use case.
-
Reranking. After initial retrieval (top 20-50 candidates), a cross-encoder reranker scores each candidate against the query with much higher accuracy than vector similarity alone. Cohere Rerank ($1/1K searches), cross-encoder models from Hugging Face (free, self-hosted), ColBERT (late interaction model — faster than cross-encoders with similar quality). Reranking typically improves precision@5 by 10-25%. Trade-off: adds 100-300ms latency and API cost. Worth it for every production system; skip only for latency-critical applications.
Expected Practical Skills
- Build a RAG pipeline end-to-end. Ingest documents (PDF, HTML, markdown), chunk them (choose strategy based on document structure), embed with a chosen model, store in a vector DB, implement retrieval with hybrid search, add reranking, format context for the LLM prompt, generate and return the response. Use LangChain, LlamaIndex, or Haystack — or build from scratch for simpler systems.
- Evaluate retrieval quality. Measure precision@k (what fraction of retrieved docs are relevant?), recall@k (what fraction of relevant docs are retrieved?), MRR (Mean Reciprocal Rank — where does the first relevant result appear?), and NDCG (Normalized Discounted Cumulative Gain). Use Ragas for RAG-specific eval: faithfulness (is the answer grounded in context?), answer relevancy, context precision/recall.
- Handle common failure modes. Wrong chunks retrieved (improve chunking, add metadata filtering, reranking). Context window overflow (too many retrieved docs — add relevance thresholds, summarize context). Stale data (index not updated — implement refresh schedules, track index freshness). “Lost in the middle” (LLMs attend less to middle of context — place most relevant chunks first and last).
- Know when RAG is the wrong tool. RAG is wrong when: the knowledge fits in the context window (just include it directly — simpler and more reliable), the task requires reasoning over the entire corpus (RAG retrieves fragments, not structure), the data changes faster than you can re-index, or the answers require cross-document synthesis that retrieval can’t capture.
Interview-Ready Explanations
-
“Walk me through how you’d build a RAG system for [enterprise knowledge base].” Start with document analysis (format, structure, volume, update frequency). Choose chunking strategy based on document structure (semantic chunking for unstructured docs, document-aware for structured). Select embedding model (text-embedding-3-small for general, Voyage for domain-specific). Choose vector DB based on scale and operational requirements (pgvector for <1M docs with existing Postgres, Qdrant for self-hosted scale). Implement hybrid search (vector + BM25). Add Cohere Rerank. Format context with source attribution. Evaluate with Ragas. Monitor retrieval quality in production.
-
“How do you evaluate whether your retrieval is working?” Two levels: retrieval quality (precision/recall/MRR — are you getting the right documents?) and end-to-end quality (does the final answer actually help the user?). Use Ragas for automated eval: faithfulness checks if the answer is grounded in retrieved context, answer relevancy checks if the answer addresses the question. Build a golden dataset of 100+ question/answer/source-document triples. Run weekly and track trends.
-
“What are the failure modes and how do you mitigate them?” (1) Retrieval misses — relevant doc exists but isn’t retrieved. Fix: hybrid search, better chunking, metadata filtering. (2) Wrong context — irrelevant docs retrieved, LLM generates plausible but unsupported answer. Fix: reranking, relevance threshold, faithfulness checking. (3) Context overload — too many docs, LLM ignores some. Fix: fewer, better-ranked results. (4) Stale index — documents updated but embeddings aren’t. Fix: incremental indexing, freshness tracking. (5) The “lost in the middle” problem — relevant info in position 3-4 of 5 retrieved docs gets ignored. Fix: put most relevant first and last.
Related
- Eval Frameworks — RAG quality requires Ragas + retrieval metrics
- Guardrails & Safety — PII scrubbing in retrieved content
- Cost Estimation — RAG cost modeling (embedding + storage + query)