Guardrails & Safety Architecture — Competence
What an interviewer or hiring manager expects you to know.
Core Knowledge
-
The guardrails pipeline. Production LLM systems use a layered architecture: input guardrails (PII detection, prompt injection scanning, topic filtering) → context assembly (RAG retrieval + system prompt) → LLM inference → output guardrails (toxicity check, hallucination detection, PII scrubbing, structured output validation) → audit logging. Every layer adds latency (500ms-2s total overhead is typical). Know this pipeline cold — it’s the first whiteboard question.
-
Guardrails frameworks. NVIDIA NeMo Guardrails (programmable rails in Colang DSL, most mature OSS option), Guardrails AI/guardrails-py (Python validators with 50+ community guards on Guardrails Hub), LLM Guard by Protect AI (input/output scanners, plug-and-play, Apache 2.0), AWS Bedrock Guardrails (native content filters + PII + denied topics). Know when to use a framework vs. building custom — frameworks handle common cases; custom logic handles domain-specific policies.
-
PII detection. Microsoft Presidio (OSS, 20+ entity types, pluggable NER backends via spaCy/HuggingFace/Stanza, anonymization strategies: redact/replace/hash/mask/encrypt). Private AI (commercial, 50+ entities, 49 languages, SOC 2 certified). Cloud-native: Google Cloud DLP (150+ infoTypes), AWS Comprehend PII, Azure AI Language PII. Know that PII detection before model input is mandatory in regulated industries (HIPAA, financial services) — not optional.
-
Content safety APIs. OpenAI Moderation API (free, multi-category scores for hate/self-harm/sexual/violence), Azure AI Content Safety (severity scores 0-6, plus prompt shield for injection detection and groundedness checking), Meta Llama Guard 3 (open-source safety classifier, runs locally, configurable taxonomy), Google Perspective API (toxicity/insult/threat scoring). Know that no single classifier catches everything — ensemble approaches combining 2-3 tools significantly reduce gaps.
-
Attack vectors. Direct prompt injection (user overrides system prompt), indirect prompt injection (malicious instructions in retrieved documents — white-on-white text in resumes, hidden instructions in emails), data exfiltration (markdown image tags encoding secrets in URLs), jailbreaking (DAN, Base64 encoding, role-play escalation, multi-turn gradual escalation), model DoS (inputs that maximize token consumption). The OWASP LLM Top 10 (v1.1, 2024) is the standard reference taxonomy.
-
Regulatory landscape. EU AI Act (risk classification: unacceptable/high/limited/minimal; high-risk systems need conformity assessments, transparency, human oversight; penalties up to EUR 35M or 7% revenue). NIST AI RMF 1.0 (Govern/Map/Measure/Manage) with NIST AI 600-1 GenAI-specific profile. Colorado AI Act (SB 24-205, effective Feb 2026 — first US state comprehensive AI law). Know that enterprise buyers now require NIST AI RMF alignment as a procurement gate.
Expected Practical Skills
- Implement a guardrails pipeline. Wire up input scanning (Presidio for PII, Lakera Guard or LLM Guard for prompt injection) → LLM call → output validation (Llama Guard for safety, Guardrails AI for schema validation) → audit log (LangFuse traces). Get it running end-to-end with <500ms added latency.
- Configure content policies. Translate business rules (“never give medical advice,” “always include a disclaimer for financial topics”) into guardrail configurations: system prompt instructions + classifier rules + output validation + tiered response (low severity: append disclaimer, medium: rephrase, high: block + canned message, critical: block + alert).
- Run a basic red-team exercise. Using NVIDIA Garak or manual testing, systematically probe an LLM system against OWASP LLM Top 10 categories. Document findings with severity ratings and remediation steps.
- Set up audit logging. Implement immutable trace logging (LangFuse or LangSmith): every request captures input, output, guardrail results (which fired, pass/fail, scores), latency, tokens, model version, policy version. Enterprise buyers require 1+ year retention.
- Write a compliance mapping. Map your guardrails architecture to NIST AI RMF categories, producing a document that shows which controls address which risks. This is what procurement teams review.
Interview-Ready Explanations
-
“Walk me through how you’d design a guardrails architecture for a customer-facing LLM product.” Layered defense: input validation (PII scrubbing with Presidio, injection detection with Lakera, topic deny-list), robust system prompt with instruction hierarchy, output checking (Llama Guard for safety, groundedness check against source docs, PII re-scan for model-generated PII), audit trail with LangFuse. Tiered response: not everything is block/allow — low-severity violations get disclaimers, medium get rephrasing, high get blocked. Monitoring: real-time dashboards on guardrail fire rates, false positive tracking, weekly calibration.
-
“How do you defend against prompt injection?” Defense in depth — no single layer is sufficient. Input classifiers (Lakera Guard claims <100ms, trained on 1M+ injection examples) catch known patterns. Instruction hierarchy (system prompt > user input, as Anthropic implements) limits override capability. Output scanning catches injections that bypassed input filters. Indirect injection is harder — sanitize retrieved content, separate data plane from control plane, scan for injection patterns in RAG sources. Continuous red-teaming because new injection techniques emerge weekly.
-
“What are the failure modes of guardrails themselves?” Over-blocking (legitimate medical/legal/educational content flagged — erodes user trust), latency tax (full pipeline adds 500ms-2s, unacceptable for real-time chat without parallelization), false sense of security (classifiers are 85-95% accurate at best, novel attacks bypass them), multi-turn bypass (single-turn classifiers miss attacks spread across 10+ messages), inconsistent policy application (input and output filters with different sensitivity levels create gaps).
Related
- Compliance & Governance — guardrails implement compliance requirements
- Model Routing — guardrails integrate into the AI gateway
- Eval Frameworks — red-teaming requires eval infrastructure