Guardrails & Safety Architecture — Competence

What an interviewer or hiring manager expects you to know.

Core Knowledge

The guardrails pipeline. Production LLM systems use a layered architecture: input guardrails (PII detection, prompt injection scanning, topic filtering) → context assembly (RAG retrieval + system prompt) → LLM inference → output guardrails (toxicity check, hallucination detection, PII scrubbing, structured output validation) → audit logging. Every layer adds latency (500ms-2s total overhead is typical). Know this pipeline cold — it’s the first whiteboard question.
Guardrails frameworks. NVIDIA NeMo Guardrails (programmable rails in Colang DSL, most mature OSS option), Guardrails AI/guardrails-py (Python validators with 50+ community guards on Guardrails Hub), LLM Guard by Protect AI (input/output scanners, plug-and-play, Apache 2.0), AWS Bedrock Guardrails (native content filters + PII + denied topics). Know when to use a framework vs. building custom — frameworks handle common cases; custom logic handles domain-specific policies.
PII detection. Microsoft Presidio (OSS, 20+ entity types, pluggable NER backends via spaCy/HuggingFace/Stanza, anonymization strategies: redact/replace/hash/mask/encrypt). Private AI (commercial, 50+ entities, 49 languages, SOC 2 certified). Cloud-native: Google Cloud DLP (150+ infoTypes), AWS Comprehend PII, Azure AI Language PII. Know that PII detection before model input is mandatory in regulated industries (HIPAA, financial services) — not optional.
Content safety APIs. OpenAI Moderation API (free, multi-category scores for hate/self-harm/sexual/violence), Azure AI Content Safety (severity scores 0-6, plus prompt shield for injection detection and groundedness checking), Meta Llama Guard 3 (open-source safety classifier, runs locally, configurable taxonomy), Google Perspective API (toxicity/insult/threat scoring). Know that no single classifier catches everything — ensemble approaches combining 2-3 tools significantly reduce gaps.
Attack vectors. Direct prompt injection (user overrides system prompt), indirect prompt injection (malicious instructions in retrieved documents — white-on-white text in resumes, hidden instructions in emails), data exfiltration (markdown image tags encoding secrets in URLs), jailbreaking (DAN, Base64 encoding, role-play escalation, multi-turn gradual escalation), model DoS (inputs that maximize token consumption). The OWASP LLM Top 10 (v1.1, 2024) is the standard reference taxonomy.
Regulatory landscape. EU AI Act (risk classification: unacceptable/high/limited/minimal; high-risk systems need conformity assessments, transparency, human oversight; penalties up to EUR 35M or 7% revenue). NIST AI RMF 1.0 (Govern/Map/Measure/Manage) with NIST AI 600-1 GenAI-specific profile. Colorado AI Act (SB 24-205, effective Feb 2026 — first US state comprehensive AI law). Know that enterprise buyers now require NIST AI RMF alignment as a procurement gate.

Expected Practical Skills

Implement a guardrails pipeline. Wire up input scanning (Presidio for PII, Lakera Guard or LLM Guard for prompt injection) → LLM call → output validation (Llama Guard for safety, Guardrails AI for schema validation) → audit log (LangFuse traces). Get it running end-to-end with <500ms added latency.
Configure content policies. Translate business rules (“never give medical advice,” “always include a disclaimer for financial topics”) into guardrail configurations: system prompt instructions + classifier rules + output validation + tiered response (low severity: append disclaimer, medium: rephrase, high: block + canned message, critical: block + alert).
Run a basic red-team exercise. Using NVIDIA Garak or manual testing, systematically probe an LLM system against OWASP LLM Top 10 categories. Document findings with severity ratings and remediation steps.
Set up audit logging. Implement immutable trace logging (LangFuse or LangSmith): every request captures input, output, guardrail results (which fired, pass/fail, scores), latency, tokens, model version, policy version. Enterprise buyers require 1+ year retention.
Write a compliance mapping. Map your guardrails architecture to NIST AI RMF categories, producing a document that shows which controls address which risks. This is what procurement teams review.

Interview-Ready Explanations

“Walk me through how you’d design a guardrails architecture for a customer-facing LLM product.” Layered defense: input validation (PII scrubbing with Presidio, injection detection with Lakera, topic deny-list), robust system prompt with instruction hierarchy, output checking (Llama Guard for safety, groundedness check against source docs, PII re-scan for model-generated PII), audit trail with LangFuse. Tiered response: not everything is block/allow — low-severity violations get disclaimers, medium get rephrasing, high get blocked. Monitoring: real-time dashboards on guardrail fire rates, false positive tracking, weekly calibration.
“How do you defend against prompt injection?” Defense in depth — no single layer is sufficient. Input classifiers (Lakera Guard claims <100ms, trained on 1M+ injection examples) catch known patterns. Instruction hierarchy (system prompt > user input, as Anthropic implements) limits override capability. Output scanning catches injections that bypassed input filters. Indirect injection is harder — sanitize retrieved content, separate data plane from control plane, scan for injection patterns in RAG sources. Continuous red-teaming because new injection techniques emerge weekly.
“What are the failure modes of guardrails themselves?” Over-blocking (legitimate medical/legal/educational content flagged — erodes user trust), latency tax (full pipeline adds 500ms-2s, unacceptable for real-time chat without parallelization), false sense of security (classifiers are 85-95% accurate at best, novel attacks bypass them), multi-turn bypass (single-turn classifiers miss attacks spread across 10+ messages), inconsistent policy application (input and output filters with different sensitivity levels create gaps).

Compliance & Governance — guardrails implement compliance requirements
Model Routing — guardrails integrate into the AI gateway
Eval Frameworks — red-teaming requires eval infrastructure

Guardrails & Safety

Guardrails & Safety Architecture — Competence

Core Knowledge

Expected Practical Skills

Interview-Ready Explanations

Related