CapabilityAtlas CapabilityAtlas
Sign In
search
Integration & Operations Fundamentals

Guardrails & Safety

Input/output filtering, PII detection, content policy enforcement, audit logging.

Guardrails & Safety Architecture — Competence

What an interviewer or hiring manager expects you to know.

Core Knowledge

  • The guardrails pipeline. Production LLM systems use a layered architecture: input guardrails (PII detection, prompt injection scanning, topic filtering) → context assembly (RAG retrieval + system prompt) → LLM inference → output guardrails (toxicity check, hallucination detection, PII scrubbing, structured output validation) → audit logging. Every layer adds latency (500ms-2s total overhead is typical). Know this pipeline cold — it’s the first whiteboard question.

  • Guardrails frameworks. NVIDIA NeMo Guardrails (programmable rails in Colang DSL, most mature OSS option), Guardrails AI/guardrails-py (Python validators with 50+ community guards on Guardrails Hub), LLM Guard by Protect AI (input/output scanners, plug-and-play, Apache 2.0), AWS Bedrock Guardrails (native content filters + PII + denied topics). Know when to use a framework vs. building custom — frameworks handle common cases; custom logic handles domain-specific policies.

  • PII detection. Microsoft Presidio (OSS, 20+ entity types, pluggable NER backends via spaCy/HuggingFace/Stanza, anonymization strategies: redact/replace/hash/mask/encrypt). Private AI (commercial, 50+ entities, 49 languages, SOC 2 certified). Cloud-native: Google Cloud DLP (150+ infoTypes), AWS Comprehend PII, Azure AI Language PII. Know that PII detection before model input is mandatory in regulated industries (HIPAA, financial services) — not optional.

  • Content safety APIs. OpenAI Moderation API (free, multi-category scores for hate/self-harm/sexual/violence), Azure AI Content Safety (severity scores 0-6, plus prompt shield for injection detection and groundedness checking), Meta Llama Guard 3 (open-source safety classifier, runs locally, configurable taxonomy), Google Perspective API (toxicity/insult/threat scoring). Know that no single classifier catches everything — ensemble approaches combining 2-3 tools significantly reduce gaps.

  • Attack vectors. Direct prompt injection (user overrides system prompt), indirect prompt injection (malicious instructions in retrieved documents — white-on-white text in resumes, hidden instructions in emails), data exfiltration (markdown image tags encoding secrets in URLs), jailbreaking (DAN, Base64 encoding, role-play escalation, multi-turn gradual escalation), model DoS (inputs that maximize token consumption). The OWASP LLM Top 10 (v1.1, 2024) is the standard reference taxonomy.

  • Regulatory landscape. EU AI Act (risk classification: unacceptable/high/limited/minimal; high-risk systems need conformity assessments, transparency, human oversight; penalties up to EUR 35M or 7% revenue). NIST AI RMF 1.0 (Govern/Map/Measure/Manage) with NIST AI 600-1 GenAI-specific profile. Colorado AI Act (SB 24-205, effective Feb 2026 — first US state comprehensive AI law). Know that enterprise buyers now require NIST AI RMF alignment as a procurement gate.

Expected Practical Skills

  • Implement a guardrails pipeline. Wire up input scanning (Presidio for PII, Lakera Guard or LLM Guard for prompt injection) → LLM call → output validation (Llama Guard for safety, Guardrails AI for schema validation) → audit log (LangFuse traces). Get it running end-to-end with <500ms added latency.
  • Configure content policies. Translate business rules (“never give medical advice,” “always include a disclaimer for financial topics”) into guardrail configurations: system prompt instructions + classifier rules + output validation + tiered response (low severity: append disclaimer, medium: rephrase, high: block + canned message, critical: block + alert).
  • Run a basic red-team exercise. Using NVIDIA Garak or manual testing, systematically probe an LLM system against OWASP LLM Top 10 categories. Document findings with severity ratings and remediation steps.
  • Set up audit logging. Implement immutable trace logging (LangFuse or LangSmith): every request captures input, output, guardrail results (which fired, pass/fail, scores), latency, tokens, model version, policy version. Enterprise buyers require 1+ year retention.
  • Write a compliance mapping. Map your guardrails architecture to NIST AI RMF categories, producing a document that shows which controls address which risks. This is what procurement teams review.

Interview-Ready Explanations

  • “Walk me through how you’d design a guardrails architecture for a customer-facing LLM product.” Layered defense: input validation (PII scrubbing with Presidio, injection detection with Lakera, topic deny-list), robust system prompt with instruction hierarchy, output checking (Llama Guard for safety, groundedness check against source docs, PII re-scan for model-generated PII), audit trail with LangFuse. Tiered response: not everything is block/allow — low-severity violations get disclaimers, medium get rephrasing, high get blocked. Monitoring: real-time dashboards on guardrail fire rates, false positive tracking, weekly calibration.

  • “How do you defend against prompt injection?” Defense in depth — no single layer is sufficient. Input classifiers (Lakera Guard claims <100ms, trained on 1M+ injection examples) catch known patterns. Instruction hierarchy (system prompt > user input, as Anthropic implements) limits override capability. Output scanning catches injections that bypassed input filters. Indirect injection is harder — sanitize retrieved content, separate data plane from control plane, scan for injection patterns in RAG sources. Continuous red-teaming because new injection techniques emerge weekly.

  • “What are the failure modes of guardrails themselves?” Over-blocking (legitimate medical/legal/educational content flagged — erodes user trust), latency tax (full pipeline adds 500ms-2s, unacceptable for real-time chat without parallelization), false sense of security (classifiers are 85-95% accurate at best, novel attacks bypass them), multi-turn bypass (single-turn classifiers miss attacks spread across 10+ messages), inconsistent policy application (input and output filters with different sensitivity levels create gaps).