Red-Teaming / Adversarial Testing — Competence
What an interviewer or hiring manager expects you to know.
Core Knowledge
-
The OWASP LLM Top 10 (v1.1, 2024). The standard attack taxonomy: LLM01 Prompt Injection (direct and indirect), LLM02 Insecure Output Handling (XSS via LLM output), LLM03 Training Data Poisoning, LLM04 Model Denial of Service, LLM05 Supply Chain Vulnerabilities, LLM06 Sensitive Information Disclosure (PII leakage, training data extraction), LLM07 Insecure Plugin Design (tool use vulnerabilities), LLM08 Excessive Agency (agent takes unauthorized actions), LLM09 Overreliance (users trust incorrect output), LLM10 Model Theft. Know the top 5 cold — they’re the foundation of every red-team exercise.
-
Prompt injection attack classes. Direct injection (user overrides system prompt: “Ignore all previous instructions…”). Indirect injection (malicious instructions in retrieved documents, emails, web pages — the model processes them as instructions). Jailbreaking (bypass safety training: DAN prompts, role-play escalation, Base64 encoding, language switching, multi-turn gradual escalation). Payload injection (embed executable content in LLM output: markdown image tags for data exfiltration, SQL injection via LLM-generated queries, XSS via LLM-generated HTML). Know the difference: prompt injection targets the system, jailbreaking targets the model’s safety training.
-
Red-teaming tools. NVIDIA Garak (automated LLM vulnerability scanner — probes for prompt injection, jailbreaks, data leakage, toxicity with 1000s of attack variants; the “nmap for LLMs”). Promptfoo red-team mode (
promptfoo redteam— auto-generates adversarial inputs based on your system’s purpose and tests them). Microsoft PyRIT (Python Risk Identification Toolkit — red-teaming framework for multi-turn attacks, orchestrated adversarial conversations). HarmBench (benchmark for evaluating attack/defense pairs). Adversarial Robustness Toolbox (IBM — adversarial ML attacks/defenses). PromptInject (academic tool for testing prompt injection defenses). -
Attack methodology. Reconnaissance (understand the system: what model, what tools, what guardrails, what the system prompt says), automated scanning (run Garak/Promptfoo red-team for broad coverage of known attack patterns), manual probing (creative attacks that automated tools miss: context-specific social engineering, domain-specific exploits, multi-turn strategies), finding discovery documentation (severity classification, reproduction steps, impact assessment), and regression test creation (every finding becomes an automated test that runs on every change).
-
Severity classification for AI vulnerabilities. Critical: PII/data exfiltration, unauthorized actions with real-world consequences (agent sends email, deletes data), safety-relevant harmful output (medical advice, weapons instructions). High: system prompt extraction, persistent jailbreak that bypasses all guardrails, consistent generation of harmful content. Medium: intermittent guardrail bypass, information disclosure about system architecture, output manipulation that could mislead users. Low: cosmetic policy violations, theoretical attacks that require unrealistic preconditions.
Expected Practical Skills
- Run an automated red-team scan. Configure and run Garak or Promptfoo redteam against an LLM application. Interpret results: which attacks succeeded, what’s the success rate per category, which findings are real vulnerabilities vs. false positives.
- Conduct manual prompt injection testing. Systematically test: direct injection (override attempts), indirect injection (malicious content in RAG sources), output manipulation (force specific output format/content), tool abuse (trick the model into calling tools with unintended arguments), and multi-turn escalation (gradually build context to bypass safety).
- Write a red-team report. Document: findings with severity, reproduction steps, affected components, recommended mitigations, and timeline. Follow standard security reporting format — this is a deliverable enterprise buyers and regulators expect.
- Convert findings to regression tests. Every successful attack becomes an automated test case in the guardrails eval suite (Skill 9). The test verifies that the mitigation works and that future changes don’t reintroduce the vulnerability.
- Test guardrail effectiveness. Given a guardrails stack (Skill 15), probe for gaps: what gets through input filters? What gets through output filters? What multi-turn attacks bypass session-level defenses? Measure the guardrail bypass rate per attack category.
Interview-Ready Explanations
-
“Walk me through how you’d red-team an LLM application.” Start with scope: what system, what threats are in-scope (data exfiltration, harmful content, unauthorized actions), what’s the risk context (customer-facing, internal, regulated)? Phase 1: automated scanning — Garak for broad vulnerability coverage, Promptfoo red-team for application-specific attacks. Phase 2: manual testing — focus on the OWASP LLM Top 10, especially prompt injection (direct + indirect), sensitive information disclosure, and excessive agency. Phase 3: domain-specific probing — for healthcare, try to get medical diagnoses; for finance, try to get investment advice; for compliance, try to get definitive legal opinions. Document everything. Convert findings to regression tests. Re-test after mitigations.
-
“How do you test for indirect prompt injection?” Inject adversarial content into the data sources the LLM processes: add hidden instructions to documents in the RAG index, embed malicious text in email threads the system summarizes, include adversarial CSS/invisible text in web pages the system analyzes. Test: does the model follow the injected instructions? Does it override system prompt boundaries? Does it exfiltrate data to attacker-controlled URLs? The hardest attack to defend against because it lives in the data, not the user input.
-
“What’s the difference between red-teaming and general adversarial testing?” Red-teaming is goal-oriented: an adversary with specific objectives (extract PII, bypass safety, cause unauthorized actions) using creative, adaptive strategies. Adversarial testing is systematic: enumerate known attack patterns, test each one, measure success rates. Both are needed: adversarial testing catches known vulnerabilities at scale (automated), red-teaming finds novel vulnerabilities that automated tools miss (manual, creative). Think of it as automated security scanning vs. penetration testing — both necessary, different strengths.
Related
- Guardrails & Safety — red-teaming tests guardrail effectiveness
- Eval Frameworks — red-team findings become eval test cases
- Failure Mode Reasoning — adversarial failures are a failure mode category