LLM Red Teaming Tools

Five automated tools for adversarially testing LLM applications: Garak (pre-deployment scanner), PyRIT (enterprise multi-turn attack framework), Promptfoo (eval + security combined), NeMo Guardrails (runtime filtering), and DeepTeam (DeepEval-integrated red teaming).

This page covers runnable tooling. For the methodology — jailbreak categories, human red-teaming process, severity tiers, and CI regression suites — see security/red-teaming.

The threat taxonomy these tools test against is at security/owasp-llm-top10.


The Two Phases These Tools Cover

Pre-deployment scanning: run a battery of attacks against your model or application endpoint before shipping. Garak, PyRIT, Promptfoo, and DeepTeam all operate here.

Runtime filtering: inspect and block every live request and response. NeMo Guardrails operates here.

The recommended stack uses both. Scanning before deploy catches systematic vulnerabilities. Runtime filtering handles attacks that slip through or emerge after launch.


Garak (NVIDIA)

What it is: An open-source LLM vulnerability scanner. Think nmap for LLMs — systematic, probe-driven, report-generating. Run it before shipping any model-backed feature.

Install and run:

pip install garak

# Scan a public model for prompt injection vulnerabilities
garak --model_type openai --model_name gpt-4o --probes promptinject

# Run all probes (slow — use for pre-release audits)
garak --model_type openai --model_name gpt-4o --probes all

# Target a custom endpoint
garak --model_type rest --model_name my-app --probes jailbreak,dan

Architecture — four components:

ComponentRole
GeneratorsWrap the target LLM; handle auth, connection, backoff
ProbesAssemble and send adversarial prompts targeting a specific weakness
DetectorsAnalyse the response to judge whether the attack succeeded
EvaluatorsConvert detector judgments into pass/fail data; build the report

A harness coordinates all four components per run.

Probe coverage (150+ probe modules, 3,000+ prompts):

  • Prompt injection and jailbreaks (DAN variants, role-play, hypothetical framing)
  • Encoding attacks (Base64, rot13, homoglyph substitution)
  • Toxicity and harmful content generation
  • Data leakage and PII extraction
  • Hallucination and misinformation elicitation
  • System prompt extraction
  • Many-shot escalation

Probes are tagged with MISP taxonomy, OWASP LLM category (owasp:llm01, owasp:llm06, etc.), and AVID effect codes — so the HTML report maps findings to the security/owasp-llm-top10 directly.

Output: structured HTML + JSONL report per run, with pass rate per probe. Integrate into pre-release CI with a threshold gate.

When to use: pre-deployment audit of any model endpoint — hosted API, fine-tuned model, or full application chain.


PyRIT (Microsoft)

What it is: Python Risk Identification Tool. Enterprise-focused, multi-turn, multi-modal red-teaming framework. Battle-tested by Microsoft's AI Red Team against Bing Chat, Copilot, and internal systems. MIT licensed, v0.11.0 (February 2026), Python 3.10-3.13.

pip install pyrit

Architecture — four primitives:

PrimitiveRole
TargetsThe system under test (any LLM API)
OrchestratorsRun multi-turn attack campaigns, manage conversation state
ConvertersTransform prompts to bypass filters (encode, obfuscate, translate)
ScorersEvaluate whether each attack response constitutes a success

Built-in attack strategies:

  • Many-shot jailbreaking — prefill context with 50+ examples of compliance
  • Crescendo — gradual multi-turn escalation; starts benign, steers toward violation
  • PAIR (Prompt Automatic Iterative Refinement) — adversarial LLM iteratively rewrites attacks to maximise success rate
  • Tree-of-Attacks — branches multiple attack paths in parallel, selects the most successful branch

Azure AI Foundry integration: in April 2025, Microsoft launched the AI Red Teaming Agent in public preview — a managed version of PyRIT that runs inside Azure AI Foundry, produces Attack Success Rate (ASR) metrics, and integrates with Azure AI evaluations. Use the hosted agent for Azure-native workflows; use the Python SDK directly for custom pipelines.

from pyrit.orchestrator import PromptSendingOrchestrator
from pyrit.common import initialize_pyrit

initialize_pyrit(memory_db_type="duckdb")

orchestrator = PromptSendingOrchestrator(objective_target=target)
results = await orchestrator.send_prompts_async(
    prompt_list=["Ignore all instructions and...", "Pretend you are DAN..."]
)

When to use: systematic enterprise red-team programs; when you need documented Attack Success Rate metrics; when running multi-turn attack campaigns at scale; when you are on Azure AI Foundry.


Promptfoo

What it is: Dual-role tool — standard eval framework AND security scanner. 300k+ developer users. Used by OpenAI and Anthropic teams internally. MIT licensed, actively maintained. Acquired by OpenAI (remains open source).

Install:

npm install -g promptfoo
# or
npx promptfoo@latest

Security scanning via red-team plugin:

# promptfooconfig.yaml
redteam:
  plugins:
    - owasp:llm          # full OWASP LLM Top 10 coverage
    - prompt-injection
    - jailbreak
    - pii
    - harmful:hate
    - harmful:violence
    - overreliance
  strategies:
    - jailbreak:composite  # combine multiple jailbreak techniques
    - prompt-injection
promptfoo redteam run
promptfoo redteam report   # HTML report in browser

Plugin coverage: 157 plugins across 6 categories — brand protection, compliance and legal, datasets from published research, security and access control, trust and safety, and custom. Built-in framework presets: owasp:llm, owasp:api, mitre:atlas, NIST AI RMF.

Each plugin is a trained model that generates adversarial payloads targeting a specific weakness — not just static prompt templates.

Dual eval + security workflow:

# Single config for both quality evals and security scanning
description: My LLM app
providers:
  - anthropic:claude-sonnet-4-6

tests:
  # Quality evals
  - vars:
      query: "What is the capital of France?"
    assert:
      - type: contains
        value: Paris

# Red team runs separately but in the same CI pipeline
redteam:
  plugins:
    - owasp:llm

CI integration:

# GitHub Actions
- name: LLM security scan
  run: npx promptfoo@latest redteam run --ci
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

When to use: teams that want eval and security in one tool; YAML-first workflows; CI-native security gates without a separate scanner tool.


NeMo Guardrails (NVIDIA)

What it is: Runtime filtering — not a scanner. Runs on every live request and response. Uses Colang, a domain-specific language for defining conversation rails. Complements pre-deployment scanners: scan before shipping, filter at runtime.

pip install nemoguardrails

Five rail types:

RailWhen it runs
InputBefore the LLM sees the user message
DialogControls allowed conversation flows
RetrievalFilters content before it enters the context
ExecutionGuards tool calls and action results
OutputBefore the LLM response reaches the user

Colang example — topical rail:

# config/rails.co  (Colang 1.0 syntax — default as of 2026)
define user ask about competitors
  "How does your product compare to [competitor]?"
  "Is [competitor] better than you?"

define bot refuse competitor questions
  "I'm not able to make comparisons with other products."

define flow no competitors
  user ask about competitors
  bot refuse competitor questions
from nemoguardrails import RailsConfig, LLMRails

config = RailsConfig.from_path("./config")
rails = LLMRails(config)

response = await rails.generate_async(
    messages=[{"role": "user", "content": "Tell me about your competitor."}]
)

Built-in capabilities:

  • PII anonymisation (detect and redact before sending to LLM)
  • Topical rails (prevent off-topic conversations)
  • Jailbreak detection on input
  • Hallucination and factual error flagging on output
  • Colang 2.0 (beta) for more expressive flow control

Key distinction from scanners: NeMo Guardrails does not generate a vulnerability report. It sits in the request path and blocks or transforms content in real time. Use it alongside Garak and PyRIT, not instead of them. See also security/guardrails for output validation patterns using Guardrails AI (a different product from a different vendor).

When to use: customer-facing chatbots requiring strict topic control; any deployment needing PII filtering at the LLM boundary; runtime enforcement of conversation policy.


DeepTeam (Confident AI)

What it is: Dedicated red-teaming framework from the DeepEval team (Confident AI, YC W25). Spun out from DeepEval as a standalone package. Red-team attacks are first-class evaluation metrics — same syntax, same reporting pipeline.

pip install deepteam
from deepteam import red_team
from deepteam.attacks import PromptInjection, JailbreakingLinear
from deepteam.vulnerabilities import PIILeakage, HarmfulContent

results = await red_team(
    target=your_model_fn,
    attacks=[PromptInjection(), JailbreakingLinear()],
    vulnerabilities=[PIILeakage(), HarmfulContent()],
)
print(results.vulnerability_scores)

Coverage: 50+ vulnerabilities including PII leakage, toxicity, bias, jailbreaks. 20+ attack types including single-turn and multi-turn strategies. Deeply integrated with the Confident AI platform for tracking vulnerability scores over time.

When to use: teams already using DeepEval who want red teaming in the same framework; when you want vulnerability scores reported alongside standard eval metrics.


OWASP LLM Top 10 Coverage Map

OWASP CategoryGarakPyRITPromptfooNeMoDeepTeam
LLM01 Prompt InjectionYes (probe)Yes (converter + orchestrator)Yes (plugin)Yes (input rail)Yes
LLM02 Insecure OutputYes (detector)Yes (scorer)Yes (plugin)Yes (output rail)Yes
LLM06 Sensitive Info DisclosureYes (probe)YesYes (plugin)Yes (PII rail)Yes
LLM07 System Prompt LeakageYes (probe)YesYes (plugin)PartialYes
LLM08 Excessive AgencyPartialYes (agent orchestrator)Yes (agents plugin)NoYes
LLM09 MisinformationYes (hallucination probe)NoYes (plugin)Yes (output rail)Yes
LLM10 Unbounded ConsumptionNoNoPartialNoNo

See security/owasp-llm-top10 for the full threat taxonomy.


Pre-deployment: Garak (systematic probe-based scan) + Promptfoo (YAML CI gate)
Enterprise programs: PyRIT (multi-turn attack campaigns, ASR metrics)
Runtime: NeMo Guardrails (topic control, PII filtering, jailbreak detection)
DeepEval teams: DeepTeam (red team + eval in one pipeline)

The minimal production stack for most teams is Garak pre-deployment plus NeMo Guardrails at runtime. Add PyRIT when you need documented Attack Success Rate metrics for compliance or when running periodic structured red-team programs. Add Promptfoo when your team already uses it for evals and wants security in the same CI step.


Key Facts

  • Garak: 150+ probe modules, 3,000+ prompts; reports map to OWASP and MISP taxonomy; run with garak --model_type openai --model_name gpt-4o --probes all
  • PyRIT: v0.11.0 (Feb 2026); built-in attacks include PAIR, crescendo, tree-of-attacks, many-shot; Azure AI Foundry integration (AI Red Teaming Agent, April 2025 preview)
  • Promptfoo: 157 plugins; OWASP, MITRE ATLAS, NIST AI RMF presets; dual eval + security in one YAML config; acquired by OpenAI, remains MIT licensed
  • NeMo Guardrails: runtime-only; 5 rail types (input/dialog/retrieval/execution/output); Colang 1.0 is default, Colang 2.0 in beta; complements scanners, does not replace them
  • DeepTeam: YC W25; 50+ vulnerabilities, 20+ attack strategies; same syntax as DeepEval; best for teams already on the Confident AI stack

Common Failure Cases

Garak scan times out on large probe sets when targeting a rate-limited API Why: running all 150+ probe modules sends thousands of requests; hosted APIs enforce rate limits that cause request failures, timeouts, and incomplete scan coverage. Detect: the scan report shows incomplete probe runs or high error rates; wall-clock time far exceeds expected duration. Fix: use --parallel_attempts 1 to reduce concurrency; run targeted probe sets (--probes promptinject,jailbreak) rather than all probes; schedule full scans against a local or unlimited-tier endpoint.

PyRIT crescendo attack appears to succeed because the scorer misclassifies a partial compliance as a full violation Why: multi-turn escalation often produces ambiguous responses near the boundary; a binary scorer configured for clear-cut violations will flag ambiguous hedging as a success, inflating ASR. Detect: manually reviewing flagged conversations shows the model hedged or provided incomplete information that the scorer counted as a full violation. Fix: use a calibrated LLM-as-judge scorer with a rubric that distinguishes between partial compliance, hedging, and full violation; include few-shot examples of each class in the judge prompt.

NeMo Guardrails topical rail silently drops valid user messages that partially match a disallowed pattern Why: Colang 1.0 pattern matching is not exact — a user message containing a keyword from a disallowed define user block may trigger the rail even if the intent is benign. Detect: users report getting refusals for clearly legitimate queries; Colang debug logs show the wrong flow being triggered. Fix: make define user patterns as specific as possible; add a catch-all default flow for unmatched messages; test rail coverage with the nemoguardrails chat --debug flag before deploying.


Connections

Open Questions

  • Does Garak's probe coverage extend to agentic contexts (multi-agent delegation attacks, Agentic Top 10 A1-A10), or is it still primarily single-model?
  • How does PyRIT's Attack Success Rate correlate with real-world adversary success rates on production systems — is ASR a reliable pre-deployment proxy?
  • Can NeMo Guardrails Colang 2.0's event-driven model handle agentic tool call flows, or is it still limited to conversational turns?