Guardrails and Output Validation
Runtime enforcement of LLM output contracts — ensuring models return valid structure, safe content, and correct format before that output reaches your users or downstream systems.
Runtime enforcement of LLM output contracts. Ensuring models return valid structure, safe content, and correct format before that output reaches your users or downstream systems.
Distinct from security/prompt-injection (blocking bad inputs) and security/owasp-llm-top10 (the threat taxonomy). This page is about validating and constraining what comes out.
The Problem
LLMs are probabilistic. Even with clear instructions they can:
- Return malformed JSON
- Include off-topic or harmful content
- Hallucinate fields or values
- Ignore output format instructions under certain prompts
Production systems need a validation layer between the LLM response and the application.
Pattern 1: instructor (Schema Enforcement)
The lightest-weight solution. Wraps the Anthropic/OpenAI client to enforce a Pydantic schema with automatic retry on validation failure. See python/instructor for full coverage.
import instructor
import anthropic
from pydantic import BaseModel, field_validator
client = instructor.from_anthropic(anthropic.Anthropic())
class ReviewDecision(BaseModel):
approved: bool
reason: str
confidence: float
@field_validator("confidence")
@classmethod
def confidence_in_range(cls, v):
if not 0.0 <= v <= 1.0:
raise ValueError("confidence must be between 0 and 1")
return v
decision = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{"role": "user", "content": "Review this PR: ..."}],
response_model=ReviewDecision,
max_retries=3,
)Best for: Schema enforcement and type validation. Simple and composable.
Pattern 2: Guardrails AI
A framework for building "guardrails". Named validators that can run before (input) and after (output) every LLM call. Validators are composable and reusable.
pip install guardrails-aifrom guardrails import Guard
from guardrails.hub import ValidLength, DetectPII, ToxicLanguage
guard = Guard().use_many(
ValidLength(min=10, max=500, on_fail="reask"),
DetectPII(pii_entities=["EMAIL_ADDRESS", "PHONE_NUMBER"], on_fail="fix"),
ToxicLanguage(threshold=0.5, on_fail="filter"),
)
response, validated, *rest = guard(
anthropic.Anthropic().messages.create,
prompt="Summarise this customer complaint: {complaint}",
prompt_params={"complaint": complaint_text},
model="claude-sonnet-4-6",
max_tokens=512,
)
print(validated.validated_output) # cleaned outputOn-fail actions
| Action | Behaviour |
|---|---|
"reask" | Send output + error back to LLM for correction |
"fix" | Apply automated fix (e.g., truncate length, redact PII) |
"filter" | Remove the offending content |
"exception" | Raise ValidationError |
"noop" | Log but pass through |
Validator hub
Guardrails AI has a hub of pre-built validators:
guardrails hub install hub://guardrails/detect_pii
guardrails hub install hub://guardrails/toxic_language
guardrails hub install hub://guardrails/valid_length
guardrails hub install hub://guardrails/regex_matchBest for: Multi-rule validation pipelines, PII redaction, content safety at scale.
Pattern 3: NVIDIA NeMo Guardrails
Conversation-level guardrails using a domain-specific language (Colang) to define allowed/disallowed conversational flows. Primarily for chatbot safety.
pip install nemoguardrails# config/rails.co
define user ask politics
"What do you think about the election?"
"Which party should I vote for?"
define bot refuse to answer politics
"I'm not able to discuss political topics."
define flow politics
user ask politics
bot refuse to answer politics
from nemoguardrails import RailsConfig, LLMRails
config = RailsConfig.from_path("./config")
rails = LLMRails(config)
response = rails.generate(
messages=[{"role": "user", "content": "What about the election?"}]
)Best for: Topic control, conversation flow enforcement, customer-facing chatbots with strict scope limits.
Pattern 4: Manual Validation Layer
For simple cases, write your own:
import json
import re
from anthropic import Anthropic
client = Anthropic()
def extract_json(text: str) -> dict:
# Find JSON block in model output
match = re.search(r'\{.*\}', text, re.DOTALL)
if not match:
raise ValueError("No JSON found in response")
return json.loads(match.group())
def call_with_validation(prompt: str, max_retries: int = 3) -> dict:
for attempt in range(max_retries):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
try:
return extract_json(response.content[0].text)
except (ValueError, json.JSONDecodeError) as e:
if attempt == max_retries - 1:
raise
prompt += f"\n\nYour previous response had an error: {e}. Please try again."
raise RuntimeError("Max retries exceeded")Choosing an Approach
| Scenario | Best approach |
|---|---|
| Structured output with Pydantic schema | instructor |
| Multi-rule validation pipeline | Guardrails AI |
| Chatbot topic/flow control | NeMo Guardrails |
| Simple JSON extraction | Manual regex + retry |
| PII redaction at scale | Guardrails AI with DetectPII |
| Agent tool return validation | instructor or manual |
Input vs Output Guardrails
This page focuses on output validation. Input validation (blocking malicious prompts) is covered in:
- security/prompt-injection — injection attack patterns and defences
- security/owasp-llm-top10 — LLM01 (prompt injection), LLM02 (insecure output handling)
A production system needs both: validate inputs before the LLM sees them, validate outputs before your application uses them.
Common Failure Cases
instructor max_retries exceeded because the model's explanation wraps the JSON in markdown fences
Why: some models return JSON inside a \``json ... ```code block by default;instructor's parser tries to parse the full response text as JSON, fails, and retries; after max_retriesattempts the error propagates. Detect:InstructorRetryExceptionon every call for a specific model; the raw response content shows valid JSON wrapped in markdown. Fix: setmode=instructor.Mode.JSONormode=instructor.Mode.MD_JSON` to match the model's output style; or add a pre-parse step to strip code fences before passing to the Pydantic validator.
Guardrails AI on_fail="reask" loops indefinitely when the model cannot satisfy a validator
Why: reask sends the failed output back to the model with the error message; if the validator is checking for something the model genuinely cannot produce (e.g., detecting PII in a response that must contain a name), the model fails every reask attempt.
Detect: API calls multiply by max_retries; the final response is still a validation failure; cost per request spikes.
Fix: use on_fail="exception" or on_fail="filter" for validators that the model structurally cannot satisfy; reserve reask only for format/schema validators where restatement actually helps.
NeMo Guardrails Colang flow silently drops user messages that don't match any defined pattern
Why: NeMo routes user messages through defined conversational flows; if a user message matches no defined user pattern, NeMo returns an empty response or the default fallback without explanation.
Detect: users receive empty responses or generic fallbacks for valid queries that are slightly out-of-scope; Colang logs show "no matching flow" for those messages.
Fix: add a catch-all define user ask anything → define flow default pattern to handle unrecognised messages; or set allow_unhandled_events: true in the NeMo config to pass unmatched messages through to the underlying LLM.
Manual regex-based JSON extraction fails when the model returns multiple JSON objects in one response
Why: re.search(r'\{.*\}', text, re.DOTALL) matches from the first { to the last }; if the model includes both a scratchpad JSON and the final JSON, the regex captures everything between them, producing invalid JSON.
Detect: json.JSONDecodeError on responses where the model was verbose; the extracted string contains two separate JSON objects concatenated.
Fix: use re.findall(r'\{[^{}]*\}', text) to find all top-level objects and select the last one; or instruct the model to return JSON in a specific XML tag: <result>{...}</result> and parse between the tags.
Connections
- python/instructor — schema enforcement via Pydantic (Pattern 1 above, in depth)
- security/prompt-injection — input-side defence
- security/owasp-llm-top10 — LLM02 (insecure output handling) and LLM06 (sensitive info disclosure)
- evals/methodology — guardrails can be modelled as automated evals running on every call
- agents/practical-agent-design — agent output validation as a safety layer
Open Questions
- What attack vectors does this defence fail to address?
- How does the threat model change when this is deployed in an agentic context?
Related reading