Extract structured data reliably with Pydantic and Claude
Write a data extraction pipeline that takes unstructured text (job postings, invoice snippets, or product descriptions) and returns validated Pydantic objects every time. You will define the schema, write a system prompt that enforces JSON output, and build a retry loop that re-prompts when validation fails.
Why this matters
Unstructured-to-structured extraction is one of the highest-ROI use cases for LLMs in production. The failure mode is not Claude refusing; it is Claude returning almost-valid JSON that breaks your downstream system. Pydantic validation plus a retry loop is the pattern that makes this production-safe.
Before you start
- Python with anthropic and pydantic v2 installed
- Basic Pydantic knowledge (defining a BaseModel with typed fields)
- Anthropic API key
- A set of 20 unstructured text samples; job postings or invoices from any public dataset
Step-by-step guide
- 1
Define your Pydantic model
Choose a domain (job postings work well). Define a model with 5-8 fields including required strings, optional fields, and at least one list field. Run model.model_json_schema() and inspect the output; this is what you will paste into your system prompt to tell Claude exactly what shape to produce.
from pydantic import BaseModel from typing import Optional import json class JobPosting(BaseModel): title: str company: str location: str salary_min: Optional[int] = None salary_max: Optional[int] = None skills: list[str] seniority: Optional[str] = None # "junior" | "mid" | "senior" remote: bool # Inspect the schema — this is what you paste into the system prompt schema = json.dumps(JobPosting.model_json_schema(), indent=2) print(schema) - 2
Write the extraction system prompt
Instruct Claude to return only valid JSON matching the schema. Paste the json_schema() output directly into the prompt. Add one rule: if a field is not present in the text, return null for optional fields and never hallucinate values.
import anthropic, json from pydantic import ValidationError client = anthropic.Anthropic() SYSTEM = f"""Extract job posting information and return ONLY valid JSON matching this schema. If a field is absent from the text, use null for optional fields. Never invent values. Schema: {schema}""" def extract(text: str) -> JobPosting: response = client.messages.create( model="claude-sonnet-4-6", max_tokens=512, system=SYSTEM, messages=[ {"role": "user", "content": text}, {"role": "assistant", "content": "{"}, # prefill to force JSON output ], ) raw = "{" + response.content[0].text return JobPosting.model_validate(json.loads(raw)) - 3
Extract from the first 5 samples
Call Claude for each sample, parse the response text as JSON, and pass it to model.model_validate(). Print each validated object. Expect at least one Pydantic validation error on your first pass; that is normal and the point of the exercise.
samples = [ "Senior Python Engineer at Acme Corp, London. £70-90k. Skills: Python, FastAPI, PostgreSQL. Hybrid.", "Junior Frontend Dev, remote. React, TypeScript required. No salary listed.", # ... add 3 more ] for i, text in enumerate(samples): try: job = extract(text) print(f"[{i}] OK: {job.model_dump()}") except (ValidationError, json.JSONDecodeError) as e: print(f"[{i}] FAIL: {e}") - 4
Build a retry loop
Wrap the call in a loop: try to validate; if ValidationError, append the error message to the conversation as a user turn and ask Claude to fix it. Cap retries at 3. Track how many inputs needed a retry and how many failed permanently.
def extract_with_retry(text: str, max_retries: int = 3) -> tuple[JobPosting | None, int]: messages = [ {"role": "user", "content": text}, {"role": "assistant", "content": "{"}, ] for attempt in range(max_retries): response = client.messages.create( model="claude-sonnet-4-6", max_tokens=512, system=SYSTEM, messages=messages, ) raw = "{" + response.content[0].text try: return JobPosting.model_validate(json.loads(raw)), attempt except (ValidationError, json.JSONDecodeError) as e: # Ask Claude to fix the specific error messages.append({"role": "assistant", "content": response.content[0].text}) messages.append({"role": "user", "content": f"Invalid JSON: {e}. Return corrected JSON only."}) messages.append({"role": "assistant", "content": "{"}) return None, max_retries - 5
Measure and tune
Run against all 20 samples. Record: success rate on first attempt, success rate after retry, permanent failures. If the failure rate exceeds 10%, read the failed samples; the issue is almost always ambiguity in the schema description or missing null handling in the prompt.
- 6
Add assistant prefill
Update the API call to prefill the assistant turn with { to force JSON output. Compare first-attempt success rate before and after. Prefill eliminates most cases where Claude adds a preamble before the JSON and breaks json.loads().