BeginnerAI Engineer

Extract structured data reliably with Pydantic and Claude

Write a data extraction pipeline that takes unstructured text (job postings, invoice snippets, or product descriptions) and returns validated Pydantic objects every time. You will define the schema, write a system prompt that enforces JSON output, and build a retry loop that re-prompts when validation fails.

Why this matters

Unstructured-to-structured extraction is one of the highest-ROI use cases for LLMs in production. The failure mode is not Claude refusing; it is Claude returning almost-valid JSON that breaks your downstream system. Pydantic validation plus a retry loop is the pattern that makes this production-safe.

Before you start

Python with anthropic and pydantic v2 installed
Basic Pydantic knowledge (defining a BaseModel with typed fields)
Anthropic API key
A set of 20 unstructured text samples; job postings or invoices from any public dataset

Step-by-step guide

Define your Pydantic model

Choose a domain (job postings work well). Define a model with 5-8 fields including required strings, optional fields, and at least one list field. Run model.model_json_schema() and inspect the output; this is what you will paste into your system prompt to tell Claude exactly what shape to produce.

from pydantic import BaseModel
from typing import Optional
import json

class JobPosting(BaseModel):
    title: str
    company: str
    location: str
    salary_min: Optional[int] = None
    salary_max: Optional[int] = None
    skills: list[str]
    seniority: Optional[str] = None  # "junior" | "mid" | "senior"
    remote: bool

# Inspect the schema — this is what you paste into the system prompt
schema = json.dumps(JobPosting.model_json_schema(), indent=2)
print(schema)

Write the extraction system prompt

Instruct Claude to return only valid JSON matching the schema. Paste the json_schema() output directly into the prompt. Add one rule: if a field is not present in the text, return null for optional fields and never hallucinate values.

import anthropic, json
from pydantic import ValidationError

client = anthropic.Anthropic()

SYSTEM = f"""Extract job posting information and return ONLY valid JSON matching this schema.
If a field is absent from the text, use null for optional fields.
Never invent values.

Schema:
{schema}"""

def extract(text: str) -> JobPosting:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=SYSTEM,
        messages=[
            {"role": "user", "content": text},
            {"role": "assistant", "content": "{"},  # prefill to force JSON output
        ],
    )
    raw = "{" + response.content[0].text
    return JobPosting.model_validate(json.loads(raw))

Extract from the first 5 samples

Call Claude for each sample, parse the response text as JSON, and pass it to model.model_validate(). Print each validated object. Expect at least one Pydantic validation error on your first pass; that is normal and the point of the exercise.

samples = [
    "Senior Python Engineer at Acme Corp, London. £70-90k. Skills: Python, FastAPI, PostgreSQL. Hybrid.",
    "Junior Frontend Dev, remote. React, TypeScript required. No salary listed.",
    # ... add 3 more
]

for i, text in enumerate(samples):
    try:
        job = extract(text)
        print(f"[{i}] OK: {job.model_dump()}")
    except (ValidationError, json.JSONDecodeError) as e:
        print(f"[{i}] FAIL: {e}")

Build a retry loop

Wrap the call in a loop: try to validate; if ValidationError, append the error message to the conversation as a user turn and ask Claude to fix it. Cap retries at 3. Track how many inputs needed a retry and how many failed permanently.

def extract_with_retry(text: str, max_retries: int = 3) -> tuple[JobPosting | None, int]:
    messages = [
        {"role": "user", "content": text},
        {"role": "assistant", "content": "{"},
    ]
    for attempt in range(max_retries):
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=512,
            system=SYSTEM,
            messages=messages,
        )
        raw = "{" + response.content[0].text
        try:
            return JobPosting.model_validate(json.loads(raw)), attempt
        except (ValidationError, json.JSONDecodeError) as e:
            # Ask Claude to fix the specific error
            messages.append({"role": "assistant", "content": response.content[0].text})
            messages.append({"role": "user", "content": f"Invalid JSON: {e}. Return corrected JSON only."})
            messages.append({"role": "assistant", "content": "{"})
    return None, max_retries

5
Measure and tune
Run against all 20 samples. Record: success rate on first attempt, success rate after retry, permanent failures. If the failure rate exceeds 10%, read the failed samples; the issue is almost always ambiguity in the schema description or missing null handling in the prompt.
6
Add assistant prefill
Update the API call to prefill the assistant turn with { to force JSON output. Compare first-attempt success rate before and after. Prefill eliminates most cases where Claude adds a preamble before the JSON and breaks json.loads().

Relevant Axiom pages

Anthropic API Pydantic v2 Prompt engineering

What to do next

Back to Practice Lab