BeginnerAI Engineer

Extract structured data reliably with Pydantic and Claude

Write a data extraction pipeline that takes unstructured text (job postings, invoice snippets, or product descriptions) and returns validated Pydantic objects every time. You will define the schema, write a system prompt that enforces JSON output, and build a retry loop that re-prompts when validation fails.

Why this matters

Unstructured-to-structured extraction is one of the highest-ROI use cases for LLMs in production. The failure mode is not Claude refusing; it is Claude returning almost-valid JSON that breaks your downstream system. Pydantic validation plus a retry loop is the pattern that makes this production-safe.

Before you start

Step-by-step guide

  1. 1

    Define your Pydantic model

    Choose a domain (job postings work well). Define a model with 5-8 fields including required strings, optional fields, and at least one list field. Run model.model_json_schema() and inspect the output; this is what you will paste into your system prompt to tell Claude exactly what shape to produce.

    from pydantic import BaseModel
    from typing import Optional
    import json
    
    class JobPosting(BaseModel):
        title: str
        company: str
        location: str
        salary_min: Optional[int] = None
        salary_max: Optional[int] = None
        skills: list[str]
        seniority: Optional[str] = None  # "junior" | "mid" | "senior"
        remote: bool
    
    # Inspect the schema — this is what you paste into the system prompt
    schema = json.dumps(JobPosting.model_json_schema(), indent=2)
    print(schema)
  2. 2

    Write the extraction system prompt

    Instruct Claude to return only valid JSON matching the schema. Paste the json_schema() output directly into the prompt. Add one rule: if a field is not present in the text, return null for optional fields and never hallucinate values.

    import anthropic, json
    from pydantic import ValidationError
    
    client = anthropic.Anthropic()
    
    SYSTEM = f"""Extract job posting information and return ONLY valid JSON matching this schema.
    If a field is absent from the text, use null for optional fields.
    Never invent values.
    
    Schema:
    {schema}"""
    
    def extract(text: str) -> JobPosting:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=512,
            system=SYSTEM,
            messages=[
                {"role": "user", "content": text},
                {"role": "assistant", "content": "{"},  # prefill to force JSON output
            ],
        )
        raw = "{" + response.content[0].text
        return JobPosting.model_validate(json.loads(raw))
  3. 3

    Extract from the first 5 samples

    Call Claude for each sample, parse the response text as JSON, and pass it to model.model_validate(). Print each validated object. Expect at least one Pydantic validation error on your first pass; that is normal and the point of the exercise.

    samples = [
        "Senior Python Engineer at Acme Corp, London. £70-90k. Skills: Python, FastAPI, PostgreSQL. Hybrid.",
        "Junior Frontend Dev, remote. React, TypeScript required. No salary listed.",
        # ... add 3 more
    ]
    
    for i, text in enumerate(samples):
        try:
            job = extract(text)
            print(f"[{i}] OK: {job.model_dump()}")
        except (ValidationError, json.JSONDecodeError) as e:
            print(f"[{i}] FAIL: {e}")
  4. 4

    Build a retry loop

    Wrap the call in a loop: try to validate; if ValidationError, append the error message to the conversation as a user turn and ask Claude to fix it. Cap retries at 3. Track how many inputs needed a retry and how many failed permanently.

    def extract_with_retry(text: str, max_retries: int = 3) -> tuple[JobPosting | None, int]:
        messages = [
            {"role": "user", "content": text},
            {"role": "assistant", "content": "{"},
        ]
        for attempt in range(max_retries):
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=512,
                system=SYSTEM,
                messages=messages,
            )
            raw = "{" + response.content[0].text
            try:
                return JobPosting.model_validate(json.loads(raw)), attempt
            except (ValidationError, json.JSONDecodeError) as e:
                # Ask Claude to fix the specific error
                messages.append({"role": "assistant", "content": response.content[0].text})
                messages.append({"role": "user", "content": f"Invalid JSON: {e}. Return corrected JSON only."})
                messages.append({"role": "assistant", "content": "{"})
        return None, max_retries
  5. 5

    Measure and tune

    Run against all 20 samples. Record: success rate on first attempt, success rate after retry, permanent failures. If the failure rate exceeds 10%, read the failed samples; the issue is almost always ambiguity in the schema description or missing null handling in the prompt.

  6. 6

    Add assistant prefill

    Update the API call to prefill the assistant turn with { to force JSON output. Compare first-attempt success rate before and after. Prefill eliminates most cases where Claude adds a preamble before the JSON and breaks json.loads().

Relevant Axiom pages

What to do next

Back to Practice Lab