IntermediateAI Engineer

Build a document understanding pipeline with Claude vision

Build a pipeline that converts a PDF document to page images, sends each page to Claude as a base64-encoded image, extracts structured information using a Pydantic schema, and evaluates extraction accuracy against a hand-labelled ground truth set of 5-10 pages.

Why this matters

Claude vision is best-in-class for document understanding; it outperforms dedicated OCR tools on complex layouts, handwriting, and tables. The pattern (PDF to images, images to Claude, output to schema) is reused across invoice processing, medical record extraction, and any workflow where data lives in scanned documents rather than databases.

Before you start

Python with anthropic, pydantic, and pdf2image or pymupdf installed
Anthropic API key
A PDF document with extractable information; invoices, receipts, or a simple form work well
A ground truth: manually label 5-10 pages with the expected extracted values

Step-by-step guide

1
Convert PDF pages to images
Use pdf2image.convert_from_path or PyMuPDF to render each PDF page as a PNG at 150-200 DPI. Save to a temporary directory. Print the pixel dimensions; Claude works best when images are between 100px and 8000px on the long edge.
2
Encode images as base64
Read each PNG, base64-encode it, and construct the image content block: {type: image, source: {type: base64, media_type: image/png, data: <encoded>}}. This is the exact shape the API expects. Verify the structure matches the Anthropic docs before making any API calls.
3
Send a single page and verify
Build a message with the image block followed by a text block asking what is on the page. Print the response. Verify Claude can read the document correctly before adding the schema. If the description is wrong, check image resolution and contrast.
4
Add schema extraction
Define a Pydantic model for the document type (for a receipt: vendor, date, line_items as a list, total, currency). Update the prompt to return JSON matching the schema. Validate the response. Iterate on the prompt until the first page extracts cleanly.
5
Process all pages and aggregate
Loop over all pages, extract, and collect results. For multi-page documents, decide: does each page stand alone, or do you need to merge page results into a single document-level object? Implement the merge if needed.
6
Evaluate accuracy against ground truth
For each labelled field in your ground truth, compare the extracted value to the expected value. Report per-field accuracy. Identify which fields Claude gets wrong most often; these are usually amounts with ambiguous formatting or fields where the document layout varies between pages.

Relevant Axiom pages

Vision models Anthropic API RAG pipeline overview

What to do next

Back to Practice Lab

Why this matters

Before you start

Step-by-step guide

Convert PDF pages to images

Encode images as base64

Send a single page and verify

Add schema extraction

Process all pages and aggregate

Evaluate accuracy against ground truth

Relevant Axiom pages

What to do next