Build a document understanding pipeline with Claude vision
Build a pipeline that converts a PDF document to page images, sends each page to Claude as a base64-encoded image, extracts structured information using a Pydantic schema, and evaluates extraction accuracy against a hand-labelled ground truth set of 5-10 pages.
Why this matters
Claude vision is best-in-class for document understanding; it outperforms dedicated OCR tools on complex layouts, handwriting, and tables. The pattern (PDF to images, images to Claude, output to schema) is reused across invoice processing, medical record extraction, and any workflow where data lives in scanned documents rather than databases.
Before you start
- Python with anthropic, pydantic, and pdf2image or pymupdf installed
- Anthropic API key
- A PDF document with extractable information; invoices, receipts, or a simple form work well
- A ground truth: manually label 5-10 pages with the expected extracted values
Step-by-step guide
- 1
Convert PDF pages to images
Use pdf2image.convert_from_path or PyMuPDF to render each PDF page as a PNG at 150-200 DPI. Save to a temporary directory. Print the pixel dimensions; Claude works best when images are between 100px and 8000px on the long edge.
- 2
Encode images as base64
Read each PNG, base64-encode it, and construct the image content block: {type: image, source: {type: base64, media_type: image/png, data: <encoded>}}. This is the exact shape the API expects. Verify the structure matches the Anthropic docs before making any API calls.
- 3
Send a single page and verify
Build a message with the image block followed by a text block asking what is on the page. Print the response. Verify Claude can read the document correctly before adding the schema. If the description is wrong, check image resolution and contrast.
- 4
Add schema extraction
Define a Pydantic model for the document type (for a receipt: vendor, date, line_items as a list, total, currency). Update the prompt to return JSON matching the schema. Validate the response. Iterate on the prompt until the first page extracts cleanly.
- 5
Process all pages and aggregate
Loop over all pages, extract, and collect results. For multi-page documents, decide: does each page stand alone, or do you need to merge page results into a single document-level object? Implement the merge if needed.
- 6
Evaluate accuracy against ground truth
For each labelled field in your ground truth, compare the extracted value to the expected value. Report per-field accuracy. Identify which fields Claude gets wrong most often; these are usually amounts with ambiguous formatting or fields where the document layout varies between pages.