Vision and Multimodal AI

VLM architecture (ViT encoder → projection → LLM), Claude's document-understanding strength, multimodal RAG with ColPali, and image generation model comparison.

Models that process multiple modalities (text, images, audio, video) in a unified architecture. As of 2026, vision is default on all frontier models. "Multimodal" no longer means experimental.


The Vision Pipeline

How VLMs process images:

Image
  ↓ [Vision encoder — ViT or similar]
Image patch embeddings (e.g. 1,024 tokens for a 1024×1024 image)
  ↓ [Projection layer]
Visual tokens in language model space
  ↓ [LLM with interleaved text + visual tokens]
Text output

The vision encoder (typically a ViT, Vision Transformer) divides the image into patches (16×16 or 32×32 pixels), embeds each as a vector, and feeds these into the language model alongside text tokens. The LLM treats visual tokens like text tokens.


Frontier VLMs (April 2026)

ModelVision strengthNotes
Claude Opus / SonnetBest-in-class document understandingPDF parsing, table extraction, chart reading
GPT-4V / GPT-4oStrong general visionReal-time audio + video in GPT-4o
Gemini 1.5 Pro / UltraNative multimodal from trainingBest video understanding; 1M context
LLaVA-1.6 / Idefics3Open-sourceGood for self-hosted vision tasks
Qwen-VLStrong for documents + ChineseBest open OCR

[Source: general knowledge as of 2026-04-29. Claude 3.5 Sonnet was best-in-class for document understanding at mid-2024 release; as of 2026, Gemini 2.5 Pro and Qwen3-VL-235B are competitive across multimodal benchmarks. Rankings shift frequently — check current leaderboards before model selection.]


Claude Vision

Claude's vision capability is strongest for document understanding: reading PDFs, extracting tables, interpreting charts and technical diagrams.

Sending Images via the API

import anthropic, base64

with open("diagram.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_data,
                }
            },
            {"type": "text", "text": "Describe the architecture shown in this diagram."}
        ]
    }]
)

URL instead of base64:

"source": {"type": "url", "url": "https://example.com/image.png"}

Use URL for publicly accessible images. Use base64 for local files, private images, or when you need offline reproducibility.

Supported formats: JPEG, PNG, GIF, WebP.

Vision Limitations

  • No text recognition from handwriting (weak OCR on cursive/stylised text)
  • Counting objects > ~20 is unreliable
  • Spatial reasoning ("is X to the left of Y") degrades for complex scenes
  • Maximum image size varies by model — resize images > 2MB before sending

Document Processing

The primary production use case for Claude vision. Processing PDFs, financial statements, legal docs, technical drawings.

PDF workflow:

with open("contract.pdf", "rb") as f:
    pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_data}
            },
            {"type": "text", "text": "Extract all payment terms and conditions."}
        ]
    }]
)

For very long PDFs (50+ pages): use the Files API to upload once and reference by ID, rather than base64-encoding on every call.

Table extraction: Claude handles tables in PDFs better than most OCR tools. Request output as JSON or markdown table for structured downstream processing.


Multimodal RAG

Combining vision with retrieval. Two approaches:

Text-centric: Extract text + structure from documents (using vision), embed the extracted text, retrieve textually. Works well for most document Q&A.

True multimodal RAG (ColPali): Embed document images directly using a VLM-based embedder. Retrieve by visual similarity. Better for heavily visual documents (charts, diagrams, slide decks). Slower and more expensive. ColPali model (2024) enables this.


Audio

TaskBest toolNotes
Speech-to-textWhisper (OpenAI)Open-source, runs locally, multilingual
Text-to-speechElevenLabs, OpenAI TTSHigh quality voices
Real-time audioGPT-4o Realtime APIUltra-low latency voice; not Claude-native

Claude does not natively process audio input. For voice-to-Claude pipelines: Whisper → text → Claude → TTS.


Image Generation

Not Claude's domain. Generation is a separate model family.

ModelStrengthNotes
DALL-E 3Best text adherenceVia OpenAI API; integrates with ChatGPT
Flux.1Best open-source qualityReplaces Stable Diffusion in most workflows
Stable Diffusion 3Open, flexibleControlNet, LoRA fine-tuning
Midjourney v6Aesthetic qualityNo API; Discord only
Ideogram 2Text in imagesBest for designs with embedded text

Key Facts

  • VLM image → tokens: a 1024×1024 image → ~1,024 patch embeddings fed to the LLM alongside text
  • Claude supported image formats: JPEG, PNG, GIF, WebP; resize images >2MB before sending
  • Claude vision weakness: handwriting OCR, counting objects >20, complex spatial reasoning
  • PDF via Files API: upload once, reference by ID; avoids re-encoding base64 on every call
  • True multimodal RAG (ColPali, 2024): embeds document images directly; better for chart/diagram-heavy docs
  • Claude does not natively process audio input — pipeline is Whisper→text→Claude→TTS
  • Flux.1 has largely replaced Stable Diffusion as the open-source image generation default

Common Failure Cases

Image sent as "type": "url" is not accessible by Claude because the URL requires authentication or is behind a VPN
Why: Claude fetches URL-referenced images from its own servers; any URL that requires cookies, auth headers, or is not publicly reachable returns a 403 or 404 when Claude attempts to fetch it; the API returns a successful response but the model cannot see the image.
Detect: the model responds as if no image was provided ("I don't see any image in your message"); switching to base64 encoding of the same image resolves the issue.
Fix: use base64 encoding for any non-public image; reserve URL references for truly public, unauthenticated URLs (e.g., public S3 objects, CDN assets without signed URLs).

base64.standard_b64encode(f.read()).decode("utf-8") encodes a large image that exceeds Claude's per-image token limit, causing a 400 error
Why: Claude converts images to patch embeddings; a very high-resolution image (e.g., a 6000×4000 photo) generates thousands of tokens; combined with the text prompt, this may exceed the model's input token limit.
Detect: the API returns 400 Request too large or the token usage in the response shows unexpectedly high input token counts for a simple image query.
Fix: resize images before encoding — target 1024×1024 or smaller for most tasks; use PIL.Image.thumbnail((1024, 1024)) to resize in-place without distortion; for document PDFs, use the document block type rather than sending each page as an image.

Counting objects in an image returns systematically wrong results because the image contains more than ~15-20 instances
Why: Claude's spatial reasoning degrades for dense counting tasks; it reliably counts 5-10 items but undercounts or approximates when items exceed ~15-20, especially if they are small or overlapping.
Detect: asking "how many X are in this image?" returns an answer that differs from the ground truth by more than 10%; the error rate increases with the number of items.
Fix: for counting tasks, use a dedicated object detection model (YOLO, Detectron2) rather than a VLM; or ask Claude to count by segmenting the image into quadrants and summing the partial counts.

Connections

Open Questions

  • When will Claude gain native audio input, and will it use a ViT-style audio encoder or a different architecture?
  • Does ColPali's visual embedding approach scale to very long documents (100+ pages) without prohibitive cost?
  • How does Claude's document understanding compare to specialised document AI tools (AWS Textract, Azure Document Intelligence) for structured extraction tasks?