Reasoning Model Patterns
A production decision framework for when to use reasoning models and extended thinking — covering task fit, budget_tokens selection, cross-provider comparison, and cost/latency tradeoffs.
[Source: Anthropic API docs / platform.claude.com, WebSearch, 2026-05-03]
What Is a Reasoning Model?
A reasoning model allocates a dedicated chain-of-thought phase before producing its final answer. This internal scratchpad lets the model self-verify intermediate steps, backtrack, and try alternative approaches — behaviours that standard autoregressive generation cannot do mid-token.
The implementation varies by provider:
- Claude (Anthropic): a
thinkingblock in the API response, streamed separately from the text block - OpenAI o-series: reasoning tokens consumed internally; partially surfaced via
reasoning_effort - Gemini 2.5 Pro / Flash: a
thinkingConfigparameter with a token budget - DeepSeek R1: reasoning traces emitted inside
<think>tags before the final answer
In all cases, the model pays a token tax (latency + cost) upfront in exchange for higher accuracy on tasks where reasoning depth matters.
When Thinking Helps
Use a reasoning model when the task has these properties:
Multi-step math, logic, and formal proofs. The model can check each step against the next instead of committing to a plausible-sounding trajectory. AIME 2024 pass@1 on R1 improved from 15.6% to 77.9% during GRPO training solely because RL rewarded correct chains.
Algorithm design and complex debugging. Reasoning allows the model to mentally trace execution paths, catch off-by-one errors, and evaluate alternative implementations before writing. Claude Opus 4.5 scores 80.9% on SWE-bench Verified — substantially ahead of its non-thinking baseline — and o3 hits 69.1% on the same benchmark.
Tasks requiring self-verification. When correctness matters and the output can be internally cross-checked (e.g., "does this proof follow from premise A?"), extended thinking lets the model act as its own adversarial reviewer before answering.
Low-latency tolerance with high accuracy requirement. If the user can wait 10–60 seconds and a wrong answer has meaningful cost (e.g., an architectural decision, a medical triage question, a security audit), the latency tradeoff is worth it. The key test: would you pay a human expert to spend more time thinking? If yes, use a reasoning model.
When Thinking Hurts or Wastes Money
Simple factual retrieval. Looking up a library version, a date, a definition, or a well-established procedure does not benefit from extended reasoning. The answer is in the model's weights. Extended thinking adds 10+ seconds of latency and 3–5x the token cost for zero accuracy improvement.
Creative writing. Thinking tends to over-plan, producing prose that feels mechanical and overly structured. Standard temperature-driven generation with a well-crafted system prompt outperforms extended thinking on creative tasks.
Short-answer classification. Sentiment analysis, intent classification, format validation, routing decisions — these are pattern-matching tasks. A reasoning model burning 5,000 tokens to decide "positive or negative" is money incinerated.
High-volume chained pipelines. A pipeline that applies extended thinking to 1 million records at 30 seconds per call takes approximately one year to complete. At scale, reasoning model latency compounds from an engineering inconvenience into a product constraint. Worse: if your pipeline chains multiple LLM calls (e.g., extract → classify → summarise → rank), enabling thinking at each step multiplies both cost and latency multiplicatively, not additively.
Sub-second UX requirements. Reasoning model time-to-first-token is typically 2–10 seconds on standard API endpoints. If users expect instant feedback, standard models with streaming are the only viable path.
Cross-Provider Model Comparison
Claude Extended Thinking (Anthropic)
Models: claude-opus-4-7 and earlier with thinking: {type: "enabled", budget_tokens: N} (deprecated on 4.7+). Newer models use adaptive thinking.
Adaptive thinking (current approach): thinking: {type: "adaptive"} — Claude evaluates each request and decides whether to produce a thinking block and how long to spend on it. Manual budget_tokens is deprecated on claude-opus-4-7 and later; passing it returns a 400 error.
Legacy explicit budget: Available on claude-opus-4-6 and claude-sonnet-4-6. Range: 1,024 to 32,000+ tokens. Billed as output tokens at output token price.
# Legacy explicit budget (opus-4-6 / sonnet-4-6 only)
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
)
# response.content[0] is a ThinkingBlock, [1] is TextBlock
# Adaptive (opus-4-7+)
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=16000,
thinking={"type": "adaptive"},
messages=[{"role": "user", "content": "Design a distributed rate limiter."}]
)Thinking blocks are streamed separately. In the streaming response, type: "thinking" events arrive before type: "text" events. See apis/anthropic-api for the streaming implementation.
Benchmarks (April 2026): Claude Opus 4.5: 80.9% SWE-bench Verified, outperforming o3 (69.1%) and Gemini 3 Pro (76.2%) on software engineering tasks.
OpenAI o3 / o3-mini / o4-mini
Released April 2026. Reasoning tokens are consumed internally; the model does not expose the full chain-of-thought by default.
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="o3",
reasoning_effort="high", # "low" | "medium" | "high"
messages=[{"role": "user", "content": "Solve this AIME problem..."}]
)
# response.usage.completion_tokens_details.reasoning_tokens gives token countEffort levels map approximately to:
| Effort | Reasoning tokens (approx) | Latency | Use case |
|---|---|---|---|
low | 1,000–5,000 | 2–5s | Light reasoning, routing decisions |
medium | 5,000–20,000 | 5–20s | Default — most complex tasks |
high | 20,000–50,000+ | 20–120s | Frontier math, hard proofs |
Pricing: o3 at $10/$40 per M input/output tokens (reasoning tokens billed as output). o3-mini at $1.10/$4.40 — the cost-efficient reasoning tier. No temperature parameter on o-series models.
Benchmarks: o3: 88.9% AIME 2026, 83.3% GPQA Diamond, 69.1% SWE-bench Verified, 2706 Elo competitive programming.
Gemini 2.5 Pro / 2.5 Flash (Google)
Both Gemini 2.5 Pro and 2.5 Flash support a thinking mode controlled by thinkingConfig.
import google.generativeai as genai
model = genai.GenerativeModel("gemini-2.5-pro")
response = model.generate_content(
"Solve this integral step by step...",
generation_config=genai.GenerationConfig(
thinking_config=genai.ThinkingConfig(thinking_budget=10000)
)
)Gemini 2.5 Flash thinking is competitively priced ($0.15/$0.60 per M input/output) and is the recommended tier for cost-sensitive reasoning tasks where Claude Opus pricing is prohibitive. Gemini 2.5 Pro ($1.25/$10) targets frontier reasoning.
Context: 1M token context window on both models — the largest available commercially. Thinking tokens add to this budget, so extremely long-context + heavy-thinking combinations can hit limits.
DeepSeek R1
Open-weights reasoning model. Does not use a budget_tokens API parameter — reasoning depth is controlled implicitly by the model's training.
from openai import OpenAI # DeepSeek uses OpenAI-compatible API
client = OpenAI(
api_key="your-deepseek-key",
base_url="https://api.deepseek.com"
)
response = client.chat.completions.create(
model="deepseek-reasoner", # R1 model name
max_tokens=8000, # must cover <think>...</think> + final answer
messages=[{"role": "user", "content": "Prove the AM-GM inequality."}]
)
# response.choices[0].message.reasoning_content → the chain-of-thought
# response.choices[0].message.content → the final answerCritical production note: R1 emits the full reasoning trace inside <think> tags. These count against max_tokens. Set max_tokens = (expected reasoning length) + (expected answer length). A common footgun: setting max_tokens=512 for the answer without accounting for the 3,000-token thinking trace, producing truncated or empty final answers.
Pricing: $0.55/$2.19 per M input/output tokens — 96% cheaper than o1 at launch. Distilled variants (1.5B–70B) are available for local deployment via llama.cpp / vLLM. See llms/deepseek-r1 for the full treatment.
budget_tokens Selection Guide (Claude Legacy API)
For claude-opus-4-6 and claude-sonnet-4-6 with explicit budget_tokens:
| Task type | Recommended budget | Rationale |
|---|---|---|
| Simple structured output (JSON extraction, classification) | 1,024–2,048 | Overhead only; keep minimal |
| Moderate reasoning (code review, logical deduction) | 5,000–8,000 | Default starting point |
| Complex multi-step reasoning (debugging subtle bugs, system design) | 10,000–16,000 | Covers most hard tasks |
| Hard math, formal proofs, AIME-level problems | 16,000–32,000 | Needs space to explore and backtrack |
| Frontier research / most complex agent decisions | 32,000+ | Use batch API; avoid on streaming UX |
Tuning protocol:
- Start at 5,000 tokens for any new task type.
- Run your eval suite. If outputs are wrong or incomplete, double the budget.
- Repeat until accuracy plateaus. Most tasks plateau below 16,000 tokens.
- For budgets above 32,000, switch to the Batch API to avoid network timeout issues.
The budget is a target, not a hard cap — actual consumption varies. At the 32K+ range, the Batch API is recommended.
Cost and Latency Tradeoffs
Token cost multiplier. Thinking tokens on Claude are billed as output tokens. At Claude Sonnet 4.6 ($15/M output), 10,000 thinking tokens cost $0.15 per call — before the actual answer. At 10,000 calls/day, that is $1,500/month in thinking overhead alone.
Thinking tokens vs standard output cost comparison:
| Provider | Thinking token cost | Standard output cost | Multiplier |
|---|---|---|---|
| Claude Sonnet 4.6 | $15/M (same as output) | $15/M | 1x nominal, but additive |
| o3 | $40/M (output) | $40/M | High absolute cost |
| o3-mini | $4.40/M | $4.40/M | Cheapest reasoning tier |
| Gemini 2.5 Flash | $0.60/M | $0.60/M | Most cost-efficient |
| DeepSeek R1 | $2.19/M | $2.19/M | Open-weights option available |
Latency. Time-to-first-token for reasoning models is 2–10 seconds at low budgets, 10–60+ seconds at high budgets. Standard models stream first tokens in under 500ms. This gap is the primary reason reasoning models cannot replace standard models in interactive UX without a deliberate "thinking..." state.
Streaming thinking blocks to reduce perceived latency. Stream the thinking block to the UI as it arrives. This converts a 30-second blank wait into a 30-second "thinking..." animation — substantially better UX. Anthropic's streaming API sends thinking events before text events; consume and display them.
with client.messages.stream(
model="claude-opus-4-6",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[...]
) as stream:
for event in stream:
if event.type == "content_block_start":
if event.content_block.type == "thinking":
print("[thinking...]")
elif event.type == "content_block_delta":
if hasattr(event.delta, "thinking"):
print(event.delta.thinking, end="", flush=True)
elif hasattr(event.delta, "text"):
print(event.delta.text, end="", flush=True)Prompt caching to offset cost. On repeat calls with the same system prompt, mark the system prompt with cache_control to cache it. The cached prefix costs 0.1x on re-read. This partially offsets thinking token cost for multi-turn sessions. See synthesis/cost-optimisation and apis/anthropic-api for caching implementation.
Production Decision Framework
Is the task verifiable / does it have a correct answer?
├── No (creative writing, subjective summary, conversational)
│ └── Use standard model. Reasoning adds mechanical quality, not creativity.
│
└── Yes → Does accuracy matter more than latency?
├── No (latency < 1s required, or volume > 100k calls/day)
│ └── Use standard model (Claude Sonnet/Haiku, GPT-4o, Gemini Flash).
│ If pipeline chains 3+ LLM calls: standard only.
│
└── Yes → What is the task complexity?
├── Simple (classification, retrieval, extraction, routing)
│ └── Use standard model. Thinking budget 1,024 if needed as sanity check.
│
├── Moderate (code review, SQL generation, logical deduction)
│ └── Use reasoning model with low/medium effort.
│ Claude: adaptive thinking or 5,000 budget_tokens
│ OpenAI: o3-mini effort=medium
│ Google: Gemini 2.5 Flash thinking_budget=5000
│
└── Hard (algorithm design, formal proof, frontier coding, security audit)
└── Use frontier reasoning model.
Claude: Opus 4.7 adaptive or Opus 4.6 budget_tokens=16000–32000
OpenAI: o3 effort=high
Google: Gemini 2.5 Pro thinking_budget=16000+
Cost-sensitive: DeepSeek R1 via API or local distilled model
Cost-quality tier ordering (May 2026):
| Tier | Model | Best for | Cost (output/M) |
|---|---|---|---|
| Cheapest reasoning | o3-mini (medium) or Gemini 2.5 Flash thinking | Moderate tasks at scale | $4.40 / $0.60 |
| Balanced | Claude Sonnet 4.6 adaptive | Production default; near-Opus quality | $15 |
| Frontier (cost-efficient) | DeepSeek R1 API | Hard tasks, cost-sensitive | $2.19 |
| Frontier (quality) | Claude Opus 4.7 / o3 | Hardest tasks, accuracy first | $25 / $40 |
| Self-hosted | DeepSeek R1-Distill-70B | No API dependency, GPU available | Hardware only |
Common Production Mistakes
Enabling thinking on every call in a pipeline. Thinking is multiplicative. A 5-step pipeline with 10,000-token budgets per step burns 50,000 thinking tokens per request — at Claude Sonnet pricing, that is $0.75/pipeline call before any actual output. Reserve reasoning for the steps where it genuinely matters (the decision node), not the mechanical steps (format, extract, route).
Not accounting for thinking tokens in max_tokens. For DeepSeek R1, thinking traces appear inside <think> and count against max_tokens. For Claude legacy API, the max_tokens budget covers thinking + output — set it high enough. A common failure: max_tokens=512 with budget_tokens=5000 means the model runs out of tokens before producing the answer.
Using reasoning models for latency benchmarks. If you benchmark a reasoning model in your integration test, its latency profile (10–60s) will skew the p95/p99 numbers in ways that mask normal call latency. Keep reasoning model calls in a separate metric namespace.
Treating adaptive thinking as free. Adaptive thinking on Claude Opus 4.7 means Claude decides when to think. On a complex enough prompt, it will always think — and bill accordingly. Monitor token usage by day during initial rollout.
Connections
- llms/claude — Claude model family; extended thinking availability per model tier
- llms/deepseek-r1 — DeepSeek R1 GRPO training,
<think>tags,max_tokensfootgun, distilled variants - llms/model-families — o3, Gemini 2.5, Claude 4.x in competitive context
- apis/anthropic-api —
thinkingparameter, streaming implementation, prompt caching - apis/openai-api — o3/o3-mini
reasoning_effortparameter, pricing - apis/google-ai — Gemini 2.5 Pro/Flash
thinkingConfig, context limits - prompting/techniques — do not combine explicit CoT prompting with extended thinking
- evals/benchmarks — AIME, SWE-bench, GPQA Diamond — the benchmarks used to compare reasoning models
- synthesis/cost-optimisation — prompt caching, model routing, and the 60-90% cost reduction playbook
- fine-tuning/dpo-grpo — GRPO training objective that produced DeepSeek R1's reasoning capability
Related reading