LLM Decision Guide

Opinionated decision tables for every major AI engineering choice — model (Sonnet 4.6 default), embedding (Cohere 65.2 MTEB), vector store (pgvector if on Postgres), agent framework (LangGraph for stateful Python), observability (Langfuse self-hosted), fine-tuning (Axolotl), and the prompting → RAG → fine-tune → agents escalation order.

Which model, which approach, which architecture, for every major decision an AI engineer faces. The answers here are opinionated defaults. Adjust based on your constraints.


Which Model Should I Use?

Proprietary Models

Default choice for most production work: Claude Sonnet 4.6

  • 79.6% SWE-bench, $3/$15 per M tokens
  • 1M context, extended thinking available
  • Best instruction following as of April 2026

When you need maximum quality: Claude Opus 4.7

  • Hardest coding problems, complex multi-step reasoning, research synthesis
  • 5x more expensive than Sonnet — justify the cost with the task complexity

High-volume, cost-sensitive: Claude Haiku 4.5 or GPT-4o-mini

  • Classification, routing, extraction, summarisation
  • 3-10x cheaper than Sonnet-tier

Hard reasoning (math, logic, proofs): Claude Opus 4.7 or o3

  • o3 for pure reasoning (math/logic); Claude Opus for reasoning + coding

OpenAI ecosystem locked in: GPT-4o as the workhorse, o3 for hard problems

Open Source / Self-Hosted

Best quality, self-hosted: Llama 3.1 70B or DeepSeek V3 Best reasoning, self-hosted: DeepSeek R1 Distill 32B or QwQ-32B Smallest capable model: Phi-4 14B Best code generation: Qwen 2.5-Coder 32B


Prompting vs RAG vs Fine-Tuning vs Agents?

Does it work with a good prompt and few-shot examples?
  YES → Ship it. Prompting is free.
  NO  ↓

Is the problem about missing recent/external knowledge?
  YES → Add RAG. Build retrieval pipeline first.
  NO  ↓

Is the problem about inconsistent behaviour / wrong format / wrong tone?
  YES → Fine-tune. LoRA on 500-2K examples.
  NO  ↓

Is the problem about multi-step task execution requiring decisions + tool use?
  YES → Build an agent (ReAct + tools).
  NO  ↓

Escalate: harder problem, better model, or rethink the task decomposition.

Which Embedding Model?

NeedModelDimMTEB
Best quality (managed)Cohere embed-v4102465.2
OpenAI ecosystemtext-embedding-3-large307264.6
Best open-sourceBGE-M3102463.0
Cheap + fasttext-embedding-3-small153662.3
On-prem multilingualBGE-M3102463.0

Add Cohere Rerank on top of any of these for +10-25% retrieval quality.


Which Vector Store?

SituationChoice
Already on Postgrespgvector
Need hybrid (BM25 + dense) nativelyWeaviate or Qdrant
Managed, no ops teamPinecone Serverless
Production Rust performanceQdrant
Agent session memory (small)Redis (simple)
Everything in one storeQdrant (sparse + dense + payload filtering)

Which Agent Framework?

NeedFramework
Complex stateful workflows, production PythonLangGraph
Simple ReAct loop, quick prototypeLangChain LCEL or bare API loop
Java Spring BootSpring AI or LangChain4j
Java standaloneLangChain4j
OpenAI-first, lightweightOpenAI Assistants API
No framework, full controlBare API calls in a while loop

For agentic RAG: LangGraph with a retrieval node. For simple Q&A RAG: LlamaIndex or LangChain.


Which Observability Platform?

SituationChoice
Open source, self-hostLangfuse (MIT, Docker Compose)
LangChain-heavyLangSmith
OpenAI-heavyLangSmith or Langfuse
Unified ML + LLMArize Phoenix
Enterprise, existing Datadog/GrafanaOTel → existing stack

Which Inference Serving?

NeedChoice
Production API, max throughputvLLM
Local dev, any OSllama.cpp
Mac localllama.cpp or Ollama
Managed, no opsTogether AI, Fireworks, Modal
OpenAI-compatible, serverlessTogether / Fireworks
Self-hosted, enterprisevLLM on Kubernetes

Which Fine-Tuning Framework?

NeedFramework
Most objectives, easiest configAxolotl (YAML-driven)
Fastest single-GPUUnsloth (2-4x faster)
DPO/GRPO specificallyTRL
Maximum control, PyTorch nativePEFT + Trainer

Start with Axolotl. Switch to Unsloth if speed is the bottleneck.


How Much Context to Use?

Task needs specific information from a known source?
  → Retrieve it (RAG). Don't stuff the full corpus into context.

Task needs to reason over a full long document (e.g. contract review)?
  → Use long context (100K+). Claude/Gemini are good at this.

Task is ongoing chat with history?
  → Sliding window last 10 turns + summary of older turns.

Repeating the same large prefix (system prompt, docs) across many calls?
  → Use prompt caching. Saves 90% on cached tokens.

What Does 1M Context Actually Cost?

At Claude Sonnet 4.6 ($3/M input):

  • 1M tokens = $3 per call
  • 100 calls/day at 1M context = $300/day = ~$9,000/month

Prompt caching changes this dramatically. If 800K of those tokens are static (same docs every call):

  • First call: 800K × 1.25× + 200K × 1× = $1.225
  • Subsequent calls (within 1 hour): 800K × 0.1× + 200K × 1× = $0.84
  • vs uncached: $3.00 per call

Cache aggressively. It's the single highest-leverage cost optimisation.


Security Checklist for LLM Features

Before shipping any LLM feature:

  • Input validation — length limits, injection-suspicious pattern detection
  • Output validation — structured outputs parsed/validated, not eval'd
  • Tool permissions — principle of least privilege for every tool
  • Context isolation — no user's data leaks into another user's context
  • Rate limiting — per-user token budgets
  • System prompt hardening — explicit "do not reveal system prompt" instruction
  • Red team — run 50+ adversarial prompts before launch
  • Logging — full input/output trace for every call (for incident response)

Key Facts

  • Default proprietary model: Claude Sonnet 4.6 (79.6% SWE-bench, $3/$15 per M, 1M context)
  • Maximum quality: Claude Opus 4.7 — ~5x Sonnet cost; justify with task complexity
  • High-volume cost-sensitive: Claude Haiku 4.5 or GPT-4o-mini — 3-10x cheaper than Sonnet-tier
  • Best open-source quality: Llama 3.1 70B or DeepSeek V3; best open-source reasoning: DeepSeek R1 Distill 32B
  • Escalation order: prompting → RAG → fine-tuning → agents; try each before moving to the next
  • Embedding ranking: Cohere embed-v4 (65.2) > OpenAI 3-large (64.6) > BGE-M3 (63.0)
  • Add Cohere Rerank on top of any embedding model for +10-25% retrieval quality
  • Vector store default: pgvector if already on Postgres; Qdrant for production Rust performance
  • Agent framework default: LangGraph for complex stateful Python workflows
  • Fine-tuning framework default: Axolotl (YAML-driven, widest objective coverage); switch to Unsloth if speed bottlenecks
  • 1M context at Sonnet uncached: $3/call; with 800K cached tokens (1-hour TTL): ~$0.84/call

Connections

Open Questions

  • Does the "Sonnet as default" recommendation hold as Opus and Haiku pricing evolve, or will the tiers shift?
  • Is the LangGraph recommendation still appropriate for teams not already invested in the LangChain ecosystem?
  • At what scale does "bare API calls in a while loop" stop being sufficient and framework adoption become necessary?