Inference Serving
Production LLM inference is memory-bandwidth-bound, not compute-bound — vLLM solves this with paged attention (2-4x throughput over naive serving) and continuous batching; llama.cpp handles quantised local inference.
Running LLMs in production. The key challenge: transformers are memory-bandwidth-bound at inference time, and KV cache grows unboundedly with sequence length. Production serving requires careful memory management to maximise throughput.
The Bottleneck: Memory, Not Compute
At inference time (after training), the GPU is not compute-limited. It's memory-bandwidth-limited. Moving weights from HBM (GPU memory) to CUDA cores is the bottleneck. Making the GPU do more computation per memory fetch (batch processing) is the key to throughput.
Single-request serving: The GPU is ~10% utilised because it's waiting for memory fetches. Wasteful. Batched serving: Serving N requests simultaneously uses the same memory fetches to do N times the work. Batching is everything for throughput.
vLLM
The standard open-source inference serving framework. Written in Python + CUDA.
Key innovations:
Paged Attention
The KV cache in standard transformers requires contiguous pre-allocated memory. This causes fragmentation and wastes 60–80% of GPU memory.
vLLM uses a paged virtual memory scheme (like OS virtual memory) for the KV cache:
- Divide KV cache into fixed-size "pages" (blocks)
- Allocate pages dynamically as sequences grow
- Share pages across requests that have common prefixes (prefix caching)
Result: 2–4x higher throughput than naive serving, near-zero wasted memory.
Continuous Batching
Standard batching waits for a full batch before processing. Continuous batching processes requests as they arrive and retires them when done. New requests join the in-flight batch dynamically.
Combined with paged attention: optimal GPU utilisation.
Tensor Parallelism
Distribute a single model across multiple GPUs by splitting weight matrices along the tensor dimension. Required for 70B+ models on standard GPU hardware.
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95OpenAI-compatible API: vLLM exposes an endpoint compatible with the OpenAI API format — drop-in replacement for any OpenAI API client.
llama.cpp
CPU and consumer GPU inference. The standard for running quantised models locally.
GGUF format — llama.cpp's quantised model format. Supports Q4_0, Q4_K_M, Q5_K_M, Q8_0, fp16, and others. The most widely used format for local inference.
./llama-cli -m llama-3-8b-q4_k_m.gguf -p "Hello, who are you?" -n 256Typical performance (M3 Max, 16-core GPU):
- Llama 3 8B Q4_K_M: ~50 tokens/sec
- Llama 3 70B Q4_K_M: ~10 tokens/sec (requires enough unified memory)
Python binding (llama-cpp-python):
from llama_cpp import Llama
llm = Llama(model_path="llama-3-8b-q4_k_m.gguf", n_gpu_layers=-1)
output = llm("Hello world", max_tokens=50)TensorRT-LLM
NVIDIA's optimised inference library. Maximum throughput on NVIDIA GPUs.
- Custom CUDA kernels optimised for transformer attention
- In-flight batching
- FP8 / INT8 / INT4 quantisation with calibration
- Model parallelism
Best for production deployments on owned/leased NVIDIA hardware. Higher setup cost than vLLM but 20–40% better throughput.
Triton Inference Server
NVIDIA's model serving platform. Wraps TensorRT-LLM (and other backends) with:
- gRPC + REST API
- Dynamic batching
- Model ensemble support (chain models together)
- Prometheus metrics
Enterprise-grade, complex to configure. Use vLLM unless you need the enterprise features.
Quantisation for Inference
See math/transformer-math for the numbers. Practical guide:
| Use case | Recommendation |
|---|---|
| Local, CPU | GGUF Q4_K_M (best quality/size balance) |
| Local, consumer GPU | GGUF Q5_K_M or fp16 if VRAM allows |
| Cloud serving, quality-first | bf16 or fp8 (H100) |
| Cloud serving, cost-first | int4/GPTQ with calibration |
| Development / API | Don't quantise — use API |
Speculative Decoding
Technique to accelerate autoregressive generation without quality loss:
- A small "draft" model generates N tokens quickly
- The large "target" model verifies all N tokens in a single forward pass
- Accept all tokens the target model agrees with; reject and regenerate from the first mismatch
Result: 2–3x throughput improvement when the draft model frequently agrees with the target. Draft and target must share the same tokeniser.
Managed Inference Options
| Provider | Best for | Models |
|---|---|---|
| Anthropic API | Claude models (only option) | Claude 4.x family |
| OpenAI API | GPT family | GPT-4o, o3, etc |
| Together AI | Open models, fast | Llama, Mixtral, Qwen |
| Fireworks AI | Low latency, function calling | Llama, Firefunction |
| Replicate | Diverse models, pay-per-use | 1,000+ models |
| Modal | Serverless GPU, custom models | Any model |
| RunPod | Reserved GPU, cheap | Any model |
Key Facts
- Single-request serving: GPU is ~10% utilised; batching is the primary throughput lever
- vLLM paged attention: 2-4x throughput over naive serving, near-zero wasted KV cache memory
- vLLM exposes OpenAI-compatible API — drop-in replacement for any OpenAI client
- llama.cpp GGUF Q4_K_M on M3 Max: Llama 3 8B at ~50 tok/s, 70B at ~10 tok/s
- Speculative decoding: 2-3x throughput improvement; draft and target must share same tokeniser
- TensorRT-LLM: 20-40% better throughput than vLLM; higher setup cost
- Local inference recommendation: GGUF Q4_K_M for CPU; Q5_K_M or bf16 for consumer GPU
Common Failure Cases
vLLM OOMs at startup before serving any requests
Why: gpu_memory_utilization=0.90 (default) pre-allocates 90% of VRAM for the KV cache; if the model weights consume more than 10% of the remaining space, startup fails.
Detect: CUDA out of memory in vLLM startup logs before any requests are received.
Fix: lower gpu_memory_utilization to 0.80 or 0.75; check model size vs GPU VRAM; or use tensor_parallel_size to split across multiple GPUs.
llama.cpp Q4_K_M GGUF produces noticeably worse quality than BF16 on coding tasks
Why: INT4 quantisation loses precision on attention heads; coding and instruction-following tasks are sensitive to this.
Detect: pass@1 accuracy on HumanEval drops >5% vs the BF16 model; output contains more hallucinated API calls.
Fix: use Q5_K_M or Q6_K for coding-focused deployments where quality matters; accept the 30% larger model size.
vLLM speculative decoding draft model produces high rejection rate, slowing throughput
Why: if the draft and target models are too different in capability or the target temperature is high, the acceptance rate drops below 50%, making speculative decoding slower than without it.
Detect: tokens/second with speculative decoding enabled is lower than without; acceptance_rate metric in vLLM < 0.6.
Fix: use a draft model that is a smaller version of the same model family; disable speculative decoding for high-temperature creative generation.
Continuous batching stalls when one request has a very long output
Why: vLLM's continuous batching waits for all active sequences before adding new ones if one sequence is generating thousands of tokens.
Detect: tail latency increases when one user is generating a very long response; other short requests are queued behind it.
Fix: set max_model_len to cap KV cache per sequence; use streaming and timeout long generations at the application layer.
TensorRT-LLM engine built for one GPU type fails on another
Why: TRT engines are compiled for a specific GPU architecture; an engine compiled for A100 cannot run on H100 (different compute capability, different optimisations).
Detect: RuntimeError: Engine was compiled for CUDA compute capability 8.0 but current device is 9.0.
Fix: rebuild the engine for each GPU type; maintain separate engine binaries per GPU architecture in CI.
Connections
- infra/vector-stores — vector stores frequently collocated with inference serving in RAG systems
- infra/huggingface — model hub provides the checkpoints that vLLM and llama.cpp load
- infra/gpu-hardware — GPU selection determines which serving approach is viable
- infra/ai-gateway — API gateway layer that sits in front of inference serving for routing, rate limiting, and cost control
- infra/inference-platforms — serverless and managed inference platforms (Replicate, Modal, Hugging Face Endpoints, AWS SageMaker)
- cloud/serverless-patterns — serverless inference as an alternative to always-on GPU servers
- math/transformer-math — KV cache memory calculations underlie paged attention design
- fine-tuning/lora-qlora — quantisation affects inference serving choices
Open Questions
- How does vLLM prefix caching compare to Anthropic prompt caching for repeated system prompts?
- When does TensorRT-LLM's throughput advantage justify the higher setup complexity over vLLM?
- What is the realistic speculative decoding acceptance rate for general-purpose assistants vs coding tasks?
Related reading