Inference Serving

Production LLM inference is memory-bandwidth-bound, not compute-bound — vLLM solves this with paged attention (2-4x throughput over naive serving) and continuous batching; llama.cpp handles quantised local inference.

Updated Invalid Date·

inference vllm llama-cpp serving paged-attention batching quantisation

Running LLMs in production. The key challenge: transformers are memory-bandwidth-bound at inference time, and KV cache grows unboundedly with sequence length. Production serving requires careful memory management to maximise throughput.

The Bottleneck: Memory, Not Compute

At inference time (after training), the GPU is not compute-limited. It's memory-bandwidth-limited. Moving weights from HBM (GPU memory) to CUDA cores is the bottleneck. Making the GPU do more computation per memory fetch (batch processing) is the key to throughput.

Single-request serving: The GPU is ~10% utilised because it's waiting for memory fetches. Wasteful. Batched serving: Serving N requests simultaneously uses the same memory fetches to do N times the work. Batching is everything for throughput.

vLLM

The standard open-source inference serving framework. Written in Python + CUDA.

Key innovations:

Paged Attention

The KV cache in standard transformers requires contiguous pre-allocated memory. This causes fragmentation and wastes 60–80% of GPU memory.

vLLM uses a paged virtual memory scheme (like OS virtual memory) for the KV cache:

Divide KV cache into fixed-size "pages" (blocks)
Allocate pages dynamically as sequences grow
Share pages across requests that have common prefixes (prefix caching)

Result: 2–4x higher throughput than naive serving, near-zero wasted memory.

Continuous Batching

Standard batching waits for a full batch before processing. Continuous batching processes requests as they arrive and retires them when done. New requests join the in-flight batch dynamically.

Combined with paged attention: optimal GPU utilisation.

Tensor Parallelism

Distribute a single model across multiple GPUs by splitting weight matrices along the tensor dimension. Required for 70B+ models on standard GPU hardware.

vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95

OpenAI-compatible API: vLLM exposes an endpoint compatible with the OpenAI API format — drop-in replacement for any OpenAI API client.

llama.cpp

CPU and consumer GPU inference. The standard for running quantised models locally.

GGUF format — llama.cpp's quantised model format. Supports Q4_0, Q4_K_M, Q5_K_M, Q8_0, fp16, and others. The most widely used format for local inference.

./llama-cli -m llama-3-8b-q4_k_m.gguf -p "Hello, who are you?" -n 256

Typical performance (M3 Max, 16-core GPU):

Llama 3 8B Q4_K_M: ~50 tokens/sec
Llama 3 70B Q4_K_M: ~10 tokens/sec (requires enough unified memory)

Python binding (llama-cpp-python):

from llama_cpp import Llama
llm = Llama(model_path="llama-3-8b-q4_k_m.gguf", n_gpu_layers=-1)
output = llm("Hello world", max_tokens=50)

TensorRT-LLM

NVIDIA's optimised inference library. Maximum throughput on NVIDIA GPUs.

Custom CUDA kernels optimised for transformer attention
In-flight batching
FP8 / INT8 / INT4 quantisation with calibration
Model parallelism

Best for production deployments on owned/leased NVIDIA hardware. Higher setup cost than vLLM but 20–40% better throughput.

Triton Inference Server

NVIDIA's model serving platform. Wraps TensorRT-LLM (and other backends) with:

gRPC + REST API
Dynamic batching
Model ensemble support (chain models together)
Prometheus metrics

Enterprise-grade, complex to configure. Use vLLM unless you need the enterprise features.

Quantisation for Inference

See math/transformer-math for the numbers. Practical guide:

Use case	Recommendation
Local, CPU	GGUF Q4_K_M (best quality/size balance)
Local, consumer GPU	GGUF Q5_K_M or fp16 if VRAM allows
Cloud serving, quality-first	bf16 or fp8 (H100)
Cloud serving, cost-first	int4/GPTQ with calibration
Development / API	Don't quantise — use API

Speculative Decoding

Technique to accelerate autoregressive generation without quality loss:

A small "draft" model generates N tokens quickly
The large "target" model verifies all N tokens in a single forward pass
Accept all tokens the target model agrees with; reject and regenerate from the first mismatch

Result: 2–3x throughput improvement when the draft model frequently agrees with the target. Draft and target must share the same tokeniser.

Managed Inference Options

Provider	Best for	Models
Anthropic API	Claude models (only option)	Claude 4.x family
OpenAI API	GPT family	GPT-4o, o3, etc
Together AI	Open models, fast	Llama, Mixtral, Qwen
Fireworks AI	Low latency, function calling	Llama, Firefunction
Replicate	Diverse models, pay-per-use	1,000+ models
Modal	Serverless GPU, custom models	Any model
RunPod	Reserved GPU, cheap	Any model

Key Facts

Single-request serving: GPU is ~10% utilised; batching is the primary throughput lever
vLLM paged attention: 2-4x throughput over naive serving, near-zero wasted KV cache memory
vLLM exposes OpenAI-compatible API — drop-in replacement for any OpenAI client
llama.cpp GGUF Q4_K_M on M3 Max: Llama 3 8B at ~50 tok/s, 70B at ~10 tok/s
Speculative decoding: 2-3x throughput improvement; draft and target must share same tokeniser
TensorRT-LLM: 20-40% better throughput than vLLM; higher setup cost
Local inference recommendation: GGUF Q4_K_M for CPU; Q5_K_M or bf16 for consumer GPU

Common Failure Cases

vLLM OOMs at startup before serving any requests
Why: gpu_memory_utilization=0.90 (default) pre-allocates 90% of VRAM for the KV cache; if the model weights consume more than 10% of the remaining space, startup fails.
Detect: CUDA out of memory in vLLM startup logs before any requests are received.
Fix: lower gpu_memory_utilization to 0.80 or 0.75; check model size vs GPU VRAM; or use tensor_parallel_size to split across multiple GPUs.

llama.cpp Q4_K_M GGUF produces noticeably worse quality than BF16 on coding tasks
Why: INT4 quantisation loses precision on attention heads; coding and instruction-following tasks are sensitive to this.
Detect: pass@1 accuracy on HumanEval drops >5% vs the BF16 model; output contains more hallucinated API calls.
Fix: use Q5_K_M or Q6_K for coding-focused deployments where quality matters; accept the 30% larger model size.

vLLM speculative decoding draft model produces high rejection rate, slowing throughput
Why: if the draft and target models are too different in capability or the target temperature is high, the acceptance rate drops below 50%, making speculative decoding slower than without it.
Detect: tokens/second with speculative decoding enabled is lower than without; acceptance_rate metric in vLLM < 0.6.
Fix: use a draft model that is a smaller version of the same model family; disable speculative decoding for high-temperature creative generation.

Continuous batching stalls when one request has a very long output
Why: vLLM's continuous batching waits for all active sequences before adding new ones if one sequence is generating thousands of tokens.
Detect: tail latency increases when one user is generating a very long response; other short requests are queued behind it.
Fix: set max_model_len to cap KV cache per sequence; use streaming and timeout long generations at the application layer.

TensorRT-LLM engine built for one GPU type fails on another
Why: TRT engines are compiled for a specific GPU architecture; an engine compiled for A100 cannot run on H100 (different compute capability, different optimisations).
Detect: RuntimeError: Engine was compiled for CUDA compute capability 8.0 but current device is 9.0.
Fix: rebuild the engine for each GPU type; maintain separate engine binaries per GPU architecture in CI.

Connections

infra/vector-stores — vector stores frequently collocated with inference serving in RAG systems
infra/huggingface — model hub provides the checkpoints that vLLM and llama.cpp load
infra/gpu-hardware — GPU selection determines which serving approach is viable
infra/ai-gateway — API gateway layer that sits in front of inference serving for routing, rate limiting, and cost control
infra/inference-platforms — serverless and managed inference platforms (Replicate, Modal, Hugging Face Endpoints, AWS SageMaker)
cloud/serverless-patterns — serverless inference as an alternative to always-on GPU servers
math/transformer-math — KV cache memory calculations underlie paged attention design
fine-tuning/lora-qlora — quantisation affects inference serving choices

Open Questions

How does vLLM prefix caching compare to Anthropic prompt caching for repeated system prompts?
When does TensorRT-LLM's throughput advantage justify the higher setup complexity over vLLM?
What is the realistic speculative decoding acceptance rate for general-purpose assistants vs coding tasks?