Inference Serving

Production LLM inference is memory-bandwidth-bound, not compute-bound — vLLM solves this with paged attention (2-4x throughput over naive serving) and continuous batching; llama.cpp handles quantised local inference.

Running LLMs in production. The key challenge: transformers are memory-bandwidth-bound at inference time, and KV cache grows unboundedly with sequence length. Production serving requires careful memory management to maximise throughput.


The Bottleneck: Memory, Not Compute

At inference time (after training), the GPU is not compute-limited. It's memory-bandwidth-limited. Moving weights from HBM (GPU memory) to CUDA cores is the bottleneck. Making the GPU do more computation per memory fetch (batch processing) is the key to throughput.

Single-request serving: The GPU is ~10% utilised because it's waiting for memory fetches. Wasteful. Batched serving: Serving N requests simultaneously uses the same memory fetches to do N times the work. Batching is everything for throughput.


vLLM

The standard open-source inference serving framework. Written in Python + CUDA.

Key innovations:

Paged Attention

The KV cache in standard transformers requires contiguous pre-allocated memory. This causes fragmentation and wastes 60–80% of GPU memory.

vLLM uses a paged virtual memory scheme (like OS virtual memory) for the KV cache:

  • Divide KV cache into fixed-size "pages" (blocks)
  • Allocate pages dynamically as sequences grow
  • Share pages across requests that have common prefixes (prefix caching)

Result: 2–4x higher throughput than naive serving, near-zero wasted memory.

Continuous Batching

Standard batching waits for a full batch before processing. Continuous batching processes requests as they arrive and retires them when done. New requests join the in-flight batch dynamically.

Combined with paged attention: optimal GPU utilisation.

Tensor Parallelism

Distribute a single model across multiple GPUs by splitting weight matrices along the tensor dimension. Required for 70B+ models on standard GPU hardware.

vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95

OpenAI-compatible API: vLLM exposes an endpoint compatible with the OpenAI API format — drop-in replacement for any OpenAI API client.


llama.cpp

CPU and consumer GPU inference. The standard for running quantised models locally.

GGUF format — llama.cpp's quantised model format. Supports Q4_0, Q4_K_M, Q5_K_M, Q8_0, fp16, and others. The most widely used format for local inference.

./llama-cli -m llama-3-8b-q4_k_m.gguf -p "Hello, who are you?" -n 256

Typical performance (M3 Max, 16-core GPU):

  • Llama 3 8B Q4_K_M: ~50 tokens/sec
  • Llama 3 70B Q4_K_M: ~10 tokens/sec (requires enough unified memory)

Python binding (llama-cpp-python):

from llama_cpp import Llama
llm = Llama(model_path="llama-3-8b-q4_k_m.gguf", n_gpu_layers=-1)
output = llm("Hello world", max_tokens=50)

TensorRT-LLM

NVIDIA's optimised inference library. Maximum throughput on NVIDIA GPUs.

  • Custom CUDA kernels optimised for transformer attention
  • In-flight batching
  • FP8 / INT8 / INT4 quantisation with calibration
  • Model parallelism

Best for production deployments on owned/leased NVIDIA hardware. Higher setup cost than vLLM but 20–40% better throughput.


Triton Inference Server

NVIDIA's model serving platform. Wraps TensorRT-LLM (and other backends) with:

  • gRPC + REST API
  • Dynamic batching
  • Model ensemble support (chain models together)
  • Prometheus metrics

Enterprise-grade, complex to configure. Use vLLM unless you need the enterprise features.


Quantisation for Inference

See math/transformer-math for the numbers. Practical guide:

Use caseRecommendation
Local, CPUGGUF Q4_K_M (best quality/size balance)
Local, consumer GPUGGUF Q5_K_M or fp16 if VRAM allows
Cloud serving, quality-firstbf16 or fp8 (H100)
Cloud serving, cost-firstint4/GPTQ with calibration
Development / APIDon't quantise — use API

Speculative Decoding

Technique to accelerate autoregressive generation without quality loss:

  1. A small "draft" model generates N tokens quickly
  2. The large "target" model verifies all N tokens in a single forward pass
  3. Accept all tokens the target model agrees with; reject and regenerate from the first mismatch

Result: 2–3x throughput improvement when the draft model frequently agrees with the target. Draft and target must share the same tokeniser.


Managed Inference Options

ProviderBest forModels
Anthropic APIClaude models (only option)Claude 4.x family
OpenAI APIGPT familyGPT-4o, o3, etc
Together AIOpen models, fastLlama, Mixtral, Qwen
Fireworks AILow latency, function callingLlama, Firefunction
ReplicateDiverse models, pay-per-use1,000+ models
ModalServerless GPU, custom modelsAny model
RunPodReserved GPU, cheapAny model

Key Facts

  • Single-request serving: GPU is ~10% utilised; batching is the primary throughput lever
  • vLLM paged attention: 2-4x throughput over naive serving, near-zero wasted KV cache memory
  • vLLM exposes OpenAI-compatible API — drop-in replacement for any OpenAI client
  • llama.cpp GGUF Q4_K_M on M3 Max: Llama 3 8B at ~50 tok/s, 70B at ~10 tok/s
  • Speculative decoding: 2-3x throughput improvement; draft and target must share same tokeniser
  • TensorRT-LLM: 20-40% better throughput than vLLM; higher setup cost
  • Local inference recommendation: GGUF Q4_K_M for CPU; Q5_K_M or bf16 for consumer GPU

Common Failure Cases

vLLM OOMs at startup before serving any requests
Why: gpu_memory_utilization=0.90 (default) pre-allocates 90% of VRAM for the KV cache; if the model weights consume more than 10% of the remaining space, startup fails.
Detect: CUDA out of memory in vLLM startup logs before any requests are received.
Fix: lower gpu_memory_utilization to 0.80 or 0.75; check model size vs GPU VRAM; or use tensor_parallel_size to split across multiple GPUs.

llama.cpp Q4_K_M GGUF produces noticeably worse quality than BF16 on coding tasks
Why: INT4 quantisation loses precision on attention heads; coding and instruction-following tasks are sensitive to this.
Detect: pass@1 accuracy on HumanEval drops >5% vs the BF16 model; output contains more hallucinated API calls.
Fix: use Q5_K_M or Q6_K for coding-focused deployments where quality matters; accept the 30% larger model size.

vLLM speculative decoding draft model produces high rejection rate, slowing throughput
Why: if the draft and target models are too different in capability or the target temperature is high, the acceptance rate drops below 50%, making speculative decoding slower than without it.
Detect: tokens/second with speculative decoding enabled is lower than without; acceptance_rate metric in vLLM < 0.6.
Fix: use a draft model that is a smaller version of the same model family; disable speculative decoding for high-temperature creative generation.

Continuous batching stalls when one request has a very long output
Why: vLLM's continuous batching waits for all active sequences before adding new ones if one sequence is generating thousands of tokens.
Detect: tail latency increases when one user is generating a very long response; other short requests are queued behind it.
Fix: set max_model_len to cap KV cache per sequence; use streaming and timeout long generations at the application layer.

TensorRT-LLM engine built for one GPU type fails on another
Why: TRT engines are compiled for a specific GPU architecture; an engine compiled for A100 cannot run on H100 (different compute capability, different optimisations).
Detect: RuntimeError: Engine was compiled for CUDA compute capability 8.0 but current device is 9.0.
Fix: rebuild the engine for each GPU type; maintain separate engine binaries per GPU architecture in CI.

Connections

Open Questions

  • How does vLLM prefix caching compare to Anthropic prompt caching for repeated system prompts?
  • When does TensorRT-LLM's throughput advantage justify the higher setup complexity over vLLM?
  • What is the realistic speculative decoding acceptance rate for general-purpose assistants vs coding tasks?