GPU Hardware for LLMs

GPU selection guide for LLM inference and training — VRAM is the binding constraint (2 bytes per parameter in BF16), with H100 at ~3x A100 throughput for inference and RTX 4090 as the consumer sweet spot for fine-tuning.

The practical guide to GPU selection for inference and training. VRAM is the binding constraint. A model that doesn't fit in VRAM can't run.


VRAM Requirements by Model Size

Rule of thumb: 2 bytes per parameter for FP16/BF16.

Model sizeBF16 VRAMINT8 VRAMINT4 VRAM
7B14 GB7 GB4 GB
13B26 GB13 GB7 GB
33B66 GB33 GB17 GB
70B140 GB70 GB35 GB
405B810 GB405 GB203 GB

Add ~20% overhead for KV cache + activations during inference.

For training with AdamW: optimizer states take 3x the model size in FP32 Adam moments → a 7B model needs ~84GB for full fine-tuning. LoRA reduces this dramatically. Only the adapter parameters need optimizer states.


GPU Comparison

NVIDIA Data Centre (Cloud/Server)

GPUVRAMMemory BWTDPBest for
H200 SXM141 GB HBM3e4.8 TB/s700WFrontier training + inference
H100 SXM80 GB HBM33.35 TB/s700WLarge model training
H100 PCIe80 GB HBM2e2 TB/s350WInference
A100 80GB80 GB HBM2e2 TB/s400WTraining, widely available
A100 40GB40 GB HBM2e1.6 TB/s400WCommon in cloud
L40S48 GB GDDR6864 GB/s350WInference, 30% cheaper than A100

H100 delivers ~3x the throughput of A100 for transformer inference due to FP8 support and faster NVLink interconnect.

Consumer / Prosumer

GPUVRAMMemory BWCostBest for
RTX 409024 GB GDDR6X1008 GB/s~$1,600QLoRA fine-tuning 7-13B, fast inference
RTX 408016 GB GDDR6X717 GB/s~$800Inference 7B, QLoRA 7B
RTX 309024 GB GDDR6X936 GB/s~$700 usedGood value for 7B-13B inference
RTX 4060 Ti 16GB16 GB GDDR6288 GB/s~$450Inference 7B INT4

Apple Silicon:

ChipUnified MemoryMemory BWBest for
M3 Max 128GB128 GB400 GB/sFull 70B INT4 inference locally
M3 Pro 36GB36 GB150 GB/s13-30B inference
M3 36GB36 GB100 GB/s7-13B
M4 Max 128GB128 GB546 GB/sBest local inference hardware (2025)

Apple Silicon is compelling for inference: unified memory means 128GB available to GPU at reasonable bandwidth. No PCIe bottleneck.


Cloud GPU Pricing (April 2026)

ProviderGPU$/hr (on-demand)Notes
Lambda LabsA100 80GB~$1.99Best value for training
Lambda LabsH100 SXM~$3.29
RunPodA100 80GB~$2.29Spot can be 50% cheaper
RunPodH100 SXM~$3.99
Vast.aiRTX 4090~$0.40-0.80Cheap, less reliable
ModalA100~$3.50Serverless, scales to zero
AWS p4d.24xlarge8× A100 40GB~$32Expensive but reliable
GCP A3 Mega8× H100 80GB~$43Frontier training

Spot/interruptible pricing is typically 50-70% cheaper. Use for training (with checkpointing) or batch inference.


Fitting Models in Memory

Multi-GPU with tensor parallelism

# vLLM: auto-shard across available GPUs
from vllm import LLM

llm = LLM(
    model="meta-llama/Meta-Llama-3-70B-Instruct",
    tensor_parallel_size=4,  # shard across 4 GPUs (4× A100 40GB = 160GB)
)

Quantisation to fit on smaller GPU

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# INT4 quantisation — 70B fits in ~35GB (2× RTX 4090)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-70B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)

llama.cpp on CPU with GPU offload

from llama_cpp import Llama

# Offload 35 layers to GPU, rest on CPU
llm = Llama(
    model_path="llama-3-70b.Q4_K_M.gguf",
    n_gpu_layers=35,
    n_ctx=4096,
)

Use when the model doesn't fully fit in VRAM. Slower than full GPU inference but faster than CPU-only.


Choosing Hardware: Decision Guide

Just want to run 7B locally for development: → RTX 4060 Ti 16GB (cheapest) or M3 Mac (best dev experience)

Fine-tune 7B with LoRA: → RTX 4090 24GB or single A100 40GB

Fine-tune 13-70B with QLoRA: → A100 80GB (single) or 2-4× RTX 4090

Run 70B inference locally: → M3 Max 128GB or 2× A100 40GB

Production inference API (<100ms): → H100 or A100, use vLLM

Training from scratch / large-scale fine-tuning: → Multi-node H100 cluster; use Lambda or GCP

Cheapest possible experiments: → Google Colab Pro ($10/month, T4 16GB), Kaggle (free T4/P100 weekly quota)


VRAM Monitoring

import torch

# Check available VRAM
print(torch.cuda.memory_allocated() / 1e9, "GB allocated")
print(torch.cuda.memory_reserved() / 1e9, "GB reserved")
print(torch.cuda.get_device_properties(0).total_memory / 1e9, "GB total")

# nvidia-smi equivalent in Python
!nvidia-smi --query-gpu=name,memory.used,memory.free,memory.total --format=csv

Key Facts

  • VRAM rule of thumb: 2 bytes per parameter in BF16 (7B = 14GB, 70B = 140GB)
  • Add 20% overhead for KV cache and activations during inference
  • Full fine-tuning with AdamW: ~3x model size for optimizer states (7B = 84GB+)
  • H100 delivers ~3x A100 throughput for transformer inference (FP8 support + faster NVLink)
  • Lambda Labs A100 80GB: ~$1.99/hr; H100 SXM: ~$3.29/hr (best value for training)
  • Apple M4 Max 128GB: 546 GB/s memory bandwidth; best local inference hardware (2025)
  • Spot/interruptible pricing: 50-70% cheaper — use with checkpointing for training
  • Google Colab Pro: $10/month, T4 16GB — cheapest option for experiments

Common Failure Cases

Model loads on VRAM paper spec but OOMs during inference due to KV cache
Why: VRAM estimates are for weights only; KV cache grows with sequence length and batch size, adding 20-40% overhead.
Detect: CUDA OOM occurs after the model loads successfully but fails on first generation; nvidia-smi shows memory near capacity before inference starts.
Fix: account for 20% KV cache overhead when sizing GPU memory; reduce max_new_tokens or use streaming with smaller batch sizes.

Multi-GPU setup with device_map="auto" is slower than single GPU
Why: PCIe interconnect bandwidth (~16 GB/s) is far slower than NVLink (~600 GB/s); on PCIe-connected GPUs, inter-device tensor transfers dominate latency.
Detect: tokens/second on 2× PCIe GPU is lower than 1× GPU of the same type; nvidia-smi topo shows PHB (PCIe Host Bridge) connections.
Fix: use NVLink-connected GPUs (SXM form factor) for multi-GPU inference; or use quantisation to fit on a single GPU instead.

int4 quantisation causes severe quality degradation on instruction-following tasks
Why: aggressive INT4 quantisation loses precision on the attention layers that drive instruction following; 4-bit models with bad calibration data are noticeably worse.
Detect: instruction-following accuracy drops >10% vs BF16 on your benchmark; the model ignores format requirements.
Fix: use Q4_K_M (GGUF) or GPTQ with calibrated quantisation rather than naive INT4; or use 5-bit quantisation as a compromise.

Cloud GPU spot instance preemption loses training progress
Why: spot/interruptible instances are reclaimed without warning when demand increases.
Detect: training job terminates with SpotInstanceInterruption or equivalent; no checkpoint was saved recently.
Fix: checkpoint every 10-30 minutes with save_steps; enable training job restart from the latest checkpoint; use deepspeed ZeRO with checkpoint support.

Apple Silicon model loads but runs at 10% of expected speed due to CPU fallback
Why: certain custom ops (e.g., some GGUF quantisation types) fall back to CPU on Apple Silicon; the GPU runs but CPU is the bottleneck.
Detect: GPU utilisation is 10-30% in Activity Monitor despite the model "running on GPU"; tokens/second is far below the expected rate.
Fix: use n_gpu_layers=-1 in llama.cpp to maximise GPU offload; check that the GGUF quantisation type (Q4_K_M, Q5_K_M) is supported natively by Metal.

Connections

Open Questions

  • When does H200 become widely available on cloud rental platforms vs H100?
  • How does Apple Silicon M4 Max compare to a single A100 for vLLM-style continuous batching inference?
  • What is the practical VRAM headroom needed above the minimum for stable long-context inference?