DeepSpeed ZeRO

Zero Redundancy Optimizer — Microsoft's distributed training system that partitions model training state across GPUs to eliminate memory redundancy.

Updated Invalid Date·

deepspeed zero distributed-training multi-gpu memory-efficiency microsoft

Zero Redundancy Optimizer. Microsoft's distributed training system that partitions model training state across GPUs to eliminate memory redundancy. Enables training models with billions of parameters on commodity hardware.

Part of the DeepSpeed library. Integrates directly with HuggingFace Trainer, Axolotl, and PyTorch Lightning.

The Problem

In standard data-parallel training, every GPU holds a complete copy of:

Model parameters (weights)
Optimizer states (for Adam: momentum, variance, master weights — 3× the parameter count in fp32)
Gradients

For a 7B model with Adam in mixed precision: ~112GB of memory is duplicated across every GPU. A single H100 has 80GB. The model doesn't fit, let alone the optimizer states.

ZeRO Stages

ZeRO eliminates redundancy by partitioning these components across GPUs. Each stage adds more partitioning:

Stage 1 — Optimizer State Partitioning

Each GPU holds all parameters and gradients, but only its shard of the optimizer states.

GPU 0: params[all], grads[all], opt_states[0:N/4]
GPU 1: params[all], grads[all], opt_states[N/4:N/2]
...

Memory reduction: 4× vs baseline (optimizer states are the largest component)
Communication overhead: minimal (all-gather optimizer states at update step)

Stage 2 — Gradient Partitioning

Each GPU holds all parameters but only its shard of gradients and optimizer states.

GPU 0: params[all], grads[0:N/4], opt_states[0:N/4]
GPU 1: params[all], grads[N/4:N/2], opt_states[N/4:N/2]
...

Memory reduction: 8× vs baseline
Communication: reduce-scatter gradients during backward pass

Stage 3 — Parameter Partitioning

All three components are partitioned. Each GPU only owns a shard of the model.

GPU 0: params[0:N/4], grads[0:N/4], opt_states[0:N/4]
GPU 1: params[N/4:N/2], grads[N/4:N/2], opt_states[N/4:N/2]
...

Memory reduction: linear with GPU count (8 GPUs = 8× reduction)
Communication: all-gather parameters during forward and backward passes
Can train 200B+ parameter models with model parallelism

ZeRO-Infinity (offloading)

Extends Stage 3 by offloading parameters, gradients, and optimizer states to CPU RAM or NVMe storage. Enables training on GPU memory alone is insufficient.

Memory Comparison (7B Model, 8 GPUs)

Approach	Memory per GPU
Baseline data parallel	~112GB (doesn't fit on H100)
ZeRO Stage 1	~28GB
ZeRO Stage 2	~14GB
ZeRO Stage 3	~7GB

Usage with HuggingFace

# ds_config.json
{
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "overlap_comm": true
  },
  "bf16": {"enabled": true},
  "train_batch_size": 32,
  "train_micro_batch_size_per_gpu": 4
}

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    deepspeed="ds_config.json",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
)

With Axolotl:

deepspeed: deepspeed_configs/zero2.json

ZeRO vs FSDP (PyTorch)

PyTorch's Fully Sharded Data Parallel (FSDP) is the native alternative to ZeRO Stage 3.

	DeepSpeed ZeRO	PyTorch FSDP
Ecosystem	DeepSpeed library	PyTorch native
Stages	1, 2, 3, Infinity	Equivalent to ZeRO 3
CPU/NVMe offload	ZeRO-Infinity	Limited
HuggingFace support	Full	Full
Ease of config	JSON config	Python API
Communication efficiency	Slightly better at large scale	Simpler to debug

Rule of thumb: FSDP for standard multi-GPU fine-tuning; DeepSpeed ZeRO-3 or ZeRO-Infinity for very large models or when CPU offload is needed.

When to Use Each Stage

Scenario	Recommended Stage
7B model, 8× A100 80GB	Stage 1 or 2 — model fits with just opt state partitioning
13B model, 4× A100 80GB	Stage 2 or 3
70B model, 8× H100 80GB	Stage 3
70B+ model, limited GPU RAM	Stage 3 + ZeRO-Infinity

Start with Stage 2. Go to Stage 3 only if Stage 2 OOMs. Stage 3's all-gather communication adds overhead.

Common Failure Cases

Stage 3 training is slower than Stage 2 despite lower memory usage
Why: Stage 3 all-gathers parameters during both forward and backward passes; on nodes with low NVLink bandwidth (e.g., PCIe-connected multi-GPU), the communication overhead exceeds the memory savings benefit.
Detect: tokens/second drops >20% when moving from Stage 2 to Stage 3; nvidia-smi shows GPUs waiting on communication.
Fix: start with Stage 2; only move to Stage 3 when Stage 2 genuinely OOMs; use NVLink-connected nodes for Stage 3.

ds_config.json train_batch_size doesn't match per_device * accumulation * gpus
Why: DeepSpeed requires train_batch_size = per_device_train_batch_size * gradient_accumulation_steps * num_gpus; a mismatch raises a validation error at startup.
Detect: AssertionError: ... train_batch_size on training job start.
Fix: calculate and set train_batch_size explicitly in ds_config.json to match the product of the three values.

ZeRO-Infinity NVMe offloading requires nvme_path to exist on all nodes
Why: NVMe offload writes optimizer states to a local path; if the path doesn't exist or has insufficient space, training crashes.
Detect: FileNotFoundError or IOError: No space left on device in training logs; only fails on some nodes.
Fix: create the offload directory on all nodes before training; check free space with df -h; ensure the path is consistent across the cluster.

Gradient clipping disabled in ds_config.json causes loss explosion
Why: the default DeepSpeed config doesn't enable gradient clipping; without it, large gradients destabilise training.
Detect: training loss spikes then diverges (NaN/Inf); grad_norm metric exceeds 100 in W&B.
Fix: add "gradient_clipping": 1.0 to the DeepSpeed config; this is standard practice and rarely omitted intentionally.

FSDP migration from ZeRO requires rewriting the training script structure
Why: ZeRO is configured externally via JSON; FSDP requires explicit wrapping policies in the training code; they're not interchangeable via a config swap.
Detect: switching from deepspeed to fsdp in TrainingArguments raises NotImplementedError or unexpected behavior.
Fix: treat them as separate code paths; use ZeRO for DeepSpeed-only stacks and FSDP for pure PyTorch; don't expect a drop-in switch.

Connections

fine-tuning/frameworks — Axolotl and TRL both integrate DeepSpeed ZeRO via HuggingFace Trainer
infra/gpu-hardware — ZeRO stage selection depends on GPU VRAM (H100 80GB vs A100 40GB)
fine-tuning/lora-qlora — QLoRA + ZeRO Stage 2 is the standard setup for multi-GPU LoRA training
infra/experiment-tracking — W&B/MLflow log memory usage per GPU during ZeRO training runs
infra/inference-serving — ZeRO is training-only; vLLM handles inference distribution separately

Open Questions

What operational burden does this introduce that the documentation does not cover?
At what scale does this infrastructure pattern stop being cost-effective?