Experiment Tracking

Logging and comparing ML training runs. Distinct from production LLM observability: experiment tracking is for the training phase — comparing hyperparameter runs, catching overfitting, reproducing results.

Logging and comparing ML training runs. Distinct from production LLM observability (see observability/platforms): experiment tracking is for the training phase. Comparing hyperparameter runs, catching overfitting, reproducing results.

The two dominant tools: Weights & Biases (W&B) (industry standard, richer UX) and MLflow (open-source, self-hostable).

Neptune deprecation notice: Neptune SaaS was shut down in March 2026 following its acquisition by OpenAI. Any teams still using Neptune must migrate. Migration paths: MLflow (open-source, self-hosted, drop-in for most use cases), W&B (managed, best UX), Aim (open-source alternative). See the Migration section below.


Why It Matters

Without experiment tracking you cannot:

  • Compare two fine-tuning runs with different learning rates
  • Know which checkpoint produced the best eval loss
  • Reproduce a result from six weeks ago
  • Share training curves with a colleague

Weights & Biases

Setup

pip install wandb
wandb login   # paste API key from wandb.ai

Basic run logging

import wandb

wandb.init(
    project="my-llm-finetune",
    name="lora-lr1e-4",
    config={
        "model": "mistral-7b",
        "learning_rate": 1e-4,
        "lora_rank": 16,
        "epochs": 3,
    }
)

for step, batch in enumerate(train_loader):
    loss = train_step(batch)
    wandb.log({"train/loss": loss, "step": step})

wandb.finish()

Key features

FeatureWhat it does
wandb.log()Log any scalar, image, table, or histogram
wandb.configTrack hyperparameters for each run
wandb.save()Upload model checkpoints as artefacts
SweepsHyperparameter search (grid, random, Bayesian)
TablesLog eval datasets with per-row predictions
ArtefactsVersion datasets, models, and code

Integration with Axolotl (fine-tuning)

In config.yaml:

wandb_project: my-finetune
wandb_name: experiment-1
wandb_log_model: checkpoint   # upload checkpoints as W&B artefacts

Axolotl's HuggingFace Trainer will call the W&B callback automatically.

Integration with TRL

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    report_to="wandb",
    run_name="dpo-mistral-v1",
    logging_steps=10,
)

Hyperparameter sweeps

sweep_config = {
    "method": "bayes",
    "metric": {"name": "eval/loss", "goal": "minimize"},
    "parameters": {
        "learning_rate": {"min": 1e-5, "max": 1e-3},
        "lora_rank": {"values": [8, 16, 32]},
    }
}

sweep_id = wandb.sweep(sweep_config, project="my-finetune")
wandb.agent(sweep_id, function=train, count=20)

W&B Weave (LLM tracing and evaluation)

Weave is a separate W&B product for LLM-specific observability, distinct from W&B core which targets training runs. It shares the W&B dashboard but is its own product surface.

What Weave covers:

  • LLM call tracing — logs input, output, latency, and cost for every LLM call; supports nested traces for multi-step chains and agents
  • Prompt versioning — tracks prompt text alongside the model and parameters used, so you can diff prompts across experiments
  • Evaluation datasets — attach a dataset to an experiment, run the model over every row, and compare results across runs
  • Human annotation — annotation queues for labelling model outputs; feeds back into evals
  • Experiment comparison UI — side-by-side comparison of prompt versions, model outputs, and scores

Quick setup:

import weave
from anthropic import Anthropic

weave.init("my-llm-project")   # creates a W&B project with Weave enabled

client = Anthropic()

@weave.op()
def call_model(prompt: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.content[0].text

Weave automatically wraps the call, captures inputs/outputs, and logs latency and token usage. No manual wandb.log() calls required for tracing.

Relationship to W&B core: training metrics (loss curves, checkpoints) still go through wandb.log() / wandb.init(). Weave sits on top for the inference/evaluation side. Both appear in the same W&B project.


Neptune Deprecation and Migration

Neptune SaaS shut down in March 2026 after acquisition by OpenAI. If migrating off Neptune:

DestinationWhen to choose
MLflowSelf-hosted requirement; regulated environment; already in Databricks
W&BFastest migration; best UX parity; fine-tuning-heavy workflow
AimOpen-source alternative with a local UI; lighter-weight than MLflow

All three support logging params, metrics, and artefacts via patterns nearly identical to Neptune's run["metrics/loss"].log(value) pattern. The HuggingFace Trainer report_to argument accepts "mlflow" or "wandb" directly.


MLflow

Open-source, self-hostable. Better for orgs that can't send training data to a third-party cloud.

Setup

pip install mlflow
mlflow ui   # starts local tracking server at http://localhost:5000

Basic logging

import mlflow

mlflow.set_experiment("llm-fine-tuning")

with mlflow.start_run(run_name="lora-lr1e-4"):
    mlflow.log_params({
        "model": "mistral-7b",
        "learning_rate": 1e-4,
        "lora_rank": 16,
    })

    for step, batch in enumerate(train_loader):
        loss = train_step(batch)
        mlflow.log_metric("train_loss", loss, step=step)

    mlflow.log_artifact("./output/checkpoint-final")

Model Registry

MLflow has a model registry for promoting models through staging → production. Useful when multiple fine-tuned variants are in flight.

mlflow.pytorch.log_model(model, "model", registered_model_name="mistral-ft-v1")

client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="mistral-ft-v1",
    version=1,
    stage="Production"
)

Integration with HuggingFace Trainer

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    report_to="mlflow",
    run_name="sft-v1",
)

W&B vs MLflow

W&BMLflow
HostingSaaS (self-host option)Self-host or Databricks
UXRicher — sweeps, tables, artefact lineageFunctional — experiments, runs, registry
CostFree tier (100GB storage); paid for teamsOpen-source (infra cost only)
EcosystemDeep HuggingFace/Axolotl/TRL integrationBroad ML ecosystem (sklearn, PyTorch, XGBoost)
Model registryVia artefactsNative model registry
Best forActive research, fast iterationEnterprise, regulated environments

Default recommendation: W&B for fine-tuning experiments. MLflow if you need self-hosted control or are already in a Databricks environment.


What to Always Log

Minimum for reproducibility:

config = {
    # Model
    "base_model": "mistralai/Mistral-7B-v0.1",
    "peft_method": "lora",
    "lora_rank": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.1,
    # Training
    "learning_rate": 2e-4,
    "batch_size": 4,
    "gradient_accumulation_steps": 4,
    "max_steps": 1000,
    "warmup_ratio": 0.03,
    "lr_scheduler": "cosine",
    # Data
    "dataset": "my-dataset-v3",
    "max_seq_length": 2048,
    "num_samples": 10000,
}

Metrics to log every N steps: train/loss, eval/loss, learning_rate, grad_norm.


Common Failure Cases

W&B run starts but logs nothing because WANDB_API_KEY is not set in the training container
Why: the API key is set locally but not passed to the Docker container or CI job environment.
Detect: W&B silently creates an offline run; metrics appear in local wandb/ directory but not on wandb.ai.
Fix: pass WANDB_API_KEY as an environment variable to the container; set it as a GitHub Actions secret for CI training jobs.

MLflow log_artifact fails silently when the path doesn't exist
Why: log_artifact("./output/checkpoint") quietly succeeds without validating the path; if training failed before saving the checkpoint, nothing is uploaded.
Detect: MLflow run shows no artifacts; checkpoint directory is empty or absent.
Fix: verify the artifact path exists before calling log_artifact; add an assertion or raise an explicit error if the checkpoint is missing.

Two concurrent training jobs corrupt each other's W&B run
Why: wandb.init() called without name or id generates a random run ID; if two jobs share the same project and one resumes the wrong run, metrics are interleaved.
Detect: W&B run shows step numbers that are non-monotonic or jump unexpectedly.
Fix: set a deterministic name tied to the experiment config hash; use resume="allow" only when intentionally resuming a specific run.

Hyperparameter sweep logs incorrect eval/loss because evaluation runs less frequently
Why: evaluation_strategy="epoch" logs eval loss once per epoch, but logging_steps=10 logs train loss every 10 steps; the sweep's metric eval/loss has fewer data points and a slower update rate.
Detect: sweep convergence looks artificially slow; eval loss curve is flat between epochs.
Fix: use evaluation_strategy="steps" with eval_steps matching logging_steps for sweep runs; ensures the sweep metric is updated regularly.

MLflow model registry promotion silently fails under concurrent promotion requests
Why: two CI jobs promoting different versions to "Production" at the same time may leave the registry in an inconsistent state.
Detect: get_latest_versions(stage="Production") returns two models; or the "Production" model is the wrong version.
Fix: serialize promotion through a deployment job that checks current production before promoting; add a mutex or use compare-and-swap via the MLflow REST API.

Connections

Open Questions

  • What operational burden does this introduce that the documentation does not cover?
  • At what scale does this infrastructure pattern stop being cost-effective?