Datasets

The HuggingFace datasets library is the standard way to load, stream, and push training data. Key datasets for AI engineering: instruction-following (Alpaca, OpenHermes), preference pairs (Anthropic HH-RLHF), code (The Stack, CodeContests), and synthetic data generated by stronger models.

Updated Invalid Date·

datasets huggingface training-data rlhf instruction-following synthetic-data

Training data for LLMs and AI systems. The HuggingFace Hub is the central registry. Over 150,000 public datasets. The datasets library provides a unified API to load, stream, filter, and push them.

HuggingFace datasets Library

from datasets import load_dataset

# Load from Hub — downloads and caches locally
ds = load_dataset("tatsu-lab/alpaca")
# DatasetDict({'train': Dataset({features: ['instruction', 'input', 'output', 'text'], num_rows: 52002})})

# Stream large datasets without downloading everything
ds = load_dataset("bigcode/the-stack", data_files="data/python/*.parquet", streaming=True)
for example in ds["train"]:
    print(example["content"][:200])
    break

# Filter, map, select
clean = ds["train"].filter(lambda x: len(x["text"]) > 100)
tokenised = clean.map(lambda x: tokenizer(x["text"], truncation=True, max_length=512), batched=True)

# Push your own dataset to the Hub
from datasets import Dataset
my_data = Dataset.from_list([{"prompt": "...", "response": "..."} for _ in range(1000)])
my_data.push_to_hub("your-username/my-dataset", private=True)

Instruction-Following Datasets

Used for SFT (supervised fine-tuning). Teach the model to follow instructions.

Dataset	Size	Notes
Alpaca	52K	GPT-3.5 generated from 175 seed tasks. First major open instruction set. Quality is uneven.
OpenHermes 2.5	1M	High-quality synthetic from GPT-4. Widely used for open-source SFT.
Dolly 15K	15K	Human-written by Databricks employees. No GPT data — fully open licence.
FLAN collection	15M+	Google's instruction-tuning collection across hundreds of NLP tasks.
Orca 2	800K	Microsoft — reasoning traces from GPT-4, designed for smaller models.

# OpenHermes — most popular SFT dataset for open models
ds = load_dataset("teknium/OpenHermes-2.5")
# Format: {"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}

Preference Datasets (RLHF / DPO)

(prompt, chosen, rejected) triples for preference optimisation.

Dataset	Size	Notes
Anthropic HH-RLHF	170K	Human preference labels for helpfulness and harmlessness. Gold standard.
OpenAI WebGPT	20K	Human preferences on web-assisted answers.
UltraFeedback	64K prompts	AI feedback from GPT-4 across 4 models. Used to train Zephyr.
Nectar	183K	7-way rankings from GPT-4. Good for reward model training.

# Anthropic HH-RLHF
ds = load_dataset("Anthropic/hh-rlhf")
# {"chosen": "Human: ... Assistant: ...", "rejected": "Human: ... Assistant: ..."}

# Format for TRL DPO trainer
def format_for_dpo(example):
    return {
        "prompt": extract_prompt(example["chosen"]),
        "chosen": extract_response(example["chosen"]),
        "rejected": extract_response(example["rejected"]),
    }

Code Datasets

Dataset	Size	Notes
The Stack v2	3TB+	619 programming languages, deduplicated. HuggingFace + BigCode.
CodeContests	13K	Competitive programming problems + solutions. DeepMind.
MBPP	374	Mostly Basic Python Programming — eval + training split.
HumanEval	164	OpenAI's function completion benchmark.
SWE-bench	2.3K	Real GitHub issues + test suites. The hardest coding benchmark.

# Stream The Stack for Python only
ds = load_dataset(
    "bigcode/the-stack-v2",
    data_dir="data/python",
    streaming=True,
    split="train",
)

Synthetic Data Generation

Generating training data with a stronger model. Now the dominant approach for instruction and preference data.

import anthropic
from datasets import Dataset

client = anthropic.Anthropic()

def generate_instruction_pair(topic: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system="Generate a realistic instruction-response pair about the given topic. Return JSON with 'instruction' and 'response' keys.",
        messages=[{"role": "user", "content": f"Topic: {topic}"}]
    )
    import json
    return json.loads(response.content[0].text)

# Generate 1000 pairs, push to Hub
topics = ["Python async programming", "RAG pipelines", "LLM evaluation"] * 333
pairs = [generate_instruction_pair(t) for t in topics]
Dataset.from_list(pairs).push_to_hub("your-username/synthetic-instructions")

Quality filtering — synthetic data needs cleaning:

# Deduplication with MinHash
from datasketch import MinHash, MinHashLSH

def minhash(text: str, num_perm=128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for word in text.lower().split():
        m.update(word.encode("utf8"))
    return m

# Length filtering
filtered = ds.filter(lambda x: 50 < len(x["response"].split()) < 500)

# Quality scoring with a reward model
filtered = filtered.map(lambda x: {"quality": reward_model.score(x["prompt"], x["response"])})
filtered = filtered.filter(lambda x: x["quality"] > 0.7)

Data Quality Over Quantity

The LIMA paper (2023) showed 1,000 high-quality examples can outperform 50,000 low-quality ones for instruction tuning. Quality signals:

Diversity — broad coverage of tasks and domains
Correctness — responses are accurate and complete
Format consistency — responses follow a consistent style
Difficulty balance — mix of easy and challenging examples

# Check dataset statistics before training
from collections import Counter
import numpy as np

lengths = [len(x["response"].split()) for x in ds["train"]]
print(f"Mean length: {np.mean(lengths):.0f}")
print(f"Median length: {np.median(lengths):.0f}")
print(f"P95 length: {np.percentile(lengths, 95):.0f}")

# Check for duplicates
responses = [x["response"] for x in ds["train"]]
unique_ratio = len(set(responses)) / len(responses)
print(f"Unique ratio: {unique_ratio:.2%}")

Key Facts

datasets library caches downloads at ~/.cache/huggingface/datasets — use streaming=True for >10GB datasets
OpenHermes 2.5 is the go-to SFT dataset for open models; Anthropic HH-RLHF for preference training
LIMA (2023): 1K curated examples beat 50K noisy ones — quality over quantity
The Stack v2 (3TB+) is the largest open code dataset; use data_dir to select a single language
push_to_hub requires huggingface-cli login or HF_TOKEN environment variable
Synthetic data at scale: GPT-4 or Claude Sonnet as the generator, reward model or LLM judge for filtering

Common Failure Cases

load_dataset with streaming=True silently returns only the first shard of a multi-shard dataset when data_files is a glob pattern
Why: in streaming mode, some dataset loaders resolve glob patterns lazily and may only resolve the first matching shard if the pattern is not fully expanded before iteration; the result is an iterator that yields only a fraction of the expected data with no error.
Detect: iterating the stream yields far fewer examples than the dataset documentation claims; next(iter(ds)) works but the stream ends after the first shard.
Fix: list shard files explicitly in data_files rather than using a glob; or use non-streaming mode with load_dataset(..., split="train") and verify len(ds) matches the expected count.

push_to_hub uploads a dataset that looks correct locally but has all string columns truncated to 512 characters because the features schema was auto-inferred
Why: when features is not explicitly specified, push_to_hub auto-infers column types from the first batch; if the first batch has responses under 512 characters, the schema infers Value("string") with a length hint, silently truncating longer values during serialisation.
Detect: downloading the dataset from the Hub shows response values truncated mid-sentence; comparing the Hub version to the local version reveals the truncation.
Fix: explicitly define Features({"instruction": Value("string"), "response": Value("string")}) when creating the dataset; Value("string") in HuggingFace datasets has no length limit when schema is explicit.

MinHash deduplication removes semantically distinct examples that share a high word-overlap because the hash is computed on unigrams only
Why: MinHash with unigrams is sensitive to shared boilerplate vocabulary (e.g., "The capital of X is Y" across many geography questions); two semantically different instructions with high common word overlap are incorrectly classified as duplicates and one is removed.
Detect: after deduplication, the dataset loses coverage of specific domains (e.g., all geography questions removed); inspecting removed pairs shows they are semantically distinct despite high ROUGE overlap.
Fix: use 2-gram or 3-gram MinHash shingles to capture phrase-level similarity; or use a combination of MinHash for near-duplicates and an embedding similarity threshold to catch only true duplicates.

Connections

data/rlhf-datasets — RLHF-specific preference datasets in detail
data/distilabel — Argilla's library for synthetic data pipelines
fine-tuning/rlhf-dpo — how preference datasets are used for DPO and RLHF training
fine-tuning/frameworks — TRL, Axolotl, Unsloth consume these datasets for training
python/polars-duckdb — Polars and DuckDB for fast dataset preprocessing at scale
evals/benchmarks — HumanEval, MBPP, SWE-bench are datasets repurposed as benchmarks

Open Questions

What data quality issues does this approach fail to detect?
When does this pipeline design become a bottleneck at production scale?