Datasets

The HuggingFace datasets library is the standard way to load, stream, and push training data. Key datasets for AI engineering: instruction-following (Alpaca, OpenHermes), preference pairs (Anthropic HH-RLHF), code (The Stack, CodeContests), and synthetic data generated by stronger models.

Training data for LLMs and AI systems. The HuggingFace Hub is the central registry. Over 150,000 public datasets. The datasets library provides a unified API to load, stream, filter, and push them.


HuggingFace datasets Library

from datasets import load_dataset

# Load from Hub — downloads and caches locally
ds = load_dataset("tatsu-lab/alpaca")
# DatasetDict({'train': Dataset({features: ['instruction', 'input', 'output', 'text'], num_rows: 52002})})

# Stream large datasets without downloading everything
ds = load_dataset("bigcode/the-stack", data_files="data/python/*.parquet", streaming=True)
for example in ds["train"]:
    print(example["content"][:200])
    break

# Filter, map, select
clean = ds["train"].filter(lambda x: len(x["text"]) > 100)
tokenised = clean.map(lambda x: tokenizer(x["text"], truncation=True, max_length=512), batched=True)

# Push your own dataset to the Hub
from datasets import Dataset
my_data = Dataset.from_list([{"prompt": "...", "response": "..."} for _ in range(1000)])
my_data.push_to_hub("your-username/my-dataset", private=True)

Instruction-Following Datasets

Used for SFT (supervised fine-tuning). Teach the model to follow instructions.

DatasetSizeNotes
Alpaca52KGPT-3.5 generated from 175 seed tasks. First major open instruction set. Quality is uneven.
OpenHermes 2.51MHigh-quality synthetic from GPT-4. Widely used for open-source SFT.
Dolly 15K15KHuman-written by Databricks employees. No GPT data — fully open licence.
FLAN collection15M+Google's instruction-tuning collection across hundreds of NLP tasks.
Orca 2800KMicrosoft — reasoning traces from GPT-4, designed for smaller models.
# OpenHermes — most popular SFT dataset for open models
ds = load_dataset("teknium/OpenHermes-2.5")
# Format: {"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}

Preference Datasets (RLHF / DPO)

(prompt, chosen, rejected) triples for preference optimisation.

DatasetSizeNotes
Anthropic HH-RLHF170KHuman preference labels for helpfulness and harmlessness. Gold standard.
OpenAI WebGPT20KHuman preferences on web-assisted answers.
UltraFeedback64K promptsAI feedback from GPT-4 across 4 models. Used to train Zephyr.
Nectar183K7-way rankings from GPT-4. Good for reward model training.
# Anthropic HH-RLHF
ds = load_dataset("Anthropic/hh-rlhf")
# {"chosen": "Human: ... Assistant: ...", "rejected": "Human: ... Assistant: ..."}

# Format for TRL DPO trainer
def format_for_dpo(example):
    return {
        "prompt": extract_prompt(example["chosen"]),
        "chosen": extract_response(example["chosen"]),
        "rejected": extract_response(example["rejected"]),
    }

Code Datasets

DatasetSizeNotes
The Stack v23TB+619 programming languages, deduplicated. HuggingFace + BigCode.
CodeContests13KCompetitive programming problems + solutions. DeepMind.
MBPP374Mostly Basic Python Programming — eval + training split.
HumanEval164OpenAI's function completion benchmark.
SWE-bench2.3KReal GitHub issues + test suites. The hardest coding benchmark.
# Stream The Stack for Python only
ds = load_dataset(
    "bigcode/the-stack-v2",
    data_dir="data/python",
    streaming=True,
    split="train",
)

Synthetic Data Generation

Generating training data with a stronger model. Now the dominant approach for instruction and preference data.

import anthropic
from datasets import Dataset

client = anthropic.Anthropic()

def generate_instruction_pair(topic: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system="Generate a realistic instruction-response pair about the given topic. Return JSON with 'instruction' and 'response' keys.",
        messages=[{"role": "user", "content": f"Topic: {topic}"}]
    )
    import json
    return json.loads(response.content[0].text)

# Generate 1000 pairs, push to Hub
topics = ["Python async programming", "RAG pipelines", "LLM evaluation"] * 333
pairs = [generate_instruction_pair(t) for t in topics]
Dataset.from_list(pairs).push_to_hub("your-username/synthetic-instructions")

Quality filtering — synthetic data needs cleaning:

# Deduplication with MinHash
from datasketch import MinHash, MinHashLSH

def minhash(text: str, num_perm=128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for word in text.lower().split():
        m.update(word.encode("utf8"))
    return m

# Length filtering
filtered = ds.filter(lambda x: 50 < len(x["response"].split()) < 500)

# Quality scoring with a reward model
filtered = filtered.map(lambda x: {"quality": reward_model.score(x["prompt"], x["response"])})
filtered = filtered.filter(lambda x: x["quality"] > 0.7)

Data Quality Over Quantity

The LIMA paper (2023) showed 1,000 high-quality examples can outperform 50,000 low-quality ones for instruction tuning. Quality signals:

  • Diversity — broad coverage of tasks and domains
  • Correctness — responses are accurate and complete
  • Format consistency — responses follow a consistent style
  • Difficulty balance — mix of easy and challenging examples
# Check dataset statistics before training
from collections import Counter
import numpy as np

lengths = [len(x["response"].split()) for x in ds["train"]]
print(f"Mean length: {np.mean(lengths):.0f}")
print(f"Median length: {np.median(lengths):.0f}")
print(f"P95 length: {np.percentile(lengths, 95):.0f}")

# Check for duplicates
responses = [x["response"] for x in ds["train"]]
unique_ratio = len(set(responses)) / len(responses)
print(f"Unique ratio: {unique_ratio:.2%}")

Key Facts

  • datasets library caches downloads at ~/.cache/huggingface/datasets — use streaming=True for >10GB datasets
  • OpenHermes 2.5 is the go-to SFT dataset for open models; Anthropic HH-RLHF for preference training
  • LIMA (2023): 1K curated examples beat 50K noisy ones — quality over quantity
  • The Stack v2 (3TB+) is the largest open code dataset; use data_dir to select a single language
  • push_to_hub requires huggingface-cli login or HF_TOKEN environment variable
  • Synthetic data at scale: GPT-4 or Claude Sonnet as the generator, reward model or LLM judge for filtering

Common Failure Cases

load_dataset with streaming=True silently returns only the first shard of a multi-shard dataset when data_files is a glob pattern
Why: in streaming mode, some dataset loaders resolve glob patterns lazily and may only resolve the first matching shard if the pattern is not fully expanded before iteration; the result is an iterator that yields only a fraction of the expected data with no error.
Detect: iterating the stream yields far fewer examples than the dataset documentation claims; next(iter(ds)) works but the stream ends after the first shard.
Fix: list shard files explicitly in data_files rather than using a glob; or use non-streaming mode with load_dataset(..., split="train") and verify len(ds) matches the expected count.

push_to_hub uploads a dataset that looks correct locally but has all string columns truncated to 512 characters because the features schema was auto-inferred
Why: when features is not explicitly specified, push_to_hub auto-infers column types from the first batch; if the first batch has responses under 512 characters, the schema infers Value("string") with a length hint, silently truncating longer values during serialisation.
Detect: downloading the dataset from the Hub shows response values truncated mid-sentence; comparing the Hub version to the local version reveals the truncation.
Fix: explicitly define Features({"instruction": Value("string"), "response": Value("string")}) when creating the dataset; Value("string") in HuggingFace datasets has no length limit when schema is explicit.

MinHash deduplication removes semantically distinct examples that share a high word-overlap because the hash is computed on unigrams only
Why: MinHash with unigrams is sensitive to shared boilerplate vocabulary (e.g., "The capital of X is Y" across many geography questions); two semantically different instructions with high common word overlap are incorrectly classified as duplicates and one is removed.
Detect: after deduplication, the dataset loses coverage of specific domains (e.g., all geography questions removed); inspecting removed pairs shows they are semantically distinct despite high ROUGE overlap.
Fix: use 2-gram or 3-gram MinHash shingles to capture phrase-level similarity; or use a combination of MinHash for near-duplicates and an embedding similarity threshold to catch only true duplicates.

Connections

Open Questions

  • What data quality issues does this approach fail to detect?
  • When does this pipeline design become a bottleneck at production scale?