Data

9 pages

Start here

Data Engineering Hub

Hub page for data engineering — pipelines, transformation, orchestration (Airflow, Prefect, dbt), storage patterns, and the specific requirements that AI workloads add to traditional data infrastructure.

Annotation Tooling

Label Studio (general-purpose, strong RLHF pairwise templates) and Argilla (purpose-built for LLM preference data) are the two open-source defaults for building RLHF and fine-tuning datasets. RLHF annotation costs 5-10x more per sample than compute — this is why synthetic data is so attractive.

annotationlabel-studioargillarlhf

Data Pipelines for AI

Data pipelines for AI (dbt, Airflow, Prefect, DVC) differ from traditional ETL because data quality bugs silently degrade model quality, making validation checkpoints and eval-as-a-pipeline-stage mandatory.

data-pipelinesdbtairflowprefect

Datasets

The HuggingFace datasets library is the standard way to load, stream, and push training data. Key datasets for AI engineering: instruction-following (Alpaca, OpenHermes), preference pairs (Anthropic HH-RLHF), code (The Stack, CodeContests), and synthetic data generated by stronger models.

datasetshuggingfacetraining-datarlhf

distilabel

distilabel is Argilla's framework for synthetic data pipelines — generating preference pairs, instruction datasets, and AI feedback at scale. Pipelines are Python code; steps are composable. Powers several SOTA open-source fine-tuning datasets.

distilabelsynthetic-dataargilladpo

Feature Stores

A feature store is a central repository for ML features that eliminates training-serving skew by guaranteeing the same feature computation is used during both model training and online inference.

datamlopsfeature-storefeast

Model Cards

A model card is a standardized document published alongside an ML model that records intended use, performance across subgroups, limitations, and ethical considerations — originating from Mitchell et al. 2018 and now required for HuggingFace Hub uploads and EU AI Act compliance.

datagovernancemodel-cardshuggingface

RLHF Datasets and Preference Data

RLHF/DPO training requires chosen/rejected preference pairs — quality of the preference dataset directly determines alignment quality, and a bad dataset is worse than none at all.

rlhfdatasetspreferencesdpo

Synthetic Data Generation

LLM-generated synthetic data enables thousands of domain-specific training examples per hour for pennies, but requires quality filtering to remove the 10-30% garbage and must guard against model collapse across generations.

synthetic-datafine-tuningdatasetsllm