Annotation Tooling

Label Studio (general-purpose, strong RLHF pairwise templates) and Argilla (purpose-built for LLM preference data) are the two open-source defaults for building RLHF and fine-tuning datasets. RLHF annotation costs 5-10x more per sample than compute — this is why synthetic data is so attractive.

Updated Invalid Date·

annotation label-studio argilla rlhf human-feedback preference-data fine-tuning data-collection

Human annotation is the bottleneck of alignment. Models trained with RLHF need pairwise preference data (chosen/rejected response pairs), and models trained with SFT need demonstration data (human-written ideal responses). Both require tooling to present tasks to annotators and collect structured output.

Cost Reality

RLHF annotation costs 5-10x more per sample than compute. 600 high-quality RLHF annotations can cost ~$60,000. Roughly 167x the compute expense for the same training run. [Source: taskmonk.ai, 2026] [unverified]

This is the primary reason synthetic data generation (see data/distilabel, data/synthetic-data) is so attractive: it replaces expensive human annotation with model-generated preference pairs, at the cost of some alignment quality.

The Three Stages of LLM Training Data

Each stage of RLHF requires different annotation work:

Stage	Task for Annotators	Output
SFT (Supervised Fine-Tuning)	Write ideal responses to prompts	Instruction-following pairs
Reward Model Training	Rate/rank pairs of model responses	Chosen/rejected preference pairs
RL Prompt Collection	Curate diverse prompts for RL training	Prompt set

Label Studio

General-purpose, open-source annotation platform. Covers image, audio, text, and video labelling. Most relevant for LLM work: pairwise preference collection (human preference for RLHF) and instruction annotation (SFT data).

Setup

pip install label-studio
label-studio start
# UI at http://localhost:8080

Pairwise Preference Template (RLHF)

Label Studio ships a pairwise human preference template. Annotators see two model responses side by side and select the preferred one:

import label_studio_sdk as ls

client = ls.Client(url="http://localhost:8080", api_key="YOUR_API_KEY")
project = client.start_project(
    title="RLHF Preference Collection",
    label_config="""
    <View>
      <Header value="Choose the better response:"/>
      <Text name="prompt" value="$prompt"/>
      <PairwiseComparison name="pref" toName="prompt,response_a,response_b"
                          selectedChoices="$selected">
        <Text name="response_a" value="$response_a"/>
        <Text name="response_b" value="$response_b"/>
      </PairwiseComparison>
    </View>
    """
)

Import and Export

# Import tasks (prompt + two responses for comparison)
tasks = [
    {
        "data": {
            "prompt": "Explain gradient descent",
            "response_a": model_response_1,
            "response_b": model_response_2
        }
    }
    for model_response_1, model_response_2 in generate_response_pairs(prompts)
]
project.import_tasks(tasks)

# Export completed annotations as chosen/rejected pairs for DPO
annotations = project.export_tasks(export_type="JSON")
preference_pairs = [
    {
        "prompt": t["data"]["prompt"],
        "chosen": t["data"]["response_a"] if t["annotations"][0]["result"][0]["value"]["selected"] == "left" else t["data"]["response_b"],
        "rejected": t["data"]["response_b"] if t["annotations"][0]["result"][0]["value"]["selected"] == "left" else t["data"]["response_a"],
    }
    for t in annotations if t["annotations"]
]

ML Backend for Pre-annotation

Label Studio supports plugging in a model to suggest labels before humans review:

# label_studio_ml backend — auto-suggests responses for annotators to accept/edit
from label_studio_ml import LabelStudioMLBase
import anthropic

class ClaudePreannotator(LabelStudioMLBase):
    def predict(self, tasks, **kwargs):
        client = anthropic.Anthropic()
        predictions = []
        for task in tasks:
            response = client.messages.create(
                model="claude-haiku-4-5-20251001",
                max_tokens=512,
                messages=[{"role": "user", "content": task["data"]["prompt"]}]
            )
            predictions.append({
                "result": [{"value": {"text": response.content[0].text}, "from_name": "response", "to_name": "prompt", "type": "textarea"}],
                "score": 0.8
            })
        return predictions

Argilla

Open-source data curation platform purpose-built for LLMs. Built by the team behind data/distilabel. The FeedbackDataset (v2.x) is the primary dataset type for LLM annotation tasks.

Setup

pip install argilla
# Self-host
docker run -d --name argilla -p 6900:6900 argilla/argilla-quickstart:latest

Or deploy on Hugging Face Spaces (free tier available for small teams).

Creating a Preference Collection Dataset

import argilla as rg

# Connect
rg.init(api_url="http://localhost:6900", api_key="admin.apikey")

# Define the dataset schema
dataset = rg.FeedbackDataset(
    fields=[
        rg.TextField(name="prompt", title="User Prompt"),
        rg.TextField(name="response_a", title="Response A"),
        rg.TextField(name="response_b", title="Response B"),
    ],
    questions=[
        rg.RatingQuestion(
            name="preference",
            title="Which response is better?",
            values=[1, 2],   # 1=A, 2=B
            required=True
        ),
        rg.TextQuestion(
            name="reason",
            title="Why did you prefer this response?",
            required=False
        )
    ],
    guidelines="Rate responses on helpfulness, accuracy, and safety. "
               "Prefer concise, correct, non-harmful answers."
)

# Add records (prompt + two responses)
records = [
    rg.FeedbackRecord(fields={
        "prompt": "Explain gradient descent",
        "response_a": response_a,
        "response_b": response_b,
    })
    for response_a, response_b in response_pairs
]
dataset.add_records(records)

# Push to Argilla server for annotation
dataset.push_to_argilla(name="rlhf-preferences-v1", workspace="default")

Export to DPO Training Format

# Pull completed annotations
dataset = rg.FeedbackDataset.from_argilla("rlhf-preferences-v1", workspace="default")

# Convert to DPO chosen/rejected format
dpo_data = []
for record in dataset.records:
    if not record.responses:
        continue
    preference = record.responses[0].values["preference"].value
    dpo_data.append({
        "prompt": record.fields["prompt"],
        "chosen": record.fields["response_a"] if preference == 1 else record.fields["response_b"],
        "rejected": record.fields["response_b"] if preference == 1 else record.fields["response_a"],
    })

# Push to HuggingFace Hub
from datasets import Dataset
Dataset.from_list(dpo_data).push_to_hub("my-org/rlhf-preferences")

Argilla + distilabel Integration

Argilla and distilabel are designed to work together: distilabel generates synthetic preference pairs, Argilla lets humans review and curate them.

from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import UltraFeedback

# Generate synthetic preferences → push to Argilla for human review
pipeline = Pipeline(
    name="synthetic-to-argilla",
    steps=[
        LoadDataFromHub(repo_id="HuggingFaceH4/instruction-dataset"),
        UltraFeedback(llm=InferenceEndpointsLLM(model_id="meta-llama/Meta-Llama-3-8B-Instruct")),
        # ArgillaPushStep → uploads for human review
    ]
)

Tool Comparison

	Label Studio	Argilla
Best for	General annotation tasks; any modality	LLM preference/feedback data specifically
UI	Task-focused annotation UI	LLM-optimised review UI
RLHF support	Templates; custom config required	First-class FeedbackDataset
distilabel integration	Manual export	Native integration
Self-host	Docker, pip	Docker, HuggingFace Spaces
License	Apache 2.0	Apache 2.0
Commercial	Label Studio Enterprise	HuggingFace managed

Data Quality Signals

Key quality metrics to monitor:

Inter-annotator agreement — Cohen's kappa > 0.6 is acceptable; > 0.8 is good
Annotation consistency — same annotator rates similar items similarly (track per-annotator variance)
Task clarity — vague rubrics produce noisy data; write explicit guidelines before data collection begins

Annotation guidelines should specify: what counts as "better" for your task (helpfulness? safety? factual accuracy?), examples of edge cases, and how to handle ties.

When to Use Synthetic Data Instead

Annotation tooling is expensive. Consider data/synthetic-data and data/distilabel first:

Situation	Approach
Need 100k+ preference pairs	Synthetic (LLM-as-annotator via UltraFeedback)
Domain-specific safety data	Human annotation (nuance matters)
Style/format preferences	Synthetic (clear rubric, LLM can judge)
Medical/legal accuracy	Human annotation (errors are high-stakes)
Limited budget	Synthetic generation + small human spot-check

Key Facts

RLHF annotation costs 5-10x more per sample than compute; 600 pairs can cost ~$60,000 [unverified]
Argilla FeedbackDataset (v2.x) is the current standard; v1 TextClassification/TokenClassification datasets are deprecated
Label Studio ML Backend enables Claude/GPT pre-annotation — humans review, not write from scratch
distilabel + Argilla pipeline: generate synthetically, human-review selectively
Inter-annotator agreement (Cohen's kappa > 0.6) is minimum acceptable for training data quality

Common Failure Cases

Label Studio PairwiseComparison export produces ties where selected is neither "left" nor "right" because annotators skipped the question
Why: annotators can submit a task without selecting a preference; the export JSON contains an empty result array for those annotations, which causes a KeyError or IndexError when the processing script accesses t["annotations"][0]["result"][0].
Detect: the preference pair extraction script throws IndexError: list index out of range on some rows; the raw export shows tasks with "annotations": [{"result": []}].
Fix: filter out tasks with empty results before processing: if t["annotations"] and t["annotations"][0]["result"]; add a required=True constraint in the Label Studio template to force annotators to select before submitting.

Argilla FeedbackDataset.push_to_argilla() fails silently after timeout, leaving a partially uploaded dataset
Why: large datasets (10,000+ records) can hit Argilla's HTTP timeout during upload; the Python call returns without error if the timeout is caught internally, but the Argilla server only received a fraction of the records.
Detect: rg.FeedbackDataset.from_argilla(name).records returns fewer records than the uploaded list; checking the Argilla UI shows the dataset exists but with a lower row count.
Fix: upload in batches using dataset.add_records(batch) in chunks of 500–1000; add a post-upload assertion that len(dataset.records) == len(source_records).

Inter-annotator agreement is measured only at the end of the project and reveals κ < 0.5, making the entire dataset unusable
Why: agreement is computed retrospectively on completed annotations; if the annotation guidelines were ambiguous, all annotators may have interpreted the task differently from the start, producing inconsistent labels throughout the dataset.
Detect: Cohen's kappa across annotators is below 0.5 after all annotations are complete; reviewing disagreements shows systematic differences in how annotators interpret "helpfulness" rather than random noise.
Fix: run an inter-annotator agreement calibration round on 50–100 overlap tasks before the main annotation begins; review disagreements, update guidelines, and retrain annotators before starting the full project.

Connections

data/distilabel — Argilla's synthetic data pipeline; works alongside Argilla for hybrid human+synthetic annotation
data/synthetic-data — LLM-generated preference pairs to supplement human annotation
data/rlhf-datasets — HH-RLHF, UltraFeedback — pre-built datasets before building your own
fine-tuning/dpo-grpo — DPO/GRPO training uses chosen/rejected pairs produced by annotation tooling
fine-tuning/decision-framework — when annotation cost is justified vs synthetic data

Open Questions

How does LLM-as-annotator quality compare to human annotation for safety-sensitive domains?
Is Argilla v2.x API now stable enough to recommend over v1 for new projects?
At what scale does commercial annotation (Scale AI, Surge) become cheaper than running an in-house Argilla deployment?