Data Engineering Hub

Hub page for data engineering — pipelines, transformation, orchestration (Airflow, Prefect, dbt), storage patterns, and the specific requirements that AI workloads add to traditional data infrastructure.

Data engineering for AI differs from traditional data engineering in one critical way: data quality bugs silently degrade model quality with no visible error. A bad SQL join in a BI pipeline produces a wrong number; the same bug in a training pipeline produces a miscalibrated model that looks fine until you evaluate it.


The AI Data Stack

Sources (logs, databases, APIs, user feedback)
        ↓
Ingestion (Kafka, Fivetran, Airbyte, custom scrapers)
        ↓
Storage (data lake, warehouse, feature store)
        ↓
Transformation (dbt, Spark, DuckDB)
        ↓
Validation (Great Expectations, dbt tests, custom evals)
        ↓
Serving (vector stores, feature stores, S3/GCS for training data)
        ↓
Consumption (training jobs, RAG pipelines, online inference)

Pipeline Orchestration

Airflow

Mature, widely deployed. DAG-based. Best when you have a large operations team comfortable with Python DAG authorship and need enterprise monitoring.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG("embedding_refresh", schedule_interval="@daily", start_date=datetime(2026, 1, 1)) as dag:
    extract = PythonOperator(task_id="extract", python_callable=extract_new_documents)
    embed = PythonOperator(task_id="embed", python_callable=generate_embeddings)
    load = PythonOperator(task_id="load", python_callable=upsert_to_vector_store)

    extract >> embed >> load

Prefect

Pythonic, cloud-native. Easier local development than Airflow. Better for teams that want flow-as-code without a full Airflow deployment.

from prefect import flow, task

@task
def extract() -> list[str]: ...

@task
def embed(texts: list[str]) -> list[list[float]]: ...

@flow
def embedding_pipeline():
    texts = extract()
    vectors = embed(texts)
    upsert_to_store(vectors)

dbt

SQL-first transformation layer. Not a scheduler itself — pairs with Airflow/Prefect/dbt Cloud for orchestration. The standard tool for warehouse-layer transformation.


Data Storage for AI

LayerToolUse
Raw storageS3, GCSTraining data lake, model artefacts
Structured dataPostgreSQL, Snowflake, BigQueryFeatures, labels, metadata
Vector storepgvector, Qdrant, PineconeEmbedding index for RAG
Feature storeFeast, Tecton, SageMaker FSOnline/offline feature serving
CacheRedisLow-latency feature serving

Data Quality for AI

Standard data quality checks are necessary but insufficient. Add AI-specific validation:

  1. Embedding quality checks — cosine similarity distribution of new embeddings vs baseline; sudden distribution shift signals upstream text quality degradation
  2. Label consistency — for RLHF datasets, inter-annotator agreement should stay above a threshold (Cohen's κ > 0.6)
  3. Deduplication — near-duplicate training examples bias models toward overrepresented patterns; MinHash deduplication before training
  4. PII detection — run regex + model-based PII checks before any data enters a training pipeline

Data Versioning

# DVC — Git for data
dvc init
dvc add data/embeddings/corpus_v3.parquet
git add data/embeddings/corpus_v3.parquet.dvc .gitignore
git commit -m "Add corpus v3 embeddings"
dvc push  # pushes to S3/GCS remote

Key Pages


Connections

  • data/pipelines — full treatment of Airflow, Prefect, and dbt orchestration patterns
  • data/feature-stores — feature stores (Feast, Tecton) are the serving layer bridging training pipelines and online inference
  • data/synthetic-data — LLM-generated training data; shares ingestion and deduplication pipeline patterns with real data
  • data/rlhf-datasets — preference data collection; same pipeline infrastructure as standard annotation workflows
  • infra/vector-stores — downstream consumer of embedding pipelines built on this stack
  • sql/sql-for-ai — SQL patterns used at the transformation layer (dbt, DuckDB)
  • synthesis/cost-optimisation — data pipeline efficiency directly affects training and inference costs

Open Questions

  • At what data volume does DuckDB become a bottleneck, and when does it make sense to migrate transformation workloads from DuckDB to Spark?
  • Is MinHash deduplication sufficient for RLHF datasets where near-duplicates with different labels are specifically harmful, or are more precise semantic deduplication approaches needed?
  • How should data versioning (DVC) be integrated with model versioning (MLflow) to produce a single reproducible experiment record that ties dataset version to model artefact?