Feature Stores

A feature store is a central repository for ML features that eliminates training-serving skew by guaranteeing the same feature computation is used during both model training and online inference.


Key Facts

  • Training-serving skew is the core problem: if training uses one SQL query and serving uses another, model performance silently degrades in production
  • Two storage tiers: offline store (historical, batch, used for training dataset construction) and online store (current values only, low-latency, used at inference time)
  • Point-in-time correctness prevents data leakage — when building a training dataset, features are looked up as of the label timestamp, not "now"
  • Online stores target sub-10ms feature lookup; offline stores trade latency for completeness and historical depth
  • Feast is the leading open-source option (Linux Foundation); Tecton is the leading managed enterprise option; Hopsworks leads on governance and regulated industries
  • For LLM applications, pre-computed embeddings and user preference vectors stored in feature stores let recommendation and retrieval systems meet <200ms end-to-end latency budgets
  • Feature stores are overkill for small teams or single-model systems; the complexity pays off when multiple models share the same features

Detail

The problem feature stores solve

Without a feature store, every team that needs "user's average purchase value over the last 30 days" writes its own version — one for training, one for the API, one for the dashboard. They diverge. The training pipeline uses a LEFT JOIN; the serving code uses an INNER JOIN. The model is trained on one distribution and evaluated on another. This is training-serving skew, and it causes model degradation that is extremely hard to debug because nothing crashes — predictions just get worse.

A feature store solves this by making feature logic the single source of truth. Features are defined once (as a transformation + storage spec), computed on a schedule, and read from the same store for both training data construction and online inference.

Point-in-time correctness

When constructing a training dataset from historical labels, you cannot use feature values that were computed after the label timestamp — that is data leakage. A feature store's offline retrieval performs a point-in-time join: for each (entity, timestamp) in your label set, it looks up the feature value that was current at that timestamp, not the latest value.

Example: a fraud label for a transaction at 14:03:22 should be joined with the user's "number of transactions in last hour" as of 14:03:22, not as of whenever you ran the training job.

This is the most operationally difficult part of building a feature store from scratch — it requires storing feature history with timestamps and doing an efficient as-of join across potentially billions of rows.

Online vs offline store

PropertyOffline storeOnline store
Storage backendS3, GCS, Delta Lake, BigQueryRedis, DynamoDB, Cassandra, Bigtable
Read latencySeconds to minutes (batch)1–10ms
Data volumeFull historyLatest value per entity only
Primary useTraining dataset constructionReal-time inference
Write patternBatch materialization jobsContinuous or scheduled materialization from offline

The materialization job moves data from offline to online on a schedule (hourly, daily). Freshness of online features is bounded by this schedule — a relevant operational tradeoff.

Tool comparison: Feast vs Tecton vs Hopsworks

Feast (open source, Linux Foundation)

  • Most flexible: bring your own storage backends (S3 + Redis, BigQuery + Firestore, etc.)
  • Feature definitions in Python, registered to a central registry
  • Requires engineering capacity to operate and integrate
  • Best for: teams that want no vendor lock-in and already have infrastructure opinions

Tecton (managed SaaS, originally from Uber Michelangelo team)

  • Opinionated, end-to-end managed platform
  • Strong real-time streaming support (features computed from Kafka, not just batch)
  • Enterprise pricing; targets business-critical real-time ML
  • Best for: enterprises that need production SLAs and can pay for a managed service

Hopsworks (open source + managed, on-prem or cloud)

  • Tightest integration between feature store, model registry, and training pipelines
  • Strong data lineage, governance, and metadata management
  • Default choice for regulated industries (healthcare, finance, manufacturing)
  • Best for: organizations requiring on-premises deployment or regulatory compliance

All three support point-in-time correct training dataset generation and an online/offline split. The differences are operational model, streaming support depth, and governance tooling.

LLM-specific use cases

Feature stores are seeing renewed relevance in LLM applications in two ways:

1. Pre-computed embeddings as features For retrieval-augmented systems or recommendation layers sitting in front of an LLM, embedding a document or user profile at query time is too slow. Pre-computing embeddings on a schedule and storing them in the online store (or a vector index backed by the feature store) brings retrieval latency into the <10ms range. The feature store manages freshness — when a document is updated, the embedding is recomputed and re-materialized.

2. User preference and behavior vectors User-LLM personalization patterns encode user history as a dense vector (a "user embedding") computed from behavioral signals. These vectors are expensive to compute at query time. Storing them as features — recomputed on a rolling schedule, served from the online store — lets LLM inference pipelines fetch rich user context in milliseconds via a single key lookup.

Note: for pure embedding retrieval (ANN search over a corpus), a vector store (Qdrant, pgvector, Pinecone) is more natural than a feature store. The feature store pattern applies when the embedding or vector is an input feature to a model, not the retrieval index itself.

When a feature store is worth adding

Add a feature store when:

  • Multiple models share the same derived features (avoids redundant computation and inconsistency)
  • Online inference has strict latency requirements (<50ms) and features require non-trivial computation
  • You need auditable, reproducible training datasets from historical feature values
  • The team is large enough that feature definitions will otherwise drift between data scientists and engineers

Skip it when:

  • You have a single model and a small data team
  • Features are simple column selects from a database — a view or dbt model is sufficient
  • The overhead of operating a dual-store system exceeds the engineering time lost to the problem it solves
  • A lightweight SQL implementation (see sql/sql-for-ai feature store pattern) handles the use case

Connections

  • data/data-engineering-hub — first mention of feature stores in the vault; positioned in the AI data stack between storage and serving layers
  • sql/sql-for-ai — the feature store pattern section shows a minimal PostgreSQL implementation using feature_values + TTL checks, useful when a full feature store is overkill
  • rag/embeddings — pre-computed embeddings stored in a feature store are the primary LLM-era use case; freshness and materialization schedules apply directly
  • infra/vector-stores — alternative and complement: vector stores handle ANN search over embedding corpora; feature stores handle entity-keyed feature lookup; the two are often used together

Open Questions

  • What is the practical freshness floor for online store materialization in real-time recommendation use cases — is hourly good enough, or do streaming feature stores (Tecton's approach) become necessary?
  • How do feature stores interact with LLM prompt caching — is there a pattern where the feature store keys into a prompt cache by user/context ID?
  • Does the Hopsworks vector store offering (added 2024) meaningfully compete with dedicated vector stores for RAG, or is it primarily for feature-adjacent embedding use cases?
  • What is the migration path from a lightweight SQL feature store (as in sql-for-ai) to a full Feast deployment — is there a natural breaking point (data volume, team size)?

Sources