Design a star schema
Design a star schema for e-commerce analytics: a fact_orders table at the order-line grain, with dim_customer, dim_product, and dim_date dimension tables. Write the CREATE TABLE statements with appropriate indexes, define the grain explicitly, and explain your choice of surrogate keys versus natural keys.
Why this matters
The star schema is the most widely deployed data warehousing pattern for a reason: it makes analytical queries fast and readable. Understanding grain, surrogate keys, and slowly changing dimensions is the foundation that makes everything in an analytics stack work correctly; dbt models, BI tool queries, and executive dashboards all depend on getting this right.
Before you start
- Understanding of what a primary key and foreign key are
- A SQL database (Postgres or DuckDB both work for this exercise)
- Basic understanding of what an analytical query looks like (GROUP BY, aggregate functions)
- No prior data warehousing experience required
Step-by-step guide
- 1
Define the grain
Write one sentence defining the grain of the fact table: 'Each row in fact_orders represents one line item from one order placed by one customer for one product on one date'. The grain determines which columns belong in the fact table and which belong in dimensions. Write this sentence before touching SQL.
- 2
Design the dimension tables
Write CREATE TABLE statements for dim_customer (customer_sk surrogate key, customer_id natural key, name, email, country, valid_from, valid_to for SCD2), dim_product (product_sk, product_id, name, category, unit_price), and dim_date (date_sk integer YYYYMMDD, date, year, quarter, month, day_of_week, is_weekend).
- 3
Design the fact table
Write CREATE TABLE for fact_orders (order_line_sk surrogate key, order_id, customer_sk FK, product_sk FK, date_sk FK, quantity, unit_price, discount_amount, net_revenue). Net_revenue is a derived metric worth pre-computing; explain why you chose to store it rather than calculating it at query time.
- 4
Add indexes
Add indexes on every foreign key column in fact_orders (customer_sk, product_sk, date_sk). Add a composite index on (date_sk, customer_sk) for the most common query pattern: 'revenue by customer for a date range'. Explain why you would not add more indexes on a fact table that receives millions of inserts per day.
- 5
Write three analytical queries
Write: monthly revenue by product category, top 10 customers by revenue in the last 90 days, and revenue for new customers (first purchase in the last 30 days) vs returning customers. All three should run without subqueries; if they require subqueries, your schema has a structural issue.