BeginnerAnalytics Engineer

Analyse a large CSV with DuckDB

Use DuckDB to query a 500MB CSV of transaction data without loading it into memory. Answer three specific questions: the top 10 categories by revenue, the month with the highest average order value, and the busiest hour of day by transaction count. Export the results to a smaller CSV for further analysis.

Why this matters

DuckDB changes what is possible with local data analysis. A 500MB CSV that would take 10 seconds to load into pandas queries in under a second with DuckDB, with no memory pressure and full SQL support. Understanding when to use DuckDB versus a full data warehouse versus pandas is a practical skill that makes you dramatically faster at exploratory data analysis.

Before you start

Step-by-step guide

  1. 1

    Query the CSV directly without loading it

    Run: SELECT COUNT(*) FROM read_csv_auto('your_file.csv'). DuckDB reads the file in streaming fashion; you do not need to import it first. Check the row count and print the first 5 rows to understand the schema. Note how long this takes compared to loading the same file in pandas.

  2. 2

    Find the top 10 categories by revenue

    Write a query grouping by the category column, summing the revenue or amount column, ordering descending, and limiting to 10. If the category column has NULL values, decide whether to include them; NULLIF and COALESCE are your tools. Print the result and verify the math on the top entry manually.

  3. 3

    Find the month with highest average order value

    Extract the month from the timestamp column using strftime('%Y-%m', timestamp_col). Group by month, compute AVG(order_value), order descending, limit 1. Compare December to other months; seasonality almost always shows up in this query on real e-commerce data.

  4. 4

    Find the busiest hour of day

    Extract the hour using EXTRACT(hour FROM timestamp_col). Group by hour, count transactions, order by count descending. Plot the results in your head (or in a quick matplotlib chart); the distribution is rarely uniform and the shape tells you something about the business.

  5. 5

    Export results and compare DuckDB to pandas

    Use COPY (your_query) TO 'results.csv' WITH (FORMAT CSV, HEADER) to export results. Then load the same CSV in pandas and run equivalent operations. Measure the time difference. Note which operations DuckDB is faster at and which pandas handles more conveniently.

Relevant Axiom pages

What to do next

Back to Practice Lab