BeginnerAnalytics Engineer

Profile a dataset for quality issues with DuckDB

Take a raw CSV dataset (at least 50,000 rows) and produce a complete data quality profile: null rates per column, duplicate row counts, outlier detection using IQR for numeric columns, cardinality for categorical columns, and a summary report that flags any column needing remediation.

Why this matters

Data quality problems are silent; a model trained on 20% null values in a key feature produces subtly wrong results that do not show up as errors. Profiling a dataset before using it is the data engineering equivalent of reading the code before running it. The output is also a prerequisite for any SLA-based data contract: you cannot promise data is complete if you have never measured it.

Before you start

Python with duckdb installed (pip install duckdb)
A CSV dataset with at least 50,000 rows and a mix of numeric and categorical columns (Kaggle Titanic or NYC Taxi datasets work well)
Basic SQL knowledge

Step-by-step guide

1
Load and inspect with DuckDB
Use duckdb.sql("SELECT * FROM read_csv_auto('data.csv') LIMIT 5").df() to preview. Run SELECT COUNT(*) to verify row count. Run DESCRIBE on the table to see inferred column types. A numeric column inferred as VARCHAR is a common sign of data quality issues.
2
Measure null rates
Write a DuckDB query that computes (COUNT(*) - COUNT(col)) / COUNT(*) for every column. Sort descending. Any column above 5% null is flagged; either the data is genuinely missing or the ETL process has a gap.
3
Find duplicate rows
Run SELECT COUNT(*) FROM (SELECT DISTINCT * FROM data) and compare to the total. Then identify the natural key (e.g., user_id + timestamp) and check for duplicates on that subset. Duplicate rows in event data are usually a pipeline bug, not noise.
4
Detect numeric outliers
For each numeric column, compute Q1, Q3, and IQR using PERCENTILE_CONT. Flag rows where value < Q1 - 1.5*IQR or value > Q3 + 1.5*IQR. Print the outlier count and percentage per column. Decide: are these data errors (negative age) or genuine extreme values (very large purchase)?
5
Profile categorical cardinality
For each categorical column, compute COUNT(DISTINCT value) and the top 10 most frequent values with frequencies. Low-cardinality columns (under 20 unique values) are good candidates for enums. High-cardinality columns where 95% of values appear only once are a sign of free-text fields being misused as structured data.
6
Write the quality report
Produce a data quality report: a table with column, type, null_rate, duplicate_flag, outlier_count, cardinality, and a summary section with overall data quality score. Save as CSV or markdown; this is what you attach to a pull request when the data lands in the warehouse.

Relevant Axiom pages

DuckDB and Polars Database design

What to do next

Back to Practice Lab

Why this matters

Before you start

Step-by-step guide

Load and inspect with DuckDB

Measure null rates

Find duplicate rows

Detect numeric outliers

Profile categorical cardinality

Write the quality report

Relevant Axiom pages

What to do next