Profile a dataset for quality issues with DuckDB
Take a raw CSV dataset (at least 50,000 rows) and produce a complete data quality profile: null rates per column, duplicate row counts, outlier detection using IQR for numeric columns, cardinality for categorical columns, and a summary report that flags any column needing remediation.
Why this matters
Data quality problems are silent; a model trained on 20% null values in a key feature produces subtly wrong results that do not show up as errors. Profiling a dataset before using it is the data engineering equivalent of reading the code before running it. The output is also a prerequisite for any SLA-based data contract: you cannot promise data is complete if you have never measured it.
Before you start
- Python with duckdb installed (pip install duckdb)
- A CSV dataset with at least 50,000 rows and a mix of numeric and categorical columns (Kaggle Titanic or NYC Taxi datasets work well)
- Basic SQL knowledge
Step-by-step guide
- 1
Load and inspect with DuckDB
Use duckdb.sql("SELECT * FROM read_csv_auto('data.csv') LIMIT 5").df() to preview. Run SELECT COUNT(*) to verify row count. Run DESCRIBE on the table to see inferred column types. A numeric column inferred as VARCHAR is a common sign of data quality issues.
- 2
Measure null rates
Write a DuckDB query that computes (COUNT(*) - COUNT(col)) / COUNT(*) for every column. Sort descending. Any column above 5% null is flagged; either the data is genuinely missing or the ETL process has a gap.
- 3
Find duplicate rows
Run SELECT COUNT(*) FROM (SELECT DISTINCT * FROM data) and compare to the total. Then identify the natural key (e.g., user_id + timestamp) and check for duplicates on that subset. Duplicate rows in event data are usually a pipeline bug, not noise.
- 4
Detect numeric outliers
For each numeric column, compute Q1, Q3, and IQR using PERCENTILE_CONT. Flag rows where value < Q1 - 1.5*IQR or value > Q3 + 1.5*IQR. Print the outlier count and percentage per column. Decide: are these data errors (negative age) or genuine extreme values (very large purchase)?
- 5
Profile categorical cardinality
For each categorical column, compute COUNT(DISTINCT value) and the top 10 most frequent values with frequencies. Low-cardinality columns (under 20 unique values) are good candidates for enums. High-cardinality columns where 95% of values appear only once are a sign of free-text fields being misused as structured data.
- 6
Write the quality report
Produce a data quality report: a table with column, type, null_rate, duplicate_flag, outlier_count, cardinality, and a summary section with overall data quality score. Save as CSV or markdown; this is what you attach to a pull request when the data lands in the warehouse.