BeginnerAnalytics Engineer

Profile a dataset for quality issues with DuckDB

Take a raw CSV dataset (at least 50,000 rows) and produce a complete data quality profile: null rates per column, duplicate row counts, outlier detection using IQR for numeric columns, cardinality for categorical columns, and a summary report that flags any column needing remediation.

Why this matters

Data quality problems are silent; a model trained on 20% null values in a key feature produces subtly wrong results that do not show up as errors. Profiling a dataset before using it is the data engineering equivalent of reading the code before running it. The output is also a prerequisite for any SLA-based data contract: you cannot promise data is complete if you have never measured it.

Before you start

Step-by-step guide

  1. 1

    Load and inspect with DuckDB

    Use duckdb.sql("SELECT * FROM read_csv_auto('data.csv') LIMIT 5").df() to preview. Run SELECT COUNT(*) to verify row count. Run DESCRIBE on the table to see inferred column types. A numeric column inferred as VARCHAR is a common sign of data quality issues.

  2. 2

    Measure null rates

    Write a DuckDB query that computes (COUNT(*) - COUNT(col)) / COUNT(*) for every column. Sort descending. Any column above 5% null is flagged; either the data is genuinely missing or the ETL process has a gap.

  3. 3

    Find duplicate rows

    Run SELECT COUNT(*) FROM (SELECT DISTINCT * FROM data) and compare to the total. Then identify the natural key (e.g., user_id + timestamp) and check for duplicates on that subset. Duplicate rows in event data are usually a pipeline bug, not noise.

  4. 4

    Detect numeric outliers

    For each numeric column, compute Q1, Q3, and IQR using PERCENTILE_CONT. Flag rows where value < Q1 - 1.5*IQR or value > Q3 + 1.5*IQR. Print the outlier count and percentage per column. Decide: are these data errors (negative age) or genuine extreme values (very large purchase)?

  5. 5

    Profile categorical cardinality

    For each categorical column, compute COUNT(DISTINCT value) and the top 10 most frequent values with frequencies. Low-cardinality columns (under 20 unique values) are good candidates for enums. High-cardinality columns where 95% of values appear only once are a sign of free-text fields being misused as structured data.

  6. 6

    Write the quality report

    Produce a data quality report: a table with column, type, null_rate, duplicate_flag, outlier_count, cardinality, and a summary section with overall data quality score. Save as CSV or markdown; this is what you attach to a pull request when the data lands in the warehouse.

Relevant Axiom pages

What to do next

Back to Practice Lab