Smoke and Sanity Testing

Two fast verification techniques used at different stages of delivery. Often confused; both are narrow scope, but serve different purposes.

Two fast verification techniques used at different stages of delivery. Often confused; both are narrow scope, but serve different purposes.


Smoke vs Sanity

Smoke TestingSanity Testing
PurposeVerify basic system health after a buildVerify a specific fix or feature works
ScopeSurface-level, breadth over depthNarrow, depth over breadth
WhenAfter every build/deployAfter a bug fix or targeted change
AutomatedYes (CI gate)Often manual
If failsBlock further testingReject the build and return for rework
CoverageCritical paths only (~5-10 tests)The fixed area only

Smoke Test — What to Cover

Smoke tests verify the application hasn't fallen over. Not that it's correct.

Must cover:
  ✓ Service is responding (health endpoint returns 200)
  ✓ Database is connected (DB health check passes)
  ✓ Authentication flow works (can log in)
  ✓ Critical happy paths work (can browse products, can add to cart)
  ✓ Key third-party integrations responding (payment ping, email ping)

Should NOT cover:
  ✗ Edge cases
  ✗ Permissions for every role
  ✗ Full regression across all features

Smoke Test Suite (Python + httpx)

# tests/smoke/test_smoke.py
import pytest
import httpx

BASE_URL = os.environ.get("APP_URL", "https://staging.myapp.com")

@pytest.fixture(scope="session")
def client():
    with httpx.Client(base_url=BASE_URL, timeout=30) as c:
        yield c

def test_health_endpoint_ok(client):
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json()["status"] == "healthy"

def test_database_healthy(client):
    response = client.get("/health/db")
    assert response.status_code == 200

def test_homepage_loads(client):
    response = client.get("/")
    assert response.status_code == 200
    assert len(response.text) > 100

def test_login_api_responds(client):
    response = client.post("/api/auth/token", json={
        "email": os.environ["SMOKE_USER_EMAIL"],
        "password": os.environ["SMOKE_USER_PASSWORD"],
    })
    assert response.status_code == 200
    assert "access_token" in response.json()

def test_products_api_returns_data(client):
    token = get_token(client)
    response = client.get("/api/products", headers={"Authorization": f"Bearer {token}"})
    assert response.status_code == 200
    assert len(response.json()["data"]) > 0

Smoke Tests in CI Pipeline

# .github/workflows/ci.yaml
jobs:
  smoke:
    needs: [deploy-staging]         # runs after staging deploy
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - run: pip install pytest httpx
    - name: Run smoke tests
      env:
        APP_URL: ${{ vars.STAGING_URL }}
        SMOKE_USER_EMAIL: ${{ secrets.SMOKE_USER_EMAIL }}
        SMOKE_USER_PASSWORD: ${{ secrets.SMOKE_USER_PASSWORD }}
      run: pytest tests/smoke/ -v --timeout=60 --tb=short

    integration:                    # smoke gates this
      needs: [smoke]
      ...

Sanity Testing — Examples

Bug fix: "Order total wasn't including VAT"
Sanity test: place an order and verify the total = subtotal + VAT (exact calculation)
             verify the invoice shows the VAT breakdown
             verify the VAT rate changes correctly for different product categories

Feature: "Added promo code support"
Sanity test: apply a valid promo code → correct discount applied
             apply an expired promo code → rejected with clear error
             apply a promo code with usage limit reached → rejected

Don't run the full regression during sanity — just the affected area.
If sanity fails, return the build. Don't proceed to regression.

Production Smoke (Synthetic Monitoring)

# AWS Lambda — synthetic canary (runs every 5 min against production)
import urllib.request
import json

def handler(event, context):
    checks = [
        ("Health endpoint", "https://api.myapp.com/health"),
        ("Products API", "https://api.myapp.com/api/products"),
        ("Homepage", "https://www.myapp.com/"),
    ]

    failures = []
    for name, url in checks:
        try:
            with urllib.request.urlopen(url, timeout=10) as resp:
                if resp.status != 200:
                    failures.append(f"{name}: HTTP {resp.status}")
        except Exception as e:
            failures.append(f"{name}: {e}")

    if failures:
        raise Exception(f"Smoke failures: {', '.join(failures)}")

    return {"statusCode": 200, "body": f"All {len(checks)} checks passed"}

Common Failure Cases

Smoke tests depend on shared test credentials that expire or get deleted Why: SMOKE_USER_EMAIL and SMOKE_USER_PASSWORD are created manually once and not rotated, so when the account is cleaned up or the password expires the entire smoke suite fails in a way that looks like an application regression. Detect: the smoke suite suddenly fails on login after a period of stability, and the failure is reproducible only in CI, not locally with fresh credentials. Fix: create smoke test credentials via a setup step in the CI pipeline itself using the factory pattern or a seed command, and inject them as environment variables rather than storing long-lived credentials in secrets.

The smoke suite grows into a mini-regression suite Why: engineers add tests to the smoke suite because it's the fastest thing that runs, until it covers 80 tests and takes 20 minutes. Detect: the smoke suite duration exceeds 5 minutes, or it tests anything beyond the five must-cover categories (health, DB, auth, critical paths, third-party pings). Fix: enforce a maximum of 10 smoke tests with a CI check on the count; anything beyond basic health checks belongs in Tier 2.

Synthetic monitoring canary runs against a static URL that is redirected in production Why: the canary Lambda was written when the API was at /api/products, but a later deployment moved it to /v2/api/products; the canary still returns 200 because the old path redirects rather than 404ing. Detect: the canary reports healthy but real users are hitting the new endpoint path; compare the URLs the canary tests against the OpenAPI spec on every deploy. Fix: derive synthetic monitoring URLs from the deployed OpenAPI spec or a service discovery endpoint rather than hardcoding them; add an assertion that the response body is non-empty and contains expected schema keys, not just a 200 status code.

Sanity test re-runs the full regression after a bug fix Why: when a fix is delivered, the team defaults to running the complete regression suite "just to be safe," eliminating the time-saving point of sanity testing. Detect: cycle time from "fix delivered" to "fix accepted" is 4+ hours because a full regression runs every time. Fix: define the sanity scope explicitly in the bug ticket (which component, which test scenarios), and run only those plus a smoke check; document the decision to skip full regression in the ticket.

Connections

qa-hub · qa/regression-testing · qa/test-strategy · qa/qa-in-devops · qa/qa-metrics · cloud/cloud-monitoring

Open Questions

  • What testing scenarios does this technique systematically miss?
  • How does this approach need to change when delivery cadence moves to continuous deployment?