API Performance Testing

Measuring, baselining, and regression-testing API latency and throughput.

Updated Invalid Date·

api-performance k6 latency throughput percentiles slos regression benchmark

Measuring, baselining, and regression-testing API latency and throughput.

Metrics to Capture

Latency percentiles:
  p50 (median): typical experience
  p90:          most users' experience
  p95:          near-worst case
  p99:          worst-case (1 in 100 requests)
  p99.9 (p999): tail latency — important for high-volume services

  Never optimise for average — outliers hurt user experience.
  Never report average alone — always report p95 or p99.

Throughput:
  RPS (requests per second): how many requests the system handles
  Saturation point: RPS at which latency starts to degrade

Error rate:
  % of requests returning 4xx or 5xx
  Target: < 0.1% under normal load, < 1% under peak

Resource utilisation under load:
  CPU: target < 70% at peak (headroom for spikes)
  Memory: track for leaks (linearly increasing = memory leak)
  DB connections: pool exhaustion causes latency spikes
  Event loop lag (Node.js): > 10ms indicates blocking I/O

k6 API Performance Test

// api-performance.js
import http from "k6/http";
import { check, sleep } from "k6";
import { Rate, Trend, Counter } from "k6/metrics";

const errorRate = new Rate("error_rate");
const orderLatency = new Trend("order_creation_ms", true);   // true = milliseconds
const ordersCreated = new Counter("orders_created");

export const options = {
    stages: [
        { duration: "1m", target: 10 },     // ramp up
        { duration: "5m", target: 50 },     // steady state
        { duration: "2m", target: 100 },    // stress
        { duration: "1m", target: 0 },      // ramp down
    ],
    thresholds: {
        "http_req_duration{api:list_orders}": ["p(95)<200"],   // tagged threshold
        "http_req_duration{api:create_order}": ["p(95)<500"],
        "error_rate": ["rate<0.01"],         // < 1% errors
        "http_req_failed": ["rate<0.01"],
    },
};

const BASE_URL = __ENV.BASE_URL || "http://localhost:8000";
const AUTH_TOKEN = __ENV.AUTH_TOKEN;

function createOrder() {
    const start = Date.now();
    const payload = JSON.stringify({
        product_id: "prod_abc123",
        quantity: 1,
        user_id: `user_${Math.floor(Math.random() * 1000)}`,
    });

    const response = http.post(`${BASE_URL}/api/orders`, payload, {
        headers: {
            "Content-Type": "application/json",
            "Authorization": `Bearer ${AUTH_TOKEN}`,
        },
        tags: { api: "create_order" },
    });

    const latency = Date.now() - start;
    orderLatency.add(latency);
    errorRate.add(response.status >= 400);

    const ok = check(response, {
        "status is 201": (r) => r.status === 201,
        "has order id": (r) => r.json("id") !== undefined,
        "latency < 1s": () => latency < 1000,
    });

    if (ok) ordersCreated.add(1);
    return response;
}

export default function () {
    createOrder();
    sleep(1);
}

Python Baseline Benchmark Script

# benchmark.py — capture baseline metrics for regression detection
import asyncio, statistics, time, json
from pathlib import Path
import httpx

API_BASE = "http://localhost:8000"
ITERATIONS = 200
CONCURRENCY = 10

async def measure_endpoint(
    method: str, path: str, payload: dict | None = None, iterations: int = ITERATIONS
) -> dict:
    latencies_ms = []
    errors = 0

    semaphore = asyncio.Semaphore(CONCURRENCY)

    async def single_request() -> None:
        async with semaphore:
            start = time.perf_counter()
            try:
                async with httpx.AsyncClient() as client:
                    if method == "GET":
                        r = await client.get(f"{API_BASE}{path}", timeout=10)
                    else:
                        r = await client.post(f"{API_BASE}{path}", json=payload, timeout=10)
                    if r.status_code >= 400:
                        nonlocal errors
                        errors += 1
            except Exception:
                errors += 1
                return
            latencies_ms.append((time.perf_counter() - start) * 1000)

    await asyncio.gather(*[single_request() for _ in range(iterations)])

    if not latencies_ms:
        return {"error": "all requests failed"}

    sorted_lat = sorted(latencies_ms)
    n = len(sorted_lat)
    return {
        "endpoint": f"{method} {path}",
        "iterations": iterations,
        "error_rate": errors / iterations,
        "p50_ms": sorted_lat[int(n * 0.50)],
        "p90_ms": sorted_lat[int(n * 0.90)],
        "p95_ms": sorted_lat[int(n * 0.95)],
        "p99_ms": sorted_lat[int(n * 0.99)],
        "mean_ms": statistics.mean(latencies_ms),
        "stddev_ms": statistics.stdev(latencies_ms) if n > 1 else 0,
    }

async def main() -> None:
    results = {
        "list_orders": await measure_endpoint("GET", "/api/orders"),
        "get_order": await measure_endpoint("GET", "/api/orders/ord_test123"),
        "create_order": await measure_endpoint("POST", "/api/orders",
                                               {"product_id": "prod_123", "quantity": 1}),
    }

    Path("baseline.json").write_text(json.dumps(results, indent=2))
    print(json.dumps(results, indent=2))

asyncio.run(main())

Regression Detection in CI

# test_performance_regression.py
import json, pytest
from pathlib import Path

BASELINE_PATH = Path("benchmarks/baseline.json")
THRESHOLD_MULTIPLIER = 1.25   # allow up to 25% regression

@pytest.fixture(scope="session")
def baseline() -> dict:
    if not BASELINE_PATH.exists():
        pytest.skip("No baseline file — run benchmark.py first")
    return json.loads(BASELINE_PATH.read_text())

@pytest.fixture(scope="session")
def current(benchmark_results) -> dict:
    return benchmark_results   # injected from conftest

@pytest.mark.parametrize("endpoint", ["list_orders", "get_order", "create_order"])
@pytest.mark.parametrize("percentile", ["p95_ms", "p99_ms"])
def test_no_latency_regression(
    baseline: dict, current: dict, endpoint: str, percentile: str
) -> None:
    if endpoint not in baseline:
        pytest.skip(f"No baseline for {endpoint}")

    baseline_val = baseline[endpoint][percentile]
    current_val = current[endpoint][percentile]
    threshold = baseline_val * THRESHOLD_MULTIPLIER

    assert current_val <= threshold, (
        f"{endpoint} {percentile}: {current_val:.1f}ms > {threshold:.1f}ms "
        f"(baseline: {baseline_val:.1f}ms, regression: {((current_val/baseline_val)-1)*100:.1f}%)"
    )

def test_error_rates_acceptable(baseline: dict, current: dict) -> None:
    for endpoint, stats in current.items():
        assert stats["error_rate"] < 0.01, (
            f"{endpoint} error rate {stats['error_rate']:.2%} exceeds 1% threshold"
        )

SLO Validation Test

SLOS = {
    "GET /api/orders": {"p95_ms": 200, "p99_ms": 500},
    "POST /api/orders": {"p95_ms": 500, "p99_ms": 1000},
    "GET /api/products": {"p95_ms": 100, "p99_ms": 300},
}

async def validate_slos() -> list[str]:
    violations = []
    for endpoint, targets in SLOS.items():
        method, path = endpoint.split(" ", 1)
        results = await measure_endpoint(method, path)
        for metric, target in targets.items():
            actual = results.get(metric, float("inf"))
            if actual > target:
                violations.append(
                    f"{endpoint} {metric}: {actual:.0f}ms > SLO {target}ms"
                )
    return violations

Common Failure Cases

Baseline captures warm-cache performance, so regression tests always pass Why: the baseline benchmark script is run after the service has already processed requests and warmed its caches, making the baseline artificially fast. Detect: the baseline p95 is suspiciously low relative to production p95 values; the first cold-run in CI is measurably slower. Fix: flush caches explicitly before capturing a baseline, or record both warm and cold baselines and gate on the appropriate one per environment.

k6 threshold passes but real user experience is poor because average latency was used Why: k6 default summary shows avg prominently; teams set thresholds on avg rather than p95 or p99, which masks tail latency. Detect: threshold passes, but p99 in the same run exceeds your SLO by 3-5x. Fix: only set thresholds on p(95) or p(99) — never on avg — as shown in the k6 options.thresholds config above.

Performance regression test flags a false positive due to CI runner variance Why: shared CI runners have variable CPU and memory availability; a 25% threshold multiplier is too tight for noisy infrastructure. Detect: the regression test fails intermittently on the same commit across successive runs with no code change. Fix: raise the multiplier to 1.5 for CI and run the tight regression benchmark (1.25) only on a dedicated performance runner or as a scheduled job, not on every PR.

Load test saturates the test database instead of the application under test Why: k6 sends 100 concurrent requests each requiring a DB write; the database connection pool exhausts at 20 connections and the app starts returning 503s. Detect: error rate spikes while CPU on the app server is below 30%; database connection wait time is high in pg_stat_activity. Fix: either configure the load test to use a pre-seeded read-only dataset for read-heavy endpoints, or provision a database with connection pooling matching production (PgBouncer) before running write-heavy load tests.

Connections

tqa-hub · technical-qa/load-testing-advanced · technical-qa/performance-testing · qa/performance-testing-qa · qa/continuous-testing · cs-fundamentals/observability-se

Open Questions

What is the most common failure mode when implementing this at scale?
How does this testing approach need to adapt for distributed or microservice architectures?