Performance Testing (QA Perspective)
QA's role in performance: defining NFR acceptance criteria, running performance tests as quality gates, and communicating results to stakeholders. Distinct from engineering-level load testing tooling.
QA's role in performance: defining NFR acceptance criteria, running performance tests as quality gates, and communicating results to stakeholders. Distinct from engineering-level load testing tooling.
Performance NFRs as Acceptance Criteria
Every feature with a user-facing response should have performance ACs:
Pattern:
GIVEN {scenario}
WHEN {action}
THEN {response within X ms at Yth percentile under Z concurrent users}
Examples:
GIVEN the product catalogue has 10,000 products
WHEN a user searches by category
THEN results are returned within 500ms at p95 under 100 concurrent users
GIVEN the checkout page is loaded
WHEN a user submits payment
THEN the confirmation is displayed within 3 seconds at p99 under 50 concurrent users
GIVEN the reports dashboard is accessed
WHEN a user generates a monthly report
THEN the report renders within 10 seconds at p95 (async is acceptable)
Performance AC failure = same severity as functional AC failure.
Establishing a Baseline
# tools/baseline.py — run before making changes; compare after
import httpx
import statistics
import time
import json
def measure_endpoint(url: str, n: int = 100, headers: dict = None) -> dict:
latencies = []
errors = 0
for _ in range(n):
start = time.perf_counter()
try:
resp = httpx.get(url, headers=headers, timeout=10.0)
if resp.status_code >= 500:
errors += 1
except Exception:
errors += 1
latencies.append((time.perf_counter() - start) * 1000)
latencies.sort()
return {
"url": url,
"n": n,
"p50": round(statistics.median(latencies), 1),
"p95": round(latencies[int(n * 0.95)], 1),
"p99": round(latencies[int(n * 0.99)], 1),
"max": round(max(latencies), 1),
"error_rate": round(errors / n, 4),
}
# Run and save baseline before a release
if __name__ == "__main__":
endpoints = [
"https://staging.myapp.com/api/products",
"https://staging.myapp.com/api/products/p-001",
"https://staging.myapp.com/api/search?q=widget",
]
baseline = [measure_endpoint(url) for url in endpoints]
with open(f"baselines/baseline-{date.today()}.json", "w") as f:
json.dump(baseline, f, indent=2)Performance Regression Check
# tests/performance/test_no_regression.py
import json
import pytest
import httpx
import statistics
import time
REGRESSION_THRESHOLD = 1.20 # 20% slower than baseline = fail
@pytest.fixture(scope="module")
def baseline():
with open("baselines/baseline-current.json") as f:
return {e["url"]: e for e in json.load(f)}
def measure(url: str, n=50) -> dict:
latencies = []
for _ in range(n):
start = time.perf_counter()
httpx.get(url, timeout=10.0)
latencies.append((time.perf_counter() - start) * 1000)
latencies.sort()
return {"p95": latencies[int(n * 0.95)], "p99": latencies[int(n * 0.99)]}
@pytest.mark.parametrize("endpoint", [
"https://staging.myapp.com/api/products",
"https://staging.myapp.com/api/products/p-001",
])
def test_no_performance_regression(endpoint, baseline):
current = measure(endpoint)
base = baseline[endpoint]
assert current["p95"] <= base["p95"] * REGRESSION_THRESHOLD, (
f"p95 regression on {endpoint}: "
f"current {current['p95']:.0f}ms vs baseline {base['p95']:.0f}ms "
f"({(current['p95']/base['p95'] - 1)*100:.0f}% slower)"
)Performance Test Types for QA Sign-Off
Load test (required for every release):
Purpose: verify system meets SLO at expected peak load
Pass criteria: p95 within SLO, error rate < 1%
Duration: 15-30 minutes
Stress test (required for capacity planning):
Purpose: find breaking point, validate auto-scaling triggers
Pass criteria: documented capacity ceiling; graceful degradation at ceiling
When: quarterly or before major traffic events
Soak test (required for major features):
Purpose: catch memory leaks and connection exhaustion
Pass criteria: performance stable for 2+ hours at 50% peak
When: new background jobs, new caching layers, new connection pools
Volume test (required for data-heavy features):
Purpose: verify queries don't degrade with production-sized datasets
Pass criteria: p95 within SLO with 10x current data volume
When: new DB queries, new reports, new analytics features
Reporting Performance Results to Stakeholders
## Performance Test Report — Release 1.4.2
**Test scope:** Checkout flow, product search, order history
**Environment:** Staging (same spec as prod)
**Date:** 2026-05-01
### Results vs SLOs
| Endpoint | SLO (p95) | Result (p95) | Status |
|---|---|---|---|
| GET /api/products | 500ms | 312ms | ✅ PASS |
| POST /api/checkout | 3000ms | 2104ms | ✅ PASS |
| GET /api/orders | 1000ms | 887ms | ✅ PASS |
| GET /api/search | 800ms | 1243ms | ❌ FAIL |
### Key Finding
Search endpoint fails p95 SLO under 80+ concurrent users. Root cause: missing
index on `products.search_vector`. Fix estimated: 2 hours.
### Recommendation
Block release on search fix. All other endpoints pass SLOs.
Performance Budget in CI
# .github/workflows/performance.yaml
- name: Performance regression check
run: pytest tests/performance/ -v --timeout=120 -m performance
env:
BASELINE_FILE: baselines/baseline-current.json
APP_URL: ${{ vars.STAGING_URL }}
- name: Lighthouse CI budget
run: |
npx lighthouse-ci autorun
# .lighthouserc.yaml defines budgets per metric per pageCommon Failure Cases
Baseline captured from an unrepresentative environment (dev laptop, cold cache)
Why: the baseline script runs against a local server with no warm cache and no concurrent users; every subsequent measurement in staging looks like a regression even when performance has not changed.
Detect: every release the p95 for GET /api/products shows "50% regression" but production metrics are unchanged.
Fix: always capture baselines against the staging environment with the same warmup pass (10 requests discarded) and the same data volume as production; document the exact conditions in the baseline JSON metadata.
Performance ACs written in prose, not as measurable thresholds
Why: the story AC says "the page should load quickly" rather than specifying percentile and concurrent user count; the test cannot fail because there is no numeric threshold.
Detect: test_no_performance_regression cannot be written because baseline[endpoint] has no p95 entry.
Fix: require all performance ACs to follow the Given/When/Then format with explicit p-value, latency threshold, and concurrent user count before the story enters the sprint.
Regression threshold set too loosely — real regressions pass
Why: REGRESSION_THRESHOLD = 1.20 (20% allowed regression) masks a 15% slowdown on a 300ms endpoint, which is a 45ms real-world impact that violates the SLO.
Detect: the test passes, but the stakeholder report shows the p95 for POST /api/checkout climbed from 1.8s to 2.1s over three releases.
Fix: set the threshold relative to the SLO headroom, not a fixed percentage: if the SLO is 3s and the baseline is 2.1s, the allowed regression should be at most (3.0 - 2.1) / 2.1 ≈ 43% — but prefer a hard ceiling of 10% to catch gradual degradation early.
Soak and volume tests only run manually before major releases — leaks between releases Why: the CI pipeline only includes the load test; soak and volume tests are run manually once per quarter and so memory leaks introduced in intermediate releases go undetected. Detect: heap memory grows by 20MB/hour in production Grafana charts; no automated test would have caught it. Fix: add the soak test as a nightly scheduled CI job (not a pre-release gate) so it catches regressions within 24 hours of introduction.
Connections
qa-hub · qa/non-functional-testing · qa/test-automation-strategy · technical-qa/load-testing-advanced · qa/qa-metrics · cloud/observability-stack
Open Questions
- What testing scenarios does this technique systematically miss?
- How does this approach need to change when delivery cadence moves to continuous deployment?
Related reading