SLOs and SLAs as QA Governance
How QA teams derive, apply, and govern SLOs and error budgets as objective release gates — covering SLI/SLO/SLA vocabulary, error budget policies, go/no-go decision frameworks, and tooling with Prometheus and Datadog.
Service Level Objectives and Agreements are not just operational contracts between SRE teams and product owners. Used correctly, they become the most objective quality gates in your release pipeline — ones that cannot be gamed by a passing test suite, inflated coverage numbers, or optimistic sprint demos. This page covers how QA teams can derive, apply, and govern SLOs as first-class release signals.
The Vocabulary: SLI, SLO, SLA
These three terms are frequently conflated. Each plays a distinct role.
SLI — Service Level Indicator
An SLI is a quantitative measurement of a specific aspect of service behaviour. It is a ratio or rate computed from raw telemetry.
SLI = (good events / total events) over a time window
QA examples:
successful HTTP responses (2xx + 3xx) / total HTTP responses— availability SLIrequests with latency < 300 ms / total requests— latency SLIsearch queries returning results in < 1 s / total search queries— user-journey latency SLIpayment transactions completing without error / total payment attempts— critical-path error SLIdata exports completing within 60 s / total export requests— throughput SLI
An SLI must be measurable continuously from production telemetry. If you cannot instrument it, you cannot govern with it.
SLO — Service Level Objective
An SLO is the target value or range for an SLI. It is an internal commitment — agreed between engineering, product, and QA — but not formally contractual with customers.
SLO = SLI >= target, measured over a rolling window
QA examples:
availability SLI >= 99.9% over a rolling 30-day windowp99 latency SLI >= 95% of requests served in < 500 ms over a rolling 7-day windowpayment error SLI >= 99.95% success rate over a rolling 28-day window
The SLO target is the threshold against which QA gates releases. If you cannot meet the SLO in a staging environment under realistic load, you cannot ship.
SLA — Service Level Agreement
An SLA is a contractual commitment to customers, typically with financial remedies for breach (credits, refunds, termination rights). SLAs are almost always looser than internal SLOs — deliberately, so that internal violations catch problems before customers are impacted.
Concrete relationship:
- Internal SLO:
availability >= 99.9% - External SLA:
availability >= 99.5% (or 10% credit applies)
The gap between SLO and SLA is the operational buffer. QA governs to the SLO. Legal defends to the SLA.
Consultant framing: When a client asks "are we meeting our SLAs?", the correct answer structure is: "Your SLA target is X. Your internal SLO is Y. Current performance is Z. You are [meeting / at risk / breaching] your SLO, which gives you [N days / N%] of headroom before the SLA is at risk."
Error Budgets
The error budget is the most powerful concept in SLO-based governance. It converts an abstract percentage into a concrete, time-bounded allowance for failure.
Calculation
Error budget = (1 - SLO target) * time window
Example:
SLO = 99.9% availability over 30 days
Error budget = 0.001 * 30 * 24 * 60 minutes
= 43.2 minutes of allowed downtime per 30-day window
Budget consumption
As incidents, deployments, and degraded periods occur, they consume the error budget. Tracking budget consumption over time gives you a real-time signal of system health that a green test suite cannot provide.
Budget remaining = total budget - consumed budget
Budget burn rate = (budget consumed in last N hours) / (expected budget consumption in N hours)
A burn rate > 1.0 means you are consuming the budget faster than its intended rate. At burn rate 1.0, you will exactly exhaust the budget at the end of the window. At burn rate 2.0, you will exhaust it halfway through.
Error budget as a release hold signal
This is the core QA governance mechanism:
Policy: If error budget remaining falls below a threshold, no new releases deploy until it recovers.
Typical thresholds:
- Budget consumed > 50% with > 50% of the window remaining: amber — releases require explicit sign-off
- Budget consumed > 75% with > 25% of the window remaining: red — releases blocked, only hotfixes
- Budget fully exhausted: freeze — no changes except incident remediation
Concrete example: You have a 30-day window. On day 10, you have already consumed 40% of your error budget. You have 20 more days but only 60% of the budget left. At that burn rate you will exhaust the budget by approximately day 25, five days before the window closes. This triggers amber — the QA lead raises a go/no-go flag before the next sprint release.
The error budget policy should be a written artefact, reviewed by QA, SRE, and product at the start of each engagement. Without a written policy, the threshold becomes a negotiation at the point of crisis, which always favours shipping.
Burn rate alerts
Modern SRE practice (from the Google SRE Workbook, Chapter 5) recommends multi-window burn rate alerts rather than single threshold alerts:
| Burn rate | Time to exhaustion | Alert window | Severity |
|---|---|---|---|
| 14.4x | ~2 hours | 1h + 5min | Page immediately |
| 6x | ~5 hours | 6h + 30min | Page within 30 min |
| 3x | ~10 hours | 1d + 6h | Ticket |
| 1x | ~30 days | 3d | No alert |
QA should be subscribed to the ticket-level alerts (3x burn rate). Page-level alerts indicate an active incident, not a release governance signal.
Defining Meaningful SLOs
Most SLOs fail not because teams ignore them but because they define the wrong thing, or define the right thing at the wrong granularity.
The four standard SLO dimensions for web/API services
1. Availability
- SLI:
(successful responses) / (total requests) - Target range: 99.0% to 99.99% depending on criticality
- Common mistake: measuring availability from a synthetic ping rather than real user traffic
- QA test link: uptime checks in performance test harness, chaos injection to validate recovery
2. Latency (p99)
- SLI:
(requests completing within threshold) / (total requests) - Prefer p99 over p50 for SLO targets — p50 hides tail latency that affects the most frustrated users
- Target range: highly domain-specific; e-commerce checkout < 1 s p99, internal admin tool < 3 s p99
- QA test link: NFR thresholds in load test scripts derived directly from the SLO target
3. Error rate
- SLI:
(non-error responses) / (total responses) - Usually expressed as success rate: 99.5%, 99.9%, 99.99%
- Separate SLOs for different error classes: 5xx errors (server fault) vs 4xx errors (client/data fault)
- QA test link: error injection tests, third-party dependency mocking, timeout and retry validation
4. Throughput / saturation
- SLI:
(requests processed within N seconds) / (total requests enqueued)for async systems - Useful for batch processing, message queues, export pipelines
- QA test link: sustained load tests, queue depth monitoring during soak tests
Writing SLOs for new services
When a new service enters the test strategy phase, QA should drive SLO definition before a single test case is written. The template:
Service: [name]
SLI type: [availability | latency | error rate | throughput]
Measurement: [exact metric definition — numerator and denominator]
Target: [percentage]
Window: [rolling 7/28/30 days]
Measurement source: [Prometheus metric name | Datadog monitor ID | CloudWatch metric]
Error budget: [calculated from target and window]
Budget policy: [what happens at 50%, 75%, 100% consumed]
Owner: [who is alerted]
Review cadence: [monthly | per sprint | per release]
Filling in this template forces the conversation about what "good" means before implementation begins. Teams that skip this step end up with SLOs defined retrospectively from current performance, which is a form of performance theatre.
SLOs as NFR Acceptance Criteria
Non-functional requirements in a test strategy should be expressed as SLO compliance, not arbitrary figures.
Before (vague NFR):
"The system should respond quickly under load."
After (SLO-derived NFR):
"The checkout API must maintain a p99 latency SLI of >= 95% (requests completing in < 800 ms) under a sustained load of 500 concurrent users for 30 minutes. This is derived from the service SLO target. A performance test run that breaches this threshold is a test failure, not a warning."
Load test configuration derived from SLOs
# Example: k6 load test with SLO-derived thresholds
thresholds = {
# SLO target: 99.9% availability
'http_req_failed': ['rate<0.001'],
# SLO target: p99 latency < 500ms (mapped to k6 percentile)
'http_req_duration': ['p(99)<500'],
# SLO target: 95% of requests under 300ms (additional SLI)
'http_req_duration': ['p(95)<300'],
}The test fails if any threshold is breached. This is not a performance benchmark to be interpreted — it is a binary gate. If the thresholds come from the SLO document, any stakeholder can understand why the gate exists without reading test code.
Staging vs production SLO targets
A common mistake is setting performance test thresholds lower than the production SLO to account for staging environment limitations. This defeats the purpose. Options:
- Scale the load — if staging has 25% of production capacity, run 25% of production load and hold to the same latency target
- Match the environment — invest in a production-parity staging environment (preferred; costly)
- Accept the gap explicitly — document that staging tests are indicative only and deploy with monitoring alerting for SLO breach
Option 3 is valid but must be a documented decision, not an implicit assumption.
Go/No-Go Using Error Budget State
The release checklist addition
Every release checklist should include an error budget check. The question is not "did the tests pass?" but "is the system in a state where absorbing another change is responsible?"
Proposed checklist item:
[ ] Error budget check
- Current budget remaining: ___%
- Budget consumed this window: ___%
- Days remaining in window: ___
- Projected exhaustion date (at current burn rate): ___
- Policy threshold: amber at 50% consumed with >50% window remaining
- Result: GREEN / AMBER / RED
- If AMBER or RED: sign-off required from [QA Lead + Engineering Lead]
Scenario examples
Scenario 1 — Clear go: SLO window: 30 days. Current day: 15. Budget consumed: 20%. Projected exhaustion: day 75 (extrapolated). Result: GREEN.
Scenario 2 — Amber hold: SLO window: 30 days. Current day: 12. Budget consumed: 45%. Last incident consumed 30% in 2 days. Projected exhaustion: day 18. Result: AMBER. The release can proceed only with explicit sign-off acknowledging that another incident of similar magnitude will exhaust the budget before the window closes, breaching the SLA buffer.
Scenario 3 — Hard hold: SLO window: 30 days. Current day: 20. Budget consumed: 80%. 10 days remaining. Budget will exhaust at current rate before the window closes. Result: RED. No feature releases until budget recovers (incident resolved, window resets, or SLO target adjusted by exception process).
SLA vs SLO: Consultant Positioning
When working with clients, the distinction matters for who you are talking to.
| Dimension | SLA | SLO |
|---|---|---|
| Audience | Legal, commercial, customers | Engineering, QA, product |
| Enforceability | Contractual, financial remedies | Internal commitment, cultural |
| Target | Looser (customer-facing floor) | Tighter (operational ceiling) |
| Governance | Account management, legal review | QA governance, SRE |
| Breach consequence | Credits, termination clauses | Internal escalation, freeze |
| Review cadence | Annually (contract renewal) | Monthly or per sprint |
Common client misunderstanding: "We've never breached our SLA so we're fine." This statement conflates SLA performance with service quality. An SLO can be breached dozens of times per year without ever triggering SLA remedies, because the SLA target is set lower. QA governance tracks SLO breach, not SLA breach. SLA breach means the internal SLO governance failed.
SLO Review Cadence as a QA Governance Ritual
SLOs decay. The target that was meaningful when the service launched becomes irrelevant as usage patterns shift, infrastructure changes, or the product evolves. QA should own a recurring SLO review ritual.
Monthly SLO review agenda (45 minutes)
- Budget consumption report (5 min) — how much budget was consumed, by what events
- SLI trend review (10 min) — is performance trending toward or away from target?
- Incident retrospective link (10 min) — which incidents consumed the most budget, what changed
- Target relevance check (10 min) — are the SLI definitions still measuring what matters to users?
- Threshold adjustment proposals (10 min) — should any targets be tightened, relaxed, or retired?
When to tighten a target
- Current performance consistently exceeds the SLO by a large margin (e.g., SLO is 99.5%, actual is 99.95%) — the target gives no signal
- A new contractual commitment has been made that requires tighter targets
- User research shows the latency threshold is higher than what users perceive as acceptable
When to relax a target
- Budget consistently exhausted by known, accepted infrastructure events (planned maintenance)
- Service was over-engineered relative to actual user impact
- Explicit business decision to reduce investment in a non-critical path
Targets should never be relaxed to avoid accountability. If a target is being missed and the response is to lower the target rather than fix the system, that is a governance failure, not an SLO review.
Common Mistakes
SLOs too tight
Setting a 99.99% availability target for an internal batch processing service. The target consumes error budget on routine maintenance windows, trains teams to ignore alerts, and creates pressure to lower targets without fixing anything. Reserve four-nines targets for genuinely critical, customer-facing, revenue-generating paths.
SLOs unmeasured
The SLO exists in a spreadsheet. No Prometheus rule computes the SLI. No dashboard shows budget consumption. The SLO is reviewed retrospectively by looking at incident logs rather than continuously from telemetry. This is documentation theatre. An unmeasured SLO provides no signal and no release gate.
SLOs set by ops without business input
Operations teams set SLOs based on what is technically achievable, not what users actually need. A checkout latency target of p99 < 2,000 ms might be easily achievable but unacceptable to users who abandon carts after 1,000 ms. SLOs must be calibrated to user impact, not infrastructure comfort.
Single-window SLOs
Using a 30-day rolling window without shorter sub-window alerts means a catastrophic 6-hour outage is smoothed out of the SLI calculation and only shows up as a few percent of budget consumed. Short-window burn rate alerts (1-hour, 6-hour) catch fast-moving incidents before the monthly SLO is at risk.
SLO reviewed only at the SLA renewal
SLOs reviewed once a year at the contract renewal are not operational governance — they are audit theatre. By the time you review, you have already made hundreds of release decisions without the signal.
Not separating read and write path SLOs
A service that is 99.9% available overall can have a write path that fails 5% of the time if reads are dominating the traffic mix. Define SLOs per critical user journey, not per service.
Tooling
Prometheus recording rules
Prometheus is the standard for computing SLIs from raw metrics. Recording rules pre-aggregate the ratio at scrape time, making SLO dashboards fast and consistent.
# recording rule: availability SLI (5-minute window)
- record: job:http_requests_success_rate:5m
expr: |
sum(rate(http_requests_total{status=~"2..|3.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
# recording rule: error budget burn rate (1-hour window vs 30-day budget)
- record: job:error_budget_burn_rate:1h
expr: |
(1 - job:http_requests_success_rate:1h)
/
(1 - 0.999) # SLO target: 99.9%Alerting on burn rate:
- alert: ErrorBudgetFastBurn
expr: |
job:error_budget_burn_rate:1h > 14.4
and
job:error_budget_burn_rate:5m > 14.4
for: 2m
labels:
severity: page
annotations:
summary: "Error budget burning at 14.4x — exhaustion in ~2 hours"
- alert: ErrorBudgetSlowBurn
expr: |
job:error_budget_burn_rate:6h > 6
and
job:error_budget_burn_rate:30m > 6
for: 15m
labels:
severity: ticket
annotations:
summary: "Error budget burning at 6x — escalate to QA for release hold assessment"Datadog SLO dashboards
Datadog has native SLO tracking objects (separate from monitors). Configure them as:
- Metric-based SLO — computed from a ratio of two Datadog metrics. Supports 7-day, 30-day, 90-day rolling windows simultaneously.
- Monitor-based SLO — derived from the uptime of an existing Datadog monitor. Simpler but less flexible.
Metric-based SLOs are preferred for QA governance because they can be scoped to specific user journeys (using tags) and allow budget consumption to be calculated precisely.
Datadog error budget widget (in dashboards): shows remaining budget as a percentage and a time-series of consumption rate. Include this on every release dashboard so the release coordinator can see budget state before approving a deployment.
Google SRE Workbook
The Google SRE Workbook (2018, available free at sre.google/workbook) defines the canonical methodology for SLO design. Chapters 2 (SLOs), 5 (Alerting on Significant Events), and 9 (Data Processing Pipelines) are directly applicable to QA governance. The multi-window burn rate alert model in Chapter 5 is the reference implementation for the Prometheus alerting rules above.
Pyrra
Pyrra (open-source, CNCF sandbox) generates Prometheus recording rules and Grafana dashboards from SLO definitions written as Kubernetes CRDs or plain YAML. Useful for standardising SLO configuration across a multi-service platform rather than writing recording rules by hand.
# Pyrra SLO definition example
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
name: checkout-availability
spec:
description: Checkout API availability SLO
target: "99.9"
window: 4w
indicator:
ratio:
errors:
metric: http_requests_total{job="checkout",code=~"5.."}
total:
metric: http_requests_total{job="checkout"}Sloth
Sloth (open-source) takes a similar approach — define SLOs in YAML, generate Prometheus rules. More mature than Pyrra at the time of writing for standalone (non-Kubernetes) environments.
Connections
- qa/performance-testing-qa — load test thresholds derived directly from SLO targets; SLO-derived NFRs turn performance tests into binary gates
- qa/production-monitoring-qa — observing SLI behaviour post-release; the runtime layer where error budget is consumed
- qa/qa-metrics — SLO compliance surfaces as a top-level QA reporting metric alongside escape rate and coverage
- qa/non-functional-testing — NFR taxonomy; SLO-derived acceptance criteria replace vague NFR wording
- qa/continuous-testing — automated SLI checks as CI/CD pipeline gates; shift-right quality signal
- qa/test-strategy — where SLO definitions are captured and tied to release governance policy
Open Questions
- How should QA handle SLO targets for services that have never been measured — should a provisional target be set from industry benchmarks or held at 0% until a measurement window closes?
- When staging environment capacity is a known fraction of production, is there a principled formula for scaling load-test SLO thresholds rather than accepting the coverage gap?
- Who owns the error budget policy in organisations where SRE and QA operate as separate functions with no formal handoff process?
Cross-References
- qa/performance-testing-qa — load test configuration deriving thresholds from SLO targets
- qa/production-monitoring-qa — observing SLI behaviour in production after release
- qa/qa-metrics — SLO compliance as a QA reporting metric
- qa/test-strategy — where SLO definitions are captured during test strategy creation
- qa/continuous-testing — automated SLI checks as pipeline gates
- qa/non-functional-testing — NFR taxonomy including SLO-derived acceptance criteria
- observability/langfuse — observability tooling for LLM pipelines (separate SLI model)
- qa/qa-hub — QA knowledge index
Related reading