QA in DevOps

How quality practices integrate with DevOps pipelines. DevOps QA is not a team — it's quality gates, automated checks, and feedback loops embedded into the delivery pipeline.

How quality practices integrate with DevOps pipelines. DevOps QA is not a team. It's quality gates, automated checks, and feedback loops embedded into the delivery pipeline.


The DevOps Quality Pipeline

Developer
  │
  ├── Pre-commit hooks (linting, formatting, secret scanning)
  │
Pull Request
  ├── Unit tests (< 2 min)
  ├── SAST (Semgrep, Bandit)
  ├── Dependency scan (Trivy, safety)
  ├── Type check (mypy, tsc)
  ├── PR review (peer + QA review checklist)
  │
Merge to main
  ├── Integration tests (< 10 min)
  ├── E2E smoke tests (< 5 min)
  ├── Build Docker image + push
  │
Deploy to Staging
  ├── Smoke tests against staging
  ├── Contract tests (Pact can-i-deploy)
  ├── Performance baseline
  │
Deploy to Production
  ├── Canary (5% traffic) + metric analysis
  ├── Progressive rollout (20% → 50% → 100%)
  ├── Production smoke (critical paths only)
  └── Synthetic monitoring (continuous)

Quality Gates

Quality gates block promotion unless criteria are met. Implement as GitHub Actions steps with explicit failure conditions.

# .github/workflows/quality-gate.yaml
jobs:
  quality-gate:
    runs-on: ubuntu-latest
    steps:
    - name: Unit tests with coverage
      run: |
        pytest --cov=src --cov-fail-under=80 --cov-report=xml

    - name: Coverage gate
      uses: codecov/codecov-action@v4
      with:
        fail_ci_if_error: true
        threshold: 80

    - name: Type check
      run: mypy src/ --strict

    - name: Lint
      run: ruff check src/ && ruff format --check src/

    - name: SAST scan
      run: bandit -r src/ -ll --exit-zero

    - name: Integration tests
      run: pytest tests/integration/ --timeout=120

    - name: E2E smoke
      run: playwright test tests/smoke/ --reporter=github

Pre-Commit Hooks

# .pre-commit-config.yaml
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
  rev: v4.6.0
  hooks:
  - id: trailing-whitespace
  - id: end-of-file-fixer
  - id: check-yaml
  - id: detect-private-key
  - id: check-added-large-files

- repo: https://github.com/astral-sh/ruff-pre-commit
  rev: v0.4.0
  hooks:
  - id: ruff
    args: [--fix]
  - id: ruff-format

- repo: https://github.com/Yelp/detect-secrets
  rev: v1.4.0
  hooks:
  - id: detect-secrets
    args: ['--baseline', '.secrets.baseline']

Contract Testing in the Pipeline (Pact)

# Consumer service — publish pacts on PR
- name: Run consumer tests and publish pacts
  run: |
    pytest tests/contract/
    pact-broker publish ./pacts \
      --broker-base-url ${{ secrets.PACT_BROKER_URL }} \
      --consumer-app-version ${{ github.sha }} \
      --branch ${{ github.ref_name }}

# Provider service — verify pacts on PR
- name: Verify provider pacts
  run: |
    PACT_BROKER_URL=${{ secrets.PACT_BROKER_URL }} \
    PACT_CONSUMER_VERSION=${{ github.sha }} \
    pytest tests/provider/

# Both services — can-i-deploy before production
- name: Can I Deploy?
  run: |
    pact-broker can-i-deploy \
      --pacticipant myapp-api \
      --version ${{ github.sha }} \
      --to-environment production \
      --broker-base-url ${{ secrets.PACT_BROKER_URL }}

Synthetic Monitoring

# Run against production continuously (not just at deploy time)
# CloudWatch Synthetics canary (Python)
from aws_synthetics.selenium import synthetics_webdriver as webdriver

def handler(event, context):
    browser = webdriver.Chrome()
    browser.get("https://myapp.com")

    title = browser.title
    assert "MyApp" in title, f"Expected 'MyApp' in title, got: {title}"

    browser.find_element("id", "search-input").send_keys("test product")
    browser.find_element("id", "search-submit").click()

    results = browser.find_elements("class name", "product-card")
    assert len(results) > 0, "Search returned no results"

Schedule: every 5 minutes on critical paths (login, checkout, API health). Alert via SNS → PagerDuty/Slack.


Observability as Quality Signal

# Production error rate as a quality gate during canary
# Argo Rollouts analysis template using Prometheus
{
    "metrics": [{
        "name": "error-rate",
        "successCondition": "result[0] < 0.01",   # < 1% error rate
        "provider": {
            "prometheus": {
                "query": "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))"
            }
        }
    }]
}

QA's DevOps Responsibilities

  • Own the test pipeline configuration (GitHub Actions, CircleCI)
  • Maintain the test environment (staging infrastructure, seed data)
  • Track quality metrics per sprint (escape rate, automation coverage, flaky test rate)
  • Triage flaky tests — quarantine within 24h of first flake
  • Champion quality gates — resist pressure to bypass them under time pressure

Common Failure Cases

Quality gate bypassed under release pressure by adding continue-on-error: true Why: a team member adds continue-on-error: true to the coverage or test step to unblock a deployment; the gate exists in the YAML but no longer blocks anything. Detect: grep the workflow files for continue-on-error on any quality gate step; presence is a finding. Fix: remove continue-on-error from all gate steps; if a test is genuinely blocking release for an unrelated reason, quarantine the specific test with a pytest.mark.skip and a linked ticket — never disable the gate itself.

Flaky E2E smoke tests silently pass 80% of the time — regressions slip through Why: a smoke test that is flaky 20% of the time will appear green on most PRs; developers learn to re-run the pipeline rather than investigate failures, and a real regression gets re-run past. Detect: look at the last 20 CI runs for each smoke test; any test with more than one unexplained failure in 20 runs is flaky. Fix: quarantine flaky tests within 24 hours (move to a separate flaky marker, exclude from the smoke gate, file a ticket); a flaky gate is worse than no gate because it trains the team to distrust failures.

Canary rollout has no automated rollback condition — errors go to 100% rollout Why: the Argo Rollouts analysis template exists but successCondition is misconfigured (result[0] < 0.1 when errors are a ratio, not a percentage); the canary promotes automatically even at 5% error rate. Detect: review the analysis template in staging by injecting errors and confirming the canary pauses before promotion; if it promotes through a 5% error rate, the condition is wrong. Fix: unit-test the PromQL expression in Prometheus directly before wiring it into the rollout; confirm sum(rate(...)) produces a ratio in [0,1] and adjust the threshold accordingly.

Pre-commit hooks not installed by all developers — secret scanning is bypassed Why: detect-secrets is in .pre-commit-config.yaml but pre-commit install is only run by developers who read the README; CI does not re-run the same checks, so secrets committed without hooks installed are not caught until a manual audit. Detect: run git log --all --full-history -- .secrets.baseline and check whether the baseline is diverging from the installed hook version; also check whether CI has a dedicated secret-scanning step independent of pre-commit. Fix: add a CI step that runs detect-secrets scan independently of pre-commit, so the server-side check catches anything that bypassed the client-side hook.

Connections

qa-hub · qa/agile-qa · qa/regression-testing · qa/test-environments · cloud/github-actions · technical-qa/contract-testing · cloud/argo-rollouts

Open Questions

  • What testing scenarios does this technique systematically miss?
  • How does this approach need to change when delivery cadence moves to continuous deployment?