Test Estimation and Capacity Planning

Estimation is the skill of turning scope uncertainty into a defensible commitment. Done well it protects the team from over-promise and protects the client from surprise. Done badly it destroys trust in both directions.

Estimation is the skill of turning scope uncertainty into a defensible commitment. Done well it protects the team from over-promise and protects the client from surprise. Done badly it destroys trust in both directions.


Why QA Estimation Is Hard

Root causes of poor estimates:
  1. Scope ambiguity at estimate time — requirements are incomplete, so assumptions fill the gap
  2. Ignoring activities that aren't "test execution" — writing, review, rework, regression all take time
  3. Single-point estimates — quoting one number implies false precision
  4. Ignoring the new-client ramp-up — domain unfamiliarity kills velocity for the first 2–4 weeks
  5. No feedback loop — estimates are never compared to actuals, so the same mistakes compound
  6. Optimism bias — estimators assume best-case throughput, no blockers, zero defects returned
  7. Anchoring — the first number mentioned in a room becomes the target, not the evidence

The solution is not "be more accurate" — it is to estimate systematically, communicate ranges, and track actuals so each project makes you better at the next one.


Estimation Techniques

Three-Point PERT

The most rigorous technique for any estimate where uncertainty is real.

For each task:
  O = optimistic estimate (everything goes right)
  M = most likely estimate (realistic, typical conditions)
  P = pessimistic estimate (blockers, rework, complexity)

PERT Estimate = (O + 4M + P) / 6
Std Deviation = (P - O) / 6
Variance      = ((P - O) / 6)^2

Worked example — writing test cases for a checkout module:

  Optimistic:   3 days  (requirements are complete, domain is familiar)
  Most likely:  5 days  (1–2 clarification sessions, some ambiguous flows)
  Pessimistic:  9 days  (incomplete AC, multiple reviews, late requirement changes)

  PERT = (3 + 4×5 + 9) / 6 = (3 + 20 + 9) / 6 = 32 / 6 = 5.3 days
  Std Dev = (9 - 3) / 6 = 1.0 day

  Report to client: 5–6 days (PERT ± 1 std dev = 4.3–6.3, round to nearest half-day)

Adding variance across independent tasks gives you project-level confidence:

  Total variance = sum of individual variances
  Total std dev  = sqrt(total variance)
  Project range  = sum of PERT estimates ± 1 or 2 std devs

  Example — four tasks with PERT estimates [5.3, 3.2, 7.1, 4.0] days
  and std devs [1.0, 0.7, 1.5, 0.8]:
    Total estimate = 19.6 days
    Total variance = 1.0 + 0.49 + 2.25 + 0.64 = 4.38
    Total std dev  = sqrt(4.38) = 2.1 days
    90% confidence range = 19.6 ± 2×2.1 = 15.4 to 23.8 days → report as 3–5 weeks

Use PERT when you have time to decompose thoroughly — test plans, project kick-offs, formal proposals.


T-Shirt Sizing

Fast, relative sizing for agile refinement or early scoping conversations.

XS: < 2 hours        — single form, no edge cases, no data setup
S:  half to 1 day    — a single CRUD feature, 5–10 test cases
M:  2–3 days         — a functional area, integration points, some edge cases
L:  1 week           — a complex module, multiple integrations, performance angle
XL: 2+ weeks         — full subsystem, multiple environments, cross-team dependency

T-shirt sizes only communicate order of magnitude. They are useful for:

  • Backlog grooming when stories aren't fully defined
  • Stakeholder conversations before detailed scoping
  • Rapid triage of incoming requests

Convert to day-ranges before committing to a sprint or delivery date. An M is 2–3 days — that is the number you schedule against, not "M".


Analogy-Based Estimation

Compare the new work to something you have already measured.

Pattern:
  1. Find a reference task from a past project with known actuals
  2. List the ways the new task is similar and different
  3. Apply adjustment factors for each difference
  4. Sanity-check with a second reference task

Example:
  Reference: Last project's user profile module — 48 test cases, took 6 days to write
  New work:  Payment module — similar complexity, but:
    - 2× more regulatory constraints (+50%)
    - Team has existing domain knowledge (−20%)
    - Requirements are more complete (−10%)

  Adjusted estimate = 6 × 1.5 × 0.8 × 0.9 = 6.5 days

Analogy estimation degrades when the reference project is old (processes changed) or when the new work is genuinely novel. Always state which reference project you used — it makes the assumption visible and challengeable.


Use-Case Points (UCP)

A formal technique for estimating from use cases or user stories, common in larger fixed-price engagements.

Steps:
  1. Classify each actor (simple, average, complex) and assign weights
     Simple (1): external system, batch job
     Average (2): operator with limited interaction
     Complex (3): user with GUI and complex flows

  2. Classify each use case by transaction count
     Simple (5):   1–3 transactions
     Average (10): 4–7 transactions
     Complex (15): 8+ transactions

  3. Unadjusted Use Case Points (UUCP) = sum of weighted actors + weighted use cases

  4. Apply Technical Complexity Factor (TCF) and Environmental Factor (EF)
     — standard tables from the Karner model

  5. Hours = UUCP × TCF × EF × productivity factor (typically 20–28 hours/UCP)

UCP is best suited to waterfall or fixed-price contracts where a formal audit trail is needed. It is too heavy for sprint-level estimation in agile teams.


Estimating Each Activity Separately

The single biggest estimation mistake is treating "testing" as one block. Break it into components:

Component                   Typical % of total QA effort
----------------------------------------------------------
Requirements review         5–10%
Test case / script writing  20–30%
Test environment setup      5–10%
Test data preparation       5–10%
Test execution              25–35%
Defect logging              5–8%
Defect retesting            10–15%
Regression testing          10–20%
Reporting and handover      3–5%
Automation build            varies — see below

Worked breakdown — medium-sized agile feature (say 20 story points, estimated 8 test days execution):

Requirements review:   0.5 days
Test case writing:     2.0 days  (1 day per ~15 test cases, 30 cases estimated)
Environment setup:     0.5 days
Test data prep:        0.5 days
Test execution:        2.5 days  (30 cases × 5 min avg per case = 2.5h, ×2 for exploratory)
Defect logging:        0.5 days
Defect retesting:      1.0 day   (assume 30% defect rate on 30 cases = 9 defects × 1h rework)
Regression:            1.0 day   (top 20 regression cases from related areas)
Reporting:             0.5 days
-----------
Total:                 9.0 days

Presenting the breakdown matters: clients can challenge "9 days testing" but rarely challenge a line-itemised breakdown. Each line also becomes a scope handle — if the client wants to cut scope, you can remove test case writing for a low-risk module and call out the accepted risk.


Estimating Automation Build Time

Automation estimate components:

  Per test case:
    Script writing:       20–60 min (simple flow) to 2–4 hours (complex, dynamic UI)
    Review and refactor:  20% of writing time
    Framework maintenance: amortised across the suite — ~5% per test

  Framework setup (one-time):
    Page Object Model scaffold:   2–5 days
    CI/CD integration:            1–3 days
    Test data and fixtures:       1–2 days

  Rule of thumb:
    New automation suite from scratch: 3–4× manual execution time to build the same coverage
    Mature automation framework: 1–2× manual execution time to add new scripts

Worked example — 50 automated regression cases, mature framework:

  Script writing:    50 × 45 min avg = 37.5 hours = ~5 days
  Review:            5 × 0.2 = 1 day
  Framework updates: 0.5 day
  CI integration:    already done — 0
  Total:             6.5 days

  Payback calculation:
    Manual regression suite: 50 × 5 min = 4.2 hours per run
    At 2 runs/sprint: 8.4 hours/sprint saved
    6.5 days × 8h = 52 hours invested → payback at sprint 7

Estimation in Agile

Story Points for QA Tasks

QA tasks belong in the sprint backlog with explicit story point estimates — not as an afterthought attached to developer stories.

QA tasks to estimate explicitly:
  - Test case writing for new stories
  - Automation scripting
  - Exploratory testing sessions (time-boxed — e.g. 2 × 1-hour charters)
  - Regression run
  - Defect retesting (budget as a % of story complexity)
  - Test data setup

Point mapping example (calibrate against your team's reference story):
  1 point: review AC, run 3–5 existing tests, log any defects found
  2 points: write 5–8 test cases, execute them, retest any defects
  3 points: 8–15 test cases, some exploratory, coordinate with dev on edge cases
  5 points: full feature with integrations, 15–25 cases, regression impact
  8 points: complex feature, cross-team dependencies, automation update required

A QA engineer on a typical two-week sprint carrying 20–25% ceremonial overhead delivers roughly 24–28 story points per sprint. Never allocate more than 80% of that to planned work — the remaining 20% absorbs defect retesting and unplanned blockers.

Velocity Tracking

Track QA velocity as a rolling average:

Sprint  Points planned  Points completed  Velocity
1       22              18                18  (ramp-up sprint, lower)
2       22              21                19.5 avg
3       24              23                20.7 avg
4       24              24                21.5 avg
5       26              22                21.6 avg  (production incident disrupted sprint 5)

Use 3-sprint rolling average for planning: (23+24+22)/3 = 23 points

Never plan to a single engineer's peak velocity. Plan to the team's rolling average, and surface when a sprint is overloaded before the sprint starts, not at the retrospective.


Common Estimation Biases

BiasWhat it looks likeCounter
Optimism bias"This should only take a day" — no blockers, no rework assumedAsk: what is the realistic worst case? Weight PERT accordingly
AnchoringA PM says "can we do it in two weeks?" and the estimate gravitates toward two weeksEstimate independently before any number is mentioned in the room
Planning fallacyUnderestimating time for tasks you do yourself; overestimating for othersUse actuals from past projects, not gut feel
Scope creep blindnessEstimating the happy path; not the edge cases, negative tests, and non-functional testsUse a test type checklist — tick off functional, non-functional, security, accessibility before finalising
Student syndromeEstimating accurately but starting late, leaving no bufferFix with sprint commitment at grooming, not at stand-up
Parkinson's LawWork expands to fill available timeTime-box execution phases; record actual duration, not planned duration

Estimating for Multiple Concurrent Test Streams

When one QA engineer — or a small team — is running multiple streams simultaneously (e.g. regression for release A, feature testing for sprint B, UAT support for stream C), the naive approach is to add estimates. This under-estimates because context switching carries a real cost.

Context-Switch Tax

  Number of concurrent streams  Effective productivity
  1                             100%
  2                             80%  (20% lost to switching)
  3                             60%
  4+                            40%  or less

  Rule: for each additional concurrent stream beyond the first, reduce daily throughput by 20%

Worked example — one QA engineer, three streams:

  Naive estimate: 5 days each = 15 days total
  
  Effective daily capacity = 8h × 60% = 4.8h productive per day
  
  Adjusted total effort = 15 days × 8h = 120 hours of work
  Calendar days at 4.8h/day = 120 / 4.8 = 25 days
  
  Report to client: 5 days → 8 calendar days per stream (not 5)

Mitigation strategies:

  • Batch stream A tasks for mornings, stream B tasks for afternoons — minimise micro-switching
  • Time-box each stream per day and make it visible on a task board
  • Escalate stream conflicts to the project manager before accepting additional scope
  • If all three streams have critical deadlines, flag the resource conflict explicitly — "all three cannot be done by Friday; which two do you prioritise?"

Building a Test Capacity Model

A capacity model converts headcount and calendar time into realistic delivery commitments.

Base Formula

Available test days = headcount × working days × utilisation rate

Where utilisation rate accounts for:
  - Meetings, stand-ups, retrospectives:  ~10–15%
  - Admin, reporting, onboarding:         ~5–10%
  - Unplanned work and interruptions:     ~10%
  - Ramp-up (new client — see below):     variable

Typical target utilisation: 70–75% of nominal capacity

Worked example — 3 QA engineers, 4-week sprint, new client:

  Nominal days: 3 engineers × 20 working days = 60 person-days

  Deductions:
    Sprint ceremonies (planning, review, retro, 2× stand-up/day):
      ~1.5h/day × 20 days × 3 = 90h = 11.25 days
    Admin and reporting:
      0.5h/day × 20 × 3 = 30h = 3.75 days
    Ramp-up (week 1 of new client — see below):
      1 engineer × 5 days × 50% productivity loss = 2.5 days
    Unplanned:
      5% buffer = 0.05 × 60 = 3 days

  Available test days = 60 - 11.25 - 3.75 - 2.5 - 3 = 39.5 person-days
  Utilisation = 39.5 / 60 = 66%

  Plan sprint work to 39 person-days. Do not commit to 60.

Capacity Planning Table

Build this as a simple spreadsheet at project kick-off:

Engineer    Role           Weeks 1-2   Weeks 3-4   Weeks 5-6
-------     ----           ---------   ---------   ---------
Alice       Senior QA      6 days*     8 days      8 days      *ramp-up week 1
Bob         QA Engineer    7 days*     8 days      8 days      *ramp-up week 1
Charlie     Automation QA  4 days      6 days      8 days      **automation setup weeks 1-2

Available:  17 days        22 days     24 days

This makes it immediately visible that week 1-2 has constrained capacity — and that committing to a full test execution phase in week 2 is unrealistic.


Accounting for New-Client Ramp-Up

Ramp-up is consistently underestimated on new engagements. Domain unfamiliarity, environment access issues, and process learning all reduce effective throughput.

Phase                Duration    Productivity impact
-----                --------    --------------------
Environment access   Day 1–3     Effectively 0 test execution possible until environments are provisioned
Domain learning      Week 1–2    50–60% of normal throughput on complex business logic
Process alignment    Week 1–3    QA process, ticketing conventions, stakeholder contacts
Tooling familiarity  Week 1–2    Test management tool, CI/CD pipeline, browser stacks

Composite ramp-up model:
  Week 1:  40–50% effective capacity
  Week 2:  60–70% effective capacity
  Week 3:  80–90% effective capacity
  Week 4+: 95–100% effective capacity

Practical implication: do not commit to a full sprint test execution delivery in week 1. Commit to: environment access, domain onboarding, draft test plan, and initial test case writing. Set this expectation explicitly at kick-off.


Tracking Estimation Accuracy

Estimates only improve if you measure them against actuals.

Estimation Log Format

Maintain a simple log per project:

Task                    Estimated (days)  Actual (days)  Variance  Notes
Checkout test writing   5.3               6.5            +1.2      Late AC changes, 3 extra flows added
Checkout test execution 2.5               2.0            -0.5      Fewer defects than expected
Defect retesting        1.0               1.5            +0.5      Dev rework took longer, retests delayed
Regression run          1.0               1.2            +0.2      Two new failures found, investigated
-------
Sprint total            9.8               11.2           +1.4      14% over

Metrics to Track

Estimation accuracy = 1 - |actual - estimate| / estimate × 100%
Target: > 80% of tasks within ±20% of estimate

Bias indicator = (sum of variances) / number of tasks
  Positive = systematically under-estimating (optimism bias)
  Negative = systematically over-estimating (padding)
  Target: close to zero — no systematic directional error

Review the log at each project retrospective. One project of data is noise. Three projects of data shows patterns.


Presenting Estimates to Clients

Never present a single-point estimate as a commitment. Single points imply precision you do not have, and they become targets that erode buffer silently.

The Range Model

Presentation format:
  "Based on the scope as currently understood, we estimate [low] to [high] days.
   The mid-point [mid] is our planning baseline.
   This range reflects [specific uncertainty — e.g. incomplete AC on module X,
   pending environment access, unknown defect volume]."

Example:
  Low:  16 days  (requirements stable, no major defects found, existing test data usable)
  Mid:  21 days  (our planning baseline — some defect retesting, minor scope additions)
  High: 27 days  (significant rework cycles, environment delays, scope additions in sprint 3)

What to Present Alongside the Range

  • Assumptions list: every assumption baked into the estimate
  • Exclusions: what is explicitly out of scope
  • Risks: what would push the estimate to the high end
  • Triggers for re-estimation: "if X happens, we will need to revisit this estimate"

This converts the estimate from a promise into a working agreement — both parties understand what the number is contingent on.

Confidence Levels

For fixed-price or milestone-based contracts, attach explicit confidence levels:

"We are 90% confident the work will complete within 27 days.
 We are 50% confident it will complete within 21 days."

Use PERT standard deviations to derive these numbers, not gut feel.


Handling Scope Creep

Scope creep is the most common cause of estimate overrun. It arrives in three forms:

Type 1 — Explicit addition: "Can we add testing of the new payments gateway?"
Type 2 — Scope inflation: "Testing" turns out to include performance, security, and
           accessibility when the estimate only covered functional
Type 3 — Requirement drift: The AC for a story changes mid-sprint, invalidating test cases

Response Playbook

Type 1 — Explicit addition:

1. Quantify the addition immediately: "That's roughly X days additional effort."
2. Offer three options:
   a. Add it, extend the timeline
   b. Add it, remove something else of equivalent size (negotiate out-of-scope)
   c. Defer it to the next sprint/phase
3. Document the decision in writing. Never absorb scope additions silently.

Type 2 — Scope inflation:

1. Return to the original estimate document and read out what was included
2. Identify the gap explicitly: "Performance testing was not in scope — here is what that adds"
3. Treat as Type 1 from this point

Type 3 — Requirement drift:

1. Log the change, note which test cases are invalidated
2. Estimate rework: test case rewrites + re-execution of affected cases
3. Flag to the project manager before absorbing the rework into the sprint
4. If drift is frequent, surface it as a process issue, not just a one-off delay

Re-Baselining When Requirements Change

If requirements change significantly mid-project — a module redesign, a new integration, a pivot in scope — re-baseline the estimate rather than patching it.

Re-baseline process:
  1. Freeze the current estimate: record what was delivered against the original baseline
  2. Define the new scope: what changed and what is now required
  3. Re-estimate from scratch for the changed scope using the same technique as the original
  4. Issue a revised estimate document with:
     - Change description
     - Original estimate vs new estimate for the affected scope
     - Impact on project total
     - Revised timeline
  5. Get sign-off before proceeding

Do not silently absorb a major scope change and then miss the original deadline. The earlier a re-baseline is issued, the more options the client has.

Worked example:

  Original estimate: 42 days for three modules (A, B, C)
  Change event: module C replaced by a new module D with 2× the complexity

  Module C estimate (original):        8 days — now void
  Module D estimate (new, using PERT):
    O=7, M=14, P=22 → PERT = 14.2 days, Std Dev = 2.5 days

  Delta: +6.2 days on the mid estimate
  Revised total: 42 - 8 + 14.2 = 48.2 days → report as 46–52 days

  Present to client: "The replacement of module C with module D adds 6–8 days
  to the QA engagement. Revised delivery window is [date range]."

Connections

  • qa/test-planning — scope definition and exit criteria that set the boundaries estimation works within
  • qa/risk-based-testing — risk prioritisation determines which test activities receive the largest effort allocations
  • qa/agile-qa — sprint velocity tracking and story point calibration ground the PERT and T-shirt models in real team data
  • qa/test-automation-strategy — automation build-vs-maintain trade-off directly feeds the estimation breakdown for automation effort
  • qa/qa-leadership — client-facing estimate presentation and expectation-setting are core senior QA consultant skills
  • qa/uat-governance — UAT phase estimation sits within the same capacity model; named tester availability is a hard input

Open Questions

  • When a client's requirements are incomplete at estimate time, at what point does the uncertainty range become wide enough that quoting a range is commercially misleading and a fixed discovery phase is the right answer instead?
  • How should context-switch tax be communicated to a client who is simultaneously requesting estimates for three parallel workstreams without accepting that serial sequencing would be faster overall?
  • Is there an industry-accepted standard for new-client ramp-up productivity curves, or is the 40/60/80/95% weekly model purely empirical and team-dependent?