Async Python Latency Benchmarking

Benchmark p50/p95/p99 with 1000+ samples per scenario. Use py-spy for sampling profiling without code changes. Histograms over averages — averages hide the tail. Profile representative traffic at steady state.

Updated Invalid Date·

latency benchmarking p50 p95 p99 py-spy profiling async http mcpindex

Directly relevant to mcpindex Weekend 2. Latency baselines for STDIO vs HTTP transport comparison.

Key Facts

p50 = typical user experience; p95 = where slow users live; p99 = architectural bottlenecks and outliers
Minimum 1000 samples per scenario for reliable percentile estimates — fewer samples make p99 unstable
Averages hide disasters: one 10s request in 100 pulls average from 10ms to 110ms; p99 shows the 10s clearly
py-spy attaches to a running Python process without code changes — no import overhead, no measurement bias
Cold starts skew results: warm up the client connection before recording measurements
Profile at steady state, not during startup — connection pooling, JIT, and caching effects are meaningful

Percentile Reference

Percentile	What it tells you	When it regresses, check
p50 (median)	Typical request experience	General performance regression
p95	Slow user experience; 5% worse than this	Outlier workloads, connection contention
p99	Architectural bottlenecks; 1% worse than this	GIL contention, blocking I/O in async code, GC pauses
p999	Worst-of-worst; meaningful only at high traffic	Resource leaks, timeouts, retry storms

For MCP transport comparison (STDIO vs HTTP), p50 shows the baseline difference; p95 shows how each degrades under load.

Measuring Latency in Async Python

Basic Pattern with `time.perf_counter`

import asyncio
import time
import statistics
import httpx

async def measure_endpoint(
    url: str,
    n_samples: int = 1000,
    concurrency: int = 1,
) -> dict:
    latencies = []

    async with httpx.AsyncClient() as client:
        # Warm-up: 10 requests not recorded
        for _ in range(10):
            await client.get(url)

        # Measurement phase
        for _ in range(n_samples):
            start = time.perf_counter()
            await client.get(url)
            elapsed_ms = (time.perf_counter() - start) * 1000
            latencies.append(elapsed_ms)

    latencies.sort()
    n = len(latencies)
    return {
        "p50": latencies[int(n * 0.50)],
        "p95": latencies[int(n * 0.95)],
        "p99": latencies[int(n * 0.99)],
        "min": latencies[0],
        "max": latencies[-1],
        "mean": statistics.mean(latencies),
        "stdev": statistics.stdev(latencies),
    }

Concurrent Load Pattern

For realistic load that exercises connection pooling:

import asyncio
import time
import httpx
from collections import defaultdict

async def concurrent_benchmark(
    url: str,
    n_samples: int = 1000,
    concurrency: int = 10,
) -> dict:
    latencies = []
    semaphore = asyncio.Semaphore(concurrency)

    async def single_request(client: httpx.AsyncClient):
        async with semaphore:
            start = time.perf_counter()
            await client.get(url)
            return (time.perf_counter() - start) * 1000

    async with httpx.AsyncClient(limits=httpx.Limits(max_connections=concurrency)) as client:
        # Warm-up
        await asyncio.gather(*[single_request(client) for _ in range(concurrency)])

        # Benchmark
        tasks = [single_request(client) for _ in range(n_samples)]
        latencies = await asyncio.gather(*tasks)

    latencies = sorted(latencies)
    n = len(latencies)
    return {
        "p50": latencies[int(n * 0.50)],
        "p95": latencies[int(n * 0.95)],
        "p99": latencies[int(n * 0.99)],
        "samples": n,
        "concurrency": concurrency,
    }

MCP Transport Comparison Pattern

For mcpindex. Comparing STDIO vs HTTP transport latency:

import asyncio
import subprocess
import time

async def benchmark_stdio_transport(
    server_command: list[str],
    tool_name: str,
    n_samples: int = 200,
) -> dict:
    """Benchmark MCP STDIO transport: measures wall time per tool call."""
    latencies = []

    for _ in range(n_samples + 10):  # 10 warm-up
        start = time.perf_counter()
        proc = await asyncio.create_subprocess_exec(
            *server_command,
            stdin=asyncio.subprocess.PIPE,
            stdout=asyncio.subprocess.PIPE,
        )
        # Send tool call JSON-RPC request
        request = json.dumps({
            "jsonrpc": "2.0",
            "id": 1,
            "method": "tools/call",
            "params": {"name": tool_name, "arguments": {}},
        }) + "\n"
        stdout, _ = await proc.communicate(input=request.encode())
        elapsed_ms = (time.perf_counter() - start) * 1000

        if _ >= 10:  # skip warm-up
            latencies.append(elapsed_ms)

    latencies.sort()
    n = len(latencies)
    return {"p50": latencies[int(n * 0.50)], "p95": latencies[int(n * 0.95)], "p99": latencies[int(n * 0.99)]}

Profiling with py-spy

py-spy is a sampling profiler that attaches to a running Python process without code changes. Essential for identifying where latency comes from once you have a baseline.

Installation

pip install py-spy
# or: uv add py-spy

Attaching to a running process

# Find the PID of your Python process
ps aux | grep python

# Attach and record a flame graph (30 seconds)
py-spy record -o profile.svg --pid 12345 --duration 30

# Or dump the current stack trace
py-spy dump --pid 12345

The flame graph (profile.svg) shows where time is actually spent. Wide bars = hot code paths.

Running directly with py-spy

py-spy record -o profile.svg -- python -m mcpindex scan --target http://localhost:8000

Reading the flame graph

Bottom = outermost call (main/event loop)
Top = innermost call (where time is actually spent)
Width = proportion of total time spent in this call

Look for:
- Unexpectedly wide bars outside your code (e.g., json parsing, ssl handshake)
- asyncio._run_once or select() being wide = I/O wait (often correct, but check)
- GC frames (gc.collect) being wide = memory pressure

Statistical Validity

For reliable percentile estimates:

Samples	p50 stability	p95 stability	p99 stability
100	Good	Poor	Very poor
500	Good	OK	Poor
1000	Good	Good	OK
5000	Good	Good	Good

Minimum recommendation: 1000 samples for p95, 5000 for p99 that you want to track over time.

Avoiding Measurement Bias

# BAD: cold start included
for i in range(1000):
    start = time.perf_counter()
    response = await client.get(url)
    latencies.append(time.perf_counter() - start)

# GOOD: warm-up excluded from measurement
async with httpx.AsyncClient() as client:
    # Warm up connection pool and any server-side caches
    for _ in range(10):
        await client.get(url)
    
    # Now measure
    for _ in range(1000):
        start = time.perf_counter()
        await client.get(url)
        latencies.append((time.perf_counter() - start) * 1000)

Reporting Format

Standard output for latency baselines:

Transport: HTTP (streamable-HTTP, 2025-03-26)
Server: test-mcp-server v1.2.0
Operation: tools/list
Samples: 1000 | Concurrency: 1

p50:  12.3 ms
p95:  45.1 ms
p99: 112.8 ms
min:   8.1 ms
max: 890.2 ms
mean: 15.4 ms ± 22.1 ms (stdev)

---
Transport: STDIO
Samples: 200 | Concurrency: 1 (inherently serial)

p50:  85.2 ms  (+6.9x vs HTTP p50)
p95: 143.6 ms
p99: 201.4 ms

Record baselines at a fixed server version and commit to enable regression tracking across mcpindex releases.

[Source: py-spy documentation; Python performance optimization best practices, 2025]

Common Failure Cases

p99 estimates are unstable because fewer than 1000 samples were collected
Why: p99 requires that 1% of samples (10 out of 1000) fall above the reported value; with only 100 samples, p99 is estimated from just one sample and varies wildly between runs — a single outlier can double the reported value.
Detect: re-running the benchmark with the same code shows p99 varying by 2-5x between runs; p50 and p95 are stable but p99 is not.
Fix: collect at minimum 1000 samples for p95, 5000 for p99 that you want to track as a regression metric over time.

Warm-up phase is too short, and connection setup latency contaminates the early measurements
Why: the first HTTP request (or first few STDIO invocations) includes TCP handshake, TLS negotiation, and server-side warm-up; these are 5-10x slower than steady-state and inflate the low end of the distribution.
Detect: the first 10 measurements in the latencies list are significantly higher than the rest; removing them changes the p50 noticeably.
Fix: increase the warm-up phase to at least 10 requests (or more for connection-pooled services); for STDIO transport, warm-up should include at least one full JSON-RPC exchange.

Concurrency semaphore set to 1 during a concurrent benchmark, producing serial measurements
Why: asyncio.Semaphore(concurrency) with concurrency=1 allows only one request at a time; the benchmark appears to run concurrently but is actually serial, making the results identical to the sequential benchmark.
Detect: the concurrent benchmark (n=1000, concurrency=10) takes 10x longer than expected; throughput is the same as the serial baseline.
Fix: verify the semaphore limit matches the intended concurrency; print active concurrent tasks to confirm parallelism; use asyncio.gather() without a semaphore for a true fixed-concurrency burst test.

py-spy profile shows most time in asyncio._run_once suggesting I/O wait, but the actual bottleneck is synchronous code
Why: asyncio._run_once being wide in the flame graph is normal — it means the event loop is waiting for I/O; but if your async function includes a synchronous CPU-bound operation, that operation blocks the event loop without appearing separately in the py-spy flame graph.
Detect: p99 latency is high but py-spy only shows asyncio._run_once as wide; adding time.perf_counter() breakpoints inside the async function reveals a specific code section is slow.
Fix: identify synchronous blocking operations (JSON parsing of large payloads, regex on large strings) and move them to a thread pool using asyncio.to_thread().

Connections

protocols/mcp-http-transport — the HTTP transport being benchmarked
protocols/mcp — STDIO transport as the comparison baseline
python/ecosystem — httpx, asyncio, structlog for the benchmark harness
observability/tracing — Langfuse/OTel for production latency (complement to benchmarking)

Open Questions

At what concurrency level does STDIO transport become worse than HTTP due to process spawn overhead?
Does mcpindex's scan parallelism interact badly with STDIO's inherently serial nature?