IntermediateAI Engineer

Implement and measure prompt caching

Take an existing system with a long system prompt (1000+ tokens) and add Anthropic prompt caching. You will add cache_control breakpoints, make five repeated API calls, read the usage fields to confirm cache hits, and calculate the actual cost and latency reduction compared to uncached calls.

Why this matters

Prompt caching is one of the few optimisations that reduces both cost and latency simultaneously. On a system where the system prompt is 2000 tokens and you make 1000 calls a day, caching pays for itself immediately. More importantly, the discipline of reading usage fields from every API response is a habit that catches runaway costs before they become a problem.

Before you start

Step-by-step guide

  1. 1

    Baseline: measure uncached cost

    Make five identical API calls to Claude with your long system prompt. After each call, print the usage object: input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens. All five should show zero cache hits. Record the total input tokens across all five calls.

    import anthropic, time
    
    client = anthropic.Anthropic()
    
    LONG_SYSTEM = "You are a helpful assistant. " + ("Context: " * 600)  # ~1200 tokens
    
    def call_api(use_cache: bool = False) -> dict:
        system_content = [{"type": "text", "text": LONG_SYSTEM}]
        if use_cache:
            system_content[0]["cache_control"] = {"type": "ephemeral"}
    
        start = time.time()
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=64,
            system=system_content,
            messages=[{"role": "user", "content": "Say hello briefly."}],
        )
        latency = time.time() - start
        u = response.usage
        return {
            "latency": latency,
            "input_tokens": u.input_tokens,
            "cache_creation": u.cache_creation_input_tokens,
            "cache_read": u.cache_read_input_tokens,
        }
    
    print("=== Baseline (no cache) ===")
    for i in range(5):
        r = call_api(use_cache=False)
        print(f"Call {i+1}: {r}")
  2. 2

    Add cache_control breakpoints

    Add a cache_control: {type: 'ephemeral'} marker to the last content block of your system prompt. This tells Anthropic to cache everything up to that point. The cached portion must be at least 1024 tokens; if your prompt is shorter, pad it with relevant context.

    # cache_control is added to the content block, not the message
    system_with_cache = [
        {
            "type": "text",
            "text": LONG_SYSTEM,
            "cache_control": {"type": "ephemeral"},  # cache everything up to here
        }
    ]
    
    # If your system prompt has multiple sections, put cache_control
    # on the LAST section you want cached. Everything before it is cached.
    # Minimum 1024 tokens must be in the cached portion.
  3. 3

    Make five calls and read usage

    Repeat the five calls. The first call creates the cache (cache_creation_input_tokens > 0). Calls two through five should show cache_read_input_tokens matching your system prompt token count and cache_creation_input_tokens at zero. Print each usage object; do not assume it is working.

    print("\n=== With cache ===")
    for i in range(5):
        r = call_api(use_cache=True)
        hit = "MISS (creating)" if r["cache_creation"] > 0 else "HIT"
        print(
            f"Call {i+1}: {hit} | "
            f"latency={r['latency']:.2f}s | "
            f"created={r['cache_creation']} | "
            f"read={r['cache_read']}"
        )
    
    # Expected output:
    # Call 1: MISS (creating) | latency=1.8s | created=1240 | read=0
    # Call 2: HIT             | latency=1.1s | created=0    | read=1240
    # Call 3: HIT             | latency=1.0s | created=0    | read=1240
  4. 4

    Calculate actual savings

    Cache reads cost 10% of the base input token price. Calculate: (uncached_total_tokens - cached_reads) * full_price + cached_reads * 0.1 * full_price. Compare to the uncached baseline. Also measure wall-clock latency for each call; cached calls are typically 20-40% faster on long prompts.

    # Sonnet 4.6 pricing (per million tokens, as of mid-2025)
    INPUT_PRICE_PER_M = 3.00   # $3.00 per 1M input tokens
    CACHE_READ_PRICE_PER_M = INPUT_PRICE_PER_M * 0.10  # 10% of base
    
    # Example numbers from 5 calls, 1240 cached tokens each
    uncached_total = 1240 * 5  # 6200 input tokens
    uncached_cost = (uncached_total / 1_000_000) * INPUT_PRICE_PER_M
    
    # With cache: 1 creation + 4 reads
    cached_cost = (
        (1240 / 1_000_000) * INPUT_PRICE_PER_M +         # first call creates cache
        (1240 * 4 / 1_000_000) * CACHE_READ_PRICE_PER_M  # remaining 4 calls read cache
    )
    
    print(f"Uncached cost: {uncached_cost:.6f}")
    print(f"Cached cost:   {cached_cost:.6f}")
    print(f"Savings:       {(1 - cached_cost / uncached_cost):.0%}")
  5. 5

    Test cache expiry

    Wait 6 minutes and repeat a call. The 5-minute TTL for ephemeral caches should have expired; the call should show cache_creation_input_tokens again rather than cache_read_input_tokens. This teaches you when to use ephemeral vs considering longer cache strategies.

Relevant Axiom pages

What to do next

Back to Practice Lab