IntermediateAI Engineer

Implement and measure prompt caching

Take an existing system with a long system prompt (1000+ tokens) and add Anthropic prompt caching. You will add cache_control breakpoints, make five repeated API calls, read the usage fields to confirm cache hits, and calculate the actual cost and latency reduction compared to uncached calls.

Why this matters

Prompt caching is one of the few optimisations that reduces both cost and latency simultaneously. On a system where the system prompt is 2000 tokens and you make 1000 calls a day, caching pays for itself immediately. More importantly, the discipline of reading usage fields from every API response is a habit that catches runaway costs before they become a problem.

Before you start

Anthropic API access with a key that has access to Claude Sonnet or Opus
A system prompt of at least 1024 tokens (required minimum for caching to activate)
Python with the anthropic SDK installed
Basic understanding of the Messages API request/response structure

Step-by-step guide

Baseline: measure uncached cost

Make five identical API calls to Claude with your long system prompt. After each call, print the usage object: input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens. All five should show zero cache hits. Record the total input tokens across all five calls.

import anthropic, time

client = anthropic.Anthropic()

LONG_SYSTEM = "You are a helpful assistant. " + ("Context: " * 600)  # ~1200 tokens

def call_api(use_cache: bool = False) -> dict:
    system_content = [{"type": "text", "text": LONG_SYSTEM}]
    if use_cache:
        system_content[0]["cache_control"] = {"type": "ephemeral"}

    start = time.time()
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=64,
        system=system_content,
        messages=[{"role": "user", "content": "Say hello briefly."}],
    )
    latency = time.time() - start
    u = response.usage
    return {
        "latency": latency,
        "input_tokens": u.input_tokens,
        "cache_creation": u.cache_creation_input_tokens,
        "cache_read": u.cache_read_input_tokens,
    }

print("=== Baseline (no cache) ===")
for i in range(5):
    r = call_api(use_cache=False)
    print(f"Call {i+1}: {r}")

Add cache_control breakpoints

Add a cache_control: {type: 'ephemeral'} marker to the last content block of your system prompt. This tells Anthropic to cache everything up to that point. The cached portion must be at least 1024 tokens; if your prompt is shorter, pad it with relevant context.

# cache_control is added to the content block, not the message
system_with_cache = [
    {
        "type": "text",
        "text": LONG_SYSTEM,
        "cache_control": {"type": "ephemeral"},  # cache everything up to here
    }
]

# If your system prompt has multiple sections, put cache_control
# on the LAST section you want cached. Everything before it is cached.
# Minimum 1024 tokens must be in the cached portion.

Make five calls and read usage

Repeat the five calls. The first call creates the cache (cache_creation_input_tokens > 0). Calls two through five should show cache_read_input_tokens matching your system prompt token count and cache_creation_input_tokens at zero. Print each usage object; do not assume it is working.

print("\n=== With cache ===")
for i in range(5):
    r = call_api(use_cache=True)
    hit = "MISS (creating)" if r["cache_creation"] > 0 else "HIT"
    print(
        f"Call {i+1}: {hit} | "
        f"latency={r['latency']:.2f}s | "
        f"created={r['cache_creation']} | "
        f"read={r['cache_read']}"
    )

# Expected output:
# Call 1: MISS (creating) | latency=1.8s | created=1240 | read=0
# Call 2: HIT             | latency=1.1s | created=0    | read=1240
# Call 3: HIT             | latency=1.0s | created=0    | read=1240

Calculate actual savings

Cache reads cost 10% of the base input token price. Calculate: (uncached_total_tokens - cached_reads) * full_price + cached_reads * 0.1 * full_price. Compare to the uncached baseline. Also measure wall-clock latency for each call; cached calls are typically 20-40% faster on long prompts.

# Sonnet 4.6 pricing (per million tokens, as of mid-2025)
INPUT_PRICE_PER_M = 3.00   # $3.00 per 1M input tokens
CACHE_READ_PRICE_PER_M = INPUT_PRICE_PER_M * 0.10  # 10% of base

# Example numbers from 5 calls, 1240 cached tokens each
uncached_total = 1240 * 5  # 6200 input tokens
uncached_cost = (uncached_total / 1_000_000) * INPUT_PRICE_PER_M

# With cache: 1 creation + 4 reads
cached_cost = (
    (1240 / 1_000_000) * INPUT_PRICE_PER_M +         # first call creates cache
    (1240 * 4 / 1_000_000) * CACHE_READ_PRICE_PER_M  # remaining 4 calls read cache
)

print(f"Uncached cost: {uncached_cost:.6f}")
print(f"Cached cost:   {cached_cost:.6f}")
print(f"Savings:       {(1 - cached_cost / uncached_cost):.0%}")

5
Test cache expiry
Wait 6 minutes and repeat a call. The 5-minute TTL for ephemeral caches should have expired; the call should show cache_creation_input_tokens again rather than cache_read_input_tokens. This teaches you when to use ephemeral vs considering longer cache strategies.

Relevant Axiom pages

Anthropic API; prompt caching Prompting techniques Observability platforms

What to do next

Back to Practice Lab