Implement and measure prompt caching
Take an existing system with a long system prompt (1000+ tokens) and add Anthropic prompt caching. You will add cache_control breakpoints, make five repeated API calls, read the usage fields to confirm cache hits, and calculate the actual cost and latency reduction compared to uncached calls.
Why this matters
Prompt caching is one of the few optimisations that reduces both cost and latency simultaneously. On a system where the system prompt is 2000 tokens and you make 1000 calls a day, caching pays for itself immediately. More importantly, the discipline of reading usage fields from every API response is a habit that catches runaway costs before they become a problem.
Before you start
- Anthropic API access with a key that has access to Claude Sonnet or Opus
- A system prompt of at least 1024 tokens (required minimum for caching to activate)
- Python with the anthropic SDK installed
- Basic understanding of the Messages API request/response structure
Step-by-step guide
- 1
Baseline: measure uncached cost
Make five identical API calls to Claude with your long system prompt. After each call, print the usage object: input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens. All five should show zero cache hits. Record the total input tokens across all five calls.
import anthropic, time client = anthropic.Anthropic() LONG_SYSTEM = "You are a helpful assistant. " + ("Context: " * 600) # ~1200 tokens def call_api(use_cache: bool = False) -> dict: system_content = [{"type": "text", "text": LONG_SYSTEM}] if use_cache: system_content[0]["cache_control"] = {"type": "ephemeral"} start = time.time() response = client.messages.create( model="claude-sonnet-4-6", max_tokens=64, system=system_content, messages=[{"role": "user", "content": "Say hello briefly."}], ) latency = time.time() - start u = response.usage return { "latency": latency, "input_tokens": u.input_tokens, "cache_creation": u.cache_creation_input_tokens, "cache_read": u.cache_read_input_tokens, } print("=== Baseline (no cache) ===") for i in range(5): r = call_api(use_cache=False) print(f"Call {i+1}: {r}") - 2
Add cache_control breakpoints
Add a cache_control: {type: 'ephemeral'} marker to the last content block of your system prompt. This tells Anthropic to cache everything up to that point. The cached portion must be at least 1024 tokens; if your prompt is shorter, pad it with relevant context.
# cache_control is added to the content block, not the message system_with_cache = [ { "type": "text", "text": LONG_SYSTEM, "cache_control": {"type": "ephemeral"}, # cache everything up to here } ] # If your system prompt has multiple sections, put cache_control # on the LAST section you want cached. Everything before it is cached. # Minimum 1024 tokens must be in the cached portion. - 3
Make five calls and read usage
Repeat the five calls. The first call creates the cache (cache_creation_input_tokens > 0). Calls two through five should show cache_read_input_tokens matching your system prompt token count and cache_creation_input_tokens at zero. Print each usage object; do not assume it is working.
print("\n=== With cache ===") for i in range(5): r = call_api(use_cache=True) hit = "MISS (creating)" if r["cache_creation"] > 0 else "HIT" print( f"Call {i+1}: {hit} | " f"latency={r['latency']:.2f}s | " f"created={r['cache_creation']} | " f"read={r['cache_read']}" ) # Expected output: # Call 1: MISS (creating) | latency=1.8s | created=1240 | read=0 # Call 2: HIT | latency=1.1s | created=0 | read=1240 # Call 3: HIT | latency=1.0s | created=0 | read=1240 - 4
Calculate actual savings
Cache reads cost 10% of the base input token price. Calculate: (uncached_total_tokens - cached_reads) * full_price + cached_reads * 0.1 * full_price. Compare to the uncached baseline. Also measure wall-clock latency for each call; cached calls are typically 20-40% faster on long prompts.
# Sonnet 4.6 pricing (per million tokens, as of mid-2025) INPUT_PRICE_PER_M = 3.00 # $3.00 per 1M input tokens CACHE_READ_PRICE_PER_M = INPUT_PRICE_PER_M * 0.10 # 10% of base # Example numbers from 5 calls, 1240 cached tokens each uncached_total = 1240 * 5 # 6200 input tokens uncached_cost = (uncached_total / 1_000_000) * INPUT_PRICE_PER_M # With cache: 1 creation + 4 reads cached_cost = ( (1240 / 1_000_000) * INPUT_PRICE_PER_M + # first call creates cache (1240 * 4 / 1_000_000) * CACHE_READ_PRICE_PER_M # remaining 4 calls read cache ) print(f"Uncached cost: {uncached_cost:.6f}") print(f"Cached cost: {cached_cost:.6f}") print(f"Savings: {(1 - cached_cost / uncached_cost):.0%}") - 5
Test cache expiry
Wait 6 minutes and repeat a call. The 5-minute TTL for ephemeral caches should have expired; the call should show cache_creation_input_tokens again rather than cache_read_input_tokens. This teaches you when to use ephemeral vs considering longer cache strategies.