Claude Prompt Caching: 80% Bill Reduction in One Setting

Q: Claude Prompt Caching: 80% Bill Reduction in One Setting

## The Math Anthropic's prompt cache pricing for Claude Opus 4.7: - First write (cache miss): 1.25x input price → $18.75/1M - Subsequent reads (cache hit, 5-min TTL): 0.1x input price → $1.50/1M - Regular input: $15/1M Take a chat assistant with a 50,000-token system prompt that handles 100 user messages per session: | Strategy | Per-session cost | Notes | | --------------- | ----------------------------------------

Anthropic charges 1.25x on cache write and only 0.1x on cache read. For a 50K-token system prompt reused 100 times, the math drops from $7.50 to $1.50 per session.

Caching & long context

Detailed Explanation

The Math

Anthropic's prompt cache pricing for Claude Opus 4.7:

First write (cache miss): 1.25x input price → $18.75/1M
Subsequent reads (cache hit, 5-min TTL): 0.1x input price → $1.50/1M
Regular input: $15/1M

Take a chat assistant with a 50,000-token system prompt that handles 100 user messages per session:

Strategy	Per-session cost	Notes
No cache	100 × 50K × $15/1M = $75.00	Full price every turn
With cache	1 × 50K × $18.75/1M + 99 × 50K × $1.50/1M = $8.36	89% reduction

For a B2B product with 1,000 active sessions per day, that is $66,640/day vs. $8,360/day — a $58K daily delta, or $1.7M per month.

TTL trade-offs

The default TTL is 5 minutes. Anthropic also offers a 1-hour TTL at 2x the write cost. Use 1-hour TTL when:

Sessions span coffee breaks (most chat products).
You're rate-limiting users to one message per minute.
The system prompt rarely changes.

Use 5-minute TTL when:

Users send messages in rapid bursts then disappear.
The system prompt is per-tenant and tenants rotate.

Cache write granularity

Anthropic caches at the cache_control boundary you mark in the API. Place the marker after stable content (system prompt, retrieved documents) and before per-turn content (user message, conversation history). Markers are limited to 4 per request — use them on the largest stable blocks first.

Real-world hit rates

Production observations:

Customer support bots: 80-95% hit rate (system prompt rarely changes).
RAG with retrieval: 30-50% (retrieved chunks vary per query).
Code agents: 60-75% (project context stable across multiple file edits).

OpenAI comparison

OpenAI's prompt caching is automatic (no cache_control needed) and bills cache reads at 0.5x — less aggressive than Anthropic's 0.1x but still significant. There's no write premium.

Use Case

Apply this when the system prompt or retrieved context is large and stable across turns. Customer support, technical documentation Q&A, code agents, and any persistent character / persona application.

Try It — Prompt Token Cost Calculator

Open full tool →