Claude Prompt Caching: 80% Bill Reduction in One Setting
Anthropic charges 1.25x on cache write and only 0.1x on cache read. For a 50K-token system prompt reused 100 times, the math drops from $7.50 to $1.50 per session.
Detailed Explanation
The Math
Anthropic's prompt cache pricing for Claude Opus 4.7:
- First write (cache miss): 1.25x input price → $18.75/1M
- Subsequent reads (cache hit, 5-min TTL): 0.1x input price → $1.50/1M
- Regular input: $15/1M
Take a chat assistant with a 50,000-token system prompt that handles 100 user messages per session:
| Strategy | Per-session cost | Notes |
|---|---|---|
| No cache | 100 × 50K × $15/1M = $75.00 | Full price every turn |
| With cache | 1 × 50K × $18.75/1M + 99 × 50K × $1.50/1M = $8.36 | 89% reduction |
For a B2B product with 1,000 active sessions per day, that is $66,640/day vs. $8,360/day — a $58K daily delta, or $1.7M per month.
TTL trade-offs
The default TTL is 5 minutes. Anthropic also offers a 1-hour TTL at 2x the write cost. Use 1-hour TTL when:
- Sessions span coffee breaks (most chat products).
- You're rate-limiting users to one message per minute.
- The system prompt rarely changes.
Use 5-minute TTL when:
- Users send messages in rapid bursts then disappear.
- The system prompt is per-tenant and tenants rotate.
Cache write granularity
Anthropic caches at the cache_control boundary you mark in the API. Place the marker after stable content (system prompt, retrieved documents) and before per-turn content (user message, conversation history). Markers are limited to 4 per request — use them on the largest stable blocks first.
Real-world hit rates
Production observations:
- Customer support bots: 80-95% hit rate (system prompt rarely changes).
- RAG with retrieval: 30-50% (retrieved chunks vary per query).
- Code agents: 60-75% (project context stable across multiple file edits).
OpenAI comparison
OpenAI's prompt caching is automatic (no cache_control needed) and bills cache reads at 0.5x — less aggressive than Anthropic's 0.1x but still significant. There's no write premium.
Use Case
Apply this when the system prompt or retrieved context is large and stable across turns. Customer support, technical documentation Q&A, code agents, and any persistent character / persona application.
Try It — Prompt Token Cost Calculator
Related Topics
Long-Context Costs: What 128K Tokens Actually Cost Per Call
Caching & long context
Agent Loops: Why a 'Simple' Task Costs 50K Tokens
Caching & long context
RAG Pipeline Cost: Embedding + Retrieval + Generation
Workload patterns
Monthly Budget Estimation: Build a 30-Day Forecast in 5 Minutes
Operational
Cost Optimization Strategies: 10 Techniques to Cut Your LLM Bill
Operational