Cost Optimization Strategies: 10 Techniques to Cut Your LLM Bill
From prompt caching (40-90% off Anthropic) to model routing (10x cheaper for easy queries), these ten techniques have all paid for themselves in a real production system.
Detailed Explanation
The Ten Levers, Ranked by Typical Impact
1. Prompt caching (Anthropic / OpenAI) — 40-90% reduction
For any workload with stable system prompts or retrieved context. Set up cache_control markers and watch the bill drop. See Claude Prompt Caching.
2. Model routing — 5-10x reduction on easy queries
Use a cheap classifier (GPT-4o mini, ~$0.0001 per call) to route each request:
- Easy / FAQ → GPT-4o mini
- Hard / multi-step → GPT-4o or Claude Opus
Real production data shows 70-85% of queries qualify as "easy". Effective average cost drops by 5-7x.
3. Batch API for non-realtime — 50% reduction
Move enrichment, classification, and async pipelines to OpenAI / Anthropic batch endpoints. See Batch Processing.
4. Output token reduction — 30-60% on output cost
- Add "Be concise. Limit to 3 sentences." to system prompt.
- Use structured output (JSON schema) to avoid verbose narrative.
- Set max_tokens to a realistic ceiling (LLMs respect this).
Output costs are typically 4-5x input costs per token, so cutting output tokens has outsized impact.
5. Context trimming — 20-50% on input cost
- Drop conversation history older than N turns.
- Summarize old conversation to a 200-token note instead of keeping full transcript.
- For RAG, lower top-k from 10 → 5 and re-evaluate quality.
6. Switch from GPT-4o → GPT-4o mini for selected workloads — 17x reduction
Many "I need a frontier model" claims don't survive A/B testing. Run GPT-4o mini for a week on a 10% traffic slice and check user-facing metrics.
7. Streaming + early termination — 10-30% reduction
If your app can stop generation at a specific token (e.g., "I'm done.", a closing brace, an end marker), stream the output and call abort() when the marker appears. Cuts output cost on cases where the model would have rambled.
8. Embedding + retrieval instead of long context — 50-80% reduction
A 100K-token document in every prompt is expensive. Embed it once, retrieve only the relevant 3,000 tokens per query.
9. Provider arbitrage — 0-50% reduction
When pricing changes (and it changes often), test the cheapest equivalent model. GPT-4o mini for short tasks, Gemini Flash for vision, DeepSeek-V3 for code. Refresh quarterly.
10. Negotiate with sales — 10-30% reduction at scale
Above $50K/month committed spend, both OpenAI and Anthropic will negotiate. Volume discounts, batch SLAs, dedicated capacity. Often worth a quarterly call.
Stack the savings
These compose multiplicatively. A well-optimized RAG pipeline applying #1, #2, #3, and #4 together can run at 5-10% of the naïve implementation's cost — a 90% reduction without touching feature scope.
Use Case
Apply these when the LLM bill becomes a board-level discussion, when planning a Series B investor deck (margin matters), or when a single feature's COGS exceeds its ARR.
Try It — Prompt Token Cost Calculator
Related Topics
Claude Prompt Caching: 80% Bill Reduction in One Setting
Caching & long context
Batch Processing: 50% Off via OpenAI / Anthropic Batch APIs
Operational
Monthly Budget Estimation: Build a 30-Day Forecast in 5 Minutes
Operational
RAG Pipeline Cost: Embedding + Retrieval + Generation
Workload patterns
Long-Context Costs: What 128K Tokens Actually Cost Per Call
Caching & long context