RAG Pipeline Cost: Embedding + Retrieval + Generation
A typical RAG query costs $0.0003 in embedding + $0.005 in generation = $0.0053. The expensive step is generation, not retrieval — design accordingly.
Detailed Explanation
The Three-Step Cost Breakdown
Per-query cost of a typical RAG pipeline:
- Query embedding — embed the user question (~50 tokens) → 50/1M × $0.02 = $0.000001
- Vector search — vector DB cost (Pinecone, Qdrant, etc.), not LLM cost → ~$0.00005 depending on plan
- LLM generation — pack retrieved chunks (~3,000 tokens) + answer (~500 tokens) on GPT-4o:
- Input: 3,000/1M × $2.50 = $0.0075
- Output: 500/1M × $10 = $0.005
- Total: $0.0125
Per-query total: ~$0.0126, dominated by the generation step (99% of cost).
Where money actually leaks
Three patterns inflate the bill significantly:
Top-k too large — retrieving k=20 chunks instead of k=5 multiplies the input portion of the bill by 4x. Run an evaluation: does k=5 actually beat k=20 on your benchmark? Often it does, because the LLM gets confused by irrelevant context.
No context compression — passing 3,000 tokens of retrieved context per query is normal; 15,000 tokens (because each chunk is huge) is not. Cap chunk size at 500 tokens and overlap at 50.
Re-embedding on every deployment — embedding the corpus is one-time cost. Don't include it in your CI/CD pipeline. Set up incremental updates only.
Caching the system prompt
Most RAG products have a constant system prompt + few-shot examples (~3,000 tokens) followed by per-query retrieved chunks (~3,000 tokens) + question (~50 tokens). The first 3,000 tokens are perfect for prompt caching:
- Without cache: 6,050 input tokens × $2.50/1M = $0.0151
- With cache (assuming hit): 3,000 read + 3,050 fresh × ($1.25 + $2.50)/1M ≈ $0.0114
A 25% reduction with one configuration change.
Choosing the generation model
For most RAG flows, GPT-4o mini ($0.15 / $0.60 per 1M) matches GPT-4o quality on retrieved-context generation. Drop down a tier and re-evaluate — the savings are 17x.
Use Case
Use when designing or tuning a retrieval-augmented application: customer support bot with knowledge base, code documentation Q&A, internal wiki search, contract analysis.
Try It — Prompt Token Cost Calculator
Related Topics
Embedding Costs: text-embedding-3-small vs Cohere vs Voyage
Model comparison
Claude Prompt Caching: 80% Bill Reduction in One Setting
Caching & long context
Long-Context Costs: What 128K Tokens Actually Cost Per Call
Caching & long context
Agent Loops: Why a 'Simple' Task Costs 50K Tokens
Caching & long context
Monthly Budget Estimation: Build a 30-Day Forecast in 5 Minutes
Operational