RAG Pipeline Cost: Embedding + Retrieval + Generation

Q: RAG Pipeline Cost: Embedding + Retrieval + Generation

## The Three-Step Cost Breakdown Per-query cost of a typical RAG pipeline: 1. Query embedding — embed the user question (~50 tokens) → 50/1M × $0.02 = $0.000001 2. Vector search — vector DB cost (Pinecone, Qdrant, etc.), not LLM cost → ~$0.00005 depending on plan 3. LLM generation — pack retrieved chunks (~3,000 tokens) + answer (~500 tokens) on GPT-4o: - Input: 3,000/1M × $2.50 = $0.0075 - Output: 500/1M × $10 = $0.005 - Total: $0.0125 Per-query total: ~$0.0126, dominated by the gen

A typical RAG query costs $0.0003 in embedding + $0.005 in generation = $0.0053. The expensive step is generation, not retrieval — design accordingly.

Workload patterns

Detailed Explanation

The Three-Step Cost Breakdown

Per-query cost of a typical RAG pipeline:

Query embedding — embed the user question (~50 tokens) → 50/1M × $0.02 = $0.000001
Vector search — vector DB cost (Pinecone, Qdrant, etc.), not LLM cost → ~$0.00005 depending on plan
LLM generation — pack retrieved chunks (~3,000 tokens) + answer (~500 tokens) on GPT-4o:
- Input: 3,000/1M × $2.50 = $0.0075
- Output: 500/1M × $10 = $0.005
- Total: $0.0125

Per-query total: ~$0.0126, dominated by the generation step (99% of cost).

Where money actually leaks

Three patterns inflate the bill significantly:

Top-k too large — retrieving k=20 chunks instead of k=5 multiplies the input portion of the bill by 4x. Run an evaluation: does k=5 actually beat k=20 on your benchmark? Often it does, because the LLM gets confused by irrelevant context.
No context compression — passing 3,000 tokens of retrieved context per query is normal; 15,000 tokens (because each chunk is huge) is not. Cap chunk size at 500 tokens and overlap at 50.
Re-embedding on every deployment — embedding the corpus is one-time cost. Don't include it in your CI/CD pipeline. Set up incremental updates only.

Caching the system prompt

Most RAG products have a constant system prompt + few-shot examples (~3,000 tokens) followed by per-query retrieved chunks (~3,000 tokens) + question (~50 tokens). The first 3,000 tokens are perfect for prompt caching:

Without cache: 6,050 input tokens × $2.50/1M = $0.0151
With cache (assuming hit): 3,000 read + 3,050 fresh × ($1.25 + $2.50)/1M ≈ $0.0114

A 25% reduction with one configuration change.

Choosing the generation model

For most RAG flows, GPT-4o mini ($0.15 / $0.60 per 1M) matches GPT-4o quality on retrieved-context generation. Drop down a tier and re-evaluate — the savings are 17x.

Use Case

Use when designing or tuning a retrieval-augmented application: customer support bot with knowledge base, code documentation Q&A, internal wiki search, contract analysis.

Try It — Prompt Token Cost Calculator

Open full tool →