Summarization Cost: Map-Reduce vs Single-Call vs Streaming

Q: Summarization Cost: Map-Reduce vs Single-Call vs Streaming

## Three Architectures Summarizing a 100K-token document to a 1K-token summary on GPT-4o: ### 1. Single-call (entire doc → summary) - Input: 100K × $2.50/1M = $0.250 - Output: 1K × $10/1M = $0.010 - Total: $0.260 Constraint: the doc must fit in the model's context window. GPT-4o is fine at 100K (128K limit). Claude is fine at 100K (200K limit). For larger docs, you must chunk. ### 2. Map-reduce (chunk → per-chunk summary → final summary) Split into 20 chunks of 5K tokens each, summarize ea

A 100,000-token document summarized to 1,000 tokens costs $0.26 single-call on GPT-4o, or $0.05 with map-reduce on GPT-4o mini. The right architecture matters.

Workload patterns

Detailed Explanation

Three Architectures

Summarizing a 100K-token document to a 1K-token summary on GPT-4o:

1. Single-call (entire doc → summary)

Input: 100K × $2.50/1M = $0.250
Output: 1K × $10/1M = $0.010
Total: $0.260

Constraint: the doc must fit in the model's context window. GPT-4o is fine at 100K (128K limit). Claude is fine at 100K (200K limit). For larger docs, you must chunk.

2. Map-reduce (chunk → per-chunk summary → final summary)

Split into 20 chunks of 5K tokens each, summarize each to 200 tokens, then summarize the 20 summaries:

Map step: 20 × (5K input + 200 output) on GPT-4o mini:
- Input: 100K × $0.15/1M = $0.015
- Output: 4K × $0.60/1M = $0.0024
Reduce step: 1 × (4K input + 1K output) on GPT-4o:
- Input: 4K × $2.50/1M = $0.010
- Output: 1K × $10/1M = $0.010

Total: $0.037 — ~7x cheaper than single-call.

3. Streaming with rolling window

Process the document sequentially, maintaining a 2,000-token rolling summary that gets updated with each new 5,000-token chunk:

20 chunks × (2K rolling + 5K new + 2K updated rolling) on GPT-4o mini:
- Input: 20 × 7K × $0.15/1M = $0.021
- Output: 20 × 2K × $0.60/1M = $0.024
Total: $0.045

Quality trade-offs

Single-call: best quality, captures cross-document patterns the model can see at once.
Map-reduce: ~85% of single-call quality, parallelizable (do all 20 map calls concurrently).
Streaming: ~75% of single-call quality, useful when documents arrive incrementally.

Decision matrix

Document size	Best architecture	Notes
< 50K tokens	Single-call	Cost is a rounding error
50K-500K	Map-reduce	Best cost/quality ratio
> 500K	Streaming	Avoid context limits
Live streams	Streaming	Process as it arrives

Use Case

Use when designing a summarization feature: document analysis, meeting transcript summary, multi-document research synthesis, news aggregation, customer feedback distillation.

Try It — Prompt Token Cost Calculator

Open full tool →