Vision Costs: GPT-4o, Claude, Gemini Image Pricing Compared
GPT-4o charges per image based on resolution: a 1024x1024 image is ~765 tokens. Claude charges ~1,600 tokens for a typical image. Gemini is the cheapest at ~258 tokens.
Detailed Explanation
Image Tokenization Across Providers
Different providers tokenize images very differently. A typical 1024x1024 photograph:
| Provider | Tokens per image (1024x1024) | Cost per 100 images (input) |
|---|---|---|
| GPT-4o (high detail) | ~765 | $0.19 |
| GPT-4o (low detail) | 85 (fixed) | $0.02 |
| Claude Opus 4.7 (auto) | ~1,600 | $2.40 |
| Claude Sonnet 4.6 (auto) | ~1,600 | $0.48 |
| Gemini 2.5 Pro | ~258 | $0.032 |
| Gemini 2.5 Flash | ~258 | $0.0077 |
How GPT-4o tokenizes images
GPT-4o uses a tile-based scheme:
- Low detail: fixed 85 tokens regardless of resolution.
- High detail: 85 base tokens + 170 per 512x512 tile. A 1024x1024 image needs 4 tiles → 85 + 4×170 = 765 tokens.
A 4096x4096 image (16 tiles) would be 85 + 16×170 = 2,805 tokens — about $0.007 input on GPT-4o. Always downscale to the smallest resolution that lets the model accomplish the task.
How Claude tokenizes images
Claude resizes any image to fit within 1568x1568 and then tokenizes at roughly (width × height) / 750 tokens. A 1024x1024 image ≈ 1,398 tokens; a wide 1920x1080 ≈ 2,765 tokens.
How Gemini tokenizes images
Gemini uses a fixed 258 tokens per image for images up to 384×384 (after resizing). Larger images are split into tiles of 768×768, each tile costing 258 tokens.
Practical guidance
- Receipt OCR / document parsing: Use GPT-4o low-detail (85 tokens flat). Even at 1080p the recognition is good for printed text.
- Image classification: Gemini Flash is dramatically cheaper. Claim quality on a benchmark first.
- Detailed visual reasoning (charts, diagrams, screenshots): GPT-4o high-detail or Claude. The extra tokens buy actual capability.
- Bulk image labeling: Gemini Flash by a wide margin. At ~$0.00008 per image you can label 1M images for $80.
The output side
Vision pricing usually focuses on input. The output (model's text response describing the image) is billed at the regular text output rate. A 200-token caption on GPT-4o is $0.002 — negligible compared to image input cost.
Use Case
Use when designing image-input features: receipt OCR, product image tagging, screenshot debugging, accessibility alt-text generation, content moderation.
Try It — Prompt Token Cost Calculator
Related Topics
Embedding Costs: text-embedding-3-small vs Cohere vs Voyage
Model comparison
Batch Processing: 50% Off via OpenAI / Anthropic Batch APIs
Operational
Translation Task Cost: GPT-4o vs DeepL vs Google Translate
Workload patterns
Monthly Budget Estimation: Build a 30-Day Forecast in 5 Minutes
Operational
Cost Optimization Strategies: 10 Techniques to Cut Your LLM Bill
Operational