OpenAI API Rate Limits and Token Budgeting
Understand OpenAI API rate limits for GPT-4, GPT-3.5, and embedding models. Learn about tokens per minute, requests per minute, and tier-based limits.
Detailed Explanation
OpenAI API Rate Limits
OpenAI uses a dual rate limit system: limits are enforced on both requests per minute (RPM) and tokens per minute (TPM) simultaneously. You must stay within both limits.
Rate Limits by Tier (GPT-4o)
| Tier | RPM | TPM | RPD |
|---|---|---|---|
| Free | 500 | 30,000 | 500 |
| Tier 1 | 500 | 30,000 | 10,000 |
| Tier 2 | 5,000 | 450,000 | — |
| Tier 3 | 5,000 | 800,000 | — |
| Tier 4 | 10,000 | 2,000,000 | — |
| Tier 5 | 10,000 | 10,000,000 | — |
Token Budgeting
Unlike most APIs, OpenAI limits are primarily token-based. A single request might use anywhere from 100 to 100,000 tokens depending on the prompt and response length.
Effective RPM = min(RPM limit, TPM limit / avg_tokens_per_request)
For example, at Tier 1 with GPT-4o:
- RPM limit: 500
- TPM limit: 30,000
- If average request uses 1,000 tokens: effective RPM = min(500, 30) = 30 RPM
- If average request uses 100 tokens: effective RPM = min(500, 300) = 300 RPM
Optimization Strategies
- Batch requests where possible to reduce per-request overhead
- Limit max_tokens in your requests to prevent runaway token usage
- Use streaming to start processing responses before completion
- Implement token counting client-side before sending requests (use
tiktokenlibrary) - Queue and throttle requests based on estimated token costs
Use Case
You are building a customer support chatbot using GPT-4o at Tier 2. Each customer interaction averages 2,000 tokens (prompt + response). You need to calculate how many concurrent chat sessions you can support and what happens during peak load.