OpenAI API Rate Limits and Token Budgeting

Understand OpenAI API rate limits for GPT-4, GPT-3.5, and embedding models. Learn about tokens per minute, requests per minute, and tier-based limits.

API Limits

Detailed Explanation

OpenAI API Rate Limits

OpenAI uses a dual rate limit system: limits are enforced on both requests per minute (RPM) and tokens per minute (TPM) simultaneously. You must stay within both limits.

Rate Limits by Tier (GPT-4o)

Tier RPM TPM RPD
Free 500 30,000 500
Tier 1 500 30,000 10,000
Tier 2 5,000 450,000
Tier 3 5,000 800,000
Tier 4 10,000 2,000,000
Tier 5 10,000 10,000,000

Token Budgeting

Unlike most APIs, OpenAI limits are primarily token-based. A single request might use anywhere from 100 to 100,000 tokens depending on the prompt and response length.

Effective RPM = min(RPM limit, TPM limit / avg_tokens_per_request)

For example, at Tier 1 with GPT-4o:

  • RPM limit: 500
  • TPM limit: 30,000
  • If average request uses 1,000 tokens: effective RPM = min(500, 30) = 30 RPM
  • If average request uses 100 tokens: effective RPM = min(500, 300) = 300 RPM

Optimization Strategies

  1. Batch requests where possible to reduce per-request overhead
  2. Limit max_tokens in your requests to prevent runaway token usage
  3. Use streaming to start processing responses before completion
  4. Implement token counting client-side before sending requests (use tiktoken library)
  5. Queue and throttle requests based on estimated token costs

Use Case

You are building a customer support chatbot using GPT-4o at Tier 2. Each customer interaction averages 2,000 tokens (prompt + response). You need to calculate how many concurrent chat sessions you can support and what happens during peak load.

Try It — Rate Limit Calculator

Open full tool