AI Infrastructure, Safety & Ethics

Rate Limiting

Definition

Rate limiting tracks request counts per client (identified by API key, IP address, or user ID) within rolling or fixed time windows. When a client exceeds their allocated rate, the gateway returns a 429 Too Many Requests response. Algorithms include token bucket (allows short bursts), leaky bucket (smooths request flow), and sliding window counters. For AI APIs, rate limits are often expressed in requests per minute AND tokens per minute, since token consumption determines actual compute cost.

Why It Matters

Rate limiting is critical for AI APIs where each request consumes significant compute resources. Without rate limiting, a single misbehaving client can monopolize GPU capacity, degrading service for all other users. Rate limits enable predictable cost management, protect against prompt injection attacks that generate extremely long outputs, and allow tiered pricing models where higher-paying customers get higher throughput. Rate limiting also acts as a first line of defense against automated abuse and credential stuffing.

How It Works

Rate limit state is stored in a fast key-value store like Redis, shared across all gateway instances for distributed enforcement. Each request increments a counter keyed to the client identifier and time window. If the counter exceeds the limit, the request is rejected with a 429 response including Retry-After headers. Adaptive rate limiting can dynamically adjust limits based on backend health metrics, reducing limits during high load to protect model servers.

Rate Limiting by Tier (RPM = Requests/min)

Free

40K TPM

Pro

100 RPM

200K TPM

Business

500 RPM

1M TPM

Enterprise

2000 RPM

10M TPM

Real-World Example

An LLM API provider offers three tiers: free (20 requests/min, 40,000 tokens/min), pro (200 requests/min, 400,000 tokens/min), enterprise (2,000 requests/min, 4,000,000 tokens/min). When a free-tier script attempts to send 100 requests per minute, the gateway returns 429 errors for 80 of them while the legitimate 20 succeed. This prevents the free-tier user from disrupting paid customers during peak hours.

Common Mistakes

✕Rate limiting only on request count, ignoring token count — a single request generating 50,000 tokens is far more expensive than 100 short requests
✕Not providing clear retry-after headers in 429 responses, causing clients to retry immediately and amplify load
✕Setting rate limits that are too conservative, throttling legitimate high-value customers and damaging trust

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Rate Limiting

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

API Gateway

API Security

Model Serving

AI Cost Optimization

Inference Server

Ready to build your AI chatbot?