Rate Limiting
Definition
Rate limiting tracks request counts per client (identified by API key, IP address, or user ID) within rolling or fixed time windows. When a client exceeds their allocated rate, the gateway returns a 429 Too Many Requests response. Algorithms include token bucket (allows short bursts), leaky bucket (smooths request flow), and sliding window counters. For AI APIs, rate limits are often expressed in requests per minute AND tokens per minute, since token consumption determines actual compute cost.
Why It Matters
Rate limiting is critical for AI APIs where each request consumes significant compute resources. Without rate limiting, a single misbehaving client can monopolize GPU capacity, degrading service for all other users. Rate limits enable predictable cost management, protect against prompt injection attacks that generate extremely long outputs, and allow tiered pricing models where higher-paying customers get higher throughput. Rate limiting also acts as a first line of defense against automated abuse and credential stuffing.
How It Works
Rate limit state is stored in a fast key-value store like Redis, shared across all gateway instances for distributed enforcement. Each request increments a counter keyed to the client identifier and time window. If the counter exceeds the limit, the request is rejected with a 429 response including Retry-After headers. Adaptive rate limiting can dynamically adjust limits based on backend health metrics, reducing limits during high load to protect model servers.
Rate Limiting by Tier (RPM = Requests/min)
Free
Pro
Business
Enterprise
Real-World Example
An LLM API provider offers three tiers: free (20 requests/min, 40,000 tokens/min), pro (200 requests/min, 400,000 tokens/min), enterprise (2,000 requests/min, 4,000,000 tokens/min). When a free-tier script attempts to send 100 requests per minute, the gateway returns 429 errors for 80 of them while the legitimate 20 succeed. This prevents the free-tier user from disrupting paid customers during peak hours.
Common Mistakes
- ✕Rate limiting only on request count, ignoring token count — a single request generating 50,000 tokens is far more expensive than 100 short requests
- ✕Not providing clear retry-after headers in 429 responses, causing clients to retry immediately and amplify load
- ✕Setting rate limits that are too conservative, throttling legitimate high-value customers and damaging trust
Related Terms
API Gateway
An API gateway is a managed entry point that sits in front of AI model serving endpoints, handling authentication, rate limiting, request routing, load balancing, and monitoring for all incoming API traffic.
API Security
API security for AI systems encompasses authentication, authorization, input validation, output filtering, and monitoring controls that protect model APIs from unauthorized access, prompt injection, data extraction, and abuse.
Model Serving
Model serving is the infrastructure that hosts trained ML models and exposes them as APIs, handling prediction requests in production with the latency, throughput, and reliability requirements of real applications.
AI Cost Optimization
AI cost optimization encompasses techniques to reduce the compute, storage, and API expenses of AI systems—through model selection, caching, batching, quantization, and architecture decisions—making AI economically sustainable at scale.
Inference Server
An inference server is specialized software that hosts ML models and handles prediction requests with optimized batching, hardware utilization, and concurrency—outperforming generic web frameworks for AI workloads.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →