Inference Throughput
Definition
Throughput and latency are complementary but distinct metrics. Throughput measures how many requests a system can handle per unit time at steady state; latency measures how long each individual request takes. In LLM serving, throughput is often measured in output tokens per second across all concurrent requests. High throughput systems use continuous batching — dynamically grouping arriving requests into computation batches to maximize GPU utilization — and efficient KV cache management to serve many users simultaneously from a single GPU.
Why It Matters
Throughput determines the cost per request for a given AI system. If a GPU can process 100 tokens per second total and each response is 200 tokens, the system handles 0.5 requests per second. Doubling throughput halves cost per request. For high-volume applications like content moderation (millions of items per day) or embedding generation, throughput is the primary infrastructure metric. Throughput also sets the hard ceiling on concurrent user capacity — systems that cannot scale throughput are limited in how many users they can serve.
How It Works
Throughput optimization focuses on maximizing GPU utilization. Continuous batching groups requests arriving at different times into shared computation passes, filling the GPU work pipeline. Longer batch sizes increase throughput but may increase individual request latency — the throughput-latency tradeoff is tuned based on the application's SLA. Specialized inference runtimes (vLLM, TensorRT-LLM) implement PagedAttention and other techniques to significantly increase token throughput compared to naive inference implementations.
Inference Throughput (Requests/sec)
Single GPU, no batching
A100 ×1
Single GPU, batch=8
A100 ×1
2 GPUs, batch=8
A100 ×2
4 GPUs, batch=16
A100 ×4
Real-World Example
A company deploys a text classification model that needs to process 1 million documents per day for content moderation. Measuring their serving system, they find throughput of 200 requests per second, which translates to 17.3 million requests per day — well above their requirement. This means they can serve their workload with a single server, significantly reducing infrastructure costs compared to their initial estimate of 5 servers.
Common Mistakes
- ✕Measuring throughput with short, identical requests that don't represent real workload distribution — highly variable request lengths dramatically affect real-world throughput
- ✕Maximizing throughput at the expense of latency SLAs — batching many requests together improves throughput but increases per-request wait time
- ✕Not testing throughput under sustained load — systems that perform well in short burst tests may degrade when GPU memory or thermal limits are reached over time
Related Terms
Inference Latency
Inference latency is the time between submitting an input to a deployed AI model and receiving the complete output — typically measured in milliseconds for classification models and seconds for large language models — directly impacting user experience and system design.
Model Serving
Model serving is the infrastructure that hosts trained ML models and exposes them as APIs, handling prediction requests in production with the latency, throughput, and reliability requirements of real applications.
Inference Server
An inference server is specialized software that hosts ML models and handles prediction requests with optimized batching, hardware utilization, and concurrency—outperforming generic web frameworks for AI workloads.
Batch Inference
Batch inference is the processing of large groups of input data through a machine learning model in a single scheduled job, rather than in real time, enabling high throughput at lower cost for use cases that do not require immediate responses.
AI Cost Optimization
AI cost optimization encompasses techniques to reduce the compute, storage, and API expenses of AI systems—through model selection, caching, batching, quantization, and architecture decisions—making AI economically sustainable at scale.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →