AI Infrastructure, Safety & Ethics

Inference Throughput

Definition

Throughput and latency are complementary but distinct metrics. Throughput measures how many requests a system can handle per unit time at steady state; latency measures how long each individual request takes. In LLM serving, throughput is often measured in output tokens per second across all concurrent requests. High throughput systems use continuous batching — dynamically grouping arriving requests into computation batches to maximize GPU utilization — and efficient KV cache management to serve many users simultaneously from a single GPU.

Why It Matters

Throughput determines the cost per request for a given AI system. If a GPU can process 100 tokens per second total and each response is 200 tokens, the system handles 0.5 requests per second. Doubling throughput halves cost per request. For high-volume applications like content moderation (millions of items per day) or embedding generation, throughput is the primary infrastructure metric. Throughput also sets the hard ceiling on concurrent user capacity — systems that cannot scale throughput are limited in how many users they can serve.

How It Works

Throughput optimization focuses on maximizing GPU utilization. Continuous batching groups requests arriving at different times into shared computation passes, filling the GPU work pipeline. Longer batch sizes increase throughput but may increase individual request latency — the throughput-latency tradeoff is tuned based on the application's SLA. Specialized inference runtimes (vLLM, TensorRT-LLM) implement PagedAttention and other techniques to significantly increase token throughput compared to naive inference implementations.

Inference Throughput (Requests/sec)

Single GPU, no batching

A100 ×1

Single GPU, batch=8

A100 ×1

2 GPUs, batch=8

A100 ×2

125

4 GPUs, batch=16

A100 ×4

430

Real-World Example

A company deploys a text classification model that needs to process 1 million documents per day for content moderation. Measuring their serving system, they find throughput of 200 requests per second, which translates to 17.3 million requests per day — well above their requirement. This means they can serve their workload with a single server, significantly reducing infrastructure costs compared to their initial estimate of 5 servers.

Common Mistakes

✕Measuring throughput with short, identical requests that don't represent real workload distribution — highly variable request lengths dramatically affect real-world throughput
✕Maximizing throughput at the expense of latency SLAs — batching many requests together improves throughput but increases per-request wait time
✕Not testing throughput under sustained load — systems that perform well in short burst tests may degrade when GPU memory or thermal limits are reached over time

Related Terms

Inference Latency

Inference latency is the time between submitting an input to a deployed AI model and receiving the complete output — typically measured in milliseconds for classification models and seconds for large language models — directly impacting user experience and system design.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Inference Throughput

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Inference Latency

Model Serving

Inference Server

Batch Inference

AI Cost Optimization

Ready to build your AI chatbot?