Inference Latency
Definition
Inference latency has several components: network round-trip time (client to server and back), input preprocessing time, model forward pass computation, output postprocessing, and queue wait time under load. For LLMs, latency is often split into Time to First Token (TTFT) — how long until generation starts — and Time Per Output Token (TPOT) — generation speed in tokens per second. Latency scales with model size, sequence length, and hardware capacity. Optimization techniques include quantization, KV cache optimization, speculative decoding, and batching strategies.
Why It Matters
Inference latency directly determines the viability of AI features in interactive applications. Research shows user satisfaction degrades significantly when AI response times exceed 3-4 seconds for conversational applications. High latency in customer support chatbots leads to user abandonment and ticket escalation. Latency also affects business economics: lower latency often requires higher-cost hardware or smaller models, creating a direct tradeoff between user experience and compute cost. Setting explicit latency SLAs before choosing models and infrastructure prevents misaligned expectations.
How It Works
Latency benchmarking measures p50, p95, and p99 response times under representative load. p99 latency captures worst-case user experience — if p99 is 8 seconds, 1% of users wait 8+ seconds even when average latency is 1 second. Optimization proceeds by profiling latency breakdowns: is the bottleneck in preprocessing, queue wait, model computation, or output processing? GPU profiling tools, distributed traces, and load testing under realistic concurrency reveal the dominant latency source.
Inference Latency Breakdown (Total: 200ms)
Network I/O
Tokenization
Model Forward Pass
Detokenization
Post-processing
Real-World Example
A team building a real-time product recommendation AI measures a p99 latency of 850ms using a large BERT model. After profiling, 600ms is in model inference and 250ms in a slow database lookup for product metadata. They apply INT8 quantization to reduce model inference to 220ms and add a Redis cache for product metadata lookups. The resulting p99 drops to 280ms — well within their 500ms SLA — enabling a real-time inline recommendation feature.
Common Mistakes
- ✕Benchmarking latency with a single request rather than under concurrent load — multi-user concurrency reveals batching overhead and queue wait times that single-request tests miss
- ✕Optimizing average (p50) latency while ignoring tail (p99) latency — users experiencing p99 latency have the worst experiences and highest churn rates
- ✕Conflating model inference speed with end-to-end system latency — network, preprocessing, and database calls often dominate total response time
Related Terms
Online Inference
Online inference (also called real-time inference) is the processing of individual or small groups of model inputs immediately upon arrival, returning results within milliseconds to seconds to support interactive applications like chatbots, search, and recommendations.
Model Serving
Model serving is the infrastructure that hosts trained ML models and exposes them as APIs, handling prediction requests in production with the latency, throughput, and reliability requirements of real applications.
Inference Server
An inference server is specialized software that hosts ML models and handles prediction requests with optimized batching, hardware utilization, and concurrency—outperforming generic web frameworks for AI workloads.
Load Balancing
Load balancing is the distribution of incoming AI inference requests across multiple model serving instances to maximize throughput, minimize latency, prevent any single server from becoming a bottleneck, and maintain high availability.
Inference Throughput
Inference throughput is the rate at which an AI model serving system processes requests — measured in requests per second (RPS) or tokens per second — representing the maximum capacity of the system under sustained load.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →