AI Infrastructure, Safety & Ethics

Inference Latency

Definition

Inference latency has several components: network round-trip time (client to server and back), input preprocessing time, model forward pass computation, output postprocessing, and queue wait time under load. For LLMs, latency is often split into Time to First Token (TTFT) — how long until generation starts — and Time Per Output Token (TPOT) — generation speed in tokens per second. Latency scales with model size, sequence length, and hardware capacity. Optimization techniques include quantization, KV cache optimization, speculative decoding, and batching strategies.

Why It Matters

Inference latency directly determines the viability of AI features in interactive applications. Research shows user satisfaction degrades significantly when AI response times exceed 3-4 seconds for conversational applications. High latency in customer support chatbots leads to user abandonment and ticket escalation. Latency also affects business economics: lower latency often requires higher-cost hardware or smaller models, creating a direct tradeoff between user experience and compute cost. Setting explicit latency SLAs before choosing models and infrastructure prevents misaligned expectations.

How It Works

Latency benchmarking measures p50, p95, and p99 response times under representative load. p99 latency captures worst-case user experience — if p99 is 8 seconds, 1% of users wait 8+ seconds even when average latency is 1 second. Optimization proceeds by profiling latency breakdowns: is the bottleneck in preprocessing, queue wait, model computation, or output processing? GPU profiling tools, distributed traces, and load testing under realistic concurrency reveal the dominant latency source.

Inference Latency Breakdown (Total: 200ms)

Network I/O

8ms

Tokenization

2ms

Model Forward Pass

180ms

Detokenization

3ms

Post-processing

7ms

Real-World Example

A team building a real-time product recommendation AI measures a p99 latency of 850ms using a large BERT model. After profiling, 600ms is in model inference and 250ms in a slow database lookup for product metadata. They apply INT8 quantization to reduce model inference to 220ms and add a Redis cache for product metadata lookups. The resulting p99 drops to 280ms — well within their 500ms SLA — enabling a real-time inline recommendation feature.

Common Mistakes

✕Benchmarking latency with a single request rather than under concurrent load — multi-user concurrency reveals batching overhead and queue wait times that single-request tests miss
✕Optimizing average (p50) latency while ignoring tail (p99) latency — users experiencing p99 latency have the worst experiences and highest churn rates
✕Conflating model inference speed with end-to-end system latency — network, preprocessing, and database calls often dominate total response time

Related Terms

Online Inference

Online inference (also called real-time inference) is the processing of individual or small groups of model inputs immediately upon arrival, returning results within milliseconds to seconds to support interactive applications like chatbots, search, and recommendations.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Inference Latency

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Online Inference

Model Serving

Inference Server

Load Balancing

Inference Throughput

Ready to build your AI chatbot?