AI Infrastructure, Safety & Ethics

Online Inference

Definition

Online inference serves latency-sensitive applications where users expect immediate responses. A model serving stack for online inference maintains always-on infrastructure: GPU servers running model instances, load balancers distributing requests, caches for repeated queries, and autoscalers that provision additional capacity during traffic spikes. The tradeoff versus batch inference is cost: resources must be provisioned to handle peak demand even during off-peak hours when utilization is low.

Why It Matters

Online inference enables the interactive AI experiences that users expect — chatbots responding in seconds, search results appearing as users type, fraud detection happening at checkout time. The latency requirement shapes the entire model selection and serving architecture: models must be small enough or optimized enough (quantization, caching) to meet response time targets. Customer support chatbots specifically require online inference since users expect responses within 1-3 seconds of sending a message.

How It Works

Online inference systems are designed around latency targets. Model selection, quantization level, hardware choice (GPU type, batch size), and caching strategy are all tuned to meet p99 latency SLAs. Streaming inference — where the model outputs tokens as they are generated rather than waiting for the complete response — reduces perceived latency for LLM applications. Autoscaling policies provision additional servers when queue depth or latency metrics breach thresholds, maintaining response times during traffic surges.

Online (Real-Time) Inference

User Request

HTTP POST /v1/chat

Inference Server

GPU model forward pass

Response

< 500ms SLA

Throughput

~500 req/s

P95 Latency

220ms

Availability

99.9%

Real-World Example

A customer support chatbot requires online inference with a 2-second p99 latency SLA. The team quantizes their LLM to 4-bit precision, deploys it on two A10G GPU instances behind a load balancer, implements prompt caching for repeated system prompts, and configures token streaming so users see the response appearing as it generates. The resulting system handles 200 concurrent users within the SLA and auto-scales to 800 concurrent users during peak hours.

Common Mistakes

✕Designing online inference without latency SLAs — building an interactive product without understanding the required response time leads to architecture choices that fail in production
✕Not implementing request queuing — direct connections from clients to model servers without a queue cause request drops when brief spikes exceed server capacity
✕Ignoring cold start latency — newly provisioned servers take time to load model weights, causing high latency for the first requests served after a scale-up event

Related Terms

Batch Inference

Batch inference is the processing of large groups of input data through a machine learning model in a single scheduled job, rather than in real time, enabling high throughput at lower cost for use cases that do not require immediate responses.

Inference Latency

Inference latency is the time between submitting an input to a deployed AI model and receiving the complete output — typically measured in milliseconds for classification models and seconds for large language models — directly impacting user experience and system design.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →