Online Inference
Definition
Online inference serves latency-sensitive applications where users expect immediate responses. A model serving stack for online inference maintains always-on infrastructure: GPU servers running model instances, load balancers distributing requests, caches for repeated queries, and autoscalers that provision additional capacity during traffic spikes. The tradeoff versus batch inference is cost: resources must be provisioned to handle peak demand even during off-peak hours when utilization is low.
Why It Matters
Online inference enables the interactive AI experiences that users expect — chatbots responding in seconds, search results appearing as users type, fraud detection happening at checkout time. The latency requirement shapes the entire model selection and serving architecture: models must be small enough or optimized enough (quantization, caching) to meet response time targets. Customer support chatbots specifically require online inference since users expect responses within 1-3 seconds of sending a message.
How It Works
Online inference systems are designed around latency targets. Model selection, quantization level, hardware choice (GPU type, batch size), and caching strategy are all tuned to meet p99 latency SLAs. Streaming inference — where the model outputs tokens as they are generated rather than waiting for the complete response — reduces perceived latency for LLM applications. Autoscaling policies provision additional servers when queue depth or latency metrics breach thresholds, maintaining response times during traffic surges.
Online (Real-Time) Inference
User Request
HTTP POST /v1/chat
Inference Server
GPU model forward pass
Response
< 500ms SLA
Throughput
~500 req/s
P95 Latency
220ms
Availability
99.9%
Real-World Example
A customer support chatbot requires online inference with a 2-second p99 latency SLA. The team quantizes their LLM to 4-bit precision, deploys it on two A10G GPU instances behind a load balancer, implements prompt caching for repeated system prompts, and configures token streaming so users see the response appearing as it generates. The resulting system handles 200 concurrent users within the SLA and auto-scales to 800 concurrent users during peak hours.
Common Mistakes
- ✕Designing online inference without latency SLAs — building an interactive product without understanding the required response time leads to architecture choices that fail in production
- ✕Not implementing request queuing — direct connections from clients to model servers without a queue cause request drops when brief spikes exceed server capacity
- ✕Ignoring cold start latency — newly provisioned servers take time to load model weights, causing high latency for the first requests served after a scale-up event
Related Terms
Batch Inference
Batch inference is the processing of large groups of input data through a machine learning model in a single scheduled job, rather than in real time, enabling high throughput at lower cost for use cases that do not require immediate responses.
Inference Latency
Inference latency is the time between submitting an input to a deployed AI model and receiving the complete output — typically measured in milliseconds for classification models and seconds for large language models — directly impacting user experience and system design.
Model Serving
Model serving is the infrastructure that hosts trained ML models and exposes them as APIs, handling prediction requests in production with the latency, throughput, and reliability requirements of real applications.
Inference Server
An inference server is specialized software that hosts ML models and handles prediction requests with optimized batching, hardware utilization, and concurrency—outperforming generic web frameworks for AI workloads.
Load Balancing
Load balancing is the distribution of incoming AI inference requests across multiple model serving instances to maximize throughput, minimize latency, prevent any single server from becoming a bottleneck, and maintain high availability.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →