Model Serving
Definition
Model serving encompasses the systems, software, and infrastructure that make trained models available for inference in production. A serving system receives prediction requests (via REST, gRPC, or message queue), preprocesses inputs, runs the model forward pass, post-processes outputs, and returns predictions—all within latency SLA requirements. Serving frameworks include TorchServe, TensorFlow Serving, Triton Inference Server, and Ray Serve. For LLMs specifically, serving systems like vLLM, TGI (Text Generation Inference), and Ollama handle the unique requirements of autoregressive generation: KV cache management, continuous batching, and streaming responses.
Why It Matters
Model serving infrastructure determines whether an AI model can actually function as a product component. A model that achieves excellent accuracy in offline evaluation is worthless if it cannot respond within 200ms at peak load, handle 10x traffic spikes without failures, or serve 1,000 concurrent users reliably. Serving infrastructure decisions—hardware choice (CPU vs GPU), batching strategy, caching, auto-scaling—directly determine latency, throughput, cost per request, and availability. For LLM applications, serving efficiency can reduce inference costs by 3-10x through techniques like continuous batching.
How It Works
A model serving system handles the full inference path: (1) request routing—load balancer distributes requests across model replicas; (2) input validation and preprocessing—transform raw API input to model-ready tensors; (3) batching—group concurrent requests for GPU efficiency; (4) model execution—GPU forward pass; (5) output post-processing—decode model outputs to API response format; (6) caching—return cached predictions for repeated inputs. Advanced LLM serving uses continuous batching (dynamically grouping requests mid-generation) and speculative decoding (using a small draft model to accelerate the large model) for dramatically higher throughput.
Model Serving Patterns
REST API
Single queries, chatbotsSynchronous request/response
gRPC
Internal microservicesBinary protocol, low latency
Streaming SSE
Real-time text generationServer-sent events, token-by-token
Batch Endpoint
Bulk document processingAsync, high-throughput jobs
Real-World Example
A legal tech company deployed a contract analysis LLM using a naive single-request serving setup. At peak load (9 AM, when lawyers start reviewing overnight contracts), average latency spiked to 45 seconds per contract and 30% of requests timed out. After migrating to vLLM with continuous batching, GPU utilization increased from 35% to 92%, average latency dropped to 8 seconds, and the system handled 10x peak load with no timeouts—on the same GPU hardware. The continuous batching optimization eliminated the latency spikes by keeping the GPU saturated rather than idling between requests.
Common Mistakes
- ✕Conflating model training infrastructure with serving infrastructure—they have different optimization targets and often require different hardware
- ✕Not testing serving infrastructure under realistic load before launch—latency and throughput look fine with one request; they collapse under concurrent load
- ✕Over-provisioning serving infrastructure to avoid performance issues—auto-scaling with proper load testing is more cost-effective than constant over-provisioning
Related Terms
MLOps
MLOps (Machine Learning Operations) applies DevOps principles to ML systems—combining engineering practices for model development, deployment, monitoring, and retraining into a disciplined operational lifecycle.
Inference Server
An inference server is specialized software that hosts ML models and handles prediction requests with optimized batching, hardware utilization, and concurrency—outperforming generic web frameworks for AI workloads.
Model Deployment
Model deployment is the process of moving a trained ML model from development into a production environment where it can serve real users—encompassing packaging, testing, infrastructure provisioning, and release management.
Batch Inference
Batch inference is the processing of large groups of input data through a machine learning model in a single scheduled job, rather than in real time, enabling high throughput at lower cost for use cases that do not require immediate responses.
Online Inference
Online inference (also called real-time inference) is the processing of individual or small groups of model inputs immediately upon arrival, returning results within milliseconds to seconds to support interactive applications like chatbots, search, and recommendations.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →