AI Infrastructure, Safety & Ethics

Model Serving

Definition

Model serving encompasses the systems, software, and infrastructure that make trained models available for inference in production. A serving system receives prediction requests (via REST, gRPC, or message queue), preprocesses inputs, runs the model forward pass, post-processes outputs, and returns predictions—all within latency SLA requirements. Serving frameworks include TorchServe, TensorFlow Serving, Triton Inference Server, and Ray Serve. For LLMs specifically, serving systems like vLLM, TGI (Text Generation Inference), and Ollama handle the unique requirements of autoregressive generation: KV cache management, continuous batching, and streaming responses.

Why It Matters

Model serving infrastructure determines whether an AI model can actually function as a product component. A model that achieves excellent accuracy in offline evaluation is worthless if it cannot respond within 200ms at peak load, handle 10x traffic spikes without failures, or serve 1,000 concurrent users reliably. Serving infrastructure decisions—hardware choice (CPU vs GPU), batching strategy, caching, auto-scaling—directly determine latency, throughput, cost per request, and availability. For LLM applications, serving efficiency can reduce inference costs by 3-10x through techniques like continuous batching.

How It Works

A model serving system handles the full inference path: (1) request routing—load balancer distributes requests across model replicas; (2) input validation and preprocessing—transform raw API input to model-ready tensors; (3) batching—group concurrent requests for GPU efficiency; (4) model execution—GPU forward pass; (5) output post-processing—decode model outputs to API response format; (6) caching—return cached predictions for repeated inputs. Advanced LLM serving uses continuous batching (dynamically grouping requests mid-generation) and speculative decoding (using a small draft model to accelerate the large model) for dramatically higher throughput.

Model Serving Patterns

REST API

Single queries, chatbots

Synchronous request/response

gRPC

Internal microservices

Binary protocol, low latency

Streaming SSE

Real-time text generation

Server-sent events, token-by-token

Batch Endpoint

Bulk document processing

Async, high-throughput jobs

Real-World Example

A legal tech company deployed a contract analysis LLM using a naive single-request serving setup. At peak load (9 AM, when lawyers start reviewing overnight contracts), average latency spiked to 45 seconds per contract and 30% of requests timed out. After migrating to vLLM with continuous batching, GPU utilization increased from 35% to 92%, average latency dropped to 8 seconds, and the system handled 10x peak load with no timeouts—on the same GPU hardware. The continuous batching optimization eliminated the latency spikes by keeping the GPU saturated rather than idling between requests.

Common Mistakes

✕Conflating model training infrastructure with serving infrastructure—they have different optimization targets and often require different hardware
✕Not testing serving infrastructure under realistic load before launch—latency and throughput look fine with one request; they collapse under concurrent load
✕Over-provisioning serving infrastructure to avoid performance issues—auto-scaling with proper load testing is more cost-effective than constant over-provisioning

Related Terms

MLOps

MLOps (Machine Learning Operations) applies DevOps principles to ML systems—combining engineering practices for model development, deployment, monitoring, and retraining into a disciplined operational lifecycle.

Inference Server

An inference server is specialized software that hosts ML models and handles prediction requests with optimized batching, hardware utilization, and concurrency—outperforming generic web frameworks for AI workloads.

Model Deployment

Model deployment is the process of moving a trained ML model from development into a production environment where it can serve real users—encompassing packaging, testing, infrastructure provisioning, and release management.

Batch Inference

Batch inference is the processing of large groups of input data through a machine learning model in a single scheduled job, rather than in real time, enabling high throughput at lower cost for use cases that do not require immediate responses.

Online Inference

Online inference (also called real-time inference) is the processing of individual or small groups of model inputs immediately upon arrival, returning results within milliseconds to seconds to support interactive applications like chatbots, search, and recommendations.

← AI Infrastructure, Safety & Ethics ← Glossary Hub

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →