AI Infrastructure, Safety & Ethics

Model Serving

Definition

Model serving encompasses the systems, software, and infrastructure that make trained models available for inference in production. A serving system receives prediction requests (via REST, gRPC, or message queue), preprocesses inputs, runs the model forward pass, post-processes outputs, and returns predictions—all within latency SLA requirements. Serving frameworks include TorchServe, TensorFlow Serving, Triton Inference Server, and Ray Serve. For LLMs specifically, serving systems like vLLM, TGI (Text Generation Inference), and Ollama handle the unique requirements of autoregressive generation: KV cache management, continuous batching, and streaming responses.

Why It Matters

Model serving infrastructure determines whether an AI model can actually function as a product component. A model that achieves excellent accuracy in offline evaluation is worthless if it cannot respond within 200ms at peak load, handle 10x traffic spikes without failures, or serve 1,000 concurrent users reliably. Serving infrastructure decisions—hardware choice (CPU vs GPU), batching strategy, caching, auto-scaling—directly determine latency, throughput, cost per request, and availability. For LLM applications, serving efficiency can reduce inference costs by 3-10x through techniques like continuous batching.

How It Works

A model serving system handles the full inference path: (1) request routing—load balancer distributes requests across model replicas; (2) input validation and preprocessing—transform raw API input to model-ready tensors; (3) batching—group concurrent requests for GPU efficiency; (4) model execution—GPU forward pass; (5) output post-processing—decode model outputs to API response format; (6) caching—return cached predictions for repeated inputs. Advanced LLM serving uses continuous batching (dynamically grouping requests mid-generation) and speculative decoding (using a small draft model to accelerate the large model) for dramatically higher throughput.

Model Serving Patterns

REST API

Single queries, chatbots

Synchronous request/response

gRPC

Internal microservices

Binary protocol, low latency

Streaming SSE

Real-time text generation

Server-sent events, token-by-token

Batch Endpoint

Bulk document processing

Async, high-throughput jobs

Real-World Example

A legal tech company deployed a contract analysis LLM using a naive single-request serving setup. At peak load (9 AM, when lawyers start reviewing overnight contracts), average latency spiked to 45 seconds per contract and 30% of requests timed out. After migrating to vLLM with continuous batching, GPU utilization increased from 35% to 92%, average latency dropped to 8 seconds, and the system handled 10x peak load with no timeouts—on the same GPU hardware. The continuous batching optimization eliminated the latency spikes by keeping the GPU saturated rather than idling between requests.

Common Mistakes

  • Conflating model training infrastructure with serving infrastructure—they have different optimization targets and often require different hardware
  • Not testing serving infrastructure under realistic load before launch—latency and throughput look fine with one request; they collapse under concurrent load
  • Over-provisioning serving infrastructure to avoid performance issues—auto-scaling with proper load testing is more cost-effective than constant over-provisioning

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Model Serving? Model Serving Definition & Guide | 99helpers | 99helpers.com