LLM Inference
Definition
Inference is the deployment phase of the LLM lifecycle—the process of using a trained model to produce outputs from new inputs. Unlike training (which updates model weights), inference runs the fixed model weights forward to generate predictions. For LLMs, inference is autoregressive: the model generates one token at a time, each token's generation requires a full forward pass through all model layers, and each generated token becomes part of the input for the next generation step. This means generating a 200-token response requires 200 sequential forward passes—making LLM inference inherently sequential and latency-sensitive. Inference infrastructure includes the serving hardware (GPUs), serving framework (vLLM, TGI, TensorRT-LLM), and scaling layer (load balancing, batching).
Why It Matters
Inference is where LLM capability meets production reality. A model that achieves great benchmark scores but requires 5 seconds to generate a response creates a poor user experience. Inference optimization—batching, quantization, KV caching, speculative decoding—directly impacts the cost per query and response latency that users experience. For 99helpers, inference performance determines whether the chatbot feels instantaneous or sluggish. Understanding inference helps product teams make informed tradeoffs: using streaming to show tokens as they generate (improving perceived latency), choosing the right model size for the required quality/speed balance, and architecting async versus synchronous response flows.
How It Works
LLM inference proceeds in two phases: prefill (processing the entire prompt in parallel—fast because transformer attention is parallelizable over the input) and decode (generating output tokens one by one—slow because each new token depends on all previous ones). The prefill phase processes thousands of prompt tokens in milliseconds; the decode phase generates tokens at a rate of 10-100 tokens/second depending on model size and hardware. Key metrics: time-to-first-token (TTFT, latency from request to first output token), tokens per second (throughput), and cost per 1K tokens. Modern serving frameworks like vLLM use continuous batching and PagedAttention to maximize GPU utilization by intelligently sharing memory across concurrent requests.
LLM Inference — Prefill + Autoregressive Decode Pipeline
Tokenize
2ms
Prefill (prompt)
80ms
Decode tok 1
22ms
Decode tok 2
22ms
Decode tok 3
22ms
Decode …→150
22ms
Time-to-First-Token
80 ms
TTFT — prefill duration
Decode Speed
45 tok/s
on A10G GPU
Total Latency
~3.4 s
150 token response
Streaming
Send tokens to UI as they are decoded → user sees first token in 80 ms. Total generation time unchanged, but perceived latency drops dramatically.
Real-World Example
A 99helpers platform runs self-hosted Llama-3-8B on an A10G GPU for customer chatbot queries. Average response length: 150 tokens. Prefill time: 80ms. Decode rate: 45 tokens/second. Total latency: 80ms + (150/45)*1000ms = 80 + 3,333ms = ~3.4 seconds. This feels slow. After implementing vLLM with continuous batching and enabling streaming (sending tokens to the UI as they generate), the user sees the first token in 80ms and reads the response as it streams at 45 t/s—perceived experience improves dramatically even though total generation time is unchanged.
Common Mistakes
- ✕Conflating inference latency with generation quality—a faster inference setup doesn't change model quality; it only affects how quickly quality outputs are delivered.
- ✕Running inference on CPU for production workloads—CPU inference is 10-100x slower than GPU for most LLMs; GPU is required for acceptable latency.
- ✕Ignoring batching for high-throughput deployments—serving one request at a time wastes GPU utilization; batching multiple requests dramatically improves throughput.
Related Terms
LLM API
An LLM API is a cloud service interface that provides programmatic access to large language models, allowing developers to send prompts and receive completions without managing model infrastructure.
KV Cache
The KV cache stores the key and value attention tensors computed during the prefill phase, allowing subsequent token generation to reuse these computations rather than recomputing them for every new token.
Speculative Decoding
Speculative decoding uses a small 'draft' model to generate multiple candidate tokens quickly, then verifies them in parallel with the large target model, achieving 2-3x inference speedup without changing output quality.
Model Quantization
Model quantization reduces the numerical precision of LLM weights from 32-bit or 16-bit floats to 8-bit or 4-bit integers, dramatically reducing memory requirements and inference costs with minimal quality loss.
GPU Inference
GPU inference is the use of graphics processing units to run LLM predictions, leveraging their massive parallel compute capabilities to achieve the high throughput and low latency required for production AI applications.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →