Large Language Models (LLMs)

LLM Inference

Definition

Inference is the deployment phase of the LLM lifecycle—the process of using a trained model to produce outputs from new inputs. Unlike training (which updates model weights), inference runs the fixed model weights forward to generate predictions. For LLMs, inference is autoregressive: the model generates one token at a time, each token's generation requires a full forward pass through all model layers, and each generated token becomes part of the input for the next generation step. This means generating a 200-token response requires 200 sequential forward passes—making LLM inference inherently sequential and latency-sensitive. Inference infrastructure includes the serving hardware (GPUs), serving framework (vLLM, TGI, TensorRT-LLM), and scaling layer (load balancing, batching).

Why It Matters

Inference is where LLM capability meets production reality. A model that achieves great benchmark scores but requires 5 seconds to generate a response creates a poor user experience. Inference optimization—batching, quantization, KV caching, speculative decoding—directly impacts the cost per query and response latency that users experience. For 99helpers, inference performance determines whether the chatbot feels instantaneous or sluggish. Understanding inference helps product teams make informed tradeoffs: using streaming to show tokens as they generate (improving perceived latency), choosing the right model size for the required quality/speed balance, and architecting async versus synchronous response flows.

How It Works

LLM inference proceeds in two phases: prefill (processing the entire prompt in parallel—fast because transformer attention is parallelizable over the input) and decode (generating output tokens one by one—slow because each new token depends on all previous ones). The prefill phase processes thousands of prompt tokens in milliseconds; the decode phase generates tokens at a rate of 10-100 tokens/second depending on model size and hardware. Key metrics: time-to-first-token (TTFT, latency from request to first output token), tokens per second (throughput), and cost per 1K tokens. Modern serving frameworks like vLLM use continuous batching and PagedAttention to maximize GPU utilization by intelligently sharing memory across concurrent requests.

LLM Inference — Prefill + Autoregressive Decode Pipeline

Tokenize

2ms

Prefill (prompt)

80ms

Decode tok 1

22ms

Decode tok 2

22ms

Decode tok 3

22ms

Decode …→150

22ms

Prefill phase (parallel)

Decode phase (sequential)

Time-to-First-Token

80 ms

TTFT — prefill duration

Decode Speed

45 tok/s

on A10G GPU

Total Latency

~3.4 s

150 token response

Streaming

Send tokens to UI as they are decoded → user sees first token in 80 ms. Total generation time unchanged, but perceived latency drops dramatically.

Real-World Example

A 99helpers platform runs self-hosted Llama-3-8B on an A10G GPU for customer chatbot queries. Average response length: 150 tokens. Prefill time: 80ms. Decode rate: 45 tokens/second. Total latency: 80ms + (150/45)*1000ms = 80 + 3,333ms = ~3.4 seconds. This feels slow. After implementing vLLM with continuous batching and enabling streaming (sending tokens to the UI as they generate), the user sees the first token in 80ms and reads the response as it streams at 45 t/s—perceived experience improves dramatically even though total generation time is unchanged.

Common Mistakes

✕Conflating inference latency with generation quality—a faster inference setup doesn't change model quality; it only affects how quickly quality outputs are delivered.
✕Running inference on CPU for production workloads—CPU inference is 10-100x slower than GPU for most LLMs; GPU is required for acceptable latency.
✕Ignoring batching for high-throughput deployments—serving one request at a time wastes GPU utilization; batching multiple requests dramatically improves throughput.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

LLM Inference

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

LLM API

KV Cache

Speculative Decoding

Model Quantization

GPU Inference

Ready to build your AI chatbot?