Large Language Models (LLMs)

GPU Inference

Definition

GPU inference refers to running LLM forward passes on GPUs—specialized hardware with thousands of cores optimized for the matrix multiplications that dominate transformer computation. A GPU like the NVIDIA A100 (80GB) performs ~312 TFLOPS of fp16 operations versus a high-end CPU's ~1 TFLOPS—a 300x compute advantage for LLM workloads. This translates directly to token generation speed: CPU inference of a 7B model generates ~1-5 tokens/second; GPU inference generates 50-150 tokens/second on a single A100. Memory bandwidth is equally critical—LLM inference is memory-bandwidth-bound, not compute-bound, for small batch sizes. NVIDIA A100s (80GB, $2-3/hr cloud), H100s (80GB, $5-8/hr), and consumer GPUs (RTX 4090, 24GB) are the primary hardware for LLM inference.

Why It Matters

GPU infrastructure is the primary cost driver for self-hosted LLM deployments. Understanding GPU requirements—VRAM capacity (must fit model weights + KV cache), memory bandwidth (determines tokens/second), and compute (matters for large batch sizes)—enables informed hardware selection. For 99helpers customers evaluating self-hosted versus API deployment, GPU inference costs are the key variable: a single A10G GPU instance at $0.75/hour can serve approximately 500K tokens/hour, making self-hosting economical at scale versus API pricing for high-volume use cases. Cloud GPU providers include NVIDIA (DGX Cloud), AWS (P4/P5 instances), Google Cloud (A100/H100 TPUs), and GPU-specialized providers (Lambda Labs, CoreWeave, RunPod).

How It Works

GPU memory requirements for inference: model weights (float16) + KV cache. For a 7B model: 14GB weights + ~2-8GB KV cache depending on context length and batch size. For a 70B model: 140GB weights (requires 2x A100 80GB). Throughput depends on memory bandwidth: H100's 3.35 TB/s bandwidth supports ~4x more tokens/second than A100's 2 TB/s for the same model size. Serving frameworks vLLM and TGI (Text Generation Inference) maximize GPU utilization via continuous batching—grouping multiple concurrent requests into batches processed together on the GPU—achieving 2-10x throughput improvement over naive single-request serving. H100 NVLink clusters enable tensor parallelism for models too large for a single GPU.

GPU Inference — Request Queue to Response

Request Queue
Req #10ms
Req #212ms
Req #324ms
Req #4queued
GPU Cluster
GPU 0A100 80GB1800 tok/s
87% util
GPU 1A100 80GB1750 tok/s
82% util
GPU 2H100 80GB2400 tok/s
71% util
Response
streamed tokens
48ms
Latency (TTFT)
time to first token
5,950 tok/s
Throughput
across 3 GPUs
32
Batch size
concurrent requests
80%
GPU utilization
avg across cluster

Real-World Example

A 99helpers team benchmarks GPU options for self-hosting Claude 3.5 Haiku-equivalent quality (targeting Llama-3-70B). Options: (1) 2x A100 80GB on Lambda Labs ($2.20/hr): 140GB weights fit, 25 t/s, $0.000024/token at 100% utilization; (2) 1x H100 80GB ($4.00/hr): requires quantization to 4-bit (35GB), 60 t/s, $0.000019/token; (3) 4x A10G 24GB ($3.00/hr): model sharded across 4 GPUs, 15 t/s, $0.000056/token. The H100 with 4-bit quantization offers the best tokens-per-dollar despite higher hourly cost, due to its superior memory bandwidth driving higher throughput.

Common Mistakes

  • Selecting GPU tier based only on VRAM without checking memory bandwidth—two GPUs with the same VRAM can have 2x different inference throughput due to bandwidth differences.
  • Running one request at a time on expensive GPUs—serving multiple concurrent requests with continuous batching is essential for cost-effective GPU utilization.
  • Ignoring PCIe vs NVLink for multi-GPU setups—NVLink enables much faster inter-GPU communication for tensor parallelism; PCIe-connected GPUs have much lower cross-GPU bandwidth.

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is GPU Inference? GPU Inference Definition & Guide | 99helpers | 99helpers.com