GPU Inference
Definition
GPU inference refers to running LLM forward passes on GPUs—specialized hardware with thousands of cores optimized for the matrix multiplications that dominate transformer computation. A GPU like the NVIDIA A100 (80GB) performs ~312 TFLOPS of fp16 operations versus a high-end CPU's ~1 TFLOPS—a 300x compute advantage for LLM workloads. This translates directly to token generation speed: CPU inference of a 7B model generates ~1-5 tokens/second; GPU inference generates 50-150 tokens/second on a single A100. Memory bandwidth is equally critical—LLM inference is memory-bandwidth-bound, not compute-bound, for small batch sizes. NVIDIA A100s (80GB, $2-3/hr cloud), H100s (80GB, $5-8/hr), and consumer GPUs (RTX 4090, 24GB) are the primary hardware for LLM inference.
Why It Matters
GPU infrastructure is the primary cost driver for self-hosted LLM deployments. Understanding GPU requirements—VRAM capacity (must fit model weights + KV cache), memory bandwidth (determines tokens/second), and compute (matters for large batch sizes)—enables informed hardware selection. For 99helpers customers evaluating self-hosted versus API deployment, GPU inference costs are the key variable: a single A10G GPU instance at $0.75/hour can serve approximately 500K tokens/hour, making self-hosting economical at scale versus API pricing for high-volume use cases. Cloud GPU providers include NVIDIA (DGX Cloud), AWS (P4/P5 instances), Google Cloud (A100/H100 TPUs), and GPU-specialized providers (Lambda Labs, CoreWeave, RunPod).
How It Works
GPU memory requirements for inference: model weights (float16) + KV cache. For a 7B model: 14GB weights + ~2-8GB KV cache depending on context length and batch size. For a 70B model: 140GB weights (requires 2x A100 80GB). Throughput depends on memory bandwidth: H100's 3.35 TB/s bandwidth supports ~4x more tokens/second than A100's 2 TB/s for the same model size. Serving frameworks vLLM and TGI (Text Generation Inference) maximize GPU utilization via continuous batching—grouping multiple concurrent requests into batches processed together on the GPU—achieving 2-10x throughput improvement over naive single-request serving. H100 NVLink clusters enable tensor parallelism for models too large for a single GPU.
GPU Inference — Request Queue to Response
Real-World Example
A 99helpers team benchmarks GPU options for self-hosting Claude 3.5 Haiku-equivalent quality (targeting Llama-3-70B). Options: (1) 2x A100 80GB on Lambda Labs ($2.20/hr): 140GB weights fit, 25 t/s, $0.000024/token at 100% utilization; (2) 1x H100 80GB ($4.00/hr): requires quantization to 4-bit (35GB), 60 t/s, $0.000019/token; (3) 4x A10G 24GB ($3.00/hr): model sharded across 4 GPUs, 15 t/s, $0.000056/token. The H100 with 4-bit quantization offers the best tokens-per-dollar despite higher hourly cost, due to its superior memory bandwidth driving higher throughput.
Common Mistakes
- ✕Selecting GPU tier based only on VRAM without checking memory bandwidth—two GPUs with the same VRAM can have 2x different inference throughput due to bandwidth differences.
- ✕Running one request at a time on expensive GPUs—serving multiple concurrent requests with continuous batching is essential for cost-effective GPU utilization.
- ✕Ignoring PCIe vs NVLink for multi-GPU setups—NVLink enables much faster inter-GPU communication for tensor parallelism; PCIe-connected GPUs have much lower cross-GPU bandwidth.
Related Terms
LLM Inference
LLM inference is the process of running a trained model to generate a response for a given input, encompassing the forward pass computation, token generation, and the infrastructure required to serve predictions at scale.
Model Quantization
Model quantization reduces the numerical precision of LLM weights from 32-bit or 16-bit floats to 8-bit or 4-bit integers, dramatically reducing memory requirements and inference costs with minimal quality loss.
KV Cache
The KV cache stores the key and value attention tensors computed during the prefill phase, allowing subsequent token generation to reuse these computations rather than recomputing them for every new token.
Speculative Decoding
Speculative decoding uses a small 'draft' model to generate multiple candidate tokens quickly, then verifies them in parallel with the large target model, achieving 2-3x inference speedup without changing output quality.
Open-Source LLM
An open-source LLM is a language model with publicly available weights that anyone can download, run locally, fine-tune, and deploy without per-query licensing fees, enabling private deployment and customization.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →