Model Quantization
Definition
Model quantization compresses LLM weights by representing them with fewer bits. A 70B parameter model stored in float32 requires ~280GB; in float16 (bfloat16), ~140GB; in int8, ~70GB; in 4-bit (int4), ~35GB. Lower-bit quantization enables running larger models on smaller hardware—a 70B model in int4 fits on a single 40GB GPU that couldn't load the float16 version. Modern quantization techniques like GPTQ (post-training quantization using calibration data), AWQ (activation-aware weight quantization), and GGUF (used by llama.cpp) achieve int4 quantization with typically 2-5% quality degradation compared to float16 on standard benchmarks, a highly favorable cost-quality tradeoff.
Why It Matters
Model quantization is the primary technique for making large LLMs economically feasible to self-host. A team that cannot afford 8 A100 80GB GPUs to run a 70B model in float16 can run the same model in int4 on a single A100. For 99helpers customers building cost-sensitive AI products, self-hosted quantized models can reduce per-query inference costs by 4-8x compared to frontier API pricing, while providing data privacy and latency advantages from local deployment. Quantization also accelerates inference—integer arithmetic is faster than floating-point on many hardware backends, improving throughput.
How It Works
Post-training quantization (PTQ) converts a trained model's weights without retraining. GPTQ uses a small calibration dataset to minimize quantization error layer by layer: for each weight matrix, it finds an integer representation that minimizes the difference in output activations. AWQ identifies the 1% of weights that are most important to accuracy (via activation magnitude analysis) and keeps those in higher precision while quantizing the rest to int4. GGUF format (used by llama.cpp) stores quantized models in a single file with mixed precision—different layer types use different bit widths. Quantization-aware training (QAT) is an alternative that fine-tunes the model to be robust to quantization, achieving better quality than PTQ at lower bit widths.
Quantization Precision Levels (70B model)
Real-World Example
A 99helpers developer wants to self-host Llama-3-70B for data privacy reasons. The float16 model requires 140GB VRAM—approximately $25,000/month in cloud GPU costs. Using GPTQ 4-bit quantization (via the AutoGPTQ library), they compress the model to ~35GB, fitting on a single A100 80GB GPU ($2.50/hour = ~$1,800/month). Benchmark testing shows the quantized model scores within 3% of the full-precision version on their support QA evaluation set. The 90% cost reduction makes self-hosting economically viable.
Common Mistakes
- ✕Applying int4 quantization without evaluating quality on your specific task—quantization error varies by model and domain; always benchmark before deployment.
- ✕Confusing model quantization with vector quantization—model quantization compresses model weights; vector quantization compresses stored embeddings in a vector database.
- ✕Ignoring hardware compatibility—some quantization formats (GPTQ, AWQ) are only supported on CUDA GPUs; others (GGUF) support CPU inference.
Related Terms
LoRA (Low-Rank Adaptation)
LoRA is a parameter-efficient fine-tuning technique that injects small trainable low-rank matrices into LLM layers, updating less than 1% of parameters while achieving quality comparable to full fine-tuning.
Parameter-Efficient Fine-Tuning (PEFT)
PEFT encompasses techniques like LoRA, prefix tuning, and adapters that fine-tune only a small fraction of LLM parameters, achieving comparable quality to full fine-tuning at dramatically reduced compute and memory cost.
Model Distillation
Model distillation trains a smaller 'student' model to mimic a larger 'teacher' model's outputs, producing a compact model that approximates the teacher's capabilities at a fraction of the compute cost.
Model Compression
Model compression reduces LLM size through techniques like pruning (removing unimportant weights), quantization (reducing weight precision), and distillation (training smaller models), enabling deployment on resource-constrained hardware.
LLM Inference
LLM inference is the process of running a trained model to generate a response for a given input, encompassing the forward pass computation, token generation, and the infrastructure required to serve predictions at scale.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →