Large Language Models (LLMs)

Model Quantization

Definition

Model quantization compresses LLM weights by representing them with fewer bits. A 70B parameter model stored in float32 requires ~280GB; in float16 (bfloat16), ~140GB; in int8, ~70GB; in 4-bit (int4), ~35GB. Lower-bit quantization enables running larger models on smaller hardware—a 70B model in int4 fits on a single 40GB GPU that couldn't load the float16 version. Modern quantization techniques like GPTQ (post-training quantization using calibration data), AWQ (activation-aware weight quantization), and GGUF (used by llama.cpp) achieve int4 quantization with typically 2-5% quality degradation compared to float16 on standard benchmarks, a highly favorable cost-quality tradeoff.

Why It Matters

Model quantization is the primary technique for making large LLMs economically feasible to self-host. A team that cannot afford 8 A100 80GB GPUs to run a 70B model in float16 can run the same model in int4 on a single A100. For 99helpers customers building cost-sensitive AI products, self-hosted quantized models can reduce per-query inference costs by 4-8x compared to frontier API pricing, while providing data privacy and latency advantages from local deployment. Quantization also accelerates inference—integer arithmetic is faster than floating-point on many hardware backends, improving throughput.

How It Works

Post-training quantization (PTQ) converts a trained model's weights without retraining. GPTQ uses a small calibration dataset to minimize quantization error layer by layer: for each weight matrix, it finds an integer representation that minimizes the difference in output activations. AWQ identifies the 1% of weights that are most important to accuracy (via activation magnitude analysis) and keeps those in higher precision while quantizing the rest to int4. GGUF format (used by llama.cpp) stores quantized models in a single file with mixed precision—different layer types use different bit widths. Quantization-aware training (QAT) is an alternative that fine-tunes the model to be robust to quantization, achieving better quality than PTQ at lower bit widths.

Quantization Precision Levels (70B model)

Format

Memory footprint

VRAM

Quality

float32 (FP32)

32 bits/weight

280 GB

100%

bfloat16 (BF16)

16 bits/weight

140 GB

99%

int8 (Q8)

8 bits/weight

70 GB

97%

int4 (Q4)

4 bits/weight

35 GB

93%

int2 (Q2)

2 bits/weight

18 GB

78%

Sweet spot: Q4/Q8

Best quality-per-GB ratio. Used by GPTQ, GGUF, AWQ formats.

Post-training quant.

No retraining needed — apply to any existing model checkpoint.

Real-World Example

A 99helpers developer wants to self-host Llama-3-70B for data privacy reasons. The float16 model requires 140GB VRAM—approximately $25,000/month in cloud GPU costs. Using GPTQ 4-bit quantization (via the AutoGPTQ library), they compress the model to ~35GB, fitting on a single A100 80GB GPU ($2.50/hour = ~$1,800/month). Benchmark testing shows the quantized model scores within 3% of the full-precision version on their support QA evaluation set. The 90% cost reduction makes self-hosting economically viable.

Common Mistakes

✕Applying int4 quantization without evaluating quality on your specific task—quantization error varies by model and domain; always benchmark before deployment.
✕Confusing model quantization with vector quantization—model quantization compresses model weights; vector quantization compresses stored embeddings in a vector database.
✕Ignoring hardware compatibility—some quantization formats (GPTQ, AWQ) are only supported on CUDA GPUs; others (GGUF) support CPU inference.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Model Quantization

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

LoRA (Low-Rank Adaptation)

Parameter-Efficient Fine-Tuning (PEFT)

Model Distillation

Model Compression

LLM Inference

Ready to build your AI chatbot?