Large Language Models (LLMs)

Model Compression

Definition

Model compression is the umbrella term for all techniques that reduce the memory footprint, computational cost, and inference latency of LLMs while preserving as much quality as possible. The primary techniques are: (1) quantization—representing weights in lower-precision formats (float16 → int8 → int4); (2) pruning—removing weights or entire neurons/attention heads that contribute minimally to outputs; (3) knowledge distillation—training a smaller student model to mimic a larger teacher; (4) low-rank factorization—decomposing weight matrices into products of smaller matrices; (5) weight sharing—using the same weights for multiple model components. In practice, quantization and distillation are the most widely used, as pruning LLMs without quality degradation remains challenging.

Why It Matters

Model compression is the key to making LLMs economically viable for resource-constrained deployments. Not every application can afford to run a 70B model on 4 A100s. Compressed models—smaller through distillation, lower precision through quantization, or both—enable LLM deployment on edge devices, smaller cloud instances, and CPU-only servers. For 99helpers customers serving high volumes with tight latency budgets, compressed models can provide 4-8x cost reductions and 2-4x latency improvements at acceptable quality levels. The rapidly improving quality of quantized and distilled models means the quality compromise is increasingly small.

How It Works

Model compression workflow: start with a reference quality benchmark for the full-precision model. Apply quantization first (easiest, most impactful): GPTQ 4-bit typically preserves 95-98% of quality with 4x memory reduction. If quality targets are met, stop. If not, or if further compression is needed, combine with pruning: structured pruning removes entire attention heads or FFN intermediate dimensions—10-20% of weights can often be pruned with <2% quality impact. Measure quality after each compression step; the cumulative effect of multiple techniques compounds. Benchmark on representative domain queries, not just general benchmarks, as compression effects can be domain-specific.

Model Compression Techniques

Original Model (100%)e.g. 70B float32 = 280 GB

Quantization4× smaller

float32 → int4

75% size reduction

Pruning2× smaller

Remove weak weights

50% size reduction

Distillation5–10× smaller

Teacher → Student

85% size reduction

Factorization1.5× smaller

Low-rank decompose

33% size reduction

Combined (Quantization + Distillation) → up to 20–40× smaller model

Real-World Example

A 99helpers team needs to deploy an LLM on a customer's edge server (32GB RAM, no GPU). The target model (Llama-3-8B) requires 16GB in float16—too much with OS and other services. After compression pipeline: GPTQ 4-bit quantization → 4GB; combined with llama.cpp's GGUF format with CPU inference optimizations → runs in 6GB total memory including KV cache. Inference speed: 8 tokens/second on CPU (versus 80 t/s on GPU)—adequate for an edge deployment where responses can stream progressively. The team achieves 93% of the original model's quality on their domain benchmark.

Common Mistakes

✕Applying compression without establishing a quality baseline—always benchmark the uncompressed model first so you can measure quality degradation precisely.
✕Using a single general benchmark to evaluate compression quality—compression effects are dataset-specific; always evaluate on your domain's representative queries.
✕Ignoring the interaction between compression techniques—combining multiple compression methods can have unexpected multiplicative quality effects beyond what each technique produces alone.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Model Compression

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Model Quantization

Model Distillation

LoRA (Low-Rank Adaptation)

QLoRA

LLM Inference

Ready to build your AI chatbot?