Model Compression
Definition
Model compression is the umbrella term for all techniques that reduce the memory footprint, computational cost, and inference latency of LLMs while preserving as much quality as possible. The primary techniques are: (1) quantization—representing weights in lower-precision formats (float16 → int8 → int4); (2) pruning—removing weights or entire neurons/attention heads that contribute minimally to outputs; (3) knowledge distillation—training a smaller student model to mimic a larger teacher; (4) low-rank factorization—decomposing weight matrices into products of smaller matrices; (5) weight sharing—using the same weights for multiple model components. In practice, quantization and distillation are the most widely used, as pruning LLMs without quality degradation remains challenging.
Why It Matters
Model compression is the key to making LLMs economically viable for resource-constrained deployments. Not every application can afford to run a 70B model on 4 A100s. Compressed models—smaller through distillation, lower precision through quantization, or both—enable LLM deployment on edge devices, smaller cloud instances, and CPU-only servers. For 99helpers customers serving high volumes with tight latency budgets, compressed models can provide 4-8x cost reductions and 2-4x latency improvements at acceptable quality levels. The rapidly improving quality of quantized and distilled models means the quality compromise is increasingly small.
How It Works
Model compression workflow: start with a reference quality benchmark for the full-precision model. Apply quantization first (easiest, most impactful): GPTQ 4-bit typically preserves 95-98% of quality with 4x memory reduction. If quality targets are met, stop. If not, or if further compression is needed, combine with pruning: structured pruning removes entire attention heads or FFN intermediate dimensions—10-20% of weights can often be pruned with <2% quality impact. Measure quality after each compression step; the cumulative effect of multiple techniques compounds. Benchmark on representative domain queries, not just general benchmarks, as compression effects can be domain-specific.
Model Compression Techniques
Real-World Example
A 99helpers team needs to deploy an LLM on a customer's edge server (32GB RAM, no GPU). The target model (Llama-3-8B) requires 16GB in float16—too much with OS and other services. After compression pipeline: GPTQ 4-bit quantization → 4GB; combined with llama.cpp's GGUF format with CPU inference optimizations → runs in 6GB total memory including KV cache. Inference speed: 8 tokens/second on CPU (versus 80 t/s on GPU)—adequate for an edge deployment where responses can stream progressively. The team achieves 93% of the original model's quality on their domain benchmark.
Common Mistakes
- ✕Applying compression without establishing a quality baseline—always benchmark the uncompressed model first so you can measure quality degradation precisely.
- ✕Using a single general benchmark to evaluate compression quality—compression effects are dataset-specific; always evaluate on your domain's representative queries.
- ✕Ignoring the interaction between compression techniques—combining multiple compression methods can have unexpected multiplicative quality effects beyond what each technique produces alone.
Related Terms
Model Quantization
Model quantization reduces the numerical precision of LLM weights from 32-bit or 16-bit floats to 8-bit or 4-bit integers, dramatically reducing memory requirements and inference costs with minimal quality loss.
Model Distillation
Model distillation trains a smaller 'student' model to mimic a larger 'teacher' model's outputs, producing a compact model that approximates the teacher's capabilities at a fraction of the compute cost.
LoRA (Low-Rank Adaptation)
LoRA is a parameter-efficient fine-tuning technique that injects small trainable low-rank matrices into LLM layers, updating less than 1% of parameters while achieving quality comparable to full fine-tuning.
QLoRA
QLoRA (Quantized Low-Rank Adaptation) combines 4-bit model quantization with LoRA fine-tuning, enabling fine-tuning of large LLMs on consumer-grade hardware by dramatically reducing memory requirements.
LLM Inference
LLM inference is the process of running a trained model to generate a response for a given input, encompassing the forward pass computation, token generation, and the infrastructure required to serve predictions at scale.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →