Knowledge Distillation
Definition
Knowledge distillation is a model compression technique where a smaller 'student' model is trained not on hard labels (0 or 1) but on the soft probability outputs ('soft labels') of a larger, pre-trained 'teacher' model. The teacher's probability distribution over all classes contains richer information than hard labels—it encodes the teacher's uncertainty and the relative similarity between classes. By training the student to match these distributions (minimizing KL divergence between teacher and student outputs), the student inherits the teacher's learned generalizations rather than just its final answers. DistilBERT (66% smaller than BERT-base, 97% of performance), TinyBERT, and MobileNet family models were produced through distillation.
Why It Matters
Knowledge distillation is one of the primary paths to making powerful large models deployable in production at scale. A 7B-parameter LLM produces excellent outputs but requires significant GPU memory and compute per inference; a 1B-parameter distilled model can achieve 90-95% of its quality at 15% of the inference cost. For edge AI deployment where models must run on consumer hardware, distillation is often the only path to acceptable quality within memory and compute budgets. Distillation also improves inference speed linearly with parameter reduction—a 7x smaller model infers approximately 7x faster, directly reducing latency.
How It Works
Distillation training: (1) train or obtain the teacher model; (2) generate soft labels—run the teacher on the training dataset to obtain probability distributions for each example; (3) train the student on a combined loss: a fraction on hard labels (ground truth) plus a fraction on KL divergence from teacher soft labels, typically weighted 0.9 soft + 0.1 hard; (4) optionally include intermediate layer matching (feature-based distillation), where the student is trained to match the teacher's intermediate representations in addition to final outputs; (5) evaluate the student on held-out data against the teacher's performance. Training data for distillation can be the original training data or a synthetic dataset generated by the teacher.
Knowledge Distillation
Teacher Model
GPT-4 / 70B
Large, slow, expensive
Soft labels (probability distributions)
Student Model
DistilBERT / 7B
Small, fast, cheap
Student achieves ~95% teacher accuracy at 1/10th the compute cost
Real-World Example
A startup needed to deploy a medical imaging classifier on tablet devices used in rural clinics without reliable internet connectivity. Their best ResNet-152 model achieved 94% accuracy but required 250MB memory and was too slow for real-time tablet inference. They distilled this teacher model into a MobileNetV3 student using soft label training on 50,000 radiographs. The 22MB student model achieved 91% accuracy—3 points below the teacher—at 8x faster inference speed and 11x smaller memory footprint. The 3-point accuracy tradeoff was accepted: 91% accuracy on a tablet in a remote clinic is dramatically better than 0% accuracy due to no connectivity for cloud inference.
Common Mistakes
- ✕Expecting the student to match teacher performance exactly—distillation introduces a small but unavoidable performance gap; validate that the gap is acceptable for the use case
- ✕Training the student from scratch without teacher soft labels—simply training a small model on hard labels usually produces worse results than distillation
- ✕Distilling without diverse training data—the student learns from the teacher's outputs, so the training data must cover the full input distribution
Related Terms
Transfer Learning
Transfer learning leverages knowledge from a model trained on one task or dataset to accelerate and improve learning on a related task—dramatically reducing the labeled data and compute required to build high-performing domain-specific models.
Model Pruning
Model pruning reduces neural network size and inference speed by removing low-importance weights, neurons, or layers—enabling deployment of high-quality models with reduced memory footprint and faster inference.
Edge AI
Edge AI runs AI models directly on local devices—smartphones, IoT sensors, cameras—rather than sending data to the cloud, enabling real-time inference without internet connectivity, reduced latency, and enhanced privacy.
AI Cost Optimization
AI cost optimization encompasses techniques to reduce the compute, storage, and API expenses of AI systems—through model selection, caching, batching, quantization, and architecture decisions—making AI economically sustainable at scale.
Model Serving
Model serving is the infrastructure that hosts trained ML models and exposes them as APIs, handling prediction requests in production with the latency, throughput, and reliability requirements of real applications.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →