AI Infrastructure, Safety & Ethics

Knowledge Distillation

Definition

Knowledge distillation is a model compression technique where a smaller 'student' model is trained not on hard labels (0 or 1) but on the soft probability outputs ('soft labels') of a larger, pre-trained 'teacher' model. The teacher's probability distribution over all classes contains richer information than hard labels—it encodes the teacher's uncertainty and the relative similarity between classes. By training the student to match these distributions (minimizing KL divergence between teacher and student outputs), the student inherits the teacher's learned generalizations rather than just its final answers. DistilBERT (66% smaller than BERT-base, 97% of performance), TinyBERT, and MobileNet family models were produced through distillation.

Why It Matters

Knowledge distillation is one of the primary paths to making powerful large models deployable in production at scale. A 7B-parameter LLM produces excellent outputs but requires significant GPU memory and compute per inference; a 1B-parameter distilled model can achieve 90-95% of its quality at 15% of the inference cost. For edge AI deployment where models must run on consumer hardware, distillation is often the only path to acceptable quality within memory and compute budgets. Distillation also improves inference speed linearly with parameter reduction—a 7x smaller model infers approximately 7x faster, directly reducing latency.

How It Works

Distillation training: (1) train or obtain the teacher model; (2) generate soft labels—run the teacher on the training dataset to obtain probability distributions for each example; (3) train the student on a combined loss: a fraction on hard labels (ground truth) plus a fraction on KL divergence from teacher soft labels, typically weighted 0.9 soft + 0.1 hard; (4) optionally include intermediate layer matching (feature-based distillation), where the student is trained to match the teacher's intermediate representations in addition to final outputs; (5) evaluate the student on held-out data against the teacher's performance. Training data for distillation can be the original training data or a synthetic dataset generated by the teacher.

Knowledge Distillation

Teacher Model

GPT-4 / 70B

Large, slow, expensive

Soft labels (probability distributions)

Student Model

DistilBERT / 7B

Small, fast, cheap

Student achieves ~95% teacher accuracy at 1/10th the compute cost

Real-World Example

A startup needed to deploy a medical imaging classifier on tablet devices used in rural clinics without reliable internet connectivity. Their best ResNet-152 model achieved 94% accuracy but required 250MB memory and was too slow for real-time tablet inference. They distilled this teacher model into a MobileNetV3 student using soft label training on 50,000 radiographs. The 22MB student model achieved 91% accuracy—3 points below the teacher—at 8x faster inference speed and 11x smaller memory footprint. The 3-point accuracy tradeoff was accepted: 91% accuracy on a tablet in a remote clinic is dramatically better than 0% accuracy due to no connectivity for cloud inference.

Common Mistakes

  • Expecting the student to match teacher performance exactly—distillation introduces a small but unavoidable performance gap; validate that the gap is acceptable for the use case
  • Training the student from scratch without teacher soft labels—simply training a small model on hard labels usually produces worse results than distillation
  • Distilling without diverse training data—the student learns from the teacher's outputs, so the training data must cover the full input distribution

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Knowledge Distillation? Knowledge Distillation Definition & Guide | 99helpers | 99helpers.com