Model Distillation
Definition
Knowledge distillation transfers capability from a large, expensive 'teacher' model to a smaller, cheaper 'student' model. During distillation, the student is trained not just on hard labels (correct answers) but on the teacher's soft probability distributions (called 'soft targets' or 'dark knowledge'). These distributions contain richer learning signal than hard labels: they encode the teacher's confidence and the relative similarities between classes that the teacher has learned. For LLMs, distillation often takes the form of 'data distillation'—generating high-quality responses from a frontier teacher model, then training a smaller student model on these generated (prompt, response) pairs. Meta's Alpaca and many fine-tuned Llama variants use this approach.
Why It Matters
Model distillation is how the AI community makes frontier model capabilities accessible at lower cost. A team cannot afford to deploy GPT-4 level quality at $0.01/1K tokens for millions of daily queries, but they can potentially distill a smaller model on GPT-4-generated data that achieves 85-90% of GPT-4 quality at 1/20th the inference cost. For 99helpers customers with high query volumes, distilled models can make AI-powered support economically sustainable at scale. OpenAI's newer smaller models (GPT-4o-mini) are likely distilled from larger counterparts, explaining why their quality exceeds what would be expected from their size alone.
How It Works
LLM distillation workflow: (1) select a target task or domain; (2) curate a set of representative input prompts; (3) generate high-quality responses from the teacher model (GPT-4, Claude 3.5 Sonnet); (4) fine-tune a smaller student model (Llama-3-8B, Mistral-7B) on these (prompt, teacher_response) pairs using supervised fine-tuning; (5) evaluate the student on held-out test cases, targeting >85% of teacher quality; (6) iterate on prompt curation and fine-tuning until quality targets are met. More sophisticated distillation uses the teacher's token-level probability distributions rather than just the final generated text, providing richer learning signal.
Knowledge Distillation Pipeline
How the student learns
Teacher vs. Student comparison
Real-World Example
A 99helpers team builds a support chatbot initially using Claude 3.5 Sonnet (quality: excellent, cost: $0.003/query). At 50,000 queries/day, this costs $150/day or $54,750/year. They distill to a Llama-3-8B student: generating 20,000 (support query, Claude response) pairs, fine-tuning with LoRA, and evaluating against their benchmark. The distilled student achieves 88% of Claude's quality score at $0.00015/query (self-hosted)—a 20x cost reduction. For 85%+ of queries (straightforward factual questions), the distilled model performs equally to Claude; only for complex edge cases do they route to Claude as a fallback.
Common Mistakes
- ✕Distilling without the teacher's legal permission—generating training data from closed model APIs for commercial use may violate terms of service.
- ✕Using a student model far smaller than the task requires—an 8B parameter student cannot fully replicate a 70B teacher's complex reasoning capabilities regardless of distillation quality.
- ✕Evaluating only on training distribution queries—distilled models often overfit to the distillation data distribution and may underperform on out-of-distribution inputs.
Related Terms
Fine-Tuning
Fine-tuning adapts a pre-trained LLM to a specific task or domain by continuing training on a smaller, curated dataset, improving performance on targeted use cases while preserving general language capabilities.
Model Quantization
Model quantization reduces the numerical precision of LLM weights from 32-bit or 16-bit floats to 8-bit or 4-bit integers, dramatically reducing memory requirements and inference costs with minimal quality loss.
Model Compression
Model compression reduces LLM size through techniques like pruning (removing unimportant weights), quantization (reducing weight precision), and distillation (training smaller models), enabling deployment on resource-constrained hardware.
LoRA (Low-Rank Adaptation)
LoRA is a parameter-efficient fine-tuning technique that injects small trainable low-rank matrices into LLM layers, updating less than 1% of parameters while achieving quality comparable to full fine-tuning.
Open-Source LLM
An open-source LLM is a language model with publicly available weights that anyone can download, run locally, fine-tune, and deploy without per-query licensing fees, enabling private deployment and customization.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →