Large Language Models (LLMs)

Model Distillation

Definition

Knowledge distillation transfers capability from a large, expensive 'teacher' model to a smaller, cheaper 'student' model. During distillation, the student is trained not just on hard labels (correct answers) but on the teacher's soft probability distributions (called 'soft targets' or 'dark knowledge'). These distributions contain richer learning signal than hard labels: they encode the teacher's confidence and the relative similarities between classes that the teacher has learned. For LLMs, distillation often takes the form of 'data distillation'—generating high-quality responses from a frontier teacher model, then training a smaller student model on these generated (prompt, response) pairs. Meta's Alpaca and many fine-tuned Llama variants use this approach.

Why It Matters

Model distillation is how the AI community makes frontier model capabilities accessible at lower cost. A team cannot afford to deploy GPT-4 level quality at $0.01/1K tokens for millions of daily queries, but they can potentially distill a smaller model on GPT-4-generated data that achieves 85-90% of GPT-4 quality at 1/20th the inference cost. For 99helpers customers with high query volumes, distilled models can make AI-powered support economically sustainable at scale. OpenAI's newer smaller models (GPT-4o-mini) are likely distilled from larger counterparts, explaining why their quality exceeds what would be expected from their size alone.

How It Works

LLM distillation workflow: (1) select a target task or domain; (2) curate a set of representative input prompts; (3) generate high-quality responses from the teacher model (GPT-4, Claude 3.5 Sonnet); (4) fine-tune a smaller student model (Llama-3-8B, Mistral-7B) on these (prompt, teacher_response) pairs using supervised fine-tuning; (5) evaluate the student on held-out test cases, targeting >85% of teacher quality; (6) iterate on prompt curation and fine-tuning until quality targets are met. More sophisticated distillation uses the teacher's token-level probability distributions rather than just the final generated text, providing richer learning signal.

Knowledge Distillation Pipeline

Teacher Model
Large (70B+), expensive
Soft Labels
Probability distributions over tokens
Student Model
Small (7B), fast & cheap

How the student learns

Hard Labels
dog / cat / bird
binary signal
+
Soft Labels (Dark Knowledge)
dog 0.85 / wolf 0.10 / fox 0.05
richer signal from teacher

Teacher vs. Student comparison

Metric
Teacher
Student
Parameters
70B
7B
Inference cost
$$$
$
Latency
~3s
~0.3s
Quality
100%
85–92%
Memory
140 GB
14 GB

Real-World Example

A 99helpers team builds a support chatbot initially using Claude 3.5 Sonnet (quality: excellent, cost: $0.003/query). At 50,000 queries/day, this costs $150/day or $54,750/year. They distill to a Llama-3-8B student: generating 20,000 (support query, Claude response) pairs, fine-tuning with LoRA, and evaluating against their benchmark. The distilled student achieves 88% of Claude's quality score at $0.00015/query (self-hosted)—a 20x cost reduction. For 85%+ of queries (straightforward factual questions), the distilled model performs equally to Claude; only for complex edge cases do they route to Claude as a fallback.

Common Mistakes

  • Distilling without the teacher's legal permission—generating training data from closed model APIs for commercial use may violate terms of service.
  • Using a student model far smaller than the task requires—an 8B parameter student cannot fully replicate a 70B teacher's complex reasoning capabilities regardless of distillation quality.
  • Evaluating only on training distribution queries—distilled models often overfit to the distillation data distribution and may underperform on out-of-distribution inputs.

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Model Distillation? Model Distillation Definition & Guide | 99helpers | 99helpers.com