AI Infrastructure, Safety & Ethics

Active Learning

Definition

Active learning is a semi-supervised ML paradigm where the model participates in selecting which examples to label next, rather than passively receiving randomly sampled labeled data. The core insight is that not all examples are equally informative: labeling an example the model is already confident about provides little new information, while labeling an example the model is highly uncertain about maximally improves its decision boundaries. Active learning strategies include: uncertainty sampling (select examples with highest prediction uncertainty), query by committee (select examples where an ensemble of models disagrees most), expected model change (select examples that would cause the largest model update), and diversity sampling (select a diverse batch that covers unexplored input space).

Why It Matters

Active learning directly addresses the cost and time bottleneck of data labeling. In practice, active learning achieves equivalent model performance to random sampling with 40-70% fewer labeled examples on many classification tasks. For expensive annotation (medical imaging requiring radiologist review, legal document annotation requiring attorney time), this reduction translates to substantial cost savings—potentially tens of thousands of dollars on a large labeling project. Active learning also accelerates development cycles: reaching acceptable model performance in 3 weeks with 2,000 actively-selected labels vs. 8 weeks with 5,000 randomly sampled labels enables faster product iteration.

How It Works

Active learning implementation: (1) train an initial model on a seed set of labeled examples (typically 100-500); (2) run the model on the full unlabeled pool; (3) score each unlabeled example by the chosen uncertainty metric (entropy of predicted probabilities for multi-class tasks); (4) present the top-k most uncertain examples to human annotators; (5) add the newly labeled examples to the training set; (6) retrain the model; (7) repeat. Batch active learning selects diverse batches rather than one-by-one to reduce the annotation round-trip cost. The process continues until the model reaches target performance or the labeling budget is exhausted.

Active Learning Loop

Unlabeled Pool

10,000 examples

Uncertainty Scoring

Model scores each sample

Select Top-K

100 most uncertain

Human Annotates

Oracle labels selected

Retrain Model

Add labels → update weights

Query Strategies

Uncertainty Sampling

Highest entropy predictions

Query by Committee

Ensemble disagreement

Diversity Sampling

Cover unexplored space

Label Efficiency vs. Random Sampling

Random Sampling (baseline)5,000 labels needed

Active Learning~1,800 labels needed

64% fewer labels for equivalent model performance

Real-World Example

A medical imaging company used active learning to build a tumor detection model. Starting with 200 radiologist-labeled scans, they ran 10 active learning rounds, selecting 100 most uncertain scans per round for radiologist review. After 1,000 total labeled scans (10 rounds × 100), the model reached 91% sensitivity—equivalent to the performance a random labeling approach achieved only after 2,800 labeled scans. The active learning approach saved 1,800 radiologist-hours of annotation, representing $180,000 in annotation cost reduction at $100/hour radiologist time.

Common Mistakes

✕Using only uncertainty sampling without diversity—high-uncertainty samples can cluster in a narrow region of input space; diversity sampling ensures broad coverage
✕Not retraining the model after each batch—stale model predictions produce suboptimal uncertainty estimates and waste annotation budget on redundant examples
✕Applying active learning when high-quality labeled data is already abundant—active learning benefits are largest when labels are scarce; it adds complexity without benefit when data is plentiful

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Active Learning

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Data Labeling

Human-in-the-Loop

Annotation Quality

MLOps

Experiment Tracking

Ready to build your AI chatbot?