What is Active Learning?

A machine learning approach where the model selectively queries a human annotator to label the most informative examples, maximizing learning efficiency per labeled sample.

Definition

Active learning is a training strategy where the model participates in selecting which data points should be labeled next, rather than training on a randomly sampled labeled dataset. The core idea is that not all training examples are equally informative — some examples, when labeled and added to the training set, improve model performance much more than others. By strategically selecting the most informative examples for labeling, active learning can achieve the same model quality with significantly fewer labeled examples, reducing annotation costs.

In the LLM fine-tuning context, active learning typically works in iterative cycles. The model is first trained on a small seed set of labeled examples. It then scores a pool of unlabeled examples using an uncertainty or informativeness criterion, selects the most informative candidates, and presents them to human annotators for labeling. The newly labeled examples are added to the training set, the model is retrained, and the cycle repeats until a quality target is met or the annotation budget is exhausted.

Active learning selection strategies include uncertainty sampling (selecting examples where the model is most uncertain), diversity sampling (selecting examples that are maximally different from each other and from the existing training set), expected model change (selecting examples that would cause the largest gradient updates), and committee-based approaches (selecting examples where multiple models disagree). Each strategy has different strengths depending on the task and data distribution.

Why It Matters

Annotation is the primary cost bottleneck in LLM fine-tuning. High-quality labeled data for specialized domains (medical, legal, financial) can cost $10-50 per example when annotated by domain experts. Active learning can reduce the number of labeled examples needed by 50-80% compared to random sampling, directly translating to proportional cost savings.

Beyond cost savings, active learning improves data quality by focusing annotation effort on the examples that matter most. Instead of annotating hundreds of easy, redundant examples that the model already handles well, annotators spend their time on challenging edge cases and ambiguous examples that the model needs help with. This produces a training set that is optimally informative, yielding better model performance per dollar of annotation spend.

How It Works

The active learning loop has five phases. (1) Initialization: a small seed set (50-200 examples) is labeled and used to train an initial model. (2) Scoring: the model processes a large pool of unlabeled examples and assigns an informativeness score to each. For uncertainty sampling, this is typically the entropy of the model's output distribution or the difference between the top two class probabilities. (3) Selection: the top-k most informative examples are selected for annotation. (4) Annotation: human annotators label the selected examples. (5) Retraining: the model is retrained on the expanded labeled dataset.

This cycle repeats until convergence — the point where adding more labeled examples does not significantly improve model performance. In practice, active learning often reaches 90% of full-dataset performance using only 20-30% of the labels, with diminishing returns beyond that point.

Example Use Case

A legal AI startup needs to fine-tune a contract analysis model but has budget for only 2,000 annotated examples (at $25 each, $50K total). Using active learning, they start with 200 seed examples and run 9 active learning cycles, selecting 200 examples per cycle. By strategically selecting contracts with unusual clauses, ambiguous language, and edge cases, they achieve the same accuracy as a randomly sampled dataset of 6,000 examples — saving $100K in annotation costs while building a model that handles difficult contracts better.

Key Takeaways

Active learning strategically selects the most informative examples for human annotation.
It can reduce labeling costs by 50-80% compared to random sampling while maintaining model quality.
Common selection strategies include uncertainty sampling, diversity sampling, and committee disagreement.
The approach works in iterative cycles of scoring, selecting, annotating, and retraining.
Active learning produces higher-quality training sets by focusing on challenging, informative examples.

How Ertas Helps

Ertas Data Suite supports active learning workflows in its Label stage, helping teams prioritize which examples to annotate based on model uncertainty, maximizing the value of every annotated example before fine-tuning in Ertas Studio.

Related Resources

Annotation

Data Augmentation

Data Labeling

Few-Shot Learning

Training Data

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →