Large Language Models (LLMs)

Emergent Abilities

Definition

Emergent abilities are qualitative capability jumps that appear in LLMs as they cross certain scale thresholds—measured in parameters, training compute, or training data. The phenomenon was highlighted in the paper 'Emergent Abilities of Large Language Models' (Wei et al., 2022): small models perform at near-random on tasks like multi-step arithmetic, logical reasoning, and few-shot translation; larger models suddenly achieve strong performance on the same tasks, with little warning from smaller scale trends. Examples include chain-of-thought reasoning (reasoning by 'thinking out loud'), arithmetic word problems, and unusual analogy completion. The emergence of these capabilities without explicit training on them suggests that scale enables qualitative leaps in reasoning capacity.

Why It Matters

Emergent abilities explain why teams often need to test their applications with the most capable available models before concluding a use case is impossible. A task that fails completely with a smaller model may work excellently with a frontier model—not because of better prompting, but because the underlying capability only exists at sufficient scale. For 99helpers, this means that complex customer support queries requiring multi-step reasoning, subtle policy interpretation, or nuanced troubleshooting may only be solvable with frontier models, while simpler FAQ-style queries work equally well with smaller, cheaper models. The emergent ability phenomenon also means model capability benchmarks don't always linearly predict application performance.

How It Works

The mechanisms behind emergence are not fully understood, but several explanations have been proposed. One view: complex tasks require multiple subtasks, each of which must meet a threshold level of competence; smaller models fail on all subtasks and thus achieve near-zero overall performance; larger models exceed competence thresholds on all subtasks simultaneously, producing sudden success. Another view: apparent emergence may be an artifact of evaluation metrics—using exact match metrics can mask gradual underlying improvements. Regardless of mechanism, the practical implication is that capability on complex reasoning tasks improves non-linearly with scale, and the specific threshold varies by task.

Emergent Abilities — Capability vs Model Scale

70B

540B+

Emerges at

Few-shot learning

—

✓

~7B

Chain-of-thought

—

✓

~100B

Multi-step math

—

✓

~540B

Instruction following

—

✓

~7B*

Code generation

—

✓

~22B

Calibration (uncertainty)

—

~540B

—

Not present

Partial / inconsistent

✓

Reliably present

* Instruction following emerges early with fine-tuning but not reliably in pure base models until larger scales. Emergence is characterized by near-zero performance below a threshold, then rapid improvement.

Real-World Example

A 99helpers team builds a feature that analyzes a customer's conversation history and automatically identifies root cause patterns across multiple sessions. Testing with GPT-3.5-Turbo, the analysis is superficial—it identifies obvious keywords but misses subtle patterns. Testing with GPT-4o, the same prompt produces insightful multi-step causal reasoning: 'These 5 sessions share a common pattern: the user consistently struggles with the configuration step after resetting, suggesting the reset flow may not preserve their settings as expected.' This qualitative reasoning improvement represents an emergent capability that only exists at GPT-4 scale.

Common Mistakes

✕Assuming a task is impossible for LLMs based on testing with a small model—test with frontier models before concluding a reasoning task is beyond LLM capability.
✕Over-interpreting specific emergence thresholds as fundamental laws—exact thresholds vary by task, evaluation method, and training data mixture.
✕Expecting smooth performance improvements with model size—emergent abilities mean performance can be flat then jump sharply, not monotonically improving.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Emergent Abilities

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Scaling Laws

Large Language Model (LLM)

Foundation Model

LLM Benchmark

Reasoning Model

Ready to build your AI chatbot?