Large Language Models (LLMs)

Scaling Laws

Definition

Scaling laws for neural language models (Kaplan et al., 2020; Hoffmann et al., 2022 'Chinchilla') establish that LLM performance (measured by cross-entropy loss on held-out text) improves as a power law with three factors: model parameters (N), training tokens (D), and training compute (C = 6ND for a single pass). The Chinchilla scaling laws showed that many earlier models were 'over-parameterized and under-trained'—they had too many parameters relative to training data. The optimal parameter-to-token ratio is approximately 20 training tokens per parameter. These laws let AI labs predict what model size and training data volume will achieve a target loss before committing the resources to training, making large-scale AI development more tractable.

Why It Matters

Scaling laws are foundational for AI product planning. They explain why compute investment produces predictable capability improvements (within a training paradigm), why a smaller model trained on more data can outperform a larger model trained on less data, and why simply making models bigger without more data has diminishing returns. For enterprises evaluating AI vendors, scaling laws help contextualize claims about model improvements: a 2x increase in parameters with the same training data yields modest gains, while a 2x increase in both parameters and training tokens yields substantially larger gains. Understanding scaling laws also explains why open-source models tend to lag behind frontier models—they have access to fewer training tokens, not necessarily fewer parameters.

How It Works

The Chinchilla optimal scaling prescription: for a given compute budget C, train a model of size N = (C / (6 * 20))^0.5 parameters on D = 20N tokens. Example: a compute budget of 10^23 FLOPs implies N ≈ 67B parameters and D ≈ 1.3T tokens—roughly the Chinchilla model. In practice, inference costs also matter: a smaller model requires fewer resources to serve at scale, making it more economical to over-train a smaller model to achieve a target quality level. Meta's Llama models follow this philosophy—Llama-3-8B is trained on ~15T tokens (far more than Chinchilla optimal), producing an inference-efficient model with strong quality.

Scaling Laws — Compute vs Model Loss (Power-Law)

High lossLow loss
100
72
52
37
27
10×
100×
1000×
10000×

Compute Budget (relative)

Pattern

Predictable power-law

10× compute

≈ 28% loss reduction

Scales with

Params · Data · Compute

Key insight: Doubling model size, dataset size, or compute each yields predictable, consistent improvements in model quality — allowing researchers to forecast performance before training.

Real-World Example

A 99helpers CTO evaluating LLM providers asks: 'Why does Provider A's 70B model sometimes outperform Provider B's 175B model?' Scaling laws provide the answer: Provider A trained their 70B model on 2T tokens (well above the Chinchilla-optimal 1.4T for 70B), while Provider B trained their 175B model on 1T tokens (below the Chinchilla-optimal 3.5T for 175B). The smaller, data-richly-trained model can indeed outperform the larger, under-trained model on many benchmarks—a non-intuitive result explained by training efficiency principles.

Common Mistakes

  • Treating scaling laws as guarantees—they describe statistical trends across many training runs; individual model training can deviate from predictions.
  • Ignoring inference costs when applying scaling laws—a model optimal for training compute may not be optimal for total cost of ownership including serving.
  • Applying pre-2022 scaling laws (Kaplan) without accounting for the Chinchilla correction—Chinchilla showed that the optimal token-to-parameter ratio is much higher than earlier estimates.

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Scaling Laws? Scaling Laws Definition & Guide | 99helpers | 99helpers.com