Large Language Models (LLMs)

Pre-Training

Definition

Pre-training is the first and most computationally intensive phase of building an LLM. The model—typically a transformer with billions of parameters—is trained on hundreds of billions to trillions of tokens of internet text, books, code, and other data using self-supervised learning: the training objective is simply to predict the next token given all preceding tokens. No human labels are required. Through this process, the model learns grammar, facts, reasoning patterns, and implicit world knowledge encoded in the training distribution. Pre-training a frontier model like GPT-4 or Llama-3 requires thousands of GPUs running for months and costs tens to hundreds of millions of dollars, making it accessible only to well-funded organizations.

Why It Matters

Pre-training is what gives LLMs their breadth. A pre-trained model has seen text from medical literature, legal documents, software documentation, scientific papers, and casual conversation—giving it the foundation to assist in all these domains without domain-specific training. For AI application builders, understanding pre-training explains why LLMs generalize surprisingly well to new tasks (they have already 'read' about nearly everything), why they have knowledge cutoffs (training data has a cutoff date), and why fine-tuning works (you are updating a model that already understands language deeply, not teaching it from scratch).

How It Works

Pre-training uses the causal language modeling (CLM) objective for decoder-only transformers: given tokens [t1, t2, ..., tn-1], predict tn. The loss is cross-entropy between predicted and actual next tokens, averaged over all positions. Training data is assembled from diverse sources (Common Crawl web data, books, GitHub, Wikipedia, etc.), deduplicated, filtered for quality, and tokenized. The model's billions of parameters are initialized randomly and updated by stochastic gradient descent over many training steps. Data mixture proportions—how much code, web text, books, etc.—significantly affect model capabilities and are a key research variable. The result is a 'base model' that is good at text completion but not necessarily instruction-following.

Pre-Training Pipeline

Training corpus composition

Web crawl (CommonCrawl)

~15T tokens

Books & literature

~2T tokens

Code (GitHub)

~1.5T tokens

Wikipedia & encyclopedias

~0.8T tokens

Scientific papers

~0.6T tokens

1. Massive corpus

Trillions of tokens from diverse sources

2. Tokenization

Text → subword tokens (BPE / SentencePiece)

3. Self-supervised objective

Predict next token given preceding context

4. Gradient descent

Billions of weight updates over weeks on GPUs

5. Base model

General-purpose weights — ready for fine-tuning

Pre-training costs $10M–$100M+ in compute. The resulting weights are then shared publicly or used for fine-tuning.

Real-World Example

Llama-3-8B was pre-trained by Meta on 15 trillion tokens of publicly available text. A 99helpers developer downloads the pre-trained base model and observes its behavior: prompted with 'The fastest way to reset your API key is', it completes the sentence in a style that mimics documentation, but the output is unpredictable—sometimes useful, sometimes a continuation that doesn't fit the product. This demonstrates that base models are text completers, not instruction followers. The subsequent fine-tuning step transforms this powerful text completer into an assistant that responds helpfully to questions.

Common Mistakes

✕Attempting to use a base (pre-trained only) model for production chat applications—base models are text completers, not assistants, and produce unreliable chat outputs.
✕Assuming pre-training knowledge is uniformly distributed—topics heavily represented in training data (English web content) are better represented than niche or non-English topics.
✕Confusing pre-training with training from scratch—fine-tuning builds on a pre-trained model's weights; pre-training creates those weights from random initialization.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Pre-Training

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Fine-Tuning

Instruction Tuning

Foundation Model

Large Language Model (LLM)

Base Model

Ready to build your AI chatbot?