Pre-Training
Definition
Pre-training is the first and most computationally intensive phase of building an LLM. The model—typically a transformer with billions of parameters—is trained on hundreds of billions to trillions of tokens of internet text, books, code, and other data using self-supervised learning: the training objective is simply to predict the next token given all preceding tokens. No human labels are required. Through this process, the model learns grammar, facts, reasoning patterns, and implicit world knowledge encoded in the training distribution. Pre-training a frontier model like GPT-4 or Llama-3 requires thousands of GPUs running for months and costs tens to hundreds of millions of dollars, making it accessible only to well-funded organizations.
Why It Matters
Pre-training is what gives LLMs their breadth. A pre-trained model has seen text from medical literature, legal documents, software documentation, scientific papers, and casual conversation—giving it the foundation to assist in all these domains without domain-specific training. For AI application builders, understanding pre-training explains why LLMs generalize surprisingly well to new tasks (they have already 'read' about nearly everything), why they have knowledge cutoffs (training data has a cutoff date), and why fine-tuning works (you are updating a model that already understands language deeply, not teaching it from scratch).
How It Works
Pre-training uses the causal language modeling (CLM) objective for decoder-only transformers: given tokens [t1, t2, ..., tn-1], predict tn. The loss is cross-entropy between predicted and actual next tokens, averaged over all positions. Training data is assembled from diverse sources (Common Crawl web data, books, GitHub, Wikipedia, etc.), deduplicated, filtered for quality, and tokenized. The model's billions of parameters are initialized randomly and updated by stochastic gradient descent over many training steps. Data mixture proportions—how much code, web text, books, etc.—significantly affect model capabilities and are a key research variable. The result is a 'base model' that is good at text completion but not necessarily instruction-following.
Pre-Training Pipeline
Training corpus composition
Real-World Example
Llama-3-8B was pre-trained by Meta on 15 trillion tokens of publicly available text. A 99helpers developer downloads the pre-trained base model and observes its behavior: prompted with 'The fastest way to reset your API key is', it completes the sentence in a style that mimics documentation, but the output is unpredictable—sometimes useful, sometimes a continuation that doesn't fit the product. This demonstrates that base models are text completers, not instruction followers. The subsequent fine-tuning step transforms this powerful text completer into an assistant that responds helpfully to questions.
Common Mistakes
- ✕Attempting to use a base (pre-trained only) model for production chat applications—base models are text completers, not assistants, and produce unreliable chat outputs.
- ✕Assuming pre-training knowledge is uniformly distributed—topics heavily represented in training data (English web content) are better represented than niche or non-English topics.
- ✕Confusing pre-training with training from scratch—fine-tuning builds on a pre-trained model's weights; pre-training creates those weights from random initialization.
Related Terms
Fine-Tuning
Fine-tuning adapts a pre-trained LLM to a specific task or domain by continuing training on a smaller, curated dataset, improving performance on targeted use cases while preserving general language capabilities.
Instruction Tuning
Instruction tuning fine-tunes a pre-trained language model on diverse (instruction, response) pairs, transforming a text-completion model into an assistant that reliably follows human directives.
Foundation Model
A foundation model is a large AI model trained on broad, diverse data that can be adapted to a wide range of downstream tasks through fine-tuning or prompting, serving as a base for many applications.
Large Language Model (LLM)
A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.
Base Model
A base model is a pre-trained LLM that has learned language from massive text data but has not yet been instruction-tuned or aligned—capable of text completion but not reliably following instructions or behaving as an assistant.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →