Corpus
Definition
A corpus (plural: corpora) is a collection of text (or speech) assembled for linguistic analysis or machine learning purposes. Corpora range from small, manually annotated datasets like SQuAD (100,000 QA pairs) to massive web-crawled collections like Common Crawl (petabytes of multilingual text). Corpora can be raw (unannotated), annotated (with linguistic labels), parallel (aligned translations), or specialized (domain-specific text). Key characteristics include size (token count), domain coverage, language(s), time period, and annotation quality. The composition of a pre-training corpus is one of the most consequential decisions in building a language model.
Why It Matters
Corpus quality and composition directly determine NLP model capabilities and biases. A model trained on customer support corpora will understand support-specific language but struggle with medical or legal text. Pre-training corpora bias models toward the demographics, topics, and time periods represented—a corpus heavy with English news from 2015-2020 produces a model with corresponding knowledge gaps. For practitioners, selecting the right corpus for fine-tuning or evaluation is a critical design decision that affects whether a model generalizes to production use cases.
How It Works
Corpus construction involves: (1) source selection (web crawls, books, news, academic papers, domain-specific documents); (2) data acquisition (crawling, licensing, or scraping with permission); (3) deduplication (exact and near-duplicate removal using MinHash or edit distance); (4) quality filtering (removing low-quality, offensive, or irrelevant content using heuristics and classifiers); (5) language identification; and (6) tokenization and normalization for training. Large corpus collections like The Pile, RedPajama, and ROOTS aggregate multiple public domain text sources for transparent, reproducible language model pre-training.
Corpus — Composition by Domain
Domain breakdown (% of total tokens)
~444B
Total tokens
93
Languages
5
Unique domains
Yes
Deduplicated
Common corpus uses
Real-World Example
A healthcare AI company builds a domain-specific corpus for pre-training a clinical NLP model. They collect 50 million de-identified clinical notes, 5 million medical journal abstracts, and 2 million radiology reports—totaling 12 billion tokens of clinical text. After deduplication and quality filtering, they pre-train a BERT variant on this corpus. The clinical model achieves 94% accuracy on clinical NER vs. 79% for general BERT, demonstrating how domain-specific corpus composition dramatically improves downstream clinical NLP performance.
Common Mistakes
- ✕Assuming large automatically-collected corpora are unbiased—web corpora over-represent English, certain demographics, and content types with heavy online presence
- ✕Neglecting deduplication before training—duplicate documents cause models to memorize specific text rather than generalizing
- ✕Using test sets from the same distribution as training data—models overfit to corpus-specific patterns and their performance overestimates real-world generalization
Related Terms
Linguistic Annotation
Linguistic annotation is the process of manually or automatically labeling text with linguistic information—such as POS tags, parse trees, named entities, or coreference chains—creating training data for supervised NLP models.
Word Embeddings
Word embeddings are dense vector representations of words in a continuous numerical space where semantically similar words are positioned close together, enabling machines to understand word meaning through geometry.
BERT
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based language model pre-trained on massive text corpora that revolutionized NLP by providing rich contextual word representations that dramatically improved nearly every language task.
Natural Language Processing (NLP)
Natural Language Processing (NLP) is the field of AI focused on enabling computers to understand, interpret, and generate human language—powering applications from chatbots and search engines to translation and sentiment analysis.
Text Preprocessing
Text preprocessing is the collection of transformations applied to raw text before NLP model training or inference—including tokenization, normalization, and filtering—determining the quality and consistency of model inputs.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →