Natural Language Processing (NLP)

Corpus

Definition

A corpus (plural: corpora) is a collection of text (or speech) assembled for linguistic analysis or machine learning purposes. Corpora range from small, manually annotated datasets like SQuAD (100,000 QA pairs) to massive web-crawled collections like Common Crawl (petabytes of multilingual text). Corpora can be raw (unannotated), annotated (with linguistic labels), parallel (aligned translations), or specialized (domain-specific text). Key characteristics include size (token count), domain coverage, language(s), time period, and annotation quality. The composition of a pre-training corpus is one of the most consequential decisions in building a language model.

Why It Matters

Corpus quality and composition directly determine NLP model capabilities and biases. A model trained on customer support corpora will understand support-specific language but struggle with medical or legal text. Pre-training corpora bias models toward the demographics, topics, and time periods represented—a corpus heavy with English news from 2015-2020 produces a model with corresponding knowledge gaps. For practitioners, selecting the right corpus for fine-tuning or evaluation is a critical design decision that affects whether a model generalizes to production use cases.

How It Works

Corpus construction involves: (1) source selection (web crawls, books, news, academic papers, domain-specific documents); (2) data acquisition (crawling, licensing, or scraping with permission); (3) deduplication (exact and near-duplicate removal using MinHash or edit distance); (4) quality filtering (removing low-quality, offensive, or irrelevant content using heuristics and classifiers); (5) language identification; and (6) tokenization and normalization for training. Large corpus collections like The Pile, RedPajama, and ROOTS aggregate multiple public domain text sources for transparent, reproducible language model pre-training.

Corpus — Composition by Domain

Domain breakdown (% of total tokens)

Web / Common Crawl
82%410B
Books & Literature
52%26B
Wikipedia
21%4.1B
Academic Papers
14%2.3B
Code Repositories
10%1.8B

~444B

Total tokens

93

Languages

5

Unique domains

Yes

Deduplicated

Common corpus uses

LLM pre-trainingNLP benchmarkingVocabulary buildingLanguage modeling

Real-World Example

A healthcare AI company builds a domain-specific corpus for pre-training a clinical NLP model. They collect 50 million de-identified clinical notes, 5 million medical journal abstracts, and 2 million radiology reports—totaling 12 billion tokens of clinical text. After deduplication and quality filtering, they pre-train a BERT variant on this corpus. The clinical model achieves 94% accuracy on clinical NER vs. 79% for general BERT, demonstrating how domain-specific corpus composition dramatically improves downstream clinical NLP performance.

Common Mistakes

  • Assuming large automatically-collected corpora are unbiased—web corpora over-represent English, certain demographics, and content types with heavy online presence
  • Neglecting deduplication before training—duplicate documents cause models to memorize specific text rather than generalizing
  • Using test sets from the same distribution as training data—models overfit to corpus-specific patterns and their performance overestimates real-world generalization

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Corpus? Corpus Definition & Guide | 99helpers | 99helpers.com