Natural Language Processing (NLP)

BERT

Definition

BERT, introduced by Google in 2018, is a transformer encoder pre-trained with two self-supervised objectives: Masked Language Modeling (MLM, predicting randomly masked tokens) and Next Sentence Prediction (NSP, classifying whether two sentences follow each other). Training on Wikipedia and BooksCorpus with MLM forces BERT to develop deep bidirectional contextual representations—the representation of each word depends on all surrounding words, not just preceding ones. Fine-tuning BERT on small labeled datasets achieved state-of-the-art results on 11 NLP benchmarks simultaneously, establishing the pre-train-then-fine-tune paradigm that dominates NLP.

Why It Matters

BERT's release democratized high-performance NLP by providing a pre-trained foundation that non-researchers could fine-tune for specific tasks with modest labeled datasets. Before BERT, achieving strong NLP performance required large task-specific datasets and extensive feature engineering. BERT demonstrated that transfer learning from self-supervised pre-training works powerfully for language, paralleling what ImageNet pre-training did for computer vision. Understanding BERT is fundamental to understanding all modern NLP, as virtually every high-performance NLP system uses a BERT-family model or one of its successors.

How It Works

BERT uses a transformer encoder stack (BERT-base has 12 layers, 768 hidden dimensions, 12 attention heads; BERT-large has 24 layers, 1024 dimensions, 16 heads). Tokenization uses WordPiece subword segmentation with a 30,000 token vocabulary. Input sequences include [CLS] (classification) and [SEP] (separator) special tokens. For classification tasks, the [CLS] token representation is fine-tuned with a linear classification head. For sequence labeling (NER), all token representations are used. Fine-tuning updates all model weights on task-specific data, typically for 3-5 epochs on small datasets.

BERT — Bidirectional Transformer Pre-Training

Input tokens (with masking)

[CLS]

The

model

[MASK]

text

bidirectionally

[SEP]

Bidirectional self-attention

Left context

[MASK]

→ "reads"

Right context

Attends to tokens on both sides simultaneously — unlike GPT which is left-to-right only

Pre-training objectives

Masked Language Model (MLM)

Predict masked tokens using left + right context

Next Sentence Prediction (NSP)

Predict whether sentence B follows sentence A

Output: Contextual embeddings

Each token gets a 768-dim (BERT-base) or 1024-dim (BERT-large) vector encoding its meaning in context. Fine-tuned for classification, NER, QA, and more.

Real-World Example

A legal tech company fine-tunes BERT-base on 3,000 labeled contract clauses to classify 18 clause types (limitation of liability, termination, confidentiality, etc.). The fine-tuned BERT model achieves 94% classification accuracy, dramatically outperforming the previous TF-IDF + logistic regression baseline at 76%. Contract review time decreases from 4 hours to 20 minutes per document as BERT automatically locates and labels all clause types. The model processes 100-page contracts in under 30 seconds.

Common Mistakes

✕Fine-tuning with too many epochs on small datasets—BERT fine-tuning on small datasets requires early stopping to prevent catastrophic forgetting
✕Using base BERT for tasks that require generation—BERT is an encoder-only model and cannot generate text; use T5 or GPT for generation
✕Applying BERT to very long documents without chunking—BERT has a 512-token context limit that must be handled by sliding window or hierarchical approaches

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

BERT

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Transformer Encoder

Sentence Transformers

Word Embeddings

Text Classification

Named Entity Recognition (NER)

Ready to build your AI chatbot?