Large Language Models (LLMs)

Transformer

Definition

The transformer architecture, introduced by Vaswani et al. in 2017's 'Attention Is All You Need,' revolutionized natural language processing by replacing sequential processing (RNNs, LSTMs) with parallel attention-based processing. A transformer consists of stacked encoder and/or decoder layers, each containing a multi-head self-attention sublayer (which lets every token attend to every other token in the sequence) and a feed-forward sublayer. Decoder-only transformers (used in GPT, Claude, Llama) are optimized for text generation: each token attends to all preceding tokens via causal (masked) self-attention, predicting the next token autoregressively. The architecture's parallelism enables training on massive datasets using thousands of GPUs simultaneously.

Why It Matters

The transformer is the architectural foundation that made modern LLMs possible. Its self-attention mechanism captures relationships between distant words in a single pass—understanding that 'it' in a sentence refers to an entity mentioned paragraphs earlier—something sequential models struggled with. For AI application builders, understanding transformers at a high level helps explain LLM behaviors: why context window limits exist (self-attention scales quadratically with sequence length), why position in the prompt matters (positional encodings give order information), and why attention heads can be specialized for different relationship types (syntax, coreference, etc.).

How It Works

In a decoder-only transformer, processing a prompt proceeds as follows: the input text is tokenized into integer IDs; an embedding layer converts each token ID to a dense vector; positional encodings are added to preserve word order; the combined representations pass through N identical transformer layers, each computing multi-head self-attention (which tokens attend to which) and applying a feed-forward transformation; the final layer's output is projected to a vocabulary-sized logit vector, and a softmax converts logits to a probability distribution over all possible next tokens. The model samples from this distribution to select the next token, appends it to the sequence, and repeats until a stop token is generated.

Transformer Architecture — Encoder-Decoder with Attention + FFN

Encoder

Processes input sequence

Encoder Block ×N

Multi-Head Self-Attention

↓

Add & Norm

↓

Feed-Forward Network

↓

Add & Norm

↓

Input Embeddings + Positional Encoding

↓

"The cat sat"

Cross-Attention

Decoder

Generates output tokens

Decoder Block ×N

Masked Self-Attention

↓

Cross-Attention (← Encoder)

↓

Add & Norm

↓

Feed-Forward Network

↓

Add & Norm

↓

Output Embeddings + Positional Encoding

↓

Linear → Softmax → Token

Decoder-only LLMs (GPT, Claude, Llama) omit the encoder entirely and use masked self-attention so each token can only attend to prior tokens during generation.

Real-World Example

When a 99helpers chatbot prompt contains the sentence 'The webhook should be configured in Settings, then go to the Integrations tab where it lives,' the transformer's self-attention mechanism connects 'it' back to 'webhook' via attention weights—both tokens end up with high mutual attention scores. This coreference resolution happens implicitly through learned attention patterns, enabling the LLM to generate: 'Navigate to Settings > Integrations tab to find the webhook configuration' rather than losing track of the referent.

Common Mistakes

✕Assuming all transformers are the same—encoder-only (BERT), encoder-decoder (T5), and decoder-only (GPT) architectures have different capabilities and use cases.
✕Treating context window limits as artificial restrictions—the quadratic attention complexity means doubling context length quadruples the computation, creating genuine scaling constraints.
✕Confusing the transformer architecture with a specific model—transformer is the architecture; GPT-4, Claude, and Llama are specific models built on this architecture.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Transformer

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Large Language Model (LLM)

Attention Mechanism

Tokenization

Context Length

Self-Attention

Ready to build your AI chatbot?