Transformer
Definition
The transformer architecture, introduced by Vaswani et al. in 2017's 'Attention Is All You Need,' revolutionized natural language processing by replacing sequential processing (RNNs, LSTMs) with parallel attention-based processing. A transformer consists of stacked encoder and/or decoder layers, each containing a multi-head self-attention sublayer (which lets every token attend to every other token in the sequence) and a feed-forward sublayer. Decoder-only transformers (used in GPT, Claude, Llama) are optimized for text generation: each token attends to all preceding tokens via causal (masked) self-attention, predicting the next token autoregressively. The architecture's parallelism enables training on massive datasets using thousands of GPUs simultaneously.
Why It Matters
The transformer is the architectural foundation that made modern LLMs possible. Its self-attention mechanism captures relationships between distant words in a single pass—understanding that 'it' in a sentence refers to an entity mentioned paragraphs earlier—something sequential models struggled with. For AI application builders, understanding transformers at a high level helps explain LLM behaviors: why context window limits exist (self-attention scales quadratically with sequence length), why position in the prompt matters (positional encodings give order information), and why attention heads can be specialized for different relationship types (syntax, coreference, etc.).
How It Works
In a decoder-only transformer, processing a prompt proceeds as follows: the input text is tokenized into integer IDs; an embedding layer converts each token ID to a dense vector; positional encodings are added to preserve word order; the combined representations pass through N identical transformer layers, each computing multi-head self-attention (which tokens attend to which) and applying a feed-forward transformation; the final layer's output is projected to a vocabulary-sized logit vector, and a softmax converts logits to a probability distribution over all possible next tokens. The model samples from this distribution to select the next token, appends it to the sequence, and repeats until a stop token is generated.
Transformer Architecture — Encoder-Decoder with Attention + FFN
Encoder
Processes input sequence
Encoder Block ×N
Input Embeddings + Positional Encoding
"The cat sat"
Cross-Attention
Decoder
Generates output tokens
Decoder Block ×N
Output Embeddings + Positional Encoding
Linear → Softmax → Token
Decoder-only LLMs (GPT, Claude, Llama) omit the encoder entirely and use masked self-attention so each token can only attend to prior tokens during generation.
Real-World Example
When a 99helpers chatbot prompt contains the sentence 'The webhook should be configured in Settings, then go to the Integrations tab where it lives,' the transformer's self-attention mechanism connects 'it' back to 'webhook' via attention weights—both tokens end up with high mutual attention scores. This coreference resolution happens implicitly through learned attention patterns, enabling the LLM to generate: 'Navigate to Settings > Integrations tab to find the webhook configuration' rather than losing track of the referent.
Common Mistakes
- ✕Assuming all transformers are the same—encoder-only (BERT), encoder-decoder (T5), and decoder-only (GPT) architectures have different capabilities and use cases.
- ✕Treating context window limits as artificial restrictions—the quadratic attention complexity means doubling context length quadruples the computation, creating genuine scaling constraints.
- ✕Confusing the transformer architecture with a specific model—transformer is the architecture; GPT-4, Claude, and Llama are specific models built on this architecture.
Related Terms
Large Language Model (LLM)
A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.
Attention Mechanism
The attention mechanism allows neural networks to dynamically focus on relevant parts of the input sequence when processing each token, enabling LLMs to capture long-range relationships and contextual meaning.
Tokenization
Tokenization converts raw text into a sequence of tokens—the basic units an LLM processes—using algorithms like byte-pair encoding that split text into subword pieces rather than whole words or individual characters.
Context Length
Context length is the maximum number of tokens an LLM can process in a single request—encompassing the system prompt, conversation history, retrieved documents, and the response—determining how much information the model can consider simultaneously.
Self-Attention
Self-attention is the core operation in transformer models where each token computes a weighted representation of all other tokens in the sequence, enabling every position to directly access information from every other position.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →