Self-Attention
Definition
Self-attention is the mechanism by which each token in a transformer sequence 'looks at' all other tokens and decides how much information to incorporate from each. Unlike recurrent networks that process tokens sequentially (with information passing only forward), self-attention computes all pairwise interactions in parallel: for each token position, a query vector is computed that 'asks' what information is needed, key vectors from all positions answer that query, and value vectors carry the actual information to be mixed in. The similarity between each query and all keys determines how much of each position's value is included. Causal (masked) self-attention—used in decoder-only models like GPT and Llama—only allows each token to attend to previous tokens, maintaining the autoregressive generation property.
Why It Matters
Self-attention is what makes transformers uniquely powerful for language: the ability to directly model arbitrary long-range dependencies between tokens in a single computation step. Earlier sequential models (RNNs, LSTMs) had to pass information through hundreds of sequential time steps to connect distant tokens, often losing it along the way. Self-attention provides a direct, one-hop connection between any two positions regardless of distance. For AI practitioners, understanding self-attention explains both LLM strengths (excellent at capturing long-range context) and limitations (quadratic compute cost with sequence length, fixed context window, lost-in-the-middle phenomenon).
How It Works
Self-attention computation for one head: (1) project input X into queries Q=XW_Q, keys K=XW_K, values V=XW_V; (2) compute attention scores: A = softmax(QK^T / sqrt(d_k)); (3) compute output: O = AV. The scores matrix A has shape (sequence_length × sequence_length)—each entry represents how much attention token i pays to token j. For causal attention, A is masked so future positions have -infinity score (zero attention after softmax). Multi-head attention runs H independent attention heads with different projection matrices, then concatenates their outputs and projects down: this allows different heads to specialize in different relationship types (syntax, coreference, semantic similarity).
Self-Attention — Single Head (Query: "sat")
Attention weights for query token "sat"
Query (Q)
"What should I attend to?"
Keys (K)
"What do I offer to attend to?"
Values (V)
"What information do I carry?"
Result: "sat" attends most strongly to "cat" (55%) — capturing the subject-verb relationship across the sequence.
Real-World Example
A 99helpers developer analyzes attention patterns in their chatbot model to understand why it handles certain queries well. Using transformer-lens (a mechanistic interpretability library), they visualize attention patterns for: 'The integration broke after the last update. How do I fix it?' Attention analysis shows: when generating the word 'integration' in the response, the model heavily attends to 'integration' in the query. When generating 'update', it attends to 'last update' in the query. The self-attention mechanism automatically identifies the relevant entities for each part of the response—explaining why the model produces contextually accurate answers.
Common Mistakes
- ✕Confusing self-attention with cross-attention—self-attention attends within one sequence; cross-attention (in encoder-decoder models) attends from one sequence to another.
- ✕Assuming all attention heads do the same thing—research shows specialization: some heads track syntax, others coreference, others positional patterns.
- ✕Treating attention weights as interpretable 'reasons'—attention weights are a computation mechanism, not a transparent explanation of what the model 'decided' to focus on.
Related Terms
Transformer
The transformer is the neural network architecture underlying all modern LLMs, using self-attention mechanisms to process entire input sequences in parallel and capture long-range dependencies between words.
Attention Mechanism
The attention mechanism allows neural networks to dynamically focus on relevant parts of the input sequence when processing each token, enabling LLMs to capture long-range relationships and contextual meaning.
Multi-Head Attention
Multi-head attention runs multiple independent self-attention operations ('heads') in parallel, allowing the transformer to simultaneously capture different types of relationships between tokens from different representation subspaces.
Context Length
Context length is the maximum number of tokens an LLM can process in a single request—encompassing the system prompt, conversation history, retrieved documents, and the response—determining how much information the model can consider simultaneously.
Large Language Model (LLM)
A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →