Large Language Models (LLMs)

Self-Attention

Definition

Self-attention is the mechanism by which each token in a transformer sequence 'looks at' all other tokens and decides how much information to incorporate from each. Unlike recurrent networks that process tokens sequentially (with information passing only forward), self-attention computes all pairwise interactions in parallel: for each token position, a query vector is computed that 'asks' what information is needed, key vectors from all positions answer that query, and value vectors carry the actual information to be mixed in. The similarity between each query and all keys determines how much of each position's value is included. Causal (masked) self-attention—used in decoder-only models like GPT and Llama—only allows each token to attend to previous tokens, maintaining the autoregressive generation property.

Why It Matters

Self-attention is what makes transformers uniquely powerful for language: the ability to directly model arbitrary long-range dependencies between tokens in a single computation step. Earlier sequential models (RNNs, LSTMs) had to pass information through hundreds of sequential time steps to connect distant tokens, often losing it along the way. Self-attention provides a direct, one-hop connection between any two positions regardless of distance. For AI practitioners, understanding self-attention explains both LLM strengths (excellent at capturing long-range context) and limitations (quadratic compute cost with sequence length, fixed context window, lost-in-the-middle phenomenon).

How It Works

Self-attention computation for one head: (1) project input X into queries Q=XW_Q, keys K=XW_K, values V=XW_V; (2) compute attention scores: A = softmax(QK^T / sqrt(d_k)); (3) compute output: O = AV. The scores matrix A has shape (sequence_length × sequence_length)—each entry represents how much attention token i pays to token j. For causal attention, A is masked so future positions have -infinity score (zero attention after softmax). Multi-head attention runs H independent attention heads with different projection matrices, then concatenates their outputs and projects down: this allows different heads to specialize in different relationship types (syntax, coreference, semantic similarity).

Self-Attention — Single Head (Query: "sat")

Q · Kᵀ

→

÷ √d

→

Softmax

→

× V

→

Output

Attention weights for query token "sat"

The

cat

55%

sat

18%

mat

12%

Query (Q)

"What should I attend to?"

Keys (K)

"What do I offer to attend to?"

Values (V)

"What information do I carry?"

Result: "sat" attends most strongly to "cat" (55%) — capturing the subject-verb relationship across the sequence.

Real-World Example

A 99helpers developer analyzes attention patterns in their chatbot model to understand why it handles certain queries well. Using transformer-lens (a mechanistic interpretability library), they visualize attention patterns for: 'The integration broke after the last update. How do I fix it?' Attention analysis shows: when generating the word 'integration' in the response, the model heavily attends to 'integration' in the query. When generating 'update', it attends to 'last update' in the query. The self-attention mechanism automatically identifies the relevant entities for each part of the response—explaining why the model produces contextually accurate answers.

Common Mistakes

✕Confusing self-attention with cross-attention—self-attention attends within one sequence; cross-attention (in encoder-decoder models) attends from one sequence to another.
✕Assuming all attention heads do the same thing—research shows specialization: some heads track syntax, others coreference, others positional patterns.
✕Treating attention weights as interpretable 'reasons'—attention weights are a computation mechanism, not a transparent explanation of what the model 'decided' to focus on.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Self-Attention

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Transformer

Attention Mechanism

Multi-Head Attention

Context Length

Large Language Model (LLM)

Ready to build your AI chatbot?