Attention Mechanism
Definition
The attention mechanism computes a weighted sum of value vectors based on the similarity between a query vector and a set of key vectors. In the context of LLMs, for each token position, the model computes: how much should this token 'attend to' every other token in the sequence? Tokens with high query-key similarity receive high attention weights; their value vectors contribute more to the output representation. This allows the model to dynamically route information—when processing a pronoun, attention can weight the likely antecedent nouns highly; when processing a verb, attention can weight its subject and object. Multi-head attention runs multiple attention computations in parallel, each potentially capturing different relationship types.
Why It Matters
The attention mechanism is what gives LLMs their contextual understanding. Unlike older models that processed text with fixed-size context windows, attention allows every token to directly influence every other token's representation. This is why LLMs can understand complex anaphora, follow multi-clause arguments, and maintain coherent themes across long documents. For AI engineers building applications on LLMs, understanding attention helps explain phenomena like the 'lost in the middle' problem (attention tends to be higher for tokens near the beginning and end of long sequences) and why LLMs are better at answering questions about content placed prominently in prompts.
How It Works
Scaled dot-product attention: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V. The queries (Q), keys (K), and values (V) are linear projections of the input embeddings. QK^T computes pairwise similarity scores between all query-key pairs. Dividing by sqrt(d_k) (the key dimension) prevents the dot products from growing too large, which would push softmax into saturation. The softmax converts scores to a probability distribution summing to 1. Multiplying by V produces a weighted combination of value vectors. In self-attention, Q, K, and V all come from the same input sequence; in cross-attention (encoder-decoder), Q comes from the decoder and K, V come from the encoder.
Multi-Head Attention
Attention scores (query token → key tokens)
Real-World Example
A 99helpers chatbot is processing: 'My chatbot was working fine yesterday, but now it returns errors on every message.' The attention mechanism helps the LLM connect 'it' to 'chatbot,' understand 'now' as a temporal contrast with 'yesterday,' and weight 'errors' and 'every message' as the key diagnostic information. When generating a helpful response, the decoder's causal attention lets each output token attend to all prior output tokens plus the input prompt, maintaining coherence across the full response. Without attention, the model would lose track of these relationships in longer passages.
Common Mistakes
- ✕Thinking attention is the only component of a transformer—feed-forward layers after attention are equally important for storing and transforming knowledge.
- ✕Assuming uniform attention across the context window—in practice, attention is sparse and position-biased, with early and recent tokens receiving disproportionate weight.
- ✕Conflating attention weights with 'what the model is thinking'—attention weights are a component of computation, not a direct interpretability window into model reasoning.
Related Terms
Transformer
The transformer is the neural network architecture underlying all modern LLMs, using self-attention mechanisms to process entire input sequences in parallel and capture long-range dependencies between words.
Self-Attention
Self-attention is the core operation in transformer models where each token computes a weighted representation of all other tokens in the sequence, enabling every position to directly access information from every other position.
Multi-Head Attention
Multi-head attention runs multiple independent self-attention operations ('heads') in parallel, allowing the transformer to simultaneously capture different types of relationships between tokens from different representation subspaces.
Context Length
Context length is the maximum number of tokens an LLM can process in a single request—encompassing the system prompt, conversation history, retrieved documents, and the response—determining how much information the model can consider simultaneously.
Large Language Model (LLM)
A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →