Multi-Head Attention
Definition
Multi-head attention (MHA) extends single self-attention by running H parallel attention computations, each with its own learned projection matrices for queries, keys, and values. Each 'head' operates in a lower-dimensional subspace (d_model/H dimensions) and independently learns to attend to different aspects of the input. The outputs of all H heads are concatenated and projected back to the original dimension. Research on attention head specialization shows that different heads often learn to track syntactic dependencies, semantic relationships, coreference, positional patterns, and rare linguistic phenomena. Modern LLMs use 32-128 attention heads; larger models benefit from more heads as they develop more specialized linguistic capabilities.
Why It Matters
Multi-head attention enables transformers to simultaneously process multiple types of linguistic information in a single layer. A sentence like 'The team that won the tournament has disbanded' requires simultaneously tracking: subject-verb agreement (team...has), relative clause boundaries (that won), pronoun coreference, and semantic relationships. Single-head attention would need to compromise between these competing objectives; multi-head attention dedicates separate computational resources to each. For AI practitioners, multi-head attention is relevant when analyzing model capabilities and limitations—models with more heads tend to have more robust linguistic understanding, and fine-tuning behavior can sometimes be traced to specific head specializations.
How It Works
MHA architecture: input X (seq_len × d_model) → H parallel projections each producing Q_i, K_i, V_i of dimension (seq_len × d_model/H) → H independent attention computations → H outputs concatenated (seq_len × d_model) → final projection W_O (d_model × d_model) → output. Grouped Query Attention (GQA), used in Llama-3, reduces memory by sharing key-value projections across multiple query heads: instead of H_q = H_k = H_v heads, H_k = H_v < H_q. This reduces KV cache size proportionally to H_q/H_k while maintaining query expressiveness—critical for memory-efficient long-context inference.
Multi-Head Attention Architecture
Real-World Example
A 99helpers team fine-tunes a model for technical support classification. Analyzing attention patterns before and after fine-tuning with BertViz, they observe that certain attention heads in the pre-trained model track product category mentions (head 12 of layer 8 strongly attends 'billing' → 'subscription' → 'payment' together). After fine-tuning on support conversations, this head's patterns become even more pronounced for their specific product's terminology. This specialization explains why few-shot examples in the prompt that include relevant product terms dramatically improve classification—the already-specialized head efficiently processes the relevant vocabulary.
Common Mistakes
- ✕Reducing the number of attention heads to save memory without understanding quality implications—heads specialize over training; reducing head count can degrade specific linguistic capabilities.
- ✕Confusing multi-head attention with multiple transformer layers—MHA is one operation within a single layer; transformers also stack many such layers.
- ✕Assuming more heads always means better quality—beyond a certain threshold, returns diminish and training becomes more unstable; optimal head count depends on model size and task.
Related Terms
Self-Attention
Self-attention is the core operation in transformer models where each token computes a weighted representation of all other tokens in the sequence, enabling every position to directly access information from every other position.
Transformer
The transformer is the neural network architecture underlying all modern LLMs, using self-attention mechanisms to process entire input sequences in parallel and capture long-range dependencies between words.
Attention Mechanism
The attention mechanism allows neural networks to dynamically focus on relevant parts of the input sequence when processing each token, enabling LLMs to capture long-range relationships and contextual meaning.
KV Cache
The KV cache stores the key and value attention tensors computed during the prefill phase, allowing subsequent token generation to reuse these computations rather than recomputing them for every new token.
Large Language Model (LLM)
A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →