Large Language Models (LLMs)

Multi-Head Attention

Definition

Multi-head attention (MHA) extends single self-attention by running H parallel attention computations, each with its own learned projection matrices for queries, keys, and values. Each 'head' operates in a lower-dimensional subspace (d_model/H dimensions) and independently learns to attend to different aspects of the input. The outputs of all H heads are concatenated and projected back to the original dimension. Research on attention head specialization shows that different heads often learn to track syntactic dependencies, semantic relationships, coreference, positional patterns, and rare linguistic phenomena. Modern LLMs use 32-128 attention heads; larger models benefit from more heads as they develop more specialized linguistic capabilities.

Why It Matters

Multi-head attention enables transformers to simultaneously process multiple types of linguistic information in a single layer. A sentence like 'The team that won the tournament has disbanded' requires simultaneously tracking: subject-verb agreement (team...has), relative clause boundaries (that won), pronoun coreference, and semantic relationships. Single-head attention would need to compromise between these competing objectives; multi-head attention dedicates separate computational resources to each. For AI practitioners, multi-head attention is relevant when analyzing model capabilities and limitations—models with more heads tend to have more robust linguistic understanding, and fine-tuning behavior can sometimes be traced to specific head specializations.

How It Works

MHA architecture: input X (seq_len × d_model) → H parallel projections each producing Q_i, K_i, V_i of dimension (seq_len × d_model/H) → H independent attention computations → H outputs concatenated (seq_len × d_model) → final projection W_O (d_model × d_model) → output. Grouped Query Attention (GQA), used in Llama-3, reduces memory by sharing key-value projections across multiple query heads: instead of H_q = H_k = H_v heads, H_k = H_v < H_q. This reduces KV cache size proportionally to H_q/H_k while maintaining query expressiveness—critical for memory-efficient long-context inference.

Multi-Head Attention Architecture

Input Embeddings→ Linear projections → Q, K, V per head
Head 1
Syntax / grammar
Head 2
Coreference
Head 3
Proximity
Head 4
Semantics
Head 5
Long-range deps.
Head 6
Entity type
Head 7
Negation scope
Head 8
Positional order
Concatenate all heads → Linear projection → Output
MultiHead(Q,K,V) = Concat(head₁,…,headₕ)·Wᴼ

Real-World Example

A 99helpers team fine-tunes a model for technical support classification. Analyzing attention patterns before and after fine-tuning with BertViz, they observe that certain attention heads in the pre-trained model track product category mentions (head 12 of layer 8 strongly attends 'billing' → 'subscription' → 'payment' together). After fine-tuning on support conversations, this head's patterns become even more pronounced for their specific product's terminology. This specialization explains why few-shot examples in the prompt that include relevant product terms dramatically improve classification—the already-specialized head efficiently processes the relevant vocabulary.

Common Mistakes

  • Reducing the number of attention heads to save memory without understanding quality implications—heads specialize over training; reducing head count can degrade specific linguistic capabilities.
  • Confusing multi-head attention with multiple transformer layers—MHA is one operation within a single layer; transformers also stack many such layers.
  • Assuming more heads always means better quality—beyond a certain threshold, returns diminish and training becomes more unstable; optimal head count depends on model size and task.

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Multi-Head Attention? Multi-Head Attention Definition & Guide | 99helpers | 99helpers.com