Large Language Models (LLMs)

Positional Encoding

Definition

Transformers process input as a set of token embeddings simultaneously via self-attention—which is inherently permutation-invariant. Without positional information, the model cannot distinguish 'A causes B' from 'B causes A' since the same tokens appear regardless of order. Positional encodings solve this by adding position-dependent signals to token embeddings before the transformer layers. Original transformers used sinusoidal fixed encodings (Vaswani et al., 2017). Modern LLMs use learned absolute or relative positional encodings. Rotary Position Embedding (RoPE), used by Llama and most modern open-source models, encodes position via rotation of query and key vectors, enabling relative position information to naturally emerge in the attention dot product. ALiBi (Attention with Linear Biases) adds position-dependent biases to attention scores.

Why It Matters

Positional encoding determines whether an LLM can distinguish word order—fundamental for language understanding—and how well it generalizes to longer sequences than it was trained on. RoPE, the dominant modern approach, has a key advantage: because position is encoded relative to query-key pairs rather than absolute positions, models using RoPE can be extended to longer contexts through techniques like RoPE scaling (YaRN, LongRoPE) without full re-training. This explains why Llama models trained with a 8K context can be extended to 128K+ context with fine-tuning. For AI practitioners, positional encoding choice affects context length extrapolation capabilities and the 'lost in the middle' behavior.

How It Works

Sinusoidal PE (original): PE(pos, 2i) = sin(pos/10000^(2i/d)), PE(pos, 2i+1) = cos(pos/10000^(2i/d)). These fixed patterns provide unique position signatures that the model learns to use. RoPE applies a rotation matrix R(pos) to query and key vectors before attention: Q_rotated = R(pos_q)Q, K_rotated = R(pos_k)K. The dot product Q_rotated · K_rotated naturally captures the relative position (pos_q - pos_k) rather than absolute positions, enabling length generalization. RoPE is computationally efficient—it's applied element-wise to pairs of dimensions in Q and K without additional parameters.

Positional Encoding — Sinusoidal Values per Token Position

The

pos 0

cat

pos 1

sat

pos 2

pos 3

the

pos 4

mat

pos 5

dim 0

0.00

0.84

0.91

0.14

-0.76

-0.96

dim 1

1.00

0.54

-0.42

-0.99

-0.65

0.28

dim 2

0.00

0.01

0.02

0.03

0.04

0.05

dim 3

1.00

0.99

+1 (high positive)

near 0

−1 (high negative)

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))

PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Added to token embeddings so the model knows the order of tokens — transformers are inherently order-agnostic without this signal

Real-World Example

A 99helpers developer builds a document reordering feature where the LLM must identify that step 3 cannot come before step 2 in an installation guide. The model correctly identifies the constraint because positional encoding allows it to distinguish 'step 2 must precede step 3' from 'step 3 must precede step 2'—different sequences of the same tokens. When testing with very long documents (50K tokens) using a base Llama-3-8B model (trained with 8K context), instruction quality degrades. Switching to a context-extended Llama-3.1-8B (128K context via RoPE scaling), the model handles the long document reliably—the improved positional encoding enables longer-range position awareness.

Common Mistakes

✕Attempting to extend context length without re-training or RoPE scaling—standard positional encodings generalize poorly beyond training context length.
✕Confusing absolute and relative positional encoding—absolute PE gives each position a unique signature; relative PE (like RoPE) encodes the distance between positions, which generalizes better to unseen lengths.
✕Ignoring that 'lost in the middle' correlates partly with positional encoding—models attend more strongly to content near position 0 and the end of context, regardless of semantic importance.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Positional Encoding

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Transformer

Self-Attention

Context Length

Large Language Model (LLM)

Attention Mechanism

Ready to build your AI chatbot?