Large Language Models (LLMs)

Top-P Sampling (Nucleus Sampling)

Definition

Top-p sampling, also called nucleus sampling, was introduced as an alternative to top-k sampling that adapts to the shape of the probability distribution. Rather than always sampling from the top K tokens, top-p sampling dynamically determines the candidate pool: it sorts tokens by probability, then includes tokens from highest to lowest until their cumulative probability reaches p (e.g., p=0.9). If the distribution is sharp (the model is very confident), the top-90% might include only 5 tokens. If the distribution is flat (the model is uncertain), the top-90% might include 1,000 tokens. This prevents top-k from either being too restrictive on peaked distributions or too permissive on flat ones.

Why It Matters

Top-p sampling addresses a practical limitation of fixed top-k: in confident, factual contexts, top-k=50 still samples from 50 tokens even when the model assigns 99% probability to one token, introducing unnecessary noise. In creative, ambiguous contexts, top-k=50 may be too restrictive, producing repetitive outputs. Top-p adapts naturally to both situations. For 99helpers chatbots, top-p=0.9 or 0.95 is a robust default that maintains coherent, on-topic responses while avoiding the stilted quality of very low temperature or top-k. Most LLM providers recommend using either top-p or top-k, not both simultaneously.

How It Works

Implementation: after computing the softmax probability distribution over the vocabulary (and optionally applying temperature), sort tokens by probability descending. Accumulate probabilities until the sum exceeds p. Discard all tokens beyond this threshold. Sample from the remaining tokens according to their renormalized probabilities. Example: vocabulary has tokens [A:0.5, B:0.3, C:0.15, D:0.04, E:0.01]. With top-p=0.9, include A (0.5), B (0.8 cumulative), C (0.95 > 0.9 → stop after C). Sample from {A, B, C} with renormalized probabilities {A:0.53, B:0.32, C:0.16}.

Top-P Sampling — Cumulative Probability Cutoff at p = 0.9

Prompt: "Today I'm feeling very…"(p = 0.9)

Sorted tokens with cumulative probability

happy

32%

cum 32%

sunny

24%

cum 56%

great

18%

cum 74%

clear

10%

cum 84%

warm

cum 91%→ excluded

fine

cum 95%→ excluded

nice

cum 98%→ excluded

okay

cum 100%→ excluded

p=0.9 cutoff — tokens beyond this point are excluded

Top-P advantage

Adapts dynamically — includes more tokens when distribution is flat, fewer when peaked

vs Top-K

Top-K always selects exactly K tokens regardless of distribution shape

Real-World Example

A 99helpers chatbot uses top-p=0.9, temperature=0.7. When answering 'What is your refund policy?', the model assigns high probability to factual tokens and top-p selects only 3-5 candidates—producing consistent, accurate answers. When asked 'Can you suggest a creative use case for our chatbot?', the model's uncertainty distributes probability across many possible continuations, and top-p=0.9 includes 40-50 candidates—producing varied, interesting suggestions on each run. The same top-p setting handles both cases appropriately without manual adjustment.

Common Mistakes

✕Setting top-p=1.0 thinking it disables sampling—top-p=1.0 includes all tokens (no filtering), making it equivalent to pure temperature-based sampling.
✕Using both top-p and top-k simultaneously—most recommendations suggest using one or the other; combining them applies both constraints, which can behave unexpectedly.
✕Confusing top-p with the probability of the final answer being correct—p controls sampling diversity, not answer accuracy.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Top-P Sampling (Nucleus Sampling)

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Temperature

Top-K Sampling

Beam Search

Large Language Model (LLM)

Log Probabilities (Logprobs)

Ready to build your AI chatbot?