Large Language Models (LLMs)

Beam Search

Definition

Beam search is a heuristic search algorithm that explores multiple generation paths simultaneously. Instead of committing to the single highest-probability token at each step (greedy decoding), beam search maintains a 'beam' of B candidate sequences. At each step, it expands each of the B candidates by all possible next tokens, scores the resulting B*V candidates by cumulative log-probability, and keeps only the top-B for the next step. After generating the final token (end-of-sequence or max length), beam search returns the sequence with the highest total log-probability. With beam width=1, beam search reduces to greedy decoding. Higher beam widths find more globally optimal sequences at the cost of O(B) more computation.

Why It Matters

Beam search was the dominant decoding strategy for machine translation and text summarization before the sampling-based approaches gained popularity for generative LLMs. It is still preferred for tasks where the 'best' output can be objectively measured by probability—structured text generation, code completion, and constrained generation. However, for conversational AI, beam search produces outputs that are technically highest-probability but often feel generic and repetitive (the 'safe' answer phenomenon). Most modern chat-optimized LLMs use sampling (temperature + top-p/top-k) for more natural, varied responses. For 99helpers, beam search is relevant when generating structured outputs like configuration files or precise text formats.

How It Works

Beam search with B=4: at step 1, the model has 4 starting candidates (the top-4 first tokens by probability). At step 2, each of the 4 candidates spawns 50,000 next-token options; the top-4 two-token sequences by joint probability are kept. This continues until all beams hit the end-of-sequence token. Final selection: the completed sequence with the highest sum of log probabilities. In practice, length normalization (dividing by sequence length) prevents beam search from favoring short sequences. Libraries like Hugging Face Transformers implement beam search via the num_beams parameter: model.generate(input_ids, num_beams=4, no_repeat_ngram_size=3).

Beam Search (beam width = 2)

Legend

Kept (beam)

Pruned

Start token

Beam width = 2

Only top-2 paths

survive each step

Real-World Example

A 99helpers developer builds a feature that generates structured JSON configuration templates from natural language descriptions. Using sampling (temperature=0.7), the model occasionally generates invalid JSON (mismatched brackets, wrong key names). Switching to constrained beam search with JSON grammar constraints produces valid, well-formed JSON on every run. The beam search explores multiple JSON structure candidates simultaneously and selects the highest-probability valid JSON structure, ensuring output reliability for this structured generation use case.

Common Mistakes

✕Using beam search for open-ended conversational AI—beam search produces bland, repetitive responses in unconstrained generation because it over-optimizes for probability.
✕Setting beam width very high (B=20+) hoping for much better outputs—quality improvements diminish quickly after B=4-8, while computational cost scales linearly with B.
✕Ignoring beam search's bias against diversity—all beams often converge to similar high-probability prefixes, producing outputs that differ only in final tokens.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Beam Search

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Temperature

Top-P Sampling (Nucleus Sampling)

Top-K Sampling

Greedy Decoding

Large Language Model (LLM)

Ready to build your AI chatbot?