Beam Search
Definition
Beam search is a heuristic search algorithm that explores multiple generation paths simultaneously. Instead of committing to the single highest-probability token at each step (greedy decoding), beam search maintains a 'beam' of B candidate sequences. At each step, it expands each of the B candidates by all possible next tokens, scores the resulting B*V candidates by cumulative log-probability, and keeps only the top-B for the next step. After generating the final token (end-of-sequence or max length), beam search returns the sequence with the highest total log-probability. With beam width=1, beam search reduces to greedy decoding. Higher beam widths find more globally optimal sequences at the cost of O(B) more computation.
Why It Matters
Beam search was the dominant decoding strategy for machine translation and text summarization before the sampling-based approaches gained popularity for generative LLMs. It is still preferred for tasks where the 'best' output can be objectively measured by probability—structured text generation, code completion, and constrained generation. However, for conversational AI, beam search produces outputs that are technically highest-probability but often feel generic and repetitive (the 'safe' answer phenomenon). Most modern chat-optimized LLMs use sampling (temperature + top-p/top-k) for more natural, varied responses. For 99helpers, beam search is relevant when generating structured outputs like configuration files or precise text formats.
How It Works
Beam search with B=4: at step 1, the model has 4 starting candidates (the top-4 first tokens by probability). At step 2, each of the 4 candidates spawns 50,000 next-token options; the top-4 two-token sequences by joint probability are kept. This continues until all beams hit the end-of-sequence token. Final selection: the completed sequence with the highest sum of log probabilities. In practice, length normalization (dividing by sequence length) prevents beam search from favoring short sequences. Libraries like Hugging Face Transformers implement beam search via the num_beams parameter: model.generate(input_ids, num_beams=4, no_repeat_ngram_size=3).
Beam Search (beam width = 2)
Real-World Example
A 99helpers developer builds a feature that generates structured JSON configuration templates from natural language descriptions. Using sampling (temperature=0.7), the model occasionally generates invalid JSON (mismatched brackets, wrong key names). Switching to constrained beam search with JSON grammar constraints produces valid, well-formed JSON on every run. The beam search explores multiple JSON structure candidates simultaneously and selects the highest-probability valid JSON structure, ensuring output reliability for this structured generation use case.
Common Mistakes
- ✕Using beam search for open-ended conversational AI—beam search produces bland, repetitive responses in unconstrained generation because it over-optimizes for probability.
- ✕Setting beam width very high (B=20+) hoping for much better outputs—quality improvements diminish quickly after B=4-8, while computational cost scales linearly with B.
- ✕Ignoring beam search's bias against diversity—all beams often converge to similar high-probability prefixes, producing outputs that differ only in final tokens.
Related Terms
Temperature
Temperature is an LLM parameter (0-2) that controls output randomness: low values produce focused, deterministic responses while high values produce more varied, creative outputs.
Top-P Sampling (Nucleus Sampling)
Top-p sampling (nucleus sampling) restricts token generation to the smallest set of tokens whose cumulative probability exceeds p, dynamically adapting the candidate pool size based on the probability distribution.
Top-K Sampling
Top-K sampling restricts token generation to the K most probable next tokens at each step, preventing the model from selecting rare or unlikely tokens while maintaining diversity within the top-K candidates.
Greedy Decoding
Greedy decoding selects the single highest-probability token at each generation step, producing deterministic, locally optimal output without exploring alternative sequences.
Large Language Model (LLM)
A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →