Greedy Decoding
Definition
Greedy decoding is the simplest generation strategy: at every step, the model picks the token with the highest probability from its output distribution and appends it to the sequence. This is equivalent to setting temperature=0 or top-k=1. The name 'greedy' reflects its myopic optimization—it makes the locally optimal choice at each step without considering whether that choice leads to the globally optimal complete sequence. Greedy decoding is fully deterministic (the same input always produces the same output), computationally efficient (no sampling or beam expansion), and often produces high-quality outputs for simple, factual queries where there is one clear best next token at each step.
Why It Matters
Greedy decoding is the implicit baseline for LLM outputs and helps explain certain failure modes. When an LLM gets 'stuck' generating the same phrase repeatedly, it is often caught in a greedy loop: the highest-probability next token is one that reinforces a pattern, which makes the same continuation more likely, creating a cycle. Understanding this explains why repetition penalties and sampling help—they break the deterministic greedy cycle. For 99helpers support chatbots where answer consistency matters, temperature=0 (greedy) is often the right choice; for content generation where variety is valued, sampling outperforms greedy.
How It Works
Greedy decoding process: (1) compute logits over vocabulary from model forward pass; (2) apply softmax to convert to probabilities; (3) select argmax (token with highest probability); (4) append to sequence; (5) run forward pass again with extended sequence; (6) repeat until stop token. The computational cost per token is one forward pass through the model—identical to sampling, making greedy no faster than sampling at the per-token level. Greedy is differentiated by eliminating the sampling step and associated randomness. In code: next_token = logits.argmax(-1) instead of torch.multinomial(probabilities, num_samples=1).
Greedy Decoding — Argmax Token Selection
Greedy decoding always picks the single highest-probability token. Fast, deterministic, but can miss globally better sequences that require taking a lower-probability first token.
Real-World Example
A 99helpers chatbot is configured with temperature=0 (greedy decoding) for its FAQ answering feature. During testing, a developer notices that when asked 'What does 99helpers do?' the bot responds identically on every run—perfect for a use case where consistency builds user trust. However, when the same setting is used for a creative product description generator, all outputs are nearly identical regardless of the input product. The team adds temperature=0.9 for creative generation while keeping temperature=0 for structured FAQ, applying greedy and sampling appropriately to each use case.
Common Mistakes
- ✕Thinking greedy decoding is always 'best' because it selects the highest-probability token—locally optimal choices can lead to globally suboptimal sequences.
- ✕Expecting greedy decoding to always produce the same output across different model serving infrastructure—hardware and batching differences can occasionally produce different floating-point results.
- ✕Using greedy decoding for tasks requiring creative variation—greedy decoding by definition produces the least diverse possible output.
Related Terms
Temperature
Temperature is an LLM parameter (0-2) that controls output randomness: low values produce focused, deterministic responses while high values produce more varied, creative outputs.
Beam Search
Beam search is a decoding algorithm that maintains multiple candidate sequences (beams) in parallel during generation, selecting the overall most probable complete sequence rather than the locally optimal token at each step.
Top-K Sampling
Top-K sampling restricts token generation to the K most probable next tokens at each step, preventing the model from selecting rare or unlikely tokens while maintaining diversity within the top-K candidates.
Top-P Sampling (Nucleus Sampling)
Top-p sampling (nucleus sampling) restricts token generation to the smallest set of tokens whose cumulative probability exceeds p, dynamically adapting the candidate pool size based on the probability distribution.
Large Language Model (LLM)
A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →