Temperature
Definition
Temperature is a sampling parameter that scales the probability distribution over the vocabulary before the model selects its next token. At temperature=0, the model always picks the highest-probability token (greedy decoding), producing fully deterministic output. At temperature=1, the distribution is unchanged—sampling follows the model's raw probabilities. At temperature>1, the distribution is flattened (low-probability tokens become relatively more likely), increasing variety and occasionally producing surprising or incoherent outputs. Temperature=0.7-1.0 is a common sweet spot for conversational AI, balancing coherence with naturalness. Temperature below 0.3 is preferred for factual, technical, or structured outputs.
Why It Matters
Temperature is the most impactful parameter for controlling chatbot personality and reliability. A customer support chatbot answering factual product questions should use temperature=0 or 0.1—consistent, predictable answers reduce user confusion and support team overhead. A creative writing assistant or brainstorming tool benefits from temperature=1.0-1.5, producing varied and imaginative outputs. For 99helpers customers deploying AI chatbots, temperature tuning is often the first optimization step after initial deployment: reducing temperature for factual support bots reduces hallucination and response variance, while increasing it for engagement tools produces more conversational, natural-feeling exchanges.
How It Works
Mathematically: adjusted_logit[i] = logit[i] / temperature. After dividing all logits by the temperature, softmax converts them to probabilities. When temperature approaches 0, the highest logit dominates exponentially; probabilities concentrate on the top token. When temperature=1, softmax is applied directly to raw logits. When temperature=2, logit differences are halved, flattening the distribution. In practice, temperature=0 is implemented as argmax (take highest logit) to avoid division-by-zero. Temperature interacts with top-p and top-k: in most APIs, temperature is applied first, then top-p or top-k filtering narrows the candidate pool.
Temperature — Token Probability Distribution
T = 0
Greedy — always picks top token
T = 1
Balanced — samples as trained
T = 2
Creative — highly random output
T = 0
Deterministic, factual, repetitive
T = 1
Default balance of quality & variety
T = 2
Creative, unpredictable, may be incoherent
Real-World Example
A 99helpers customer service chatbot initially deploys with temperature=0.8. Users report inconsistent answers—the same question about refund policy sometimes receives 3 different responses across sessions, creating customer confusion and support escalations. Lowering temperature to 0.1 makes responses nearly deterministic for the same query: the refund policy question always returns the same accurate, policy-grounded answer. Separately, their marketing team's content generation tool uses temperature=1.2 to produce varied product descriptions from the same feature list, preventing identical-sounding outputs across different customers' generated content.
Common Mistakes
- ✕Using temperature=1.0 for all use cases without considering whether the application is factual (needs low temp) or creative (benefits from higher temp).
- ✕Expecting temperature=0 to eliminate all non-determinism—system-level batching and floating-point non-determinism can still produce occasional variation even at temp=0.
- ✕Combining very high temperature (>1.5) with no other output constraints—the model may generate grammatically correct but nonsensical content.
Related Terms
Top-P Sampling (Nucleus Sampling)
Top-p sampling (nucleus sampling) restricts token generation to the smallest set of tokens whose cumulative probability exceeds p, dynamically adapting the candidate pool size based on the probability distribution.
Top-K Sampling
Top-K sampling restricts token generation to the K most probable next tokens at each step, preventing the model from selecting rare or unlikely tokens while maintaining diversity within the top-K candidates.
Beam Search
Beam search is a decoding algorithm that maintains multiple candidate sequences (beams) in parallel during generation, selecting the overall most probable complete sequence rather than the locally optimal token at each step.
Large Language Model (LLM)
A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.
LLM API
An LLM API is a cloud service interface that provides programmatic access to large language models, allowing developers to send prompts and receive completions without managing model infrastructure.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →