Log Probabilities (Logprobs)
Definition
Logprobs (log probabilities) are the model's internal probability scores for each generated token—specifically the natural logarithm of the probability the model assigns to the chosen token at each generation step. The OpenAI API and compatible APIs can return logprobs alongside the generated text when requested. Logprobs have practical applications: measuring model confidence (very negative logprobs indicate the model was uncertain about that token), detecting uncertain or hallucinated spans in long-form responses (text where the model consistently generates low-probability tokens), implementing custom sampling logic, and computing calibration metrics (whether stated confidence aligns with empirical accuracy).
Why It Matters
Logprobs provide a window into the model's generation uncertainty that is hidden from normal response inspection. A response that reads confidently may contain spans where the model was highly uncertain—the text could have been very different with slightly different sampling. For 99helpers teams building high-stakes AI applications, logprob analysis can identify which parts of responses to flag for human review: if the logprob at a critical fact (a number, a name, a URL) is very low, it's worth verifying. This enables a targeted human-in-the-loop review process that focuses effort on genuinely uncertain outputs rather than reviewing all responses.
How It Works
Requesting logprobs in the OpenAI API: response = openai.chat.completions.create(model='gpt-4o', messages=[...], logprobs=True, top_logprobs=5). The response includes for each generated token: token (the generated string), logprob (log probability of this token), bytes (UTF-8 bytes), and top_logprobs (the top-5 alternatives and their probabilities). Token probability: exp(logprob). Example: token 'helpful' with logprob=-0.1 has probability ~0.90 (high confidence); token 'excellent' with logprob=-3.5 has probability ~0.03 (low confidence). A span of low-logprob tokens in a factual statement suggests potential hallucination.
Log Probabilities — Token-Level Confidence
The
-0.02
98%
sky
-0.31
73%
is
-0.04
96%
blue
-0.71
49%
.
-0.1
90%
Top alternatives at each position (example: token 4)
Confidence scoring
Low logprob tokens signal hallucination risk
Uncertainty detection
Aggregate logprobs = generation confidence
Reranking
Score candidate answers by total sequence logprob
Real-World Example
A 99helpers team implements confidence scoring for their AI-generated product descriptions. For each generated sentence, they compute the average logprob of all tokens. Sentences with average logprob above -0.5 (>60% average token probability) are flagged as 'high confidence' and published automatically. Sentences with average logprob below -1.5 are flagged for human review. Running on 1,000 descriptions, this catches 87% of factual errors while only flagging 12% of descriptions for review—a targeted quality gate that focuses human effort where uncertainty is highest.
Common Mistakes
- ✕Treating logprobs as a perfect confidence measure—logprobs measure token probability, not factual accuracy; a model can be very confident while generating a hallucinated fact.
- ✕Requesting logprobs for all production queries unnecessarily—logprobs add response payload size; only request them when needed for confidence analysis.
- ✕Ignoring that top_logprobs shows alternatives—the most informative use of logprobs is often examining the top alternatives to understand what the model nearly generated instead.
Related Terms
Temperature
Temperature is an LLM parameter (0-2) that controls output randomness: low values produce focused, deterministic responses while high values produce more varied, creative outputs.
LLM Inference
LLM inference is the process of running a trained model to generate a response for a given input, encompassing the forward pass computation, token generation, and the infrastructure required to serve predictions at scale.
LLM API
An LLM API is a cloud service interface that provides programmatic access to large language models, allowing developers to send prompts and receive completions without managing model infrastructure.
Top-P Sampling (Nucleus Sampling)
Top-p sampling (nucleus sampling) restricts token generation to the smallest set of tokens whose cumulative probability exceeds p, dynamically adapting the candidate pool size based on the probability distribution.
Structured Output
Structured output constrains LLM responses to follow a specific format—typically JSON with defined fields—enabling reliable parsing and integration with downstream systems rather than free-form text generation.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →