Large Language Models (LLMs)

Log Probabilities (Logprobs)

Definition

Logprobs (log probabilities) are the model's internal probability scores for each generated token—specifically the natural logarithm of the probability the model assigns to the chosen token at each generation step. The OpenAI API and compatible APIs can return logprobs alongside the generated text when requested. Logprobs have practical applications: measuring model confidence (very negative logprobs indicate the model was uncertain about that token), detecting uncertain or hallucinated spans in long-form responses (text where the model consistently generates low-probability tokens), implementing custom sampling logic, and computing calibration metrics (whether stated confidence aligns with empirical accuracy).

Why It Matters

Logprobs provide a window into the model's generation uncertainty that is hidden from normal response inspection. A response that reads confidently may contain spans where the model was highly uncertain—the text could have been very different with slightly different sampling. For 99helpers teams building high-stakes AI applications, logprob analysis can identify which parts of responses to flag for human review: if the logprob at a critical fact (a number, a name, a URL) is very low, it's worth verifying. This enables a targeted human-in-the-loop review process that focuses effort on genuinely uncertain outputs rather than reviewing all responses.

How It Works

Requesting logprobs in the OpenAI API: response = openai.chat.completions.create(model='gpt-4o', messages=[...], logprobs=True, top_logprobs=5). The response includes for each generated token: token (the generated string), logprob (log probability of this token), bytes (UTF-8 bytes), and top_logprobs (the top-5 alternatives and their probabilities). Token probability: exp(logprob). Example: token 'helpful' with logprob=-0.1 has probability ~0.90 (high confidence); token 'excellent' with logprob=-3.5 has probability ~0.03 (low confidence). A span of low-logprob tokens in a factual statement suggests potential hallucination.

Log Probabilities — Token-Level Confidence

The

-0.02

98%

sky

-0.31

73%

-0.04

96%

blue

-0.71

49%

-0.1

90%

Top alternatives at each position (example: token 4)

"The"-0.02

"A" -3.91"This" -4.6

"sky"-0.31

"weather" -1.9"sun" -2.3

"is"-0.04

"was" -3.22"'s" -4.01

"blue"-0.71

"clear" -0.92"grey" -1.51

"."-0.1

"today" -2.3"and" -2.77

Confidence scoring

Low logprob tokens signal hallucination risk

Uncertainty detection

Aggregate logprobs = generation confidence

Reranking

Score candidate answers by total sequence logprob

Real-World Example

A 99helpers team implements confidence scoring for their AI-generated product descriptions. For each generated sentence, they compute the average logprob of all tokens. Sentences with average logprob above -0.5 (>60% average token probability) are flagged as 'high confidence' and published automatically. Sentences with average logprob below -1.5 are flagged for human review. Running on 1,000 descriptions, this catches 87% of factual errors while only flagging 12% of descriptions for review—a targeted quality gate that focuses human effort where uncertainty is highest.

Common Mistakes

✕Treating logprobs as a perfect confidence measure—logprobs measure token probability, not factual accuracy; a model can be very confident while generating a hallucinated fact.
✕Requesting logprobs for all production queries unnecessarily—logprobs add response payload size; only request them when needed for confidence analysis.
✕Ignoring that top_logprobs shows alternatives—the most informative use of logprobs is often examining the top alternatives to understand what the model nearly generated instead.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Log Probabilities (Logprobs)

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Temperature

LLM Inference

LLM API

Top-P Sampling (Nucleus Sampling)

Structured Output

Ready to build your AI chatbot?