Large Language Models (LLMs)

Max Tokens

Definition

Max tokens (called max_tokens in OpenAI API, max_new_tokens in Hugging Face, max_tokens_to_sample in Anthropic) sets an upper bound on the number of tokens the LLM generates in a single response. Generation stops when either a stop sequence is encountered or the max token limit is reached. Max tokens directly controls: (1) cost—output tokens cost more than input tokens in most APIs; capping max_tokens prevents unexpectedly long, expensive responses; (2) latency—longer responses take longer to generate; (3) user experience—responses truncated at max_tokens may be cut off mid-sentence if the model hasn't reached a natural stopping point. Setting max_tokens appropriately for the expected response length is part of responsible API usage.

Why It Matters

Max tokens is a critical parameter for production LLM deployments that affects both economics and user experience. Too low: legitimate responses are truncated mid-sentence, confusing users and requiring follow-up questions. Too high: unnecessarily long responses waste tokens and increase latency without adding value. For support chatbots, most responses should be 50-200 tokens; setting max_tokens=100 works for short answers but truncates complex instructions that legitimately need 150+ tokens. The right value depends on the use case: SQL generation may need 500+, tweet generation should cap at 30-40, and conversational support sits at 150-300. Testing with real queries helps establish appropriate limits.

How It Works

Max tokens configuration: for OpenAI API, set max_tokens in the request: {model: 'gpt-4o', max_tokens: 512, messages: [...]}. The model generates until: stop sequence encountered, natural end-of-response, or max_tokens reached. When max_tokens is reached, finish_reason='length' in the response (versus finish_reason='stop' for natural completion). Monitoring finish_reason='length' in production is important—high rates indicate max_tokens is set too low and responses are frequently being cut off. A useful pattern: set max_tokens generously (2-3x expected response length) and use a post-processing step to truncate responses that exceed desired length at a sentence boundary.

Max Tokens — Token Budget Within the Context Window

Context window (128K tokens, simplified)128K total

Prompt: 72K

max_tokens: 40K

Unused: 16K

What happens when max_tokens is hit

Generation in progress

"The solution requires three steps. First, install the dependencies. Second, configure the environment variables. Third, [CUTOFF]"

API response

finish_reason:

"length"

Response truncated

Too low

< 100

Responses truncated mid-sentence

Balanced

256–1024

Most chat/Q&A scenarios

Too high

> context limit

API error, wasteful billing

Real-World Example

A 99helpers platform initially deploys their chatbot with max_tokens=200. Monitoring shows finish_reason='length' on 18% of responses—users see cut-off answers. Analyzing these truncated responses, they find they cluster around technical setup instructions that require more than 200 tokens. They increase max_tokens to 500 for the technical support query category while keeping it at 150 for FAQ queries. The 18% truncation rate drops to 2%, primarily for unusually complex setup scenarios that users are prompted to contact support for. Cost increases by 8% due to longer technical responses—acceptable for the quality improvement.

Common Mistakes

✕Setting max_tokens very high for all queries regardless of expected response length—this inflates costs and can encourage verbose responses when the model fills its allotted tokens.
✕Not monitoring finish_reason='length'—truncated responses are invisible to users except as confusing cut-offs; track this metric to catch misconfigured limits.
✕Conflating max_tokens with response quality—max_tokens is a length cap, not a quality setting; longer is not always better.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Max Tokens

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Token

LLM API

Context Length

LLM Inference

Temperature

Ready to build your AI chatbot?