Max Tokens
Definition
Max tokens (called max_tokens in OpenAI API, max_new_tokens in Hugging Face, max_tokens_to_sample in Anthropic) sets an upper bound on the number of tokens the LLM generates in a single response. Generation stops when either a stop sequence is encountered or the max token limit is reached. Max tokens directly controls: (1) cost—output tokens cost more than input tokens in most APIs; capping max_tokens prevents unexpectedly long, expensive responses; (2) latency—longer responses take longer to generate; (3) user experience—responses truncated at max_tokens may be cut off mid-sentence if the model hasn't reached a natural stopping point. Setting max_tokens appropriately for the expected response length is part of responsible API usage.
Why It Matters
Max tokens is a critical parameter for production LLM deployments that affects both economics and user experience. Too low: legitimate responses are truncated mid-sentence, confusing users and requiring follow-up questions. Too high: unnecessarily long responses waste tokens and increase latency without adding value. For support chatbots, most responses should be 50-200 tokens; setting max_tokens=100 works for short answers but truncates complex instructions that legitimately need 150+ tokens. The right value depends on the use case: SQL generation may need 500+, tweet generation should cap at 30-40, and conversational support sits at 150-300. Testing with real queries helps establish appropriate limits.
How It Works
Max tokens configuration: for OpenAI API, set max_tokens in the request: {model: 'gpt-4o', max_tokens: 512, messages: [...]}. The model generates until: stop sequence encountered, natural end-of-response, or max_tokens reached. When max_tokens is reached, finish_reason='length' in the response (versus finish_reason='stop' for natural completion). Monitoring finish_reason='length' in production is important—high rates indicate max_tokens is set too low and responses are frequently being cut off. A useful pattern: set max_tokens generously (2-3x expected response length) and use a post-processing step to truncate responses that exceed desired length at a sentence boundary.
Max Tokens — Token Budget Within the Context Window
What happens when max_tokens is hit
Generation in progress
"The solution requires three steps. First, install the dependencies. Second, configure the environment variables. Third, [CUTOFF]"
API response
finish_reason:
"length"
Response truncated
Too low
< 100
Responses truncated mid-sentence
Balanced
256–1024
Most chat/Q&A scenarios
Too high
> context limit
API error, wasteful billing
Real-World Example
A 99helpers platform initially deploys their chatbot with max_tokens=200. Monitoring shows finish_reason='length' on 18% of responses—users see cut-off answers. Analyzing these truncated responses, they find they cluster around technical setup instructions that require more than 200 tokens. They increase max_tokens to 500 for the technical support query category while keeping it at 150 for FAQ queries. The 18% truncation rate drops to 2%, primarily for unusually complex setup scenarios that users are prompted to contact support for. Cost increases by 8% due to longer technical responses—acceptable for the quality improvement.
Common Mistakes
- ✕Setting max_tokens very high for all queries regardless of expected response length—this inflates costs and can encourage verbose responses when the model fills its allotted tokens.
- ✕Not monitoring finish_reason='length'—truncated responses are invisible to users except as confusing cut-offs; track this metric to catch misconfigured limits.
- ✕Conflating max_tokens with response quality—max_tokens is a length cap, not a quality setting; longer is not always better.
Related Terms
Token
A token is the basic unit of text an LLM processes—roughly 4 characters or 3/4 of an English word. LLM APIs charge per token, context windows are measured in tokens, and generation speed is measured in tokens per second.
LLM API
An LLM API is a cloud service interface that provides programmatic access to large language models, allowing developers to send prompts and receive completions without managing model infrastructure.
Context Length
Context length is the maximum number of tokens an LLM can process in a single request—encompassing the system prompt, conversation history, retrieved documents, and the response—determining how much information the model can consider simultaneously.
LLM Inference
LLM inference is the process of running a trained model to generate a response for a given input, encompassing the forward pass computation, token generation, and the infrastructure required to serve predictions at scale.
Temperature
Temperature is an LLM parameter (0-2) that controls output randomness: low values produce focused, deterministic responses while high values produce more varied, creative outputs.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →