Stop Sequence
Definition
Stop sequences are tokens or strings that the LLM API monitors during generation; when the model produces one, generation halts immediately and the stop sequence is excluded from the response. Common stop sequences: ' ' (stop at first newline for single-line outputs), '###' (custom separator for structured responses), '<|end|>' (end-of-turn marker), or domain-specific terminators. Stop sequences work in conjunction with max_tokens: generation ends at whichever condition is met first. Multiple stop sequences can be specified; generation stops when any of them is produced. Stop sequences are particularly useful for: extracting a single value from a prompted completion, preventing the model from continuing past the desired output, and managing turn boundaries in multi-turn conversation templates.
Why It Matters
Stop sequences provide deterministic control over response length and structure that max_tokens alone cannot achieve. A max_tokens limit stops at an arbitrary point; a stop sequence stops at a semantically meaningful boundary. For JSON extraction tasks, stopping at ' ' or '}' after the relevant value prevents the model from generating unwanted text after the answer. For templated generation where the model fills in blanks, a stop sequence at the template's end marker prevents over-generation. For 99helpers chatbots, stop sequences are less commonly needed (instruction tuning handles natural stopping), but they're invaluable for structured generation tasks like data extraction or template completion.
How It Works
Stop sequence usage in OpenAI API: response = openai.chat.completions.create(model='gpt-4o', messages=[...], stop=[' ', '###', '---']). The response ends at the first occurrence of any of these strings. Example: extracting a number from a prompt—stop=[' '] ensures only the number on the first line is returned. For multi-choice extraction: prompt the model with 'Answer: A, B, C, or D: [question]' and stop=[' ']—the model outputs 'Answer: B' and stops. When using stop sequences with few-shot prompting, use the same delimiter as in your examples so the model learns to stop at the same point.
Stop Sequence — Token Generation & Truncation
Configuration
["\n###", "END", "</answer>"]Token generation stream
Returned output
The answer is 42.
Stop sequence excluded from result
Without stop sequence
The answer is 42. ### Now let me explain...
Generation continues until max_tokens
Real-World Example
A 99helpers developer builds a feature that classifies ticket priority from message text. Zero-shot prompt: 'Priority (low/medium/high): [ticket text] Priority: '. Stop sequence: [' ']. The model generates 'Priority: high' and stops at the newline—returning exactly the priority label without additional explanation. Without the stop sequence, the model might generate: 'Priority: high This ticket mentions a system outage affecting multiple users, which warrants immediate attention...' The stop sequence ensures a predictable, parseable output format for this classification use case.
Common Mistakes
- ✕Setting stop sequences that appear inside the expected response—if your stop sequence is a common word or character, it will terminate generation prematurely.
- ✕Using stop sequences as the only response length control—always combine with max_tokens as a safety limit in case the stop sequence is never generated.
- ✕Forgetting that stop sequences are exact string matches—'END' and 'end' are different; ensure case matches expected model output.
Related Terms
Max Tokens
Max tokens is an LLM API parameter that limits the maximum number of tokens the model can generate in a single response, controlling response length, cost, and latency.
Structured Output
Structured output constrains LLM responses to follow a specific format—typically JSON with defined fields—enabling reliable parsing and integration with downstream systems rather than free-form text generation.
LLM API
An LLM API is a cloud service interface that provides programmatic access to large language models, allowing developers to send prompts and receive completions without managing model infrastructure.
JSON Mode
JSON mode is an LLM API feature that guarantees the model's output is valid JSON, ensuring reliable programmatic parsing without worrying about prose text surrounding the JSON object.
Temperature
Temperature is an LLM parameter (0-2) that controls output randomness: low values produce focused, deterministic responses while high values produce more varied, creative outputs.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →