Browse all previously published AI Tutorials here.
Table of Contents
Stop Sequences in LLMs - Concept and Implementation
Conceptual Overview
How Stop Sequences Work
Implementation in Different Frameworks
OpenAI API
Hugging Face Transformers
Custom Implementations (Manual Decoding)
Best Practices for Using Stop Sequences
Conceptual Overview
Stop sequences are specific strings that signal a language model to stop generating further text once they appear in the output (How to use stop sequences?). In practice, they act as custom end-of-output markers defined by the developer. Stop sequences are necessary for controlling the length and format of an LLM’s response. By specifying one or more stop strings, you can ensure the model produces a concise, focused output and doesn’t “run on” into irrelevant or undesired text . This is especially useful for:
Structured outputs: e.g. ensuring a JSON or XML response ends at a closing tag, without extra commentary.
Dialogue systems: to stop the model when it’s the next speaker’s turn (for instance, stopping when the model’s answer is done, before it starts imitating the user).
Preventing rambling or cost overruns: The model stops when the content is complete, saving on token usage and avoiding drifting into off-topic content .
In summary, a stop sequence is simply a substring (one or more characters) that, if generated by the model, will terminate the generation. It doesn’t alter the model’s understanding of the prompt or its training data; instead, it’s an external control to guarantee a controlled, concise response (How to use stop sequences?).
How Stop Sequences Work
During text generation, an LLM produces output token by token until a stopping condition is met (such as reaching a max token limit or an end-of-sequence token). Stop sequences introduce an additional stopping condition: if the output text ends with any of the specified sequences, the generation process halts . The mechanism works as follows:
Token Generation and Checking: The model generates tokens normally based on probabilities. After each new token is added, the output string is checked to see if it ends with the stop sequence. For multi-token stop sequences, the model must generate the entire sequence in order for it to match and trigger a stop (How to use stop sequences?). (Partial matches won’t stop the model until the full sequence appears.)
Halting Behavior: Once a stop sequence is encountered at the end of the output, the generation loop breaks, and no further tokens are produced . Typically, the stop string itself is excluded from the final output (the model stops right after producing it, or even omits it), so the returned text ends just before the stop marker (How do I use stop sequences in the OpenAI API? | OpenAI Help Center). This means the user sees the output truncated at the desired point, and the stop token is effectively like a non-visible EOS (end-of-sequence).
Impact on Log Probabilities: Stop sequences do not change the model’s intrinsic token probabilities or the sampling process up to that point. The model isn’t inherently aware that a certain string will cause an external stop (unless it was explicitly trained with that convention). The generation algorithm (greedy, sampling, etc.) proceeds as usual, selecting tokens based on their probabilities. The only difference is that once a stop condition is met, no further tokens are generated. In frameworks like OpenAI’s API, tokens that form the stop sequence are generated internally but then are not included in the final text response . If you request log probabilities, you would receive log probs for tokens up to (but typically not including) the stop sequence.
Impact on Sampling and Decoding: Stop sequences don’t inherently bias which token is chosen next; they intervene after a token (or sequence of tokens) has been output. For example, if using nucleus or top-k sampling, those techniques shape the distribution of the next token, but if the chosen token happens to lead to a stop pattern, the process will cut off at that point. Some implementations achieve stopping by post-checks (after token is added) while others might proactively prevent further tokens once a stop is detected. Either way, the sampling process is unaffected until the moment of stopping. After stopping, no further sampling occurs. (In custom implementations, one could also remove/bias tokens that would continue beyond the stop, effectively forcing the model to end – but normally the simpler approach is just to break out of the generation loop.)
In essence, a stop sequence functions as a user-defined “hard stop” for the generated text. It does not influence the content until it appears, at which point generation ceases. This ensures the output doesn’t go beyond the desired endpoint (like the end of an answer or a closing tag), much like putting a period at the end of a sentence and telling the model “stop here.” (Foundation model parameters: decoding and stopping criteria | IBM watsonx)
Implementation in Different Frameworks
OpenAI API
OpenAI’s API provides a built-in parameter to specify stop sequences. When using the Completion
or ChatCompletion
endpoints, you can pass a stop
parameter as either a single string or a list of strings (up to 4 sequences) (How to use stop sequences?). The model will cease generation whenever one of those sequences is produced. The stop sequence itself is not included in the returned text, so the result is neatly truncated (How do I use stop sequences in the OpenAI API? | OpenAI Help Center).
How to use (Python example):
import openai
openai.api_key = "YOUR_API_KEY"
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": "List three fruits."}
],
max_tokens=50,
stop=["\n"] # Stop when a newline is generated (for example)
)
print(response['choices'][0]['message']['content'])
In the above snippet, the stop=["\n"]
tells the model to stop if it outputs a newline character. You can specify multiple stop sequences; for instance, stop=["\n", "User:"]
would stop generation if either a newline or the string "User:"
appears. OpenAI allows up to four stop sequences per API call (How do I use stop sequences in the OpenAI API? | OpenAI Help Center). This feature is often used in chatbots to prevent the assistant from beginning to speak as the user, or in structured outputs to cut off after a certain delimiter.
Effect: If the model generates text that includes a stop sequence, the API will truncate the output at that point. For example, if you set a stop sequence "World"
and the model’s natural completion is "Hello World"
, the API would return only "Hello "
(without "World"
) because it stopped when the sequence was reached . This gives you precise control over where the model’s reply ends.
Hugging Face Transformers
Hugging Face’s Transformers library (for models like GPT-2, GPT-3 derivatives, LLaMA, etc.) does not have a simple stop="..."
argument in the basic model.generate()
call (passing an unknown stop_sequence
parameter will be ignored or cause a warning) (PygmalionAI/pygmalion-6b · how can i set stop_sequence in generate method?). However, there are several ways to enforce stop sequences:
Using
eos_token_id
: If your stop sequence corresponds to a special end-of-sequence token in the model’s tokenizer, you can set the model’seos_token_id
(end-of-sequence token) to that token’s ID. The generation will stop when this token is generated. For example, if you fine-tuned a model and added a special token<END>
with a known ID, you could domodel.generate(..., eos_token_id=tokenizer.encode("<END>")[0])
. This is a simple solution if you have a single, unique stop token.Using
StoppingCriteria
: For arbitrary strings (which may consist of multiple tokens), you can use the Transformers library’s stopping criteria mechanism. The library allows you to define custom logic to decide when to stop generation. You do this by subclassingtransformers.StoppingCriteria
and implementing the__call__
method to check the generated tokens. For example, the following custom stopping criteria stops generation when the output contains a certain substring:
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList
import torch
class StopOnSubsequence(StoppingCriteria):
def __init__(self, stop_sequences, tokenizer):
# Encode the stop sequences into token ID sequences
self.stop_sequences_ids = [tokenizer.encode(seq, add_special_tokens=False) for seq in stop_sequences]
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
# Check if the end of input_ids matches any of the stop sequences
for stop_seq_ids in self.stop_sequences_ids:
if input_ids.shape[1] >= len(stop_seq_ids) and \
torch.equal(input_ids[0, -len(stop_seq_ids):], torch.tensor(stop_seq_ids)):
return True # Stop condition met
return False
# Load a model and tokenizer (e.g., GPT-2)
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
prompt = "Q: What is 1+1?\nA: 2\nQ: What is the capital of France?\nA:"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
# Define stop sequences to stop at the end of an answer (stop when a new "Q:" appears)
stop_sequences = ["\nQ:"]
stopping_criteria = StoppingCriteriaList([StopOnSubsequence(stop_sequences, tokenizer)])
output_ids = model.generate(input_ids, max_new_tokens=100, stopping_criteria=stopping_criteria)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)
In this code,
StopOnSubsequence
will halt generation as soon as the model’s output ends with any of the specified stop strings (here we use"\nQ:"
to stop when a new question begins). This approach works by inspecting the latest tokens ininput_ids
at each step and comparing them to the encoded stop sequence. When a match is found,__call__
returnsTrue
, signaling the generation to stop.Using
GenerationConfig.stop_strings
: Recent versions of Transformers provide a high-level way to specify stop strings viaGenerationConfig
. You can create aGenerationConfig
withstop_strings=["...</output>"]
(for example) and pass that config tomodel.generate
. The generate function will then automatically stop when any of those strings is produced (Generation) . For instance:
from transformers import GenerationConfig
config = GenerationConfig.from_pretrained("gpt2")
config.stop_strings = ["</output>"]
output = model.generate(input_ids, generation_config=config)
This feature simplifies things by avoiding writing a custom stopping class; internally it likely implements similar checks.
Note: If you use the Transformers pipeline (e.g., pipeline("text-generation")
), you might not have an obvious stop
parameter either. One approach is to pass a stop_sequence
via generate_kwargs
or use the underlying model’s generate with a stopping criteria. Another approach is to post-process the output from the pipeline by truncating at the stop string, although that’s less efficient since the model might generate beyond the desired point. The recommended method is using StoppingCriteria
as shown above, which ensures the model stops as soon as the sequence appears (saving compute and preventing unwanted text).
Custom Implementations (Manual Decoding)
If you are implementing text generation manually (for example, writing your own decoding loop or using an LLM that doesn’t natively support stop sequences), you can enforce stop sequences with simple logic. The idea is to generate tokens one at a time and check after each token if the recently generated text ends with the stop sequence, then break out.
Pseudo-code for manual decoding with a stop sequence (for clarity):
output_tokens = []
for _ in range(max_tokens):
next_token = sample_next_token(model, context=output_tokens) # your function to get next token
output_tokens.append(next_token)
# Convert latest tokens to text to check stop condition
output_text = tokenizer.decode(output_tokens)
if output_text.endswith(stop_sequence):
break # Stop generating further tokens
## The output_text will include the stop_sequence at the end; you can remove it if needed.
In this loop, sample_next_token
represents the process of forwarding the model to get logits and sampling the next token (could be greedy or random sampling depending on your decoding strategy). After appending the token, you check the string form. If it ends in (or contains) your stop string, you halt the loop.
A more efficient variant is to check the token IDs without decoding to string each time. For example, if your stop sequence is tokenized to the ID sequence [x, y, z]
, you can check if output_tokens[-len([x,y,z]):] == [x,y,z]
. This avoids repeatedly decoding the whole output. The principle remains the same: detect the stop pattern in the output and break early.
This manual approach is essentially what the high-level libraries are doing under the hood when you specify a stop condition. It gives you flexibility to implement complex stopping logic (e.g., stopping on multiple possible substrings or on some semantic condition), at the cost of more coding.
Best Practices for Using Stop Sequences
Choosing effective stop sequences and using them correctly is important to get the desired behavior. Here are some best practices and potential pitfalls:
Choose Unique Delimiters: Use stop sequences that are unlikely to appear in normal generated content except where you intend to stop. For example, a sequence like
"###"
or a special token (like"</output>"
or"END"
) is usually better than a common word. Using a common word or punctuation (e.g., a single period"."
) as a stop sequence can cause the model to stop too early (Foundation model parameters: decoding and stopping criteria | IBM watsonx) (stopping after the first sentence, in this case). Only use such a common stop if you truly want to cut off at that point (one sentence output). Otherwise, prefer a delimiter that clearly marks the end of the response.Include Stop Markers in Prompts/Few-Shot Examples: If you expect the model to use a certain end marker, it's helpful to show that in the prompt (especially for few-shot prompting or fine-tuning) . For instance, if you want the model to output an answer and then stop at
"</output>"
, consider providing an example in the prompt where a sample answer ends with</output>
. This helps the model learn that it should include that token at the end of its answer. In a few-shot setting, you might separate examples with a specific token or string (like"===" or "\n\n"
) and use that as a stop sequence; the model, seeing the pattern, will know to end at that delimiter (How to Get Better Outputs from Your Large Language Model | NVIDIA Technical Blog).Multiple Stop Sequences: Many APIs allow multiple stop sequences. This is useful if there are several possible markers for end-of-output. For example, you might stop on either
"User:"
(start of a next user prompt) or on a generic end token like"</end>"
. The generation will halt on whichever appears first. Keep in mind there may be limits (OpenAI allows up to 4, IBM’s API allows up to 6, etc.) (LLM Stop Sequences Tokens and Params - Prompting - OpenAI Developer Community). If you find yourself needing many different stop strings, you might reconsider your prompt format (maybe a single unique terminator would be more reliable).Avoid Truncating Valid Content: Ensure your stop sequence won’t accidentally appear in legitimate content before the intended end. For example, if your stop sequence is
"END"
, and the model is listing words that might include “bend” or “weekend”, there’s a risk"END"
could appear as part of a word. To mitigate this, you could include surrounding characters (like" END\n"
with spaces or newline) or use a rarely-occurring sequence. Another strategy is to use the tokenizer’s special tokens if available (many tokenizers have unused tokens you can repurpose as stop signals).Case Sensitivity and Exact Matching: Stop sequences are usually matched exactly (case-sensitive, including punctuation). Make sure your stop strings are exactly what you intend to catch in the output. For example,
"User:"
is different from"user:"
– a model outputting lowercase “user:” would not trigger a stop if you only specified the capitalized version. Define the stop variations if needed, or normalize the output before checking (in custom implementations).Testing and Iteration: Always test your prompt with the chosen stop sequences to see how the model behaves. You might need to adjust either the prompt or the stop strings. If the model sometimes doesn’t produce the stop sequence at all (thus hitting max tokens instead), it could be a sign that the prompt didn’t sufficiently indicate where to stop. In such cases, try guiding the model more explicitly (e.g., “... end the answer with
</output>
”). If the model stops too early, verify it isn’t using the stop token in an unintended way.Cost and Efficiency: Using stop sequences can help reduce token usage, which is cost-effective for API usage (How to use stop sequences?). By cutting off superfluous text, you avoid paying for unnecessary tokens. However, note that if a model never encounters the stop sequence, it will run until the max tokens limit, so always set a reasonable
max_tokens
(or equivalent) as a fallback. The combination of a stop sequence and a max token limit ensures your generation won’t run away endlessly. Typically, you’d set the max tokens to a value slightly above what you expect needed, and rely on the stop sequence to halt at the right point if reached sooner.Common Pitfalls: One pitfall is assuming the model won’t produce the stop sequence unless it’s truly at the end. The model might generate the stop string as part of content if it’s a common sequence. For example, if you set
stop=["The"]
, the model will stop every time it outputs the word "The" – clearly not useful, as that could occur in normal text. So choose stops that basically mean “end of output” in your context. Another pitfall is forgetting that the stop sequence itself might be removed from the output (in OpenAI API, for instance). If you wanted the output to include a certain terminator and you also use it as a stop, you may need to append it back or account for its absence. Usually, though, we set stop tokens that we don’t want to appear in the final text (they are just signals to cut off).
In conclusion, stop sequences are a powerful tool for shaping LLM outputs. They provide an explicit cutoff point in generated text (How to use stop sequences?), which is invaluable for structured applications, multi-turn interactions, and controlling verbosity. By understanding how to implement them in various frameworks and following best practices to choose the right stop markers, you can avoid common issues and ensure your language model’s output ends exactly where you want it to. This leads to cleaner and more reliable responses, without the model trailing off or mixing in unwanted text . Use stop sequences judiciously to make your LLM outputs easier to manage and integrate into your applications.