What are different ways you can define stopping criteria in large language model

Jun 16, 2025

Browse all previously published AI Tutorials here.

What are different ways you can define stopping criteria in large language model
Training Stopping Criteria
Inference Stopping Criteria
Implementation in PyTorch
Implementation in TensorFlow

Large Language Models (LLMs) require well-defined stopping criteria both during training and inference. Proper stopping conditions ensure that training doesn’t overrun (wasting resources or overfitting) and that generated text is of appropriate length and quality. Below is a comprehensive breakdown of common stopping criteria in both contexts, followed by implementation strategies in PyTorch and TensorFlow.

Training Stopping Criteria

Loss-Based Stopping: Training can stop once the loss has converged or fallen below a certain threshold. In practice, this means if the training loss is no longer decreasing significantly (i.e. improvement per epoch falls below a small ϵ) the model is considered converged (A Gentle Introduction to Gradient Descent and Its Variants | MLDemystified). This prevents wasting epochs when further improvement is marginal. Some setups even define a target loss value; reaching it triggers an early termination of training.
Early Stopping with Validation Metrics: A prevalent strategy is to monitor a validation metric (like validation loss) and stop when it ceases to improve (neural networks - Loss convergence in deep learning - Cross Validated). For example, if the validation loss has not decreased for N consecutive evaluations (patience), training is halted to avoid overfitting. This uses a hold-out set to determine when the model’s generalization performance plateaus. Modern libraries implement this via callbacks that check the metric at each epoch end and stop when no improvement is seen within the patience window .
Gradient-Based Stopping: Monitoring the gradients can signal when to stop. If gradients vanish (norm falls below a tiny threshold) the model may be at a minimum plateau and further training won’t change the parameters much (A Gentle Introduction to Gradient Descent and Its Variants | MLDemystified). Conversely, if gradients explode (norm becomes NaN or inf), it indicates instability/divergence. In practice, one might stop training when the gradient norm is below a minimum (indicating convergence) or if non-finite gradients are detected (to avoid wasting time on a blown-up training). This criterion is less commonly a sole stopper, but it’s useful for diagnosing stalled training progress.
Computational Constraints: Often, training is stopped not by model convergence but by a predefined budget. This could be a maximum number of epochs/steps or a limit on time, FLOPs, or cost. For instance, many large models are trained on a fixed number of tokens based on prior scaling law calculations (e.g. the Chinchilla strategy fixes an optimal training duration given a compute budget) (How Long Should You Train Your Language Model? | Databricks Blog). In practice, one might simply run for a set number of epochs or until a time limit is reached, then stop. This ensures training fits within resource constraints even if the model hasn’t fully converged.
Divergence Detection: If training behaves pathologically (e.g. loss suddenly jumps to a very high value or becomes NaN), an automatic halt is desirable. Callbacks can watch for loss becoming NaN/inf or exceeding a certain “unreasonable” threshold. For example, Keras provides TerminateOnNaN which immediately stops training on a NaN loss (TerminateOnNaN). Similarly, frameworks may allow setting a divergence threshold so that if validation loss goes above some limit (indicating the model has diverged and likely won’t recover), training stops early (Early Stopping — PyTorch Lightning 2.5.0.post0 documentation). This spares time and allows fixing the cause (like too high a learning rate) rather than continuing a doomed run.
Token-Level Performance: In language modeling, metrics like perplexity or token accuracy on a validation set can be monitored. Training can be stopped when these token-level performance metrics stabilize (plateau) indicating the model isn’t learning finer details anymore (Why don't we use validation/test sets more in LLM fine-tuning? : r/LocalLLaMA). Recent research even proposes specialized metrics for certain LLM behaviors to decide when to stop. For example, an Instruction Following Score (IFS) was used as an early stopping criterion in instruction-tuning, since models learned to follow instructions early and further training only changed other semantics (2307.03692 Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning). In essence, when the quality of generated tokens or the model’s ability (measured by a specific metric) stops improving, it’s an indicator to stop training.
Learning Rate Plateau: If using learning rate schedules or adaptive reduction, the point at which the learning rate reaches its minimum can signal that training has run its course. For instance, with a reduce-on-plateau scheduler, once it has reduced the learning rate to the min_lr and the model still isn’t improving, continuing training has little benefit. Practically, one might combine this with early stopping: e.g. if the learning rate has decayed to a small value and no improvement is seen in the metric, stop training. This ensures we don’t keep training with a tiny learning rate that yields no progress.

Inference Stopping Criteria

Max Token Length: The generation process is often halted after a predefined number of tokens. This max_length (or max_tokens) cutoff ensures responses don’t run on indefinitely. For example, in the OpenAI API, max_tokens defines the maximum length of the response and the model will cut off at that point (Struggling with max_tokens and getting responses within a given limit, please help! - API - OpenAI Developer Community). This is a hard limit to prevent runaway text generation or excessive output length.
EOS (End-of-Sequence) Token: Most LLMs use a special end-of-sequence token (e.g. </s> or <|endoftext|>) learned during training to indicate completion. During decoding, if the model outputs this EOS token, the generation stops immediately – it's a signal that the model considers the text complete. In implementations, generation will continue until an EOS token is produced or the max length is reached, whichever comes first (generate do not stop after generating eos_token with batch process · Issue #31261 · huggingface/transformers · GitHub). This way, the model can determine its own stopping point when it has finished a sentence or answer.
Log Probability Threshold: Another criterion is to stop when the model’s confidence drops below a certain level. For instance, one might define a threshold on the average log-probability of generated tokens; if the probability of continuing the sequence becomes very low, the generation is ended. (In OpenAI’s Whisper, a --logprob_threshold is used similarly: if the average log probability falls below a set value, the decoding is treated as failed and stops (Stops working after long gap with no speech? · openai whisper · Discussion #29 · GitHub).) In text generation, this could prevent the model from trailing off into extremely uncertain or nonsensical text by cutting off when it’s no longer confident in any next token.
Repetition Avoidance: LLMs can sometimes get stuck in loops, repeating the same phrase or token sequence ad infinitum. A stopping criterion can watch for this behavior – for example, if the model has repeated an n-gram or a sequence of tokens beyond a set number of times, we assume it’s in a loop and halt generation. This is especially useful for smaller or uncapped models that might otherwise generate endless repetitive text (StoppingCriteria for Repetition · Issue #32902 · huggingface/transformers · GitHub). By detecting a repetition pattern (like the same token appearing too many times or an output that starts repeating previous content), the decoder can terminate early to avoid useless output.
Forced Decoding Completion: In some scenarios, we explicitly require a fixed number of output tokens, stopping exactly after that many tokens are generated. This is essentially a strict length cutoff, but here it’s not just a safety limit (like max length) but a deliberate fixed size for the output. For example, if an application expects exactly a 100-token summary, the decoding can be forced to stop after 100 tokens regardless of EOS. Implementations achieve this by setting max_length (or max_new_tokens) to the exact desired count and possibly disabling the EOS token during generation. The idea is to produce a fixed-length output every time – the generation stops once the model has emitted the required number of tokens, even if the sentence isn’t naturally complete.

Implementation in PyTorch

PyTorch (Custom or with Trainer APIs): In pure PyTorch training loops, you can implement stopping criteria manually. For instance, you might track the validation loss each epoch and use a simple Python condition to break out of the loop when it hasn’t improved for several epochs (early stopping). Similarly, you could break if the training loss converges (change < ε) or if torch.isnan(loss) is detected (divergence). When using higher-level APIs, there are built-in solutions:

PyTorch Lightning: Lightning provides a Trainer with an EarlyStopping callback. You specify a metric to monitor (e.g. "val_loss") and a patience. The callback will automatically stop the training loop when the metric hasn’t improved for the given patience (Early Stopping — PyTorch Lightning 2.5.0.post0 documentation). You can also set parameters like stopping_threshold (to stop once a metric reaches a desired value) or divergence_threshold (to stop if the metric goes out of bounds, indicating divergence) . Lightning’s EarlyStopping will also optionally restore the best model weights for you.
Hugging Face Transformers: The Trainer API in HuggingFace Transformers offers an EarlyStoppingCallback that you can include in trainer.train(). You configure it via TrainingArguments (e.g. load_best_model_at_end=True and early_stopping_patience along with metric_for_best_model). This callback monitors your chosen metric and stops when the metric worsens for the specified number of evaluation steps (Callbacks). It integrates with the Trainer’s evaluation loop, so you get automatic early stopping without manual checks.
Gradient Monitoring: In PyTorch, you can examine gradients during training to catch problems. For example, after loss.backward() you could compute torch.norm(parameter.grad) for all parameters to get the total gradient norm. If this norm is zero (or extremely small) for many iterations, it suggests learning has stalled. If it’s inf or NaN, something went wrong numerically. PyTorch doesn’t have a built-in “stop on grad” callback, but you can enable anomaly detection (torch.autograd.set_detect_anomaly(True)) to raise errors on NaNs, or simply break the loop if not torch.isfinite(grad_norm). In frameworks like Lightning, there is a check_finite option in the EarlyStopping callback to abort if the monitored metric becomes NaN (which indirectly catches exploding loss/gradients).
Stopping Callbacks: Both Lightning and Hugging Face allow custom callbacks. You can write a callback that checks any condition at runtime (like a time limit, a certain accuracy reached, etc.) and stops training. In Lightning, a callback can call trainer.stop_training = True or raise TrainingStopped exception; in Hugging Face, the TrainerControl object can signal should_training_stop=True . These mechanisms let you implement any bespoke stopping criterion (for example, stop at a specific epoch number or based on an external signal).

Inference (PyTorch): When using transformers library for generation, you can pass max_new_tokens or max_length to limit output length, and eos_token_id to define the end token — the generate function will stop when it encounters this token (generate do not stop after generating eos_token with batch process · Issue #31261 · huggingface/transformers · GitHub). For custom stopping, you can subclass StoppingCriteria in HuggingFace. This allows implementing rules like stopping on a custom phrase or based on generated content (e.g. stop if the sequence contains a certain substring or if a repetition is detected). For example, one could implement a StoppingCriteria that monitors the generated tokens and returns True (stop) if it finds the last 5 tokens are the same as the 5 before them (a sign of looping). The HuggingFace generation API will call these criteria every step and halt when any returns True.

Implementation in TensorFlow

In TensorFlow/Keras, the high-level Model.fit API supports callbacks that make implementing stopping criteria straightforward:

EarlyStopping Callback: Keras provides tf.keras.callbacks.EarlyStopping out-of-the-box. You specify monitor (e.g. "val_loss"), patience, and mode ("min" or "max"). During model.fit, after each epoch, this callback checks the monitored metric – if it hasn’t improved in the last patience epochs, it stops training (EarlyStopping). You can also set restore_best_weights=True to revert to the best model observed. This single callback covers loss-based and validation-based stopping conditions (it effectively looks for convergence or no improvement).
TerminateOnNaN: For divergence detection, Keras has a TerminateOnNaN callback. Simply include keras.callbacks.TerminateOnNaN() in your callbacks list, and if at any point the loss becomes NaN, training will immediately halt (TerminateOnNaN). This is useful to catch exploding gradients or other numerical issues.
ReduceLROnPlateau + EarlyStopping: A common practice is to combine ReduceLROnPlateau (which lowers the learning rate when a metric plateaus) with EarlyStopping. For example, you monitor validation loss: if it doesn’t improve for 2 epochs, reduce LR; if it doesn’t improve for, say, 5 epochs, stop training. The ReduceLROnPlateau callback has a min_lr parameter – when that floor is reached, it implies no further significant training progress can be made at a lower LR. Often training is stopped shortly after hitting min_lr if no improvement occurs. While Keras doesn’t have a built-in “stop at min_lr”, you can achieve it by checking optimizer.lr in a custom callback and calling model.stop_training = True when it’s below a threshold.
Custom Callbacks: Keras callbacks are very flexible. You can subclass tf.keras.callbacks.Callback and override methods like on_epoch_end or on_batch_end. This allows implementing arbitrary stopping logic. For instance, you could monitor the average token probability in a language model and stop if it drops below some value, or even integrate external evaluation metrics (like BLEU score on a validation set for a translation model) and stop when that metric saturates.
Monitoring Tools: During inference in TF (e.g. using tf.function or generating sequences with a loop), you typically manually implement stopping. If using Keras for generation (e.g. with a text Vectorization and a custom loop), you’d check after each token if it’s the end token or if length exceeds a limit. In practice, many use HuggingFace even with TF models for easier text generation. But if using TensorFlow-serving or custom generation code, ensure to include checks in the loop for EOS token and max length. You might also use probability thresholds – for example, break out of a loop if the model’s predicted max probability falls below a certain cutoff (though this is uncommon in Keras out-of-the-box).
Integration with TensorBoard: While not a stopping criterion per se, monitoring training in TensorBoard can help you decide when to stop manually. You’d look at the curves of loss or accuracy and could interrupt training if you see divergence or plateau. In automated settings, however, the above callbacks handle it without manual intervention.

Summary: Both PyTorch and TensorFlow offer robust mechanisms to implement these stopping criteria. Using high-level callbacks and trainer APIs can greatly simplify managing training stops, while low-level access allows custom and fine-grained control when needed. By applying these criteria, practitioners ensure that LLM training runs are efficient and that inference generates outputs that are neither too short nor unnecessarily long, aligning with the latest best practices in the field. (Struggling with max_tokens and getting responses within a given limit, please help! - API - OpenAI Developer Community)

Rohan's Bytes

Discussion about this post