ML Interview Q Series: If a model consistently underfits the training data, how can you spot this issue and what measures can you take to resolve it?

Mar 22, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

High bias is another way of saying that a model is underfitting the data. When a model has high bias, it struggles to capture the underlying relationship between features and the target variable, leading to systematically poor performance on both training and validation sets. Underfitting can occur for numerous reasons, such as an overly simple model, insufficient model capacity, or an overly strong regularization setting.

Connect with me on X (Twitter)

Core Mathematical Understanding

When discussing high bias and underfitting, the bias-variance decomposition is a powerful conceptual tool. The expected error of a model's predictions can be broken down into three components: bias, variance, and irreducible error. This can be written as follows:

Bias(f^(x)) in this expression represents how much a model's predicted values deviate systematically from the true target on average. High bias means the model makes similar kinds of errors regardless of the particular training set, often from oversimplification or failing to learn key patterns.
Var(f^(x)) reflects how much the model's predictions vary based on different training sets. If variance is high, slight changes in the training set may drastically change the model’s predictions.
Irreducible error is noise inherent in the data generation process that no model can eliminate.

When a model is said to have high bias, it indicates the first component (Bias(f^(x)))2 is a significant contributor to overall error. Such a model is not capturing the complexity of the data sufficiently well.

Identifying a High Bias (Underfitting) Model

Poor Training Performance If the training error is already large, you know that the model is not fitting the training data well. This stands in contrast to high variance, where the training error is often quite small, but the test error is large.
Training and Validation Curves If, as you increase training iterations or model capacity, your training accuracy remains relatively low and close to your validation accuracy, this indicates the model is stuck with high bias. In a learning curve plot, both training and validation scores converge to a low value, signaling that additional data alone will not help as much if the model remains too simplistic.

How to Fix High Bias

Increase Model Complexity Using a more powerful model—such as adding more layers in a neural network or using a more sophisticated algorithm (e.g., going from a linear model to a polynomial or a more flexible ensemble)—can help reduce underfitting. For deep learning architectures, you can add more hidden layers or increase the number of neurons per layer.
Reduce Regularization Strength If you are using techniques like L1/L2 regularization or dropout, they might be set too aggressively, penalizing the model’s complexity excessively. Lowering the regularization parameter or reducing dropout may allow the model to learn more intricate patterns.
Feature Engineering Adding relevant features or transforming existing features (e.g., polynomial features, interaction terms, domain-specific transformations) can make an otherwise simple model more expressive. Sometimes, the issue is not about the model architecture itself, but rather about the features that the model can exploit.
Train Longer or Use Better Optimization In scenarios involving neural networks, sometimes underfitting can be a result of insufficient training time or suboptimal hyperparameters (learning rate, batch size, etc.). Training the model for more epochs or adjusting optimization parameters can help.
Data Preprocessing Improvements Poor data preprocessing (such as not scaling features in a model that is sensitive to feature scales) can also lead to underfitting. Ensuring correct data normalization or data augmentation (in images or text data) might address high bias in certain contexts.

Practical Example in Python

Below is a simplified example showing how one might address high bias in a neural network by increasing model complexity, assuming we use PyTorch:

import torch
import torch.nn as nn
import torch.optim as optim

# Example dataset
X = torch.randn(1000, 5)
y = torch.randint(0, 2, (1000,))

# Example simple model that might lead to underfitting
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc = nn.Linear(5, 2)

    def forward(self, x):
        return self.fc(x)

model = SimpleNet()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

for epoch in range(100):
    optimizer.zero_grad()
    outputs = model(X)
    loss = criterion(outputs, y)
    loss.backward()
    optimizer.step()

# Evaluate training performance
predictions = torch.argmax(model(X), dim=1)
train_acc = (predictions == y).float().mean()
print("Training accuracy with simple model:", train_acc.item())

# Possible fix: deeper architecture (increasing complexity)
class DeeperNet(nn.Module):
    def __init__(self):
        super(DeeperNet, self).__init__()
        self.fc1 = nn.Linear(5, 16)
        self.fc2 = nn.Linear(16, 8)
        self.fc3 = nn.Linear(8, 2)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model_deeper = DeeperNet()
optimizer = optim.Adam(model_deeper.parameters(), lr=0.01)

for epoch in range(100):
    optimizer.zero_grad()
    outputs = model_deeper(X)
    loss = criterion(outputs, y)
    loss.backward()
    optimizer.step()

predictions = torch.argmax(model_deeper(X), dim=1)
train_acc_deeper = (predictions == y).float().mean()
print("Training accuracy with deeper model:", train_acc_deeper.item())

In the snippet above, moving from a single linear layer to a network with additional layers and non-linear activations can help the model fit more complex relationships and reduce high bias.

What if you still face high bias even after adding layers or features?

Sometimes, even after adding more layers or features, the model might continue to underfit if:

The data is not relevant to the target, or there is insufficient signal.
The hyperparameters (like learning rate) remain suboptimal, hindering learning.
The number of training epochs is still too low.

Addressing these requires careful experimentation, data validation, hyperparameter tuning, or even re-examining whether the problem is solvable with the given data.

How can learning curves help in pinpointing the cause of underfitting?

In a learning curve, you plot performance (accuracy or loss) on the training and validation sets as a function of the training set size. If the curves converge to a low level of performance (low accuracy or high error), that strongly indicates high bias, meaning that even with ample data, the model fails to capture important patterns. If you see that adding more data does not improve performance, it reinforces the idea that the model is too simple or not well-optimized.

Could high bias coexist with high variance in any scenario?

While the typical notion is that a model with high bias has low variance and vice versa, there are indeed situations where a model might exhibit aspects of both. For instance, if you have an ensemble of models—each individually underfitting but whose predictions vary significantly when combined—you could end up with both high bias and relatively high variance overall. Another scenario can occur if a model is not tuned correctly, resulting in it sometimes overfitting specific sub-patterns but completely missing global patterns (underfitting on the main signal). These are more nuanced situations but highlight why examining both training and validation performance—and the distribution of errors—is crucial.

How can regularization balance be maintained to avoid high bias while preventing overfitting?

If you notice persistent underfitting, you reduce regularization (like lowering the lambda in L2 regularization or decreasing dropout rates).
If you see that training error is much lower than validation error, you increase regularization.
Using validation metrics to tune the regularization strength is often done through hyperparameter search methods, such as grid search or Bayesian optimization, ensuring you neither overly penalize nor under-penalize the model’s capacity.

Controlling the extent of regularization is a balancing act: too little might lead to overfitting (high variance), whereas too much leads to underfitting (high bias). The ideal point is typically found through systematic experimentation, guided by performance metrics on a validation set.

Below are additional follow-up questions

How can underfitting manifest in unsupervised learning tasks?

Underfitting is commonly discussed in the context of supervised tasks, but it can also show up in unsupervised contexts like clustering or dimensionality reduction. When a clustering algorithm underfits, it might assign clusters in a way that fails to capture the true structure of the data. For instance, a k-means model might place centroids in positions that poorly separate meaningful subgroups.

Underfitting in dimensionality reduction algorithms (like principal component analysis) can occur if the chosen number of components is too small relative to the true underlying dimensionality. The model then misses important variance in the data, rendering the transformed representation incomplete or not useful for further tasks.

Potential pitfalls include mistaking insufficient features for underfitting. In unsupervised tasks, features directly drive cluster separations or manifold learning. If you only have sparse or highly noisy features, even a more complex unsupervised method may appear to underfit. Sometimes, the data simply contains little inherent structure, so no algorithm can produce distinctly meaningful clusters or reduced dimensions.

A practical way to address underfitting in unsupervised tasks is to try alternative algorithms or re-express features. You can also tune hyperparameters like the number of clusters or the number of components. In clustering, silhouette scores or other internal metrics can help identify if your model is consistently missing structure.

Could data imbalance lead to a high bias scenario, and how would that be handled?

In classification problems, data imbalance means one class outnumbers the others significantly. If a model predominantly predicts the majority class, it may show artificially acceptable accuracy but very poor performance on minority classes. This pattern can sometimes be interpreted as underfitting the minority data distribution. Essentially, the model’s decision boundary becomes too simplistic, failing to capture nuances for the underrepresented classes.

One subtlety is that model metrics like accuracy can mask this issue. You might see decent overall accuracy but poor recall or precision for minority classes. The model, in effect, holds a high bias with respect to minority classes.

To address this, data-level approaches like oversampling (e.g., SMOTE) or undersampling can balance the training distribution. Alternatively, algorithm-level strategies, such as adjusting class weights or using focal loss, make the model pay closer attention to minority classes. When rebalancing data, be aware that oversampling might lead to overfitting minority samples if done naively, whereas undersampling might lose valuable majority data.

How do you differentiate between underfitting and limitations imposed by noisy data?

In many real-world datasets, noise can limit the best achievable performance. Underfitting due to high bias means the model has not even learned the core patterns that do exist. However, even an ideal model cannot go below the irreducible error if the data is noisy.

You can examine this by looking at how your model behaves on carefully curated subsets of the data. If you can filter out rows with questionable labels or anomalies (suspected noise) and you see performance drastically improve, it suggests that noise is the limiting factor rather than the model’s bias.

Additionally, plotting learning curves can help. If the model plateaus at a low performance level with both training and validation sets, that might indicate either high bias or noise constraints. By systematically experimenting with more complex models or well-engineered features and observing if there’s a significant boost in performance, you can tease out whether the limitation is inherent noise or if the model just needs more capacity.

In production environments, how can you continuously detect if the model starts to underfit over time?

Models that perform well initially might gradually degrade if the data distribution changes (concept drift). If your model is simplistic, it might not adapt well, causing it to effectively “underfit” the new data patterns.

One strategy is to implement continuous monitoring of performance metrics on fresh data. This can be done through a rolling window evaluation: gather predictions over a defined period, compare against the ground truth, and track key metrics. If these metrics drop persistently below certain thresholds, that’s an indicator the model may be underfitting.

Another subtlety is that retraining on stale data may further compound underfitting if the new data distribution is not well-represented or if the feature engineering no longer captures novel patterns. Careful checks on data drift, including distribution monitoring of features and targets, is essential. If drift is detected, updating your feature engineering steps or adopting online learning approaches can be necessary.

When might underfitting be acceptable or even desirable?

A simpler model that underfits slightly can still be beneficial in scenarios where interpretability and stability are paramount. For mission-critical applications like healthcare, a very complex model might be able to eke out higher accuracy but be opaque and prone to unpredictable errors. A less complex model, while possibly having a small degree of underfitting, could provide consistent performance with fewer unpredictable edge cases.

Regulatory constraints in certain industries may require transparency that more complex models cannot easily provide. Similarly, in extremely high-dimensional problems with limited data, a simpler bias-prone model may generalize more reliably than an over-parametrized model that risks learning spurious patterns.

What if you cannot feasibly make the model more complex, or feature engineering is not an option?

Sometimes there are strict resource constraints. You might be running on edge devices with minimal computational power, where large models cannot be deployed. Or you have no additional domain features to engineer, and your dataset is inherently sparse.

In these cases, you can optimize whatever capacity you do have. If using a linear model, you can fine-tune hyperparameters like regularization strength or learning rate. You could also investigate advanced regularization techniques that might be more parameter-efficient, or experiment with knowledge distillation to transfer from a large teacher model to a constrained student model.

Hyperparameter tuning with systematic search (like grid search or Bayesian optimization) over simpler models can yield small but meaningful performance gains within tight resource constraints. In certain tasks, you may also consider alternative formulations—like one-shot or few-shot learning—to incorporate prior knowledge without explicitly adding more features.

Does adding more training data always solve underfitting?

Acquiring more data can help, but if the model is fundamentally too simple or the data fails to capture the full complexity of the task, underfitting can persist. You might observe only marginal gains when scaling the dataset if the hypothesis class is insufficiently expressive.

In such scenarios, the shortfall is not due to lack of data but due to the model’s structure, lack of critical features, or overly aggressive regularization. Another hidden pitfall is that new data might not be distributed similarly to the old data; if there is domain shift, then you might still be underfitting the newly introduced patterns.

Careful curation of additional data to ensure it improves coverage of relevant patterns is more critical than simply throwing more examples at an overly simplistic model.

Rohan's Bytes

Discussion about this post