ML Interview Q Series: How are Random Oversampling and Random Undersampling different, and in which scenarios are they typically used?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Class imbalance arises when one class (often called the “majority class”) has substantially more samples than another class (the “minority class”). This imbalance can degrade the performance of many standard machine learning models because they might be biased toward the majority class.
Random Oversampling is a technique that replicates existing samples from the minority class until both classes have a more balanced distribution. By contrast, Random Undersampling removes samples from the majority class to match its quantity to the minority class. In simpler terms, Random Oversampling adds data points to the minority class, while Random Undersampling discards data points from the majority class.
Random Oversampling
Random Oversampling aims to boost the presence of the minority class by reintroducing its data points into the training set. The process involves sampling (with replacement) minority-class examples until the minority class attains a desired number of samples.
The primary advantage is that it does not result in any loss of potentially useful information. All data points, including those from the majority class, remain accessible to the model.
One of the biggest drawbacks is the risk of overfitting. Since the same data points from the minority class get repeated, the model may learn idiosyncrasies specific to these duplicated examples, leading to poorer generalization on unseen data.
In practice, more sophisticated variations of oversampling exist, such as SMOTE (Synthetic Minority Over-sampling Technique), which synthesizes new data points by interpolating between existing minority samples, thereby mitigating the overfitting problem that arises from naive repetition.
Random Undersampling
Random Undersampling randomly discards examples from the majority class until the counts of majority and minority samples reach a more comparable ratio.
A major benefit is that it diminishes the memory and computational demands for the subsequent training, because the dataset size decreases.
The main drawback is the possible loss of critical information from the majority class, which could impair the model’s ability to generalize or capture relevant patterns.
Undersampling is usually preferred when the dataset is extremely large and memory constraints or training time constraints come into play. However, if the minority class is already quite small, relying heavily on undersampling could lose valuable trends present within the majority class data.
When Each Method is Appropriate
Random Oversampling is often a first attempt when the minority class is severely underrepresented, especially if one wants to avoid discarding majority-class data. If there is a worry about overfitting on repeated samples, hybrid approaches such as SMOTE or using synthetic data generation might be more appropriate.
Random Undersampling is more practical when one has limited computational resources or if the majority class is extremely large. By discarding some of the majority examples, training becomes faster and more memory-efficient. Nevertheless, one must be cautious not to discard important majority instances that carry significant information about the decision boundary.
In real-world applications, a combination of oversampling the minority class and undersampling the majority class sometimes yields better results than either technique alone.
Example Code in Python
Below is a simplified illustration using the imbalanced-learn
library:
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
import numpy as np
import pandas as pd
# Suppose X and y are your features and labels respectively
X = pd.DataFrame({
'feature1': np.random.randn(1000),
'feature2': np.random.randn(1000)
})
y = np.array([0]*950 + [1]*50) # Imbalanced data
print("Original distribution:", Counter(y))
# Random Oversampling
ros = RandomOverSampler(random_state=42)
X_resampled_over, y_resampled_over = ros.fit_resample(X, y)
print("After Random Oversampling:", Counter(y_resampled_over))
# Random Undersampling
rus = RandomUnderSampler(random_state=42)
X_resampled_under, y_resampled_under = rus.fit_resample(X, y)
print("After Random Undersampling:", Counter(y_resampled_under))
Potential Follow-Up Questions
What are some potential pitfalls with Random Oversampling, and how do we mitigate them?
Random Oversampling can lead to overfitting because the minority class points are repeated, which makes the model “memorize” these samples rather than learn generalized patterns. One way to mitigate this is to employ methods like SMOTE or ADASYN, which create new synthetic points instead of just duplicating existing ones. It’s also possible to regularize the model more or tune hyperparameters to combat overfitting.
How does SMOTE improve upon naive Random Oversampling?
SMOTE (Synthetic Minority Over-sampling Technique) systematically generates new minority samples by linearly interpolating existing data points within the minority class. By doing so, it can make the minority region in feature space more continuous and reduce the chance of overfitting on duplicated samples. However, SMOTE can also generate out-of-distribution or noisy samples if the minority class is very sparse.
Can Random Undersampling ever be combined with other approaches?
Yes. Undersampling is frequently combined with oversampling or with cost-sensitive methods. Combining both oversampling and undersampling can sometimes achieve a better balance between removing redundant majority samples and creating new minority samples. It preserves some of the critical patterns in the majority class while increasing the representation of the minority class to an acceptable degree.
Is it better to address class imbalance at the data level or use algorithmic methods?
It depends on the problem constraints and the availability of computational resources. Data-level methods like oversampling or undersampling are straightforward and often effective. However, cost-sensitive learning modifies the training objective to make misclassification of the minority class more “expensive.” If the problem domain allows it and the model supports it (e.g., certain tree-based methods or neural networks with weighted loss functions), cost-sensitive approaches can help the model pay more attention to minority samples without having to modify the original data distribution.
Are there cases where neither Random Oversampling nor Random Undersampling is advisable?
If the dataset is extremely small (for the minority class), naive random methods could risk either excessive overfitting (oversampling) or information loss (undersampling). In such cases, more advanced synthetic sample generation (SMOTE or variational autoencoder-based approaches for data augmentation) or algorithmic methods (such as cost-sensitive learning) might be more appropriate. Additionally, if the imbalance is not too severe, one might rely on well-calibrated models combined with proper metrics (like precision-recall curves) rather than drastically altering data distributions.
Below are additional follow-up questions
How do we measure success in imbalanced classification scenarios?
One of the core challenges in imbalanced classification tasks is that accuracy might be misleading. If the minority class constitutes only a small fraction of the data, predicting everything as the majority class could yield deceptively high accuracy. To truly measure success, we often rely on metrics like Precision, Recall (or Sensitivity), Specificity, F1 score, and area under the Precision-Recall curve.
In particular, Recall for the minority class is paramount in domains where missed detections are costly (for example, medical diagnostics). Precision ensures that when the model does predict the minority class, it is correct frequently enough to be practically useful. F1 score balances precision and recall, and the area under the Precision-Recall curve gives an even more nuanced measure of the trade-off across different probability thresholds.
A subtle pitfall is that focusing purely on one metric (e.g., recall) can lead to unacceptably low precision, flooding the pipeline with too many false positives. Thus, the choice of metric typically stems from domain-specific cost considerations (cost of false positives vs. cost of false negatives).
Could Random Oversampling or Random Undersampling distort the real-world data distribution and hinder interpretability?
When you oversample the minority class or undersample the majority class, you alter the proportions that naturally occur in the real world. This shift may reduce the interpretability of your model or predictions because the model sees a training distribution that no longer reflects the actual scenario where one class is scarce.
In some applications, such as fraud detection, the frequency of fraudulent transactions in the real world is not close to 50%, so oversampling or undersampling to that level might lead to spurious insights about the frequency of suspicious activity. One way to partially mitigate this is to calibrate the model’s output probabilities after training, or to use cost-sensitive learning that modifies the model’s loss function rather than the data distribution itself, thus preserving the original data proportions while still making the model focus on the minority class.
How do we handle multi-class imbalance when there are more than two classes?
Most discussions about random oversampling and undersampling focus on binary classification. However, you can encounter multi-class imbalances where certain classes appear more frequently than others. In these scenarios, oversampling or undersampling can be applied to each class with fewer samples than a specified threshold or to the classes above a threshold.
One complication is deciding how to set target class frequencies. In a multi-class scenario, you might not wish to push all classes to the same count, as some classes might be extremely rare. Instead, you might only bring them closer to a moderate level or rely on domain-based weighting. Another subtlety is ensuring that the synthetic data (for methods like SMOTE) makes sense across multiple minority classes. SMOTE can be applied one class at a time, but that can lead to generating samples that overlap with the domain of other classes if those classes are not well-separated in feature space.
Should Random Oversampling or Undersampling be applied to the test or validation sets as well?
Data resampling is almost always performed on the training set only. The test (or validation) set should reflect the real distribution you expect at inference time, so that performance metrics computed on it accurately reflect how the model would perform in the real world. If you artificially modify class distribution in the test set, you lose the ability to measure real-world performance metrics properly.
A subtle scenario arises if you use a performance metric that can be significantly affected by the class distribution, like overall accuracy. In that case, adjusting the test set distribution might artificially inflate or deflate performance estimates. For this reason, it’s best to maintain the original distribution in the final evaluation sets to ensure reliability of the results.
How can Random Undersampling remove potentially valuable samples, and what can be done to mitigate that?
Random Undersampling discards examples from the majority class, and there is always the risk of removing important or unique samples that could have been crucial to learning the decision boundary. If certain subpopulations within the majority class appear less frequently, random undersampling can eliminate them, leading to a model that generalizes poorly for those subpopulations.
To mitigate this issue, more intelligent undersampling approaches can be used. For instance, one could cluster the majority class and remove points that appear to be redundant or in high-density regions. Alternatively, one can combine undersampling with a method that ensures rare but important majority examples remain. A balanced approach might also involve weighting or cost-sensitive learning, thereby reducing the reliance on physically discarding any samples from the majority class.
In extremely large datasets, is Random Oversampling for the minority class practical?
If the dataset is huge for the majority class but extremely small for the minority class, naive random oversampling for the minority class might require duplicating a tiny subset of examples many times. This can lead to severe overfitting, as the same minority examples become heavily repeated. It might also inflate the training set without genuinely adding new information.
A more sophisticated approach is to generate synthetic minority samples using a technique like SMOTE, ADASYN, or advanced augmentation. Alternatively, cost-sensitive methods that adjust the loss function might be more practical, because they place heavier penalties on the misclassification of minority samples without requiring the replication of data. Hybrid methods might also be employed: one could do mild undersampling of the majority class to limit computational costs and simultaneously apply a smart oversampling technique.
How can we deal with real-time streaming data that contains class imbalance?
In a streaming context, data arrives in continuous batches, and the distribution of classes may shift over time. If you rely on random oversampling or undersampling offline, it becomes tricky when new data keeps coming. Simply duplicating minority samples on the fly might not be meaningful if the data distribution changes (a phenomenon known as concept drift).
One strategy is to use online cost-sensitive updates in your model. Another approach is to detect concept drift and adjust the sampling strategy adaptively. For instance, you might undersample the majority class in more recent batches if the minority portion is too small, or apply a lightweight synthetic oversampling technique in streaming mode. However, implementing these strategies requires careful engineering to ensure that the continuous learning process is stable and consistent with real-world demands.
How do we choose the target ratio between classes when applying Random Oversampling or Random Undersampling?
It is not always optimal to aim for a perfectly balanced 1:1 ratio, especially if the original imbalance is extreme or if real-world costs differ for the two classes. Sometimes, you might only partially rebalance the minority class to a point that improves the model’s recall without creating excessive duplication. The choice of ratio often comes from experiments using validation data, domain-specific cost considerations (e.g., the acceptable rate of false positives vs. false negatives), or computational constraints.
One might systematically try different oversampling ratios and measure performance using appropriate metrics like the F1 score or the area under the Precision-Recall curve. Whichever ratio yields the best validation performance can be selected for the final model. Domain expertise also plays a key role, as certain fields have well-established guidelines on the acceptable trade-offs for misclassifications.
How does Random Oversampling differ from simply applying class weights in a machine learning algorithm?
Random Oversampling manipulates the training data itself, replicating existing minority samples. Class weighting, on the other hand, modifies the objective function to penalize the misclassification of minority-class examples more severely. In practice, many algorithms (e.g., logistic regression, SVM, random forests, and neural networks) allow specifying a “class weight” parameter that effectively “upsamples” minority examples in the loss function without replicating them in the training dataset.
A key difference is that with oversampling, your model literally sees repeated examples of the minority class, potentially leading to overfitting on those repeats. With class weighting, the model sees the original data distribution but is guided by the training loss to pay more attention to errors on minority samples. Class weighting can sometimes be more memory-friendly and might preserve the distribution of the input space, but oversampling can be easier to implement and visualize, especially for methods that do not inherently support class weighting.
Is there ever a situation where it might be best to avoid any sampling methods altogether?
Yes. If the imbalance is not very severe and the model chosen is robust or specifically designed to handle skewed class distributions (for instance, tree-based ensembles that can handle missing or unbalanced data relatively well), random sampling might not be essential. Some models or pipelines that use appropriate metrics (precision-recall, F1, etc.) and hyperparameter tuning might achieve acceptable performance without explicitly oversampling or undersampling.
Additionally, if you have ample domain knowledge indicating that maintaining the natural distribution is crucial to the interpretability or real-world relevance of the model, artificially changing the distribution could backfire. In such scenarios, you might opt for using the true data distribution, combined with careful metric selection (such as focusing on precision and recall) or cost-sensitive training approaches that avoid altering the dataset itself.