ML Case-study Interview Question: Text Augmentation Strategies for Imbalanced Invoice Error Classification

Rohan Paul

Apr 21, 2025

Browse all the ML Case-Studies here.

Case-Study question

A technology team faces an imbalanced dataset problem while trying to flag legal invoice line items for errors. The dataset has over 100,000 rows, with most labeled as correct, and very few labeled as erroneous. The organization wants to build a robust text classifier that accurately captures the minority class (erroneous items). They explore multiple data augmentation approaches to improve recall of the minority class without losing overall accuracy. Outline how you would handle this scenario, including detailed strategies for data augmentation, handling imbalance, and ultimately improving the classification metrics.

Connect with me on X (Twitter)

Detailed solution

Data Imbalance and the Need for Augmentation

Imbalance occurs when most samples belong to one class. Minorities get poorly represented during training. The focus is on increasing samples for rare labels or transforming the training set so the model learns minority patterns effectively. Oversampling the minority class helps retain its representation in the training phase.

Splitting Data Without Leakage

Train-test splitting should happen before augmentation to prevent information leakage. The minority class in the test set remains untouched, so improvements can be fairly attributed to augmentation techniques.

Character-Based Augmentations

Character-level methods simulate real-world noise such as typos or digit-letter confusion. One approach randomly replaces letters with neighbors on a QWERTY keyboard. Another replaces visually similar characters (like 0 replaced by O). These preserve the general context but subtly alter token representations.

Word-Based Augmentations

Contextual word embeddings from large language models replace words with synonyms (or near-synonyms) that maintain sentence coherence. Back-translation translates a sentence to another language and back, often producing variants that preserve semantic meaning.

Sentence-Level Augmentations

Contextual sentence generation using transformer-based models like GPT-2 can generate paraphrased or alternative sentences. This can produce diversity but sometimes strays from the original context.

Novel Keyword-to-Sentence Augmentation

Extract keywords from existing sentences, then regenerate new sentences around those keywords using a language model. This reintroduces the same concepts but with potentially new phrasing. It retains important terms while altering structure.

Classifier and Model Agnosticism

Any chosen classifier can use these augmented samples. One approach uses a transformer-based model (for example, a pretrained language model) to fine-tune on original plus augmented data. The same augmentation methods remain valid for other classifiers, providing flexibility.

Metrics and Evaluation

Precision, Recall, Weighted Accuracy, and F-Beta scores measure performance. Balancing these metrics is crucial with imbalanced data.

F1 measures the harmonic mean of Precision and Recall. This helps observe how effectively the model captures minority samples (Recall) while limiting false positives (Precision).

Observations on Performance

Character augmentations may not yield substantial gains for complex tasks. Word-level contextual augmentation often improves minority class recall significantly while maintaining acceptable overall accuracy. The keyword-to-sentence approach can be a second-best method, sometimes surpassing character and sentence-only methods. Results vary depending on dataset specifics and text complexity.

Example of Python Code for Augmentation

import nlpaug.augmenter.word as naw
import nlpaug.augmenter.char as nac

train_texts = ["Sample invoice text...", ...]
train_labels = [...]

# Contextual Word Augmenter
context_aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased',
                                        action="insert")

augmented_texts = []
for text in train_texts:
    augmented_texts.append(context_aug.augment(text))

# Combined with original data
extended_texts = train_texts + augmented_texts
extended_labels = train_labels + train_labels

This inserts words in each sentence using a pretrained model. Later, these extended sets feed into the classifier.

Possible follow-up questions and detailed answers

How do you ensure the augmented sentences remain relevant and do not add noise?

Large language models sometimes generate text that drifts from the original context. Validating augmented sentences is important. One approach checks whether crucial keywords remain. Another approach trains a simple validation classifier to ensure augmented data aligns with original class distributions. Manual or semi-automated checks on a sample subset also help verify relevance.

Why not just undersample the majority class?

Undersampling discards valuable data. Minority classes remain the same, so the classifier might still struggle with limited data. Oversampling or augmenting the minority class keeps most original data and leverages extra examples for improved learning. Undersampling can be combined with augmentations, but discarding data risks losing crucial information.

Are there risks of overfitting when using augmented data?

Repeating the same minority examples in slightly altered forms can cause the model to memorize. Mitigation includes using multiple augmentation techniques and applying them randomly. Proper hyperparameter tuning such as adjusting learning rate or regularization methods also helps reduce overfitting.

How do you pick the best augmentation technique?

Empirically test multiple methods. Train models on each augmented set and compare metrics. Choose the approach that optimizes recall of the minority class while retaining high overall accuracy. Business context might value certain metrics (like recall) more than others, so metric choice depends on real-world impact.

Why use a transformer-based classifier?

Transformer-based models capture contextual nuances in text. Fine-tuning them on augmented data often boosts minority recall by leveraging pretrained language understanding. Traditional machine learning algorithms sometimes struggle with subtle text distinctions. Transformers can handle synonyms, paraphrases, and other variations introduced by augmentation.

How to handle future expansions that include numeric or categorical data?

Each data type might need specialized augmentation. Numeric augmentation can add noise or small perturbations within realistic ranges. Categorical features can be swapped out with valid categories. The same pipeline can incorporate additional transformations, ensuring the model sees a variety of realistic examples without losing data integrity.

What if the domain has unique jargon not captured by standard pretrained models?

Domain adaptation might be needed. Fine-tuning a domain-specific language model on raw in-house text helps the model learn specialized terms. Vocabulary expansions or custom tokenizers can incorporate domain jargon effectively. Labeling more domain-specific data and augmenting it ensures the model aligns with the specialized context.

When do you stop augmenting?

It depends on whether minority class recall plateaus or accuracy starts to degrade. A validation set can track performance. Once metrics stabilize or worsen, further augmentation may add noise. Metrics such as cross-validation scores help decide when to stop. Over-augmentation might inflate false positives or degrade overall performance.

How do you justify the added complexity of augmentation to stakeholders?

Demonstrating measurable gains in minority class detection is key. Show how unbalanced models can fail to catch critical errors. Highlight improvements in metrics like F1, recall, or precision for the minority class. Link these to cost savings or risk reduction. Clear evidence of better performance justifies the extra steps.

How does the novel Keyword-to-Sentence approach differ from simpler sentence augmentation?

It preserves essential terms by extracting them first and then regenerating the sentence. This ensures crucial domain-specific words remain, while reordering or rephrasing adds variation. Regular sentence augmentation might generate content that lacks domain-specific keywords, reducing its usefulness. Keyword-based methods keep these key terms intact.

How to choose hyperparameters for the augmentation process?

Control the number of augmentations per sample, maximum word replacements, or the range of translations for back-translation. Tune these hyperparameters by checking a validation set for improvements in minority recall without hurting precision. Adjust augmentation strength to balance diversity and realism in the generated samples.

How could you incorporate real-time augmentation in production?

Real-time augmentation might not be practical because data labeling is time-consuming. Typically, augmentation happens offline. However, limited on-the-fly augmentation can occur if new minority examples arrive. Periodic retraining pipelines can incorporate newly labeled minority samples plus synthetic data. Continuous monitoring ensures balanced performance.

How to handle extremely small minority classes?

Techniques like SMOTE for text, advanced language model generation, or domain-specific synonyms might be more suitable. Collecting more real data remains the strongest approach. Advanced generative methods can help, but thorough validation is essential since a tiny minority set can easily lead to overfitting or synthetic noise.

How to explain model decisions to regulatory bodies or management?

Attribution methods, such as attention-weight visualization or local interpretable model-agnostic explanations, can reveal how the model uses augmented data. Stakeholders can see which words or phrases drive predictions. Documenting how augmentation improved coverage of critical errors adds transparency. Ensuring reproducibility of augmentation steps also builds trust.

Rohan's Bytes

Discussion about this post