Smart data selection beats brute force: LLKD shows less is more in model training.
LLKD trains smaller models using just 4% data while beating full-data approaches
https://arxiv.org/abs/2411.08028
🎯 Original Problem:
Scarcity of labeled data is a big issue for training smaller models cost-effectively. While using LLMs to generate pseudo-labels for unlabeled data is promising, it introduces noisy labels and requires efficient selection of high-quality samples.
-----
🔧 Solution in this Paper:
→ LLKD introduces an adaptive sample selection method combining teacher confidence and student uncertainty signals.
→ The teacher model (LLaMA) generates pseudo-labels with confidence scores while remaining fixed during training.
→ The student model (RoBERTa) learns from selected samples and generates uncertainty estimates.
→ Two dynamic thresholds adapt to both global training status and class-specific learning progress.
→ A weighting scheme prioritizes samples based on combined teacher confidence and student uncertainty.
-----
💡 Key Insights:
→ Higher teacher confidence correlates with better pseudo-label quality
→ Higher student uncertainty indicates challenging samples needing more learning
→ Combining both signals helps select optimal training samples
→ Dynamic thresholds outperform fixed selection ratios
-----
📊 Results:
→ Achieves 6.25% improvement in F1 score on PubMed-RCT-20k dataset
→ Uses only 3.7% of training data while maintaining superior performance
→ Consistently outperforms baseline methods across five different datasets
→ Shows robustness across different teacher models (LLaMA and Gemma)
Share this post