0:00
/
0:00
Transcript

"Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data"

The podcast on this paper is generated with Google's Illuminate.

Smart data selection beats brute force: LLKD shows less is more in model training.

LLKD trains smaller models using just 4% data while beating full-data approaches

https://arxiv.org/abs/2411.08028

🎯 Original Problem:

Scarcity of labeled data is a big issue for training smaller models cost-effectively. While using LLMs to generate pseudo-labels for unlabeled data is promising, it introduces noisy labels and requires efficient selection of high-quality samples.

-----

🔧 Solution in this Paper:

→ LLKD introduces an adaptive sample selection method combining teacher confidence and student uncertainty signals.

→ The teacher model (LLaMA) generates pseudo-labels with confidence scores while remaining fixed during training.

→ The student model (RoBERTa) learns from selected samples and generates uncertainty estimates.

→ Two dynamic thresholds adapt to both global training status and class-specific learning progress.

→ A weighting scheme prioritizes samples based on combined teacher confidence and student uncertainty.

-----

💡 Key Insights:

→ Higher teacher confidence correlates with better pseudo-label quality

→ Higher student uncertainty indicates challenging samples needing more learning

→ Combining both signals helps select optimal training samples

→ Dynamic thresholds outperform fixed selection ratios

-----

📊 Results:

→ Achieves 6.25% improvement in F1 score on PubMed-RCT-20k dataset

→ Uses only 3.7% of training data while maintaining superior performance

→ Consistently outperforms baseline methods across five different datasets

→ Shows robustness across different teacher models (LLaMA and Gemma)

Discussion about this video