0:00
/
0:00
Transcript

"Clear Minds Think Alike: What Makes LLM Fine-tuning Robust? A Study of Token Perplexity"

Below podcast is generated with Google's Illuminate.

Self-generated training data is not magic—its power lies in reducing token-level unpredictability.

Fine-tuning with LLM-generated data works by eliminating high-perplexity tokens—Selective Token Masking achieves the same effect.

---

Paper - https://arxiv.org/abs/2501.14315

Original Problem:

→ Fine-tuning LLMs often degrades their generalization to out-of-domain tasks.

→ Smaller LLMs face performance saturation, limiting their ability to improve via traditional fine-tuning.

→ The impact of using LLM-generated data for fine-tuning on cross-domain robustness is unclear.

---

Solution in this Paper: 🛠️:

→ The paper identifies high-perplexity tokens in training data as a key factor in out-of-domain degradation.

→ It demonstrates that fine-tuning with LLM-generated data leads to lower perplexity training, enhancing robustness across domains.

→ The authors propose Selective Token Masking (STM), a method that masks high-perplexity tokens in ground-truth data, achieving similar generalization benefits as LLM-generated training data.

→ Experiments across multiple architectures (Gemma2-2B, Mistral-7B, Llama3-8B) confirm that STM maintains in-domain accuracy while improving out-of-domain performance.

---

Key Insights from this Paper: 🔍:

→ LLM-generated data contains fewer high-perplexity tokens, leading to more stable fine-tuning.

→ Simply masking high-perplexity tokens in ground-truth data mimics the benefits of LLM-generated training.

→ STM is computationally efficient and eliminates the need for self-generated training data while preserving generalization.

---

Results: 📊:

→ Selective Token Masking (STM) outperforms ground-truth fine-tuning in out-of-domain tasks, achieving 75.3% accuracy on ARC-Challenge vs. 29.5% for ground truth.

→ On GSM8K, STM achieves 55.4% accuracy vs. 19.0% for ground truth in Gemma2-2B.

→ Self-Output method achieves the lowest perplexity (1.16) compared to Ground Truth (4.83), validating the low-perplexity hypothesis.

Discussion about this video