Self-generated training data is not magic—its power lies in reducing token-level unpredictability.
Fine-tuning with LLM-generated data works by eliminating high-perplexity tokens—Selective Token Masking achieves the same effect.
---
Paper - https://arxiv.org/abs/2501.14315
Original Problem:
→ Fine-tuning LLMs often degrades their generalization to out-of-domain tasks.
→ Smaller LLMs face performance saturation, limiting their ability to improve via traditional fine-tuning.
→ The impact of using LLM-generated data for fine-tuning on cross-domain robustness is unclear.
---
Solution in this Paper: 🛠️:
→ The paper identifies high-perplexity tokens in training data as a key factor in out-of-domain degradation.
→ It demonstrates that fine-tuning with LLM-generated data leads to lower perplexity training, enhancing robustness across domains.
→ The authors propose Selective Token Masking (STM), a method that masks high-perplexity tokens in ground-truth data, achieving similar generalization benefits as LLM-generated training data.
→ Experiments across multiple architectures (Gemma2-2B, Mistral-7B, Llama3-8B) confirm that STM maintains in-domain accuracy while improving out-of-domain performance.
---
Key Insights from this Paper: 🔍:
→ LLM-generated data contains fewer high-perplexity tokens, leading to more stable fine-tuning.
→ Simply masking high-perplexity tokens in ground-truth data mimics the benefits of LLM-generated training.
→ STM is computationally efficient and eliminates the need for self-generated training data while preserving generalization.
---
Results: 📊:
→ Selective Token Masking (STM) outperforms ground-truth fine-tuning in out-of-domain tasks, achieving 75.3% accuracy on ARC-Challenge vs. 29.5% for ground truth.
→ On GSM8K, STM achieves 55.4% accuracy vs. 19.0% for ground truth in Gemma2-2B.
→ Self-Output method achieves the lowest perplexity (1.16) compared to Ground Truth (4.83), validating the low-perplexity hypothesis.
Share this post