Edit tokens smartly, keep your LLM sane while training on synthetic data.
This paper proposes token-level editing to prevent model collapse when training on synthetic data, ensuring better performance without increasing data size.
-----
https://arxiv.org/abs/2412.14689
🤔 Original Problem:
Future LLMs will inevitably train on mixed synthetic and human data, but synthetic data causes model collapse - degrading performance as synthetic data proportion increases.
-----
🔧 Solution in this Paper:
→ Introduces Token-level Editing (ToEdit) that selectively replaces tokens with high model confidence (probability >= 0.99)
→ Uses a trained language model as prior distribution to estimate token probabilities
→ Maintains source distribution while improving data quality through targeted token replacement
→ Requires only single forward pass, making it computationally efficient
-----
💡 Key Insights:
→ Non-iterative model collapse occurs when mixing synthetic and human data
→ Synthetic data suffers from coverage collapse - only covers small portion of human data distribution
→ Maintaining distribution coverage is crucial for preventing model collapse
-----
📊 Results:
→ In biomedicine domain: OLMo-1B improved from 36.63 to 40.89
→ LLaMA-3-8B improved from 54.13 to 56.48
→ Consistent improvements across pre-training, continual pre-training, and fine-tuning tasks
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post