0:00
/
0:00
Transcript

"How to Synthesize Text Data without Model Collapse?"

Generated below podcast on this paper with Google's Illuminate.

Edit tokens smartly, keep your LLM sane while training on synthetic data.

This paper proposes token-level editing to prevent model collapse when training on synthetic data, ensuring better performance without increasing data size.

-----

https://arxiv.org/abs/2412.14689

🤔 Original Problem:

Future LLMs will inevitably train on mixed synthetic and human data, but synthetic data causes model collapse - degrading performance as synthetic data proportion increases.

-----

🔧 Solution in this Paper:

→ Introduces Token-level Editing (ToEdit) that selectively replaces tokens with high model confidence (probability >= 0.99)

→ Uses a trained language model as prior distribution to estimate token probabilities

→ Maintains source distribution while improving data quality through targeted token replacement

→ Requires only single forward pass, making it computationally efficient

-----

💡 Key Insights:

→ Non-iterative model collapse occurs when mixing synthetic and human data

→ Synthetic data suffers from coverage collapse - only covers small portion of human data distribution

→ Maintaining distribution coverage is crucial for preventing model collapse

-----

📊 Results:

→ In biomedicine domain: OLMo-1B improved from 36.63 to 40.89

→ LLaMA-3-8B improved from 54.13 to 56.48

→ Consistent improvements across pre-training, continual pre-training, and fine-tuning tasks

------

Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓

🎉 https://rohanpaul.substack.com/

Discussion about this video

User's avatar