Adding website addresses during training helps LLMs learn faster and better.
MeCo prepends metadata like URLs to training documents during initial training, then uses a cooldown phase without metadata, achieving same performance with 33% less data.
-----
https://arxiv.org/abs/2501.01956
🤔 Original Problem:
→ LLMs process all web content equally, ignoring crucial source context that helps humans understand content quality and intent
→ This makes it hard for models to learn appropriate behaviors for different content types (like distinguishing between factual articles and memes)
-----
🛠️ Solution in this Paper:
→ MeCo (Metadata Conditioning then Cooldown) adds source URLs before each training document
→ First 90% of training uses metadata-augmented data like "URL: en.wikipedia.org \n\n [document]"
→ Final 10% uses standard data without metadata as cooldown phase
→ Loss is calculated only on document tokens, not metadata tokens
→ Cross-document attention is disabled for 25% faster training
-----
💡 Key Insights:
→ Metadata grouping is key - hashed URLs work as well as real ones
→ Model-generated topics can replace URLs as metadata
→ 10-20% cooldown length is optimal
→ Works better with billion-parameter models
-----
📊 Results:
→ Matches baseline performance using 33% less training data
→ Consistent gains across model scales (600M to 8B parameters)
→ Using wikipedia.org URL reduces toxic generations several-fold
→ 6% improvement on zero-shot commonsense tasks with factquizmaster.com URL
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post