"Metadata Conditioning Accelerates Language Model Pre-training"

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

"Metadata Conditioning Accelerates Language Model Pre-training"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 13, 2025

Transcript

Adding website addresses during training helps LLMs learn faster and better.

MeCo prepends metadata like URLs to training documents during initial training, then uses a cooldown phase without metadata, achieving same performance with 33% less data.

-----

https://arxiv.org/abs/2501.01956

🤔 Original Problem:

→ LLMs process all web content equally, ignoring crucial source context that helps humans understand content quality and intent

→ This makes it hard for models to learn appropriate behaviors for different content types (like distinguishing between factual articles and memes)

-----

🛠️ Solution in this Paper:

→ MeCo (Metadata Conditioning then Cooldown) adds source URLs before each training document

→ First 90% of training uses metadata-augmented data like "URL: en.wikipedia.org \n\n [document]"

→ Final 10% uses standard data without metadata as cooldown phase

→ Loss is calculated only on document tokens, not metadata tokens

→ Cross-document attention is disabled for 25% faster training

-----

💡 Key Insights:

→ Metadata grouping is key - hashed URLs work as well as real ones

→ Model-generated topics can replace URLs as metadata

→ 10-20% cooldown length is optimal

→ Works better with billion-parameter models

-----

📊 Results:

→ Matches baseline performance using 33% less training data

→ Consistent gains across model scales (600M to 8B parameters)

→ Using wikipedia.org URL reduces toxic generations several-fold

→ 6% improvement on zero-shot commonsense tasks with factquizmaster.com URL

------

Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓

🎉 https://rohanpaul.substack.com/

Rohan's Bytes

"Metadata Conditioning Accelerates Language Model Pre-training"

Discussion about this video