0:00
/
0:00
Transcript

"How to Train Long-Context Language Models (Effectively)"

Generated this podcast with Google's Illuminate.

Careful data engineering enables small 8B param LLMs to process documents upto 512K tokens.

📚 https://arxiv.org/pdf/2410.02660

Solution in this Paper 🛠️:

• Establishes robust evaluation protocol using diverse long-context tasks

• Optimizes data mix: 60% long data (30% code, 30% books), 40% high-quality short data

• Scales training to 40B tokens: 20B at 64K length, 20B at 512K length

• Uses increased RoPE frequency base and disables cross-document attention

• Performs supervised fine-tuning with short-context instruction data (UltraChat)

-----

Key Insights from this Paper 💡:

• Evaluating after supervised fine-tuning reveals long-context abilities better

• Training on longer sequences (512K) improves performance on shorter contexts (64K)

• Short-context instruction data sufficient for strong long-context performance

• Mixing long data with high-quality short data crucial for maintaining overall capabilities

-----

Results 📊:

• ProLong-8B achieves state-of-the-art performance among 10B-scale models at 128K context

• Outperforms Llama-3.1-8B on most long-context tasks with only 5% of training data

• Effectively processes up to 512K tokens

• Maintains strong short-context performance

Discussion about this video