Careful data engineering enables small 8B param LLMs to process documents upto 512K tokens.
📚 https://arxiv.org/pdf/2410.02660
Solution in this Paper 🛠️:
• Establishes robust evaluation protocol using diverse long-context tasks
• Optimizes data mix: 60% long data (30% code, 30% books), 40% high-quality short data
• Scales training to 40B tokens: 20B at 64K length, 20B at 512K length
• Uses increased RoPE frequency base and disables cross-document attention
• Performs supervised fine-tuning with short-context instruction data (UltraChat)
-----
Key Insights from this Paper 💡:
• Evaluating after supervised fine-tuning reveals long-context abilities better
• Training on longer sequences (512K) improves performance on shorter contexts (64K)
• Short-context instruction data sufficient for strong long-context performance
• Mixing long data with high-quality short data crucial for maintaining overall capabilities
-----
Results 📊:
• ProLong-8B achieves state-of-the-art performance among 10B-scale models at 128K context
• Outperforms Llama-3.1-8B on most long-context tasks with only 5% of training data
• Effectively processes up to 512K tokens
• Maintains strong short-context performance
Share this post