## Original Problem 🎯
Selecting high-quality samples from synthetic long instruction-following data is crucial for effective long context alignment of LLMs. Current approaches either concatenate short samples or indiscriminately increase data volume, leading to suboptimal performance.
-----
https://arxiv.org/abs/2410.15633
## Solution in this Paper 🛠️
• GATEAU framework introduces two components:
- Homologous Models' Guidance (HMG): Compares perplexity scores between similar models with different context windows to measure response generation difficulty
- Contextual Awareness Measurement (CAM): Evaluates model's attention focus on important text segments
• Selects most challenging samples enriched with long-range dependency relations
• Uses weighted sum of HMG and CAM scores for final sample ranking
-----
## Key Insights 💡
• Quality over quantity matters in long context alignment
• Models with similar architectures but different context windows can effectively measure long-range dependencies
• Attention patterns reveal difficulty in understanding long input contexts
• Selected challenging samples improve both short and long instruction-following capabilities
-----
## Results 📊
• Models trained on GATEAU-selected samples (10% of data) outperform full dataset training
• LongBench-Chat: 9% improvement over baseline
• MT-Bench: 6.5% improvement in short instruction following
• Needle in Haystack test: Better information retrieval across different positions
• Consistent improvements across multiple benchmarks including LongBench and LongBench-Chat
Share this post