InfoScale and CosScale: Two simple tricks that let LLMs read books instead of paragraphs.
The paper introduces two novel temperature scaling techniques - InfoScale and CosScale - that significantly improve how LLMs handle sequences much longer than their training length, without requiring model retraining.
https://arxiv.org/abs/2501.08570
Original Problem 🤔:
→ LLMs struggle with processing sequences longer than their training length due to attention score dilution, where new tokens disrupt the original attention distribution
-----
Solution in this Paper 🔧:
→ InfoScale preserves focus on original tokens by maintaining constant information entropy during extrapolation.
→ CosScale enforces attention scores to follow a hypersphere distribution, making angular relationships more effective.
→ The combination of both techniques creates a synergistic effect by constraining attention patterns while preserving information flow.
→ The solution is training-free and can be integrated with existing length extrapolation methods.
-----
Key Insights 💡:
→ Attention score dilution, not insufficient training, is the key bottleneck in extending context windows
→ As CosScale increases, attention approximates windowed attention behavior
→ Constraining embeddings to a hypersphere improves angular training
-----
Results 📊:
→ Context window extended to 64x training length
→ Outperformed 7 existing methods including RoPE, ALiBi and windowed attention approaches
→ Combined InfoScale+CosScale improved accuracy by 11.4x on ReRoPE baseline
Share this post