BFloat16 breaks RoPE, but a shared anchor token fixes it all
AnchorAttention: One token to rule them all in long-context training
This paper reveals that BFloat16 precision breaks down RoPE's relative positional encoding in LLMs during long-context training. The researchers propose AnchorAttention, which uses a shared anchor token visible across documents, reducing numerical issues and training time by over 50%.
-----
https://arxiv.org/abs/2411.13476
🔍 Original Problem:
RoPE has become standard for positional encoding in LLMs. However, when using BFloat16 format for computational efficiency, numerical precision issues arise, causing RoPE to deviate from its intended relative positional encoding properties, especially with longer sequences.
-----
🛠️ Solution in this Paper:
→ AnchorAttention treats the first token as a shared anchor across all documents in the training context
→ This anchor token maintains a consistent position ID, making it visible to all documents while keeping tokens from different documents invisible to each other
→ The method reduces unnecessary attention computations and prevents rolling accumulation of numerical errors
→ Implementation requires minimal modifications to existing training pipelines
-----
💡 Key Insights:
→ First token in attention sequences contributes most to RoPE's breakdown under BFloat16
→ Numerical errors accumulate as context length increases
→ Resetting position IDs improves long-context performance but limits rotational angle learning
-----
📊 Results:
→ Reduces training time by over 50% compared to standard attention mechanisms
→ Outperforms full attention and intra-document attention on RULER benchmark (8K to 128K lengths)
→ Improves in-context learning while preserving general task capabilities on MMLU and HellaSwag
Share this post