Fourier math and wave mathematics helps LLMs read longer texts without getting confused.
FoPE enhances RoPE-based LLMs by treating each dimension as a Fourier Series and zeroing out harmful frequency components, enabling better length generalization without supplementary methods.
-----
https://arxiv.org/abs/2412.17739
🤖 Original Problem:
→ RoPE-based LLMs struggle with length generalization, requiring additional methods to handle longer contexts effectively.
→ Linear layers and activation functions cause spectral damage, while inadequately trained frequency components impair performance.
-----
🔬 Solution in this Paper:
→ FoPE models each dimension as a Fourier Series with multiple frequency components instead of RoPE's single frequency approach.
→ It clips inadequately trained frequency components by zeroing them out, preserving only zero-frequency components for long-wavelength information.
→ The implementation adds negligible memory and computation overhead compared to RoPE.
-----
🧪 Key Insights:
→ RoPE implicitly performs Non-Uniform Discrete Fourier Transform on hidden states
→ Spectral damage from linear layers and activation functions significantly impacts length generalization
→ Increasing attention head dimensions is more beneficial than adding more heads or layers
→ The decay property in attention scores doesn't significantly impact length generalization
-----
📊 Results:
→ Maintains stable perplexity across varying context windows compared to RoPE and ALiBi
→ Shows consistent accuracy in needle-in-haystack tasks
→ Works effectively across model scales from 60M to 1.2B parameters
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post