Fourier math and wave mathematics helps LLMs read longer texts without getting confused.
FoPE enhances RoPE-based LLMs by treating each dimension as a Fourier Series and zeroing out harmful frequency components, enabling better length generalization without supplementary methods.
-----
https://arxiv.org/abs/2412.17739
๐ค Original Problem:
โ RoPE-based LLMs struggle with length generalization, requiring additional methods to handle longer contexts effectively.
โ Linear layers and activation functions cause spectral damage, while inadequately trained frequency components impair performance.
-----
๐ฌ Solution in this Paper:
โ FoPE models each dimension as a Fourier Series with multiple frequency components instead of RoPE's single frequency approach.
โ It clips inadequately trained frequency components by zeroing them out, preserving only zero-frequency components for long-wavelength information.
โ The implementation adds negligible memory and computation overhead compared to RoPE.
-----
๐งช Key Insights:
โ RoPE implicitly performs Non-Uniform Discrete Fourier Transform on hidden states
โ Spectral damage from linear layers and activation functions significantly impacts length generalization
โ Increasing attention head dimensions is more beneficial than adding more heads or layers
โ The decay property in attention scores doesn't significantly impact length generalization
-----
๐ Results:
โ Maintains stable perplexity across varying context windows compared to RoPE and ALiBi
โ Shows consistent accuracy in needle-in-haystack tasks
โ Works effectively across model scales from 60M to 1.2B parameters
------
Are you into AI and LLMsโ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. โโ
๐ https://rohanpaul.substack.com/