PolyCom (Polynomial Composition Activations): Making Transformers smarter by letting them see complex patterns in data.
This new activation function that speaks polynomial but thinks like ReLU
https://arxiv.org/abs/2411.03884
🎯 Original Problem:
Transformers heavily rely on activation functions like ReLU and GeLU for nonlinearity, but these functions limit the model's ability to capture complex patterns and higher-order interactions in data.
-----
🔧 Solution in this Paper:
→ Introduces PolyCom (Polynomial Composition Activations) with two variants - PolyReLU and PolyNorm
→ PolyReLU composes polynomial functions with ReLU: x → Σ(i=0 to r) ai * ReLU^i(x)
→ PolyNorm normalizes polynomial powers: x → Σ(i=0 to r) ai * norm(x^i)
→ Uses third-order polynomials (r=3) with trainable coefficients initialized as ai=1/3
-----
💡 Key Insights:
→ PolyReLU networks can exactly represent any ReLU network of same size
→ ReLU networks need more parameters to match PolyReLU's expressivity
→ Achieves optimal approximation rate in Sobolev spaces with minimal parameters
→ Enables capturing of higher-order data interactions without computational overhead
-----
📊 Results:
→ On 1B dense models: PolyNorm outperformed SwiGLU by 1.21% average margin across six tasks
→ Lower training loss (2.17 vs 2.19) and validation perplexity (3.17 vs 3.22) compared to SwiGLU
→ On MoE models: Better performance on 8 out of 9 downstream tasks
Share this post