0:00
/
0:00
Transcript

"Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models"

The podcast on this paper is generated with Google's Illuminate.

PolyCom (Polynomial Composition Activations): Making Transformers smarter by letting them see complex patterns in data.

This new activation function that speaks polynomial but thinks like ReLU

https://arxiv.org/abs/2411.03884

🎯 Original Problem:

Transformers heavily rely on activation functions like ReLU and GeLU for nonlinearity, but these functions limit the model's ability to capture complex patterns and higher-order interactions in data.

-----

🔧 Solution in this Paper:

→ Introduces PolyCom (Polynomial Composition Activations) with two variants - PolyReLU and PolyNorm

→ PolyReLU composes polynomial functions with ReLU: x → Σ(i=0 to r) ai * ReLU^i(x)

→ PolyNorm normalizes polynomial powers: x → Σ(i=0 to r) ai * norm(x^i)

→ Uses third-order polynomials (r=3) with trainable coefficients initialized as ai=1/3

-----

💡 Key Insights:

→ PolyReLU networks can exactly represent any ReLU network of same size

→ ReLU networks need more parameters to match PolyReLU's expressivity

→ Achieves optimal approximation rate in Sobolev spaces with minimal parameters

→ Enables capturing of higher-order data interactions without computational overhead

-----

📊 Results:

→ On 1B dense models: PolyNorm outperformed SwiGLU by 1.21% average margin across six tasks

→ Lower training loss (2.17 vs 2.19) and validation perplexity (3.17 vs 3.22) compared to SwiGLU

→ On MoE models: Better performance on 8 out of 9 downstream tasks

Discussion about this video