Simple ReLU function based routing and dynamically choosing experts, replaces complex TopK routing, making LLMs learn better.
ReMoE introduces a fully differentiable routing mechanism using ReLU to replace TopK routing in Mixture-of-Experts models, enabling better performance and scaling.
-----
https://arxiv.org/abs/2412.14711
🤔 Original Problem:
→ Traditional TopK routing in Mixture-of-Experts (MoE) models is non-differentiable, limiting model performance and scalability. The discrete nature of expert selection creates training inefficiencies and prevents smooth gradient flow.
-----
🔧 Solution in this Paper:
→ ReMoE replaces TopK+Softmax routing with a simple ReLU function, making the entire routing process differentiable.
→ It uses adaptive L1 regularization to control sparsity while balancing expert load through a refined scheme that adapts during training.
→ The system allows dynamic allocation of experts per token based on complexity, enabling more efficient resource usage.
→ Training naturally progresses through three stages: warm-up (dense), sparsifying, and stable (sparse) phases.
-----
💡 Key Insights:
→ ReLU routing enables independent expert activation decisions, allowing variable expert allocation per token
→ The model naturally learns to allocate more experts to complex/rare tokens
→ Domain specialization emerges naturally, with experts becoming specialized for specific content types
→ Full differentiability enables end-to-end training without discrete operations
-----
📊 Results:
→ Outperforms vanilla TopK-routed MoE across model sizes (182M to 978M parameters)
→ Shows superior scaling with increasing expert count (4 to 128 experts)
→ Maintains same computational costs as traditional MoE
→ Achieves better performance in downstream tasks with average accuracy improvement of 1.5%
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post