Multi-Head Mixture-of-Experts (MH-MoE) enables experts to collaborate across multiple representation spaces while maintaining efficiency.
Parallel processing of token representations boosts expert model performance without extra computation.
Multi-Head Mixture-of-Experts (MH-MoE) enhances standard Sparse Mixture-of-Experts by introducing multi-head mechanisms that collectively process information from different expert representation spaces, while maintaining computational efficiency and parameter parity with baseline models.
-----
https://arxiv.org/abs/2411.16205
🤔 Original Problem:
Sparse Mixture-of-Experts (SMoE) models face challenges in effectively utilizing expert knowledge, as each expert processes information independently without sharing representations across different spaces.
-----
🔧 Solution in this Paper:
→ MH-MoE splits input tokens into multiple sub-tokens using a head layer, allowing parallel processing across different representation spaces.
→ Each sub-token is processed by different experts through a gating mechanism that dynamically routes information.
→ A merge layer combines the processed sub-tokens, integrating features from multiple expert spaces.
→ The implementation maintains FLOPs parity with standard SMoE by adjusting intermediate dimensions and expert counts.
-----
💡 Key Insights:
→ Head layer contributes more significantly to performance gains than merge layer
→ Three-head configuration outperforms two-head setup consistently
→ MH-MoE integrates effectively with 1-bit quantization techniques
-----
📊 Results:
→ Achieved lower perplexity compared to standard SMoE and fine-grained SMoE variants
→ Demonstrated 10.51 perplexity on RedPajama dataset vs 10.90 for baseline SMoE
→ Maintained performance advantages even with 1-bit quantization
Share this post