Mixture-of-Transformers (MoT) splits transformer parameters by modality while keeping global attention, reducing training costs by half.
An architecture that cuts multi-modal training costs in half by giving each modality its own processing path.
https://arxiv.org/abs/2411.04996
🤖 Original Problem:
Training multi-modal LLMs (text, image, speech) requires massive computational resources and complex optimization challenges due to conflicting training dynamics between modalities.
-----
🔧 Solution in this Paper:
→ Introduced Mixture-of-Transformers (MoT), a sparse architecture that decouples non-embedding parameters by modality
→ MoT dynamically applies modality-specific parameters (feed-forward networks, attention matrices, layer normalization) while maintaining global self-attention
→ Uses simple rule-based routing by modality instead of learned routing like Mixture-of-Experts (MoE)
→ Preserves exact computational structure and FLOP count as dense transformer
-----
💡 Key Insights:
→ Different modalities naturally cluster in distinct regions of feature space, suggesting inherent processing differences
→ Rule-based routing by modality outperforms learned routing due to more stable training dynamics
→ Global self-attention enables cross-modal learning despite parameter decoupling
-----
📊 Results:
→ In Chameleon 7B setting: Matches dense baseline using 55.8% of FLOPs
→ With speech added: Achieves comparable performance using 37.2% of FLOPs
→ In Transfusion setting: 760M MoT outperforms 1.4B dense baseline on image metrics
→ Reduces wall-clock training time - matches baseline image quality in 47.2% time
Share this post