0:00
/
0:00
Transcript

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

The podcast on this paper is generated with Google's Illuminate.

Mixture-of-Transformers (MoT) splits transformer parameters by modality while keeping global attention, reducing training costs by half.

An architecture that cuts multi-modal training costs in half by giving each modality its own processing path.

https://arxiv.org/abs/2411.04996

🤖 Original Problem:

Training multi-modal LLMs (text, image, speech) requires massive computational resources and complex optimization challenges due to conflicting training dynamics between modalities.

-----

🔧 Solution in this Paper:

→ Introduced Mixture-of-Transformers (MoT), a sparse architecture that decouples non-embedding parameters by modality

→ MoT dynamically applies modality-specific parameters (feed-forward networks, attention matrices, layer normalization) while maintaining global self-attention

→ Uses simple rule-based routing by modality instead of learned routing like Mixture-of-Experts (MoE)

→ Preserves exact computational structure and FLOP count as dense transformer

-----

💡 Key Insights:

→ Different modalities naturally cluster in distinct regions of feature space, suggesting inherent processing differences

→ Rule-based routing by modality outperforms learned routing due to more stable training dynamics

→ Global self-attention enables cross-modal learning despite parameter decoupling

-----

📊 Results:

→ In Chameleon 7B setting: Matches dense baseline using 55.8% of FLOPs

→ With speech added: Achieves comparable performance using 37.2% of FLOPs

→ In Transfusion setting: 760M MoT outperforms 1.4B dense baseline on image metrics

→ Reduces wall-clock training time - matches baseline image quality in 47.2% time

Discussion about this video