MoH: Multi-Head Attention as Mixture-of-Head Attention

Smart routing system tells Transformer heads when to pay attention, boosting efficiency

Nov 11, 2024

Smart routing system tells Transformer heads when to pay attention, boosting efficiency

Basically as per this Paper, Transformers get selective attention: Mixture-of-Head attention (MoH) picks the right heads for each token

Original Problem 🔍:

Multi-head attention, a core component of Transformer models, contains redundant attention heads, leading to inefficient inference.

Solution in this Paper 🧠:

• Introduces Mixture-of-Head attention (MoH)

• Treats attention heads as experts in Mixture-of-Experts mechanism

• Activates only Top-K heads for each token

• Uses weighted summation instead of standard summation

• Incorporates shared heads for common knowledge

• Implements two-stage routing for dynamic weight balancing

Key Insights from this Paper 💡:

• MoH enhances inference efficiency without compromising accuracy

• Weighted summation provides more flexibility in combining head outputs

• MoH does not increase total parameters compared to standard multi-head attention

• Pre-trained models can be continue-tuned into MoH models

Results 📊:

• MoH achieves competitive or superior performance using 50-90% of attention heads

• MoH-ViT-B: 84.9% top-1 accuracy on ImageNet using 75% of heads

• MoH-DiT-XL/2: FID of 2.94, outperforming DiT-XL/2 (FID 3.22)

• MoH-LLaMA3-8B: 64.0% average accuracy across 14 benchmarks, 2.4% improvement over LLaMA3-8B

🧠 How MoH differs from standard multi-head attention

MoH consists of multiple attention heads and a router that activates only the Top-K heads for each token, unlike standard multi-head attention where all heads are always active. MoH uses weighted summation of head outputs instead of simple summation.

It also incorporates shared heads that are always activated to capture common knowledge.

Rohan's Bytes

Discussion about this post