Mixture of Parrots: Experts improve memorization more than reasoning
Very interesting revelations in this paper. ๐ก
Very interesting revelations in this paper. ๐ก
Mixture-of-Experts (MoE) trade reasoning power for memory efficiency in LLM architectures
More experts don't make LLMs smarter, just better at memorizing
๐ค Original Problem:
Mixture-of-Experts (MoE) architecture lets LLMs scale parameters with minimal computational cost. But we don't know the exact performance tradeoffs between MoEs and standard dense transformers.
๐ ๏ธ Solution in this Paper:
โข Analyzed theoretical limits of MoEs in reasoning tasks using graph problems
โข Proved certain graph problems can't be solved by any number of experts of specific width
โข Showed same tasks are easily solved by slightly wider dense models
โข Used communication-complexity lower bounds to prove single-layer MoE needs critical dimension
โข Pre-trained series of MoEs and dense transformers on 65B tokens
โข Evaluated on math and natural language benchmarks
๐ก Key Insights:
โข MoEs excel at memorization but struggle with reasoning
โข Increasing experts helps world knowledge tasks but not reasoning tasks
โข MoEs match dense model performance with fewer active parameters for memorization
โข MoEs are not a "free lunch" - benefits depend heavily on task type
โข Architectural choices should be guided by specific task requirements
๐ Results:
โข On world knowledge tasks: MoEs matched dense performance with fewer active parameters
โข On commonsense reasoning: MoEs performed worse than dense models at equal parameters
โข On mathematical reasoning: Similar limitations as commonsense tasks
โข Memory efficiency: MoE with 42M active parameters outperformed dense model with 10x parameters
๐ Findings
World knowledge tasks: MoEs matched dense model performance with fewer active parameters
Commonsense reasoning: MoEs performed worse than dense models at equal total parameters
Mathematical reasoning: MoEs showed similar limitations as in commonsense reasoning tasks
๐งช Theoretical evidence of limitations of MoEs in reasoning
The researchers proved that certain graph problems cannot be solved by any number of experts of a specific width, while these same tasks can be easily solved by slightly wider dense models.
They used communication-complexity lower bounds to show that a single-layer MoE requires a critical dimension to solve even simple graph connectivity problems.
๐ฏ Implications
So MoEs are not a "free lunch" solution - their benefits depend heavily on the task type.
They are highly effective for knowledge-intensive tasks but may not be the best choice for reasoning-intensive applications.
So architectural choices should be guided by the specific requirements of the target task.