This paper introduces MoE-CAP, a benchmark for evaluating Mixture-of-Experts (MoE) systems in LLMs, addressing the trade-offs between Cost, Accuracy, and Performance.
https://arxiv.org/abs/2412.07067
🤔 Original Problem:
Existing benchmarks fail to accurately assess MoE systems, which rely on heterogeneous resources and have complex trade-offs between cost, accuracy, and performance.
-----
🔍 Solution in this Paper:
→ MoE-CAP proposes a novel benchmarking method to understand the trade-offs in MoE systems.
→ It introduces sparsity-aware performance metrics: Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU).
→ The benchmark includes comprehensive cost models accounting for heterogeneous compute and memory resources.
→ It evaluates accuracy on diverse tasks using datasets like MMLU, GSM8k, and Arena-Hard.
-----
💡 Key Insights from this Paper:
→ MoE systems can only optimize two out of three factors: cost efficiency, high accuracy, or high performance.
→ Existing metrics overestimate memory and compute costs for MoE systems.
→ Heterogeneous resources significantly impact the overall cost and performance of MoE systems.
-----
📊 Results:
→ S-MBU captures activated experts with <1% discrepancy from actual profiled memory usage.
→ Traditional MBU overestimates by >260%.
→ The benchmark confirms the CAP trade-off, showing the infeasibility of achieving all three metrics simultaneously.
Share this post