UltraMem, proposed in this paper, makes sparse LLMs run 6x faster by revolutionizing memory access patterns.
Smart memory access patterns unlock breakthrough speeds in large AI models.
UltraMem introduces a novel memory-based architecture that solves the inference speed bottleneck in sparse LLMs. It achieves 6x faster inference than Mixture of Experts (MoE) while maintaining comparable model performance through ultra-sparse memory layers and efficient memory access patterns.
-----
https://arxiv.org/abs/2411.12364
🤔 Original Problem:
MoE models face significant slowdown during inference due to high memory access costs, despite their parameter efficiency. A model with 12x more parameters than dense models runs 2-6x slower in inference, varying by batch size.
-----
🔧 Solution in this Paper:
→ UltraMem distributes multiple small memory layers across transformer blocks instead of one large layer
→ It introduces Tucker Decomposed Query-Key Retrieval (TDQKR) to improve memory access efficiency and reduce computation complexity
→ Implicit Value Expansion (IVE) technique expands memory size while reducing actual memory access
→ Multi-Core Scoring enhances performance by assigning multiple scores to single memory values
-----
💡 Key Insights:
→ Memory access patterns significantly impact inference speed more than parameter count
→ Distributed smaller memory layers perform better than single large memory layer
→ Tucker decomposition effectively reduces computation without losing performance
-----
📊 Results:
→ 6x faster inference compared to MoE at batch size 64
→ Matches 6.5B dense model performance with only 1.6B parameters
→ Maintains consistent memory access as parameters scale up
Share this post