MRM (Managed-Retention Memory) rethinks memory for the AI era, boosting efficiency by prioritizing reads over long-term retention.
This paper proposes Managed-Retention Memory (MRM) for AI inference, offering a new memory class optimized for LLM data structures. MRM trades long-term data retention and write performance for improved read throughput, energy efficiency, and density, addressing the limitations of High Bandwidth Memory.
-----
https://www.arxiv.org/abs/2501.09605
Original Problem 🤔:
→ High Bandwidth Memory (HBM) dominates AI accelerators but is suboptimal for Large Language Model inference.
→ HBM over-provisions write performance and under-provisions density and read bandwidth.
→ HBM also has high energy-per-bit overhead and high cost.
-----
Solution in this Paper 💡:
→ Managed-Retention Memory (MRM) is proposed as a new memory class.
→ MRM targets key LLM data structures by relaxing long-term retention (to days or hours) and write performance.
→ In return, MRM aims for improved read throughput, energy efficiency, and density compared to DRAM and HBM.
→ MRM could leverage technologies like RRAM, MRAM, and PCM. These technologies offer comparable or better read performance and energy efficiency than DRAM, with potential for higher density and lower cost.
-----
Key Insights from this Paper 💎:
→ LLM inference has distinct memory access patterns (sequential, read-dominated) unlike general compute workloads.
→ Current "non-volatile" memory technologies (like Flash) are unsuitable for LLM inference because of their endurance and performance limitations.
→ Relaxing the retention time requirement of these technologies could make them better suited for LLM inference.
Share this post