0:00
/
0:00
Transcript

"Managed-Retention Memory: A New Class of Memory for the AI Era"

Generated below podcast on this paper with Google's Illuminate.

MRM (Managed-Retention Memory) rethinks memory for the AI era, boosting efficiency by prioritizing reads over long-term retention.

This paper proposes Managed-Retention Memory (MRM) for AI inference, offering a new memory class optimized for LLM data structures. MRM trades long-term data retention and write performance for improved read throughput, energy efficiency, and density, addressing the limitations of High Bandwidth Memory.

-----

https://www.arxiv.org/abs/2501.09605

Original Problem 🤔:

→ High Bandwidth Memory (HBM) dominates AI accelerators but is suboptimal for Large Language Model inference.

→ HBM over-provisions write performance and under-provisions density and read bandwidth.

→ HBM also has high energy-per-bit overhead and high cost.

-----

Solution in this Paper 💡:

→ Managed-Retention Memory (MRM) is proposed as a new memory class.

→ MRM targets key LLM data structures by relaxing long-term retention (to days or hours) and write performance.

→ In return, MRM aims for improved read throughput, energy efficiency, and density compared to DRAM and HBM.

→ MRM could leverage technologies like RRAM, MRAM, and PCM. These technologies offer comparable or better read performance and energy efficiency than DRAM, with potential for higher density and lower cost.

-----

Key Insights from this Paper 💎:

→ LLM inference has distinct memory access patterns (sequential, read-dominated) unlike general compute workloads.

→ Current "non-volatile" memory technologies (like Flash) are unsuitable for LLM inference because of their endurance and performance limitations.

→ Relaxing the retention time requirement of these technologies could make them better suited for LLM inference.

Discussion about this video