0:00
/
0:00
Transcript

"Pie: Pooling CPU Memory for LLM Inference"

The podcast on this paper is generated with Google's Illuminate.

Pie lets your LLM run faster by making CPU memory work as smooth as GPU memory, with smart memory juggling between CPU-GPU

→ Outperforms vLLM by 1.9x in throughput and 2x in latency and reduces GPU memory usage by 1.67x while maintaining performance

Pie introduces a novel framework that enables LLMs to use CPU memory without performance penalties. It achieves this through performance-transparent swapping and adaptive expansion, leveraging predictable memory patterns and high-bandwidth hardware to maintain optimal performance while reducing GPU memory requirements.

-----

https://arxiv.org/abs/2411.09317

🎯 Original Problem:

LLM inference requires massive GPU memory. When GPU memory is insufficient, traditional CPU memory solutions cause higher latency and lower throughput, significantly impacting performance.

-----

🔧 Solution in this Paper:

→ Pie implements layer-by-layer KV cache management with a FIFO queue mechanism for swapping.

→ It uses performance-transparent swapping to ensure data required for upcoming layers is prefetched before needed.

→ The system maintains a mapping table similar to page tables for tracking cache locations.

→ Adaptive expansion dynamically adjusts CPU memory allocation based on real-time conditions.

→ It monitors system conditions like interconnect saturation and computation latency to optimize memory usage.

-----

💡 Key Insights:

→ High-bandwidth interconnect between GPU-CPU (900 GB/s) enables efficient swapping

→ Predictable memory access patterns at layer granularity allow 100% prefetching efficiency

→ Dynamic memory allocation performs better than fixed CPU-GPU partitioning

-----

📊 Results:

→ Outperforms vLLM by 1.9x in throughput and 2x in latency

→ Reduces GPU memory usage by 1.67x while maintaining performance

→ Achieves 60x lower latency and 9.4x higher throughput vs FlexGen

Discussion about this video

User's avatar