Pie lets your LLM run faster by making CPU memory work as smooth as GPU memory, with smart memory juggling between CPU-GPU
→ Outperforms vLLM by 1.9x in throughput and 2x in latency and reduces GPU memory usage by 1.67x while maintaining performance
Pie introduces a novel framework that enables LLMs to use CPU memory without performance penalties. It achieves this through performance-transparent swapping and adaptive expansion, leveraging predictable memory patterns and high-bandwidth hardware to maintain optimal performance while reducing GPU memory requirements.
-----
https://arxiv.org/abs/2411.09317
🎯 Original Problem:
LLM inference requires massive GPU memory. When GPU memory is insufficient, traditional CPU memory solutions cause higher latency and lower throughput, significantly impacting performance.
-----
🔧 Solution in this Paper:
→ Pie implements layer-by-layer KV cache management with a FIFO queue mechanism for swapping.
→ It uses performance-transparent swapping to ensure data required for upcoming layers is prefetched before needed.
→ The system maintains a mapping table similar to page tables for tracking cache locations.
→ Adaptive expansion dynamically adjusts CPU memory allocation based on real-time conditions.
→ It monitors system conditions like interconnect saturation and computation latency to optimize memory usage.
-----
💡 Key Insights:
→ High-bandwidth interconnect between GPU-CPU (900 GB/s) enables efficient swapping
→ Predictable memory access patterns at layer granularity allow 100% prefetching efficiency
→ Dynamic memory allocation performs better than fixed CPU-GPU partitioning
-----
📊 Results:
→ Outperforms vLLM by 1.9x in throughput and 2x in latency
→ Reduces GPU memory usage by 1.67x while maintaining performance
→ Achieves 60x lower latency and 9.4x higher throughput vs FlexGen
Share this post