Smart memory sharing helps BatchLLM process LLM tasks twice as fast.
BatchLLM optimizes LLM inference for batch processing by efficiently sharing KV cache memory and intelligently scheduling tokens, outperforming existing systems.
https://arxiv.org/abs/2412.03594
Original Problem 🔍:
→ Current LLM inference engines struggle with throughput-oriented batch processing tasks where different prompts share common prefixes.
→ LRU-based cache systems prematurely evict reusable KV contexts, leading to unnecessary recalculations and memory waste.
→ Token batching in existing systems doesn't effectively mix decoding and prefill operations, resulting in suboptimal GPU utilization.
-----
Solution in this Paper 🛠️:
→ BatchLLM introduces global prefix identification that explicitly manages prefix sharing across large batches.
→ It employs dynamic programming to maximize first-level prefix reuse while simplifying multi-level prefixes into single-level.
→ The system reorders requests based on decoding-to-prompt length ratio for optimal scheduling.
→ Memory-centric token batching replaces traditional request-number thresholds to better utilize GPU resources.
→ Horizontal fusion optimizes prefix-shared Attention computation by combining calculations in a single kernel.
-----
Key Insights 💡:
→ Global prefix analysis outperforms LRU caching by 10.7% in KV context reuse
→ Memory-centric batching reduces "valleys" in GPU utilization timeline
→ Grouping shared-prefix requests reduces KV memory lifetime
→ Horizontal kernel fusion minimizes GPU tail effects
-----
Results 📊:
→ Outperforms vLLM by 1.1× to 2.0× on microbenchmarks
→ Achieves 54.9% token saving ratio vs vLLM's 44.2%
→ Shows 1.3× speedup on snippet generation tasks
→ Demonstrates 1.47× improvement on ranking workloads
Share this post