Smart request scheduling can cut your LLM inference costs by a third
INFERMAX, proposed in this paper, helps find the perfect balance between preempting and queuing LLM requests.
30% latency reduction through optimized preemption strategies
Think database query optimization, but for LLM inference scheduling.
https://arxiv.org/abs/2411.07447
🎯 Original Problem:
LLM inference systems lack comprehensive analysis of scheduler performance across configurations. Current deployment requires expensive GPU testing, while development suffers from trial-and-error approaches without known performance limits.
-----
🔧 Solution in this Paper:
→ INFERMAX introduces an analytical framework using inference cost models to compare different schedulers.
→ It formulates optimal scheduling as a constraint satisfaction problem to establish performance upper bounds.
→ The framework implements unified scheduling algorithms handling both waiting and running request queues.
→ It uses linear cost models to predict batch execution times based on token processing and KV cache access.
-----
💡 Key Insights from this Paper:
→ Preempting requests can reduce GPU costs by 30% compared to avoiding preemptions
→ This is most effective with short requests and high memory utilization
→ Preempting long requests degrades performance by 30% due to recomputation overhead
→ Cost-based scheduling similar to database query optimization proves effective
-----
📊 Results:
→ Comprehensive analysis worth 200 GPU hours of testing
→ 30% latency reduction through optimized preemption strategies
→ 9% relative error in prediction accuracy compared to actual GPU runs
Share this post