0:00
/
0:00
Transcript

"The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving"

The podcast on this paper is generated with Google's Illuminate.

Smart request scheduling can cut your LLM inference costs by a third

INFERMAX, proposed in this paper, helps find the perfect balance between preempting and queuing LLM requests.

30% latency reduction through optimized preemption strategies

Think database query optimization, but for LLM inference scheduling.

https://arxiv.org/abs/2411.07447

🎯 Original Problem:

LLM inference systems lack comprehensive analysis of scheduler performance across configurations. Current deployment requires expensive GPU testing, while development suffers from trial-and-error approaches without known performance limits.

-----

🔧 Solution in this Paper:

→ INFERMAX introduces an analytical framework using inference cost models to compare different schedulers.

→ It formulates optimal scheduling as a constraint satisfaction problem to establish performance upper bounds.

→ The framework implements unified scheduling algorithms handling both waiting and running request queues.

→ It uses linear cost models to predict batch execution times based on token processing and KV cache access.

-----

💡 Key Insights from this Paper:

→ Preempting requests can reduce GPU costs by 30% compared to avoiding preemptions

→ This is most effective with short requests and high memory utilization

→ Preempting long requests degrades performance by 30% due to recomputation overhead

→ Cost-based scheduling similar to database query optimization proves effective

-----

📊 Results:

→ Comprehensive analysis worth 200 GPU hours of testing

→ 30% latency reduction through optimized preemption strategies

→ 9% relative error in prediction accuracy compared to actual GPU runs

Discussion about this video