"AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding"

Playback speed

Share post at current time

0:00

Transcript

Below podcast is generated with Google's Illuminate.

Feb 03, 2025

Greedy token selection isn't just simple; this paper proves it's the smartest way to hit LLM performance targets on a budget.

Greedy algorithm optimally selects tokens to meet request-specific performance targets under a total budget constraint.

-----

Original Problem 🤔:

→ Efficiently allocating a limited token budget across multiple requests is challenging.

→ Each request has a Service Level Objective (SLO) defined by a target path probability.

→ The problem is to maximize the satisfaction of SLOs within the given token budget.

-----

Solution in this Paper 💡:

→ This paper proposes and proves the optimality of a greedy algorithm.

→ The algorithm iteratively selects tokens from token trees associated with each request.

→ It prioritizes tokens with the highest path probability values across all requests.

→ For each request, the algorithm aims to achieve its target path probability (SLO).

→ Token selection continues until either all SLOs are met or the total budget is exhausted.

→ The algorithm first checks for feasibility. If the total minimum tokens needed to meet all SLOs exceeds the budget, it declares "INVALID".

→ Otherwise, the algorithm proceeds with greedy token selection to find an optimal allocation within the budget.

-----

Key Insights from this Paper 🧐:

→ A greedy approach is optimal for token selection in this scenario.

→ Selecting tokens based on highest path probability maximizes the achieved total probability within a budget.

→ The minimal number of tokens required to reach a probability threshold can be found greedily.

→ Greedy selection under a fixed budget maximizes the sum of path probabilities.

-----

Results 📊:

→ Proves that if the algorithm returns "INVALID", no feasible solution exists to meet all SLOs within the budget.

→ Proves that if a feasible solution exists, the algorithm returns an optimal solution.

→ Optimality is shown through two lemmas: Minimality in Threshold Attainment and Maximality Under a Fixed Budget.

Rohan's Bytes