Want faster LLM inference? This paper’s scheduling policy uses a heatmap to optimize request distribution.
This paper introduces a distributed scheduling policy for Large Language Model (LLM) inference. It aims to efficiently distribute requests across tensor engines, considering both request characteristics and system load.
Prefix-Decode awareness meets locality optimization in this new policy.
-----
Paper - https://arxiv.org/abs/2501.14417
Original Problem 🧐:
→ Serving LLM requests in a distributed system requires efficient scheduling.
→ Traditional scheduling might not consider the unique demands of each LLM request.
→ Inefficient scheduling can lead to underutilized resources and increased latency.
-----
Solution in this Paper 💡:
→ This paper proposes a distributed scheduling policy.
→ The policy, `dist_sched`, first uses a Prefix-Decode aware mechanism (`PD_aware`).
→ `PD_aware` selects Tensor Engines (TEs) based on the request's prefill length and decode length using a precomputed heatmap.
→ This heatmap guides the selection of TEs best suited for different request types.
→ After the PD-aware step, the policy checks for load balance across TEs.
→ If load is balanced, it applies a locality-aware mechanism (`locality_aware`).
→ `locality_aware` selects TEs based on prefix matching with existing requests.
→ If load is not balanced, a load-aware mechanism (`load_aware`) is intended to be used.
→ However, the `load_aware` mechanism’s detailed implementation (`select_tes_least_load`) is commented out in the provided algorithm.
→ Therefore, in cases of load imbalance, the policy, as described, may not explicitly perform load-aware scheduling based on the provided algorithm.
-----
Key Insights from this Paper 🤔:
→ Prefill and decode lengths are critical request characteristics for LLM inference scheduling.
→ A Prefix-Decode heatmap can effectively guide TE selection for different request types.
→ Locality-aware scheduling, based on prefix matching, can be beneficial when the system is load-balanced.
→ The policy prioritizes Prefix-Decode awareness and locality, with an intended but not fully detailed load-balancing component.
-----
Results 📊:
→ The provided algorithm description does not include explicit performance metrics or benchmark results.
→ The effectiveness of this scheduling policy would depend on the actual implementation and evaluation in a distributed LLM serving system, which are not detailed in this algorithm description.
Share this post