0:00
/
0:00
Transcript

"Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference"

Below podcast on this paper is generated with Google's Illuminate.

The problem is efficiently executing irregular workloads on GPUs, especially for Mixture-of-Experts (MoE) models, due to variations in computation and memory access.

This paper proposes a static batching framework with a novel task mapping mechanism for GPUs. This framework is applied to optimize MoE model inference by creating an efficient CUDA kernel.

-----

https://arxiv.org/abs/2501.16103

📌 This static batching framework overcomes dynamic scheduling overhead in grouped GEMM. Pre-calculated `TilePrefix` and warp-based mapping enable efficient irregular workload execution on GPUs.

📌 MoE kernel achieves near peak Tensor Core throughput via optimized CUDA implementation. Token index arrays and expert ordering enhance memory access and resource utilization for MoE inference.

📌 The framework's general applicability extends beyond MoE. It provides a template for efficient batching of diverse irregular workloads, crucial for scaling complex AI models on GPUs.

----------

Methods Explored in this Paper 🔧:

→ The paper introduces a static batching framework for irregular workloads on GPUs.

→ This framework uses a compressed task mapping mechanism. It maps thread block indices to task and tile indices via a `TilePrefix` array.

→ A warp-based algorithm decompresses this mapping within the kernel. This reduces overhead compared to dynamic scheduling.

→ For MoE models, the framework is extended to handle empty tasks. A two-stage mapping is used: "thread block index to non-empty task index to real task index".

→ Token copy overhead is eliminated using token index arrays for each expert. This avoids redundant data preparation.

→ The paper also employs expert ordering to improve resource utilization by interleaving compute-bound and memory-bound tasks.

→ Several GEMM optimizations are implemented in the MoE kernel. These include WGMMA instructions, asynchronous copy, two-stage pipelining, and tile swizzle.

-----

Key Insights 💡:

→ Static batching is effective for improving the performance of irregular workloads on GPUs.

→ The compressed task mapping mechanism reduces overhead and improves data locality.

→ Applying static batching to MoE inference and optimizing the CUDA kernel significantly increases throughput.

→ Expert ordering can further balance resource utilization in MoE inference.

→ Token index arrays effectively eliminate redundant token tensor copying.

-----

Results 📊:

→ On NVIDIA H20 GPU, the MoE kernel achieves up to 95% of peak Tensor Core throughput in balanced scenarios.

→ On NVIDIA H800 GPU, the kernel reaches up to 91% of peak throughput in best-case scenarios.

→ Even in worst-case scenarios, the kernel achieves approximately 90% peak throughput on H20.

Discussion about this video