The problem is efficiently executing irregular workloads on GPUs, especially for Mixture-of-Experts (MoE) models, due to variations in computation and memory access.
This paper proposes a static batching framework with a novel task mapping mechanism for GPUs. This framework is applied to optimize MoE model inference by creating an efficient CUDA kernel.
-----
https://arxiv.org/abs/2501.16103
📌 This static batching framework overcomes dynamic scheduling overhead in grouped GEMM. Pre-calculated `TilePrefix` and warp-based mapping enable efficient irregular workload execution on GPUs.
📌 MoE kernel achieves near peak Tensor Core throughput via optimized CUDA implementation. Token index arrays and expert ordering enhance memory access and resource utilization for MoE inference.
📌 The framework's general applicability extends beyond MoE. It provides a template for efficient batching of diverse irregular workloads, crucial for scaling complex AI models on GPUs.
----------
Methods Explored in this Paper 🔧:
→ The paper introduces a static batching framework for irregular workloads on GPUs.
→ This framework uses a compressed task mapping mechanism. It maps thread block indices to task and tile indices via a `TilePrefix` array.
→ A warp-based algorithm decompresses this mapping within the kernel. This reduces overhead compared to dynamic scheduling.
→ For MoE models, the framework is extended to handle empty tasks. A two-stage mapping is used: "thread block index to non-empty task index to real task index".
→ Token copy overhead is eliminated using token index arrays for each expert. This avoids redundant data preparation.
→ The paper also employs expert ordering to improve resource utilization by interleaving compute-bound and memory-bound tasks.
→ Several GEMM optimizations are implemented in the MoE kernel. These include WGMMA instructions, asynchronous copy, two-stage pipelining, and tile swizzle.
-----
Key Insights 💡:
→ Static batching is effective for improving the performance of irregular workloads on GPUs.
→ The compressed task mapping mechanism reduces overhead and improves data locality.
→ Applying static batching to MoE inference and optimizing the CUDA kernel significantly increases throughput.
→ Expert ordering can further balance resource utilization in MoE inference.
→ Token index arrays effectively eliminate redundant token tensor copying.
-----
Results 📊:
→ On NVIDIA H20 GPU, the MoE kernel achieves up to 95% of peak Tensor Core throughput in balanced scenarios.
→ On NVIDIA H800 GPU, the kernel reaches up to 91% of peak throughput in best-case scenarios.
→ Even in worst-case scenarios, the kernel achieves approximately 90% peak throughput on H20.
Share this post