"Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference"

Playback speed

Share post at current time

0:00

Transcript

"Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 10, 2025

The problem is efficiently executing irregular workloads on GPUs, especially for Mixture-of-Experts (MoE) models, due to variations in computation and memory access.

This paper proposes a static batching framework with a novel task mapping mechanism for GPUs. This framework is applied to optimize MoE model inference by creating an efficient CUDA kernel.

-----

https://arxiv.org/abs/2501.16103

📌 This static batching framework overcomes dynamic scheduling overhead in grouped GEMM. Pre-calculated `TilePrefix` and warp-based mapping enable efficient irregular workload execution on GPUs.

📌 MoE kernel achieves near peak Tensor Core throughput via optimized CUDA implementation. Token index arrays and expert ordering enhance memory access and resource utilization for MoE inference.

📌 The framework's general applicability extends beyond MoE. It provides a template for efficient batching of diverse irregular workloads, crucial for scaling complex AI models on GPUs.

----------

Methods Explored in this Paper 🔧:

→ The paper introduces a static batching framework for irregular workloads on GPUs.

→ This framework uses a compressed task mapping mechanism. It maps thread block indices to task and tile indices via a `TilePrefix` array.

→ A warp-based algorithm decompresses this mapping within the kernel. This reduces overhead compared to dynamic scheduling.

→ For MoE models, the framework is extended to handle empty tasks. A two-stage mapping is used: "thread block index to non-empty task index to real task index".

→ Token copy overhead is eliminated using token index arrays for each expert. This avoids redundant data preparation.

→ The paper also employs expert ordering to improve resource utilization by interleaving compute-bound and memory-bound tasks.

→ Several GEMM optimizations are implemented in the MoE kernel. These include WGMMA instructions, asynchronous copy, two-stage pipelining, and tile swizzle.

-----

Key Insights 💡:

→ Static batching is effective for improving the performance of irregular workloads on GPUs.

→ The compressed task mapping mechanism reduces overhead and improves data locality.

→ Applying static batching to MoE inference and optimizing the CUDA kernel significantly increases throughput.

→ Expert ordering can further balance resource utilization in MoE inference.

→ Token index arrays effectively eliminate redundant token tensor copying.

-----

Results 📊:

→ On NVIDIA H20 GPU, the MoE kernel achieves up to 95% of peak Tensor Core throughput in balanced scenarios.

→ On NVIDIA H800 GPU, the kernel reaches up to 91% of peak throughput in best-case scenarios.

→ Even in worst-case scenarios, the kernel achieves approximately 90% peak throughput on H20.

Rohan's Bytes

"Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference"

Discussion about this video