Vocabulary Parallelism, with smart vocabulary distribution across GPUs, solves the pipeline bottleneck in LLM training by evenly distributing vocabulary computations
https://arxiv.org/abs/2411.05288
🎯 Original Problem:
Pipeline parallelism in LLM training faces a critical imbalance issue - vocabulary layers cause uneven computation and memory usage across pipeline stages. This creates pipeline bubbles and memory bottlenecks, especially problematic with large vocabularies.
-----
🔧 Solution in this Paper:
→ Introduces "Vocabulary Parallelism" that evenly partitions vocabulary layers across all pipeline devices
→ Optimizes communication barriers from 3 to 1 using novel algorithms that reorder operations in output layers
→ Integrates seamlessly with existing pipeline schedules through a building block approach
→ Uses separate CUDA streams for overlapping communication with computation
→ Implements memory-efficient padding techniques for better alignment
-----
💡 Key Insights:
→ Vocabulary layer imbalance worsens with larger vocabulary sizes
→ Simply redistributing transformer layers doesn't solve both compute and memory imbalance
→ Communication barriers directly impact activation memory overhead
→ Overlapping communication with computation is crucial for performance
→ Memory balancing is achievable when combined with V-Half schedule
-----
📊 Results:
→ 5% to 51% improvement in throughput compared to naive approaches
→ Significantly reduced peak memory usage in large vocabulary scenarios
→ 8% performance boost achieved with proper vocabulary size padding
→ Perfect balance in both memory and computation when combined with V-Half
Share this post