Ever waited for a slow core while the fast ones sit idle? This paper fixes that for hybrid CPUs.
Hybrid CPUs were wasting their potential.
-----
https://arxiv.org/abs/2411.19542
🎯 Original Problem:
Hybrid CPUs with different core types (P-cores and E-cores) suffer from imbalanced workload distribution during LLM inference. Traditional parallel methods like OpenMP assign equal work to all cores, making faster cores wait for slower ones.
-----
🔧 Solution in this Paper:
→ The paper introduces a dynamic parallel method that tracks each core's performance in real-time.
→ A CPU runtime component monitors execution time and maintains performance ratios for each core.
→ A thread scheduler divides tasks proportionally based on core performance ratios.
→ The system continuously updates these ratios during inference to adapt to changing conditions.
-----
💡 Key Insights:
→ Performance ratios between P-cores and E-cores stabilize between 3-3.5x during inference
→ Different phases (prefill vs decode) require different workload distributions
→ Dynamic adaptation is crucial for optimal hybrid CPU utilization
-----
📊 Results:
→ Achieved 90% memory bandwidth utilization on hybrid Intel CPUs
→ 65% compute performance improvement on Ultra-125H
→ 85% enhancement on Core-12900K
→ 16 tokens/second generation speed