0:00
/
0:00
Transcript

"Puzzle: Distillation-Based NAS for Inference-Optimized LLMs"

The podcast on this paper is generated with Google's Illuminate.

Puzzle turns bulky LLMs into lean machines by optimizing each layer for specific hardware

Neural architecture search meets hardware optimization to make LLMs run faster

Puzzle framework optimizes LLM inference by transforming trained models into hardware-specific versions through neural architecture search and distillation. It achieves 2.17x speedup while maintaining 98.4% accuracy, requiring only 45B training tokens compared to original 15T tokens.

-----

https://arxiv.org/abs/2411.19146

🔄 Original Problem:

→ LLMs need massive compute for inference, limiting their practical deployment despite impressive capabilities

→ Current architectures use uniform layers without considering hardware-specific optimization opportunities

-----

🧩 Solution in this Paper:

→ Puzzle framework introduces blockwise local distillation to train alternative model components in parallel

→ It uses mixed-integer programming to search optimal architectures under hardware constraints

→ The framework optimizes each layer independently, allowing different configurations for attention and feed-forward networks

→ Training requires only parent model weights, not original training data

-----

💡 Key Insights:

→ Post-training, many LLM computations become redundant during inference

→ Hardware-specific architecture optimization outperforms uniform structures

→ Batch size significantly impacts hardware efficiency during generation phase

-----

📊 Results:

→ Nemotron-51B achieves 2.17x inference speedup on single H100 GPU

→ Preserves 98.4% of original model capabilities

→ Training needs only 45B tokens vs original 15T tokens

Discussion about this video