"Puzzle: Distillation-Based NAS for Inference-Optimized LLMs"

Playback speed

Share post at current time

0:00

Transcript

"Puzzle: Distillation-Based NAS for Inference-Optimized LLMs"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Jan 01, 2025

Puzzle turns bulky LLMs into lean machines by optimizing each layer for specific hardware

Neural architecture search meets hardware optimization to make LLMs run faster

Puzzle framework optimizes LLM inference by transforming trained models into hardware-specific versions through neural architecture search and distillation. It achieves 2.17x speedup while maintaining 98.4% accuracy, requiring only 45B training tokens compared to original 15T tokens.

-----

https://arxiv.org/abs/2411.19146

🔄 Original Problem:

→ LLMs need massive compute for inference, limiting their practical deployment despite impressive capabilities

→ Current architectures use uniform layers without considering hardware-specific optimization opportunities

-----

🧩 Solution in this Paper:

→ Puzzle framework introduces blockwise local distillation to train alternative model components in parallel

→ It uses mixed-integer programming to search optimal architectures under hardware constraints

→ The framework optimizes each layer independently, allowing different configurations for attention and feed-forward networks

→ Training requires only parent model weights, not original training data

-----