0:00
/
0:00
Transcript

LLM Pruning and Distillation in Practice: The Minitron Approach

Generated this podcast with Google's Illuminate.

🚀 A masterpiece classic of a Paper - NVIDIA's Pruning and Distillation

Addresses the industry need towards compact, resource-efficient models that maintain performance.

The work presents practical structured compression techniques for LLMs, merging various pruning approaches with knowledge distillation retraining.

📚 https://arxiv.org/pdf/2408.11796

Key Insights from this Paper 💡:

• Combining pruning with distillation effectively compresses LLMs

• Fine-tuning the teacher model on the distillation dataset improves results

• Width pruning outperforms depth pruning for a given parameter budget

• Distillation-based training surpasses conventional methods with fewer tokens

Solution in this Paper 🛠️:

• Two-step compression: pruning followed by knowledge distillation

• Pruning strategies: depth pruning and joint hidden/attention/MLP pruning

• Activation-based importance estimation for pruning

• Logit-only distillation with forward KL divergence loss

• Teacher correction: fine-tuning on distillation dataset before pruning

Results 📊:

• Compressed Llama 3.1 8B to 4B and Mistral NeMo 12B to 8B parameters

• MN-Minitron-8B outperforms Llama 3.1 8B using 40x fewer training tokens

• Llama-3.1-Minitron-4B models match teacher performance with 150x fewer tokens

• Width-pruned variant consistently outperforms depth-pruned for Llama 3.1

• 2.7x and 1.8x inference speedup for depth and width pruned 4B models respectively

Discussion about this video