🚀 A masterpiece classic of a Paper - NVIDIA's Pruning and Distillation
Addresses the industry need towards compact, resource-efficient models that maintain performance.
The work presents practical structured compression techniques for LLMs, merging various pruning approaches with knowledge distillation retraining.
📚 https://arxiv.org/pdf/2408.11796
Key Insights from this Paper 💡:
• Combining pruning with distillation effectively compresses LLMs
• Fine-tuning the teacher model on the distillation dataset improves results
• Width pruning outperforms depth pruning for a given parameter budget
• Distillation-based training surpasses conventional methods with fewer tokens
Solution in this Paper 🛠️:
• Two-step compression: pruning followed by knowledge distillation
• Pruning strategies: depth pruning and joint hidden/attention/MLP pruning
• Activation-based importance estimation for pruning
• Logit-only distillation with forward KL divergence loss
• Teacher correction: fine-tuning on distillation dataset before pruning
Results 📊:
• Compressed Llama 3.1 8B to 4B and Mistral NeMo 12B to 8B parameters
• MN-Minitron-8B outperforms Llama 3.1 8B using 40x fewer training tokens
• Llama-3.1-Minitron-4B models match teacher performance with 150x fewer tokens
• Width-pruned variant consistently outperforms depth-pruned for Llama 3.1
• 2.7x and 1.8x inference speedup for depth and width pruned 4B models respectively
Share this post