LLM-Neo merges knowledge distillation with low-rank adaptation for efficient model compression
Smart fusion of Knowledge Distillation (KD) and LoRA creates memory-efficient model training
https://arxiv.org/abs/2411.06839
🎯 Original Problem:
Knowledge Distillation (KD) and Low-Rank Adaptation (LoRA) are two separate techniques for efficient LLM training, but both consume significant computational resources and lack parameter efficiency.
-----
🔧 Solution in this Paper:
→ LLM-Neo combines KD with LoRA by introducing a low-rank branch in the student model to inherit knowledge from the teacher model
→ Uses weighted combination of cross-entropy loss from dataset and KL divergence between student-teacher models
→ Optimizes using both supervised learning and knowledge distillation components through the low-rank branch
→ Implements larger rank values (128) with learning rate around 2e-4 for optimal performance
-----
💡 Key Insights:
→ Larger rank values consistently yield better results
→ Higher ranks require lower learning rates for stability
→ Shows compatibility with memory optimization techniques like ZeRO-1/2
→ Scales effectively with increased training data size
→ Works well with LoRA variants like MoSLoRA
-----
📊 Results:
→ Achieves 39.21 average score on Llama 3.1 benchmarks, 0.87 higher than LoRA
→ Reduces GPU memory usage by 25% compared to traditional KD
→ Maintains performance while significantly reducing training time
→ Shows consistent improvement across Llama 2 and Llama 3.1 architectures
Share this post