0:00
/
0:00
Transcript

"LLM-Neo: Parameter Efficient Knowledge Distillation for Large Language Models"

The podcast on this paper is generated with Google's Illuminate.

LLM-Neo merges knowledge distillation with low-rank adaptation for efficient model compression

Smart fusion of Knowledge Distillation (KD) and LoRA creates memory-efficient model training

https://arxiv.org/abs/2411.06839

🎯 Original Problem:

Knowledge Distillation (KD) and Low-Rank Adaptation (LoRA) are two separate techniques for efficient LLM training, but both consume significant computational resources and lack parameter efficiency.

-----

🔧 Solution in this Paper:

→ LLM-Neo combines KD with LoRA by introducing a low-rank branch in the student model to inherit knowledge from the teacher model

→ Uses weighted combination of cross-entropy loss from dataset and KL divergence between student-teacher models

→ Optimizes using both supervised learning and knowledge distillation components through the low-rank branch

→ Implements larger rank values (128) with learning rate around 2e-4 for optimal performance

-----

💡 Key Insights:

→ Larger rank values consistently yield better results

→ Higher ranks require lower learning rates for stability

→ Shows compatibility with memory optimization techniques like ZeRO-1/2

→ Scales effectively with increased training data size

→ Works well with LoRA variants like MoSLoRA

-----

📊 Results:

→ Achieves 39.21 average score on Llama 3.1 benchmarks, 0.87 higher than LoRA

→ Reduces GPU memory usage by 25% compared to traditional KD

→ Maintains performance while significantly reducing training time

→ Shows consistent improvement across Llama 2 and Llama 3.1 architectures

Discussion about this video