Parameter sharing in LoRA experts enables efficient multi-task learning without performance loss.
MoSLD introduces a parameter-sharing mechanism in LoRA for multi-task learning, reducing parameters while maintaining performance across different tasks.
-----
https://arxiv.org/abs/2412.08946
🤔 Original Problem:
LoRA excels at single-task fine-tuning but struggles with multi-task scenarios due to data conflicts and interference. Mixture-of-Experts (MoE) offers a solution but introduces parameter bloat and knowledge forgetting issues.
-----
🔧 Solution in this Paper:
→ MoSLD shares the upper projection matrix (A) among different experts while keeping lower projection matrix (B) task-specific.
→ The shared matrix captures general knowledge across tasks, while individual matrices maintain task-specific features.
→ A dropout strategy on matrix A balances parameter updates and reduces overfitting.
→ The router mechanism selects top-K experts for each input, enabling dynamic task handling.
-----
🎯 Key Insights:
→ Parameter sharing in LoRA can effectively balance general and task-specific knowledge
→ Dropout on shared parameters prevents overfitting and improves information exchange
→ Layer-wise expert allocation improves model efficiency
-----
📊 Results:
→ Reduces trainable parameters to 20.6% of full parameter fine-tuning
→ Outperforms baseline models by 1.56% in mixture settings
→ Shows consistent improvements across model sizes (7B, 13B, 33B)
Share this post