Merging pre and post-fine-tuned LLMs creates safer models without safety data overhead.
This paper proposes, a simple two-step approach preserves safety in fine-tuned LLMs by merging weights of pre- and post-fine-tuned models, maintaining performance without extra safety data.
-----
https://arxiv.org/abs/2412.19512
🔍 Original Problem:
Fine-tuning LLMs on downstream tasks often degrades their safety alignment, even with benign datasets. Current solutions require additional safety data, which is impractical and resource-intensive.
-----
🛠️ Solution in this Paper:
→ Step 1: Fine-tune base model on downstream tasks using standard supervised fine-tuning
→ Step 2: Merge the weights of pre- and post-fine-tuned safety-aligned models using interpolation
→ The merging ratio λ determines contribution between base and fine-tuned models
→ Linear merging showed strong results compared to complex methods like SLERP and DARE
→ Optimal merging ratio found around 0.5~0.6, balancing safety and task performance
-----
💡 Key Insights:
→ Safety can be preserved without additional safety data
→ Simple linear merging outperforms complex methods
→ Method works across multiple model sizes and families
→ Trade-off exists between task performance and safety based on merging ratio
-----
📊 Results:
→ Reduced Attack Success Rate by up to 30% while maintaining task performance
→ Effective across 4 downstream tasks: reasoning, medical assistance, code generation, tool usage
→ Successfully tested on Llama-3-8B, Llama-3.1-8B, and Gemma-2B models
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post