0:00
/
0:00
Transcript

"Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging"

Generated below podcast on this paper with Google's Illuminate.

Merging pre and post-fine-tuned LLMs creates safer models without safety data overhead.

This paper proposes, a simple two-step approach preserves safety in fine-tuned LLMs by merging weights of pre- and post-fine-tuned models, maintaining performance without extra safety data.

-----

https://arxiv.org/abs/2412.19512

🔍 Original Problem:

Fine-tuning LLMs on downstream tasks often degrades their safety alignment, even with benign datasets. Current solutions require additional safety data, which is impractical and resource-intensive.

-----

🛠️ Solution in this Paper:

→ Step 1: Fine-tune base model on downstream tasks using standard supervised fine-tuning

→ Step 2: Merge the weights of pre- and post-fine-tuned safety-aligned models using interpolation

→ The merging ratio λ determines contribution between base and fine-tuned models

→ Linear merging showed strong results compared to complex methods like SLERP and DARE

→ Optimal merging ratio found around 0.5~0.6, balancing safety and task performance

-----

💡 Key Insights:

→ Safety can be preserved without additional safety data

→ Simple linear merging outperforms complex methods

→ Method works across multiple model sizes and families

→ Trade-off exists between task performance and safety based on merging ratio

-----

📊 Results:

→ Reduced Attack Success Rate by up to 30% while maintaining task performance

→ Effective across 4 downstream tasks: reasoning, medical assistance, code generation, tool usage

→ Successfully tested on Llama-3-8B, Llama-3.1-8B, and Gemma-2B models

------

Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓

🎉 https://rohanpaul.substack.com/

Discussion about this video