Turns out AI safety isn't just about muting the bad neurons - it's team work!
Safety fine-tuning is more complex than just turning down toxic neurons
DPO reduces toxicity through multiple neuron groups, not just by dampening toxic ones.
https://arxiv.org/abs/2411.06424
🤔 Original Problem:
Safety fine-tuning algorithms like DPO reduce harmful outputs in LLMs, but we don't fully understand how they work internally. Previous research claimed DPO mainly works by dampening toxic neurons, but this explanation seems incomplete.
-----
🔧 Solution in this Paper:
→ Used GPT-2 medium with 355M parameters to track how toxic features change across neurons after DPO training
→ Projected neuron activation changes onto a toxicity probe to measure per-neuron toxicity adjustments
→ Identified four key neuron groups: toxic neurons activated less positively (TP_), anti-toxic neurons activated less negatively (AN_), toxic neurons activated more negatively (TN+), and anti-toxic neurons activated more positively (AP+)
-----
💡 Key Insights:
→ Only 31.8% of toxicity reduction comes from dampening toxic neurons
→ 37.3% comes from reducing negative activations in anti-toxic neurons
→ The remaining 30.9% comes from promoting anti-toxicity through other neuron groups
→ DPO creates a balanced trade-off where some neurons increase toxicity while others decrease it
-----
📊 Results:
→ Ablating top toxic neurons alone achieves toxicity score of 0.403 vs DPO's 0.208
→ Patching all four neuron groups together reduces toxicity to 0.114
→ Maintains stable perplexity scores around 21-23, showing preserved model performance
Share this post