RA-RLHF (Risk-Averse Finetuning) minimizes toxic content generation in LLMs by optimizing CVaR for safer online discourse.
This paper introduces risk-averse fine-tuning for Large Language Models (LLMs) to mitigate toxic content generation.
-----
https://arxiv.org/abs/2501.06911
Original Problem 😠:
→ LLMs, trained on vast internet data, can generate harmful content even with aligned versions.
→ Current safety methods either need extensive human feedback or compromise overall performance.
-----
Solution in this Paper 💡:
→ This paper proposes Risk-Averse Reinforcement Learning from Human Feedback (RA-RLHF).
→ RA-RLHF optimizes Conditional Value at Risk (CVaR) to minimize toxicity, especially in rare high-stakes events.
→ It uses a soft-risk scheduling mechanism and balances exposure to positive and challenging scenarios during training.
-----
Key Insights from this Paper 🤔:
→ Optimizing CVaR improves LLM performance in avoiding toxic output while maintaining effectiveness in generative tasks.
→ Soft-risk scheduling and balanced data exposure are crucial for training stable and effective risk-averse policies.
→ RA-RLHF is more effective in handling the riskiest prompts, where generation tasks are hardest, as seen in Table 1.
-----
Results 📈:
→ RA-RLHF outperforms other baselines on average reward for worst-case prompts in IMDB-Gen, Jigsaw-Gen, and RealToxicityPrompts-Gen.
→ RA-RLHF achieves the highest text diversity across datasets.
→ Shows slight perplexity increase, likely due to aggressive adjustments for sentiment modification and toxicity mitigation.
Share this post