Chain-of-thought tuning enhances the safety of conversational AI systems.
Fine-tuning and aligning chain-of-thought responses enhances LLMs acting as input moderation guardrails. This approach improves malicious query detection and provides explanations for verdicts.
-----
Paper - https://arxiv.org/abs/2501.13080
Original Problem: 😞:
→ LLMs used as input filters are vulnerable to adversarial attacks.
→ LLMs struggle to explain their decisions.
-----
Solution in this Paper: 🤔:
→ This paper explores fine-tuning and aligning chain-of-thought LLM responses for input moderation.
→ A small training dataset adapts LLMs to detect malicious inputs and provide reasoning for verdicts.
→ The paper investigates supervised fine-tuning (SFT), direct preference optimization (DPO), and Kahneman-Tversky optimization (KTO).
-----
Key Insights from this Paper: 💡:
→ Aligning chain-of-thought responses with limited training data improves LLM performance.
→ Alignment improves both the accuracy and explanation quality of LLMs.
-----
Results: 💪:
→ Llama3-DPO achieves 96.1 F1 score and 93.3% attack detection ratio.
→ Llama3-DPO's false positive rate is only 0.8%.
→ Compared to LlamaGuard-2, Llama3-DPO's attack detection ratio improves by 172% while reducing the false positive rate by 275%.
Share this post