0:00
/
0:00
Transcript

"Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment"

Below podcast is generated with Google's Illuminate.

Chain-of-thought tuning enhances the safety of conversational AI systems.

Fine-tuning and aligning chain-of-thought responses enhances LLMs acting as input moderation guardrails. This approach improves malicious query detection and provides explanations for verdicts.

-----

Paper - https://arxiv.org/abs/2501.13080

Original Problem: 😞:

→ LLMs used as input filters are vulnerable to adversarial attacks.

→ LLMs struggle to explain their decisions.

-----

Solution in this Paper: 🤔:

→ This paper explores fine-tuning and aligning chain-of-thought LLM responses for input moderation.

→ A small training dataset adapts LLMs to detect malicious inputs and provide reasoning for verdicts.

→ The paper investigates supervised fine-tuning (SFT), direct preference optimization (DPO), and Kahneman-Tversky optimization (KTO).

-----

Key Insights from this Paper: 💡:

→ Aligning chain-of-thought responses with limited training data improves LLM performance.

→ Alignment improves both the accuracy and explanation quality of LLMs.

-----

Results: 💪:

→ Llama3-DPO achieves 96.1 F1 score and 93.3% attack detection ratio.

→ Llama3-DPO's false positive rate is only 0.8%.

→ Compared to LlamaGuard-2, Llama3-DPO's attack detection ratio improves by 172% while reducing the false positive rate by 275%.

Discussion about this video