0:00
/
0:00
Transcript

"Granite Guardian"

The podcast on this paper is generated with Google's Illuminate.

A single model that spots everything from toxic prompts to RAG hallucinations in LLM interactions.

Granite Guardian introduces risk detection models for LLMs that can identify harmful content, jailbreaks, and RAG-specific hallucination risks with state-of-the-art accuracy.

-----

https://arxiv.org/abs/2412.07724

🔍 Original Problem:

LLMs need robust risk detection to prevent misuse and ensure safe operation. Current solutions lack comprehensive coverage across multiple risk dimensions and transparency in development.

-----

🛠️ Solution in this Paper:

→ Granite Guardian offers a family of risk detection models (2B and 8B variants) trained on diverse human annotations and synthetic data.

→ It uses a unified safety instruction template to detect risks in both prompts and responses.

→ The models cover social risks (bias, profanity, violence), security risks (jailbreaks), and RAG-specific hallucination risks.

→ A specialized probability computation method improves risk detection confidence scoring.

-----

💡 Key Insights:

→ Combining human annotations from diverse sources with synthetic data improves model robustness

→ A unified risk detection approach works better than separate models for different risks

→ RAG-specific risks need specialized detection mechanisms

-----

📊 Results:

→ AUC score of 0.871 on harmful content detection

→ AUC score of 0.854 on RAG-hallucination benchmarks

→ Outperforms Llama Guard and ShieldGemma across metrics

Discussion about this video

User's avatar