RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems

Playback speed

Share post at current time

0:00

Transcript

Generated this podcast with Google's Illuminate.

Jan 01, 2025

This Paper identifies key drivers of character hallucination in LLMs and proposes Narrator Mode as a novel solution.

Original Problem 🔍:

LLM-based role-playing systems suffer from character hallucination, where models generate responses inconsistent with predefined character roles.

-----

Solution in this Paper 💡:

• Introduces RoleBreak framework identifying query sparsity and role-query conflict as key drivers of hallucination

• Constructs RoleBreakEval dataset to evaluate hallucination mitigation techniques

• Proposes Narrator Mode defense strategy generating supplemental narrative context

-----

Key Insights from this Paper 💡:

• Character hallucination viewed as "jailbreak" attack on role-playing systems

• Even enhanced models remain vulnerable to RoleBreak attacks

• Rejection-based strategies have limited generalization for handling hallucinations

• Narrator Mode outperforms traditional approaches in reducing hallucinations and improving coherence

-----

Results 📊:

• Narrator Mode reduces hallucination rate to 0.36 (vs 0.48 for GPT-3.5)

• Improves role fidelity to 0.71 (vs 0.65 for GPT-3.5)

• Enhances query fidelity to 0.89 (vs 0.83 for GPT-3.5)

• Increases story coherence score to 4.21 (vs 4.13 for GPT-3.5) character consistency.

Rohan's Bytes