SG-Bench, proposed in this paper, reveals LLMs struggle with safety when switching between tasks and prompt types.
📚 https://arxiv.org/abs/2410.21965
Original Problem 🔍:
Current LLM safety benchmarks focus on either discriminative or generative evaluation in isolation and overlook the effects of common prompting techniques like system prompts, few-shot demonstrations, and chain-of-thought prompting.
-----
Solution in this Paper 🛠️:
• Created SG-Bench - a benchmark to assess LLM safety across diverse tasks and prompt types
• Integrates both generative and discriminative evaluation tasks
• Uses 1,442 malicious queries covering 6 safety categories
• Tests impact of different prompt types:
- System prompts (role-oriented and task-oriented)
- Few-shot demonstrations
- Chain-of-thought prompting
• Includes extended datasets to examine jailbreak attacks
-----
Key Insights 💡:
• Most LLMs perform worse on discriminative tasks vs generative ones
• LLMs are highly susceptible to prompt variations
• Role-oriented prompts help defend against jailbreak attacks
• Few-shot demos can sometimes induce harmful responses
• Chain-of-thought prompting generally harms safety performance
• Poor generalization is due to LLMs' attention shift from malicious instructions to prompt contexts
-----
Results 📊:
• Claude-3 showed best safety performance (2.99% failure rate)
• InternLM2-7B-chat best among open-source models (14.65% failure rate)
• LLAMA2 models excel at generation but struggle with discrimination
• Larger models generally showed better safety performance
• Role-oriented prompts reduced jailbreak attack success rates
Share this post