0:00
/
0:00
Transcript

"SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types"

The podcast on this paper is generated with Google's Illuminate.

SG-Bench, proposed in this paper, reveals LLMs struggle with safety when switching between tasks and prompt types.

📚 https://arxiv.org/abs/2410.21965

Original Problem 🔍:

Current LLM safety benchmarks focus on either discriminative or generative evaluation in isolation and overlook the effects of common prompting techniques like system prompts, few-shot demonstrations, and chain-of-thought prompting.

-----

Solution in this Paper 🛠️:

• Created SG-Bench - a benchmark to assess LLM safety across diverse tasks and prompt types

• Integrates both generative and discriminative evaluation tasks

• Uses 1,442 malicious queries covering 6 safety categories

• Tests impact of different prompt types:

- System prompts (role-oriented and task-oriented)

- Few-shot demonstrations

- Chain-of-thought prompting

• Includes extended datasets to examine jailbreak attacks

-----

Key Insights 💡:

• Most LLMs perform worse on discriminative tasks vs generative ones

• LLMs are highly susceptible to prompt variations

• Role-oriented prompts help defend against jailbreak attacks

• Few-shot demos can sometimes induce harmful responses

• Chain-of-thought prompting generally harms safety performance

• Poor generalization is due to LLMs' attention shift from malicious instructions to prompt contexts

-----

Results 📊:

• Claude-3 showed best safety performance (2.99% failure rate)

• InternLM2-7B-chat best among open-source models (14.65% failure rate)

• LLAMA2 models excel at generation but struggle with discrimination

• Larger models generally showed better safety performance

• Role-oriented prompts reduced jailbreak attack success rates

Discussion about this video