This paper reveals LLMs are surprisingly sensitive to minor prompt variations
📚 https://arxiv.org/abs/2401.03729
Original Problem 🤔:
LLMs are widely used for data labeling, but practitioners make various prompt design choices - from output formats to jailbreaks. No systematic study exists on how these prompt variations affect model predictions and reliability.
-----
Solution in this Paper 🔧:
→ Tested 24 prompt variations across 3 categories:
- Output formats (JSON, CSV, XML etc.)
- Minor perturbations (spaces, greetings etc.)
- Jailbreaks (AIM, Dev Mode v2 etc.)
→ Evaluated on 11 classification tasks using ChatGPT and Llama2 models
→ Analyzed prediction changes, accuracy impacts, and similarity between variations
→ Used multidimensional scaling to visualize relationships between prompt variations
-----
Key Insights from this Paper 💡:
→ Even tiny changes like adding a space can cause 500+ prediction changes out of 11,000 samples
→ Larger models (ChatGPT, Llama-70B) are more robust to variations than smaller ones
→ No single prompt variation consistently performs best across tasks
→ Jailbreaks cause massive disruptions - over 2500 prediction changes in ChatGPT
→ Output format specifications can significantly impact accuracy
-----
Results 📊:
→ 10% predictions change just by specifying output format
→ ChatGPT's JSON checkbox caused more changes than plain JSON specification
→ Jailbreaks led to 90% invalid responses in ChatGPT
→ Majority voting across variations improved accuracy by 1-2%
Share this post