ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
How much LLMs get confused by different ways of asking the same question.
How much LLMs get confused by different ways of asking the same question.
Meet ProSA: The framework that catches LLMs playing favorites with prompts
Original Problem 🔍:
LLMs exhibit prompt sensitivity, affecting performance and user satisfaction. Current research overlooks instance-level variations and subjective evaluations.
Solution in this Paper 🛠️:
• ProSA framework for evaluating prompt sensitivity in LLMs
• Novel PromptSensiScore (PSS) metric for quantifying sensitivity
• Instance-level analysis across multiple tasks and datasets
• Utilizes decoding confidence to explain underlying mechanisms
Key Insights from this Paper 💡:
• Prompt sensitivity varies across datasets and models
• Larger models generally show enhanced robustness
• Few-shot examples alleviate prompt sensitivity
• Subjective evaluations are susceptible to prompt sensitivities
• Higher model confidence correlates with increased prompt robustness
Results 📊:
• Llama3-70B-Instruct demonstrates highest robustness
• Transition from zero-shot to one-shot shows significant improvement
• Larger LLMs benefit more from increased few-shot instances
• LLMs robust in answering straightforward queries, sensitive in complex tasks
• Prompt sensitivity reflects model's confidence level
🧠 The analysis revealed that prompt sensitivity is essentially a reflection of the model's confidence level:
Higher confidence in outputs correlates with increased robustness against prompt semantic variations
When a model is robust to prompts for a given instance (low PSS score), it exhibits the highest decoding confidence
Conversely, when sensitive to prompts, the model's decoding confidence decreases