Adding synthetic context makes LLM evaluation more reliable and insightful
Context-aware evaluation reveals hidden biases and capabilities in LLM responses
https://arxiv.org/abs/2411.07237
Original Problem 🤔:
LLM users often issue underspecified queries where context (user identity, intent, response criteria) is missing. This leads to arbitrary evaluation judgments and unreliable benchmarking results.
-----
Solution in this Paper 🛠️:
→ Introduces "contextualized evaluations" - a protocol that adds synthetic context during LLM response evaluation
→ Context is represented as follow-up question-answer pairs that clarify the original query
→ Uses GPT-4, Claude-3.5-Sonnet and Gemini-1.5-Pro to generate and validate follow-up QA pairs
→ Implements three evaluation settings: standard (no context), implicit context discovery, and adaptive evaluation
-----
Key Insights from this Paper 💡:
→ 76% of queries in current LLM benchmarks are open-ended and lack proper context
→ Adding context during evaluation increases evaluator agreement by 3-10% absolute
→ Context can flip win rates between model pairs in comparisons
→ Default model responses show bias towards WEIRD (Western, Educated, Industrialized, Rich, Democratic) contexts
→ Models vary significantly in their ability to adapt to different contexts even when provided in prompts
-----
Results 📊:
→ Human validation shows 76% follow-up questions are important, with 90% realistic and 80% diverse answer sets
→ Context-aware evaluation improves inter-evaluator agreement from 64% to 78%
→ Context reduces evaluators' reliance on surface-level criteria like style by 5-7%
→ Models show varying instruction-following abilities, with performance differences of up to 1.64 points in constraint satisfaction
Share this post