Turns out we were testing LLMs wrong - they're way smarter when they can see all the choices
LLMs perform much better on multiple-choice tests when they can see all options together, just like humans do, rather than evaluating each option separately.
https://arxiv.org/abs/2412.17758
Original Problem 🤔:
→ Current evaluation methods for multiple-choice questions make LLMs look worse at reasoning than they actually are by forcing them to evaluate options in isolation
-----
Solution in this Paper 🔧:
→ Show all answer choices together when evaluating LLMs on multiple-choice tests
→ This matches how humans naturally approach such questions - comparing options side by side
→ The paper identifies that 31% of ARC Challenge questions are "hardly answerable in separation"
→ The solution eliminates arbitrary score normalization issues when comparing options of different lengths
-----
Key Insights 💡:
→ ARC Challenge isn't inherently harder than ARC Easy - the difficulty was an artifact of evaluation method
→ OpenBookQA benchmark is essentially solved when using the all-options approach
→ SIQA scores improve by 24% with this evaluation change
→ The perceived reasoning gaps between humans and LLMs shrink dramatically with fair evaluation
-----
Results 📊:
→ Llama 3 70B improves from 64% to 93% on ARC Challenge when seeing all options
→ Gap between ARC Challenge and Easy reduces 6x
→ OpenBookQA accuracy jumps from 48% to 89%
→ Current models achieve superhuman performance (96%) on OpenBookQA with all-options evaluation
Share this post