0:00
/
0:00
Transcript

"In Case You Missed It: ARC 'Challenge' Is Not That Challenging"

Generated below podcast on this paper with Google's Illuminate.

Turns out we were testing LLMs wrong - they're way smarter when they can see all the choices

LLMs perform much better on multiple-choice tests when they can see all options together, just like humans do, rather than evaluating each option separately.

https://arxiv.org/abs/2412.17758

Original Problem 🤔:

→ Current evaluation methods for multiple-choice questions make LLMs look worse at reasoning than they actually are by forcing them to evaluate options in isolation

-----

Solution in this Paper 🔧:

→ Show all answer choices together when evaluating LLMs on multiple-choice tests

→ This matches how humans naturally approach such questions - comparing options side by side

→ The paper identifies that 31% of ARC Challenge questions are "hardly answerable in separation"

→ The solution eliminates arbitrary score normalization issues when comparing options of different lengths

-----

Key Insights 💡:

→ ARC Challenge isn't inherently harder than ARC Easy - the difficulty was an artifact of evaluation method

→ OpenBookQA benchmark is essentially solved when using the all-options approach

→ SIQA scores improve by 24% with this evaluation change

→ The perceived reasoning gaps between humans and LLMs shrink dramatically with fair evaluation

-----

Results 📊:

→ Llama 3 70B improves from 64% to 93% on ARC Challenge when seeing all options

→ Gap between ARC Challenge and Easy reduces 6x

→ OpenBookQA accuracy jumps from 48% to 89%

→ Current models achieve superhuman performance (96%) on OpenBookQA with all-options evaluation

Discussion about this video