0:00
/
0:00
Transcript

"Are Your LLMs Capable of Stable Reasoning?"

Generated below podcast on this paper with Google's Illuminate.

LLMs can solve math, but can they do it consistently? G-Pass@k finds out.

New metric reveals the hidden instability in LLMs' mathematical reasoning capabilities.

The paper introduces G-Pass@k, a new metric that measures both accuracy and stability of LLMs' mathematical reasoning across multiple attempts, along with LiveMathBench, a dynamic evaluation benchmark.

-----

https://arxiv.org/abs/2412.13147

🤔 Original Problem:

Current evaluation metrics only measure peak performance of LLMs in mathematical reasoning, missing crucial stability aspects needed for real-world applications.

-----

🔧 Solution in this Paper:

→ G-Pass@k evaluates model performance under varying correctness thresholds, providing insights into both potential and stability

→ LiveMathBench incorporates contemporary math problems from competitions like CNMO, CCEE, AMC, and WLPMC to minimize data leakage risks

→ The framework requires at least 3k generations for accurate estimation, using temperature 1.0 and top-p of 1.0

→ Implementation includes 48 runs per problem and reports both greedy accuracy and G-Pass@k values

-----

💡 Key Insights:

→ LLMs show significant instability in reasoning with performance drops exceeding 50%

→ Increasing model size doesn't necessarily improve stability

→ There's a notable gap between theoretical capabilities and actual stability

→ Data contamination boosts accuracy but reduces stability

-----

📊 Results:

→ Best performing model OpenAI o1-mini achieves 66.5% accuracy on LiveMathBench

→ Performance drops by 36.9% when measuring stability

→ Most models show 60% average performance decline under strict stability requirements

Discussion about this video

User's avatar