"Are Your LLMs Capable of Stable Reasoning?"

Playback speed

Share post at current time

0:00

Transcript

"Are Your LLMs Capable of Stable Reasoning?"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 06, 2025

LLMs can solve math, but can they do it consistently? G-Pass@k finds out.

New metric reveals the hidden instability in LLMs' mathematical reasoning capabilities.

The paper introduces G-Pass@k, a new metric that measures both accuracy and stability of LLMs' mathematical reasoning across multiple attempts, along with LiveMathBench, a dynamic evaluation benchmark.

-----

https://arxiv.org/abs/2412.13147

🤔 Original Problem:

Current evaluation metrics only measure peak performance of LLMs in mathematical reasoning, missing crucial stability aspects needed for real-world applications.

-----

🔧 Solution in this Paper:

→ G-Pass@k evaluates model performance under varying correctness thresholds, providing insights into both potential and stability

→ LiveMathBench incorporates contemporary math problems from competitions like CNMO, CCEE, AMC, and WLPMC to minimize data leakage risks

→ The framework requires at least 3k generations for accurate estimation, using temperature 1.0 and top-p of 1.0

→ Implementation includes 48 runs per problem and reports both greedy accuracy and G-Pass@k values

-----

💡 Key Insights:

→ LLMs show significant instability in reasoning with performance drops exceeding 50%

→ Increasing model size doesn't necessarily improve stability

→ There's a notable gap between theoretical capabilities and actual stability

→ Data contamination boosts accuracy but reduces stability

-----

📊 Results:

→ Best performing model OpenAI o1-mini achieves 66.5% accuracy on LiveMathBench

→ Performance drops by 36.9% when measuring stability

→ Most models show 60% average performance decline under strict stability requirements

Rohan's Bytes

"Are Your LLMs Capable of Stable Reasoning?"

Discussion about this video