LLMs can solve math, but can they do it consistently? G-Pass@k finds out.
New metric reveals the hidden instability in LLMs' mathematical reasoning capabilities.
The paper introduces G-Pass@k, a new metric that measures both accuracy and stability of LLMs' mathematical reasoning across multiple attempts, along with LiveMathBench, a dynamic evaluation benchmark.
-----
https://arxiv.org/abs/2412.13147
🤔 Original Problem:
Current evaluation metrics only measure peak performance of LLMs in mathematical reasoning, missing crucial stability aspects needed for real-world applications.
-----
🔧 Solution in this Paper:
→ G-Pass@k evaluates model performance under varying correctness thresholds, providing insights into both potential and stability
→ LiveMathBench incorporates contemporary math problems from competitions like CNMO, CCEE, AMC, and WLPMC to minimize data leakage risks
→ The framework requires at least 3k generations for accurate estimation, using temperature 1.0 and top-p of 1.0
→ Implementation includes 48 runs per problem and reports both greedy accuracy and G-Pass@k values
-----
💡 Key Insights:
→ LLMs show significant instability in reasoning with performance drops exceeding 50%
→ Increasing model size doesn't necessarily improve stability
→ There's a notable gap between theoretical capabilities and actual stability
→ Data contamination boosts accuracy but reduces stability
-----
📊 Results:
→ Best performing model OpenAI o1-mini achieves 66.5% accuracy on LiveMathBench
→ Performance drops by 36.9% when measuring stability
→ Most models show 60% average performance decline under strict stability requirements
Share this post