VisAidMath benchmark, proposed in this paper, reveals LLMs struggle to use visual aids in mathematical reasoning, with only 3% success
📚 https://arxiv.org/abs/2410.22995
🎯 Original Problem:
Current benchmarks mainly focus on text-based mathematical reasoning, neglecting how LLMs and multimodal models utilize visual information during problem-solving. This creates a significant gap in evaluating models' ability to use visual aids effectively.
-----
🔍 Solution in this Paper:
→ Created VisAidMath: A benchmark with 1,200 challenging math problems across plane geometry, solid geometry, analytic geometry, and calculus
→ Each problem contains: Visual Context (C), Question (Q), Visual Aids (V), and Answer (A)
→ Introduced two key tasks:
- CQ2VA: Generate visual aids and answer from context/question
- CQV2A: Use provided visual aids with context/question to solve
→ Implemented rigorous data curation pipeline with automated processes and manual annotations
-----
💡 Key Insights:
→ Only 3% of correct solutions actually utilized generated visual aids effectively
→ Most models tend to ignore visual aids, relying on pure arithmetic (41.1%) or hallucination (33.2%)
→ Correct visual aids significantly reduce hallucination in reasoning steps
→ Visual aid errors strongly correlate with incorrect final answers
-----
📊 Results:
→ Best performing model GPT-4V achieved only 45.33% accuracy on visual-aided reasoning
→ Most models performed below random choice baseline (24.42%)
→ Generated visual aids showed only 5% n-gram similarity to golden references
→ 38.5% of cases involved various forms of hallucination
Share this post