0:00
/
0:00
Transcript

"VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning"

The podcast on this paper is generated with Google's Illuminate.

VisAidMath benchmark, proposed in this paper, reveals LLMs struggle to use visual aids in mathematical reasoning, with only 3% success

📚 https://arxiv.org/abs/2410.22995

🎯 Original Problem:

Current benchmarks mainly focus on text-based mathematical reasoning, neglecting how LLMs and multimodal models utilize visual information during problem-solving. This creates a significant gap in evaluating models' ability to use visual aids effectively.

-----

🔍 Solution in this Paper:

→ Created VisAidMath: A benchmark with 1,200 challenging math problems across plane geometry, solid geometry, analytic geometry, and calculus

→ Each problem contains: Visual Context (C), Question (Q), Visual Aids (V), and Answer (A)

→ Introduced two key tasks:

- CQ2VA: Generate visual aids and answer from context/question

- CQV2A: Use provided visual aids with context/question to solve

→ Implemented rigorous data curation pipeline with automated processes and manual annotations

-----

💡 Key Insights:

→ Only 3% of correct solutions actually utilized generated visual aids effectively

→ Most models tend to ignore visual aids, relying on pure arithmetic (41.1%) or hallucination (33.2%)

→ Correct visual aids significantly reduce hallucination in reasoning steps

→ Visual aid errors strongly correlate with incorrect final answers

-----

📊 Results:

→ Best performing model GPT-4V achieved only 45.33% accuracy on visual-aided reasoning

→ Most models performed below random choice baseline (24.42%)

→ Generated visual aids showed only 5% n-gram similarity to golden references

→ 38.5% of cases involved various forms of hallucination

Discussion about this video