"The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.01081
The paper addresses the challenge of evaluating advanced reasoning in LLMs beyond symbolic patterns, focusing on multimodal scenarios involving both vision and language. Current benchmarks like ARC-AGI are limited, prompting a need to assess LLMs on tasks requiring visual perception and abstract or algorithmic reasoning.
This paper proposes to track the evolution of reasoning in GPT-[n] and o-[n] models using multimodal puzzles. These puzzles demand fine-grained visual perception and abstract or algorithmic reasoning, offering a robust evaluation of multimodal reasoning capabilities.
-----
📌 o1 model shows reasoning gains on multimodal puzzles, but its 750x higher inference cost than GPT-4o poses a significant practical deployment challenge.
📌 PuzzleVQA and AlgoPuzzleVQA datasets effectively pinpoint visual perception as a key bottleneck in current LLMs, even in advanced models like o1.
📌 Open-ended evaluations reveal that LLMs struggle with precise visual perception, unlike in multiple-choice settings where answer cues aid performance.
----------
Methods Explored in this Paper 🔧:
→ This paper evaluates GPT-4-Turbo, GPT-4o, and o1 models from OpenAI.
→ It uses two multimodal puzzle datasets: PuzzleVQA for abstract reasoning and AlgoPuzzleVQA for algorithmic problem-solving.
→ Puzzles are presented in both multiple-choice and open-ended question formats to comprehensively assess model performance.
→ Zero-shot chain of thought prompting is used for GPT-[n] models in multiple-choice setup.
→ GPT-4o is used to evaluate open-ended responses by matching generated answers with ground truth.
-----
Key Insights 💡:
→ Reasoning performance improves from GPT-4-Turbo to GPT-4o and significantly to o1, but o1 comes with a 750x higher inference cost than GPT-4o.
→ All models perform better in multiple-choice than open-ended settings, indicating that answer options provide helpful cues.
→ Visual perception is a major bottleneck for all models, especially in accurately interpreting fine-grained visual details like shapes and sizes.
→ o1 shows enhanced numerical reasoning and algorithmic problem-solving but still struggles with complex puzzles and visual perception limitations.
-----
Results 📊:
→ o1 achieves 79.2% average accuracy on PuzzleVQA in multiple-choice, outperforming GPT-4-Turbo (54.2%) and GPT-4o (60.6%).
→ On AlgoPuzzleVQA in multiple-choice, o1 achieves 55.3% average accuracy, compared to GPT-4-Turbo (36.5%) and GPT-4o (43.6%).
→ In open-ended setting on AlgoPuzzleVQA, o1 accuracy drops to 32.2% from 55.3% in multiple-choice, a 23.1% decline.
→ Providing visual perception guidance improves model performance by 22% to 30% across models on PuzzleVQA in open-ended setting.