"The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 10, 2025

Article voiceover

0:00

-6:17

https://arxiv.org/abs/2502.01081

The paper addresses the challenge of evaluating advanced reasoning in LLMs beyond symbolic patterns, focusing on multimodal scenarios involving both vision and language. Current benchmarks like ARC-AGI are limited, prompting a need to assess LLMs on tasks requiring visual perception and abstract or algorithmic reasoning.

This paper proposes to track the evolution of reasoning in GPT-[n] and o-[n] models using multimodal puzzles. These puzzles demand fine-grained visual perception and abstract or algorithmic reasoning, offering a robust evaluation of multimodal reasoning capabilities.

-----

📌 o1 model shows reasoning gains on multimodal puzzles, but its 750x higher inference cost than GPT-4o poses a significant practical deployment challenge.

📌 PuzzleVQA and AlgoPuzzleVQA datasets effectively pinpoint visual perception as a key bottleneck in current LLMs, even in advanced models like o1.

📌 Open-ended evaluations reveal that LLMs struggle with precise visual perception, unlike in multiple-choice settings where answer cues aid performance.

----------

Methods Explored in this Paper 🔧:

→ This paper evaluates GPT-4-Turbo, GPT-4o, and o1 models from OpenAI.

→ It uses two multimodal puzzle datasets: PuzzleVQA for abstract reasoning and AlgoPuzzleVQA for algorithmic problem-solving.

→ Puzzles are presented in both multiple-choice and open-ended question formats to comprehensively assess model performance.

→ Zero-shot chain of thought prompting is used for GPT-[n] models in multiple-choice setup.

→ GPT-4o is used to evaluate open-ended responses by matching generated answers with ground truth.

-----

Key Insights 💡:

→ Reasoning performance improves from GPT-4-Turbo to GPT-4o and significantly to o1, but o1 comes with a 750x higher inference cost than GPT-4o.

→ All models perform better in multiple-choice than open-ended settings, indicating that answer options provide helpful cues.

→ Visual perception is a major bottleneck for all models, especially in accurately interpreting fine-grained visual details like shapes and sizes.

→ o1 shows enhanced numerical reasoning and algorithmic problem-solving but still struggles with complex puzzles and visual perception limitations.

-----

Results 📊:

→ o1 achieves 79.2% average accuracy on PuzzleVQA in multiple-choice, outperforming GPT-4-Turbo (54.2%) and GPT-4o (60.6%).

→ On AlgoPuzzleVQA in multiple-choice, o1 achieves 55.3% average accuracy, compared to GPT-4-Turbo (36.5%) and GPT-4o (43.6%).

→ In open-ended setting on AlgoPuzzleVQA, o1 accuracy drops to 32.2% from 55.3% in multiple-choice, a 23.1% decline.

→ Providing visual perception guidance improves model performance by 22% to 30% across models on PuzzleVQA in open-ended setting.

Rohan's Bytes

Discussion about this post