LMMs struggle with continuous visual feedback, MageBench reveals the gap through interactive testing.
MageBench introduces a novel benchmark for evaluating Large Multimodal Models (LMMs) as intelligent agents, focusing on continuous visual feedback and reasoning capabilities through three lightweight environments: WebUI, Sokoban, and Football.
-----
https://arxiv.org/abs/2412.04531
🤖 Original Problem:
Current LMM benchmarks only assess reasoning through text chains, failing to evaluate how models handle continuous visual feedback - a crucial requirement for real-world agents and robotics applications.
-----
🔍 Solution in this Paper:
→ MageBench introduces three lightweight yet challenging environments that test different agent capabilities.
→ WebUI tests engineering knowledge by requiring agents to build target webpages from screenshots.
→ Sokoban evaluates spatial intelligence through box-pushing puzzles requiring planning.
→ Football examines interaction skills by controlling players in dynamic game scenarios.
-----
💡 Key Insights:
→ Only GPT-4o and Gemini-1.5pro outperformed random baselines in online settings
→ Current models lack visual imagination and ability to modify plans based on visual feedback
→ Models struggle with interleaved image-text long context handling
-----
📊 Results:
→ In WebUI, Claude achieved 64.11% AES in Global setting vs human's 68.71%
→ In Sokoban-Online, best model achieved 53.03% reward vs human's 96.85%
→ In Football-Online, highest score was 21.20% vs human's 54.68%
Share this post