0:00
/
0:00
Transcript

"MageBench: Bridging Large Multimodal Models to Agents"

The podcast on this paper is generated with Google's Illuminate.

LMMs struggle with continuous visual feedback, MageBench reveals the gap through interactive testing.

MageBench introduces a novel benchmark for evaluating Large Multimodal Models (LMMs) as intelligent agents, focusing on continuous visual feedback and reasoning capabilities through three lightweight environments: WebUI, Sokoban, and Football.

-----

https://arxiv.org/abs/2412.04531

🤖 Original Problem:

Current LMM benchmarks only assess reasoning through text chains, failing to evaluate how models handle continuous visual feedback - a crucial requirement for real-world agents and robotics applications.

-----

🔍 Solution in this Paper:

→ MageBench introduces three lightweight yet challenging environments that test different agent capabilities.

→ WebUI tests engineering knowledge by requiring agents to build target webpages from screenshots.

→ Sokoban evaluates spatial intelligence through box-pushing puzzles requiring planning.

→ Football examines interaction skills by controlling players in dynamic game scenarios.

-----

💡 Key Insights:

→ Only GPT-4o and Gemini-1.5pro outperformed random baselines in online settings

→ Current models lack visual imagination and ability to modify plans based on visual feedback

→ Models struggle with interleaved image-text long context handling

-----

📊 Results:

→ In WebUI, Claude achieved 64.11% AES in Global setting vs human's 68.71%

→ In Sokoban-Online, best model achieved 53.03% reward vs human's 96.85%

→ In Football-Online, highest score was 21.20% vs human's 54.68%

Discussion about this video

User's avatar