0:00
/
0:00
Transcript

"EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios"

The podcast on this paper is generated with Google's Illuminate.

EgoPlan-Bench2 teaches AI to plan real-world tasks by watching first-person videos.

EgoPlan-Bench2 is a comprehensive benchmark for evaluating multimodal LLMs' planning abilities across diverse real-world scenarios. It includes 1,321 multiple-choice questions from 1,113 egocentric videos, spanning 4 major life domains and 24 detailed scenarios.

-----

https://arxiv.org/abs/2412.04447

🤔 Original Problem:

Current multimodal LLMs excel at comprehension but struggle with real-world planning tasks. Existing benchmarks focus mainly on comprehension rather than planning capabilities.

-----

🛠️ Solution in this Paper:

→ EgoPlan-Bench2 evaluates planning abilities through first-person perspective videos covering daily tasks.

→ Uses semi-automatic pipeline with GPT-4 to extract task goals and generate multiple-choice questions.

→ Implements rigorous verification through model testing and human annotation.

→ Proposes multimodal Chain-of-Thought prompting to enhance planning capabilities without additional training.

-----

💡 Key Insights:

→ MLLMs struggle with temporal understanding and complex scene interpretation

→ Action sequences are more effective than scene descriptions for task planning

→ Bounding boxes highlighting human-object interactions improve planning accuracy

-----

📊 Results:

→ Most MLLMs perform at random guess level (23-27%)

→ GPT-4V achieves best performance at 32.63%

→ Proposed prompting method improves GPT-4V accuracy by 10.24%

Discussion about this video