EgoPlan-Bench2 teaches AI to plan real-world tasks by watching first-person videos.
EgoPlan-Bench2 is a comprehensive benchmark for evaluating multimodal LLMs' planning abilities across diverse real-world scenarios. It includes 1,321 multiple-choice questions from 1,113 egocentric videos, spanning 4 major life domains and 24 detailed scenarios.
-----
https://arxiv.org/abs/2412.04447
🤔 Original Problem:
Current multimodal LLMs excel at comprehension but struggle with real-world planning tasks. Existing benchmarks focus mainly on comprehension rather than planning capabilities.
-----
🛠️ Solution in this Paper:
→ EgoPlan-Bench2 evaluates planning abilities through first-person perspective videos covering daily tasks.
→ Uses semi-automatic pipeline with GPT-4 to extract task goals and generate multiple-choice questions.
→ Implements rigorous verification through model testing and human annotation.
→ Proposes multimodal Chain-of-Thought prompting to enhance planning capabilities without additional training.
-----
💡 Key Insights:
→ MLLMs struggle with temporal understanding and complex scene interpretation
→ Action sequences are more effective than scene descriptions for task planning
→ Bounding boxes highlighting human-object interactions improve planning accuracy
-----
📊 Results:
→ Most MLLMs perform at random guess level (23-27%)
→ GPT-4V achieves best performance at 32.63%
→ Proposed prompting method improves GPT-4V accuracy by 10.24%
Share this post