The paper introduces Video-MMMU, a novel benchmark to evaluate how well models acquire knowledge from diverse, professional videos. This benchmark assesses perception, comprehension, and adaptation of knowledge.
-----
📌 Benchmarks shift from perception to knowledge integration. Most video AI models excel at object detection but fail at real comprehension. Video-MMMU forces models to acquire, process, and apply domain-specific knowledge, exposing deep limitations in current architectures.
📌 Domain-specific gaps expose brittle generalization. Models struggle disproportionately in specialized fields like medicine and engineering. This suggests a lack of true multimodal reasoning, where visual and textual representations fail to merge into actionable understanding.
📌 Scaling vision-language pretraining isn't enough. GPT-4V's low accuracy, even with chain-of-thought prompting, shows that brute-force scaling of vision-language models does not guarantee deeper knowledge acquisition. Structural improvements in video reasoning architectures are required.
-----
https://arxiv.org/abs/2501.13826
Original Problem 😥:
→ Existing benchmarks for video understanding often focus on action recognition or object detection.
→ They lack evaluations for knowledge acquisition from videos, especially in professional domains.
→ Current benchmarks do not comprehensively assess the perception, comprehension, and adaptation of knowledge.
-----
Solution in this Paper 💡:
→ This paper proposes Video-MMMU, a new benchmark to evaluate knowledge acquisition from multi-discipline professional videos.
→ Video-MMMU includes 3,000 video clips across six professional domains like medicine, cooking, and engineering.
→ It features over 15,000 questions categorized into perception, comprehension, and adaptation stages.
→ Perception questions test basic information extraction.
→ Comprehension questions assess understanding of concepts.
→ Adaptation questions evaluate the application of learned knowledge to new situations.
→ The benchmark employs diverse question types including multiple-choice, open-ended, and QA pairs.
-----
Key Insights from this Paper 🧐:
→ Current vision-language models show limited capability in acquiring and utilizing knowledge from videos.
→ Models struggle particularly with comprehension and adaptation stages of knowledge acquisition.
→ Performance varies significantly across different professional domains, indicating domain-specific knowledge gaps.
→ Video-MMMU highlights the need for models to move beyond superficial understanding to deeper knowledge integration from video content.
-----
Results 🏆:
→ GPT-4V achieves 37.0% accuracy on Video-MMMU benchmark.
→ Even with chain-of-thought prompting, GPT-4V only reaches 42.5% accuracy.
→ Specialized models like VideoChat and InternVid-Chat perform even lower, at 29.8% and 26.3% respectively.
→ Human performance on Video-MMMU is significantly higher at 75.3% accuracy.
Share this post