OVO-Bench reveals the massive gap between how humans and AI process streaming videos.
OVO-Bench introduces a novel evaluation framework for testing video LLMs' ability to understand and reason about videos in real-time, similar to how humans process streaming content.
https://arxiv.org/abs/2501.05510
Original Problem 🤔:
→ Current video LLMs excel at analyzing complete videos after the fact but struggle with real-time understanding - like answering questions during video playback or making decisions based on partial information.
Solution in this Paper 🛠️:
→ OVO-Bench evaluates three critical aspects of online video understanding: backward memory tracing, real-time perception, and forward response prediction.
→ The benchmark contains 644 diverse videos across 7 domains, with 2,814 precisely timestamped questions.
→ It tests models' ability to recall past events, understand ongoing actions, and decide when they have enough information to answer confidently.
→ The evaluation uses a hybrid approach combining automated generation and human curation for high-quality annotations.
Key Insights 💡:
→ Even state-of-the-art video LLMs struggle with online video understanding despite strong offline performance
→ Proprietary models like Gemini 1.5 Pro perform better but still show significant gaps compared to humans
→ Current models lack temporal awareness and often hallucinate when processing streaming video
Results 📊:
→ Top performing model (Gemini 1.5 Pro) achieves only 65.25% overall accuracy
→ Significant performance gap between offline (65.25%) and online models (33.15%)
→ Human performance benchmark: 92.81%
Share this post