0:00
/
0:00
Transcript

"OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?"

Generated below podcast on this paper with Google's Illuminate.

OVO-Bench reveals the massive gap between how humans and AI process streaming videos.

OVO-Bench introduces a novel evaluation framework for testing video LLMs' ability to understand and reason about videos in real-time, similar to how humans process streaming content.

https://arxiv.org/abs/2501.05510

Original Problem 🤔:

→ Current video LLMs excel at analyzing complete videos after the fact but struggle with real-time understanding - like answering questions during video playback or making decisions based on partial information.

Solution in this Paper 🛠️:

→ OVO-Bench evaluates three critical aspects of online video understanding: backward memory tracing, real-time perception, and forward response prediction.

→ The benchmark contains 644 diverse videos across 7 domains, with 2,814 precisely timestamped questions.

→ It tests models' ability to recall past events, understand ongoing actions, and decide when they have enough information to answer confidently.

→ The evaluation uses a hybrid approach combining automated generation and human curation for high-quality annotations.

Key Insights 💡:

→ Even state-of-the-art video LLMs struggle with online video understanding despite strong offline performance

→ Proprietary models like Gemini 1.5 Pro perform better but still show significant gaps compared to humans

→ Current models lack temporal awareness and often hallucinate when processing streaming video

Results 📊:

→ Top performing model (Gemini 1.5 Pro) achieves only 65.25% overall accuracy

→ Significant performance gap between offline (65.25%) and online models (33.15%)

→ Human performance benchmark: 92.81%

Discussion about this video