TVBench: Redesigning Video-Language Evaluation

Playback speed

Share post at current time

0:00

Transcript

The podcast on this paper is generated with Google's Illuminate.

Jan 04, 2025

This New benchmark TVBench tests true temporal comprehension in video-language models, exposing current limitations.

Original Problem 🔍:

Existing video-language benchmarks fail to effectively evaluate temporal understanding, often solvable with static information or text alone.

-----

Solution in this Paper 🛠️:

• Designs temporally challenging answer candidates

• Uses text templates to prevent grammatical exploitation

• Creates questions answerable solely from video content

• Eliminates reliance on prior world knowledge

-----

Key Insights from this Paper 💡:

• Most current video-language models lack strong temporal reasoning

• Open-ended QA evaluation with LLMs is unreliable

• Temporal understanding is crucial for accurate video comprehension

• Existing benchmarks have significant spatial and textual biases

-----

Results 📊:

• Text-only and single-image models perform at random chance on TVBench

• Most state-of-the-art video-language models perform close to random

• Only Tarsier (+20.5%) and Gemini 1.5 Pro (+13.2%) outperform random baseline

• Shuffling/reversing videos significantly degrades performance on TVBench

• TVBench effectively differentiates models with strong temporal understanding

Rohan's Bytes