This New benchmark TVBench tests true temporal comprehension in video-language models, exposing current limitations.
https://arxiv.org/abs/2410.07752
Original Problem 🔍:
Existing video-language benchmarks fail to effectively evaluate temporal understanding, often solvable with static information or text alone.
-----
Solution in this Paper 🛠️:
• Designs temporally challenging answer candidates
• Uses text templates to prevent grammatical exploitation
• Creates questions answerable solely from video content
• Eliminates reliance on prior world knowledge
-----
Key Insights from this Paper 💡:
• Most current video-language models lack strong temporal reasoning
• Open-ended QA evaluation with LLMs is unreliable
• Temporal understanding is crucial for accurate video comprehension
• Existing benchmarks have significant spatial and textual biases
-----
Results 📊:
• Text-only and single-image models perform at random chance on TVBench
• Most state-of-the-art video-language models perform close to random
• Only Tarsier (+20.5%) and Gemini 1.5 Pro (+13.2%) outperform random baseline
• Shuffling/reversing videos significantly degrades performance on TVBench
• TVBench effectively differentiates models with strong temporal understanding
Share this post