A video understanding model that catches every detail without making things up.
Tarsier2 advances video understanding by training on 40M video-text pairs and using temporal alignment, enabling detailed video descriptions that outperform GPT-4V and Gemini.
-----
https://arxiv.org/abs/2501.07888
🎯 Original Problem:
→ Current video understanding models struggle with accurate temporal dynamics, spatial reasoning, and often hallucinate content, falling short of human-level comprehension.
-----
🔧 Solution in this Paper:
→ Tarsier2 uses a three-stage training approach starting with extensive pre-training on 40M diverse video-text pairs.
→ It implements fine-grained temporal alignment during supervised fine-tuning using 150K annotated video descriptions.
→ The model employs Direct Preference Optimization with automated preference data construction to enhance description quality.
-----
💡 Key Insights:
→ Commentary videos from movies/TV shows provide rich contextual information for better video understanding
→ Fine-grained temporal alignment significantly reduces hallucinations
→ Automated preference data construction using corrupted videos creates effective training pairs
-----
📊 Results:
→ Improves F1 score by 2.8% over GPT-4o and 5.8% over Gemini-1.5-Pro on DREAM-1K
→ Shows +8.6% performance advantage over GPT-4o in human evaluations
→ Sets new state-of-the-art results across 15 public benchmarks
Share this post