0:00
/
0:00
Transcript

"Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding"

Generated below podcast on this paper with Google's Illuminate.

A video understanding model that catches every detail without making things up.

Tarsier2 advances video understanding by training on 40M video-text pairs and using temporal alignment, enabling detailed video descriptions that outperform GPT-4V and Gemini.

-----

https://arxiv.org/abs/2501.07888

🎯 Original Problem:

→ Current video understanding models struggle with accurate temporal dynamics, spatial reasoning, and often hallucinate content, falling short of human-level comprehension.

-----

🔧 Solution in this Paper:

→ Tarsier2 uses a three-stage training approach starting with extensive pre-training on 40M diverse video-text pairs.

→ It implements fine-grained temporal alignment during supervised fine-tuning using 150K annotated video descriptions.

→ The model employs Direct Preference Optimization with automated preference data construction to enhance description quality.

-----

💡 Key Insights:

→ Commentary videos from movies/TV shows provide rich contextual information for better video understanding

→ Fine-grained temporal alignment significantly reduces hallucinations

→ Automated preference data construction using corrupted videos creates effective training pairs

-----

📊 Results:

→ Improves F1 score by 2.8% over GPT-4o and 5.8% over Gemini-1.5-Pro on DREAM-1K

→ Shows +8.6% performance advantage over GPT-4o in human evaluations

→ Sets new state-of-the-art results across 15 public benchmarks

Discussion about this video