0:00
/
0:00
Transcript

"LinVT: Empower Your Image-level Large Language Model to Understand Videos"

The podcast on this paper is generated with Google's Illuminate.

LinVT turns your image LLM into a video expert without starting from scratch.

LinVT enables existing image-based LLMs to understand videos efficiently by transforming visual tokens through linear operations while preserving original image comprehension capabilities.

-----

https://arxiv.org/abs/2412.05185

🤔 Original Problem:

→ Current video-capable LLMs require extensive training from scratch, consuming massive computational resources and time

→ Converting image LLMs to handle videos while preserving their original capabilities remains challenging

-----

🔧 Solution in this Paper:

→ LinVT introduces a plug-and-play module that transforms image LLMs into video-capable models.

→ The Spatio-Temporal Visual Token Refiner selects informative frames through significance scoring.

→ Text-conditioned Token Aggregation combines visual and textual information using scale-specific queries.

→ The entire process maintains linear transformation to preserve original visual-language alignment.

-----

💡 Key Insights:

→ Linear transformation preserves original image understanding capabilities

→ Multi-scale processing effectively handles videos of varying lengths

→ Text conditioning improves video token selection relevance

-----

📊 Results:

→ LinVT achieves state-of-the-art performance across video benchmarks

→ Compatible with 6 different LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo, Qwen2-VL

→ Requires only video data for training, no additional image data needed

Discussion about this video