LinVT turns your image LLM into a video expert without starting from scratch.
LinVT enables existing image-based LLMs to understand videos efficiently by transforming visual tokens through linear operations while preserving original image comprehension capabilities.
-----
https://arxiv.org/abs/2412.05185
🤔 Original Problem:
→ Current video-capable LLMs require extensive training from scratch, consuming massive computational resources and time
→ Converting image LLMs to handle videos while preserving their original capabilities remains challenging
-----
🔧 Solution in this Paper:
→ LinVT introduces a plug-and-play module that transforms image LLMs into video-capable models.
→ The Spatio-Temporal Visual Token Refiner selects informative frames through significance scoring.
→ Text-conditioned Token Aggregation combines visual and textual information using scale-specific queries.
→ The entire process maintains linear transformation to preserve original visual-language alignment.
-----
💡 Key Insights:
→ Linear transformation preserves original image understanding capabilities
→ Multi-scale processing effectively handles videos of varying lengths
→ Text conditioning improves video token selection relevance
-----
📊 Results:
→ LinVT achieves state-of-the-art performance across video benchmarks
→ Compatible with 6 different LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo, Qwen2-VL
→ Requires only video data for training, no additional image data needed
Share this post