Real-time video understanding achieved through continuous dialogue.
VideoLLMs now know exactly when to speak during video streams, no more waiting till the end.
VideoLLM's key contribution is introducing a video-text duet interaction format where models can generate responses during video playback, rather than waiting for the entire video. This enables real-time comprehension and better performance on time-sensitive tasks like temporal grounding and highlight detection.
-----
https://arxiv.org/abs/2411.17991
🤔 Original Problem:
→ Current VideoLLMs require complete video input before generating responses, making them unsuitable for live streaming scenarios.
→ They struggle with time-sensitive tasks and long video comprehension due to limitations in handling temporal information.
-----
🛠️ Solution in this Paper:
→ The paper introduces MMDuet, implementing a video-text duet interaction format where video playback is continuous.
→ Both user and model can insert text messages at any position during playback.
→ MMDuet uses two specialized heads: informative head detects significant new information, while relevance head determines query relevance.
→ The model learns optimal response timing through the MMDuetIT dataset, constructed from dense captioning and temporal grounding tasks.
-----
💡 Key Insights:
→ Real-time response generation is possible without waiting for complete video input
→ Temporal grounding improves when responses are inserted at relevant positions
→ Dual-head architecture enables better control over response timing
-----
📊 Results:
→ 76% CIDEr score on YouCook2 dense video captioning
→ 90% mAP on QVHighlights detection
→ 25% R@0.5 on Charades-STA temporal grounding
Share this post