TPO (Paper Title: ") enhances video LLMs' ability to connect responses to specific times in videos through preference learning.
This paper enhances temporal grounding in video Large Language Models (LLMs) through preference learning. Existing models struggle with accurately connecting responses to specific times in videos, especially long ones. This framework uses preferred and dis-preferred responses to teach the model better temporal understanding.
-----
Paper - https://arxiv.org/abs/2501.13919
Original Problem 😟:
→ Current video LLMs struggle with temporal grounding in long videos.
→ Existing training methods rely on costly, manually annotated datasets and lack explicit optimization for temporal understanding.
-----
Solution in this Paper 🤔:
→ This paper proposes Temporal Preference Optimization (TPO). TPO is a self-training framework using preference learning to improve temporal grounding.
→ TPO uses two types of preferences: localized (focus on segments) and comprehensive (long-range dependencies).
→ A preference dataset is generated by manipulating video inputs. Preferred responses are created from the relevant video segment (localized) or full video (comprehensive).
→ Dis-preferred responses use unrelated segments or a downsampled video.
→ Direct Preference Optimization (DPO) trains the model to prefer temporally grounded responses.
-----
Key Insights from this Paper 💡:
→ Leveraging preference learning can significantly enhance temporal understanding in video LLMs.
→ Explicitly modeling temporal preferences at different granularities leads to improved performance in long-form video understanding.
→ Self-training with manipulated video inputs allows for scalable data creation without extensive manual annotation.
-----
Results 😎:
→ LongVA-TPO achieves performance boosts of 2.9%, 3.1%, and 2.5% on LongVideoBench, MLVU, and Video-MME, respectively, compared to the base LongVA model.
→ LLaVA-Video-TPO becomes the leading 7B model on Video-MME.
→ LongVA-TPO shows consistent performance gains with increasing dataset size, from 2k to 10k training samples.
Share this post