OMCAT: Omni Context Aware Transformer
OMCAT, proposed in this paper teaches multi-modal model to understand when things happen in videos by linking sounds and visuals together
OMCAT, proposed in this paper teaches multi-modal model to understand when things happen in videos by linking sounds and visuals together
Original Problem 🔍:
LLMs struggle with fine-grained, cross-modal temporal understanding in audio-visual tasks.
Solution in this Paper 🛠️:
• OCTAV dataset: Captures event transitions across audio and video
• OMCAT model: Unified audio-visual language model with RoTE (Rotary Time Embeddings)
• Three-stage training pipeline: Feature alignment, instruction tuning, OCTAV-specific training
• Time alignment modules: ITT (Interleaving Time Tokens) and RoTE for temporal grounding
Key Insights from this Paper 💡:
• Cross-modal temporal understanding is crucial for advanced AI systems
• Synthetic data generation can address limitations in existing datasets
• RoTE provides better performance and efficiency than existing temporal conditioning methods
• Multi-stage training enhances model capabilities across various multimodal tasks
Results 📊:
• OMCAT outperforms SOTA on AVQA tasks: 90.2% accuracy on AVQA dataset
• Surpasses GroundingGPT on Charades-STA: 32.3% R@1 (IoU=0.5), 15.9% R@1 (IoU=0.7)
• Excels on OCTAV-ST: 16.9% accuracy on Youcook2, 19.0% on ActivityNet
• Demonstrates superior performance on OCTAV-MT and UnAV-100-MT datasets
The architecture and key components of the OMCAT model
The OMCAT model:
Uses separate visual and audio encoders to extract features from video and audio inputs.
Employs audio-visual adaptor layers to map the extracted features to the text embedding space of the language model.
Incorporates time alignment between audio and video using either Interleaving Time Tokens (ITT) or Rotary Time Embeddings (RoTE).
Uses a large language model (fine-tuned Vicuna 7B) as the core text generation component.