OMCAT: Omni Context Aware Transformer

Nov 11, 2024

OMCAT, proposed in this paper teaches multi-modal model to understand when things happen in videos by linking sounds and visuals together

Original Problem 🔍:

LLMs struggle with fine-grained, cross-modal temporal understanding in audio-visual tasks.

Solution in this Paper 🛠️:

• OCTAV dataset: Captures event transitions across audio and video

• OMCAT model: Unified audio-visual language model with RoTE (Rotary Time Embeddings)

• Three-stage training pipeline: Feature alignment, instruction tuning, OCTAV-specific training

• Time alignment modules: ITT (Interleaving Time Tokens) and RoTE for temporal grounding

Key Insights from this Paper 💡:

• Cross-modal temporal understanding is crucial for advanced AI systems

• Synthetic data generation can address limitations in existing datasets

• RoTE provides better performance and efficiency than existing temporal conditioning methods

• Multi-stage training enhances model capabilities across various multimodal tasks

Results 📊:

• OMCAT outperforms SOTA on AVQA tasks: 90.2% accuracy on AVQA dataset

• Surpasses GroundingGPT on Charades-STA: 32.3% R@1 (IoU=0.5), 15.9% R@1 (IoU=0.7)

• Excels on OCTAV-ST: 16.9% accuracy on Youcook2, 19.0% on ActivityNet

• Demonstrates superior performance on OCTAV-MT and UnAV-100-MT datasets

The architecture and key components of the OMCAT model

The OMCAT model:

Uses separate visual and audio encoders to extract features from video and audio inputs.
Employs audio-visual adaptor layers to map the extracted features to the text embedding space of the language model.
Incorporates time alignment between audio and video using either Interleaving Time Tokens (ITT) or Rotary Time Embeddings (RoTE).
Uses a large language model (fine-tuned Vicuna 7B) as the core text generation component.
Rohan's Bytes is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Rohan's Bytes