"Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.04328
The challenge in creating omni-modal models is their performance gap compared to specialized single modality models. Existing omni-modal models also lack balanced performance and efficient training.
This paper introduces Ola, an Omni-modal Language Model, using progressive modality alignment to address these issues. Ola achieves competitive performance across image, video, and audio tasks.
-----
📌 Progressive modality alignment in Ola effectively manages multi-modal interference. Training starts with text-image pairs. Then it progressively integrates audio and video. This staged approach avoids catastrophic forgetting and improves overall performance.
📌 Ola's architecture uses modality-specific encoders with a shared LLM decoder. This design choice allows efficient feature extraction from diverse inputs. The Local-Global Attention Pooling further optimizes visual token processing for better efficiency.
📌 Generating cross-modal video-audio data is a key innovation. This data bridges the gap between vision and audio. It enhances the model's ability to understand real-world scenarios where modalities are naturally correlated.
----------
Methods Explored in this Paper 🔧:
→ Ola employs a progressive modality alignment strategy.
→ It begins training with image and text data. This establishes core vision-language knowledge.
→ Then, speech data is added. Speech bridges language and audio understanding.
→ Finally, video data is incorporated. Video connects all modalities: vision, audio, and language.
→ Ola's architecture includes modality-specific encoders. These handle text, image, video, and audio inputs.
→ A Local-Global Attention Pooling layer fuses visual inputs. This reduces visual token length efficiently.
→ Dual audio encoders are used. Whisper for speech and BEATs for music are used. This allows for richer audio understanding.
→ Sentence-wise streaming decoding is implemented for real-time text and speech generation.
-----
Key Insights 💡:
→ Progressive modality alignment improves omni-modal learning. It breaks down complex training into manageable steps.
→ This strategy maintains a small cross-modal alignment data size. It leverages existing vision-language model advancements.
→ Video data acts as a crucial bridge between audio and vision. It provides comprehensive multi-modal information.
→ Cross-modal video-audio data generation enhances performance. This data captures relationships between video and audio.
-----
Results 📊:
→ Ola achieves 84.3% accuracy on MMBench-1.1.
→ Ola achieves 70.8% accuracy on MMStar.
→ Ola achieves 57.0% accuracy on MMMU.
→ Ola achieves 68.4% accuracy on VideoMME.
→ Ola achieves 3.1 mean WER on LibriSpeech.
→ Ola achieves 6.41 GPT-eval score on AIR-Bench.