"Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 15, 2025

Article voiceover

1×

0:00

-5:50

https://arxiv.org/abs/2502.04328

The challenge in creating omni-modal models is their performance gap compared to specialized single modality models. Existing omni-modal models also lack balanced performance and efficient training.

This paper introduces Ola, an Omni-modal Language Model, using progressive modality alignment to address these issues. Ola achieves competitive performance across image, video, and audio tasks.

-----

📌 Progressive modality alignment in Ola effectively manages multi-modal interference. Training starts with text-image pairs. Then it progressively integrates audio and video. This staged approach avoids catastrophic forgetting and improves overall performance.

📌 Ola's architecture uses modality-specific encoders with a shared LLM decoder. This design choice allows efficient feature extraction from diverse inputs. The Local-Global Attention Pooling further optimizes visual token processing for better efficiency.

📌 Generating cross-modal video-audio data is a key innovation. This data bridges the gap between vision and audio. It enhances the model's ability to understand real-world scenarios where modalities are naturally correlated.

----------

Methods Explored in this Paper 🔧:

→ Ola employs a progressive modality alignment strategy.

→ It begins training with image and text data. This establishes core vision-language knowledge.

→ Then, speech data is added. Speech bridges language and audio understanding.

→ Finally, video data is incorporated. Video connects all modalities: vision, audio, and language.

→ Ola's architecture includes modality-specific encoders. These handle text, image, video, and audio inputs.

→ A Local-Global Attention Pooling layer fuses visual inputs. This reduces visual token length efficiently.

→ Dual audio encoders are used. Whisper for speech and BEATs for music are used. This allows for richer audio understanding.

→ Sentence-wise streaming decoding is implemented for real-time text and speech generation.

-----

Key Insights 💡:

→ Progressive modality alignment improves omni-modal learning. It breaks down complex training into manageable steps.

→ This strategy maintains a small cross-modal alignment data size. It leverages existing vision-language model advancements.

→ Video data acts as a crucial bridge between audio and vision. It provides comprehensive multi-modal information.

→ Cross-modal video-audio data generation enhances performance. This data captures relationships between video and audio.

-----

Results 📊:

→ Ola achieves 84.3% accuracy on MMBench-1.1.

→ Ola achieves 70.8% accuracy on MMStar.

→ Ola achieves 57.0% accuracy on MMMU.

→ Ola achieves 68.4% accuracy on VideoMME.

→ Ola achieves 3.1 mean WER on LibriSpeech.

→ Ola achieves 6.41 GPT-eval score on AIR-Bench.

Rohan's Bytes

Discussion about this post