0:00
/
0:00
Transcript

"MIO: A Foundation Model on Multimodal Tokens"

The podcast on this paper is generated with Google's Illuminate.

Leveraging discrete tokens, MIO achieves consistent multimodal representations for enhanced intermodal capabilities.

📚 https://arxiv.org/pdf/2409.17692

Original Problem 🔍:

Existing any-to-any multimodal models lack true intermodal understanding and generation capabilities, especially for speech and video modalities. Current approaches often struggle with inconsistent input-output representations or limited multimodal instruction-following abilities.

-----

Solution in this Paper 🛠️:

• Introduces MIO, a foundation model for multimodal tokens

• Uses DIDO (Discrete-In-Discrete-Out) approach for consistent representations

• Employs SEED-Tokenizer for images and SpeechTokenizer for speech

• Implements a three-stage pre-training process:

1. Alignment pre-training

2. Interleaved pre-training

3. Speech-enhanced pre-training

• Conducts comprehensive supervised fine-tuning on 16 tasks

-----

Key Insights from this Paper 💡:

• Unified multimodal understanding and generation in a single model

• Enables generation of multimodal interleaved sequences

• Addresses the challenge of speech token dominance through staged training

• Demonstrates emergent abilities like chain-of-visual-thought reasoning

-----

Results 📊:

• Competitive performance on image understanding tasks (e.g., VQAv2: 65.5, OKVQA: 39.9)

• Strong image generation capabilities (CLIP-I score on MS-COCO: 67.76)

• Outperforms baselines in video understanding (MSVDQA: 42.6, MSRVTT-QA: 35.5)

• Achieves 4.2 WER in TTS tasks, surpassing AnyGPT (8.5 WER)

Discussion about this video

User's avatar