"MIO: A Foundation Model on Multimodal Tokens"

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

"MIO: A Foundation Model on Multimodal Tokens"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 30, 2024

Transcript

Leveraging discrete tokens, MIO achieves consistent multimodal representations for enhanced intermodal capabilities.

📚 https://arxiv.org/pdf/2409.17692

Original Problem 🔍:

Existing any-to-any multimodal models lack true intermodal understanding and generation capabilities, especially for speech and video modalities. Current approaches often struggle with inconsistent input-output representations or limited multimodal instruction-following abilities.

-----

Solution in this Paper 🛠️:

• Introduces MIO, a foundation model for multimodal tokens

• Uses DIDO (Discrete-In-Discrete-Out) approach for consistent representations

• Employs SEED-Tokenizer for images and SpeechTokenizer for speech

• Implements a three-stage pre-training process:

1. Alignment pre-training

2. Interleaved pre-training

3. Speech-enhanced pre-training

• Conducts comprehensive supervised fine-tuning on 16 tasks

-----

Key Insights from this Paper 💡: