Leveraging discrete tokens, MIO achieves consistent multimodal representations for enhanced intermodal capabilities.
📚 https://arxiv.org/pdf/2409.17692
Original Problem 🔍:
Existing any-to-any multimodal models lack true intermodal understanding and generation capabilities, especially for speech and video modalities. Current approaches often struggle with inconsistent input-output representations or limited multimodal instruction-following abilities.
-----
Solution in this Paper 🛠️:
• Introduces MIO, a foundation model for multimodal tokens
• Uses DIDO (Discrete-In-Discrete-Out) approach for consistent representations
• Employs SEED-Tokenizer for images and SpeechTokenizer for speech
• Implements a three-stage pre-training process:
1. Alignment pre-training
2. Interleaved pre-training
3. Speech-enhanced pre-training
• Conducts comprehensive supervised fine-tuning on 16 tasks
-----
Key Insights from this Paper 💡:
• Unified multimodal understanding and generation in a single model
• Enables generation of multimodal interleaved sequences
• Addresses the challenge of speech token dominance through staged training
• Demonstrates emergent abilities like chain-of-visual-thought reasoning
-----
Results 📊:
• Competitive performance on image understanding tasks (e.g., VQAv2: 65.5, OKVQA: 39.9)
• Strong image generation capabilities (CLIP-I score on MS-COCO: 67.76)
• Outperforms baselines in video understanding (MSVDQA: 42.6, MSRVTT-QA: 35.5)
• Achieves 4.2 WER in TTS tasks, surpassing AnyGPT (8.5 WER)
Share this post