Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Mini-Omni2 packs vision, speech, and interruption capabilities into a lightweight open-source package.

Nov 14, 2024

Mini-Omni2 packs vision, speech, and interruption capabilities into a lightweight open-source package.

Original Problem 🔍:

GPT-4o represents a milestone in multi-modal LLMs, but its technical details remain undisclosed. Existing open-source models often focus on specific functionalities, lacking a unified approach for text, vision, and speech capabilities.

Solution in this Paper 🛠️:

• Introduces Mini-Omni2, an open-source multi-modal LLM with vision, speech, and duplex capabilities

• Employs CLIP ViT-B/32 for visual encoding and Whisper-small for audio encoding

• Utilizes Qwen2-0.5B as the base language model

• Implements a three-stage training process for modality expansion and alignment

• Proposes a command-based interruption mechanism for flexible interactions

Key Insights from this Paper 💡:

• Demonstrates the feasibility of creating an open-source GPT-4o-like model

• Highlights the importance of modality alignment in multi-modal LLMs

• Showcases the potential of command-based interruption for natural interactions

Results 📊:

• Maintains comparable speech recognition accuracy to base Whisper model

• Achieves 4.8% WER on LibriSpeech test-clean (vs 4.4% for Whisper-small)

• Demonstrates capabilities in multi-modal question answering and real-time voice interaction

• Open-sources model and datasets for future research

🧠 The Mini-Omni2 model architecture consists of:

A visual encoder using the CLIP ViT-B/32 model
An audio encoder using the Whisper-small model
The Qwen2-0.5B base model as the foundational language model
A multi-layer vocabulary construction for parallel generation of text and audio tokens
Adapters to align features from different modalities

Rohan's Bytes

Discussion about this post