EMOVA unifies vision, speech, and text in an end-to-end model with emotional dialogue capabilities.
📚 https://arxiv.org/pdf/2409.18042
Original Problem 🔍:
Existing multi-modal models lack end-to-end capabilities in vision, speech, and text, especially for emotional spoken dialogues. Open-source models struggle to align these modalities without scarce tri-modal data.
-----
Solution in this Paper 🛠️:
• EMOVA: End-to-end omni-modal LLM with vision, speech, and text capabilities
• Continuous vision encoder for fine-grained visual details
• Semantic-acoustic disentangled speech tokenizer for speech understanding
• Speech detokenizer with lightweight style module for emotional speech generation
• Text-centric omni-modal alignment using bi-modal datasets (image-text, speech-text)
• Joint training on multiple bi-modal datasets for mutual enhancement
-----
Key Insights from this Paper 💡:
• Semantic-acoustic disentanglement in speech processing improves modality alignment
• Omni-modal alignment enhances both vision-language and speech capabilities
• End-to-end approach eliminates need for external text-to-speech tools
• Flexible style controls enable vivid emotional spoken dialogues
-----
Results 📊:
• Outperforms GPT-4V and Gemini Pro 1.5 on 10/14 vision-language benchmarks
• Reaches >95% of GPT-4o performance on most benchmarks
• State-of-the-art ASR performance: 4.0 WER on LibriSpeech (vs. 4.5 for Mini-Omni)
• Achieves 89.01% emotion and 98.30% pitch accuracy in style-controlled speech generation
Share this post