0:00
/
0:00
Transcript

EMOVA : EMPOWERING LANGUAGE MODELS TO SEE, HEAR AND SPEAK WITH VIVID EMOTIONS

The podcast on this paper is generated with Google's Illuminate.

EMOVA unifies vision, speech, and text in an end-to-end model with emotional dialogue capabilities.

📚 https://arxiv.org/pdf/2409.18042

Original Problem 🔍:

Existing multi-modal models lack end-to-end capabilities in vision, speech, and text, especially for emotional spoken dialogues. Open-source models struggle to align these modalities without scarce tri-modal data.

-----

Solution in this Paper 🛠️:

• EMOVA: End-to-end omni-modal LLM with vision, speech, and text capabilities

• Continuous vision encoder for fine-grained visual details

• Semantic-acoustic disentangled speech tokenizer for speech understanding

• Speech detokenizer with lightweight style module for emotional speech generation

• Text-centric omni-modal alignment using bi-modal datasets (image-text, speech-text)

• Joint training on multiple bi-modal datasets for mutual enhancement

-----

Key Insights from this Paper 💡:

• Semantic-acoustic disentanglement in speech processing improves modality alignment

• Omni-modal alignment enhances both vision-language and speech capabilities

• End-to-end approach eliminates need for external text-to-speech tools

• Flexible style controls enable vivid emotional spoken dialogues

-----

Results 📊:

• Outperforms GPT-4V and Gemini Pro 1.5 on 10/14 vision-language benchmarks

• Reaches >95% of GPT-4o performance on most benchmarks

• State-of-the-art ASR performance: 4.0 WER on LibriSpeech (vs. 4.5 for Mini-Omni)

• Achieves 89.01% emotion and 98.30% pitch accuracy in style-controlled speech generation

Discussion about this video