EMOVA : EMPOWERING LANGUAGE MODELS TO SEE, HEAR AND SPEAK WITH VIVID EMOTIONS

Playback speed

Share post at current time

0:00

Transcript

EMOVA : EMPOWERING LANGUAGE MODELS TO SEE, HEAR AND SPEAK WITH VIVID EMOTIONS

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 26, 2024

EMOVA unifies vision, speech, and text in an end-to-end model with emotional dialogue capabilities.

📚 https://arxiv.org/pdf/2409.18042

Original Problem 🔍:

Existing multi-modal models lack end-to-end capabilities in vision, speech, and text, especially for emotional spoken dialogues. Open-source models struggle to align these modalities without scarce tri-modal data.

-----

Solution in this Paper 🛠️:

• EMOVA: End-to-end omni-modal LLM with vision, speech, and text capabilities

• Continuous vision encoder for fine-grained visual details

• Semantic-acoustic disentangled speech tokenizer for speech understanding

• Speech detokenizer with lightweight style module for emotional speech generation

• Text-centric omni-modal alignment using bi-modal datasets (image-text, speech-text)

• Joint training on multiple bi-modal datasets for mutual enhancement

-----

Key Insights from this Paper 💡:

• Semantic-acoustic disentanglement in speech processing improves modality alignment

• Omni-modal alignment enhances both vision-language and speech capabilities

• End-to-end approach eliminates need for external text-to-speech tools

• Flexible style controls enable vivid emotional spoken dialogues

-----

Results 📊:

• Outperforms GPT-4V and Gemini Pro 1.5 on 10/14 vision-language benchmarks

• Reaches >95% of GPT-4o performance on most benchmarks

• State-of-the-art ASR performance: 4.0 WER on LibriSpeech (vs. 4.5 for Mini-Omni)

• Achieves 89.01% emotion and 98.30% pitch accuracy in style-controlled speech generation

Rohan's Bytes

EMOVA : EMPOWERING LANGUAGE MODELS TO SEE, HEAR AND SPEAK WITH VIVID EMOTIONS

Discussion about this video