0:00
/
0:00
Transcript

"OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis"

Generated below podcast on this paper with Google's Illuminate.

Speech-vision alignment achieved through language intermediary, not massive datasets.

OpenOmni enables zero-shot omnimodal alignment using language as a pivot, while delivering real-time emotional speech synthesis without requiring extensive tri-modal data .

-----

https://arxiv.org/abs/2501.04561

🤔 Original Problem:

→ Current open-source omnimodal models struggle with limited high-quality tri-modal data and real-time emotional speech generation, lagging behind commercial solutions .

-----

🛠️ Solution in this Paper:

→ OpenOmni introduces a two-stage training approach combining omnimodal alignment and speech generation .

→ The model uses language as a pivot to achieve zero-shot alignment across modalities without requiring extensive tri-modal datasets .

→ A lightweight decoder enables real-time emotional speech through direct preference optimization .

→ The system leverages O2S-300K and EO2S-9K datasets, containing 8000 hours of synthesized bilingual speech .

-----

💡 Key Insights:

→ Language can effectively bridge the gap between vision and speech modalities

→ Direct preference optimization stabilizes emotional speech generation

→ Mixture of experts module significantly improves bilingual speech generation

-----

📊 Results:

→ Outperforms VITA by 4 points on OmniBench using 7B vs 56B parameters

→ Achieves real-time speech generation with <1 second latency for 30-second outputs

→ Improves emotional speech accuracy by 7.6% through preference optimization