Speech-vision alignment achieved through language intermediary, not massive datasets.
OpenOmni enables zero-shot omnimodal alignment using language as a pivot, while delivering real-time emotional speech synthesis without requiring extensive tri-modal data .
-----
https://arxiv.org/abs/2501.04561
🤔 Original Problem:
→ Current open-source omnimodal models struggle with limited high-quality tri-modal data and real-time emotional speech generation, lagging behind commercial solutions .
-----
🛠️ Solution in this Paper:
→ OpenOmni introduces a two-stage training approach combining omnimodal alignment and speech generation .
→ The model uses language as a pivot to achieve zero-shot alignment across modalities without requiring extensive tri-modal datasets .
→ A lightweight decoder enables real-time emotional speech through direct preference optimization .
→ The system leverages O2S-300K and EO2S-9K datasets, containing 8000 hours of synthesized bilingual speech .
-----
💡 Key Insights:
→ Language can effectively bridge the gap between vision and speech modalities
→ Direct preference optimization stabilizes emotional speech generation
→ Mixture of experts module significantly improves bilingual speech generation
-----
📊 Results:
→ Outperforms VITA by 4 points on OmniBench using 7B vs 56B parameters
→ Achieves real-time speech generation with <1 second latency for 30-second outputs
→ Improves emotional speech accuracy by 7.6% through preference optimization
Share this post