VITA-1.5 sees, listens, and talks back - all in real-time without extra modules.
VITA-1.5 introduces a three-stage training method that integrates vision and speech in LLMs while maintaining strong performance across modalities, enabling real-time multimodal interaction without separate speech processing modules .
→ Matches leading open-source models in vision tasks, surpassing GPT-4V in some benchmarks
-----
https://arxiv.org/abs/2501.01957
🤔 Original Problem:
Most multimodal LLMs focus only on vision and text, neglecting speech capabilities. Adding speech often degrades vision performance, and current systems require separate speech processing modules that increase latency .
-----
🔧 Solution in this Paper:
→ A three-stage progressive training strategy gradually incorporates different modalities to prevent interference .
→ Stage 1 focuses on vision-language training through visual adapters and fine-tuning with caption and QA data .
→ Stage 2 introduces audio input processing using speech-transcription pairs and speech QA data .
→ Stage 3 adds end-to-end speech generation without external Text-to-Speech modules .
-----
💡 Key Insights:
→ Progressive training prevents modality conflicts while maintaining performance
→ End-to-end speech processing reduces latency
→ Single codebook design simplifies speech decoding
-----
📊 Results:
→ Matches leading open-source models in vision tasks, surpassing GPT-4V in some benchmarks
→ Achieves 2.2% Character Error Rate for Mandarin and 3.3% Word Error Rate for English ASR
→ Maintains vision capabilities after adding speech functionality
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post