Speech-to-speech conversations get smoother when LLMs work with continuous tokens
Flow-Omni introduces continuous speech tokens for LLMs, replacing traditional discrete tokens to improve speech-to-speech conversations. The model uses flow matching with pretrained LLMs to generate continuous mel-spectrograms, eliminating the need for complex audio codecs and enhancing robustness across varied speech conditions.
-----
https://arxiv.org/abs/2412.04917
🎯 Original Problem:
→ Current LLMs use discrete speech tokens that depend heavily on audio codecs with residual connections, requiring large-scale diverse datasets and well-designed quantizers
→ Discrete tokens struggle with robustness in high pitch, noise, and emotional scenarios, leading to poor speech reconstruction
-----
🔧 Solution in this Paper:
→ Flow-Omni directly predicts continuous mel-spectrograms using flow matching and a pretrained LLM
→ The architecture combines a Whisper encoder for speech understanding with a flow matching predictor for mel-spectrogram generation
→ Training uses both flow matching loss and LLM cross-entropy loss for joint optimization
→ The model enables parallel sequence modeling for seamless speech-text interactions
-----
💡 Key Insights:
→ Continuous tokens eliminate quantization loss and codec dependencies
→ Flow matching enables diverse speech generation through controlled sampling
→ Two-stage training (modal alignment + fine-tuning) improves multi-modal capabilities
-----
📊 Results:
→ Flow-Omni achieves 8.81 WER compared to Mini-Omni's 10.84 WER
→ Demonstrates improved robustness across varied speech conditions without codec limitations
Share this post