0:00
/
0:00
Transcript

"Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners"

The podcast on this paper is generated with Google's Illuminate.

Speech-to-speech conversations get smoother when LLMs work with continuous tokens

Flow-Omni introduces continuous speech tokens for LLMs, replacing traditional discrete tokens to improve speech-to-speech conversations. The model uses flow matching with pretrained LLMs to generate continuous mel-spectrograms, eliminating the need for complex audio codecs and enhancing robustness across varied speech conditions.

-----

https://arxiv.org/abs/2412.04917

🎯 Original Problem:

→ Current LLMs use discrete speech tokens that depend heavily on audio codecs with residual connections, requiring large-scale diverse datasets and well-designed quantizers

→ Discrete tokens struggle with robustness in high pitch, noise, and emotional scenarios, leading to poor speech reconstruction

-----

🔧 Solution in this Paper:

→ Flow-Omni directly predicts continuous mel-spectrograms using flow matching and a pretrained LLM

→ The architecture combines a Whisper encoder for speech understanding with a flow matching predictor for mel-spectrogram generation

→ Training uses both flow matching loss and LLM cross-entropy loss for joint optimization

→ The model enables parallel sequence modeling for seamless speech-text interactions

-----

💡 Key Insights:

→ Continuous tokens eliminate quantization loss and codec dependencies

→ Flow matching enables diverse speech generation through controlled sampling

→ Two-stage training (modal alignment + fine-tuning) improves multi-modal capabilities

-----

📊 Results:

→ Flow-Omni achieves 8.81 WER compared to Mini-Omni's 10.84 WER

→ Demonstrates improved robustness across varied speech conditions without codec limitations

Discussion about this video