GLM-4-Voice introduces an intelligent, human-like end-to-end spoken chatbot that supports real-time voice conversations in Chinese and English. It can adapt vocal nuances like emotion, intonation, speech rate, and dialect based on user instructions.
-----
https://arxiv.org/abs/2412.02612v1
🤔 Original Problem:
Existing spoken chatbots often rely on separate ASR, LLM, and TTS components, leading to high latency and limited expressiveness. Direct speech-to-speech models lack the intelligence of text-based LLMs due to limited speech training data.
-----
💡 Solution in this Paper:
→ GLM-4-Voice uses a 12.5 Hz single-codebook speech tokenizer derived from an ASR model, achieving ultra-low 175 bps bitrate.
→ It employs a flow-matching-based speech decoder for natural speech synthesis.
→ The model is pre-trained on 1 trillion tokens of speech-text data, including synthetic interleaved speech-text corpora.
→ A "streaming thoughts" template alternates between text and speech token generation during fine-tuning.
→ This approach enables low-latency, expressive speech generation while maintaining high intelligence.
-----
🔑 Key Insights from this Paper:
→ Bridging text and speech modalities through large-scale pre-training can enhance spoken chatbot capabilities.
→ A single-codebook speech tokenizer balances semantic preservation and speech quality.
→ Interleaved speech-text data facilitates cross-modal knowledge transfer.
→ The "streaming thoughts" template enables low-latency speech generation without sacrificing quality.
-----
📊 Results:
→ Outperforms baselines on speech language modeling and spoken question answering tasks
→ Achieves 93.6% accuracy on Topic-StoryCloze (S→T) vs 88.6% for Spirit-LM
→ Scores 5.40/10 on General QA ChatGPT evaluation vs 3.50 for Llama-Omni
→ Demonstrates 4.45 UTMOS speech quality score, surpassing baselines
Share this post