GLM-4-Voice introduces an intelligent, human-like end-to-end spoken chatbot that supports real-time voice conversations in Chinese and English. It can adapt vocal nuances like emotion, intonation, speech rate, and dialect based on user instructions.
-----
https://arxiv.org/abs/2412.02612v1
š¤ Original Problem:
Existing spoken chatbots often rely on separate ASR, LLM, and TTS components, leading to high latency and limited expressiveness. Direct speech-to-speech models lack the intelligence of text-based LLMs due to limited speech training data.
-----
š” Solution in this Paper:
ā GLM-4-Voice uses a 12.5 Hz single-codebook speech tokenizer derived from an ASR model, achieving ultra-low 175 bps bitrate.
ā It employs a flow-matching-based speech decoder for natural speech synthesis.
ā The model is pre-trained on 1 trillion tokens of speech-text data, including synthetic interleaved speech-text corpora.
ā A "streaming thoughts" template alternates between text and speech token generation during fine-tuning.
ā This approach enables low-latency, expressive speech generation while maintaining high intelligence.
-----
š Key Insights from this Paper:
ā Bridging text and speech modalities through large-scale pre-training can enhance spoken chatbot capabilities.
ā A single-codebook speech tokenizer balances semantic preservation and speech quality.
ā Interleaved speech-text data facilitates cross-modal knowledge transfer.
ā The "streaming thoughts" template enables low-latency speech generation without sacrificing quality.
-----
š Results:
ā Outperforms baselines on speech language modeling and spoken question answering tasks
ā Achieves 93.6% accuracy on Topic-StoryCloze (SāT) vs 88.6% for Spirit-LM
ā Scores 5.40/10 on General QA ChatGPT evaluation vs 3.50 for Llama-Omni
ā Demonstrates 4.45 UTMOS speech quality score, surpassing baselines