"GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot"

Playback speed

Share post at current time

0:00

Transcript

"GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 27, 2024

GLM-4-Voice introduces an intelligent, human-like end-to-end spoken chatbot that supports real-time voice conversations in Chinese and English. It can adapt vocal nuances like emotion, intonation, speech rate, and dialect based on user instructions.

-----

https://arxiv.org/abs/2412.02612v1

🤔 Original Problem:

Existing spoken chatbots often rely on separate ASR, LLM, and TTS components, leading to high latency and limited expressiveness. Direct speech-to-speech models lack the intelligence of text-based LLMs due to limited speech training data.

-----

💡 Solution in this Paper:

→ GLM-4-Voice uses a 12.5 Hz single-codebook speech tokenizer derived from an ASR model, achieving ultra-low 175 bps bitrate.

→ It employs a flow-matching-based speech decoder for natural speech synthesis.

→ The model is pre-trained on 1 trillion tokens of speech-text data, including synthetic interleaved speech-text corpora.

→ A "streaming thoughts" template alternates between text and speech token generation during fine-tuning.

→ This approach enables low-latency, expressive speech generation while maintaining high intelligence.

-----

🔑 Key Insights from this Paper:

→ Bridging text and speech modalities through large-scale pre-training can enhance spoken chatbot capabilities.

→ A single-codebook speech tokenizer balances semantic preservation and speech quality.

→ Interleaved speech-text data facilitates cross-modal knowledge transfer.

→ The "streaming thoughts" template enables low-latency speech generation without sacrificing quality.

-----

📊 Results:

→ Outperforms baselines on speech language modeling and spoken question answering tasks

→ Achieves 93.6% accuracy on Topic-StoryCloze (S→T) vs 88.6% for Spirit-LM

→ Scores 5.40/10 on General QA ChatGPT evaluation vs 3.50 for Llama-Omni

→ Demonstrates 4.45 UTMOS speech quality score, surpassing baselines

Rohan's Bytes

"GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot"

Discussion about this video