"Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.04128
The paper addresses the challenge of scaling compute for Text-to-Speech (TTS) systems, which often involve complex multi-stage models unlike the simpler, scalable architectures of text LLMs. This paper explores how to create a simpler and scalable TTS framework akin to text LLMs and investigates the effects of scaling train-time and inference-time compute for speech synthesis.
This paper proposes Llasa, a streamlined TTS framework using a single Transformer architecture and a VQ codec, directly inspired by the simplicity of text LLMs, to study compute scaling in TTS.
-----
📌 Llasa simplifies Text-to-Speech by adopting a single Transformer, mirroring text LLM design. This contrasts with complex multi-stage TTS systems. The unified architecture enables direct scaling explorations.
📌 X-codec2 is key. It bridges the gap between raw speech and Transformer-based LLMs by creating discrete speech tokens. This tokenization allows end-to-end training and inference, similar to NLP tasks.
📌 The paper demonstrates that scaling compute, both train-time and inference-time, directly enhances TTS quality in Llasa. Inference scaling with verifiers offers a novel way to refine synthesized speech output.
----------
Methods Explored in this Paper 🔧:
→ This paper introduces Llasa, a single-stage TTS model built upon the Transformer architecture, similar to text LLMs.
→ Llasa employs X-codec2, a novel speech tokenizer, to convert speech waveforms into discrete tokens, enabling the Transformer to process speech autoregressively.
→ X-codec2 uses separate encoders for semantic and acoustic features, fused through a single vector quantizer, and a Transformer-based decoder for waveform reconstruction without requiring extra information during decoding.
→ The research systematically investigates train-time compute scaling by varying both the model size (1B, 3B, 8B parameters) and training data size (80k, 160k, 250k hours) while keeping other factors constant.
→ For inference-time compute scaling, the paper explores using speech understanding models as verifiers in search strategies like Best-of-N and beam search, categorized as Output Reward Models (ORMs) and Process Reward Models (PRMs).
-----
Key Insights 💡:
→ Scaling train-time compute, by increasing model size or training data, consistently improves speech naturalness and prosody accuracy in Llasa.
→ Larger models show significant gains in complex text understanding tasks like emotional speech and poetry, while increased data benefits tasks requiring broader lexical coverage like rare characters and foreign words.
→ Scaling inference-time compute through verifier-guided search strategies can enhance specific aspects of synthesized speech, such as speaker similarity and emotional expressiveness.
→ Partial PRM strategy, combining PRM for initial steps and ORM later, balances speaker similarity and transcription accuracy effectively.
→ Inference-time scaling may offer a computationally efficient alternative to solely scaling model training for achieving high-quality TTS.
-----
Results 📊:
→ X-codec2 achieves state-of-the-art performance at a 50 tokens/second rate, with a Word Error Rate of 2.47 and a Speaker Similarity score of 0.92, outperforming other codecs like DAC and Encodec at similar token rates.
→ Llasa-8B-250k achieves a Word Error Rate of 3.12 on the test-hard split of Seed-TTS-Eval dataset with chunking, demonstrating competitive performance with state-of-the-art TTS models.
→ With partial PRM and ORM (WER verifier), Llasa-8B-250k achieves a Character Error Rate of 0.47 on test-zh and Word Error Rate of 1.39 on test-en of Seed-TTS-Eval, indicating significant improvement through inference-time scaling.