"Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 15, 2025

Article voiceover

0:00

-5:30

https://arxiv.org/abs/2502.04128

The paper addresses the challenge of scaling compute for Text-to-Speech (TTS) systems, which often involve complex multi-stage models unlike the simpler, scalable architectures of text LLMs. This paper explores how to create a simpler and scalable TTS framework akin to text LLMs and investigates the effects of scaling train-time and inference-time compute for speech synthesis.

This paper proposes Llasa, a streamlined TTS framework using a single Transformer architecture and a VQ codec, directly inspired by the simplicity of text LLMs, to study compute scaling in TTS.

-----

📌 Llasa simplifies Text-to-Speech by adopting a single Transformer, mirroring text LLM design. This contrasts with complex multi-stage TTS systems. The unified architecture enables direct scaling explorations.

📌 X-codec2 is key. It bridges the gap between raw speech and Transformer-based LLMs by creating discrete speech tokens. This tokenization allows end-to-end training and inference, similar to NLP tasks.

📌 The paper demonstrates that scaling compute, both train-time and inference-time, directly enhances TTS quality in Llasa. Inference scaling with verifiers offers a novel way to refine synthesized speech output.

----------

Methods Explored in this Paper 🔧:

→ This paper introduces Llasa, a single-stage TTS model built upon the Transformer architecture, similar to text LLMs.

→ Llasa employs X-codec2, a novel speech tokenizer, to convert speech waveforms into discrete tokens, enabling the Transformer to process speech autoregressively.

→ X-codec2 uses separate encoders for semantic and acoustic features, fused through a single vector quantizer, and a Transformer-based decoder for waveform reconstruction without requiring extra information during decoding.

→ The research systematically investigates train-time compute scaling by varying both the model size (1B, 3B, 8B parameters) and training data size (80k, 160k, 250k hours) while keeping other factors constant.

→ For inference-time compute scaling, the paper explores using speech understanding models as verifiers in search strategies like Best-of-N and beam search, categorized as Output Reward Models (ORMs) and Process Reward Models (PRMs).

-----

Key Insights 💡:

→ Scaling train-time compute, by increasing model size or training data, consistently improves speech naturalness and prosody accuracy in Llasa.

→ Larger models show significant gains in complex text understanding tasks like emotional speech and poetry, while increased data benefits tasks requiring broader lexical coverage like rare characters and foreign words.

→ Scaling inference-time compute through verifier-guided search strategies can enhance specific aspects of synthesized speech, such as speaker similarity and emotional expressiveness.

→ Partial PRM strategy, combining PRM for initial steps and ORM later, balances speaker similarity and transcription accuracy effectively.

→ Inference-time scaling may offer a computationally efficient alternative to solely scaling model training for achieving high-quality TTS.

-----

Results 📊:

→ X-codec2 achieves state-of-the-art performance at a 50 tokens/second rate, with a Word Error Rate of 2.47 and a Speaker Similarity score of 0.92, outperforming other codecs like DAC and Encodec at similar token rates.

→ Llasa-8B-250k achieves a Word Error Rate of 3.12 on the test-hard split of Seed-TTS-Eval dataset with chunking, demonstrating competitive performance with state-of-the-art TTS models.

→ With partial PRM and ORM (WER verifier), Llasa-8B-250k achieves a Character Error Rate of 0.47 on test-zh and Word Error Rate of 1.39 on test-en of Seed-TTS-Eval, indicating significant improvement through inference-time scaling.

Rohan's Bytes

Discussion about this post