0:00
/
0:00
Transcript

"Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation"

Below podcast is generated with Google's Illuminate.

Existing speech datasets are primarily based on audiobooks with formal reading styles.

This limits the ability of LLMs to produce natural, in-the-wild speech. This paper introduces Emilia-Pipe and the Emilia dataset to overcome this limitation.

-----

Paper - https://arxiv.org/abs/2501.15907

Original Problem 🤔:

→ Current speech generation models rely on audiobook datasets.

→ Audiobook datasets contain formal read-aloud speech.

→ Real-world human speech is spontaneous and diverse.

→ Existing preprocessing pipelines are proprietary and monolingual.

-----

Solution in this Paper 💡:

→ The paper proposes Emilia-Pipe, an open-source preprocessing pipeline.

→ Emilia-Pipe extracts high-quality speech data from in-the-wild sources.

→ The pipeline has six steps: standardization, source separation, speaker diarization, voice activity detection segmentation, automated speech recognition, and filtering.

→ Emilia-Pipe efficiently processes multilingual speech data.

→ Using Emilia-Pipe, the authors created Emilia, a multilingual speech generation dataset.

→ Emilia contains over 101k hours of speech in six languages.

→ They further expanded it to Emilia-Large, with 216k hours, the largest open-source dataset.

-----

Key Insights from this Paper 🔑:

→ In-the-wild speech data is crucial for spontaneous speech generation.

→ An open-source pipeline can effectively process noisy, real-world speech data.

→ Larger datasets improve speech generation model performance.

→ Multilingual datasets enable crosslingual speech generation.

-----

Results 📈:

→ Emilia achieves a DNSMOS score of 3.26, comparable to audiobook datasets.

→ Emilia-Pipe processes 600 hours of data in 3.99 hours, processing at 2.5 hours of data per minute.

→ Models trained on Emilia outperform those trained on audiobook datasets for spontaneous speech generation.

→ Data scaling experiments show improved performance up to 100k hours of training data.

Discussion about this video