Existing speech datasets are primarily based on audiobooks with formal reading styles.
This limits the ability of LLMs to produce natural, in-the-wild speech. This paper introduces Emilia-Pipe and the Emilia dataset to overcome this limitation.
-----
Paper - https://arxiv.org/abs/2501.15907
Original Problem 🤔:
→ Current speech generation models rely on audiobook datasets.
→ Audiobook datasets contain formal read-aloud speech.
→ Real-world human speech is spontaneous and diverse.
→ Existing preprocessing pipelines are proprietary and monolingual.
-----
Solution in this Paper 💡:
→ The paper proposes Emilia-Pipe, an open-source preprocessing pipeline.
→ Emilia-Pipe extracts high-quality speech data from in-the-wild sources.
→ The pipeline has six steps: standardization, source separation, speaker diarization, voice activity detection segmentation, automated speech recognition, and filtering.
→ Emilia-Pipe efficiently processes multilingual speech data.
→ Using Emilia-Pipe, the authors created Emilia, a multilingual speech generation dataset.
→ Emilia contains over 101k hours of speech in six languages.
→ They further expanded it to Emilia-Large, with 216k hours, the largest open-source dataset.
-----
Key Insights from this Paper 🔑:
→ In-the-wild speech data is crucial for spontaneous speech generation.
→ An open-source pipeline can effectively process noisy, real-world speech data.
→ Larger datasets improve speech generation model performance.
→ Multilingual datasets enable crosslingual speech generation.
-----
Results 📈:
→ Emilia achieves a DNSMOS score of 3.26, comparable to audiobook datasets.
→ Emilia-Pipe processes 600 hours of data in 3.99 hours, processing at 2.5 hours of data per minute.
→ Models trained on Emilia outperform those trained on audiobook datasets for spontaneous speech generation.
→ Data scaling experiments show improved performance up to 100k hours of training data.
Share this post