DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization

Nov 11, 2024

A 4-step speech model matches the quality of 1000-step models at 13x faster speed

Original Problem 🚨:

Diffusion models in speech synthesis are robust but inefficient due to iterative processes, limiting end-to-end optimization with perceptual metrics.

Solution in this Paper 🛠️:

Introduces DMDSpeech, a distilled diffusion model for zero-shot speech synthesis.
Utilizes Distribution Matching Distillation (DMD) to transform the teacher model into a four-step student model, enhancing efficiency.
Implements Connectionist Temporal Classification (CTC) loss to improve word error rate and Speaker Verification (SV) loss for speaker similarity.
Enables direct metric optimization, allowing the model to align better with human auditory preferences.

Key Insights from this Paper 💡:

Diffusion models can be distilled for faster and more efficient speech synthesis.
Direct metric optimization significantly enhances perceptual quality.
DMDSpeech achieves state-of-the-art performance in naturalness and speaker similarity.

Results 📊:

Rohan's Bytes