DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization
A 4-step speech model matches the quality of 1000-step models at 13x faster speed
A 4-step speech model matches the quality of 1000-step models at 13x faster speed
Original Problem 🚨:
Diffusion models in speech synthesis are robust but inefficient due to iterative processes, limiting end-to-end optimization with perceptual metrics.
Solution in this Paper 🛠️:
Introduces DMDSpeech, a distilled diffusion model for zero-shot speech synthesis.
Utilizes Distribution Matching Distillation (DMD) to transform the teacher model into a four-step student model, enhancing efficiency.
Implements Connectionist Temporal Classification (CTC) loss to improve word error rate and Speaker Verification (SV) loss for speaker similarity.
Enables direct metric optimization, allowing the model to align better with human auditory preferences.
Key Insights from this Paper 💡:
Diffusion models can be distilled for faster and more efficient speech synthesis.
Direct metric optimization significantly enhances perceptual quality.
DMDSpeech achieves state-of-the-art performance in naturalness and speaker similarity.
Results 📊:
DMDSpeech outperforms previous models in naturalness and speaker similarity.
Word Error Rate: 1.94 vs. 2.19 (ground truth).
Speaker Similarity: 0.69 vs. 0.67 (ground truth).
Real-Time Factor: 0.07, significantly faster than baseline models.