"Improved Training Technique for Latent Consistency Models"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.01441
Problem: Consistency Training, effective in pixel space, suffers performance drops in latent space, crucial for scaling LLMs to complex tasks like image/video generation.
This paper proposes enhancements to latent consistency training to bridge the performance gap with latent diffusion models.
-----
📌 Cauchy loss replaces Pseudo-Huber, effectively managing latent space outliers. This directly stabilizes consistency training, yielding usable one/two-step image generation.
📌 Adaptive scaling and Non-scaling LayerNorm are vital architectural choices. They dynamically control robustness and improve feature normalization amidst latent noise.
📌 Optimal Transport integration minimizes training variance. This methodological addition enhances consistency training stability and overall sample quality in latent space.
----------
Methods Explored in this Paper 🔧:
→ The paper identifies that latent data contains impulsive outliers, degrading performance of standard Consistency Training methods.
→ To address this, it replaces Pseudo-Huber loss with Cauchy loss, which is less sensitive to extreme outliers.
→ Diffusion loss is introduced at early timesteps to regularize the consistency objective during initial training phases.
→ Optimal Transport (OT) coupling is employed to reduce training variance and improve stability.
→ An adaptive scaling-$c$ scheduler dynamically adjusts the robustness of the loss function.
→ Non-scaling LayerNorm is integrated into the model architecture to better capture feature statistics while minimizing outlier influence.
-----
Key Insights 💡:
→ Impulsive outliers in latent data are a primary cause of poor performance in latent Consistency Training.
→ Cauchy loss effectively mitigates the impact of these outliers compared to Pseudo-Huber loss.
→ Combining diffusion loss at early stages with consistency loss enhances training.
→ Optimal Transport (OT) improves training stability by reducing variance.
→ Non-scaling LayerNorm is beneficial for robust feature normalization in latent space.
-----
Results 📊:
→ Achieves FID of 7.27 on CelebA-HQ with 1-NFE sampling, significantly lower than iLCT's 37.15.
→ Reaches FID of 8.87 on LSUN Church and 8.72 on FFHQ datasets with 1-NFE sampling, again outperforming iLCT substantially.
→ Demonstrates improved Recall metric, reaching 0.50 on CelebA-HQ, indicating better sample diversity.