Great Open-source model and paper from ByteDance
Face dubbing made simple - from sound to lips in one smooth move.
Skip the motion middleman, direct audio-to-lip sync in latent space eliminates the middleman for better expressions.
LatentSync introduces end-to-end lip sync using audio-conditioned latent diffusion, solving pixel space limitations and information loss in two-stage generation while maintaining temporal consistency.
-----
https://arxiv.org/abs/2412.09262
🎯 Original Problem:
Existing diffusion-based lip sync methods struggle with high-resolution video generation due to pixel space limitations or lose subtle expressions in two-stage approaches.
-----
🔧 Solution in this Paper:
→ LatentSync leverages Stable Diffusion's capabilities to directly model audio-visual correlations without intermediate motion representations.
→ It uses a 13-channel input combining noise latent, mask, masked image, and reference image for frame-by-frame generation.
→ The framework integrates Whisper for audio feature extraction through cross-attention layers.
→ TREPA (Temporal REPresentation Alignment) enhances temporal consistency using VideoMAEv2 representations.
→ A two-stage training approach combines simple noise prediction with SyncNet loss, LPIPS, and TREPA.
-----
💡 Key Insights:
→ SyncNet convergence improves with larger batch sizes (1024) and optimal frame count (16)
→ Temporal consistency in diffusion models benefits from audio window information
→ Fixed mask and affine transformation prevent information leakage
-----
📊 Results:
→ Achieved 94% accuracy on HDTF test set, surpassing previous 91%
→ Outperformed state-of-art in lip-sync accuracy (Sync_conf)
→ Better visual quality metrics (FID, SSIM)
→ Improved temporal consistency (FVD)
------
Are you into AI and LLMs❓ Join me on X/Twitter with 52K+ others, to remain on the bleeding-edge of AI every day.
𝕏/🐦 https://x.com/rohanpaul_ai
Share this post