"LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync"

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

"LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 13, 2025

Transcript

Great Open-source model and paper from ByteDance

Face dubbing made simple - from sound to lips in one smooth move.

Skip the motion middleman, direct audio-to-lip sync in latent space eliminates the middleman for better expressions.

LatentSync introduces end-to-end lip sync using audio-conditioned latent diffusion, solving pixel space limitations and information loss in two-stage generation while maintaining temporal consistency.

-----

https://arxiv.org/abs/2412.09262

🎯 Original Problem:

Existing diffusion-based lip sync methods struggle with high-resolution video generation due to pixel space limitations or lose subtle expressions in two-stage approaches.

-----

🔧 Solution in this Paper:

→ LatentSync leverages Stable Diffusion's capabilities to directly model audio-visual correlations without intermediate motion representations.

→ It uses a 13-channel input combining noise latent, mask, masked image, and reference image for frame-by-frame generation.

→ The framework integrates Whisper for audio feature extraction through cross-attention layers.

→ TREPA (Temporal REPresentation Alignment) enhances temporal consistency using VideoMAEv2 representations.

→ A two-stage training approach combines simple noise prediction with SyncNet loss, LPIPS, and TREPA.

-----

💡 Key Insights:

→ SyncNet convergence improves with larger batch sizes (1024) and optimal frame count (16)

→ Temporal consistency in diffusion models benefits from audio window information

→ Fixed mask and affine transformation prevent information leakage

-----

📊 Results:

→ Achieved 94% accuracy on HDTF test set, surpassing previous 91%

→ Outperformed state-of-art in lip-sync accuracy (Sync_conf)

→ Better visual quality metrics (FID, SSIM)

→ Improved temporal consistency (FVD)

------

Are you into AI and LLMs❓ Join me on X/Twitter with 52K+ others, to remain on the bleeding-edge of AI every day.

𝕏/🐦 https://x.com/rohanpaul_ai

Rohan's Bytes

"LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync"

Discussion about this video