0:00
/
0:00
Transcript

"Efficient Generative Modeling with Residual Vector Quantization-Based Tokens"

Generated below podcast on this paper with Google's Illuminate.

ResGen predicts collective token embeddings directly, making high-quality generation faster and more efficient.

Why generate one by one when you can do it all at once? That's ResGen's magic.

ResGen introduces direct prediction of vector embeddings for collective tokens, making high-fidelity generation more efficient while maintaining fast sampling speeds.

-----

https://arxiv.org/abs/2412.10208

🤔 Original Problem:

→ Residual Vector Quantization (RVQ) improves data quality but increases token depth, leading to slower inference in generative models

→ Existing methods struggle to handle both sequence length and depth efficiently

-----

🔧 Solution in this Paper:

→ ResGen predicts vector embeddings of collective tokens instead of individual ones

→ Uses progressive token masking starting from highest quantization layers

→ Implements mixture of Gaussians for latent embedding estimation

→ Employs confidence-based sampling with temperature scaling

→ Formulates token masking within a probabilistic framework using discrete diffusion

-----

💡 Key Insights:

→ Direct prediction of vector embeddings decouples sampling complexity from sequence length and depth

→ Progressive masking from higher layers preserves hierarchical information effectively

→ Confidence-based sampling improves generation quality with minimal inference steps

-----

📊 Results:

→ ImageNet 256x256: FID score of 1.95 with classifier-free guidance

→ Text-to-Speech: Lowest Word Error Rate (1.72) and Character Error Rate (0.46)

→ Maximum latent batch size of 1915, highest among compared models

→ Only 16 inference steps needed vs deeper RVQ depths

------

Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓

🎉 https://rohanpaul.substack.com/

Discussion about this video