Training image tokenizers just got 16x faster - and they work better too.
The paper introduces Grouped Spherical Quantization (GSQ), a novel approach that improves image tokenizer training efficiency and quality. GSQ achieves superior reconstruction with fewer training iterations through spherical codebook initialization and lookup regularization.
-----
https://arxiv.org/abs/2412.02632
🤖 Original Problem:
→ Current image tokenizers rely on outdated GAN-based hyperparameters, leading to suboptimal performance.
→ Existing methods lack comprehensive analysis of scaling behaviors and often use biased comparisons.
→ Methods like FSQ and LFQ have rigid bindings between latent dimension and codebook size, limiting scaling flexibility.
-----
🔧 Solution in this Paper:
→ GSQ introduces spherical codebook initialization and lookup regularization to constrain codebook latent to a spherical surface.
→ The method decomposes each latent vector into G groups for efficient compression without compromising reconstruction fidelity.
→ GSQ applies L2 normalization during lookup operations to improve stability and performance.
→ The solution enables independent scaling of latent dimensions and codebook size.
-----
💡 Key Insights:
→ Higher beta values in optimizers lead to better reconstruction performance
→ Learning rate decay negatively impacts model performance
→ Dino Discriminators consistently outperform other GAN architectures
→ Larger batch sizes and increased learning rates improve stability
-----
📊 Results:
→ Achieves 16x down-sampling with reconstruction FID of 0.50
→ Maintains 100% codebook utilization throughout training
→ Requires only 20 training epochs compared to 270+ epochs in other methods
Share this post