Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

The podcast on this paper is generated with Google's Illuminate.

Non-autoregressive MIM (Non-autoregressive Masked Image Modeling) matches SDXL quality with 3x faster generation and 8GB VRAM 🔥

Skip the diffusion dance: MIM's one-shot image magic ✨

Bridges the gap between MIM and diffusion models for high-quality text-to-image synthesis.

📚 https://arxiv.org/abs/2410.08261

Original Problem 🎯:

Text-to-image synthesis models like diffusion-based SDXL are computationally intensive and struggle with unified language-vision approaches. Non-autoregressive Masked Image Modeling (MIM) techniques offer potential efficiency but face resolution constraints and performance gaps.

-----

Solution in this Paper 🔬:

• Meissonic: A 1B parameter MIM-based text-to-image model

• Enhanced transformer architecture: 1:2 ratio of multi-modal to single-modal layers

• Rotary Position Embedding (RoPE) for high-resolution detail

• Masking rate as dynamic sampling condition (1000 discrete levels)

• Feature compression layers for efficient high-resolution generation

• Micro-conditioning: image resolution, crop coordinates, human preference score

• Progressive 4-stage training approach on curated datasets

-----

Key Insights from this Paper 💡:

• Efficient high-resolution (1024x1024) generation possible without external super-resolution

• Balanced use of multi-modal and single-modal transformer layers enhances performance

• Dynamic masking rate condition significantly improves image detail

• Progressive training stages crucial for building competent text-to-image models

-----

Results 📊:

• Highest scores on HPS v2.0 and MPS benchmarks across image categories

• GenEval: 0.54 overall (vs 0.55 SDXL)

• Efficient: 48 H100 GPU days training, runs on 8GB VRAM GPUs

• Strong performance in image quality, text alignment, and style diversity