"Visual Generation Without Guidance"

Playback speed

Share post at current time

0:00

Transcript

"Visual Generation Without Guidance"

Below podcast is generated with Google's Illuminate.

Rohan Paul

Feb 05, 2025

This paper introduces Guidance-Free Training (GFT). GFT enables visual models to match Classifier-Free Guidance (CFG) performance, while halving sampling costs by removing the need for guided sampling.

-----

📌 GFT replaces Classifier-Free Guidance by embedding conditionality directly into training. This eliminates the need for separate guided sampling, reducing inference cost by 50% while maintaining performance. The key is linear interpolation with a pseudo-temperature parameter.

📌 GFT reframes guidance as a learnable interpolation between unconditional and conditional models. This avoids redundant sampling while preserving the flexibility of Classifier-Free Guidance. Gradient stopping on the unconditional branch ensures training efficiency.

📌 The pseudo-temperature parameter beta acts as an implicit control knob for diversity-fidelity trade-offs. This removes external guidance dependencies, enabling a single model to achieve Classifier-Free Guidance quality with a simpler, cheaper inference pipeline.

-----

https://arxiv.org/abs/2501.15420

Original Problem 😫:

→ Classifier-Free Guidance (CFG) is effective for high-quality image generation.

→ CFG requires both conditional and unconditional models during sampling.

→ This doubles the computational cost during inference.

→ CFG also complicates post-training techniques like distillation and RLHF.

-----

Solution in this Paper 💡:

→ This paper proposes Guidance-Free Training (GFT).

→ GFT trains a single model for temperature-controlled sampling.

→ GFT parameterizes the conditional model implicitly.

→ It uses a linear interpolation between a sampling model and an unconditional model.

→ The training objective remains the same as CFG's maximum likelihood objective.

→ GFT introduces a pseudo-temperature parameter, beta, as model input.

→ The loss function is: E [ || beta * epsilon_theta^s(x_t|c, beta) + (1-beta) * epsilon_theta^u(x_t) - epsilon ||_2^2 ].

→ During training, beta is randomly sampled from 0 to 1.

→ GFT stops gradients for the unconditional model for efficiency and stability.

-----

Key Insights from this Paper 🧠:

→ A single model can achieve CFG-level performance by using a novel training parameterization.

→ Implicit conditional model construction via linear interpolation works effectively.

→ Introducing a pseudo-temperature parameter allows for diversity-fidelity trade-off in a guidance-free manner.

→ Stopping gradients for the unconditional branch during training improves efficiency without compromising performance.

→ GFT simplifies the visual generation pipeline by removing the need for dual model inference at sampling time.

-----

Results 📊:

→ GFT achieves a guidance-free FID of 1.99 on DiT-XL, comparable to CFG's 2.11.

→ GFT fine-tuning achieves nearly lossless FID within 5% of pre-training epochs.

→ GFT reduces sampling cost by 50% compared to CFG.

→ GFT training adds only 10-20% computation overhead compared to CFG.

Rohan's Bytes

"Visual Generation Without Guidance"

Discussion about this video