0:00
/
0:00
Transcript

"TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization"

Generated below podcast on this paper with Google's Illuminate.

TangoFlux introduces a fast text-to-audio model that generates high-quality audio in just 3.7 seconds, using preference optimization and flow matching techniques.

-----

https://arxiv.org/abs/2412.21037

🎯 Original Problem:

→ Current text-to-audio models are slow, use proprietary data, and struggle with complex prompts. Unlike LLMs, they lack effective alignment mechanisms due to missing structured rewards or gold-standard answers.

-----

🔧 Solution in this Paper:

→ TangoFlux uses a 515M parameter model combining flow matching and transformer architecture.

→ It introduces CLAP-Ranked Preference Optimization (CRPO) that iteratively generates and optimizes preference data.

→ The model employs rectified flows for efficient sampling with fewer steps.

→ A modified loss function combines DPO-FM with flow matching loss for better optimization stability.

-----

💡 Key Insights:

→ Online data generation is crucial for sustained performance improvement

→ CLAP can serve as an effective reward model for text-to-audio alignment

→ Adding winning loss helps prevent optimization instability

-----

📊 Results:

→ Generates 30 seconds of 44.1 kHz audio in 3.7 seconds on single A40 GPU

→ Lower Fréchet Distance: 75.1 vs next best 89.2

→ Higher CLAP score: 0.480 vs next best 0.447

→ Better Inception Score: 12.2 vs next best 9.9