TangoFlux introduces a fast text-to-audio model that generates high-quality audio in just 3.7 seconds, using preference optimization and flow matching techniques.
-----
https://arxiv.org/abs/2412.21037
🎯 Original Problem:
→ Current text-to-audio models are slow, use proprietary data, and struggle with complex prompts. Unlike LLMs, they lack effective alignment mechanisms due to missing structured rewards or gold-standard answers.
-----
🔧 Solution in this Paper:
→ TangoFlux uses a 515M parameter model combining flow matching and transformer architecture.
→ It introduces CLAP-Ranked Preference Optimization (CRPO) that iteratively generates and optimizes preference data.
→ The model employs rectified flows for efficient sampling with fewer steps.
→ A modified loss function combines DPO-FM with flow matching loss for better optimization stability.
-----
💡 Key Insights:
→ Online data generation is crucial for sustained performance improvement
→ CLAP can serve as an effective reward model for text-to-audio alignment
→ Adding winning loss helps prevent optimization instability
-----
📊 Results:
→ Generates 30 seconds of 44.1 kHz audio in 3.7 seconds on single A40 GPU
→ Lower Fréchet Distance: 75.1 vs next best 89.2
→ Higher CLAP score: 0.480 vs next best 0.447
→ Better Inception Score: 12.2 vs next best 9.9
Share this post