TangoFlux introduces a fast text-to-audio model that generates high-quality audio in just 3.7 seconds, using preference optimization and flow matching techniques.
-----
https://arxiv.org/abs/2412.21037
π― Original Problem:
β Current text-to-audio models are slow, use proprietary data, and struggle with complex prompts. Unlike LLMs, they lack effective alignment mechanisms due to missing structured rewards or gold-standard answers.
-----
π§ Solution in this Paper:
β TangoFlux uses a 515M parameter model combining flow matching and transformer architecture.
β It introduces CLAP-Ranked Preference Optimization (CRPO) that iteratively generates and optimizes preference data.
β The model employs rectified flows for efficient sampling with fewer steps.
β A modified loss function combines DPO-FM with flow matching loss for better optimization stability.
-----
π‘ Key Insights:
β Online data generation is crucial for sustained performance improvement
β CLAP can serve as an effective reward model for text-to-audio alignment
β Adding winning loss helps prevent optimization instability
-----
π Results:
β Generates 30 seconds of 44.1 kHz audio in 3.7 seconds on single A40 GPU
β Lower FrΓ©chet Distance: 75.1 vs next best 89.2
β Higher CLAP score: 0.480 vs next best 0.447
β Better Inception Score: 12.2 vs next best 9.9