"TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization"

Playback speed

Share post at current time

0:00

Transcript

"TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 22, 2025

TangoFlux introduces a fast text-to-audio model that generates high-quality audio in just 3.7 seconds, using preference optimization and flow matching techniques.

-----

https://arxiv.org/abs/2412.21037

🎯 Original Problem:

→ Current text-to-audio models are slow, use proprietary data, and struggle with complex prompts. Unlike LLMs, they lack effective alignment mechanisms due to missing structured rewards or gold-standard answers.

-----

🔧 Solution in this Paper:

→ TangoFlux uses a 515M parameter model combining flow matching and transformer architecture.

→ It introduces CLAP-Ranked Preference Optimization (CRPO) that iteratively generates and optimizes preference data.

→ The model employs rectified flows for efficient sampling with fewer steps.

→ A modified loss function combines DPO-FM with flow matching loss for better optimization stability.

-----

💡 Key Insights:

→ Online data generation is crucial for sustained performance improvement

→ CLAP can serve as an effective reward model for text-to-audio alignment

→ Adding winning loss helps prevent optimization instability

-----