Audio generation is lighter with Whisper-GPT: Hybrid tokens cut model size by 90% without losing quality
Whisper-GPT combines continuous audio spectrograms with discrete tokens in a hybrid architecture, reducing model size while maintaining high performance for speech and music generation.
-----
https://arxiv.org/abs/2412.11449
## Original Problem 🤔:
→ Current audio generation models using discrete tokens struggle with long context lengths, requiring massive architectures to handle high-fidelity audio content.
→ Pure token-based approaches need to process 600 tokens per second for ENCODEC, making it computationally expensive.
-----
## Solution in this Paper 🔧:
→ Introduces a hybrid architecture combining mel-spectrograms with ENCODEC acoustic tokens.
→ Uses a modified Whisper architecture with decoder-only setup, replacing the encoder with a lightweight decoder.
→ Processes 64-channel mel-spectrograms at 75 Hz to match ENCODEC token rate.
→ Implements early fusion of spectrogram and token streams using 32-dimensional vectors.
-----
## Key Insights 💡:
→ Hybrid representation captures complete audio information in single tokens while enabling sampling
→ Eliminates need for separate vocoder or Griffin-Lim algorithms
→ Achieves performance of 40M parameter models with only 4M parameters
→ Shows stronger results for complex music signals compared to speech
-----
## Results 📊:
→ Speech: NLL 1.93, Accuracy 35.05%, Perplexity 6.96
→ Music: NLL 2.52, Accuracy 38.47%, Perplexity 12.43
→ Outperforms 10x larger models while using only 4.1M parameters
Share this post