0:00
/
0:00
Transcript

"Whisper-GPT: A Hybrid Representation Audio Large Language Model"

Generated below podcast on this paper with Google's Illuminate.

Audio generation is lighter with Whisper-GPT: Hybrid tokens cut model size by 90% without losing quality

Whisper-GPT combines continuous audio spectrograms with discrete tokens in a hybrid architecture, reducing model size while maintaining high performance for speech and music generation.

-----

https://arxiv.org/abs/2412.11449

## Original Problem 🤔:

→ Current audio generation models using discrete tokens struggle with long context lengths, requiring massive architectures to handle high-fidelity audio content.

→ Pure token-based approaches need to process 600 tokens per second for ENCODEC, making it computationally expensive.

-----

## Solution in this Paper 🔧:

→ Introduces a hybrid architecture combining mel-spectrograms with ENCODEC acoustic tokens.

→ Uses a modified Whisper architecture with decoder-only setup, replacing the encoder with a lightweight decoder.

→ Processes 64-channel mel-spectrograms at 75 Hz to match ENCODEC token rate.

→ Implements early fusion of spectrogram and token streams using 32-dimensional vectors.

-----

## Key Insights 💡:

→ Hybrid representation captures complete audio information in single tokens while enabling sampling

→ Eliminates need for separate vocoder or Griffin-Lim algorithms

→ Achieves performance of 40M parameter models with only 4M parameters

→ Shows stronger results for complex music signals compared to speech

-----

## Results 📊:

→ Speech: NLL 1.93, Accuracy 35.05%, Perplexity 6.96

→ Music: NLL 2.52, Accuracy 38.47%, Perplexity 12.43

→ Outperforms 10x larger models while using only 4.1M parameters

Discussion about this video