Transformer-based speech codec compresses audio to 400 bits/second while preserving high fidelity
Novel Scaled transformers for speech compression with ultra-low bitrate encoding
This paper introduces a transformer-based speech codec achieving ultra-low bitrates (400-700 bits per second) while maintaining high audio quality. It uses Finite Scalar Quantization and scales transformer architecture to nearly 1B parameters, outperforming existing baselines.
-----
https://arxiv.org/abs/2411.19842
🎯 Original Problem:
Traditional neural audio codecs focus on low parameter architectures with strong inductive biases, limiting their compression capabilities. Current approaches using Vector Quantization face challenges with codebook utilization and complex token relationships.
-----
🔧 Solution in this Paper:
→ The paper introduces Transformer Audio AutoEncoder (TAAE), a scaled transformer architecture with 950M parameters.
→ It employs a modified Finite Scalar Quantization bottleneck instead of traditional Vector Quantization.
→ The architecture uses stacked transformer blocks with self-attention and feedforward sections, incorporating QK-norm and LayerScale for stability.
→ A novel post-hoc residual decomposition method enables flexible bitrate-quality tradeoffs without retraining.
→ The training combines uniform noise and straight-through gradient estimation in the FSQ bottleneck.
-----
💡 Key Insights:
→ Scaling transformer architectures to speech coding enables better compression
→ FSQ provides better codebook utilization than Vector Quantization
→ Post-hoc residual decomposition allows flexible bitrate adjustments
→ Hybrid training approach improves model stability
-----
📊 Results:
→ Achieves state-of-the-art speech quality at 400-700 bits per second
→ Model contains 950M parameters with 6-dimensional bottleneck
→ Outperforms existing baselines in both objective and subjective tests
Share this post