Why use transformers when state-space models can do speech recognition better and faster?
Samba-ASR introduces a state-space model architecture that replaces traditional transformer components in speech recognition, achieving superior accuracy with linear computational scaling.
-----
https://arxiv.org/abs/2501.02832
🎯 Original Problem:
Transformer-based Automatic Speech Recognition (ASR) models face quadratic scaling issues with sequence length, making them computationally expensive and memory-intensive for processing long audio sequences.
-----
🔧 Solution in this Paper:
→ Samba-ASR leverages Mamba architecture as both encoder and decoder, built on state-space models (SSMs)
→ The model processes audio through convolutional layers and Mamba blocks, capturing local and global dependencies efficiently
→ A Mamba-cross-connection mechanism in the decoder enables targeted alignment between audio and text features
→ The architecture achieves linear computational complexity through selective state-space dynamics
-----
💡 Key Insights:
→ State-space models can effectively replace attention mechanisms in ASR systems
→ Linear scaling with sequence length enables efficient processing of long audio
→ Mamba blocks provide better computational efficiency than transformer blocks
→ The model maintains high accuracy while reducing memory usage
-----
📊 Results:
→ Average Word Error Rate (WER): 3.65%
→ LibriSpeech Clean WER: 1.17%
→ GigaSpeech WER: 9.12%
→ SPGISpeech WER: 1.84%
→ Outperforms existing transformer models while using less computational resources
Share this post