0:00
/
0:00
Transcript

"Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models"

Generated below podcast on this paper with Google's Illuminate.

Why use transformers when state-space models can do speech recognition better and faster?

Samba-ASR introduces a state-space model architecture that replaces traditional transformer components in speech recognition, achieving superior accuracy with linear computational scaling.

-----

https://arxiv.org/abs/2501.02832

🎯 Original Problem:

Transformer-based Automatic Speech Recognition (ASR) models face quadratic scaling issues with sequence length, making them computationally expensive and memory-intensive for processing long audio sequences.

-----

🔧 Solution in this Paper:

→ Samba-ASR leverages Mamba architecture as both encoder and decoder, built on state-space models (SSMs)

→ The model processes audio through convolutional layers and Mamba blocks, capturing local and global dependencies efficiently

→ A Mamba-cross-connection mechanism in the decoder enables targeted alignment between audio and text features

→ The architecture achieves linear computational complexity through selective state-space dynamics

-----

💡 Key Insights:

→ State-space models can effectively replace attention mechanisms in ASR systems

→ Linear scaling with sequence length enables efficient processing of long audio

→ Mamba blocks provide better computational efficiency than transformer blocks

→ The model maintains high accuracy while reducing memory usage

-----

📊 Results:

→ Average Word Error Rate (WER): 3.65%

→ LibriSpeech Clean WER: 1.17%

→ GigaSpeech WER: 9.12%

→ SPGISpeech WER: 1.84%

→ Outperforms existing transformer models while using less computational resources

Discussion about this video