0:00
/
0:00
Transcript

"Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation"

Generated below podcast on this paper with Google's Illuminate.

VoiceFormer, proposed in this paper, reads lips and text to isolate voices in noisy environments, just like humans do

VoiceFormer introduces a transformer-based architecture that can separate speech in noisy environments using text content, lip movements, or both, without requiring strict synchronization between inputs.

-----

https://arxiv.org/abs/2501.01518

🎯 Original Problem:

Existing speech separation methods either rely heavily on synchronized lip movements or static biometric features, making them vulnerable to visual disruptions and synchronization issues in real-world scenarios.

-----

🔧 Solution in this Paper:

→ VoiceFormer uses a U-Net architecture with a transformer bottleneck to process raw audio waveforms directly

→ The model can condition on text content alone, making it useful for scenarios with prepared speeches or song lyrics

→ It employs positional encoding to handle asynchronous inputs, eliminating the need for manual audio-visual alignment

→ The architecture fuses multiple modalities through cross-attention, allowing flexible combination of audio, video, and text

-----

💡 Key Insights:

→ Text-only conditioning achieves reasonable speech separation performance (SDR: 13.1)

→ Lip movements provide stronger separation cues than text content

→ The model remains robust even with 200ms audio-visual misalignment

→ Adding text improves resilience against missing or corrupted video information

-----

📊 Results:

→ Achieves state-of-the-art performance on LRS2 and LRS3 datasets

→ Outperforms previous methods with SDR of 14.2 (A+V+T) vs 10.8 (previous best)

→ Maintains high performance even with 5-frame (200ms) audio-visual offset

Discussion about this video