VoiceFormer, proposed in this paper, reads lips and text to isolate voices in noisy environments, just like humans do
VoiceFormer introduces a transformer-based architecture that can separate speech in noisy environments using text content, lip movements, or both, without requiring strict synchronization between inputs.
-----
https://arxiv.org/abs/2501.01518
🎯 Original Problem:
Existing speech separation methods either rely heavily on synchronized lip movements or static biometric features, making them vulnerable to visual disruptions and synchronization issues in real-world scenarios.
-----
🔧 Solution in this Paper:
→ VoiceFormer uses a U-Net architecture with a transformer bottleneck to process raw audio waveforms directly
→ The model can condition on text content alone, making it useful for scenarios with prepared speeches or song lyrics
→ It employs positional encoding to handle asynchronous inputs, eliminating the need for manual audio-visual alignment
→ The architecture fuses multiple modalities through cross-attention, allowing flexible combination of audio, video, and text
-----
💡 Key Insights:
→ Text-only conditioning achieves reasonable speech separation performance (SDR: 13.1)
→ Lip movements provide stronger separation cues than text content
→ The model remains robust even with 200ms audio-visual misalignment
→ Adding text improves resilience against missing or corrupted video information
-----
📊 Results:
→ Achieves state-of-the-art performance on LRS2 and LRS3 datasets
→ Outperforms previous methods with SDR of 14.2 (A+V+T) vs 10.8 (previous best)
→ Maintains high performance even with 5-frame (200ms) audio-visual offset
Share this post