A single model with multiple experts handles error correction for different input types
NeKo, proposed in this paper, uses specialized experts to fix recognition errors across speech, text and vision tasks
https://arxiv.org/abs/2411.05945
Original Problem 🤔:
Building a general-purpose post-recognition error corrector that can handle multiple domains (speech, text, vision) while maintaining high performance across all tasks remains challenging. Current solutions require separate models for each domain, leading to parameter inefficiency.
-----
Solution in this Paper 🛠️:
→ NeKo introduces a task-oriented Mixture-of-Experts (MoE) architecture where experts specialize in specific tasks (speech-to-text, language-to-text, vision-to-text)
→ During training, input tokens are routed to both their task-specific expert and the top expert selected by a gating network
→ During inference, tokens are routed purely based on router probabilities without task knowledge, enabling zero-shot generalization
→ The model replaces standard feedforward blocks with MoE layers, allowing efficient parameter sharing across tasks
-----
Key Insights from this Paper 💡:
→ Task-specific expert assignment during training enables better specialization while maintaining cross-task knowledge sharing
→ MoE architecture provides better parameter efficiency compared to having separate models for each task
→ Zero-shot generalization is possible by relying on learned routing patterns during inference
-----
Results 📊:
→ 5.0% relative Word Error Rate reduction on Open ASR Leaderboard
→ 15.5% to 27.6% relative WER reduction compared to GPT-3.5 and Claude-Opus on zero-shot Hyporadise benchmark
→ State-of-the-art results in ASR correction while maintaining competitive performance on grammar and OCR correction tasks
Share this post