0:00
/
0:00
Transcript

"NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts"

The podcast on this paper is generated with Google's Illuminate.

A single model with multiple experts handles error correction for different input types

NeKo, proposed in this paper, uses specialized experts to fix recognition errors across speech, text and vision tasks

https://arxiv.org/abs/2411.05945

Original Problem 🤔:

Building a general-purpose post-recognition error corrector that can handle multiple domains (speech, text, vision) while maintaining high performance across all tasks remains challenging. Current solutions require separate models for each domain, leading to parameter inefficiency.

-----

Solution in this Paper 🛠️:

→ NeKo introduces a task-oriented Mixture-of-Experts (MoE) architecture where experts specialize in specific tasks (speech-to-text, language-to-text, vision-to-text)

→ During training, input tokens are routed to both their task-specific expert and the top expert selected by a gating network

→ During inference, tokens are routed purely based on router probabilities without task knowledge, enabling zero-shot generalization

→ The model replaces standard feedforward blocks with MoE layers, allowing efficient parameter sharing across tasks

-----

Key Insights from this Paper 💡:

→ Task-specific expert assignment during training enables better specialization while maintaining cross-task knowledge sharing

→ MoE architecture provides better parameter efficiency compared to having separate models for each task

→ Zero-shot generalization is possible by relying on learned routing patterns during inference

-----

Results 📊:

→ 5.0% relative Word Error Rate reduction on Open ASR Leaderboard

→ 15.5% to 27.6% relative WER reduction compared to GPT-3.5 and Claude-Opus on zero-shot Hyporadise benchmark

→ State-of-the-art results in ASR correction while maintaining competitive performance on grammar and OCR correction tasks

Discussion about this video