0:00
/
0:00
Transcript

Understanding and Mitigating Language Confusion in LLMs

Generated this podcast on this Paper with Google's Illuminate, a specialized tool to create podcast from arXiv papers only

New benchmark, LCB, proposed in this paper, reveals why LLMs struggle with language consistency and how to fix their linguistic mix-ups.

📚 https://arxiv.org/abs/2406.20052

Original Problem 🎯:

LLMs often fail to consistently generate text in users' desired languages, exhibiting "language confusion" where they mix languages at word, line, or response levels. This significantly impacts non-English users' experience.

-----

Solution in this Paper 🛠️:

• Created Language Confusion Benchmark (LCB) covering 15 diverse languages

• Evaluated models on monolingual generation (queries in language L, expecting response in L)

• Tested cross-lingual generation (English instructions to generate in target language)

• Developed metrics: Line-level Pass Rate (LPR), Word-level Pass Rate (WPR), Language Confusion Pass Rate (LCPR)

• Implemented mitigation strategies: few-shot prompting, multilingual instruction tuning, beam search decoding

-----

Key Insights 💡:

• Base models perform better than English-centric instruction-tuned variants

• Complex prompts increase language confusion

• High sampling temperatures aggravate the problem

• Multilingual instruction tuning helps reduce confusion

• Position of language instruction impacts model performance

-----

Results 📊:

• Command R and OpenAI models excel in monolingual generation (LPR 98.6-99.3%)

• Llama 2/3 and Mistral models struggle with consistency (LPR 48.3-73.0%)

• Cross-lingual performance: Command R+ Refresh leads with 95.4% LPR

• Few-shot prompting improves base model performance from 1.1% to 95.0% LPR

• Reducing temperature to 0.3 and using beam search significantly reduces confusion

Discussion about this video

User's avatar

Ready for more?