New benchmark, LCB, proposed in this paper, reveals why LLMs struggle with language consistency and how to fix their linguistic mix-ups.
📚 https://arxiv.org/abs/2406.20052
Original Problem 🎯:
LLMs often fail to consistently generate text in users' desired languages, exhibiting "language confusion" where they mix languages at word, line, or response levels. This significantly impacts non-English users' experience.
-----
Solution in this Paper 🛠️:
• Created Language Confusion Benchmark (LCB) covering 15 diverse languages
• Evaluated models on monolingual generation (queries in language L, expecting response in L)
• Tested cross-lingual generation (English instructions to generate in target language)
• Developed metrics: Line-level Pass Rate (LPR), Word-level Pass Rate (WPR), Language Confusion Pass Rate (LCPR)
• Implemented mitigation strategies: few-shot prompting, multilingual instruction tuning, beam search decoding
-----
Key Insights 💡:
• Base models perform better than English-centric instruction-tuned variants
• Complex prompts increase language confusion
• High sampling temperatures aggravate the problem
• Multilingual instruction tuning helps reduce confusion
• Position of language instruction impacts model performance
-----
Results 📊:
• Command R and OpenAI models excel in monolingual generation (LPR 98.6-99.3%)
• Llama 2/3 and Mistral models struggle with consistency (LPR 48.3-73.0%)
• Cross-lingual performance: Command R+ Refresh leads with 95.4% LPR
• Few-shot prompting improves base model performance from 1.1% to 95.0% LPR
• Reducing temperature to 0.3 and using beam search significantly reduces confusion










