Multilingual evaluation reimagined: Testing LLMs beyond Western cultural assumptions
Global-MMLU addresses cultural biases in multilingual LLM evaluation by introducing a comprehensive benchmark across 42 languages. It identifies that 28% of MMLU questions require Western-centric knowledge, leading to skewed evaluations. The paper provides separate culturally-sensitive and culturally-agnostic test sets for fairer model assessment.
-----
https://arxiv.org/abs/2412.03304
🌍 Original Problem:
Current multilingual LLM evaluations rely heavily on translated English datasets like MMLU, which contain inherent Western cultural biases. This makes them ineffective for truly global model assessment.
-----
🔧 Solution in this Paper:
→ Created Global-MMLU covering 42 languages with professional and community translations
→ Annotated questions to distinguish between culturally-sensitive and culturally-agnostic knowledge
→ Engaged 200 professional annotators to verify translation quality and cultural relevance
→ Combined human translations for 14 languages with improved machine translations for others
-----
💡 Key Insights:
→ 28% of MMLU questions require Western cultural knowledge
→ 84.9% of geographic questions focus on North America/Europe
→ Model rankings change significantly between culturally-sensitive vs agnostic evaluations
→ Translation quality impacts performance more in low-resource languages
-----
📊 Results:
→ Model rankings show 3.4 changes on culturally-agnostic vs 5.7 on culturally-sensitive tests
→ Human-translated data improves performance by 7-12% for high-resource languages
→ Low-resource languages show higher variability with 6.78 standard deviation vs 3.86 for high-resource
Share this post