"Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation"

Playback speed

Share post at current time

0:00

Transcript

"Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 27, 2024

Multilingual evaluation reimagined: Testing LLMs beyond Western cultural assumptions

Global-MMLU addresses cultural biases in multilingual LLM evaluation by introducing a comprehensive benchmark across 42 languages. It identifies that 28% of MMLU questions require Western-centric knowledge, leading to skewed evaluations. The paper provides separate culturally-sensitive and culturally-agnostic test sets for fairer model assessment.

-----

https://arxiv.org/abs/2412.03304

🌍 Original Problem:

Current multilingual LLM evaluations rely heavily on translated English datasets like MMLU, which contain inherent Western cultural biases. This makes them ineffective for truly global model assessment.

-----

🔧 Solution in this Paper:

→ Created Global-MMLU covering 42 languages with professional and community translations

→ Annotated questions to distinguish between culturally-sensitive and culturally-agnostic knowledge

→ Engaged 200 professional annotators to verify translation quality and cultural relevance

→ Combined human translations for 14 languages with improved machine translations for others

-----

💡 Key Insights:

→ 28% of MMLU questions require Western cultural knowledge

→ 84.9% of geographic questions focus on North America/Europe

→ Model rankings change significantly between culturally-sensitive vs agnostic evaluations

→ Translation quality impacts performance more in low-resource languages

-----

📊 Results:

→ Model rankings show 3.4 changes on culturally-agnostic vs 5.7 on culturally-sensitive tests

→ Human-translated data improves performance by 7-12% for high-resource languages

→ Low-resource languages show higher variability with 6.78 standard deviation vs 3.86 for high-resource

Rohan's Bytes

"Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation"

Discussion about this video