0:00
/
0:00
Transcript

"Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection"

Generated below podcast on this paper with Google's Illuminate.

LLMs can hide their biases well, but psychology-inspired tests reveal their true colors

This paper introduces a novel way to measure both explicit and implicit biases in LLMs through a self-reflection framework, revealing critical gaps in bias mitigation.

https://arxiv.org/abs/2501.02295

🤔 Original Problem:

→ Current bias detection methods focus mainly on explicit bias, missing the deeper implicit biases in LLMs

→ No systematic framework exists to compare explicit vs implicit biases in LLMs

-----

🔍 Solution in this Paper:

→ Introduces a two-phase self-reflection framework for bias evaluation

→ Phase 1 measures implicit bias through simulated psychological tests adapting Implicit Association Test (IAT)

→ Phase 2 evaluates explicit bias by having LLMs analyze their own generated content using Self-Report Assessment (SRA)

→ Tests conducted across gender, race, age, disability, and occupation dimensions

-----

💡 Key Insights:

→ LLMs show significant inconsistency between explicit and implicit biases

→ Explicit bias decreases with larger training data and model parameters

→ Implicit bias increases with model scaling, reaching 0.72 at 405B parameters

→ Alignment training (RLHF/DPO) reduces explicit bias but barely affects implicit bias

-----

📊 Results:

→ At 405B parameters, explicit bias nearly vanishes (0.03 score) while implicit bias peaks (0.72)

→ Alignment training after 300 steps reduces explicit bias to near-zero

→ Implicit bias remains stable around 0.2 throughout alignment training

------

Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓

🎉 https://rohanpaul.substack.com/

Discussion about this video