LLMs can hide their biases well, but psychology-inspired tests reveal their true colors
This paper introduces a novel way to measure both explicit and implicit biases in LLMs through a self-reflection framework, revealing critical gaps in bias mitigation.
https://arxiv.org/abs/2501.02295
🤔 Original Problem:
→ Current bias detection methods focus mainly on explicit bias, missing the deeper implicit biases in LLMs
→ No systematic framework exists to compare explicit vs implicit biases in LLMs
-----
🔍 Solution in this Paper:
→ Introduces a two-phase self-reflection framework for bias evaluation
→ Phase 1 measures implicit bias through simulated psychological tests adapting Implicit Association Test (IAT)
→ Phase 2 evaluates explicit bias by having LLMs analyze their own generated content using Self-Report Assessment (SRA)
→ Tests conducted across gender, race, age, disability, and occupation dimensions
-----
💡 Key Insights:
→ LLMs show significant inconsistency between explicit and implicit biases
→ Explicit bias decreases with larger training data and model parameters
→ Implicit bias increases with model scaling, reaching 0.72 at 405B parameters
→ Alignment training (RLHF/DPO) reduces explicit bias but barely affects implicit bias
-----
📊 Results:
→ At 405B parameters, explicit bias nearly vanishes (0.03 score) while implicit bias peaks (0.72)
→ Alignment training after 300 steps reduces explicit bias to near-zero
→ Implicit bias remains stable around 0.2 throughout alignment training
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post