"FairCode: Evaluating Social Bias of LLMs in Code Generation"

Playback speed

Share post at current time

0:00

Transcript

"FairCode: Evaluating Social Bias of LLMs in Code Generation"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 23, 2025

FairCode introduces a novel benchmark and evaluation metric to assess and quantify social bias in code generation by LLMs across real-world scenarios.

https://arxiv.org/abs/2501.05396

Original Problem 🔍:

→ Current LLMs exhibit social biases in code generation, but existing evaluation methods rely on either malicious prompts or repurposed discriminative model datasets, which are ineffective for aligned models.

Solution in this Paper 🛠️:

→ FairCode evaluates bias through two key tasks: function implementation and test case generation.

→ Function implementation tests if models generate biased code for scenarios like job hiring, college admissions, and medical treatment.

→ Test case generation examines bias in creating test data for health conditions and social characteristics.

→ A new FairScore metric combines refusal rate (avoiding sensitive attributes) and preference entropy (balance across groups).

Key Insights 💡:

→ Models show less bias with gender/race but significant bias with age/income

→ Test case generation reveals stronger biases than function implementation

→ QwenCoder achieves best overall performance with high refusal rates

→ Models consistently associate certain traits with specific demographics

Results 📊:

→ QwenCoder achieves highest FairScore: 0.93 for function implementation, 0.90 for test generation

→ Most models show >80% refusal rate for gender/race attributes

→ Lower performance on unexplored attributes like age (0.84) and income (0.68)

Rohan's Bytes

"FairCode: Evaluating Social Bias of LLMs in Code Generation"

Discussion about this video