FairCode introduces a novel benchmark and evaluation metric to assess and quantify social bias in code generation by LLMs across real-world scenarios.
https://arxiv.org/abs/2501.05396
Original Problem 🔍:
→ Current LLMs exhibit social biases in code generation, but existing evaluation methods rely on either malicious prompts or repurposed discriminative model datasets, which are ineffective for aligned models.
Solution in this Paper 🛠️:
→ FairCode evaluates bias through two key tasks: function implementation and test case generation.
→ Function implementation tests if models generate biased code for scenarios like job hiring, college admissions, and medical treatment.
→ Test case generation examines bias in creating test data for health conditions and social characteristics.
→ A new FairScore metric combines refusal rate (avoiding sensitive attributes) and preference entropy (balance across groups).
Key Insights 💡:
→ Models show less bias with gender/race but significant bias with age/income
→ Test case generation reveals stronger biases than function implementation
→ QwenCoder achieves best overall performance with high refusal rates
→ Models consistently associate certain traits with specific demographics
Results 📊:
→ QwenCoder achieves highest FairScore: 0.93 for function implementation, 0.90 for test generation
→ Most models show >80% refusal rate for gender/race attributes
→ Lower performance on unexplored attributes like age (0.84) and income (0.68)
Share this post