"Preference Leakage: A Contamination Problem in LLM-as-a-judge"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 11, 2025

Article voiceover

0:00

-3:50

https://arxiv.org/abs/2502.01534

The widespread use of LLMs as judges and for data synthesis in model development faces a contamination problem. This paper identifies "preference leakage" where judge LLMs show bias towards models trained using data from related LLMs.

This paper proposes to define and empirically demonstrate preference leakage. It investigates how relatedness between data generator and judge LLMs leads to evaluation bias.

-----

📌 Preference Leakage Score offers a practical metric to quantify bias in LLM evaluations. Immediately apply PLS to existing benchmarks to audit and adjust results for fairer model comparisons.

📌 Supervised Fine-Tuning amplifies preference leakage. Favor Direct Preference Optimization or In-Context Learning to mitigate bias when training with synthetic data from related LLMs.

📌 Relatedness significantly biases LLM judges. Avoid evaluating models with judges from the same family or lineage as the data generators to ensure evaluation integrity.

----------

Methods Explored in this Paper 🔧:

→ The paper defines preference leakage as a contamination issue arising from the relatedness between LLMs used for synthetic data generation and evaluation.

→ Three types of relatedness are examined: being the same model, having an inheritance relationship, and belonging to the same model family.

→ Experiments were conducted using LLMs like GPT-4o, Gemini-1.5, and LLaMA-3.3 as judges and data generators, and Mistral-7B and Qwen-2.5-14B as student models.

→ Supervised Fine-Tuning (SFT) was used to train student models on synthetic datasets generated by different LLMs.

→ Preference Leakage Score (PLS) was introduced to quantify the bias of judge LLMs towards related student models. PLS is calculated based on win-rates of student models judged by different LLMs.

→ Manual annotation was performed to validate the automated LLM judgements and assess preference leakage.

-----

Key Insights 💡:

→ Preference leakage is a pervasive issue in LLM-as-a-judge scenarios, causing judge LLMs to exhibit bias towards related student models.

→ The severity of preference leakage is linked to the degree of relatedness between data generator and judge LLMs. Closer relatedness leads to higher leakage.

→ Larger student models tend to exhibit more pronounced preference leakage.

→ Preference leakage is more subtle and harder to detect than other known biases in LLM-as-a-judge.

→ Subjective evaluation questions and judgment dimensions are more susceptible to preference leakage.

-----

Results 📊:

→ Preference leakage score is positive in most model pairs, indicating bias. For example, Mistral-7B with GPT-4o & Gemini-1.5 pair shows a PLS of 23.6%.

→ GPT-4o shows a preference for LLaMA series models, which is inherited by student models, impacting preference leakage scores.

→ Model pairs with similar performance exhibit higher preference leakage scores (e.g., Mistral-GPT-4o vs Mistral-Gemini-1.5 at 23.6%).

→ Supervised Fine-Tuning (SFT) shows a higher preference leakage score (23.6%) compared to Direct Preference Optimization (DPO) (5.2%) and In-Context Learning (ICL) (-2.7%).

Rohan's Bytes

Discussion about this post