0:00
/
0:00
Transcript

"Training on the Test Model: Contamination in Ranking Distillation"

The podcast on this paper is generated with Google's Illuminate.

Knowledge distillation unintentionally transfers test data contamination to student models

https://arxiv.org/abs/2411.02284

🤔 **Original Problem**

Neural ranking models are computationally expensive. Knowledge distillation helps create smaller efficient models from larger ones. But when using closed-source commercial models as teachers, there's risk of data contamination if test collections were part of their training data.

-----

🛠️ **Solution in this Paper**

→ Simulated worst-case contamination by directly adding test data into teacher model's training

→ Used ELECTRA-based cross-encoder as teacher and both ELECTRA cross-encoders and BERT bi-encoders as students

→ Evaluated three distillation techniques: marginMSE, KL divergence minimization, and RankNet loss

→ Tested on MSMARCO passage collection with TREC Deep Learning 2019/2020 and TREC COVID datasets

-----

💡 **Key Insights**

→ Even 0.1% test data contamination significantly impacts student model performance

→ RankNet-style order distillation showed largest effectiveness disparities with contaminated teachers

→ Out-of-distribution contamination can exceed standard cross-encoder performance by >10 points in nDCG@10

→ Cross-contamination between different datasets yielded unexpected effectiveness improvements

-----

📊 **Results**

→ Cross-architecture distillation with contaminated OOD data exceeded standard cross-encoder by >10 points nDCG@10

→ OOD contamination improved effectiveness across collections, even outperforming in-distribution contamination

Discussion about this video