Knowledge distillation unintentionally transfers test data contamination to student models
https://arxiv.org/abs/2411.02284
🤔 **Original Problem**
Neural ranking models are computationally expensive. Knowledge distillation helps create smaller efficient models from larger ones. But when using closed-source commercial models as teachers, there's risk of data contamination if test collections were part of their training data.
-----
🛠️ **Solution in this Paper**
→ Simulated worst-case contamination by directly adding test data into teacher model's training
→ Used ELECTRA-based cross-encoder as teacher and both ELECTRA cross-encoders and BERT bi-encoders as students
→ Evaluated three distillation techniques: marginMSE, KL divergence minimization, and RankNet loss
→ Tested on MSMARCO passage collection with TREC Deep Learning 2019/2020 and TREC COVID datasets
-----
💡 **Key Insights**
→ Even 0.1% test data contamination significantly impacts student model performance
→ RankNet-style order distillation showed largest effectiveness disparities with contaminated teachers
→ Out-of-distribution contamination can exceed standard cross-encoder performance by >10 points in nDCG@10
→ Cross-contamination between different datasets yielded unexpected effectiveness improvements
-----
📊 **Results**
→ Cross-architecture distillation with contaminated OOD data exceeded standard cross-encoder by >10 points nDCG@10
→ OOD contamination improved effectiveness across collections, even outperforming in-distribution contamination
Share this post