ScalingNote introduces a two-stage method to scale up dense retrieval using LLMs while maintaining fast query response times for real-world applications.
-----
https://arxiv.org/abs/2411.15766
🔍 Original Problem:
→ Real-world dense retrieval systems face a critical trade-off between model performance and query latency. While using LLMs can improve retrieval quality, they significantly slow down online query processing.
→ Current systems focus on negative sampling strategies rather than scaling up model size due to deployment constraints.
-----
🛠️ Solution in this Paper:
→ ScalingNote employs a two-stage training approach to leverage LLMs effectively.
→ Stage 1 trains dual towers initialized from the same LLM using contrastive learning and hard negative mining.
→ Stage 2 performs Query-based Knowledge Distillation to transfer knowledge from the LLM query tower to a smaller, faster BERT-based query tower.
→ The system uses residual K-means clustering for document indexing and IVFPQ for efficient retrieval.
-----
💡 Key Insights:
→ Scaling laws in dense retrieval show that larger models and more data improve performance up to a certain point
→ Query towers can be effectively distilled while maintaining document tower complexity
→ Title-specific query prediction improves document relevance
-----
📊 Results:
→ Achieved 83.01% AUC-SAT and 82.14% AUC-REL on manual test dataset
→ Reduced irrelevant document retrieval by 1.546% in top-20 results
→ Maintained high QPS (33,810) with 4-layer BERT student model
Share this post