"Best Practices for Distilling Large Language Models into BERT for Web Search Ranking"

Playback speed

Share post at current time

0:00

Transcript

"Best Practices for Distilling Large Language Models into BERT for Web Search Ranking"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 22, 2024

DisRanker shrinks LLM's web ranking power into BERT's efficiency, making web search smarter without the LLM computational burden.

Basically, teaching BERT to rank like an LLM, but 10x faster

https://arxiv.org/abs/2411.04539

Original Problem 🤔:

LLMs show great potential as zero-shot relevance rankers for web search, but their high computational costs make direct implementation impractical for real-world search engines.

-----

Solution in this Paper 🛠️:

→ DisRanker transfers LLM's ranking expertise to smaller BERT models through a three-stage process

→ Stage 1: Domain-specific Continued Pre-Training using clickstream data, where queries generate clicked titles and summaries

→ Stage 2: Supervised Fine-Tuning of LLM using rank loss, with end-of-sequence token representing query-document pairs

→ Stage 3: Knowledge Distillation using hybrid Point-MSE and Margin-MSE loss to transfer knowledge to BERT

-----

Key Insights 💡:

→ LLMs can effectively learn ranking through domain-specific pre-training

→ End-of-sequence token better represents query-document relationships than traditional [CLS] token

→ Hybrid loss function prevents overfitting while maintaining ranking order

→ Student model achieves similar performance with 70x less parameters

-----

Results 📊:

→ Improved PNR (Positive-Negative Ratio) from 3.514 to 3.643

→ Increased NDCG@5 from 0.8709 to 0.8793

→ Online A/B tests showed +0.47% PageCTR, +0.58% UserCTR, and +1.2% dwell time improvements

→ Latency reduced from ~100ms (LLM) to ~10ms (BERT-6)

Rohan's Bytes

"Best Practices for Distilling Large Language Models into BERT for Web Search Ranking"

Discussion about this video