0:00
/
0:00
Transcript

"1-800-SHARED-TASKS at RegNLP: Lexical Reranking of Semantic Retrieval (LeSeR) for Regulatory Question Answering"

The podcast on this paper is generated with Google's Illuminate.

LeSeR (Lexical reranking of Semantic Retrieval), proposed in this paper, combines semantic search with lexical reranking to improve regulatory document retrieval and question answering accuracy.

-----

Paper - https://arxiv.org/abs/2412.06009

🤔 Original Problem:

Regulatory documents are complex and ever-changing, making it challenging for organizations to find relevant information and ensure compliance. Traditional search methods often miss important context or struggle with regulatory terminology.

-----

🔧 Solution in this Paper:

→ LeSeR (Lexical reranking of Semantic Retrieval) introduces a two-stage approach that first uses semantic embeddings for high-recall retrieval.

→ The system fine-tunes embedding models using Multiple Negative Symmetric Ranking Loss on query-passage pairs.

→ Retrieved passages are then reranked using BM25 lexical scoring to improve precision.

→ The final system integrates BGE_LeSeR with Qwen2.5 7B for answer generation.

-----

💡 Key Insights:

→ Pure semantic models excel at recall but struggle with ranking precision

→ Lexical reranking significantly improves mean Average Precision

→ Fine-tuning with MNSR loss enhances retrieval performance

→ Hybrid approaches outperform both pure semantic and lexical methods

-----

📊 Results:

→ BGE_LeSeR achieved 0.8201 Recall@10 and 0.6655 mAP@10

→ Qwen2.5 7B integration delivered highest RePASs score of 0.4340

→ System outperformed Mistral, Nemo, and Gemma models across metrics

Discussion about this video