0:00
/
0:00
Transcript

"Finding Needles in Emb(a)dding Haystacks: Legal Document Retrieval via Bagging and SVR Ensembles"

Generated below podcast on this paper with Google's Illuminate.

SVR ensembles beat traditional legal document search without needing complex deep learning.

This paper introduces a novel legal document retrieval system using Support Vector Regression ensembles and embedding spaces, achieving higher recall without deep learning model training.

-----

https://arxiv.org/abs/2501.05018

🔍 Original Problem:

Legal document retrieval systems often lack transparency and interpretability while requiring extensive deep learning model training.

-----

🛠️ Solution in this Paper:

→ The system treats legal document retrieval as multiple binary needle-in-haystack subtasks.

→ It generates embeddings using longformer model for document passages and queries.

→ The approach partitions embedding space into 35 overlapping subsets.

→ Each subset trains a dedicated SVR model with RBF kernel.

→ For queries, it collects 50 nearest neighbors as potential matches.

→ The system concatenates query embedding with passage embedding for feature creation.

-----

💡 Key Insights:

→ Text length significantly influences embedding space positioning

→ Chunking passages into equal lengths improves relevant passage placement

→ SVR with RBF kernel effectively captures nonlinear relationships

→ GPU-accelerated training enables efficient processing of large feature spaces

-----

📊 Results:

→ Achieved 0.849 recall, surpassing BM25 (0.829) and TF-IDF (0.809) baselines

→ Maintained 0.9987 precision and 0.9199 F1-score

→ Processed 3,095,383 × 768 embedding matrix efficiently

Discussion about this video