SVR ensembles beat traditional legal document search without needing complex deep learning.
This paper introduces a novel legal document retrieval system using Support Vector Regression ensembles and embedding spaces, achieving higher recall without deep learning model training.
-----
https://arxiv.org/abs/2501.05018
🔍 Original Problem:
Legal document retrieval systems often lack transparency and interpretability while requiring extensive deep learning model training.
-----
🛠️ Solution in this Paper:
→ The system treats legal document retrieval as multiple binary needle-in-haystack subtasks.
→ It generates embeddings using longformer model for document passages and queries.
→ The approach partitions embedding space into 35 overlapping subsets.
→ Each subset trains a dedicated SVR model with RBF kernel.
→ For queries, it collects 50 nearest neighbors as potential matches.
→ The system concatenates query embedding with passage embedding for feature creation.
-----
💡 Key Insights:
→ Text length significantly influences embedding space positioning
→ Chunking passages into equal lengths improves relevant passage placement
→ SVR with RBF kernel effectively captures nonlinear relationships
→ GPU-accelerated training enables efficient processing of large feature spaces
-----
📊 Results:
→ Achieved 0.849 recall, surpassing BM25 (0.829) and TF-IDF (0.809) baselines
→ Maintained 0.9987 precision and 0.9199 F1-score
→ Processed 3,095,383 × 768 embedding matrix efficiently
Share this post