DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities
Dynamic Vocabulary (DyVo), introduced in this paper, injects Wikipedia entities into search models, making them smarter at finding what you seek.
Dynamic Vocabulary (DyVo), introduced in this paper, injects Wikipedia entities into search models, making them smarter at finding what you seek.
Basically, Wikipedia entities teach search models to stop butchering names into pieces
Original Problem 🔍:
Learned Sparse Retrieval (LSR) models use vocabularies from pre-trained transformers, often splitting entities into nonsensical fragments. This reduces retrieval accuracy and limits incorporation of up-to-date world knowledge.
Solution in this Paper 🛠️:
• Introduces Dynamic Vocabulary (DyVo) head to expand LSR vocabulary with Wikipedia entities
• Leverages entity retrieval techniques and LLMs to generate relevant entities
• Employs entity embeddings and candidate retrieval to identify small set of entities to score
• Merges entity representation with word-piece representation for indexing and retrieval
Key Insights from this Paper 💡:
• Incorporating linked entities consistently improves LSR effectiveness across entity-rich datasets
• Retrieval-oriented entity candidates further enhance performance compared to linked entities
• Model performs well with various entity embedding techniques, with Wikipedia2Vec being surprisingly effective
Results 📊:
• DyVo outperforms LSR-w consistently across all metrics and datasets
• Significant gains in nDCG@10: 1.15 to 3.57 points increase with highest regularization
• GPT4-generated entities achieve highest performance, competitive with human-annotated entities
• BLINK entity embeddings yield best overall performance, improving nDCG@10 by +1 point over LaQue
👨🔧 The 3 key components of the proposed DyVo model:
A Dynamic Vocabulary head that expands the vocabulary to include millions of Wikipedia entities
An entity candidate retrieval component that identifies a small set of relevant entities for each query/document
A method to generate entity weights and merge them with word piece weights to create joint representations