DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities
Dynamic Vocabulary (DyVo), introduced in this paper, injects Wikipedia entities into search models, making them smarter at finding what you seek.
Dynamic Vocabulary (DyVo), introduced in this paper, injects Wikipedia entities into search models, making them smarter at finding what you seek.
Basically, Wikipedia entities teach search models to stop butchering names into pieces
Original Problem ๐:
Learned Sparse Retrieval (LSR) models use vocabularies from pre-trained transformers, often splitting entities into nonsensical fragments. This reduces retrieval accuracy and limits incorporation of up-to-date world knowledge.
Solution in this Paper ๐ ๏ธ:
โข Introduces Dynamic Vocabulary (DyVo) head to expand LSR vocabulary with Wikipedia entities
โข Leverages entity retrieval techniques and LLMs to generate relevant entities
โข Employs entity embeddings and candidate retrieval to identify small set of entities to score
โข Merges entity representation with word-piece representation for indexing and retrieval
Key Insights from this Paper ๐ก:
โข Incorporating linked entities consistently improves LSR effectiveness across entity-rich datasets
โข Retrieval-oriented entity candidates further enhance performance compared to linked entities
โข Model performs well with various entity embedding techniques, with Wikipedia2Vec being surprisingly effective
Results ๐:
โข DyVo outperforms LSR-w consistently across all metrics and datasets
โข Significant gains in nDCG@10: 1.15 to 3.57 points increase with highest regularization
โข GPT4-generated entities achieve highest performance, competitive with human-annotated entities
โข BLINK entity embeddings yield best overall performance, improving nDCG@10 by +1 point over LaQue
๐จโ๐ง The 3 key components of the proposed DyVo model:
A Dynamic Vocabulary head that expands the vocabulary to include millions of Wikipedia entities
An entity candidate retrieval component that identifies a small set of relevant entities for each query/document
A method to generate entity weights and merge them with word piece weights to create joint representations



