Generate questions from answers first, then match them with user queries for better retrieval
HyQE, proposed in this paper, flips the script by generating hypothetical queries from contexts instead of matching query-context similarity
Let contexts tell you what questions they can answer before trying to match them
📚 https://arxiv.org/abs/2410.15262
🎯 Original Problem:
Context ranking in retrieval systems often fails when using simple embedding similarity between queries and contexts. Current solutions using LLMs face scalability issues and need fine-tuning.
-----
🛠️ Solution in this Paper:
→ Introduces HyQE - a framework that uses LLMs to generate hypothetical queries from contexts
→ Ranks contexts based on similarity between user queries and hypothetical queries
→ Works offline - generates and stores hypothetical queries beforehand for reuse
→ Requires no LLM fine-tuning and works with both open-source and proprietary LLMs
→ Uses variational inference to preserve causal relationships between queries and contexts
-----
💡 Key Insights:
→ Similarity between queries is more reliable than similarity between queries and contexts
→ Offline query generation makes it more scalable than existing LLM-based methods
→ Hypothetical queries are constrained by context information, reducing hallucination risk
→ Compatible with existing retrieval methods like HyDE for additive improvements
-----
📊 Results:
→ Improved NDCG@10 scores across multiple benchmarks (DL19, DL20, COVID, NEWS, Touche)
→ Works effectively with different LLMs (GPT-4, GPT-3.5, Mistral) and embedding models
→ Shows consistent performance gains when combined with other retrieval methods
Share this post