0:00
/
0:00
Transcript

"Domain-specific Question Answering with Hybrid Search"

The podcast on this paper is generated with Google's Illuminate.

Hybrid search combines dense retrieval and keyword matching to supercharge domain-specific QA systems

This paper introduces a hybrid search framework combining dense retrieval with keyword-based methods for domain-specific question answering. The system integrates cosine similarity, BM25 scores, and URL host matching with tunable boost parameters, achieving improved accuracy while maintaining robust contextual grounding.

-----

https://arxiv.org/abs/2412.03736

🤔 Original Problem:

Domain-specific question answering systems often struggle with accuracy and reliability in enterprise settings, especially when dealing with specialized product documentation and terminology.

-----

🔧 Solution in this Paper:

→ The system uses a multi-phase scoring algorithm that combines three key components: dense retrieval cosine similarity, BM25 keyword matching, and host-based URL scoring

→ Documents are chunked into 1000-character segments with 100-character overlap for granular matching

→ A linear combination formula weighs these components using empirically tuned boost parameters (BM25_boost: 0.3, host_boost: 0.1)

→ The system implements guardrails by comparing generated answers with system prompts to prevent jailbreak attempts

-----

💡 Key Insights:

→ Hybrid retrieval outperforms single-method approaches significantly

→ Smaller chunk sizes (1000 chars) perform better than larger ones

→ Host-based boosting improves retrieval from authoritative sources

→ Guardrails are crucial for production deployment

-----

📊 Results:

→ Hybrid approach achieved 0.847 NDCG score vs 0.640 for BM25-only

→ Answer similarity improved from 0.717 (baseline) to 0.780 (hybrid)

→ System maintained 0.983 groundedness score

Discussion about this video

User's avatar