Domain-Specific Retrieval-Augmented Generation Using Vector Stores, Knowledge Graphs, and Tensor Factorization

Knowledge graphs meet vector stores.

Nov 11, 2024

Knowledge graphs meet vector stores.

RAG framework achieves 97% accuracy on domain questions by mixing structured and unstructured knowledge

Original Problem 🔍:

LLMs excel in general NLP tasks but struggle with domain-specific queries, facing issues like hallucinations, knowledge cut-offs, and lack of attribution.

Solution in this Paper 🛠️:

• SMART-SLIC framework integrates Retrieval Augmented Generation (RAG) with domain-specific Knowledge Graph (KG) and Vector Store (VS)

• Uses nonnegative tensor factorization for dataset creation and dimension reduction

• Implements a ReAct agent for general inquiries and NER for document-specific questions

• Incorporates citation mechanisms for information attribution

Key Insights from this Paper 💡:

• Domain-specific KG and VS improve LLM accuracy without extensive fine-tuning

• Tensor factorization with automatic model determination enhances topic classification

• Chain-of-thought prompting with LLM agents boosts reasoning capabilities

• Integration of structured (KG) and unstructured (VS) information enhances response quality

Results 📊:

• Document-specific questions: SMART-SLIC achieved 97% accuracy vs 20% for GPT-4 without RAG

• Topic-based questions: SMART-SLIC answered 92% correctly vs 27.77% for GPT-4 without RAG

• SMART-SLIC attempted 100% of questions, while GPT-4 without RAG abstained from 40-64% of questions

• SMART-SLIC provided accurate DOI citations for complex queries, which GPT-4 without RAG couldn't do

📊 The process of dataset creation and dimension reduction in SMART-SLIC

Starting with core documents selected by subject matter experts
Expanding the dataset using citation and reference networks
Pruning irrelevant documents through human-in-the-loop and automatic methods
Preprocessing and cleaning the text data
Creating a TF-IDF matrix of the cleaned corpus
Using nonnegative tensor factorization with automatic model determination to classify document clusters
Extracting latent topics within the corpus

Rohan's Bytes

Discussion about this post