Domain-Specific Retrieval-Augmented Generation Using Vector Stores, Knowledge Graphs, and Tensor Factorization
Knowledge graphs meet vector stores.
Knowledge graphs meet vector stores.
RAG framework achieves 97% accuracy on domain questions by mixing structured and unstructured knowledge
Original Problem 🔍:
LLMs excel in general NLP tasks but struggle with domain-specific queries, facing issues like hallucinations, knowledge cut-offs, and lack of attribution.
Solution in this Paper 🛠️:
• SMART-SLIC framework integrates Retrieval Augmented Generation (RAG) with domain-specific Knowledge Graph (KG) and Vector Store (VS)
• Uses nonnegative tensor factorization for dataset creation and dimension reduction
• Implements a ReAct agent for general inquiries and NER for document-specific questions
• Incorporates citation mechanisms for information attribution
Key Insights from this Paper 💡:
• Domain-specific KG and VS improve LLM accuracy without extensive fine-tuning
• Tensor factorization with automatic model determination enhances topic classification
• Chain-of-thought prompting with LLM agents boosts reasoning capabilities
• Integration of structured (KG) and unstructured (VS) information enhances response quality
Results 📊:
• Document-specific questions: SMART-SLIC achieved 97% accuracy vs 20% for GPT-4 without RAG
• Topic-based questions: SMART-SLIC answered 92% correctly vs 27.77% for GPT-4 without RAG
• SMART-SLIC attempted 100% of questions, while GPT-4 without RAG abstained from 40-64% of questions
• SMART-SLIC provided accurate DOI citations for complex queries, which GPT-4 without RAG couldn't do
📊 The process of dataset creation and dimension reduction in SMART-SLIC
Starting with core documents selected by subject matter experts
Expanding the dataset using citation and reference networks
Pruning irrelevant documents through human-in-the-loop and automatic methods
Preprocessing and cleaning the text data
Creating a TF-IDF matrix of the cleaned corpus
Using nonnegative tensor factorization with automatic model determination to classify document clusters
Extracting latent topics within the corpus