Domain-Specific Retrieval-Augmented Generation Using Vector Stores, Knowledge Graphs, and Tensor Factorization
Knowledge graphs meet vector stores.
Knowledge graphs meet vector stores.
RAG framework achieves 97% accuracy on domain questions by mixing structured and unstructured knowledge
Original Problem ๐:
LLMs excel in general NLP tasks but struggle with domain-specific queries, facing issues like hallucinations, knowledge cut-offs, and lack of attribution.
Solution in this Paper ๐ ๏ธ:
โข SMART-SLIC framework integrates Retrieval Augmented Generation (RAG) with domain-specific Knowledge Graph (KG) and Vector Store (VS)
โข Uses nonnegative tensor factorization for dataset creation and dimension reduction
โข Implements a ReAct agent for general inquiries and NER for document-specific questions
โข Incorporates citation mechanisms for information attribution
Key Insights from this Paper ๐ก:
โข Domain-specific KG and VS improve LLM accuracy without extensive fine-tuning
โข Tensor factorization with automatic model determination enhances topic classification
โข Chain-of-thought prompting with LLM agents boosts reasoning capabilities
โข Integration of structured (KG) and unstructured (VS) information enhances response quality
Results ๐:
โข Document-specific questions: SMART-SLIC achieved 97% accuracy vs 20% for GPT-4 without RAG
โข Topic-based questions: SMART-SLIC answered 92% correctly vs 27.77% for GPT-4 without RAG
โข SMART-SLIC attempted 100% of questions, while GPT-4 without RAG abstained from 40-64% of questions
โข SMART-SLIC provided accurate DOI citations for complex queries, which GPT-4 without RAG couldn't do
๐ The process of dataset creation and dimension reduction in SMART-SLIC
Starting with core documents selected by subject matter experts
Expanding the dataset using citation and reference networks
Pruning irrelevant documents through human-in-the-loop and automatic methods
Preprocessing and cleaning the text data
Creating a TF-IDF matrix of the cleaned corpus
Using nonnegative tensor factorization with automatic model determination to classify document clusters
Extracting latent topics within the corpus


