Paper gives a great dataset showing how LLMs actually think through multi-hop questions
GRS-QA maps out LLM's reasoning process using graph structures for better evaluation.
https://arxiv.org/abs/2411.00369
🤖 Original Problem:
Existing Multi-hop Question Answering datasets lack explicit reasoning structures, making it hard to evaluate how LLMs actually arrive at answers. Current datasets mix questions of varying complexity without proper categorization, preventing fine-grained analysis of LLM reasoning capabilities.
-----
🔧 Solution in this Paper:
GRS-QA introduces explicit reasoning graphs where nodes represent contextual sentences and edges show logical connections. The dataset includes both positive (correct) and negative (perturbed) reasoning graphs, enabling isolated study of structural impacts on reasoning. Each question-answer pair comes with comprehensive metadata about reasoning steps and complexity.
-----
🎯 Key Insights:
→ First QA dataset with explicit reasoning graphs showing step-by-step logical pathways
→ Graphs categorized into 4 types: comparison, bridge, compositional, bridge-comparison
→ Negative graphs created by perturbing edges/nodes while keeping content same
→ New evaluation framework analyzing LLM reasoning across different complexities
-----
📊 Results:
→ BM25 outperformed DPR and TF-IDF for retrieval, but performance dropped with complex questions
→ GPT-3.5 achieved highest performance (F1 scores up to 1.00 for comparison_4_1)
→ LLM performance decreased as question complexity increased
→ Structured reasoning paths improved performance compared to unstructured evidence
Share this post