Current AI models get confused when reasoning needs complex knowledge combinations.
This paper investigates how LLMs struggle with complex reasoning tasks requiring external knowledge integration and non-sequential thinking, despite their impressive capabilities.
-----
https://arxiv.org/abs/2412.08317
🤔 Original Problem:
LLMs show remarkable reasoning abilities with Chain-of-Thought prompting, but their true capabilities in handling multi-hop reasoning with external knowledge remain unclear.
-----
🔍 Solution in this Paper:
→ The study tests GPT-3.5 on four reasoning benchmarks using Chain-of-Thought prompting variations.
→ Experiments evaluate three key aspects: external knowledge selection/combination, non-sequential reasoning handling, and hop-count generalization.
→ Tests compare model performance with/without external knowledge and analyze how distractors impact reasoning.
→ Counterfactual knowledge tests reveal how models handle knowledge inconsistency.
-----
💡 Key Insights:
→ Models struggle significantly when relying solely on internal knowledge, achieving only 40% accuracy on HotpotQA
→ Performance drops from 78% to 56% on non-sequential reasoning tasks
→ Adding distractors severely impacts knowledge selection precision more than recall
→ Models fail to maintain consistent performance as hop counts increase
-----
📊 Results:
→ 40% accuracy on HotpotQA with internal knowledge only
→ 35% accuracy on EntailmentBank without external context
→ 22% performance drop on non-sequential vs sequential reasoning
→ Precision drops faster than recall when distractors increase
Share this post