0:00
/
0:00
Transcript

"Large Language Models Still Face Challenges in Multi-Hop Reasoning with External Knowledge"

The podcast on this paper is generated with Google's Illuminate.

Current AI models get confused when reasoning needs complex knowledge combinations.

This paper investigates how LLMs struggle with complex reasoning tasks requiring external knowledge integration and non-sequential thinking, despite their impressive capabilities.

-----

https://arxiv.org/abs/2412.08317

🤔 Original Problem:

LLMs show remarkable reasoning abilities with Chain-of-Thought prompting, but their true capabilities in handling multi-hop reasoning with external knowledge remain unclear.

-----

🔍 Solution in this Paper:

→ The study tests GPT-3.5 on four reasoning benchmarks using Chain-of-Thought prompting variations.

→ Experiments evaluate three key aspects: external knowledge selection/combination, non-sequential reasoning handling, and hop-count generalization.

→ Tests compare model performance with/without external knowledge and analyze how distractors impact reasoning.

→ Counterfactual knowledge tests reveal how models handle knowledge inconsistency.

-----

💡 Key Insights:

→ Models struggle significantly when relying solely on internal knowledge, achieving only 40% accuracy on HotpotQA

→ Performance drops from 78% to 56% on non-sequential reasoning tasks

→ Adding distractors severely impacts knowledge selection precision more than recall

→ Models fail to maintain consistent performance as hop counts increase

-----

📊 Results:

→ 40% accuracy on HotpotQA with internal knowledge only

→ 35% accuracy on EntailmentBank without external context

→ 22% performance drop on non-sequential vs sequential reasoning

→ Precision drops faster than recall when distractors increase

Discussion about this video