0:00
/
0:00
Transcript

"Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models"

The podcast on this paper is generated with Google's Illuminate.

LLMs don't memorize math - they learn procedures from code and examples.

This paper investigates how LLMs learn reasoning from pretraining data by analyzing document influence on model outputs. Using influence functions, they examine how 7B and 35B parameter models utilize pretraining data for mathematical reasoning versus factual retrieval.

-----

https://arxiv.org/abs/2411.12580

🤔 Original Problem:

→ While LLMs show strong reasoning abilities, it's unclear whether they truly learn generalizable strategies or just memorize and retrieve answers from training data.

-----

🔬 Solution in this Paper:

→ The researchers analyzed 5 million pretraining documents (2.5B tokens) to identify which ones influence model outputs for mathematical reasoning tasks versus factual questions.

→ They used EK-FAC influence functions to compute how individual documents affect model completions.

→ They compared document influence patterns between factual retrieval and three mathematical reasoning tasks: arithmetic, slopes, and linear equations.

-----

💡 Key Insights:

→ Documents containing procedural knowledge strongly influence similar reasoning tasks across different numbers

→ Models rely less on individual documents for reasoning compared to factual retrieval

→ Factual answers often appear in influential documents (55% for 7B, 30% for 35B) while reasoning answers rarely do

→ Code and mathematical procedures are overrepresented in influential documents for reasoning tasks

-----

📊 Results:

→ Found significant correlation (p < 4e-8) between document influence scores for same-type reasoning tasks

→ Top 0.01% documents contained answers for 55% of factual queries (7B) but only 7.4% of reasoning queries

→ Code implementations appeared in top-100 documents for 80% of slope calculation queries

Discussion about this video