Why think step by step? Reasoning emerges from the locality of experience
Training data's local patterns are secret sauce behind LLMs' reasoning superpowers, according to this Paper.
Training data's local patterns are secret sauce behind LLMs' reasoning superpowers, according to this Paper.
LLMs crack complex problems by daisy-chaining familiar local patterns from training data
Original Problem ๐:
Chain-of-thought reasoning improves LLM performance on complex tasks, but it's unclear why this approach is effective.
Solution in this Paper ๐ง :
โข Investigates why chain-of-thought reasoning works in LLMs
โข Hypothesizes effectiveness due to local structure in training data
โข Proves mathematically that reasoning through intermediate variables reduces bias in simple chain-structured models
โข Tests impact of training data structure experimentally using synthetic data from Bayesian networks
โข Evaluates models' ability to estimate conditional probabilities for unseen variable pairs
Key Insights from this Paper ๐ก:
โข Chain-of-thought reasoning is effective when training data has local structure
โข Reasoning improves estimation by chaining local statistical dependencies
โข Locally structured training data combined with reasoning enables accurate inferences from limited data
โข Direct estimation is inaccurate for inferences when relevant variables are rarely seen together in training
Results ๐:
โข Free generation outperforms direct prediction with locally structured data
โข Scaffolded and free generation significantly better than negative scaffolded generation
โข Local structure training + reasoning achieves better performance with less data than fully observed training
โข Models trained on local data produce d-separating reasoning traces 70% of the time vs 34% for wrong locality structure
๐ The paper tests the impact of training data structure experimentally by training language models on synthetic data from Bayesian networks with different observation patterns.
๐ Models are evaluated on their ability to estimate conditional probabilities for pairs of variables not observed together in training.
๐ The results show that generating intermediate reasoning steps only improves performance when the training data has local structure matching the dependencies in the underlying model.



