Analyzing LLMs through Monte Carlo Language Trees reveals probabilistic reasoning.
This paper proposes a novel way to understand LLMs by representing training data and LLMs as Monte Carlo Language Trees (Data-Tree and GPT-Tree). This allows for quantitative analysis of how LLMs learn and reason.
https://arxiv.org/abs/2501.07641
Methods in this Paper 💡:
→ Represent any language dataset as a Data-Tree. Each node is a token. Each edge is a token transition probability based on conditional frequency.
→ Represent any LLM as a GPT-Tree. The tree is built by sampling tokens, inputting them into the LLM, and obtaining probability distributions for subsequent tokens. This process is repeated.
-----
Key Insights from this Paper 🤯:
→ Different LLMs trained on the same dataset show high structural similarity in their GPT-Trees.
→ Larger LLMs converge closer to the Data-Tree.
→ Over 87% of GPT output tokens can be recalled by the Data-Tree. This suggests LLMs perform probabilistic pattern matching rather than formal reasoning.
-----
Results 💯:
→ Different GPT models trained on the same dataset (The Pile) have very high similarity in GPT-Tree visualization.
→ The larger the model, the closer its GPT-Tree is to the Data-Tree.
→ More than 87% GPT output tokens can be recalled by Data-Tree.
Share this post