Want your AI to count better? Just space things out!
Proper tokenization helps LLMs count better by preventing character grouping
📚 https://arxiv.org/abs/2410.19730
🤖 Original Problem:
Transformers in LLMs lack recurrent connections, limiting them to constant-depth computation. This makes them theoretically incapable of solving tasks requiring increasing reasoning depth with input length, like counting.
-----
🔧 Solution in this Paper:
• Analyzed how Byte Pair Encoding (BPE) tokenization impacts counting ability
• Used delimiters (spaces/commas) to force item-separated tokenization
• Implemented supervised Chain of Thought (CoT) with explicit step templates
• Manipulated tokenization through string formatting to improve counter extraction
• Extended reasoning from latent space to text space using natural language sequences
💡 How does Chain of Thought (CoT) help overcome Transformer's limitations?
CoT extends reasoning from latent space to text space, using natural language sequences to relay computations in absence of recurrence. This allows higher-complexity tasks like counting to become feasible by enabling recurrent processing through text-to-vector conversions.
-----
💡 Key Insights:
• BPE tokenization groups multiple characters, causing up to 80% degradation in counting accuracy
• Lower-frequency letters (z: 0.07%) achieve higher counting accuracy than high-frequency ones (e: 12.7%)
• Proper tokenization combined with CoT can overcome Transformer's theoretical limitations
• Clear token separation improves counting accuracy significantly
• Supervised CoT outperforms unsupervised CoT across all tokenization methods
-----
📊 Results:
• Accuracy drops from 96% to 56% as string length increases from [10,20] to [30,40]
• Item-separated tokenization improves performance by 13-40% over pure BPE
• Rare tokens show 6-12% better counting performance than frequent tokens
• Supervised CoT achieves up to 70.8% accuracy on longer sequences
Share this post