Smart tokenization that helps LLMs better understand complex words, with a simple tweak to make LLMs process longer words better.
LBPE fixes token imbalance by prioritizing longer, semantically richer tokens in LLM training
https://arxiv.org/abs/2411.05504
Original Problem 🎯:
BPE tokenization in LLMs prioritizes short tokens due to their higher frequency, leading to underrepresentation of longer tokens that carry rich semantic information. This imbalance makes it harder for models to learn and process longer, more meaningful tokens effectively.
-----
Solution in this Paper 🛠️:
→ LBPE introduces a long-token-first encoding algorithm that prioritizes merging longer tokens during encoding
→ Instead of using vocabulary rank as merging priority, it uses reverse ranks of token length
→ Implements a sliding window algorithm that searches for the longest possible token matches
→ Maintains compatibility with existing BPE enhancements like Scaffold-BPE
→ Achieves better time complexity O(m|T|) compared to original BPE O(|T|²)
-----
Key Insights 💡:
→ For tokens with lengths 7-9 characters, frequency increased by 2.37%
→ For 10-12 characters, frequency increased by 2.24%
→ For 13-15 characters, frequency increased by 2.28%
→ Short tokens (1-3 characters) decreased by 0.97%
→ Works effectively even in continual pretraining of existing models
-----
Results 📊:
→ On 6.7B model: BoolQ accuracy improved from 62.87% to 64.10%
→ HellaSwag accuracy increased from 60.57% to 61.60%
→ Consistent improvements across all model sizes (468M, 1.2B, 6.7B)
→ Higher compression rate achieved across all vocabulary sizes (32K, 64K, 128K)
Share this post