Converting tokens to characters: A bridge between LLMs and human text interaction
This paper introduces algorithms to convert token-level LLMs to character-level ones, solving the "prompt boundary problem" where models show unwanted sensitivity to characters at prompt boundaries. The solution enables accurate character-level probability computation and conditional text generation while maintaining model performance.
-----
https://arxiv.org/abs/2412.03719
🤔 Original Problem:
→ LLMs internally work with tokens but users interact with characters, causing issues like the "prompt boundary problem" where adding a space can drastically change the output
→ Current token-based approaches struggle with character-level constraints and precise probability calculations
-----
🔧 Solution in this Paper:
→ Introduces a mathematical framework to convert token-level models to character-level ones through "covering" - a set of token strings that form valid decodings of a character string
→ Develops both exact and approximate beam search algorithms to efficiently compute character-level probabilities
→ Implements a novel "token healing" mechanism that finds optimal tokenizations matching the prompt
→ Creates a character-level interface that maintains the efficiency of token-level processing
-----
💡 Key Insights:
→ Token-level models can be accurately converted to character-level ones without performance loss
→ Beam search with small beam sizes (K=8) provides good approximation
→ High-probability tokenizations tend to concentrate around canonical forms
-----
📊 Results:
→ Achieves 46.3 characters/second processing speed on Llama 3.1 8B
→ Maintains accuracy within 0.00021 excess bits/character
→ Shows inverse relationship between beam size and error rate
→ Demonstrates better performance on newer models like Llama compared to GPT2
Share this post