"From Language Models over Tokens to Language Models over Characters"

Playback speed

Share post at current time

0:00

Transcript

"From Language Models over Tokens to Language Models over Characters"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 27, 2024

Converting tokens to characters: A bridge between LLMs and human text interaction

This paper introduces algorithms to convert token-level LLMs to character-level ones, solving the "prompt boundary problem" where models show unwanted sensitivity to characters at prompt boundaries. The solution enables accurate character-level probability computation and conditional text generation while maintaining model performance.

-----

https://arxiv.org/abs/2412.03719

🤔 Original Problem:

→ LLMs internally work with tokens but users interact with characters, causing issues like the "prompt boundary problem" where adding a space can drastically change the output

→ Current token-based approaches struggle with character-level constraints and precise probability calculations

-----

🔧 Solution in this Paper:

→ Introduces a mathematical framework to convert token-level models to character-level ones through "covering" - a set of token strings that form valid decodings of a character string

→ Develops both exact and approximate beam search algorithms to efficiently compute character-level probabilities

→ Implements a novel "token healing" mechanism that finds optimal tokenizations matching the prompt

→ Creates a character-level interface that maintains the efficiency of token-level processing

-----

💡 Key Insights:

→ Token-level models can be accurately converted to character-level ones without performance loss

→ Beam search with small beam sizes (K=8) provides good approximation

→ High-probability tokenizations tend to concentrate around canonical forms

-----

📊 Results:

→ Achieves 46.3 characters/second processing speed on Llama 3.1 8B

→ Maintains accuracy within 0.00021 excess bits/character