Simple token prepending trick makes LLMs understand sentences better without extra training.
Token Prepending enhances sentence embeddings from LLMs by allowing earlier tokens to access complete sentence information through a simple prepending operation.
-----
https://arxiv.org/abs/2412.11556
🤔 Original Problem:
Current LLMs use causal attention where earlier tokens can't see later tokens, leading to biased sentence embeddings. Existing solutions like repetition increase computational costs significantly.
-----
🔧 Solution in this Paper:
→ Token Prepending (TP) technique prepends a special <PST> token before the input sentence
→ At each layer, TP replaces the <PST> token's embedding with the sentence embedding from previous layer
→ This allows earlier tokens to access complete sentence information through causal attention
→ The operation stops after early layers (typically 7th or 8th) to optimize performance
→ Uses intermediate layer outputs instead of final layer for better semantic representation
-----
💡 Key Insights:
→ Simple token prepending significantly improves sentence embeddings without training
→ Early layers are crucial for capturing backward dependencies
→ Final layer embeddings contain less semantic information
→ Method works across different LLM architectures and sizes
-----
📊 Results:
→ Improves PromptEOL performance by 7.16 points on STS tasks
→ Adds only 4% inference overhead compared to baseline
→ Achieves 77.19 average Spearman correlation on STS tasks
→ Consistently outperforms baselines across 7 transfer learning tasks