HARP improves LLM performance by adding extra computation only when needed, mimicking how humans pause to think during difficult decisions.
https://arxiv.org/abs/2412.07282v1
🤔 Original Problem:
LLMs use the same computational resources for every token generation, regardless of difficulty. This wastes resources on easy tokens while potentially underserving complex ones.
💡 Solution in this Paper:
→ HARP modifies the Transformer's forward pass to detect uncertainty during token generation.
→ When uncertainty is high, it applies dropout to the embeddings to get a different perspective on the input.
→ The model combines logits from both the original and reframed perspectives for better token prediction.
→ This mimics human cognitive processes of hesitation and reframing when faced with difficult decisions.
🔍 Key Insights:
→ Token-level uncertainty effectively identifies when additional computation is needed
→ One additional reframing perspective is optimal - more reframings can hurt performance
→ The method requires no training or model modifications
📊 Results:
→ Performance improvements up to 5.16% across various tasks
→ 2x faster inference compared to beam search
→ Works across different model sizes (3B-8B parameters)
Share this post