0:00
/
0:00
Transcript

"HARP: Hesitation-Aware Reframing in Transformer Inference Pass"

The podcast on this paper is generated with Google's Illuminate.

HARP improves LLM performance by adding extra computation only when needed, mimicking how humans pause to think during difficult decisions.

https://arxiv.org/abs/2412.07282v1

🤔 Original Problem:

LLMs use the same computational resources for every token generation, regardless of difficulty. This wastes resources on easy tokens while potentially underserving complex ones.

💡 Solution in this Paper:

→ HARP modifies the Transformer's forward pass to detect uncertainty during token generation.

→ When uncertainty is high, it applies dropout to the embeddings to get a different perspective on the input.

→ The model combines logits from both the original and reframed perspectives for better token prediction.

→ This mimics human cognitive processes of hesitation and reframing when faced with difficult decisions.

🔍 Key Insights:

→ Token-level uncertainty effectively identifies when additional computation is needed

→ One additional reframing perspective is optimal - more reframings can hurt performance

→ The method requires no training or model modifications

📊 Results:

→ Performance improvements up to 5.16% across various tasks

→ 2x faster inference compared to beam search

→ Works across different model sizes (3B-8B parameters)

Discussion about this video