"HARP: Hesitation-Aware Reframing in Transformer Inference Pass"

Playback speed

Share post at current time

0:00

Transcript

"HARP: Hesitation-Aware Reframing in Transformer Inference Pass"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 27, 2024

HARP improves LLM performance by adding extra computation only when needed, mimicking how humans pause to think during difficult decisions.

https://arxiv.org/abs/2412.07282v1

🤔 Original Problem:

LLMs use the same computational resources for every token generation, regardless of difficulty. This wastes resources on easy tokens while potentially underserving complex ones.

💡 Solution in this Paper:

→ HARP modifies the Transformer's forward pass to detect uncertainty during token generation.

→ When uncertainty is high, it applies dropout to the embeddings to get a different perspective on the input.

→ The model combines logits from both the original and reframed perspectives for better token prediction.

→ This mimics human cognitive processes of hesitation and reframing when faced with difficult decisions.

🔍 Key Insights:

→ Token-level uncertainty effectively identifies when additional computation is needed

→ One additional reframing perspective is optimal - more reframings can hurt performance

→ The method requires no training or model modifications

📊 Results:

→ Performance improvements up to 5.16% across various tasks

→ 2x faster inference compared to beam search

→ Works across different model sizes (3B-8B parameters)

Rohan's Bytes

"HARP: Hesitation-Aware Reframing in Transformer Inference Pass"

Discussion about this video