0:00
/
0:00
Transcript

"Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference"

Below podcast is generated with Google's Illuminate.

Prompt compression with evaluator heads for leaner, meaner LLM inference.

This paper introduces a training-free prompt compression method, EHPC, which leverages evaluator attention heads in LLMs to select important tokens for inference, reducing API costs and accelerating inference.

-----

https://arxiv.org/abs/2501.12959

Original Problem 🤔:

→ LLMs struggle with long contexts due to high computational costs and reduced performance.

→ Existing prompt compression methods often require retraining or numerous LLM calls, increasing complexity and cost.

-----

Solution in this Paper 💡:

→ EHPC leverages "evaluator heads" within LLMs, which effectively identify crucial tokens in long prompts.

→ Using a pilot experiment with synthetic data, these evaluator heads are identified.

→ During pre-filling, EHPC uses the first few layers with evaluator heads to score token importance.

→ Only the highest-scoring tokens are retained for inference, creating a compressed prompt.

→ This compressed prompt can be used with commercial LLMs (Extended Model Inference) or locally deployed LLMs (Native Model Inference).

-----

Key Insights from this Paper 🗝️:

→ Certain attention heads in LLMs specialize in evaluating token significance, functioning as “evaluator heads”.

→ These evaluator heads are task-agnostic, performing effectively across different downstream tasks.

→ Leveraging the pre-filling stage and evaluator heads reduces prompt compression time.

-----

Results 📈:

→ EHPC achieves state-of-the-art results on LongBench and ZeroScroll prompt compression benchmarks.

→ EHPC outperforms existing methods by up to 40% on certain question-answering datasets in terms of long-context understanding.

→ EHPC demonstrates competitive performance compared to key-value cache compression in terms of acceleration.

Discussion about this video