Prompt compression with evaluator heads for leaner, meaner LLM inference.
This paper introduces a training-free prompt compression method, EHPC, which leverages evaluator attention heads in LLMs to select important tokens for inference, reducing API costs and accelerating inference.
-----
https://arxiv.org/abs/2501.12959
Original Problem 🤔:
→ LLMs struggle with long contexts due to high computational costs and reduced performance.
→ Existing prompt compression methods often require retraining or numerous LLM calls, increasing complexity and cost.
-----
Solution in this Paper 💡:
→ EHPC leverages "evaluator heads" within LLMs, which effectively identify crucial tokens in long prompts.
→ Using a pilot experiment with synthetic data, these evaluator heads are identified.
→ During pre-filling, EHPC uses the first few layers with evaluator heads to score token importance.
→ Only the highest-scoring tokens are retained for inference, creating a compressed prompt.
→ This compressed prompt can be used with commercial LLMs (Extended Model Inference) or locally deployed LLMs (Native Model Inference).
-----
Key Insights from this Paper 🗝️:
→ Certain attention heads in LLMs specialize in evaluating token significance, functioning as “evaluator heads”.
→ These evaluator heads are task-agnostic, performing effectively across different downstream tasks.
→ Leveraging the pre-filling stage and evaluator heads reduces prompt compression time.
-----
Results 📈:
→ EHPC achieves state-of-the-art results on LongBench and ZeroScroll prompt compression benchmarks.
→ EHPC outperforms existing methods by up to 40% on certain question-answering datasets in terms of long-context understanding.
→ EHPC demonstrates competitive performance compared to key-value cache compression in terms of acceleration.
Share this post