Prompt compression with evaluator heads for leaner, meaner LLM inference.
This paper introduces a training-free prompt compression method, EHPC, which leverages evaluator attention heads in LLMs to select important tokens for inference, reducing API costs and accelerating inference.
-----
https://arxiv.org/abs/2501.12959
Original Problem π€:
β LLMs struggle with long contexts due to high computational costs and reduced performance.
β Existing prompt compression methods often require retraining or numerous LLM calls, increasing complexity and cost.
-----
Solution in this Paper π‘:
β EHPC leverages "evaluator heads" within LLMs, which effectively identify crucial tokens in long prompts.
β Using a pilot experiment with synthetic data, these evaluator heads are identified.
β During pre-filling, EHPC uses the first few layers with evaluator heads to score token importance.
β Only the highest-scoring tokens are retained for inference, creating a compressed prompt.
β This compressed prompt can be used with commercial LLMs (Extended Model Inference) or locally deployed LLMs (Native Model Inference).
-----
Key Insights from this Paper ποΈ:
β Certain attention heads in LLMs specialize in evaluating token significance, functioning as βevaluator headsβ.
β These evaluator heads are task-agnostic, performing effectively across different downstream tasks.
β Leveraging the pre-filling stage and evaluator heads reduces prompt compression time.
-----
Results π:
β EHPC achieves state-of-the-art results on LongBench and ZeroScroll prompt compression benchmarks.
β EHPC outperforms existing methods by up to 40% on certain question-answering datasets in terms of long-context understanding.
β EHPC demonstrates competitive performance compared to key-value cache compression in terms of acceleration.