Speed up your LLM by letting early layers do the heavy lifting, because early transformer layers can identify key information
GemFilter uses early transformer layers to compress 128K tokens to 100, making LLMs blazing fast.
https://arxiv.org/abs/2409.17422
🤖 Original Problem:
LLMs struggle with processing long context inputs efficiently, leading to high computational costs and latency. The prompt computation phase becomes especially problematic as input lengths increase, causing significant GPU memory usage and slow processing times.
-----
🔧 Solution in this Paper:
GemFilter runs LLMs in two passes:
→ First pass uses early layers (13th-19th) to identify relevant tokens through attention patterns
→ Second pass processes only the selected tokens (reducing from 128K to ~100 tokens) through the full model
→ Uses a single index set for token selection, making it more interpretable than existing methods
→ Maintains positional embeddings effectively by recomputing RoPE for the reduced sequence
-----
💡 Key Insights:
→ LLMs can identify relevant tokens in early layers before generating answers
→ Early layers explicitly summarize required information for answering queries
→ Token selection based on attention patterns is highly effective for context compression
→ Training-free approach works across different LLM architectures
-----
📊 Results:
→ 2.4x speedup compared to state-of-the-art methods
→ 30% reduction in GPU memory usage
→ 1000x input token reduction (128K to 100 tokens) while maintaining performance
→ Outperforms standard attention and SnapKV on Needle in Haystack benchmark
Share this post