Attention sharing between layers makes LLMs more memory-efficient without losing capabilities.
It's like compression for LLM memory, but only for the less important stuff.
This paper introduces a method to optimize LLM memory usage by sharing attention scores between layers for less important tokens, maintaining performance while reducing memory footprint.
https://arxiv.org/abs/2412.02252
🤖 Original Problem:
→ Current LLMs with large context windows face significant memory and computational challenges during inference, especially with the KV cache.
→ Existing solutions that discard tokens risk losing important information needed for text generation.
-----
🔍 Key Insights:
→ Proximal tokens (initial + recent) are more important than distant tokens for attention
→ Attention scores between consecutive layers show strong similarity
→ Less important tokens can share resources instead of being discarded
-----
⚡ Solution in this Paper:
→ The method analyzes attention score similarity between layers and groups similar layers together.
→ It identifies proximal tokens as more important and processes them normally.
→ For distant tokens, it shares attention scores across grouped layers to save memory.
→ Uses a parameter-free gating mechanism to integrate attention between proximal and distant tokens.
-----
📊 Results:
→ Saves 35% KV cache without compromising model performance
→ Achieves 30% increase in maximum batch size across varying input lengths
→ Reduces computational cost by 25% with only 5% performance drop
Share this post