Two LLMs working together crack the long context problem through smart compression
📚 https://arxiv.org/abs/2410.19318
🎯 Original Problem:
While continual pre-training on long-context data works, it demands substantial computational resources and data acquisition costs.
-----
🔧 Solution in this Paper:
→ SharedLLM uses two short-context LLMs - an upper model (decoder) and lower model (compressor)
→ Lower model breaks input into chunks, compresses each into multi-grained representations using a tree structure
→ Information transfer happens only at lowest layers to avoid redundant processing
→ Uses specialized tree-style data structure for efficient encoding and retrieval of contextual information
→ Implements query-aware dynamic tree construction that expands only relevant nodes
→ Employs position-aware cross-attention for integrating compressed information
-----
💡 Key Insights:
→ Two LLMs sharing same architecture can effectively handle long contexts without extra alignment steps
→ Tree-based compression with varying granularity better captures relevant information
→ Query-dependent dynamic tree construction significantly improves efficiency
→ Layer-wise information transfer at bottom layers is sufficient for good performance
-----
📊 Results:
→ Achieves 2x faster speed than streaming approaches, 3x faster than encoder-decoder architectures
→ Outperforms baselines by 3-10% on language modeling tasks
→ Successfully processes sequences up to 128K tokens while trained only on 8K tokens
→ Demonstrates superior performance on long-context tasks like Math.Find (13.58 vs 11.14) and En MC (33.65 vs 31.44)
Share this post