0:00
/
0:00
Transcript

"Two are better than one: Context window extension with multi-grained self-injection"

The podcast on this paper is generated with Google's Illuminate.

Two LLMs working together crack the long context problem through smart compression

📚 https://arxiv.org/abs/2410.19318

🎯 Original Problem:

While continual pre-training on long-context data works, it demands substantial computational resources and data acquisition costs.

-----

🔧 Solution in this Paper:

→ SharedLLM uses two short-context LLMs - an upper model (decoder) and lower model (compressor)

→ Lower model breaks input into chunks, compresses each into multi-grained representations using a tree structure

→ Information transfer happens only at lowest layers to avoid redundant processing

→ Uses specialized tree-style data structure for efficient encoding and retrieval of contextual information

→ Implements query-aware dynamic tree construction that expands only relevant nodes

→ Employs position-aware cross-attention for integrating compressed information

-----

💡 Key Insights:

→ Two LLMs sharing same architecture can effectively handle long contexts without extra alignment steps

→ Tree-based compression with varying granularity better captures relevant information

→ Query-dependent dynamic tree construction significantly improves efficiency

→ Layer-wise information transfer at bottom layers is sufficient for good performance

-----

📊 Results:

→ Achieves 2x faster speed than streaming approaches, 3x faster than encoder-decoder architectures

→ Outperforms baselines by 3-10% on language modeling tasks

→ Successfully processes sequences up to 128K tokens while trained only on 8K tokens

→ Demonstrates superior performance on long-context tasks like Math.Find (13.58 vs 11.14) and En MC (33.65 vs 31.44)

Discussion about this video