Star Attention makes LLMs process long texts faster by splitting work across multiple computers
Think of Star Attention as a team of computers sharing the work to read long documents.
The paper Star Attention introduces a novel two-phase approach for efficient LLM inference on long sequences. It divides context processing across multiple hosts using block-sparse approximation, reducing computational complexity while maintaining accuracy. The method enables processing sequences up to 1M tokens with minimal communication overhead.
-----
https://arxiv.org/abs/2411.17116
🤔 Original Problem:
LLM inference on long sequences faces significant computational and memory challenges due to self-attention's quadratic complexity, making it slow and resource-intensive.
-----
🔧 Solution in this Paper:
→ Star Attention splits processing into two phases: context encoding and query processing
→ In phase one, context is divided into blocks distributed across hosts, with each block prefixed by an "anchor block"
→ Each host processes its block independently using local attention
→ In phase two, query tokens use global attention to access all cached tokens
→ The system minimizes communication overhead by efficiently aggregating results at a designated query host
-----
💡 Key Insights:
→ Adding anchor blocks prevents attention pattern distortion
→ Block size of one-quarter sequence length provides optimal accuracy-speed trade-off
→ Larger models achieve greater speedups with Star Attention
-----
📊 Results:
→ Achieves up to 11x faster inference while maintaining 95-100% accuracy
→ Scales linearly with number of hosts
→ Compatible with existing optimization methods like Flash Attention
Share this post