SIGMA, proposed in this paper, improves LLM inference efficiency, especially in long contexts, by differentially optimizing query, key, and value components in attention.
Paper - https://arxiv.org/abs/2501.13629
Original Problem: 🤔
→ LLMs, while powerful, suffer from slow and memory-intensive inference, especially with long text sequences, due to the quadratic complexity of attention and the growing size of key-value (KV) cache.
Solution in this Paper:💡
→ This paper introduces SIGMA, an efficient LLM specializing in the system domain, employing a novel DiffQKV attention mechanism. DiffQKV differentially optimizes query (Q), key (K), and value (V) components. It uses differentially compressed KV, where K experiences more aggressive compression than V, based on observed performance sensitivity. It also uses augmented Q, expanding the Q head dimension to counteract potential performance loss from KV compression. Optionally, selective V cache fetching is used, loading only necessary V vectors based on attention scores.
Key Insights from this Paper 💡
→ Model performance is more sensitive to changes in V than K. This allows for more aggressive K compression, saving memory and bandwidth.
→ Expanding Q head dimension improves performance with minimal impact on inference speed, offsetting performance drops from KV compression.
→ Selective V fetching further reduces memory usage and transfer times without significant performance loss.
→ DiffQKV attention enhances efficiency, improving inference speed over Grouped Query Attention by up to 33.36% in long contexts.
Results:💯:
→ On AIMICIUS, a system domain benchmark, SIGMA significantly outperforms GPT-4 (up to 52.5% absolute improvement).
→ Achieves comparable performance to state-of-the-art models in general domains.
→ Inference speed improvement up to 33.36% over Grouped Query Attention in long-context scenarios.
Share this post