Memory problems with LLMs? BitStack lets you choose: small, medium, or large.
Dynamic weight stacking enables LLMs to adapt to available device memory without recompression.
Resulting in an approximately 1-bit per parameter residual block in each decomposition iteration.
For 7/8B models, BitStack constantly outperforms GPTQ models below 4-bit-level and AWQ models below 3-bit-level.
📚 https://arxiv.org/abs/2410.23918
🎯 Original Problem:
Current LLM compression methods, such as quantization, require predefined ratios and separate compression processes for each setting, making dynamic memory adjustment impossible.
-----
🔧 Solution in this Paper:
→ BitStack decomposes weight matrices iteratively while considering parameter significance
→ Creates approximately 1-bit per parameter residual blocks in each decomposition iteration
→ Blocks are sorted and stacked in storage based on importance to model performance
→ Enables dynamic loading/offloading between RAM and storage based on available memory
→ Uses activation-aware decomposition that accounts for unequal importance of weights
→ Implements iterative absolute value decomposition separating sign matrices from magnitude
-----
💡 Key Insights:
→ Memory constraints are now the primary bottleneck for LLM deployment, not capability
→ Dynamic memory adjustment is crucial for real-world LLM applications
→ Weight decomposition can match quantization performance with proper design
→ Universal sorting of residual blocks across layers optimizes memory usage
-----
📊 Results:
→ For 7/8B models, outperforms 2-bit baselines by margins of:
- 12.1 (Llama 3.1)
- 22.3 (Llama 2)
- 10.4 (Llama 3)
→ For Llama 3.1 70B, retains 89% of original FP16 performance, beating best baseline by 41.3 on zero-shot tasks
Share this post