"BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments"

Playback speed

Share post at current time

0:00

Transcript

"BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments"

Generated this podcast on this Paper with Google's Illuminate, which is Google's platform to create podcast from arXiv papers

Rohan Paul

Dec 22, 2024

Memory problems with LLMs? BitStack lets you choose: small, medium, or large.

Dynamic weight stacking enables LLMs to adapt to available device memory without recompression.

Resulting in an approximately 1-bit per parameter residual block in each decomposition iteration.

For 7/8B models, BitStack constantly outperforms GPTQ models below 4-bit-level and AWQ models below 3-bit-level.

📚 https://arxiv.org/abs/2410.23918

🎯 Original Problem:

Current LLM compression methods, such as quantization, require predefined ratios and separate compression processes for each setting, making dynamic memory adjustment impossible.

-----

🔧 Solution in this Paper:

→ BitStack decomposes weight matrices iteratively while considering parameter significance

→ Creates approximately 1-bit per parameter residual blocks in each decomposition iteration

→ Blocks are sorted and stacked in storage based on importance to model performance

→ Enables dynamic loading/offloading between RAM and storage based on available memory

→ Uses activation-aware decomposition that accounts for unequal importance of weights

→ Implements iterative absolute value decomposition separating sign matrices from magnitude

-----

💡 Key Insights:

→ Memory constraints are now the primary bottleneck for LLM deployment, not capability

→ Dynamic memory adjustment is crucial for real-world LLM applications

→ Weight decomposition can match quantization performance with proper design

→ Universal sorting of residual blocks across layers optimizes memory usage

-----

📊 Results:

→ For 7/8B models, outperforms 2-bit baselines by margins of:

- 12.1 (Llama 3.1)

- 22.3 (Llama 2)

- 10.4 (Llama 3)

→ For Llama 3.1 70B, retains 89% of original FP16 performance, beating best baseline by 41.3 on zero-shot tasks

Rohan's Bytes

"BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments"

Discussion about this video