CURing makes LLMs smaller by keeping only the important parts of weight matrices
CURing introduces a fast matrix decomposition technique that shrinks LLMs by keeping only important rows and columns while maintaining performance.
-----
https://arxiv.org/abs/2501.04211
🔧 Technique in this Paper:
→ CURing decomposes weight matrices into products of selected columns (C), rows (R), and a small linking matrix (U)
→ It identifies important weights using both magnitude and activation patterns through WANDA
→ The method preserves original matrix characteristics by retaining actual rows and columns
→ Built-in healing capability through U matrix fine-tuning avoids extensive retraining
→ Updates are constrained to beneficial subspaces defined by C and R matrices
-----
💡 Key Insights:
→ Layers causing minimal output changes can be effectively compressed
→ Combining activation patterns with weight magnitudes improves selection
→ Constraining updates to original subspaces prevents catastrophic forgetting
→ Knowledge distillation on C4 dataset provides task-agnostic healing
-----
📊 Results:
→ Reduces Llama3.1-8B by 9% (to 7.32B parameters) in just 129 seconds
→ 20x faster than prior methods like SliceGPT
→ Maintains or improves performance on C4, WikiText2, BoolQ, and MMLU
→ Quick healing in ~100 fine-tuning steps
Share this post