0:00
/
0:00
Transcript

"CURing Large Models: Compression via CUR Decomposition"

Generated below podcast on this paper with Google's Illuminate.

CURing makes LLMs smaller by keeping only the important parts of weight matrices

CURing introduces a fast matrix decomposition technique that shrinks LLMs by keeping only important rows and columns while maintaining performance.

-----

https://arxiv.org/abs/2501.04211

🔧 Technique in this Paper:

→ CURing decomposes weight matrices into products of selected columns (C), rows (R), and a small linking matrix (U)

→ It identifies important weights using both magnitude and activation patterns through WANDA

→ The method preserves original matrix characteristics by retaining actual rows and columns

→ Built-in healing capability through U matrix fine-tuning avoids extensive retraining

→ Updates are constrained to beneficial subspaces defined by C and R matrices

-----

💡 Key Insights:

→ Layers causing minimal output changes can be effectively compressed

→ Combining activation patterns with weight magnitudes improves selection

→ Constraining updates to original subspaces prevents catastrophic forgetting

→ Knowledge distillation on C4 dataset provides task-agnostic healing

-----

📊 Results:

→ Reduces Llama3.1-8B by 9% (to 7.32B parameters) in just 129 seconds

→ 20x faster than prior methods like SliceGPT

→ Maintains or improves performance on C4, WikiText2, BoolQ, and MMLU

→ Quick healing in ~100 fine-tuning steps

Discussion about this video

User's avatar