"DeltaLLM: Compress LLMs with Low-Rank Deltas between Shared Weights"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 09, 2025

Article voiceover

0:00

-4:41

https://arxiv.org/abs/2501.18596

The paper addresses the challenge of large memory footprint in Large Language Models (LLMs), hindering their deployment on devices with limited resources. The paper introduces a novel compression technique to reduce LLM size while maintaining performance.

This paper proposes \genericmethodname, a post-training compression method. It shares weights between Transformer layers and uses low-rank matrices to capture the differences, reducing model size.

-----

📌 DeltaLLM achieves efficient LLM compression via weight sharing and low-rank deltas. This method reduces parameters significantly while retaining strong benchmark performance with minimal retraining.

📌 Progressive Module Replacement is crucial in DeltaLLM. It ensures stable and rapid training of delta layers, outperforming standard knowledge distillation for decoder LLM compression.

📌 DeltaLLM reveals that MLP layers are more amenable to compression than attention. Compressing MLP weights provides better parameter reduction with less performance degradation.

----------

Methods Explored in this Paper 🔧:

→ The paper introduces \genericmethodname, a compression technique for LLMs.

→ \genericmethodname restructures LLMs by sharing weights across Transformer blocks.

→ Low-rank "delta" matrices are added to account for differences between shared layers.

→ These delta matrices are trained using knowledge distillation, with the original LLM as the teacher.

→ Progressive Module Replacement (PMR) is used during training. PMR progressively replaces original layers with compressed layers based on a schedule to improve convergence.

→ Two training approaches are explored: delta-tuning only and delta-layer tuning with LoRA fine-tuning. The paper focuses on delta-tuning only where only the delta layers are trained, while base model weights are fixed.

→ The method is applied to compress Phi-3.5 and Llama-3.2 models, creating \phicompressedname and \llamacompressedname models.

-----

Key Insights 💡:

→ Weight sharing between Transformer layers, combined with low-rank deltas, effectively compresses LLMs.

→ Training only the low-rank delta matrices is sufficient to recover most of the original model's performance.

→ Progressive Module Replacement accelerates training convergence compared to standard distillation.

→ Compressing MLP layers within Transformer blocks is more effective than compressing attention layers for this method.

→ \genericmethodname compressed models outperform other compression techniques like JointDrop, SliceGPT, ShortGPT, and LaCo with the same parameter reduction.

-----

Results 📊:

→ \phicompressedname 2.9B achieves similar zero-shot accuracy as SlicedPhi 3.3B, despite being 400M parameters smaller.

→ \phicompressedname 3.35B outperforms Llama 3.2B and Qwen 3.2B on MMLU, WinoGrande, and ARC-Challenge benchmarks.

→ \phicompressedname 3.35B achieves a perplexity of 3.34 on the Alpaca dataset, outperforming Phi 3.5's 2.96, suggesting improved performance post-compression in some metrics.

→ \phicompressedname 2.9B model's delta-layers occupy only 90MB of storage.

Rohan's Bytes

Discussion about this post