"DeltaLLM: Compress LLMs with Low-Rank Deltas between Shared Weights"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2501.18596
The paper addresses the challenge of large memory footprint in Large Language Models (LLMs), hindering their deployment on devices with limited resources. The paper introduces a novel compression technique to reduce LLM size while maintaining performance.
This paper proposes \genericmethodname, a post-training compression method. It shares weights between Transformer layers and uses low-rank matrices to capture the differences, reducing model size.
-----
๐ DeltaLLM achieves efficient LLM compression via weight sharing and low-rank deltas. This method reduces parameters significantly while retaining strong benchmark performance with minimal retraining.
๐ Progressive Module Replacement is crucial in DeltaLLM. It ensures stable and rapid training of delta layers, outperforming standard knowledge distillation for decoder LLM compression.
๐ DeltaLLM reveals that MLP layers are more amenable to compression than attention. Compressing MLP weights provides better parameter reduction with less performance degradation.
----------
Methods Explored in this Paper ๐ง:
โ The paper introduces \genericmethodname, a compression technique for LLMs.
โ \genericmethodname restructures LLMs by sharing weights across Transformer blocks.
โ Low-rank "delta" matrices are added to account for differences between shared layers.
โ These delta matrices are trained using knowledge distillation, with the original LLM as the teacher.
โ Progressive Module Replacement (PMR) is used during training. PMR progressively replaces original layers with compressed layers based on a schedule to improve convergence.
โ Two training approaches are explored: delta-tuning only and delta-layer tuning with LoRA fine-tuning. The paper focuses on delta-tuning only where only the delta layers are trained, while base model weights are fixed.
โ The method is applied to compress Phi-3.5 and Llama-3.2 models, creating \phicompressedname and \llamacompressedname models.
-----
Key Insights ๐ก:
โ Weight sharing between Transformer layers, combined with low-rank deltas, effectively compresses LLMs.
โ Training only the low-rank delta matrices is sufficient to recover most of the original model's performance.
โ Progressive Module Replacement accelerates training convergence compared to standard distillation.
โ Compressing MLP layers within Transformer blocks is more effective than compressing attention layers for this method.
โ \genericmethodname compressed models outperform other compression techniques like JointDrop, SliceGPT, ShortGPT, and LaCo with the same parameter reduction.
-----
Results ๐:
โ \phicompressedname 2.9B achieves similar zero-shot accuracy as SlicedPhi 3.3B, despite being 400M parameters smaller.
โ \phicompressedname 3.35B outperforms Llama 3.2B and Qwen 3.2B on MMLU, WinoGrande, and ARC-Challenge benchmarks.
โ \phicompressedname 3.35B achieves a perplexity of 3.34 on the Alpaca dataset, outperforming Phi 3.5's 2.96, suggesting improved performance post-compression in some metrics.
โ \phicompressedname 2.9B model's delta-layers occupy only 90MB of storage.


