0:00
/
0:00
Transcript

"2SSP: A Two-Stage Framework for Structured Pruning of LLMs"

Below podcast on this paper is generated with Google's Illuminate.

This paper introduces a two-stage pruning framework for LLMs, combining width and depth pruning, to effectively reduce the size of LLMs while maintaining performance and pruning speed.

-----

📌 The two-stage structured pruning (2SSP) method balances width and depth sparsity, ensuring both efficiency and accuracy. Width pruning preserves key neurons, while depth pruning removes redundant Attention submodules, optimizing inference speed without degrading perplexity.

📌 Output magnitude-based width pruning in Feed-Forward Networks compresses hidden representations efficiently. This targeted neuron removal minimizes redundant activations, leading to a structured reduction in computational overhead while maintaining essential feature transformations.

📌 Iterative depth pruning on Attention submodules prioritizes maintaining contextual understanding. By removing less impactful Attention layers, 2SSP reduces memory and compute costs while preserving the core ability of the model to capture dependencies in text.

-----

Paper - https://arxiv.org/abs/2501.17771

Original Problem 😮:

→ LLMs are computationally expensive due to their massive size.

→ This high cost hinders efficient inference and deployment.

→ Reducing the computational burden of LLMs without significant performance loss is a critical challenge.

-----

Solution in this Paper 💡:

→ This paper proposes a Two-Stage Structured Pruning framework called 2SSP.

→ The first stage applies width pruning to Feed-Forward Networks within Transformer blocks.

→ It removes entire neurons based on their output magnitude, preserving network connectivity.

→ The second stage employs depth pruning on Attention submodules.

→ It iteratively removes Attention modules that least impact perplexity.

→ 2SSP balances sparsity between FFN neurons and Attention modules to optimize pruning.

-----

Key Insights from this Paper 🧠:

→ Combining width and depth pruning leverages the strengths of both approaches.

→ Width pruning refines component identification, while depth pruning accelerates computation and inference.

→ Pruning neurons in FFNs based on output magnitude effectively compresses intermediate representations.

→ Iterative removal of less important Attention submodules further reduces model size.

→ Balancing sparsity across FFN and Attention is crucial for maintaining performance at high sparsity rates.

-----

Results 💪:

→ 2SSP outperforms state-of-the-art methods on language modeling and downstream tasks across various LLMs.

→ Achieves superior perplexity on WikiText2, C4, and Fineweb-Edu datasets at 25%, 37.5%, and 50% sparsity.

→ Demonstrates better performance vs. pruning runtime trade-off compared to baselines.

→ Shows inference speed-up on different GPUs, positioning between depth and width pruning methods in terms of speed.

Discussion about this video