0:00
/
0:00
Transcript

"EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation"

The podcast on this paper is generated with Google's Illuminate.

Your compressed LLM feeling slow? EoRA (Eigenspace Low-Rank Approximation) knows exactly what weights to fix!.

Great paper from @nvidia

EoRA fixes compressed LLM accuracy by smartly prioritizing important weights in eigenspace.

Eigenspace projection helps compressed LLMs regain their lost accuracy.

📚 https://arxiv.org/abs/2410.21271

🤖 Original Problem:

LLMs face deployment challenges due to their size. Current compression methods either cause accuracy loss or have limited flexibility due to fixed compression formats (like 2:4 sparsity or 4-bit quantization), making it hard to meet diverse efficiency requirements.

-----

🔧 Solution in this Paper:

→ Introduces EoRA (Eigenspace Low-Rank Approximation) - a training-free method to compensate for compression errors

→ Projects compression errors into eigenspace of input activations

→ Uses eigenvalues as importance scores to prioritize weight columns for error approximation

→ Optimizes in minutes using minimal calibration data (256 sentences for language tasks)

→ Can integrate with fine-tuning and quantization for further improvements

-----

💡 Key Insights:

→ Reformulates compression as a customized compensation problem using residual low-rank paths

→ Direct relationship between approximation error and compression loss through eigenspace projection

→ More effective use of low-rank representation capacity than naive SVD

→ Training-free optimization without gradient computation

-----

📊 Results:

→ On LLaMA3-8B (4-bit quantized, 2:4 pruned): 31.31%/12.88% improvements on ARC-Easy/Challenge

→ 9.69% improvement on MathQA

→ Optimization completed in minutes using only 256 sentences of calibration data

→ Resilient to 3/4-bit quantization with minimal accuracy loss

Discussion about this video