Your compressed LLM feeling slow? EoRA (Eigenspace Low-Rank Approximation) knows exactly what weights to fix!.
Great paper from @nvidia
EoRA fixes compressed LLM accuracy by smartly prioritizing important weights in eigenspace.
Eigenspace projection helps compressed LLMs regain their lost accuracy.
📚 https://arxiv.org/abs/2410.21271
🤖 Original Problem:
LLMs face deployment challenges due to their size. Current compression methods either cause accuracy loss or have limited flexibility due to fixed compression formats (like 2:4 sparsity or 4-bit quantization), making it hard to meet diverse efficiency requirements.
-----
🔧 Solution in this Paper:
→ Introduces EoRA (Eigenspace Low-Rank Approximation) - a training-free method to compensate for compression errors
→ Projects compression errors into eigenspace of input activations
→ Uses eigenvalues as importance scores to prioritize weight columns for error approximation
→ Optimizes in minutes using minimal calibration data (256 sentences for language tasks)
→ Can integrate with fine-tuning and quantization for further improvements
-----
💡 Key Insights:
→ Reformulates compression as a customized compensation problem using residual low-rank paths
→ Direct relationship between approximation error and compression loss through eigenspace projection
→ More effective use of low-rank representation capacity than naive SVD
→ Training-free optimization without gradient computation
-----
📊 Results:
→ On LLaMA3-8B (4-bit quantized, 2:4 pruned): 31.31%/12.88% improvements on ARC-Easy/Challenge
→ 9.69% improvement on MathQA
→ Optimization completed in minutes using only 256 sentences of calibration data
→ Resilient to 3/4-bit quantization with minimal accuracy loss
Share this post