"LeMo: Enabling LEss Token Involvement for MOre Context Fine-tuning"

Playback speed

Share post at current time

0:00

Transcript

"LeMo: Enabling LEss Token Involvement for MOre Context Fine-tuning"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 10, 2025

LLMs struggle with long-context tasks due to high memory demands, specifically from activations. Existing methods don't adequately address this.

This paper introduces LEMO (Enabling LEss Token Involvement for MOre Context Fine-tuning), a system optimizing LLM fine-tuning by using "Contextual Token Sparsity." It minimizes redundant token use, reducing memory and computation.

-----

https://arxiv.org/abs/2501.09767

📌 LEMO directly tackles activation memory, the main bottleneck in long-context fine-tuning. It identifies and uses only crucial tokens. This contrasts with methods that reduce parameter updates but not activations.

📌 LEMO uses predictor networks, and these need minimal training data to achieve high accuracy in identifying redundant tokens (95.13% recall). These predictors add minimal computational and memory load overhead.

📌 LEMO's kernel optimizations are key to maximizing efficiency. The "permutation-free" strategy and "segment-based peak cutting" significantly reduce data movement, directly improving the practical application by cutting memory peaks.

----------

Methods Explored in this Paper 🔧:

→ LEMO uses three core techniques. Token Elimination identifies and removes less informative tokens.

→ Pattern Prediction uses small predictors to estimate token importance, avoiding full attention calculation.

→ Kernel Optimization speeds up token selection and processing without unnecessary data movement and peak memory usage is addressed by segmented based computation.

-----

Key Insights 💡:

→ Natural language has significant redundancy, especially in long contexts, which this paper exploits.

→ Token importance varies across inputs and layers (Contextual Token Sparsity).

→ Standard full attention can be approximated by focusing on interactions among a subset of the most informative tokens.

-----

Results 📊:

→ LEMO reduces memory consumption by up to 1.93×.

→Achieves up to 1.36× speedups in fine-tuning, over state-of-the-art systems.

→ LEMO shows, average memory savings are 38.2% and 50.5%, vs LoRA, at sequence lengths 4K and 8K.