LLMs struggle with long-context tasks due to high memory demands, specifically from activations. Existing methods don't adequately address this.
This paper introduces LEMO (Enabling LEss Token Involvement for MOre Context Fine-tuning), a system optimizing LLM fine-tuning by using "Contextual Token Sparsity." It minimizes redundant token use, reducing memory and computation.
-----
https://arxiv.org/abs/2501.09767
π LEMO directly tackles activation memory, the main bottleneck in long-context fine-tuning. It identifies and uses only crucial tokens. This contrasts with methods that reduce parameter updates but not activations.
π LEMO uses predictor networks, and these need minimal training data to achieve high accuracy in identifying redundant tokens (95.13% recall). These predictors add minimal computational and memory load overhead.
π LEMO's kernel optimizations are key to maximizing efficiency. The "permutation-free" strategy and "segment-based peak cutting" significantly reduce data movement, directly improving the practical application by cutting memory peaks.
----------
Methods Explored in this Paper π§:
β LEMO uses three core techniques. Token Elimination identifies and removes less informative tokens.
β Pattern Prediction uses small predictors to estimate token importance, avoiding full attention calculation.
β Kernel Optimization speeds up token selection and processing without unnecessary data movement and peak memory usage is addressed by segmented based computation.
-----
Key Insights π‘:
β Natural language has significant redundancy, especially in long contexts, which this paper exploits.
β Token importance varies across inputs and layers (Contextual Token Sparsity).
β Standard full attention can be approximated by focusing on interactions among a subset of the most informative tokens.
-----
Results π:
β LEMO reduces memory consumption by up to 1.93Γ.
βAchieves up to 1.36Γ speedups in fine-tuning, over state-of-the-art systems.
β LEMO shows, average memory savings are 38.2% and 50.5%, vs LoRA, at sequence lengths 4K and 8K.