Smart data selection beats brute force: DELIFT proves less data can mean better results
DELIFT cuts LLM fine-tuning data by 70% while keeping the same performance
https://arxiv.org/abs/2411.04425
Original Problem 🎯:
Fine-tuning LLMs requires massive datasets that are often redundant and computationally expensive. Current data selection methods either focus on single stages or use resource-intensive gradient calculations, making them impractical for large-scale applications.
-----
Solution in this Paper 🛠️:
→ DELIFT introduces a pairwise utility metric that measures how useful a data sample is for improving model responses to other samples
→ It uses three submodular functions: Facility Location for instruction tuning, Facility Location Mutual Information for task-specific tuning, and Facility Location Conditional Gain for continual learning
→ The utility metric evaluates the informational value relative to the model's current capabilities using length-normalized distance metrics between probability distributions
→ A greedy algorithm iteratively builds optimal data subsets by selecting points with maximum marginal gain in the chosen submodular function
-----
Key Insights 🔍:
→ Data selection can be optimized across all fine-tuning stages simultaneously
→ Model-aware selection outperforms traditional semantic similarity methods
→ Reducing dataset size doesn't necessarily compromise performance
→ The pairwise utility metric effectively captures sample informativeness
-----
Results 📊:
→ Reduced fine-tuning data size by 70% while maintaining performance
→ Achieved 70% reduction in computational time vs gradient-based methods
→ Outperformed existing data selection methods by up to 26% in effectiveness
→ Demonstrated consistent performance across different model scales (3.8B to 72B parameters)
Share this post