0:00
/
0:00
Transcript

"DELIFT: Data Efficient Language model Instruction Fine Tuning"

The podcast on this paper is generated with Google's Illuminate.

Smart data selection beats brute force: DELIFT proves less data can mean better results

DELIFT cuts LLM fine-tuning data by 70% while keeping the same performance

https://arxiv.org/abs/2411.04425

Original Problem 🎯:

Fine-tuning LLMs requires massive datasets that are often redundant and computationally expensive. Current data selection methods either focus on single stages or use resource-intensive gradient calculations, making them impractical for large-scale applications.

-----

Solution in this Paper 🛠️:

→ DELIFT introduces a pairwise utility metric that measures how useful a data sample is for improving model responses to other samples

→ It uses three submodular functions: Facility Location for instruction tuning, Facility Location Mutual Information for task-specific tuning, and Facility Location Conditional Gain for continual learning

→ The utility metric evaluates the informational value relative to the model's current capabilities using length-normalized distance metrics between probability distributions

→ A greedy algorithm iteratively builds optimal data subsets by selecting points with maximum marginal gain in the chosen submodular function

-----

Key Insights 🔍:

→ Data selection can be optimized across all fine-tuning stages simultaneously

→ Model-aware selection outperforms traditional semantic similarity methods

→ Reducing dataset size doesn't necessarily compromise performance

→ The pairwise utility metric effectively captures sample informativeness

-----

Results 📊:

→ Reduced fine-tuning data size by 70% while maintaining performance

→ Achieved 70% reduction in computational time vs gradient-based methods

→ Outperformed existing data selection methods by up to 26% in effectiveness

→ Demonstrated consistent performance across different model scales (3.8B to 72B parameters)

Discussion about this video