0:00
/
0:00
Transcript

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

The podcast on this paper is generated with Google's Illuminate.

Bigger isn't always better and Simple beats complex.

The paper finds, simple random sampling matches complex data selection methods when fine-tuning LLMs on million-scale datasets.

i.e. Data diversity beats quality metrics when selecting samples from massive instruction datasets.

📚 https://arxiv.org/abs/2410.09335

Original Problem 🔍:

Existing data selection techniques for supervised fine-tuning (SFT) of LLMs are designed for small-scale datasets, failing to meet real-world SFT demands involving large-scale, diverse data pools.

-----

Solution in this Paper 🧠:

• Replicated self-scoring data selection methods on two million-scale datasets: Openhermes 2.5 and WildChat-1M

• Evaluated methods across various downstream tasks using LLaMA3-8B and Qwen2-7B as base models

• Introduced token length as a criterion for data filtering in SFT

• Analyzed limitations of current approaches on large-scale datasets

-----

Key Insights from this Paper 💡:

• Most self-scoring data selection methods don't significantly outperform random selection on large-scale datasets

• Data diversity is more critical than data quality during SFT phase

• Token length-based filtering yields stable and efficient results for SFT on large-scale datasets

• Beneficial for relatively weaker base models, especially when training on long-text data

-----

Results 📊:

• Token length-based selection outperforms other methods on Llama3-8B

• Average score on WildChat (55.51) surpasses fine-tuning with entire dataset (54.58)

• Data diversity-based methods generally outperform quality-based methods

• Random selection often yields comparable or better results than sophisticated methods

Discussion about this video