Bigger isn't always better and Simple beats complex.
The paper finds, simple random sampling matches complex data selection methods when fine-tuning LLMs on million-scale datasets.
i.e. Data diversity beats quality metrics when selecting samples from massive instruction datasets.
📚 https://arxiv.org/abs/2410.09335
Original Problem 🔍:
Existing data selection techniques for supervised fine-tuning (SFT) of LLMs are designed for small-scale datasets, failing to meet real-world SFT demands involving large-scale, diverse data pools.
-----
Solution in this Paper 🧠:
• Replicated self-scoring data selection methods on two million-scale datasets: Openhermes 2.5 and WildChat-1M
• Evaluated methods across various downstream tasks using LLaMA3-8B and Qwen2-7B as base models
• Introduced token length as a criterion for data filtering in SFT
• Analyzed limitations of current approaches on large-scale datasets
-----
Key Insights from this Paper 💡:
• Most self-scoring data selection methods don't significantly outperform random selection on large-scale datasets
• Data diversity is more critical than data quality during SFT phase
• Token length-based filtering yields stable and efficient results for SFT on large-scale datasets
• Beneficial for relatively weaker base models, especially when training on long-text data
-----
Results 📊:
• Token length-based selection outperforms other methods on Llama3-8B
• Average score on WildChat (55.51) surpasses fine-tuning with entire dataset (54.58)
• Data diversity-based methods generally outperform quality-based methods
• Random selection often yields comparable or better results than sophisticated methods
Share this post