Data selection matters more than data quantity for LLM training
The research explores data selection strategies for LLM pretraining, comparing n-gram and neural embedding features to enhance model performance through better training data alignment with target distributions.
https://arxiv.org/abs/2501.03826v1
Original Problem 🔍:
→ Current LLM training often uses random or heuristic-based data selection, which may not effectively capture desired target distributions.
→ Binary classifiers for high-quality text selection consider examples independently, limiting batch selection capabilities.
-----
Solution in this Paper 🛠️:
→ Introduces Hybrid Importance Resampling (HIR) combining n-gram and neural embedding features.
→ Uses Gaussian Mixture Models to estimate target distribution probabilities from sentence embeddings.
→ Integrates discrete n-gram properties with continuous neural features using a weighted hybrid distribution.
→ Computes importance weights as ratio of hybrid distributions between target and raw datasets.
-----
Key Insights 💡:
→ Neural embeddings capture semantic richness but may not align directly with token-level training objectives
→ N-gram features excel at local context while neural features better represent global patterns
→ DSIR outperforms neural approaches in most tasks, highlighting value of token-level modeling
-----
Results 📊:
→ DSIR achieved best performance in 5/6 GLUE tasks
→ HIR with neural features improved over random selection
→ Both methods demonstrated consistent gains over baseline random selection
Share this post