Stop guessing data mixtures, portfolio optimization for LLMs is here.
Optimizing data mixtures for LLM pretraining is crucial for model performance and training efficiency. This paper addresses the challenge of selecting the best combination of datasets for pretraining LLMs, proposing efficient methods to automate this process and outperform existing heuristics and learned approaches.
-----
https://arxiv.org/abs/2501.11747
Original Problem 🤔:
→ LLMs benefit from more high quality training data.
→ Balancing quality, quantity, and diversity across multiple data sources is complex.
→ Existing data mixing methods lack comprehensive comparison across different training scales and data constraints.
→ It is unclear if current methods are robust to epoching effects in data constrained scenarios.
-----
Solution in this Paper 💡:
→ This paper introduces UtiliMax, a method that optimizes data mixes using utility estimates and portfolio optimization.
→ UtiliMax extends token based heuristics by incorporating utility estimates from reduced scale ablations of individual datasets.
→ It balances expected utility with risk, considering data diversity and dataset size.
→ The paper also presents Model Estimated Data Utility (MEDU).
→ MEDU uses LLMs to estimate data utility from small samples, significantly reducing computational cost compared to ablation based methods.
→ MEDU prompts LLMs to describe benchmarks and classify training data utility based on these descriptions.
→ UtiliMax with MEDU automates data mixing and is computationally efficient.
-----
Key Insights from this Paper 🧠:
→ Token count heuristics like UniMax are surprisingly effective baselines for data mixing.
→ Maintaining data diversity and scale is crucial for LLM performance, especially in data constrained settings.
→ UtiliMax, by considering utility, diversity, and scale, improves upon UniMax and learned data mixing approaches.
→ LLMs can be leveraged to estimate data utility efficiently through MEDU.
→ Combining UtiliMax and MEDU creates a Pareto optimal approach for data mixing.
-----
Results 📈:
→ UtiliMax achieves up to 10.6x speedup over manual baselines.
→ MEDU reduces computational cost by approximately 200x compared to ablation based utility estimation.
→ UtiliMax outperforms other baselines in both compute constrained and data constrained scenarios, as shown in Figure 1.
→ UtiliMax and MEDU approaches achieve the best mean ranks across tasks and compute scales, detailed in Table 2.
Share this post