"MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.04235
The challenge for LLMs is the increasing scarcity of high-quality data needed for pre-training, hindering further model scaling. This paper introduces MAGA, a method to create more pre-training data by reformulating existing text into diverse versions.
This paper proposes MAGA reformulation. It synthesizes new pre-training data from existing text by adapting it to different genres and audiences. This approach aims to expand datasets efficiently and maintain data quality.
-----
📌 MAGA uses genre and audience conditioning. This allows targeted data augmentation. It moves beyond simple rephrasing. This method creates diverse and contextually relevant pre-training data efficiently.
📌 The two-stage approach of MAGA is efficient. Genre-audience generation first, then reformulation. This decomposes complex data synthesis. It uses smaller models effectively, reducing computational overhead.
📌 Limited Consistency in MAGA balances variation and fidelity. This is crucial for synthetic data quality. It avoids collapse while enabling meaningful content adaptation for pre-training.
----------
Methods Explored in this Paper 🔧:
→ MAGA, or MAssive Genre-Audience Reformulation, is a two-stage process to expand pre-training corpus.
→ It starts with Genre-Audience pair generation using a small LLM. For each document, five pairs of genre and audience are created.
→ In the second stage, another small LLM reformulates the original document into five new documents based on the generated genre-audience pairs. This expands the corpus while maintaining contextual diversity.
→ The reformulated text quality is controlled by "Limited Consistency", balancing variation and information preservation. A score is assigned to measure how well the reformulated text retains information from the original text, while allowing stylistic changes.
→ Task-specific tool models, quantized to W8A8 for efficiency, are used for both genre-audience pair generation and document reformulation.
→ A final cleaning stage filters out high-frequency boilerplate text and documents with low keyword coverage.
-----
Key Insights 💡:
→ MAGA effectively expands pre-training datasets by 3.9 times while maintaining diversity. It generates a 770 billion token MAGACorpus from an initial 195 billion token corpus.
→ Models pre-trained with MAGACorpus show consistent performance improvements across various model sizes (134M to 13B parameters) compared to models trained on the original corpus alone.
→ Synthetic data expansion via MAGA scales more effectively with increasing model size than simply repeating or upsampling existing data.
→ Validation loss may not be a reliable indicator of model collapse when using synthetic data. Increased validation loss with MAGACorpus does not necessarily imply performance degradation and might reflect a shift in learning strategy towards generalizable patterns.
-----
Results 📊:
→ MAGA-Mix models outperform baseline models by +0.26, +0.95, and +2.15 average score points for 134M, 377M, and 1.7B parameter models respectively on benchmarks.
→ MAGA-Mix achieves substantial gains in TriviaQA (+2.03, +6.99, +15.47) and GSM8K (+0.15, +0.22, +6.06) benchmarks for 134M, 377M, and 1.7B parameter models respectively.
→ In scaling experiments up to 13B parameters, MAGA expansion demonstrates superior scaling characteristics, with performance gains amplifying with increasing model scale (+1.46, +2.67, +3.59, +3.73) compared to upsampling (+0.89, +1.53, +1.23, +1.41).