"MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 15, 2025

Article voiceover

1×

0:00

-2:38

https://arxiv.org/abs/2502.04235

The challenge for LLMs is the increasing scarcity of high-quality data needed for pre-training, hindering further model scaling. This paper introduces MAGA, a method to create more pre-training data by reformulating existing text into diverse versions.

This paper proposes MAGA reformulation. It synthesizes new pre-training data from existing text by adapting it to different genres and audiences. This approach aims to expand datasets efficiently and maintain data quality.

-----

📌 MAGA uses genre and audience conditioning. This allows targeted data augmentation. It moves beyond simple rephrasing. This method creates diverse and contextually relevant pre-training data efficiently.

📌 The two-stage approach of MAGA is efficient. Genre-audience generation first, then reformulation. This decomposes complex data synthesis. It uses smaller models effectively, reducing computational overhead.

📌 Limited Consistency in MAGA balances variation and fidelity. This is crucial for synthetic data quality. It avoids collapse while enabling meaningful content adaptation for pre-training.

----------

Methods Explored in this Paper 🔧:

→ MAGA, or MAssive Genre-Audience Reformulation, is a two-stage process to expand pre-training corpus.

→ It starts with Genre-Audience pair generation using a small LLM. For each document, five pairs of genre and audience are created.

→ In the second stage, another small LLM reformulates the original document into five new documents based on the generated genre-audience pairs. This expands the corpus while maintaining contextual diversity.

→ The reformulated text quality is controlled by "Limited Consistency", balancing variation and information preservation. A score is assigned to measure how well the reformulated text retains information from the original text, while allowing stylistic changes.

→ Task-specific tool models, quantized to W8A8 for efficiency, are used for both genre-audience pair generation and document reformulation.

→ A final cleaning stage filters out high-frequency boilerplate text and documents with low keyword coverage.

-----

Key Insights 💡:

→ MAGA effectively expands pre-training datasets by 3.9 times while maintaining diversity. It generates a 770 billion token MAGACorpus from an initial 195 billion token corpus.

→ Models pre-trained with MAGACorpus show consistent performance improvements across various model sizes (134M to 13B parameters) compared to models trained on the original corpus alone.

→ Synthetic data expansion via MAGA scales more effectively with increasing model size than simply repeating or upsampling existing data.

→ Validation loss may not be a reliable indicator of model collapse when using synthetic data. Increased validation loss with MAGACorpus does not necessarily imply performance degradation and might reflect a shift in learning strategy towards generalizable patterns.

-----

Results 📊:

→ MAGA-Mix models outperform baseline models by +0.26, +0.95, and +2.15 average score points for 134M, 377M, and 1.7B parameter models respectively on benchmarks.

→ MAGA-Mix achieves substantial gains in TriviaQA (+2.03, +6.99, +15.47) and GSM8K (+0.15, +0.22, +6.06) benchmarks for 134M, 377M, and 1.7B parameter models respectively.

→ In scaling experiments up to 13B parameters, MAGA expansion demonstrates superior scaling characteristics, with performance gains amplifying with increasing model scale (+1.46, +2.67, +3.59, +3.73) compared to upsampling (+0.89, +1.53, +1.23, +1.41).

Rohan's Bytes

Discussion about this post