AITP (Aligning Instruction Tuning with Pre-training) enhances instruction tuning datasets by incorporating underrepresented pre-training data, improving LLM performance.
-----
https://arxiv.org/abs/2501.09368
Original Problem 🤔:
→ Current instruction-tuning datasets are often narrowly focused and misaligned with the broader distributions of pre-training corpora.
→ This limits LLM generalization and effective use of pre-trained knowledge.
Solution in this Paper 💡:
→ AITP (Aligning Instruction Tuning with Pre-training) bridges this gap.
→ It identifies coverage shortfalls in instruction-tuning datasets by comparing their distribution to that of the pre-training corpus.
→ Underrepresented pre-training data is rewritten into high-quality instruction-response pairs.
→ These new pairs are integrated into the original dataset for fine-tuning, enhancing dataset coverage and alignment. The process involves generating a difference set based on density comparisons, rewriting raw text into instruction-response pairs, and integrating these pairs.
Key Insights from this Paper 🔑:
→ Pre-training corpora are a valuable resource for improving instruction tuning.
→ Aligning the distributions of instruction-tuning and pre-training data is crucial for unlocking the full potential of LLMs.
→ Adaptive data selection, controlled rewriting, and balanced integration contribute to the effectiveness of AITP.
Results 💯:
→ AITP demonstrates consistent performance improvements on three fully open LLMs (OLMO, MAP-Neo, Pythia) across eight benchmarks.
→ Average absolute improvements across 8 benchmarks using AITP were 3.77 for OLMo, 1.11 for Neo, and 0.97 for Pythia.
Share this post