Overfitting, not underfitting, may be the key to better LLM text generation.
Overfitting LLMs on small datasets, termed "hyperfitting," enhances greedy decoding text generation quality. This counter-intuitive approach improves long-form text generation.
-----
https://arxiv.org/abs/2412.04318
Original Problem 🤔:
→ LLMs, even large ones, often generate repetitive and uninteresting text, especially with greedy decoding.
-----
Solution in this Paper 💡:
→ Fine-tune a pre-trained LLM on a small dataset until near-zero training loss (hyperfitting).
→ Optionally block repetitions from the hyperfitting dataset during generation.
-----
Key Insights from this Paper 😲:
→ Hyperfitted models often outperform larger models and nucleus sampling in human evaluations and diversity.
→ Hyperfitted models have sharper prediction distributions, favoring single tokens.
→ The specific training data influences but does not fully determine hyperfitting outcomes.
→ Hyperfitting also improves autoregressive image generation quality and reduces repetition.
-----
Results 📊:
→ Hyperfitted TinyLLama (1.1B) achieves 34.4% human preference, comparable to Llama 3.1 (70B), up from 4.9%.
→ Hyperfitted models show higher average TTR (type-token ratio, indicating less repetition) than original models: 60+ vs 17-57.
→ Hyperfitted models achieve much worse perplexity on held-out data, ranging from 255-545.
Share this post