Code improves non-code tasks. 🔥
Adding 25% code to LLM training creates better language models for everything, not just coding. Seems it teaches LLMs to think logically, improving their overall intelligence
--------
https://arxiv.org/pdf/2408.10914
🔍 Key insights from the paper:
• Optimal code proportion: 25% code in pre-training data yields the best overall performance. But the study lacks granularity between 0% and 25%. Exploring 10% or 20% could reveal a more precise optimal point.
However, the paper doesn't explore math benchmarks, which could provide additional insights.
• Linear code performance: Code task performance increases linearly with the proportion of code data, explaining the effectiveness of dedicated code models.
• Impact of high-quality code: Even a small amount (10%) of high-quality synthetic code data significantly boosts performance, with 9% gain in NL reasoning and 44.9% in code benchmarks. This highlights the importance of data quality over quantity.
Results📊:
• Best variant: 8.2% increase in NL reasoning, 4.2% in world knowledge
• 6.6% improvement in generative win-rates
• 12x boost in code performance
• Cooldown with code: 3.6% gain in NL reasoning, 10.1% in world knowledge, 20% in code
• High-quality synthetic code (10% of data): 9% gain in NL reasoning, 44.9% in code benchmarks
Share this post