To Code, or Not To Code? Exploring Impact of Code in Pre-training

Playback speed

Share post at current time

0:00

Transcript

To Code, or Not To Code? Exploring Impact of Code in Pre-training

Generated this podcast with Google's Illuminate.

Rohan Paul

Jan 04, 2025

Code improves non-code tasks. 🔥

Adding 25% code to LLM training creates better language models for everything, not just coding. Seems it teaches LLMs to think logically, improving their overall intelligence

--------

https://arxiv.org/pdf/2408.10914

🔍 Key insights from the paper:

• Optimal code proportion: 25% code in pre-training data yields the best overall performance. But the study lacks granularity between 0% and 25%. Exploring 10% or 20% could reveal a more precise optimal point.

However, the paper doesn't explore math benchmarks, which could provide additional insights.

• Linear code performance: Code task performance increases linearly with the proportion of code data, explaining the effectiveness of dedicated code models.

• Impact of high-quality code: Even a small amount (10%) of high-quality synthetic code data significantly boosts performance, with 9% gain in NL reasoning and 44.9% in code benchmarks. This highlights the importance of data quality over quantity.

Results📊:

• Best variant: 8.2% increase in NL reasoning, 4.2% in world knowledge

• 6.6% improvement in generative win-rates

• 12x boost in code performance

• Cooldown with code: 3.6% gain in NL reasoning, 10.1% in world knowledge, 20% in code

• High-quality synthetic code (10% of data): 9% gain in NL reasoning, 44.9% in code benchmarks