0:00
/
0:00
Transcript

"OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models"

The podcast on this paper is generated with Google's Illuminate.

First Open code LLM to reveal entire training pipeline and reproducible datasets

https://arxiv.org/abs/2411.04905

🎯 Original Problem:

Code LLMs lack transparency in training data and protocols, limiting research community's ability to establish strong baselines and gain deeper insights.

-----

🛠️ Solution in this Paper:

→ Introduces OpenCoder, a fully transparent code LLM with complete training data, processing pipeline, and protocols

→ Implements sophisticated data processing pipeline called RefineCode with 960B tokens across 607 programming languages

→ Uses aggressive file-level deduplication and language-specific filtering rules

→ Employs two-stage instruction tuning with annealing phase using high-quality synthetic data

-----

💡 Key Insights:

→ File-level deduplication outperforms repository-level approach for maintaining data diversity

→ GitHub star-based filtering can reduce data diversity and affect distribution

→ High-quality data in annealing phase is more crucial than quantity

→ Two-stage instruction tuning improves both theoretical and practical coding tasks

-----

📊 Results:

→ OpenCoder-8B achieves 83.5% pass@1 on HumanEval benchmark

→ Surpasses all previous fully open models at 6B+ parameter scale

→ Demonstrates superior training efficiency compared to The Stack v2

Discussion about this video