First Open code LLM to reveal entire training pipeline and reproducible datasets
https://arxiv.org/abs/2411.04905
🎯 Original Problem:
Code LLMs lack transparency in training data and protocols, limiting research community's ability to establish strong baselines and gain deeper insights.
-----
🛠️ Solution in this Paper:
→ Introduces OpenCoder, a fully transparent code LLM with complete training data, processing pipeline, and protocols
→ Implements sophisticated data processing pipeline called RefineCode with 960B tokens across 607 programming languages
→ Uses aggressive file-level deduplication and language-specific filtering rules
→ Employs two-stage instruction tuning with annealing phase using high-quality synthetic data
-----
💡 Key Insights:
→ File-level deduplication outperforms repository-level approach for maintaining data diversity
→ GitHub star-based filtering can reduce data diversity and affect distribution
→ High-quality data in annealing phase is more crucial than quantity
→ Two-stage instruction tuning improves both theoretical and practical coding tasks
-----
📊 Results:
→ OpenCoder-8B achieves 83.5% pass@1 on HumanEval benchmark
→ Surpasses all previous fully open models at 6B+ parameter scale
→ Demonstrates superior training efficiency compared to The Stack v2
Share this post