"OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models"

Playback speed

Share post at current time

0:00

Transcript

"OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 31, 2024

First Open code LLM to reveal entire training pipeline and reproducible datasets

https://arxiv.org/abs/2411.04905

🎯 Original Problem:

Code LLMs lack transparency in training data and protocols, limiting research community's ability to establish strong baselines and gain deeper insights.

-----

🛠️ Solution in this Paper:

→ Introduces OpenCoder, a fully transparent code LLM with complete training data, processing pipeline, and protocols

→ Implements sophisticated data processing pipeline called RefineCode with 960B tokens across 607 programming languages

→ Uses aggressive file-level deduplication and language-specific filtering rules

→ Employs two-stage instruction tuning with annealing phase using high-quality synthetic data

-----

💡 Key Insights:

→ File-level deduplication outperforms repository-level approach for maintaining data diversity

→ GitHub star-based filtering can reduce data diversity and affect distribution

→ High-quality data in annealing phase is more crucial than quantity

→ Two-stage instruction tuning improves both theoretical and practical coding tasks

-----

📊 Results:

→ OpenCoder-8B achieves 83.5% pass@1 on HumanEval benchmark

→ Surpasses all previous fully open models at 6B+ parameter scale

→ Demonstrates superior training efficiency compared to The Stack v2

Rohan's Bytes

"OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models"

Discussion about this video