Graph-based Synthetic Data Pipeline (GSDP) turns math knowledge into graphs to generate millions of unique practice problems.
A novel graph-based pipeline that generates high-quality math reasoning data by exploring relationships between knowledge points, achieving 255x data expansion at 1% cost.
-----
https://arxiv.org/abs/2412.08864
🤔 Original Problem:
→ Current synthetic data approaches for mathematical reasoning suffer from limited scalability, high costs using GPT models, and generate problems too similar to seed data.
-----
🔧 Solution in this Paper:
→ The Graph-based Synthetic Data Pipeline (GSDP) extracts knowledge points from seed math problems using specialized models.
→ It builds a Knowledge Point Relationships Graph (KPRG) where nodes are knowledge concepts and edges show their connections.
→ The system explores both explicit relationships (directly connected points) and implicit relationships (points connected through multiple hops).
→ Multiple open-source models jointly evaluate and filter the generated problems and solutions.
-----
💡 Key Insights:
→ Implicit knowledge relationships enable much higher data expansion than explicit ones
→ Using open-source models for synthesis and evaluation drastically reduces costs
→ Graph-based approach generates more diverse problems than seed-based methods
-----
📊 Results:
→ Generated 1.91M high-quality math problems from just 7.5K seed problems (255x expansion)
→ Achieved synthesis quality comparable to GPT-4 at less than 1% of the cost
→ GSDP-7B model reached 37.7% accuracy on MATH and 78.4% on GSM8K benchmarks
Share this post