0:00
/
0:00
Transcript

"A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions"

The podcast on this paper is generated with Google's Illuminate.

Graph-based Synthetic Data Pipeline (GSDP) turns math knowledge into graphs to generate millions of unique practice problems.

A novel graph-based pipeline that generates high-quality math reasoning data by exploring relationships between knowledge points, achieving 255x data expansion at 1% cost.

-----

https://arxiv.org/abs/2412.08864

🤔 Original Problem:

→ Current synthetic data approaches for mathematical reasoning suffer from limited scalability, high costs using GPT models, and generate problems too similar to seed data.

-----

🔧 Solution in this Paper:

→ The Graph-based Synthetic Data Pipeline (GSDP) extracts knowledge points from seed math problems using specialized models.

→ It builds a Knowledge Point Relationships Graph (KPRG) where nodes are knowledge concepts and edges show their connections.

→ The system explores both explicit relationships (directly connected points) and implicit relationships (points connected through multiple hops).

→ Multiple open-source models jointly evaluate and filter the generated problems and solutions.

-----

💡 Key Insights:

→ Implicit knowledge relationships enable much higher data expansion than explicit ones

→ Using open-source models for synthesis and evaluation drastically reduces costs

→ Graph-based approach generates more diverse problems than seed-based methods

-----

📊 Results:

→ Generated 1.91M high-quality math problems from just 7.5K seed problems (255x expansion)

→ Achieved synthesis quality comparable to GPT-4 at less than 1% of the cost

→ GSDP-7B model reached 37.7% accuracy on MATH and 78.4% on GSM8K benchmarks

Discussion about this video