LLMs become their own code reviewers by writing verification programs to catch mistakes
ProgCo enables LLMs to verify and fix their own mistakes through self-generated programs, significantly improving accuracy across complex tasks.
-----
https://arxiv.org/abs/2501.01264
🤔 Original Problem:
LLMs struggle with self-correction, especially in complex reasoning tasks. Current methods fail to effectively detect errors and often provide misleading feedback, leading to incorrect revisions.
-----
💡 Solution in this Paper:
→ ProgCo introduces program-driven verification (ProgVe) that generates and executes verification pseudo-programs to validate model outputs
→ These programs can express complex verification logic beyond simple checklists
→ Program-driven refinement (ProgRe) performs dual optimization of both responses and verification programs
→ The system uses contrast-based refinement to avoid misleading feedback
→ Integration with Python tools enhances verification capabilities for numerical operations
-----
🎯 Key Insights:
→ Programs are more effective than natural language for expressing verification logic
→ LLMs can act as program executors while incorporating their knowledge
→ Dual refinement of responses and programs improves accuracy
→ Contrast-based refinement helps avoid error propagation
-----
📊 Results:
→ Improved GPT-3.5 performance by 4.62% on IFEval(Pr), 3.23% on IFEval(Ins)
→ Enhanced mathematical reasoning with 5.84% gain on GSM8K, 5.8% on MATH
→ Consistently outperformed baseline methods across all benchmarks
→ Further improvements when combined with Python executor tools
Share this post