"REFT: Reasoning with Reinforced Fine-Tuning"

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

Generated this podcast with Google's Illuminate.

Jan 01, 2025

ReFT: Leveraging reinforcement learning to expand LLM's mathematical problem-solving capabilities.

Improves on Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) data

Original Problem 🔍:

SFT using Chain-of-Thought (CoT) annotations for math problem-solving lacks strong generalization due to reliance on single reasoning paths.

-----

Solution in this Paper 🧠:

• Reinforced Fine-Tuning (ReFT) approach:

- Warm-up stage with SFT

- Online reinforcement learning using PPO

- Samples multiple reasoning paths

- Rewards derived from ground-truth answers

• Applies to both natural language and program-based CoTs

• Compatible with majority voting and reward model reranking

-----

Key Insights from this Paper 💡:

• ReFT learns from multiple CoT paths, improving generalization

• No need for extra training data or reward models

• Outperforms SFT and self-training baselines

• Effective on small models and various datasets

• KL divergence crucial for policy stability

-----

Results 📊:

• ReFT outperforms SFT on GSM8K, SVAMP, and MathQA datasets

• 9-point improvement on GSM8K N-CoT with CodeLLAMA

• 3.7-point average improvement on N-CoT across datasets

• 5.9-point average improvement on P-CoT across datasets

• 79.3% accuracy on GSM8K with CodeLLAMA + ReFT + Reranking (P-CoT)

• Surpasses GPT-3.5-turbo performance with only 7B parameters

Rohan's Bytes