"Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models"

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

"Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 27, 2025

Transcript

This paper explores enhancing Large Language Models (LLMs) for complex reasoning tasks by integrating reinforcement learning to automate the generation of high-quality reasoning data and scaling computation during training and testing.

-----

https://arxiv.org/abs/2501.09686

Original Problem: 🤔:

→ LLMs struggle with complex reasoning tasks.

→ Human annotation for step-by-step reasoning data is expensive and hard to scale.

→ Traditional supervised fine-tuning has limitations in fully developing reasoning capabilities.

-----

Solution in this Paper: 💡:

→ The paper introduces a method to automate the creation of high-quality reasoning data.

→ It uses a "thought" concept, a sequence of tokens as intermediate steps.

→ Reinforcement learning trains LLMs to master reasoning via trial-and-error search, generating high-quality reasoning trajectories, expanding training data.

→ Process Reward Models (PRMs) provide step-wise rewards, facilitating reinforcement learning.

→ Encouraging LLMs to “think” with more tokens during inference boosts reasoning accuracy, which is called test-time scaling.

-----

Key Insights from this Paper: 🗝️:

→ Reinforcement learning can automate high-quality reasoning data generation, overcoming manual annotation limits.

→ Process-level supervision through PRMs is more effective than outcome-based rewards for complex reasoning.

→ Scaling computation during both training and testing enhances LLM reasoning.

→ Test-time scaling with PRM-guided search can significantly boost performance without model changes.

-----

Results: 💯:

→ OpenAI's ol series achieves 83.3% success in competitive programming.

→ ol scores at the gold medal level in International Mathematics Olympiad.

→ ol matches PhD-level performance in physics, chemistry, and biology questions.

-----

1ST SET OF HOOKS

Automated data and reinforcement learning combine to make LLMs think better.

Step-by-step rewards and more compute unlock LLM reasoning potential.

Scaling "thought" processes during training and testing enhances LLM reasoning.

Reinforcement learning and process-level supervision are key to better LLM reasoning.

2nd SET OF HOOKS

Making LLMs think harder with less human help.

LLMs get smarter when they learn from their own mistakes.

More thinking time equals better answers for LLMs.

Teach LLMs to think step-by-step, and they'll solve harder problems.

Rohan's Bytes

"Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models"

Discussion about this video