0:00
/
0:00
Transcript

"SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models"

Generated below podcast on this paper with Google's Illuminate.

Self-play with tree-search helps LLMs learn instructions-following capability.

SPAR introduces a self-play framework that enhances LLMs' instruction-following by minimizing irrelevant variations during training through tree-search refinement.

https://arxiv.org/abs/2412.11605

🤖 Original Problem:

→ Current methods for improving instruction-following in LLMs use independently sampled responses, which introduce irrelevant variations and interfere with learning the key differences that determine success.

-----

🔧 Solution in this Paper:

→ SPAR employs a self-play framework where an LLM plays against itself in two roles: actor and refiner.

→ The actor generates responses while the refiner judges and refines these responses using tree-search strategies.

→ Tree-search refinement systematically explores improvement paths while minimizing unnecessary variations.

→ The framework iteratively trains both models using Direct Preference Optimization for the actor and Rejection-sampling Fine-Tuning for the refiner.

-----

💡 Key Insights:

→ Minimizing irrelevant variations in training pairs leads to better instruction-following

→ Tree-search based refinement outperforms simple sampling approaches

→ Self-play enables continuous improvement without relying on external models

-----

📊 Results:

→ LLaMA3-8B model surpassed GPT-4-Turbo on IFEval benchmark after 3 iterations

→ Achieved 81.8% average accuracy on IFEval

→ Enhanced GLM-4-9B and LLaMA3-70B performance while maintaining general capabilities

Discussion about this video