Self-play with tree-search helps LLMs learn instructions-following capability.
SPAR introduces a self-play framework that enhances LLMs' instruction-following by minimizing irrelevant variations during training through tree-search refinement.
https://arxiv.org/abs/2412.11605
🤖 Original Problem:
→ Current methods for improving instruction-following in LLMs use independently sampled responses, which introduce irrelevant variations and interfere with learning the key differences that determine success.
-----
🔧 Solution in this Paper:
→ SPAR employs a self-play framework where an LLM plays against itself in two roles: actor and refiner.
→ The actor generates responses while the refiner judges and refines these responses using tree-search strategies.
→ Tree-search refinement systematically explores improvement paths while minimizing unnecessary variations.
→ The framework iteratively trains both models using Direct Preference Optimization for the actor and Rejection-sampling Fine-Tuning for the refiner.
-----
💡 Key Insights:
→ Minimizing irrelevant variations in training pairs leads to better instruction-following
→ Tree-search based refinement outperforms simple sampling approaches
→ Self-play enables continuous improvement without relying on external models
-----
📊 Results:
→ LLaMA3-8B model surpassed GPT-4-Turbo on IFEval benchmark after 3 iterations
→ Achieved 81.8% average accuracy on IFEval
→ Enhanced GLM-4-9B and LLaMA3-70B performance while maintaining general capabilities
Share this post