"Generating Symbolic World Models via Test-time Scaling of LLMs"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.04728
The challenge is enabling LLMs to perform complex planning tasks due to natural language ambiguity in describing states and transitions. Current LLMs struggle with deductive reasoning and formal planning.
This paper introduces a test-time scaling method to enhance LLMs' Planning Domain Definition Language reasoning. It uses Best-of-N sampling for initial solutions and refines them with instance Verbalized Machine Learning.
-----
📌 This paper presents a practical method to create Planning Domain Definition Language world models using LLMs without extra training. It leverages test-time compute scaling to improve logical reasoning for complex planning tasks.
📌 Combining Best-of-N sampling with instance Verbalized Machine Learning offers a balanced approach. Best-of-N explores diverse solutions, while instance Verbalized Machine Learning iteratively refines them using feedback from LLMs.
📌 By generating Planning Domain Definition Language domains, this method shifts LLMs from direct planners to world model generators. This abstraction enables robust planning by using classical algorithms for search and validation.
----------
Methods Explored in this Paper 🔧:
→ The paper uses a two-stage approach to generate Planning Domain Definition Language world models. Best-of-N sampling is used initially. This generates multiple PDDL domains and selects the best candidates based on log-likelihood.
→ Instance Verbalized Machine Learning then refines these initial solutions. It iteratively improves the PDDL domain using feedback from an optimizer LLM. The optimizer LLM evaluates the PDDL code and provides critiques.
→ A learner LLM uses these critiques to update and refine the PDDL domain. This iterative process helps eliminate logical inconsistencies and errors in the generated PDDL. This combined approach balances exploration and exploitation for better PDDL generation.
-----
Key Insights 💡:
→ Planning Domain Definition Language offers a formal and unambiguous way to represent world models, unlike natural language. PDDL enables precise constraint specification and integration of planning algorithms. Test-time scaling of LLMs can significantly improve their ability to generate high-quality PDDL domains.
→ Combining Best-of-N sampling with instance Verbalized Machine Learning is effective. Best-of-N provides good initial solutions by exploring a diverse solution space. Instance Verbalized Machine Learning refines these solutions through iterative feedback and correction.
→ Using PDDL as an abstraction layer improves planning robustness compared to direct LLM-based planning. This approach leverages LLMs for world model generation and classical planners for solution finding, reducing hallucinations.
-----
Results 📊:
→ Achieves 85.2% success rate on NL2Domain task and 71.4% on Prob2Domain task using Qwen2.5-Coder-7B.
→ Outperforms OpenAI's o1-mini significantly on PDDL domain generation tasks. o1-mini achieves 41.7% and 33.7% success rates respectively on the same tasks.
→ Instance Verbalized Machine Learning with Best-of-N initialization improves success rates by approximately 19% compared to using only Best-of-N sampling.