LLMS STILL CAN’T PLAN; CAN LRMS? A PRELIMINARY EVALUATION OF OPENAI’S O1 ON PLANBENCH

Playback speed

Share post at current time

0:00

Transcript

LLMS STILL CAN’T PLAN; CAN LRMS? A PRELIMINARY EVALUATION OF OPENAI’S O1 ON PLANBENCH

Generated this podcast with Google's Illuminate.

Rohan Paul

Dec 29, 2024

The 1st Paper analyzing OpenAI's o1-preview model.

And some of the performance numbers are amazing.

Overall Conclusion from the paper - While o1-preview’s performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it.

📚 https://arxiv.org/pdf/2409.13373

Original Problem 🤔:

LLMs struggle with planning tasks. Progress on PlanBench has been slow. OpenAI's o1 claims to be a Large Reasoning Model (LRM) with improved capabilities.

-----

Solution in this Paper 🔍:

• Evaluates o1 on PlanBench, comparing to LLMs and classical planners

• Tests o1 on longer problems and unsolvable instances

• Analyzes accuracy/efficiency tradeoffs of o1 vs other approaches

• Proposes extensions to PlanBench for more comprehensive LRM evaluation

-----

Key Insights from this Paper 💡:

• LLMs struggle with planning tasks, especially on obfuscated versions

• o1 shows significant improvement over LLMs on PlanBench

• o1's performance degrades on longer problems and unsolvable instances

• o1 is much more expensive to run than traditional planners