0:00
/
0:00
Transcript

LLMS STILL CAN’T PLAN; CAN LRMS? A PRELIMINARY EVALUATION OF OPENAI’S O1 ON PLANBENCH

Generated this podcast with Google's Illuminate.

The 1st Paper analyzing OpenAI's o1-preview model.

And some of the performance numbers are amazing.

Overall Conclusion from the paper - While o1-preview’s performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it.

📚 https://arxiv.org/pdf/2409.13373

Original Problem 🤔:

LLMs struggle with planning tasks. Progress on PlanBench has been slow. OpenAI's o1 claims to be a Large Reasoning Model (LRM) with improved capabilities.

-----

Solution in this Paper 🔍:

• Evaluates o1 on PlanBench, comparing to LLMs and classical planners

• Tests o1 on longer problems and unsolvable instances

• Analyzes accuracy/efficiency tradeoffs of o1 vs other approaches

• Proposes extensions to PlanBench for more comprehensive LRM evaluation

-----

Key Insights from this Paper 💡:

• LLMs struggle with planning tasks, especially on obfuscated versions

• o1 shows significant improvement over LLMs on PlanBench

• o1's performance degrades on longer problems and unsolvable instances

• o1 is much more expensive to run than traditional planners

• Lack of interpretability and guarantees remains an issue with o1

-----

Results 📊:

• o1-preview: 97.8% accuracy on 3-5 block Blocksworld (vs 62.6% for best LLM)

• Mystery Blocksworld: o1 reaches 52.8% (vs ~0% for LLMs)

• 6-20 block problems: o1 performance drops to 23.63%

• Unsolvable instances: only 27% correctly identified

• Cost: o1 ~$42 per 100 instances vs $0.44-$1.80 for LLMs

-----

📚 https://arxiv.org/pdf/2409.13373

------

Are you into AI and LLMs❓ Join me on Twitter with 34K others, to remain on the bleeding-edge every day.

𝕏/🐦 https://x.com/rohanpaul_ai

Discussion about this video