The 1st Paper analyzing OpenAI's o1-preview model.
And some of the performance numbers are amazing.
Overall Conclusion from the paper - While o1-preview’s performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it.
📚 https://arxiv.org/pdf/2409.13373
Original Problem 🤔:
LLMs struggle with planning tasks. Progress on PlanBench has been slow. OpenAI's o1 claims to be a Large Reasoning Model (LRM) with improved capabilities.
-----
Solution in this Paper 🔍:
• Evaluates o1 on PlanBench, comparing to LLMs and classical planners
• Tests o1 on longer problems and unsolvable instances
• Analyzes accuracy/efficiency tradeoffs of o1 vs other approaches
• Proposes extensions to PlanBench for more comprehensive LRM evaluation
-----
Key Insights from this Paper 💡:
• LLMs struggle with planning tasks, especially on obfuscated versions
• o1 shows significant improvement over LLMs on PlanBench
• o1's performance degrades on longer problems and unsolvable instances
• o1 is much more expensive to run than traditional planners
• Lack of interpretability and guarantees remains an issue with o1
-----
Results 📊:
• o1-preview: 97.8% accuracy on 3-5 block Blocksworld (vs 62.6% for best LLM)
• Mystery Blocksworld: o1 reaches 52.8% (vs ~0% for LLMs)
• 6-20 block problems: o1 performance drops to 23.63%
• Unsolvable instances: only 27% correctly identified
• Cost: o1 ~$42 per 100 instances vs $0.44-$1.80 for LLMs
-----
📚 https://arxiv.org/pdf/2409.13373
------
Are you into AI and LLMs❓ Join me on Twitter with 34K others, to remain on the bleeding-edge every day.
𝕏/🐦 https://x.com/rohanpaul_ai
Share this post