"ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning"
Below podcast on this paper is generated with Google's Illuminate.
→ Current LLMs struggle with complex logical reasoning, especially non-monotonic reasoning.
→ Existing benchmarks do not effectively isolate pure logical reasoning from other factors like domain knowledge.
This paper introduces ZebraLogic, a framework to rigorously evaluate logical reasoning in LLMs.
It uses logic grid puzzles with controllable complexity to reveal limitations in LLM reasoning as complexity scales, even with larger models.
-----
https://arxiv.org/abs/2502.01100
📌 ZebraLogic uses Constraint Satisfaction Problems to create logic puzzles. This method offers a mathematically rigorous and scalable way to evaluate LLMs' pure logical reasoning abilities, unlike benchmarks mixed with world knowledge.
📌 The "curse of complexity" highlights a core limitation: current LLMs struggle with increasing search space size and Z3 conflicts. This suggests that scaling model size alone offers diminishing returns for complex logical tasks.
📌 The finding that o1 models use significantly more hidden reasoning tokens points to a potential compute-intensive approach for enhancing logical reasoning, though its scalability and interpretability remain open questions.
----------
-----
Methods in this Paper 🔧:
→ ZebraLogic generates logic grid puzzles derived from Constraint Satisfaction Problems (CSPs).
→ These puzzles have quantifiable and controllable complexity, using metrics like search space size and Z3 conflict count.
→ The framework allows for systematic evaluation of LLMs' logical deduction abilities in a controlled environment.
→ ZebraLogic puzzles require pure formal reasoning, minimizing reliance on domain-specific knowledge.
-----
Key Insights 💡:
→ The study reveals a "curse of complexity" for LLM reasoning.
→ LLM performance dramatically declines as puzzle complexity increases, measured by search space and Z3 conflicts.
→ Scaling model size alone (e.g., Llama-3.1-405B) does not overcome this limitation in complex reasoning.
→ Increasing test-time compute through repeated sampling (Best-of-N) shows limited improvement in overcoming the complexity curse.
→ Models like o1 generate significantly more hidden reasoning tokens, scaling with puzzle complexity, contributing to better performance.
-----
Results 📊:
→ o1-full achieves 81.0% overall puzzle accuracy, outperforming other models.
→ DeepSeek-R1 achieves 78.7% overall accuracy, showing strong performance among open-weight models.
→ Performance of most models drops significantly in "Large" (search space ≥ 10^6) and "X-Large" (search space ≥ 10^10) puzzles. For example, Llama-3.1-405B accuracy drops to 1.5% and 0.0% in "Large" and "X-Large" categories respectively.
→ GPT-4o with Best-of-N sampling with oracle selection (N=128) achieves 69.1% accuracy, approaching o1-mini performance, suggesting potential benefits of increased test-time compute.