"ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning"
Below podcast on this paper is generated with Google's Illuminate.
โ Current LLMs struggle with complex logical reasoning, especially non-monotonic reasoning.
โ Existing benchmarks do not effectively isolate pure logical reasoning from other factors like domain knowledge.
This paper introduces ZebraLogic, a framework to rigorously evaluate logical reasoning in LLMs.
It uses logic grid puzzles with controllable complexity to reveal limitations in LLM reasoning as complexity scales, even with larger models.
-----
https://arxiv.org/abs/2502.01100
๐ ZebraLogic uses Constraint Satisfaction Problems to create logic puzzles. This method offers a mathematically rigorous and scalable way to evaluate LLMs' pure logical reasoning abilities, unlike benchmarks mixed with world knowledge.
๐ The "curse of complexity" highlights a core limitation: current LLMs struggle with increasing search space size and Z3 conflicts. This suggests that scaling model size alone offers diminishing returns for complex logical tasks.
๐ The finding that o1 models use significantly more hidden reasoning tokens points to a potential compute-intensive approach for enhancing logical reasoning, though its scalability and interpretability remain open questions.
----------
-----
Methods in this Paper ๐ง:
โ ZebraLogic generates logic grid puzzles derived from Constraint Satisfaction Problems (CSPs).
โ These puzzles have quantifiable and controllable complexity, using metrics like search space size and Z3 conflict count.
โ The framework allows for systematic evaluation of LLMs' logical deduction abilities in a controlled environment.
โ ZebraLogic puzzles require pure formal reasoning, minimizing reliance on domain-specific knowledge.
-----
Key Insights ๐ก:
โ The study reveals a "curse of complexity" for LLM reasoning.
โ LLM performance dramatically declines as puzzle complexity increases, measured by search space and Z3 conflicts.
โ Scaling model size alone (e.g., Llama-3.1-405B) does not overcome this limitation in complex reasoning.
โ Increasing test-time compute through repeated sampling (Best-of-N) shows limited improvement in overcoming the complexity curse.
โ Models like o1 generate significantly more hidden reasoning tokens, scaling with puzzle complexity, contributing to better performance.
-----
Results ๐:
โ o1-full achieves 81.0% overall puzzle accuracy, outperforming other models.
โ DeepSeek-R1 achieves 78.7% overall accuracy, showing strong performance among open-weight models.
โ Performance of most models drops significantly in "Large" (search space โฅ 10^6) and "X-Large" (search space โฅ 10^10) puzzles. For example, Llama-3.1-405B accuracy drops to 1.5% and 0.0% in "Large" and "X-Large" categories respectively.
โ GPT-4o with Best-of-N sampling with oracle selection (N=128) achieves 69.1% accuracy, approaching o1-mini performance, suggesting potential benefits of increased test-time compute.


