"ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning"

Below podcast on this paper is generated with Google's Illuminate.

Feb 08, 2025

→ Current LLMs struggle with complex logical reasoning, especially non-monotonic reasoning.

→ Existing benchmarks do not effectively isolate pure logical reasoning from other factors like domain knowledge.

This paper introduces ZebraLogic, a framework to rigorously evaluate logical reasoning in LLMs.

It uses logic grid puzzles with controllable complexity to reveal limitations in LLM reasoning as complexity scales, even with larger models.

-----

https://arxiv.org/abs/2502.01100

📌 ZebraLogic uses Constraint Satisfaction Problems to create logic puzzles. This method offers a mathematically rigorous and scalable way to evaluate LLMs' pure logical reasoning abilities, unlike benchmarks mixed with world knowledge.

📌 The "curse of complexity" highlights a core limitation: current LLMs struggle with increasing search space size and Z3 conflicts. This suggests that scaling model size alone offers diminishing returns for complex logical tasks.

📌 The finding that o1 models use significantly more hidden reasoning tokens points to a potential compute-intensive approach for enhancing logical reasoning, though its scalability and interpretability remain open questions.

----------

-----

Methods in this Paper 🔧:

→ ZebraLogic generates logic grid puzzles derived from Constraint Satisfaction Problems (CSPs).

→ These puzzles have quantifiable and controllable complexity, using metrics like search space size and Z3 conflict count.

→ The framework allows for systematic evaluation of LLMs' logical deduction abilities in a controlled environment.

→ ZebraLogic puzzles require pure formal reasoning, minimizing reliance on domain-specific knowledge.

-----

Key Insights 💡:

→ The study reveals a "curse of complexity" for LLM reasoning.

→ LLM performance dramatically declines as puzzle complexity increases, measured by search space and Z3 conflicts.

→ Scaling model size alone (e.g., Llama-3.1-405B) does not overcome this limitation in complex reasoning.

→ Increasing test-time compute through repeated sampling (Best-of-N) shows limited improvement in overcoming the complexity curse.

→ Models like o1 generate significantly more hidden reasoning tokens, scaling with puzzle complexity, contributing to better performance.

-----

Results 📊:

→ o1-full achieves 81.0% overall puzzle accuracy, outperforming other models.

→ DeepSeek-R1 achieves 78.7% overall accuracy, showing strong performance among open-weight models.

→ Performance of most models drops significantly in "Large" (search space ≥ 10^6) and "X-Large" (search space ≥ 10^10) puzzles. For example, Llama-3.1-405B accuracy drops to 1.5% and 0.0% in "Large" and "X-Large" categories respectively.

→ GPT-4o with Best-of-N sampling with oracle selection (N=128) achieves 69.1% accuracy, approaching o1-mini performance, suggesting potential benefits of increased test-time compute.

Rohan's Bytes

Discussion about this post