"A Tool for In-depth Analysis of Code Execution Reasoning of LLMs"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 09, 2025

Article voiceover

0:00

-4:34

https://arxiv.org/abs/2501.18482

Current evaluations of Large Language Models (LLMs) in code execution reasoning are limited. They lack in-depth analysis of factors influencing LLM performance beyond simple output prediction.

This paper introduces \name, a tool for detailed analysis of LLM code execution reasoning. It helps understand how code properties impact LLM performance.

-----

📌 \name offers a crucial diagnostic lens for code LLMs. It moves beyond pass/fail metrics. It pinpoints specific code properties where LLMs struggle in execution reasoning.

📌 The tool's program analysis is its core strength. Static and dynamic analysis provide granular insights. This reveals LLM vulnerabilities related to program constructs and complexity.

📌 By dissecting execution reasoning failures, \name enables targeted improvements. It guides dataset creation and fine-tuning strategies. This can enhance LLM code understanding and performance.

----------

Methods Explored in this Paper 🔧:

→ The paper proposes \name, a tool to analyze LLM's code execution reasoning.

→ \name uses a Program Analyzer. This analyzer combines static and dynamic techniques. It extracts program constructs, code complexity, dynamic properties, and variable types.

→ Static analysis identifies constructs like 'If', 'For', 'While', 'Nested Ifs', 'Nested Loops' using Python's `ast` module. It measures complexity using lines of code and cyclomatic complexity.

→ Dynamic analysis executes programs to determine dynamic properties. Loop lengths and recursion depths are captured using the `trace` library.

→ Variable types are categorized into 'Int', 'Decimal', 'String', 'Binary', 'List', 'Tuple', and 'Object'.

→ \name also includes a Visualizer. This component presents analysis results through figures and reports. It uses libraries like Matplotlib for visualization.

-----

Key Insights 💡:

→ LLMs handle conditional statements better than loops and recursions.

→ Nested constructs, like nested ifs and loops, pose challenges for LLMs.

→ Higher cyclomatic complexity negatively impacts LLM performance in code execution reasoning.

→ Longer loop lengths also negatively affect LLM's ability to predict outputs.

→ LLMs achieve high accuracy in predicting variable types but struggle with predicting values, especially for complex types.

-----

Results 📊:

→ Evaluated on 1450 Python programs across four benchmarks (Avatar, ClassEval, CRUXEval, HumanEval).

→ GPT-4-Turbo achieved an overall Execution Reasoning Rate of 81.17%.

→ Gemini-1.5-Pro achieved an overall Execution Reasoning Rate of 74.15%.

→ DeepSeek-Coder (Instruct-33b) showed a rate of 60.23%, outperforming CodeLlama (Inst-34b) at 41.93% among open-source models.

Rohan's Bytes

Discussion about this post