"A Tool for In-depth Analysis of Code Execution Reasoning of LLMs"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2501.18482
Current evaluations of Large Language Models (LLMs) in code execution reasoning are limited. They lack in-depth analysis of factors influencing LLM performance beyond simple output prediction.
This paper introduces \name, a tool for detailed analysis of LLM code execution reasoning. It helps understand how code properties impact LLM performance.
-----
๐ \name offers a crucial diagnostic lens for code LLMs. It moves beyond pass/fail metrics. It pinpoints specific code properties where LLMs struggle in execution reasoning.
๐ The tool's program analysis is its core strength. Static and dynamic analysis provide granular insights. This reveals LLM vulnerabilities related to program constructs and complexity.
๐ By dissecting execution reasoning failures, \name enables targeted improvements. It guides dataset creation and fine-tuning strategies. This can enhance LLM code understanding and performance.
----------
Methods Explored in this Paper ๐ง:
โ The paper proposes \name, a tool to analyze LLM's code execution reasoning.
โ \name uses a Program Analyzer. This analyzer combines static and dynamic techniques. It extracts program constructs, code complexity, dynamic properties, and variable types.
โ Static analysis identifies constructs like 'If', 'For', 'While', 'Nested Ifs', 'Nested Loops' using Python's `ast` module. It measures complexity using lines of code and cyclomatic complexity.
โ Dynamic analysis executes programs to determine dynamic properties. Loop lengths and recursion depths are captured using the `trace` library.
โ Variable types are categorized into 'Int', 'Decimal', 'String', 'Binary', 'List', 'Tuple', and 'Object'.
โ \name also includes a Visualizer. This component presents analysis results through figures and reports. It uses libraries like Matplotlib for visualization.
-----
Key Insights ๐ก:
โ LLMs handle conditional statements better than loops and recursions.
โ Nested constructs, like nested ifs and loops, pose challenges for LLMs.
โ Higher cyclomatic complexity negatively impacts LLM performance in code execution reasoning.
โ Longer loop lengths also negatively affect LLM's ability to predict outputs.
โ LLMs achieve high accuracy in predicting variable types but struggle with predicting values, especially for complex types.
-----
Results ๐:
โ Evaluated on 1450 Python programs across four benchmarks (Avatar, ClassEval, CRUXEval, HumanEval).
โ GPT-4-Turbo achieved an overall Execution Reasoning Rate of 81.17%.
โ Gemini-1.5-Pro achieved an overall Execution Reasoning Rate of 74.15%.
โ DeepSeek-Coder (Instruct-33b) showed a rate of 60.23%, outperforming CodeLlama (Inst-34b) at 41.93% among open-source models.


