"A Tool for In-depth Analysis of Code Execution Reasoning of LLMs"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2501.18482
Current evaluations of Large Language Models (LLMs) in code execution reasoning are limited. They lack in-depth analysis of factors influencing LLM performance beyond simple output prediction.
This paper introduces \name, a tool for detailed analysis of LLM code execution reasoning. It helps understand how code properties impact LLM performance.
-----
📌 \name offers a crucial diagnostic lens for code LLMs. It moves beyond pass/fail metrics. It pinpoints specific code properties where LLMs struggle in execution reasoning.
📌 The tool's program analysis is its core strength. Static and dynamic analysis provide granular insights. This reveals LLM vulnerabilities related to program constructs and complexity.
📌 By dissecting execution reasoning failures, \name enables targeted improvements. It guides dataset creation and fine-tuning strategies. This can enhance LLM code understanding and performance.
----------
Methods Explored in this Paper 🔧:
→ The paper proposes \name, a tool to analyze LLM's code execution reasoning.
→ \name uses a Program Analyzer. This analyzer combines static and dynamic techniques. It extracts program constructs, code complexity, dynamic properties, and variable types.
→ Static analysis identifies constructs like 'If', 'For', 'While', 'Nested Ifs', 'Nested Loops' using Python's `ast` module. It measures complexity using lines of code and cyclomatic complexity.
→ Dynamic analysis executes programs to determine dynamic properties. Loop lengths and recursion depths are captured using the `trace` library.
→ Variable types are categorized into 'Int', 'Decimal', 'String', 'Binary', 'List', 'Tuple', and 'Object'.
→ \name also includes a Visualizer. This component presents analysis results through figures and reports. It uses libraries like Matplotlib for visualization.
-----
Key Insights 💡:
→ LLMs handle conditional statements better than loops and recursions.
→ Nested constructs, like nested ifs and loops, pose challenges for LLMs.
→ Higher cyclomatic complexity negatively impacts LLM performance in code execution reasoning.
→ Longer loop lengths also negatively affect LLM's ability to predict outputs.
→ LLMs achieve high accuracy in predicting variable types but struggle with predicting values, especially for complex types.
-----
Results 📊:
→ Evaluated on 1450 Python programs across four benchmarks (Avatar, ClassEval, CRUXEval, HumanEval).
→ GPT-4-Turbo achieved an overall Execution Reasoning Rate of 81.17%.
→ Gemini-1.5-Pro achieved an overall Execution Reasoning Rate of 74.15%.
→ DeepSeek-Coder (Instruct-33b) showed a rate of 60.23%, outperforming CodeLlama (Inst-34b) at 41.93% among open-source models.