0:00
/
0:00
Transcript

"CodeMonkeys: Scaling Test-Time Compute for Software Engineering"

Below podcast is generated with Google's Illuminate.

The paper introduces CodeMonkeys, a system designed to improve Large Language Model (LLM) performance in solving software engineering tasks by efficiently scaling test-time computation.

Amortized context, parallel attempts

CodeMonkeys uses iterative refinement and parallel sampling to address real-world GitHub issues. This approach aims to enhance both solution quality and cost-effectiveness in complex coding challenges.

-----

Paper - https://arxiv.org/abs/2501.14723

Original Problem 🛠️:

→ Current methods for improving LLM coding skills heavily rely on scaling training data and model size.

→ This scaling approach is becoming prohibitively expensive.

→ An alternative is to scale test-time compute, but effective strategies for this are unclear, especially for complex software engineering tasks.

-----

Solution in this Paper 💡:

→ CodeMonkeys system scales test-time compute for solving GitHub issues from SWE-bench.

→ It employs "serial" scaling by allowing models to iteratively refine code edits and testing scripts.

→ Models jointly generate code edits and corresponding test scripts.

→ Execution feedback from tests guides iterative refinement of both edits and tests.

→ "Parallel" scaling is achieved by generating multiple candidate (edit, test) pairs for each issue.

→ This parallel approach amortizes the cost of retrieving codebase context across multiple samples.

→ Codebase context retrieval is simplified by letting an LLM scan and rank all files to identify relevant ones.

→ Selection among candidate edits combines test-based voting and model-based selection.

→ Test voting filters candidates based on test passing rates.

→ Model-based selection then chooses the final edit from the filtered set, potentially with further test generation for refinement.

-----

Key Insights from this Paper 🔑:

→ Scaling test-time compute, both serially and in parallel, significantly boosts LLM performance for complex coding tasks.

→ Jointly generating code and tests provides richer feedback, enhancing serial scaling effectiveness.

→ Parallel sampling makes amortizing context retrieval costs feasible, simplifying context handling.

→ Combining test voting with model selection improves the accuracy of choosing correct code edits from multiple candidates.

-----

Results 🎯:

→ Achieves 69.8% coverage on SWE-bench Verified.

→ Obtains a final score of 57.4% on SWE-bench Verified.

→ Outperforms random selection significantly, demonstrating effective candidate selection.

→ "Barrel of Monkeys" ensemble, combining CodeMonkeys with top SWE-bench submissions, reaches 66.2% score.

Discussion about this video