The paper introduces CodeMonkeys, a system designed to improve Large Language Model (LLM) performance in solving software engineering tasks by efficiently scaling test-time computation.
Amortized context, parallel attempts
CodeMonkeys uses iterative refinement and parallel sampling to address real-world GitHub issues. This approach aims to enhance both solution quality and cost-effectiveness in complex coding challenges.
-----
Paper - https://arxiv.org/abs/2501.14723
Original Problem 🛠️:
→ Current methods for improving LLM coding skills heavily rely on scaling training data and model size.
→ This scaling approach is becoming prohibitively expensive.
→ An alternative is to scale test-time compute, but effective strategies for this are unclear, especially for complex software engineering tasks.
-----
Solution in this Paper 💡:
→ CodeMonkeys system scales test-time compute for solving GitHub issues from SWE-bench.
→ It employs "serial" scaling by allowing models to iteratively refine code edits and testing scripts.
→ Models jointly generate code edits and corresponding test scripts.
→ Execution feedback from tests guides iterative refinement of both edits and tests.
→ "Parallel" scaling is achieved by generating multiple candidate (edit, test) pairs for each issue.
→ This parallel approach amortizes the cost of retrieving codebase context across multiple samples.
→ Codebase context retrieval is simplified by letting an LLM scan and rank all files to identify relevant ones.
→ Selection among candidate edits combines test-based voting and model-based selection.
→ Test voting filters candidates based on test passing rates.
→ Model-based selection then chooses the final edit from the filtered set, potentially with further test generation for refinement.
-----
Key Insights from this Paper 🔑:
→ Scaling test-time compute, both serially and in parallel, significantly boosts LLM performance for complex coding tasks.
→ Jointly generating code and tests provides richer feedback, enhancing serial scaling effectiveness.
→ Parallel sampling makes amortizing context retrieval costs feasible, simplifying context handling.
→ Combining test voting with model selection improves the accuracy of choosing correct code edits from multiple candidates.
-----
Results 🎯:
→ Achieves 69.8% coverage on SWE-bench Verified.
→ Obtains a final score of 57.4% on SWE-bench Verified.
→ Outperforms random selection significantly, demonstrating effective candidate selection.
→ "Barrel of Monkeys" ensemble, combining CodeMonkeys with top SWE-bench submissions, reaches 66.2% score.
Share this post