This paper presents a hierarchical approach using local LLMs for summarizing large codebases, specifically tailored for business applications.
-----
https://arxiv.org/abs/2501.07857
Solution in this Paper 💡:
→ A two-step hierarchical approach. First, code is segmented into smaller units using abstract syntax trees (ASTs).
→ Local LLMs summarize these segments using custom prompts tailored to each segment type (functions, variables, etc.).
→ These segment summaries are aggregated into file-level summaries, incorporating domain and problem context.
→ File summaries are then combined to create package-level summaries.
-----
Results 😎:
→ Grounding the LLM improved domain relevance (DS) by over 7% in file-level summarization.
→ Direct file-level summarization with LLMs missed approximately 11% of functions and 24% of variables, while the proposed approach covers all segments.
→ Structured prompts with in-context learning improved function summarization accuracy (e.g., completeness by >13%, correctness and cohesiveness by 5%).