0:00
/
0:00
Transcript

"Can Large Language Models Serve as Evaluators for Code Summarization?"

The podcast on this paper is generated with Google's Illuminate.

LLMs can judge code summary quality better than traditional metrics by role-playing as reviewers.

This paper introduces CODERPE, a novel method using LLMs to evaluate code summarization quality. It employs role-playing prompts where LLMs act as code reviewers, authors, and analysts to assess generated summaries across coherence, consistency, fluency, and relevance dimensions.

-----

https://arxiv.org/abs/2412.01333

🔍 Original Problem:

Existing metrics like BLEU and ROUGE-L poorly align with human judgment when evaluating code summaries. Human evaluation is effective but expensive and hard to scale.

-----

⚡ Solution in this Paper:

→ CODERPE prompts LLMs to play diverse roles like code reviewer, author, editor and analyst.

→ Each role evaluates summaries on specific dimensions: coherence, consistency, fluency and relevance.

→ The system uses chain-of-thought reasoning and in-context learning with demonstration examples.

→ Multiple evaluation turns and rating forms help ensure robust assessments.

-----

💡 Key Insights:

→ LLMs can effectively evaluate code summaries without reference examples

→ Role-playing prompts help LLMs better understand evaluation criteria

→ Using 4 demonstration examples yields optimal performance

→ Multiple evaluation turns improve reliability

-----

📊 Results:

→ 81.59% Spearman correlation with human evaluations

→ Outperforms BERTScore by 17.27%

→ ChatGPT achieves ~90% scores across all evaluation dimensions

Discussion about this video