LLMs can judge code summary quality better than traditional metrics by role-playing as reviewers.
This paper introduces CODERPE, a novel method using LLMs to evaluate code summarization quality. It employs role-playing prompts where LLMs act as code reviewers, authors, and analysts to assess generated summaries across coherence, consistency, fluency, and relevance dimensions.
-----
https://arxiv.org/abs/2412.01333
🔍 Original Problem:
Existing metrics like BLEU and ROUGE-L poorly align with human judgment when evaluating code summaries. Human evaluation is effective but expensive and hard to scale.
-----
⚡ Solution in this Paper:
→ CODERPE prompts LLMs to play diverse roles like code reviewer, author, editor and analyst.
→ Each role evaluates summaries on specific dimensions: coherence, consistency, fluency and relevance.
→ The system uses chain-of-thought reasoning and in-context learning with demonstration examples.
→ Multiple evaluation turns and rating forms help ensure robust assessments.
-----
💡 Key Insights:
→ LLMs can effectively evaluate code summaries without reference examples
→ Role-playing prompts help LLMs better understand evaluation criteria
→ Using 4 demonstration examples yields optimal performance
→ Multiple evaluation turns improve reliability
-----
📊 Results:
→ 81.59% Spearman correlation with human evaluations
→ Outperforms BERTScore by 17.27%
→ ChatGPT achieves ~90% scores across all evaluation dimensions
Share this post