Making code review evaluation smarter through semantic understanding
The research presents a novel semantic-based method for evaluating automatically generated code reviews, moving beyond traditional lexical similarity metrics.
-----
https://arxiv.org/abs/2501.05176
Original Problem 🤔:
→ Current code review evaluation metrics like BLEU rely solely on lexical similarity, often underestimating semantically equivalent but differently worded reviews. This leads to inaccurate assessment of automatically generated code reviews.
-----
Solution in this Paper 🔧:
→ The paper introduces GradedReviews, a benchmark with 5,164 manually scored code reviews.
→ Proposes two semantic-based approaches: embedding-based similarity using deep learning models and LLM-based scoring using ChatGPT.
→ The embedding method converts reviews into vectors and measures cosine similarity.
→ The LLM approach directly compares generated and reference reviews through prompts.
-----
Key Insights 💡:
→ Most current code review generation approaches produce low-quality reviews (90.82% rated as poor)
→ Semantic-based metrics significantly outperform lexical-based metrics
→ LLM-based scoring shows stronger correlation with human evaluation
-----
Results 📊:
→ Improved correlation with human scores from 0.22 (BLEU) to 0.47 (LLM-based)
→ 80.11% accuracy in matching human-assigned scores
→ 7.2% higher accuracy compared to traditional metrics
Share this post