LLMs as judges. Can AI models really grade each other?
The Paper tries to validate if LLMs can accurately evaluate summaries generated by other LLMs, comparing their performance against human evaluators in organizational settings.
-----
https://arxiv.org/abs/2501.08167
Methods in this Paper 🔍:
→ The study used Anthropic Claude to generate thematic summaries from open-ended survey responses.
→ Amazon's Titan Express, Nova Pro, and Meta's Llama served as LLM judges.
→ The research compared LLM-as-judge evaluations with human ratings using Cohen's kappa, Spearman's rho, and Krippendorff's alpha.
-----
Key Insights 💡:
→ LLMs show moderate agreement with human evaluators in assessing thematic alignment
→ Inter-model agreement is generally higher than human-model agreement
→ Humans excel at detecting subtle, context-specific nuances that LLMs might miss
-----
Results 📊:
→ Human vs Claude agreement: 79% (Cohen's kappa: 0.41)
→ Human vs Sonnet 3.5: 76% (Cohen's kappa: 0.44)
→ Best inter-model agreement: Claude vs Titan Express (91%, kappa: 0.70)
Share this post