"Benchmarking LLMs' Judgments with No Gold Standard"

Playback speed

Share post at current time

0:00

Transcript

"Benchmarking LLMs' Judgments with No Gold Standard"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 22, 2024

A new way to score AI text quality that doesn't need perfect answers.

GEM (Generative Estimator for Mutual Information) evaluates LLM outputs without gold standards by measuring mutual information between responses

https://arxiv.org/abs/2411.07127

Original Problem 🤔:

Evaluating LLM outputs for subjective tasks like peer reviews is challenging without gold standards. Current metrics either need reference answers or are vulnerable to manipulations.

-----

Solution in this Paper 🛠️:

→ GEM (Generative Estimator for Mutual Information) measures mutual information between candidate and reference responses to assess text quality without needing gold standards.

→ GEM uses preprocessing via LLMs to remove superficial aspects like writing style, focusing on semantic content.

→ GEM-S variant conditions mutual information on task synopsis (like paper abstract) to evaluate additional semantic content beyond basic information.

→ GRE-bench (Generating Review Evaluation Benchmark) applies GEM to evaluate LLMs' peer review capabilities using ICLR 2023 dataset.

-----

Key Insights from this Paper 💡:

→ GEM shows strong correlation with human judgments while being more robust against manipulations than GPT-4 examiner

→ Preprocessing helps filter out "shortcuts" in evaluation by standardizing style

→ Larger models within same LLM family show better GRE-bench scores

→ While LLMs match humans at information retrieval, humans excel at original critical analysis

-----

Results 📊:

→ GEM and GEM-S are only metrics showing consistent sensitivity to all semantic degradation tests

→ No score inflation from meaningless text elongation or GPT-4/Llama rephrasing

→ Demonstrates competitive correlation with human scores compared to GPT-4 examiner

→ Shows strong correlation between model size and review quality within LLM families

Rohan's Bytes

"Benchmarking LLMs' Judgments with No Gold Standard"

Discussion about this video