"ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature"

Playback speed

Share post at current time

0:00

Transcript

"ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 27, 2025

This paper introduces a new method, ArxEval, to evaluate the tendency of LLMs to hallucinate when generating responses related to scientific literature.

-----

https://arxiv.org/abs/2501.10483v2

**Original Problem** 🤔:

→ LLMs sometimes generate false information, including non-existent citations and research papers.

→ This "hallucination" is a serious issue in academia and education, which require high factual accuracy.

**Solution in this Paper** 💡:

→ ArxEval uses the ArXiv dataset as a source of scientific articles.

→ It presents two new evaluation tasks: Jumbled Titles and Mixed Titles.

→ In Jumbled Titles, LLMs receive scrambled titles and must generate related information. The generated text's similarity to the original abstract is measured using cosine similarity, BERTScore, and Semantic Textual Similarity (STS).

→ In Mixed Titles, LLMs receive combined titles of two different papers. They must provide DOIs for two papers related to the mixed title. The validity of the generated DOIs is checked using APIs and the matching of titles associated with these DOIs against LLM generated titles are checked for correctness.

**Key Insights from this Paper** 🔎:

→ Even large LLMs struggle with factual accuracy in domain-specific tasks like handling scientific literature.

→ Model size does not directly correlate with performance in these tasks. Smaller models sometimes outperform larger ones.

→ Current LLMs significantly struggle with prompt adherence, especially in generating the correct number of DOIs as requested.

**Results** 📊:

→ On Jumbled Titles, Mistral v0.3 (7B) achieved the highest average similarity score (0.585). Orca-2 (13B) performed worst (0.476 average similarity).

→ For Mixed Titles, Mistral-v0.3 (7B) generated the highest number of DOIs (425), with 25.65% being valid. Qwen-2.5 (7B) had the highest percentage of valid DOIs (40.70%). No model provided correct titles for any of the valid DOIs it generated.

-----

**1ST SET OF HOOKS**

ArxEval, a new method, tests LLM hallucination in scientific text generation using jumbled and mixed ArXiv titles.

ArxEval assesses the reliability of LLMs in handling scientific literature by challenging them with altered ArXiv titles.

This paper introduces ArxEval, a pipeline to evaluate LLM hallucinations in scientific literature, employing jumbled and mixed titles.

This work evaluates the reliability of LLMs in retrieving and reasoning about scientific articles using ArXiv-based jumbled and mixed title tasks.

**2nd SET OF HOOKS**

Wanna know if LLMs can handle messy science papers? Check out ArxEval!

ArxEval throws curveballs at LLMs with scrambled and mixed-up science paper titles. Can they hit a home run?

LLMs face a pop quiz with ArxEval: unscramble titles and find the DOIs. Do they pass or fail?

ArxEval grades LLMs on their scientific literature homework: jumbled titles and mixed-up DOIs. See who makes the grade!

Rohan's Bytes

"ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature"

Discussion about this video