"Why Does ChatGPT "Delve" So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models"

Playback speed

Share post at current time

0:00

Transcript

Generated below podcast on this paper with Google's Illuminate.

Jan 07, 2025

ChatGPT's obsessions with "Delve"

This paper identifies and analyzes words overused by LLMs in scientific writing.

🛠️ Methods in this Paper:

→ The study analyzed 5.2 billion tokens from 26.7 million PubMed abstracts.

→ A three-step process identified focal words with significant usage increase since 2020.

→ Researchers compared AI-generated and human-written abstracts to pinpoint overrepresented words.

→ Potential causes were investigated, including model architecture, training data, and RLHF.

-----

💡 Key Insights from this Paper:

→ 21 focal words, including "delve" and "intricate," show unprecedented increase in scientific abstracts

→ LLMs are becoming major drivers of global language change

→ RLHF may contribute to word overuse through human evaluator biases

→ Lack of transparency in LLM development hinders thorough investigation

-----

📊 Results:

→ Significant spike in focal word usage correlates with LLM adoption in scientific writing

→ No strong evidence found for model architecture or training data causing overuse

→ RLHF emerged as a possible contributor to word overrepresentation

→ Phenomenon persists in current LLM iterations

Rohan's Bytes