"Why Does ChatGPT "Delve" So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models"

Generated below podcast on this paper with Google's Illuminate.

ChatGPT's obsessions with "Delve"

This paper identifies and analyzes words overused by LLMs in scientific writing.

https://arxiv.org/abs/2412.11385

🛠️ Methods in this Paper:

→ The study analyzed 5.2 billion tokens from 26.7 million PubMed abstracts.

→ A three-step process identified focal words with significant usage increase since 2020.

→ Researchers compared AI-generated and human-written abstracts to pinpoint overrepresented words.

→ Potential causes were investigated, including model architecture, training data, and RLHF.

-----

💡 Key Insights from this Paper:

→ 21 focal words, including "delve" and "intricate," show unprecedented increase in scientific abstracts

→ LLMs are becoming major drivers of global language change

→ RLHF may contribute to word overuse through human evaluator biases

→ Lack of transparency in LLM development hinders thorough investigation

-----

📊 Results:

→ Significant spike in focal word usage correlates with LLM adoption in scientific writing

→ No strong evidence found for model architecture or training data causing overuse

→ RLHF emerged as a possible contributor to word overrepresentation

→ Phenomenon persists in current LLM iterations