0:00
/
0:00
Transcript

"OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs"

The podcast on this paper is generated with Google's Illuminate.

OpenScholar: Your AI research assistant that actually reads and cites papers correctly

OpenScholar is a specialized retrieval-augmented LLM that synthesizes scientific literature by identifying relevant passages from 45 million open-access papers and generating citation-backed responses. It outperforms GPT-4 in correctness and citation accuracy, while being smaller and fully open-source.

-----

https://arxiv.org/abs/2411.14199

🔍 Original Problem:

Scientists struggle to stay informed due to the massive volume of published papers. Existing LLMs often hallucinate citations and rely on outdated data, making them unreliable for scientific literature synthesis.

-----

🛠️ Solution in this Paper:

→ OpenScholar combines a specialized datastore of 45 million papers with trained retrievers and an 8B parameter LM

→ It uses a self-feedback inference loop that iteratively improves responses through retrieval and refinement

→ The system generates synthetic training data by having larger models create high-quality examples

→ Citations are verified through a dedicated post-processing step to ensure accuracy

-----

💡 Key Insights:

→ Retrieval-augmented architectures can effectively combat hallucination in scientific tasks

→ Self-feedback loops significantly improve response quality and citation accuracy

→ Synthetic data generation enables training smaller, efficient models that maintain performance

-----

📊 Results:

→ OpenScholar-8B outperforms GPT-4 by 5% in correctness

→ Achieves citation accuracy comparable to human experts

→ Experts preferred OpenScholar responses over human-written ones 51% of the time

→ When combined with GPT-4, improves correctness by 12%

Discussion about this video