A framework that checks if LLMs can actually write reliable academic literature reviews
This research evaluates how well LLMs can write literature reviews by testing their ability to generate references, write abstracts, and create comprehensive reviews.
-----
https://arxiv.org/abs/2412.13612
🔍 Original Problem:
Writing literature reviews requires extensive reading and citation of hundreds of references, making it a complex and time-consuming task. While LLMs offer potential automation, their actual capabilities remain unclear.
-----
🛠️ Solution in this Paper:
→ Framework evaluates LLMs across three tasks: reference generation, abstract writing, and literature review writing
→ Dataset contains 1,106 literature reviews from 51 journals spanning 5 disciplines
→ Uses external tools to assess hallucination rates, semantic coverage, and factual consistency
→ Employs Natural Language Inference models to verify factual accuracy
-----
💡 Key Insights:
→ Reference accuracy improves when citations are part of literature review writing vs standalone generation
→ Performance varies significantly across academic disciplines, with best results in Mathematics
→ Models struggle most with author names and journal abbreviations
→ Claude-3.5 shows 25.21% overlap with human citations
-----
📊 Results:
→ Claude-3.5-Sonnet outperforms GPT-4, Qwen-72B and Llama-3B across all tasks
→ Highest accuracy in Mathematics discipline, lowest in Chemistry and Technology
→ 58.89% reference accuracy when generating citations within literature reviews
Share this post