LLMs can generate Boolean queries for systematic reviews, but how reliable are they?
This paper investigates the reproducibility and generalizability of using LLMs for Boolean query generation in systematic literature reviews. The study aims to validate and extend recent findings on using ChatGPT for this task, comparing its performance with open-source alternatives and analyzing limitations.
-----
https://arxiv.org/abs/2411.14914
🤔 Original Problem:
Systematic literature reviews are time-consuming and labor-intensive, particularly in developing effective Boolean queries for comprehensive literature searches. Recent studies have proposed using LLMs like ChatGPT to automate this process, but the reproducibility and reliability of these results remain uncertain.
-----
🔍 Solution in this Paper:
→ The researchers implemented a pipeline to automatically generate Boolean queries using various LLMs, retrieve documents from PubMed, and evaluate the results.
→ They compared the performance of GPT models (GPT-3.5 and GPT-4) with open-source alternatives like Mistral and Zephyr.
→ The study assessed the reproducibility of results by running experiments with multiple random seeds and analyzing query variability.
→ They conducted a failure analysis to identify limitations and shortcomings of using LLMs for Boolean query generation.
-----
💡 Key Insights from this Paper:
→ LLM-generated queries outperformed baselines on the CLEF TAR dataset but not on the Seed dataset
→ GPT-4 and GPT-3.5 models generally performed better than open-source alternatives
→ Query generation results showed significant variability across different runs and models
→ LLMs struggled with consistently formatting machine-actionable Boolean queries
-----
📊 Results:
→ GPT-4 achieved max Precision of 0.378 on CLEF TAR, outperforming the original study's 0.096
→ On the Seed dataset, LLM-generated queries generally underperformed compared to the original study
→ F1-scores for GPT models were significantly higher than the original study on CLEF TAR
→ Open-source models like Mistral showed competitive performance but with higher variability
Share this post