0:00
/
0:00
Transcript

"ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?"

The podcast on this paper is generated with Google's Illuminate.

LLMs may ace medical exams but fail at real clinical predictions.

Traditional ML beats fancy LLMs in hospital patient outcome predictions.

This paper introduces ClinicalBench, a comprehensive benchmark comparing LLMs with traditional ML models in clinical prediction tasks. It evaluates 22 LLMs against 11 traditional ML models across three medical prediction tasks using MIMIC-III and MIMIC-IV databases.

-----

https://arxiv.org/abs/2411.06469

🏥 Original Problem:

While LLMs excel at medical text processing and licensing exams, traditional ML models still dominate clinical prediction tasks. The field lacks systematic evaluation of LLMs' capabilities in real clinical predictions.

-----

🔬 Solution in this Paper:

→ ClinicalBench evaluates 14 general-purpose and 8 medical LLMs against traditional ML models.

→ The benchmark tests three prediction tasks: Length-of-Stay, Mortality, and Hospital Readmission.

→ Clinical codes are converted to natural text for LLM processing.

→ The study explores various prompting strategies and fine-tuning approaches.

-----

🔑 Key Insights:

→ Traditional ML models consistently outperform both general and medical LLMs

→ Larger model size doesn't guarantee better clinical predictions

→ Medical-specific LLMs show no significant advantage over general LLMs

→ Fine-tuning helps but still can't match traditional ML performance

-----

📊 Results:

→ XGBoost achieved 67.94% F1 score in Length-of-Stay prediction vs 25.78% for best LLM

→ Traditional ML models showed 95.97% AUROC in Mortality prediction vs 87.25% for LLMs

→ SVM outperformed all LLMs with 71.74% AUROC in Readmission prediction

Discussion about this video