LLMs may ace medical exams but fail at real clinical predictions.
Traditional ML beats fancy LLMs in hospital patient outcome predictions.
This paper introduces ClinicalBench, a comprehensive benchmark comparing LLMs with traditional ML models in clinical prediction tasks. It evaluates 22 LLMs against 11 traditional ML models across three medical prediction tasks using MIMIC-III and MIMIC-IV databases.
-----
https://arxiv.org/abs/2411.06469
🏥 Original Problem:
While LLMs excel at medical text processing and licensing exams, traditional ML models still dominate clinical prediction tasks. The field lacks systematic evaluation of LLMs' capabilities in real clinical predictions.
-----
🔬 Solution in this Paper:
→ ClinicalBench evaluates 14 general-purpose and 8 medical LLMs against traditional ML models.
→ The benchmark tests three prediction tasks: Length-of-Stay, Mortality, and Hospital Readmission.
→ Clinical codes are converted to natural text for LLM processing.
→ The study explores various prompting strategies and fine-tuning approaches.
-----
🔑 Key Insights:
→ Traditional ML models consistently outperform both general and medical LLMs
→ Larger model size doesn't guarantee better clinical predictions
→ Medical-specific LLMs show no significant advantage over general LLMs
→ Fine-tuning helps but still can't match traditional ML performance
-----
📊 Results:
→ XGBoost achieved 67.94% F1 score in Length-of-Stay prediction vs 25.78% for best LLM
→ Traditional ML models showed 95.97% AUROC in Mortality prediction vs 87.25% for LLMs
→ SVM outperformed all LLMs with 71.74% AUROC in Readmission prediction
Share this post