The paper introduces a method for question answering on patient medical records using private, fine-tuned Large Language Models (LLMs).
This approach enhances patient access to and understanding of their health data while maintaining privacy.
-----
Paper - https://arxiv.org/abs/2501.13687
Original Problem 😥:
→ Patients struggle to understand complex Electronic Health Records (EHRs) in FHIR format.
→ Using cloud-based LLMs for question answering poses privacy risks for sensitive health data.
→ Current solutions lack a balance between usability, accuracy and data privacy in healthcare question answering systems.
-----
Solution in this Paper 💡:
→ This paper proposes a two-stage approach for semantic question answering over EHRs.
→ Task 1 involves identifying relevant FHIR resources for a given medical query using a fine-tuned LLM as a binary classifier. This model determines if a FHIR resource is relevant (1) or irrelevant (0) to the query.
→ Task 2 focuses on answering the medical query based on the relevant FHIR resources identified in Task 1. This is done using another fine-tuned LLM to generate a natural language answer.
→ The paper fine-tunes smaller, open-source LLMs (Llama-3.1-8B, Mistral-NeMo) for both tasks using synthetic patient data generated by Synthea and refined with GPT-4.
→ QLoRA (Quantization-aware Low-Rank Adaptation) technique is used for efficient fine-tuning on NVIDIA A100 GPUs.
→ The performance of these fine-tuned, privately hosted LLMs is compared against GPT-4, GPT-4o and Meditron-7B.
-----
Key Insights from this Paper 🤔:
→ Fine-tuning significantly improves the performance of smaller LLMs for specific medical question answering tasks.
→ Fine-tuned smaller models can outperform larger, general-purpose models like GPT-4 in specialized domains.
→ Dataset size positively impacts model performance, highlighting the importance of training data volume.
→ Sequential fine-tuning can have varying effects on performance depending on the model architecture and task order.
→ LLMs can exhibit self-preference bias when evaluating their own outputs, especially in non-blind evaluations.
-----
Results 🏆:
→ Fine-tuned Llama 3.1 Base achieved 95.52% F1 score on Task 1, outperforming GPT-4's 95% F1 score.
→ Fine-tuned Mistral NeMo Base achieved a METEOR score of 0.5333 on Task 2, while GPT-4 scored 0.375223.
→ Fine-tuned models are approximately 250x smaller than GPT-4, showing efficiency gains.
→ Models fine-tuned on larger datasets (4900 examples) showed a 4.39%-4.55% improvement in METEOR score compared to those fine-tuned on smaller datasets (500 examples) in Task 2.
Share this post