0:00
/
0:00
Transcript

"Question Answering on Patient Medical Records with Private Fine-Tuned LLMs"

Below podcast is generated with Google's Illuminate.

The paper introduces a method for question answering on patient medical records using private, fine-tuned Large Language Models (LLMs).

This approach enhances patient access to and understanding of their health data while maintaining privacy.

-----

Paper - https://arxiv.org/abs/2501.13687

Original Problem 😥:

→ Patients struggle to understand complex Electronic Health Records (EHRs) in FHIR format.

→ Using cloud-based LLMs for question answering poses privacy risks for sensitive health data.

→ Current solutions lack a balance between usability, accuracy and data privacy in healthcare question answering systems.

-----

Solution in this Paper 💡:

→ This paper proposes a two-stage approach for semantic question answering over EHRs.

→ Task 1 involves identifying relevant FHIR resources for a given medical query using a fine-tuned LLM as a binary classifier. This model determines if a FHIR resource is relevant (1) or irrelevant (0) to the query.

→ Task 2 focuses on answering the medical query based on the relevant FHIR resources identified in Task 1. This is done using another fine-tuned LLM to generate a natural language answer.

→ The paper fine-tunes smaller, open-source LLMs (Llama-3.1-8B, Mistral-NeMo) for both tasks using synthetic patient data generated by Synthea and refined with GPT-4.

→ QLoRA (Quantization-aware Low-Rank Adaptation) technique is used for efficient fine-tuning on NVIDIA A100 GPUs.

→ The performance of these fine-tuned, privately hosted LLMs is compared against GPT-4, GPT-4o and Meditron-7B.

-----

Key Insights from this Paper 🤔:

→ Fine-tuning significantly improves the performance of smaller LLMs for specific medical question answering tasks.

→ Fine-tuned smaller models can outperform larger, general-purpose models like GPT-4 in specialized domains.

→ Dataset size positively impacts model performance, highlighting the importance of training data volume.

→ Sequential fine-tuning can have varying effects on performance depending on the model architecture and task order.

→ LLMs can exhibit self-preference bias when evaluating their own outputs, especially in non-blind evaluations.

-----

Results 🏆:

→ Fine-tuned Llama 3.1 Base achieved 95.52% F1 score on Task 1, outperforming GPT-4's 95% F1 score.

→ Fine-tuned Mistral NeMo Base achieved a METEOR score of 0.5333 on Task 2, while GPT-4 scored 0.375223.

→ Fine-tuned models are approximately 250x smaller than GPT-4, showing efficiency gains.

→ Models fine-tuned on larger datasets (4900 examples) showed a 4.39%-4.55% improvement in METEOR score compared to those fine-tuned on smaller datasets (500 examples) in Task 2.

Discussion about this video