This paper reveals inconsistencies in Large Language Model (LLM) answers to health questions across languages like English, Chinese, Turkish, and German. These inconsistencies can spread healthcare misinformation.
-----
Paper - https://arxiv.org/abs/2501.14719
Methods in this Paper 💡:
→ The paper introduces a novel prompt-based evaluation workflow.
→ This workflow assesses LLM consistency across languages for health-related questions.
→ It expands the HealthFC dataset by adding Turkish and Chinese translations and disease categories.
→ The workflow parses LLM responses into segments using a defined discourse ontology.
→ Discourse ontology includes Answer-Summary, Health Benefits, Clinical Guidelines, Individual Considerations, and Public Health Advice.
→ A consistency-check prompt then compares parsed English answers with other languages.
→ Consistency is labeled as Consistent, Partially Consistent, Contradict, or Irrelevant.
-----
Key Insights from this Paper 🧠:
→ LLMs show inconsistencies in health answers when questions are posed in different languages.
→ Inconsistencies are more pronounced in non-English languages compared to English.
→ Specific disease topics and information types show higher inconsistency.
→ Answer summaries are generally consistent, but other parts like guidelines and advice are not.
→ LLMs tend to provide longer answers in English and German than in Turkish.
-----
Results 📊:
→ Inconsistent Answer Summaries range from 14.55% to 37.56% across languages and models.
→ Inconsistent Clinical Guidelines and Evidence reach up to 77.93%.
→ Average inconsistency across all information units ranges from 40.56% to 68.12%.
Share this post