LLMs can now understand dates better by breaking down temporal biases into representation and logic levels
DateLogicQA introduces a benchmark dataset and evaluation metrics to identify and measure temporal biases in LLMs when processing dates across different formats and contexts.
-----
https://arxiv.org/abs/2412.13377
🤔 Original Problem:
LLMs struggle with accurate temporal reasoning, leading to errors in date-related tasks due to biases in tokenization and semantic interpretation.
-----
🔧 Solution in this Paper:
→ DateLogicQA dataset contains 190 questions covering diverse date formats and temporal contexts
→ Semantic Integrity Metric evaluates tokenization quality by measuring preservation of date structures
→ Human evaluation framework categorizes responses based on representation-level and logical-level biases
→ Analysis framework examines embeddings and softmax outputs to quantify temporal biases
-----
💡 Key Insights:
→ Newer models achieve higher semantic integrity scores around 0.7 with optimal token counts
→ Standardized date formats with clear separators perform better than complex formats
→ Models show bias towards contemporary dates over historical ones
→ Future-oriented reasoning outperforms historical and present-date tasks
-----
📊 Results:
→ GPT-4-turbo leads with 63% correct responses and 16% incorrect answers
→ Future date reasoning achieves 50% accuracy vs 44% for historical dates
→ Commonsense questions reach 51% accuracy while numerical reasoning struggles at 37%
→ Models perform best with "YYYY, Mon DD" format (57% correct)
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post