"Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 15, 2025

Article voiceover

0:00

-5:13

https://arxiv.org/abs/2502.05092

The paper addresses the problem that Multimodal LLMs (MLLMs) struggle to understand time and date from visual inputs like clocks and calendars. This is despite their advancements in other visual tasks.

This paper introduces two new datasets, ClockQA and CalendarQA, to specifically evaluate MLLMs' ability to interpret time and date from images. This allows for targeted analysis of visual parsing, numerical reasoning, and temporal inference in MLLMs.

-----

📌 ClockQA and CalendarQA datasets offer targeted benchmarks. They expose MLLMs' weaknesses in visual temporal reasoning. Current vision-language models lack precise spatial understanding for time-related visuals.

📌 The paper's zero-shot evaluation reveals a critical gap. MLLMs struggle with basic time interpretation from images despite strong language and vision capabilities in isolation. This highlights multimodal integration challenges.

📌 Performance metrics like Mean Absolute Error (MAE) in seconds for clocks and category-wise F1 score for calendars are insightful. They quantify the nuanced failures of MLLMs beyond simple accuracy.

----------

Methods Explored in this Paper 🔧:

→ Introduces two datasets: ClockQA and CalendarQA.

→ ClockQA dataset contains images of analogue clocks with different styles like standard, Roman numeral, and arrow-hand clocks. Questions are asked about the time shown on the clock. This tests visual recognition of clock hands and conversion to time.

→ CalendarQA dataset features yearly calendar images. Questions range from common dates like Christmas to calculation-based dates like the 100th day of the year. This evaluates visual parsing of calendars and date-based reasoning.

→ Seven MLLMs were evaluated in a zero-shot setting. These include closed-source models like GPT-4o, Gemini-2.0 and open-source models like Llama 3.2-Vision, Qwen2-VL-7B and MiniCPM-V-2.6.

-----

Key Insights 💡:

→ MLLMs struggle with accurately reading analogue clocks from images. Performance is poor across various clock styles, indicating a challenge in visual perception of clock hands and angles.

→ Calendar understanding is comparatively better than clock reading, especially for popular dates. However, performance drops significantly for less common dates or questions requiring date arithmetic.

→ Closed-source models like GPT-o1 and Claude-3.5 perform better on calendar tasks for well-known dates. This might be due to memorization. Open-source models show near-random performance on complex calendar queries.

→ Temporal reasoning involving visual inputs of time and date remains a significant challenge for current MLLMs.

-----

Results 📊:

→ On ClockQA, Gemini-2.0 achieves the highest Exact Match (EM) score of 22.58%. However, overall EM scores are low across all models.

→ For ClockQA, Gemini-2.0 shows the lowest Hour Error and Minute Error, indicating better but still imperfect clock reading.

→ On CalendarQA, GPT-o1 achieves the highest Accuracy of 80%. This highlights stronger date reasoning capabilities compared to other models on calendar-based questions.

→ Open-source models like Llama3.2-Vision, Qwen2-VL-7B, and MiniCPM-V-2.6 show significantly lower performance on both ClockQA and CalendarQA tasks compared to closed-source models.

Rohan's Bytes

Discussion about this post