0:00
/
0:00
Transcript

"Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning"

Below podcast is generated with Google's Illuminate.

Scientific table understanding gets a boost with MMSci's - by leveraging domain-specific data and dynamic image resolution.

This paper enhances multimodal scientific table understanding and reasoning with dynamic input image resolutions. It addresses limitations of current LLMs and MLLMs in handling complex numerical reasoning within scientific tables.

-----

https://arxiv.org/abs/2501.13042

Original Problem 🤔:

→ Current LLMs struggle with scientific tables due to conversion into sequential text, losing structural information.

→ Existing MLLMs have limitations: fixed input resolutions and weak numerical reasoning, particularly for scientific tables.

-----

Solution in this Paper 💡:

→ Introduces MMSci framework with three components: MMSci-Pre, MMSci-Ins, and MMSci-Eval.

→ MMSci-Pre (52K scientific table images) enhances table structure recognition.

→ MMSci-Ins (12K instruction tuning samples) improves numerical reasoning across TQA, TFV, and T2T.

→ MMSci-Eval (3,114 testing samples) rigorously assesses numerical reasoning.

→ Implements dynamic input resolution on Qwen2-VL-7B-Instruct and LLaVA-NeXT-7B.

-----

Key Insights from this Paper 🔑:

→ Domain-specific table structure learning (52K scientific images) outperforms 150K general-domain images.

→ Dynamic input resolution significantly improves performance.

→ Qwen2-VL-7B-Instruct shows superior performance and generalisation compared to LLaVA-NeXT-7B.

-----

Results 📊:

→ 52K scientific table images outperform 150K general-domain images.

→ Achieves up to 42.10% accuracy on TQA and 73.98% on TFV with MM-Pre (202K) + MMSci-Ins.

→ Demonstrates strong generalisation to held-out numerical reasoning datasets (49.96% on TABMWP).

Discussion about this video