0:00
/
0:00
Transcript

LoRA-Contextualizing Adaptation of Large Multimodal Models for Long Document Understanding

The podcast on this paper is generated with Google's Illuminate.

Smart page retrieval + shared parameters = Better document understanding

Two adapters, one LLM: The key to handling long documents

https://arxiv.org/abs/2411.01106

🤖 Original Problem:

LLMs struggle with understanding long, multi-page documents containing complex layouts, tables, charts, and images. Current solutions either rely on inefficient document parsers or try processing all pages at once, leading to performance bottlenecks and memory issues.

-----

🔧 Solution in this Paper:

→ Introduces LoCAL (LoRA-Contextualizing Adaptation) framework that uses dual LoRA adapters sharing same LLM backbone

→ First adapter handles evidence page retrieval using contextualized late interaction scoring between question and document features

→ Second adapter focuses on question answering based on retrieved evidence pages

→ Both modules share parameters and only add ~2% additional parameters through adapter-based fine-tuning

-----

💡 Key Insights:

→ LLMs themselves can serve as effective multimodal retrievers for finding relevant document pages

→ Contextualized late interaction technique enables fine-grained relevance scoring between questions and pages

→ Parameter sharing through dual adapters makes the solution memory efficient

→ The approach eliminates need for traditional document parsers while maintaining high accuracy

-----

📊 Results:

→ Achieves 98% top-5 retrieval accuracy on SlideVQA dataset

→ Outperforms baselines like CLIP, BM25 and SBERT across multiple benchmarks

→ Rivals Gemini-1.5-pro performance on MMLongBench-Doc with only 4B parameters

→ Successfully processes documents with hundreds of pages while maintaining efficiency

Discussion about this video