0:00
/
0:00
Transcript

"Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset"

Generated below podcast on this paper with Google's Illuminate.

Break long documents into smart pieces, and let LLMs extract what matters.

A framework that enables LLMs to extract information from hybrid documents containing both text and tables by breaking them into manageable segments and using effective retrieval strategies.

-----

https://arxiv.org/abs/2412.20072

Original Problem 🔍:

→ LLMs cannot directly process hybrid long documents (HLDs) due to context window limitations. Traditional truncation causes information loss and fails to capture relationships between text and tables.

-----

Solution in this Paper 🛠:

→ The AIE framework splits HLDs into smaller segments using simplified table serialization.

→ It retrieves relevant segments through embedding-based similarity with keywords.

→ The framework uses the "Refine" strategy to iteratively build document summaries.

→ For precise numerical extraction, it employs specialized prompts with metadata completion.

-----

Key Insights 💡:

→ Simple table serialization formats outperform complex hierarchical ones for LLM comprehension

→ Retrieving 3 segments provides optimal balance between accuracy and noise

→ Single well-designed few-shot examples perform better than multiple examples

→ Adding company and time metadata significantly improves extraction accuracy

-----

Results 📊:

→ AIE achieves 69.93% accuracy vs 15.99% baseline on FINE dataset with GPT-3.5

→ "Refine" strategy outperforms "Map-Reduce" by 11.53% in accuracy

→ Keyword completion with metadata improves accuracy by 30.71%

→ Processing time per sample: 16.36 seconds

Discussion about this video