Break long documents into smart pieces, and let LLMs extract what matters.
A framework that enables LLMs to extract information from hybrid documents containing both text and tables by breaking them into manageable segments and using effective retrieval strategies.
-----
https://arxiv.org/abs/2412.20072
Original Problem 🔍:
→ LLMs cannot directly process hybrid long documents (HLDs) due to context window limitations. Traditional truncation causes information loss and fails to capture relationships between text and tables.
-----
Solution in this Paper 🛠:
→ The AIE framework splits HLDs into smaller segments using simplified table serialization.
→ It retrieves relevant segments through embedding-based similarity with keywords.
→ The framework uses the "Refine" strategy to iteratively build document summaries.
→ For precise numerical extraction, it employs specialized prompts with metadata completion.
-----
Key Insights 💡:
→ Simple table serialization formats outperform complex hierarchical ones for LLM comprehension
→ Retrieving 3 segments provides optimal balance between accuracy and noise
→ Single well-designed few-shot examples perform better than multiple examples
→ Adding company and time metadata significantly improves extraction accuracy
-----
Results 📊:
→ AIE achieves 69.93% accuracy vs 15.99% baseline on FINE dataset with GPT-3.5
→ "Refine" strategy outperforms "Map-Reduce" by 11.53% in accuracy
→ Keyword completion with metadata improves accuracy by 30.71%
→ Processing time per sample: 16.36 seconds
Share this post