HtmlRAG uses HTML's native structure to make RAG systems smarter and more accurate
Finds HTML beats plain text for knowledge retrieval in RAG systems
https://arxiv.org/abs/2411.02959
Original Problem 🤔:
Traditional RAG systems convert HTML to plain text, losing valuable structural and semantic information like headings and table layouts. Raw HTML documents are extremely long (80K+ tokens) and contain lots of noise like CSS/JavaScript, making them unsuitable for direct use in RAG systems.
-----
Solution in this Paper 🛠️:
→ HtmlRAG preserves HTML structure instead of converting to plain text, keeping semantic information intact through three key components
→ An HTML Cleaning module removes irrelevant content like CSS/JavaScript and compresses redundant structures while preserving semantic information, reducing document size to 6% of original
→ A Block Tree Construction builds an optimized tree structure from the DOM tree that can be adjusted for different pruning requirements
→ A Two-Step HTML Pruning first uses embedding models to prune less relevant blocks at a coarse level, then uses a generative model for finer-grained pruning of remaining blocks
-----
Key Insights 💡:
→ LLMs already have HTML understanding capabilities from pre-training, making HTML a natural format for knowledge representation
→ HTML can effectively represent other document formats (PDF, Word etc.) with minimal information loss
→ A block-tree structure enables more efficient HTML processing than working directly with DOM trees
→ Two-stage pruning combining embedding and generative models provides better results than single-stage approaches
-----
Results 📊:
→ Tested on 6 QA datasets including ambiguous QA, natural QA, multi-hop QA, and long-form QA
→ Outperformed baseline methods using plain text, with improvements in metrics like Exact Match and Hit@1
→ HTML cleaning reduced document size to 6% of original while preserving key information
Share this post