0:00
/
0:00
Transcript

"HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems"

The podcast on this paper is generated with Google's Illuminate.

HtmlRAG uses HTML's native structure to make RAG systems smarter and more accurate

Finds HTML beats plain text for knowledge retrieval in RAG systems

https://arxiv.org/abs/2411.02959

Original Problem 🤔:

Traditional RAG systems convert HTML to plain text, losing valuable structural and semantic information like headings and table layouts. Raw HTML documents are extremely long (80K+ tokens) and contain lots of noise like CSS/JavaScript, making them unsuitable for direct use in RAG systems.

-----

Solution in this Paper 🛠️:

→ HtmlRAG preserves HTML structure instead of converting to plain text, keeping semantic information intact through three key components

→ An HTML Cleaning module removes irrelevant content like CSS/JavaScript and compresses redundant structures while preserving semantic information, reducing document size to 6% of original

→ A Block Tree Construction builds an optimized tree structure from the DOM tree that can be adjusted for different pruning requirements

→ A Two-Step HTML Pruning first uses embedding models to prune less relevant blocks at a coarse level, then uses a generative model for finer-grained pruning of remaining blocks

-----

Key Insights 💡:

→ LLMs already have HTML understanding capabilities from pre-training, making HTML a natural format for knowledge representation

→ HTML can effectively represent other document formats (PDF, Word etc.) with minimal information loss

→ A block-tree structure enables more efficient HTML processing than working directly with DOM trees

→ Two-stage pruning combining embedding and generative models provides better results than single-stage approaches

-----

Results 📊:

→ Tested on 6 QA datasets including ambiguous QA, natural QA, multi-hop QA, and long-form QA

→ Outperformed baseline methods using plain text, with improvements in metrics like Exact Match and Hit@1

→ HTML cleaning reduced document size to 6% of original while preserving key information