0:00
/
0:00
Transcript

"Web Archives Metadata Generation with GPT-4o: Challenges and Insights"

The podcast on this paper is generated with Google's Illuminate.

GPT-4 automates web archive metadata generation with 99.9% cost reduction

https://arxiv.org/abs/2411.05409

🎯 Original Problem:

National Library Board Singapore faced the challenge of manually cataloging over 98,000 websites. After legislative changes allowed domain-wide crawling, manual cataloging became impractical and costly.

-----

🛠️ Solution in this Paper:

→ Developed a pipeline using GPT-4 to automate metadata generation for web archives

→ Implemented three data reduction heuristics: About Page Priority, Shortest URL, and Shortest URL with Regex Filtering

→ Used Chain of Thought prompting with two variants - basic and rule-based for different website types

→ Applied BERTScore and Levenshtein Distance for automated evaluation

→ Utilized human cataloger assessment through McNemar's test

-----

🔍 Key Insights:

→ LLMs should complement rather than replace human catalogers

→ Content accuracy and hallucinations remain significant challenges

→ Data reduction techniques can dramatically cut costs while maintaining quality

→ Rule-based prompting improves metadata generation accuracy

→ Privacy concerns suggest exploring smaller language models

-----

📊 Results:

→ Achieved 99.9% reduction in metadata generation costs

→ Processing time: 4 hours for 112 WARC files

→ 19.6% accuracy issues in LLM-generated metadata vs 6.3% in human-curated metadata

→ Statistical tests showed LLM outputs were distinguishable from human-generated metadata (p-value 0.02)

Discussion about this video