GPT-4 automates web archive metadata generation with 99.9% cost reduction
https://arxiv.org/abs/2411.05409
🎯 Original Problem:
National Library Board Singapore faced the challenge of manually cataloging over 98,000 websites. After legislative changes allowed domain-wide crawling, manual cataloging became impractical and costly.
-----
🛠️ Solution in this Paper:
→ Developed a pipeline using GPT-4 to automate metadata generation for web archives
→ Implemented three data reduction heuristics: About Page Priority, Shortest URL, and Shortest URL with Regex Filtering
→ Used Chain of Thought prompting with two variants - basic and rule-based for different website types
→ Applied BERTScore and Levenshtein Distance for automated evaluation
→ Utilized human cataloger assessment through McNemar's test
-----
🔍 Key Insights:
→ LLMs should complement rather than replace human catalogers
→ Content accuracy and hallucinations remain significant challenges
→ Data reduction techniques can dramatically cut costs while maintaining quality
→ Rule-based prompting improves metadata generation accuracy
→ Privacy concerns suggest exploring smaller language models
-----
📊 Results:
→ Achieved 99.9% reduction in metadata generation costs
→ Processing time: 4 hours for 112 WARC files
→ 19.6% accuracy issues in LLM-generated metadata vs 6.3% in human-curated metadata
→ Statistical tests showed LLM outputs were distinguishable from human-generated metadata (p-value 0.02)
Share this post