A map of AI's training data landscape across modalities.
This paper conducts the first large-scale audit of AI training datasets across text, speech, and video modalities, analyzing data provenance, licensing, and representation.
-----
https://arxiv.org/abs/2412.17847
🔍 Original Problem:
→ Lack of comprehensive analysis of AI training datasets beyond text modality, especially regarding data sources, licensing restrictions, and geographical representation.
→ No systematic understanding of how training data is sourced and used across different modalities.
-----
🛠️ Solution in this Paper:
→ Conducted manual analysis of nearly 4000 public datasets spanning 1990-2024.
→ Tracked 608 languages, 798 sources, 659 organizations across 67 countries.
→ Analyzed trends in data sourcing, licensing restrictions, and geographical representation.
→ Developed methodology to trace dataset lineage and document restrictions.
-----
💡 Key Insights:
→ Web-crawled and social media data dominate training sets since 2019
→ While only 33% of datasets have restrictive licenses, 80% of source content has non-commercial restrictions
→ No improvement in geographical/linguistic diversity since 2013 despite more languages being added
→ North America/Europe account for 93% of text data and 60%+ of speech/video content
-----
📊 Results:
→ Analyzed 3916 datasets totaling 2.1T tokens and 1.9M hours of content
→ Documented 798 unique sources across 83 domains
→ Mapped 659 creator organizations spanning 67 countries
→ Tracked 608 languages across 37 language families
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/