0:00
/
0:00
Transcript

"Bridging the Data Provenance Gap Across Text, Speech and Video"

Generated below podcast on this paper with Google's Illuminate.

A map of AI's training data landscape across modalities.

This paper conducts the first large-scale audit of AI training datasets across text, speech, and video modalities, analyzing data provenance, licensing, and representation.

-----

https://arxiv.org/abs/2412.17847

🔍 Original Problem:

→ Lack of comprehensive analysis of AI training datasets beyond text modality, especially regarding data sources, licensing restrictions, and geographical representation.

→ No systematic understanding of how training data is sourced and used across different modalities.

-----

🛠️ Solution in this Paper:

→ Conducted manual analysis of nearly 4000 public datasets spanning 1990-2024.

→ Tracked 608 languages, 798 sources, 659 organizations across 67 countries.

→ Analyzed trends in data sourcing, licensing restrictions, and geographical representation.

→ Developed methodology to trace dataset lineage and document restrictions.

-----

💡 Key Insights:

→ Web-crawled and social media data dominate training sets since 2019

→ While only 33% of datasets have restrictive licenses, 80% of source content has non-commercial restrictions

→ No improvement in geographical/linguistic diversity since 2013 despite more languages being added

→ North America/Europe account for 93% of text data and 60%+ of speech/video content

-----

📊 Results:

→ Analyzed 3916 datasets totaling 2.1T tokens and 1.9M hours of content

→ Documented 798 unique sources across 83 domains

→ Mapped 659 creator organizations spanning 67 countries

→ Tracked 608 languages across 37 language families

------

Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓

🎉 https://rohanpaul.substack.com/

Discussion about this video

User's avatar