0:00
/
0:00
Transcript

"Toxicity of the Commons: Curating Open-Source Pre-Training Data"

The podcast on this paper is generated with Google's Illuminate.

Making historical training data safe for open-source LLMs without over-filtering

Custom toxicity classifier that understands context in old public domain texts

📚 https://arxiv.org/abs/2410.22587v1

🤖 Original Problem:

Current toxicity filters for LLMs work well for web text but perform poorly on historical public domain texts due to different language norms, OCR errors, and limited data availability.

-----

🛠️ Solution in this Paper:

→ Created ToxicCommons: A custom dataset with texts classified across 5 dimensions (racial, gender, religious, ability-based discrimination, and violence)

→ Built Celadon: A 5-headed DeBERTa-v3-small classifier with custom weighted loss function

→ Used human annotations for baseline ground truth, then scaled up using Llama 3.1 to create 2M annotated samples

→ Implemented 4 severity levels (0-3) for each toxicity dimension

-----

💡 Key Insights:

→ Historical texts need different toxicity filtering approach than web content

→ OCR errors significantly impact toxicity classification accuracy

→ Public domain data is more limited, requiring balanced filtering approach

→ Violence detection shows highest sensitivity across all tested models

-----

📊 Results:

→ Best performance on violence detection with 74% weighted accuracy

→ Consistent prediction patterns - tends to predict one class lower than true class as second choice

→ Balanced precision-recall trade-off across all 5 dimensions

→ Successfully scales to 2M samples while maintaining annotation quality

Discussion about this video