Making historical training data safe for open-source LLMs without over-filtering
Custom toxicity classifier that understands context in old public domain texts
📚 https://arxiv.org/abs/2410.22587v1
🤖 Original Problem:
Current toxicity filters for LLMs work well for web text but perform poorly on historical public domain texts due to different language norms, OCR errors, and limited data availability.
-----
🛠️ Solution in this Paper:
→ Created ToxicCommons: A custom dataset with texts classified across 5 dimensions (racial, gender, religious, ability-based discrimination, and violence)
→ Built Celadon: A 5-headed DeBERTa-v3-small classifier with custom weighted loss function
→ Used human annotations for baseline ground truth, then scaled up using Llama 3.1 to create 2M annotated samples
→ Implemented 4 severity levels (0-3) for each toxicity dimension
-----
💡 Key Insights:
→ Historical texts need different toxicity filtering approach than web content
→ OCR errors significantly impact toxicity classification accuracy
→ Public domain data is more limited, requiring balanced filtering approach
→ Violence detection shows highest sensitivity across all tested models
-----
📊 Results:
→ Best performance on violence detection with 74% weighted accuracy
→ Consistent prediction patterns - tends to predict one class lower than true class as second choice
→ Balanced precision-recall trade-off across all 5 dimensions
→ Successfully scales to 2M samples while maintaining annotation quality
Share this post