No humans needed: GPT trains BERT to sort news across languages automatically
This paper introduces a teacher-student framework using GPT as a teacher to automatically annotate news articles for topic classification, training smaller BERT models as students. This eliminates manual data annotation while maintaining high accuracy across multiple languages.
-----
https://arxiv.org/abs/2411.19638
🤔 Original Problem:
→ News topic classification requires extensive manually-annotated training data, especially challenging for non-English languages
→ Using GPT models directly is computationally expensive for processing millions of daily news articles
-----
🔧 Solution in this Paper:
→ A GPT model acts as teacher to automatically label news articles with IPTC Media Topic categories
→ The teacher model annotates training data in Catalan, Croatian, Greek, and Slovenian languages
→ A smaller XLM-RoBERTa model is trained as student on this GPT-annotated dataset
→ The framework requires no human-labeled training data while maintaining high classification accuracy
-----
💡 Key Insights:
→ GPT model's annotation quality matches human annotator agreement levels
→ Student models achieve comparable performance to teacher with just 15,000 training examples
→ Models show strong zero-shot cross-lingual capabilities even without target language training data
→ Weather, sports and religion topics are most accurately classified (F1 > 0.89)
-----
📊 Results:
→ Student model achieves 0.734 micro-F1 and 0.746 macro-F1 scores
→ GPT teacher shows 0.722 micro-F1 and 0.731 macro-F1 scores
→ Model maintains consistent performance across all four test languages
Share this post