0:00
/
0:00
Transcript

"LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification"

The podcast on this paper is generated with Google's Illuminate.

No humans needed: GPT trains BERT to sort news across languages automatically

This paper introduces a teacher-student framework using GPT as a teacher to automatically annotate news articles for topic classification, training smaller BERT models as students. This eliminates manual data annotation while maintaining high accuracy across multiple languages.

-----

https://arxiv.org/abs/2411.19638

🤔 Original Problem:

→ News topic classification requires extensive manually-annotated training data, especially challenging for non-English languages

→ Using GPT models directly is computationally expensive for processing millions of daily news articles

-----

🔧 Solution in this Paper:

→ A GPT model acts as teacher to automatically label news articles with IPTC Media Topic categories

→ The teacher model annotates training data in Catalan, Croatian, Greek, and Slovenian languages

→ A smaller XLM-RoBERTa model is trained as student on this GPT-annotated dataset

→ The framework requires no human-labeled training data while maintaining high classification accuracy

-----

💡 Key Insights:

→ GPT model's annotation quality matches human annotator agreement levels

→ Student models achieve comparable performance to teacher with just 15,000 training examples

→ Models show strong zero-shot cross-lingual capabilities even without target language training data

→ Weather, sports and religion topics are most accurately classified (F1 > 0.89)

-----

📊 Results:

→ Student model achieves 0.734 micro-F1 and 0.746 macro-F1 scores

→ GPT teacher shows 0.722 micro-F1 and 0.731 macro-F1 scores

→ Model maintains consistent performance across all four test languages