Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes

Playback speed

Share post at current time

0:00

Transcript

Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes

Below podcast is generated with Google's Illuminate.

Rohan Paul

Jan 29, 2025

A new dataset and automated pipeline simplifies large-scale meme analysis.

This paper introduces ClassicMemes-50-templates (CM50), a dataset of 33,000 memes, and an automated annotation pipeline using LLMs for meme analysis and retrieval.

-----

Paper - https://arxiv.org/abs/2501.13851

Original Problem 🤔:

→ Existing meme research lacks deeper comprehension and retrieval methods.

→ Current datasets are limited in scope or size, hindering large-scale analysis.

-----

Solution in this Paper 💡:

→ This study introduces CM50, a dataset of 33,172 memes based on 50 popular templates.

→ It proposes an automated annotation pipeline leveraging LLMs, specifically GPT-40.

→ This pipeline generates image captions, meme captions, and literary device labels, simplifying data annotation and analysis.

→ A meme-text retrieval CLIP model (mtrCLIP) is proposed to enhance meme analysis through cross-modal embedding.

-----

Key Insights from this Paper 💎:

→ LLMs like GPT-40 can automate meme annotation with near-human accuracy for captions and embedded text.

→ Template context improves LLM performance in labeling literary devices.

→ Fine-tuning CLIP for meme-text retrieval improves retrieval accuracy by up to 11.2%.

-----

Results 📊:

→ mtrCLIP achieves Recall@1 of 0.760 for meme-to-text and 0.780 for text-to-meme on MemeCap.

→ Automated pipeline with GPT-40 achieves 0.525 BLEURT score on MemeCap for meme captions, comparable to human-level performance.

→ Macro F1-score of 0.39 is achieved for literary device labeling on Figmemes, showing potential for improvement.

Rohan's Bytes

Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes

Discussion about this video