A clever way to create massive training data for image search by connecting similar images intelligently.
MegaPairs introduces a data synthesis method that creates high-quality training data for multimodal retrieval by leveraging vision language models and open-domain images.
-----
https://arxiv.org/abs/2412.14475
🤔 Original Problem:
→ Multimodal retrieval systems suffer from severe data scarcity, limiting their effectiveness across diverse tasks and domains
→ Existing datasets are either small, lack diversity, or held privately by research labs
-----
🔧 Solution in this Paper:
→ MegaPairs constructs heterogeneous KNN triplets from open-domain images using three similarity models
→ CLIP vision-encoder captures visual-semantic correlations between images
→ DINO vision-encoder identifies visual-pattern similarities
→ CLIP text-encoder measures caption correlations between image pairs
→ A two-stage annotation pipeline uses MLLMs to generate detailed descriptions of image relationships
→ LLMs then refine these descriptions into retrieval instructions
-----
💡 Key Insights:
→ Using multiple similarity models creates more diverse and meaningful image pairs
→ Two-stage annotation ensures high-quality instruction generation
→ Hard negative sampling significantly improves model performance
-----
📊 Results:
→ Generated 26M high-quality training instances
→ With just 500K samples, outperformed models trained on 36.7M samples
→ Achieved state-of-the-art performance on 4 composed image retrieval benchmarks
→ Set new records on 36 MMEB datasets
Share this post