Maya advances vision-language models by enabling safe, culturally-aware content generation across 8 languages through toxicity-filtered datasets and multilingual model architecture.
-----
https://arxiv.org/abs/2412.07112
🌍 Original Problem:
Vision Language Models excel mainly in English, creating accessibility gaps for low-resource languages. Existing datasets contain toxic content and lack cultural diversity, limiting cross-linguistic capabilities.
-----
🛠️ Solution in this Paper:
→ Maya builds on LLaVA architecture using Aya-23 8B model for language processing and SigLIP for vision encoding.
→ Introduces a novel multilingual dataset covering 8 languages through sophisticated translation pipeline with Aya 35B.
→ Implements comprehensive toxicity filtering using LLaVAGuard and Toxic-BERT, removing 7,531 harmful samples.
→ Uses 2-layer MLP projection matrix with GELU activation for cross-modal alignment.
-----
🔍 Key Insights:
→ Translation quality significantly improves with optimized prompting techniques
→ Toxicity filtering doesn't compromise model performance
→ SigLIP outperforms CLIP for multilingual vision tasks
→ Cultural context preservation requires targeted dataset curation
-----
📊 Results:
→ Achieves 60.4% average accuracy across 8 languages
→ Outperforms LLaVA-7B in 5 languages
→ Shows 34.98% accuracy on VizWiz benchmark
→ Maintains performance parity with 13B models despite smaller size
Share this post