0:00
/
0:00
Transcript

"Maya: An Instruction Finetuned Multilingual Multimodal Model"

The podcast on this paper is generated with Google's Illuminate.

Maya advances vision-language models by enabling safe, culturally-aware content generation across 8 languages through toxicity-filtered datasets and multilingual model architecture.

-----

https://arxiv.org/abs/2412.07112

🌍 Original Problem:

Vision Language Models excel mainly in English, creating accessibility gaps for low-resource languages. Existing datasets contain toxic content and lack cultural diversity, limiting cross-linguistic capabilities.

-----

🛠️ Solution in this Paper:

→ Maya builds on LLaVA architecture using Aya-23 8B model for language processing and SigLIP for vision encoding.

→ Introduces a novel multilingual dataset covering 8 languages through sophisticated translation pipeline with Aya 35B.

→ Implements comprehensive toxicity filtering using LLaVAGuard and Toxic-BERT, removing 7,531 harmful samples.

→ Uses 2-layer MLP projection matrix with GELU activation for cross-modal alignment.

-----

🔍 Key Insights:

→ Translation quality significantly improves with optimized prompting techniques

→ Toxicity filtering doesn't compromise model performance

→ SigLIP outperforms CLIP for multilingual vision tasks

→ Cultural context preservation requires targeted dataset curation

-----

📊 Results:

→ Achieves 60.4% average accuracy across 8 languages

→ Outperforms LLaVA-7B in 5 languages

→ Shows 34.98% accuracy on VizWiz benchmark

→ Maintains performance parity with 13B models despite smaller size

Discussion about this video