0:00
/
0:00
Transcript

"Retaining and Enhancing Pre-trained Knowledge in Vision-Language Models with Prompt Ensembling"

Generated below podcast on this paper with Google's Illuminate.

Smart prompt grouping helps vision models learn new domains without memory loss.

Group-wise Prompt Ensemble enhances vision-language models by integrating domain knowledge while preserving zero-shot capabilities through strategic prompt grouping and ensemble learning.

-----

https://arxiv.org/abs/2412.07077

🤔 Original Problem:

Vision-language models like CLIP struggle to maintain zero-shot capabilities when fine-tuned on specialized datasets, leading to performance drops when adapting to specific domains.

-----

🔧 Solution in this Paper:

→ Introduces Group-wise Prompt Ensemble (GPE) with masked attention to optimize adaptability while protecting zero-shot capabilities.

→ Implements auxiliary prompts to seamlessly integrate new domain insights without disrupting original model representation.

→ Uses ensemble learning strategy that combines original and new knowledge by promoting diversity among prompts.

→ Employs covariance regularization to ensure each prompt contributes unique information.

-----

🎯 Key Insights:

→ Prompt grouping with masked attention effectively preserves pre-trained knowledge

→ Auxiliary prompts enhance model adaptation without compromising original capabilities

→ Group-wise ensemble outperforms pair-wise training due to more available classifiers

→ Special tokens play crucial role in improving zero-shot generalization

-----

📊 Results:

→ Outperforms zero-shot CLIP by 1.7% in novel class accuracy

→ Achieves 79.24% harmonic mean across 11 datasets

→ Maintains near zero-shot performance even after fine-tuning on specific domains

Discussion about this video