Smart prompt grouping helps vision models learn new domains without memory loss.
Group-wise Prompt Ensemble enhances vision-language models by integrating domain knowledge while preserving zero-shot capabilities through strategic prompt grouping and ensemble learning.
-----
https://arxiv.org/abs/2412.07077
🤔 Original Problem:
Vision-language models like CLIP struggle to maintain zero-shot capabilities when fine-tuned on specialized datasets, leading to performance drops when adapting to specific domains.
-----
🔧 Solution in this Paper:
→ Introduces Group-wise Prompt Ensemble (GPE) with masked attention to optimize adaptability while protecting zero-shot capabilities.
→ Implements auxiliary prompts to seamlessly integrate new domain insights without disrupting original model representation.
→ Uses ensemble learning strategy that combines original and new knowledge by promoting diversity among prompts.
→ Employs covariance regularization to ensure each prompt contributes unique information.
-----
🎯 Key Insights:
→ Prompt grouping with masked attention effectively preserves pre-trained knowledge
→ Auxiliary prompts enhance model adaptation without compromising original capabilities
→ Group-wise ensemble outperforms pair-wise training due to more available classifiers
→ Special tokens play crucial role in improving zero-shot generalization
-----
📊 Results:
→ Outperforms zero-shot CLIP by 1.7% in novel class accuracy
→ Achieves 79.24% harmonic mean across 11 datasets
→ Maintains near zero-shot performance even after fine-tuning on specific domains
Share this post