This paper teaches vision-language models to pick their own optimal prompt lengths.
A novel method that automatically determines optimal context lengths for continuous prompts in vision-language models, improving accuracy by removing manual prompt design limitations.
-----
https://arxiv.org/abs/2501.00457
Original Problem 🔍:
→ Current prompt learning methods use fixed context lengths across all layers, requiring manual design and tuning. This limits adaptability and performance, especially for datasets with large distribution shifts from pre-training data.
-----
Solution in this Paper 🛠️:
→ DPL formulates prompt learning as a bilevel optimization problem to automatically find optimal context lengths.
→ It uses differentiable parameters to control contribution of different prompt options.
→ The method searches across different context lengths {0,2,4,6} for each layer independently.
→ Cross-attention mechanism allows mixing prompts of different lengths during search.
→ Final configuration is determined by the highest alpha values per layer.
-----
Key Insights from this Paper 💡:
→ Dataset-dependent prompt configurations perform better than fixed designs
→ Text branch shows higher confidence in prompt selection than image branch
→ Larger distribution shifts require more varied context lengths across layers
→ DPL needs only 0.022% of CLIP model parameters while improving performance
-----
Results 📊:
→ Improves average test accuracy by 2.60% across 11 datasets compared to baselines
→ Uses 0.028M parameters vs 3.56M in MaPLe (baseline)
→ Shows largest gains on EuroSAT (+7.27%) and Aircraft (+12.14%) datasets
Share this post