0:00
/
0:00
Transcript

"Differentiable Prompt Learning for Vision Language Models"

Generated below podcast on this paper with Google's Illuminate.

This paper teaches vision-language models to pick their own optimal prompt lengths.

A novel method that automatically determines optimal context lengths for continuous prompts in vision-language models, improving accuracy by removing manual prompt design limitations.

-----

https://arxiv.org/abs/2501.00457

Original Problem 🔍:

→ Current prompt learning methods use fixed context lengths across all layers, requiring manual design and tuning. This limits adaptability and performance, especially for datasets with large distribution shifts from pre-training data.

-----

Solution in this Paper 🛠️:

→ DPL formulates prompt learning as a bilevel optimization problem to automatically find optimal context lengths.

→ It uses differentiable parameters to control contribution of different prompt options.

→ The method searches across different context lengths {0,2,4,6} for each layer independently.

→ Cross-attention mechanism allows mixing prompts of different lengths during search.

→ Final configuration is determined by the highest alpha values per layer.

-----

Key Insights from this Paper 💡:

→ Dataset-dependent prompt configurations perform better than fixed designs

→ Text branch shows higher confidence in prompt selection than image branch

→ Larger distribution shifts require more varied context lengths across layers

→ DPL needs only 0.022% of CLIP model parameters while improving performance

-----

Results 📊:

→ Improves average test accuracy by 2.60% across 11 datasets compared to baselines

→ Uses 0.028M parameters vs 3.56M in MaPLe (baseline)

→ Shows largest gains on EuroSAT (+7.27%) and Aircraft (+12.14%) datasets

Discussion about this video