Efficient Visual In-Context Learning is possible with task-level prompt strategies.
This paper finds optimal prompts that work across most test samples, reducing computational costs.
-----
https://arxiv.org/abs/2501.08841
Original Problem: 🤔
→ Finding optimal prompts for each test sample in Visual In-Context Learning is computationally expensive.
→ Existing methods for selecting demonstrations to construct prompts, such as rule-guided and reward-model-based strategies, are either too simplistic or require extensive training data, leading to high costs and potential overfitting.
→ The core issue is determining which demonstrations to use for constructing prompts efficiently.
-----
Solution in this Paper: 💡
→ The paper introduces task-level prompting to reduce the cost of searching for prompts during the inference stage.
→ Two time-saving, training-free, reward based strategies for task-level prompt search are proposed: Top-K and Greedy search.
→ Top-K assumes the optimal prompt is built from demonstrations that perform well on their own. It measures the performance of individual demonstration, and picks the top K best demonstrations to create the final prompts.
→ Greedy search identifies the best solution by making the best local choices at each step. It selects the demonstration that allows the updated prompts to achieve the best performance.
-----
Key Insights from this Paper: 🧐
→ Most test samples achieve optimal performance under the same prompts, contrary to the assumption that different samples require different prompts.
→ Searching for sample-level prompts is unnecessary and computationally wasteful.
→ Task-level prompting can achieve comparable or better performance than sample-level methods while significantly reducing computational costs and avoiding the risk of overfitting.
-----
Results: 📈
→ Proposed methods identify near-optimal prompts and achieve the best Visual In-Context Learning performance.
→ Achieves optimal results in detection and segmentation tasks, and a global optimal solution in the coloring task.
→ Reduces prompt searching time by over 98% compared to state-of-the-art methods, with consistent relative improvements of over 6.2% across different downstream tasks.