LLM acts as a prompt engineer to boost vision models' performance without touching their parameters.
Vision models level up when LLMs pick their conversation starters
LLM-guided prompt optimization boosts Vision-Language Models (VLM) accuracy without parameter updates or gradient-based learning.
📚 https://arxiv.org/abs/2410.06154
Solution in this Paper 🧠:
• GLOV: Uses LLMs to generate optimized prompts for VLMs
• Meta-prompt queries LLM with task descriptions and ranked in-context examples
• Embedding space guidance steers LLM generation by adding offset vector to intermediate layer
• Applies to dual-encoder (CLIP) and encoder-decoder (LLaVA) VLMs
• Uses 1-shot labeled data for prompt evaluation
-----
Key Insights from this Paper 💡:
• LLMs can optimize VLM prompts without gradient-based learning
• Embedding space guidance enhances prompt generation
• Effective for both classification and open-ended VQA tasks
• Generalizes across diverse datasets and VLM architectures
-----
Results 📊:
• Dual-encoder models: Up to 15.0% improvement (3.81% average)
• Encoder-decoder models: Up to 57.5% improvement (21.6% average)
• Outperforms manual prompts on ImageNet by 1.2%
• Consistent improvements across 16 diverse datasets
Share this post