0:00
/
0:00
Transcript

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Generated this podcast with Google's Illuminate.

LLM acts as a prompt engineer to boost vision models' performance without touching their parameters.

Vision models level up when LLMs pick their conversation starters

LLM-guided prompt optimization boosts Vision-Language Models (VLM) accuracy without parameter updates or gradient-based learning.

📚 https://arxiv.org/abs/2410.06154

Solution in this Paper 🧠:

• GLOV: Uses LLMs to generate optimized prompts for VLMs

• Meta-prompt queries LLM with task descriptions and ranked in-context examples

• Embedding space guidance steers LLM generation by adding offset vector to intermediate layer

• Applies to dual-encoder (CLIP) and encoder-decoder (LLaVA) VLMs

• Uses 1-shot labeled data for prompt evaluation

-----

Key Insights from this Paper 💡:

• LLMs can optimize VLM prompts without gradient-based learning

• Embedding space guidance enhances prompt generation

• Effective for both classification and open-ended VQA tasks

• Generalizes across diverse datasets and VLM architectures

-----

Results 📊:

• Dual-encoder models: Up to 15.0% improvement (3.81% average)

• Encoder-decoder models: Up to 57.5% improvement (21.6% average)

• Outperforms manual prompts on ImageNet by 1.2%

• Consistent improvements across 16 diverse datasets

Discussion about this video