0:00
/
0:00
Transcript

"Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension"

The podcast on this paper is generated with Google's Illuminate.

Teaching vision models to look before they leap through smart inference-time decisions.

This paper introduces Vision Value Model (VisVM), a novel approach to enhance Vision Language Models' (VLMs) response quality during inference time without additional training data.

VisVM predicts long-term value to guide better visual comprehension and reduce hallucinations, enabling self-improving VLMs through better inference-time search.

-----

https://arxiv.org/abs/2412.03704

🔍 Original Problem:

VLMs still struggle with visual hallucinations and often miss less salient image details. Traditional methods require expensive training data or model parameter increases.

-----

🛠️ Solution in this Paper:

→ VisVM guides VLM inference-time search by predicting long-term value for each generated sentence

→ The model uses Temporal Difference learning to anticipate future sentence quality beyond immediate rewards

→ VisVM steers away from responses prone to hallucinations by considering potential future consequences

→ The approach leverages CLIP's text-image similarity metric as an effective reward signal

-----

💡 Key Insights:

→ Inference-time computation scaling can improve VLM responses without additional training

→ Long-term value prediction outperforms immediate reward models

→ Self-training with VisVM-guided captions enhances VLM performance

-----

📊 Results:

→ VisVM-guided captions preferred 74% over greedy decoding

→ Achieved 10.8% average improvement across 8 benchmarks

→ Reduced hallucination rates significantly with CHAIRs score from 32.4 to 26.2

Discussion about this video