Teaching vision models to look before they leap through smart inference-time decisions.
This paper introduces Vision Value Model (VisVM), a novel approach to enhance Vision Language Models' (VLMs) response quality during inference time without additional training data.
VisVM predicts long-term value to guide better visual comprehension and reduce hallucinations, enabling self-improving VLMs through better inference-time search.
-----
https://arxiv.org/abs/2412.03704
🔍 Original Problem:
VLMs still struggle with visual hallucinations and often miss less salient image details. Traditional methods require expensive training data or model parameter increases.
-----
🛠️ Solution in this Paper:
→ VisVM guides VLM inference-time search by predicting long-term value for each generated sentence
→ The model uses Temporal Difference learning to anticipate future sentence quality beyond immediate rewards
→ VisVM steers away from responses prone to hallucinations by considering potential future consequences
→ The approach leverages CLIP's text-image similarity metric as an effective reward signal
-----
💡 Key Insights:
→ Inference-time computation scaling can improve VLM responses without additional training
→ Long-term value prediction outperforms immediate reward models
→ Self-training with VisVM-guided captions enhances VLM performance
-----
📊 Results:
→ VisVM-guided captions preferred 74% over greedy decoding
→ Achieved 10.8% average improvement across 8 benchmarks
→ Reduced hallucination rates significantly with CHAIRs score from 32.4 to 26.2
Share this post