Large Vision-language models (LVLMs) can see better by boosting their vision-sensitive attention heads.
This paper introduces a novel method to identify and reduce hallucinations in Large Vision Language Models by analyzing attention head sensitivity to visual information.
-----
https://arxiv.org/abs/2412.13949
🔍 Original Problem:
LVLMs often generate text that doesn't accurately reflect visual content, leading to reliability issues. Current solutions focus on symptoms rather than root causes.
-----
🛠️ Solution in this Paper:
→ Introduces Vision-aware Head Divergence (VHD), a metric measuring how attention head outputs change when image context is removed
→ Aggregates VHD scores into Token-VHD to evaluate model's reliance on visual vs language priors
→ Proposes Vision-aware Head Reinforcement (VHR), enhancing vision-sensitive attention heads during generation
→ Implements layer-by-layer reinforcement strategy to maintain consistency in attention mechanisms
-----
💡 Key Insights:
→ Only a few attention heads show significant sensitivity to visual information
→ Words associated with hallucinations correspond to lower T-VHD scores
→ Language bias in training data significantly influences hallucination behavior
-----
📊 Results:
→ Reduced CHAIR_S by 16.36 and CHAIR_I by 4.61 on LLaVA-1.5
→ Achieved 64% prediction accuracy on test set
→ Maintained high efficiency with negligible time overhead
Share this post