0:00
/
0:00
Transcript

"Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence"

Generated below podcast on this paper with Google's Illuminate.

Large Vision-language models (LVLMs) can see better by boosting their vision-sensitive attention heads.

This paper introduces a novel method to identify and reduce hallucinations in Large Vision Language Models by analyzing attention head sensitivity to visual information.

-----

https://arxiv.org/abs/2412.13949

🔍 Original Problem:

LVLMs often generate text that doesn't accurately reflect visual content, leading to reliability issues. Current solutions focus on symptoms rather than root causes.

-----

🛠️ Solution in this Paper:

→ Introduces Vision-aware Head Divergence (VHD), a metric measuring how attention head outputs change when image context is removed

→ Aggregates VHD scores into Token-VHD to evaluate model's reliance on visual vs language priors

→ Proposes Vision-aware Head Reinforcement (VHR), enhancing vision-sensitive attention heads during generation

→ Implements layer-by-layer reinforcement strategy to maintain consistency in attention mechanisms

-----

💡 Key Insights:

→ Only a few attention heads show significant sensitivity to visual information

→ Words associated with hallucinations correspond to lower T-VHD scores

→ Language bias in training data significantly influences hallucination behavior

-----

📊 Results:

→ Reduced CHAIR_S by 16.36 and CHAIR_I by 4.61 on LLaVA-1.5

→ Achieved 64% prediction accuracy on test set

→ Maintained high efficiency with negligible time overhead

Discussion about this video