This paper introduces Interleaved-modal Chain-of-Thought (ICoT), enhancing Vision Language Models by combining visual and textual reasoning steps for more precise visual understanding.
https://arxiv.org/abs/2411.19488
🤔 Original Problem:
→ Current Chain-of-Thought prompting for vision-language tasks relies on text-only rationales, making it difficult to express precise visual associations with images
→ Existing multimodal CoT methods struggle with fine-grained visual reasoning due to their text-only approach
🔍 Solution in this Paper:
→ Introduces ICoT, which generates sequential reasoning steps with paired visual and textual rationales
→ Proposes Attention-driven Selection (ADS) to implement ICoT by intelligently selecting regions from input images
→ ADS works by analyzing attention maps to identify optimal image patches without requiring model changes
→ The selected patches are inserted into the reasoning sequence to guide textual rationale generation
💡 Key Insights:
→ Visual information paired with text creates more precise reasoning paths
→ Attention maps can effectively identify relevant image regions
→ Plugin approach makes implementation feasible across different VLM architectures
📊 Results:
→ Achieves up to 14% performance improvement over existing multimodal CoT methods
→ Shows substantial interpretability improvements in reasoning steps
→ Successfully implements across different VLM architectures with minimal latency impact
Share this post