A critic model that speaks human language to guide VLMs through complex reasoning
Critic-V introduces a novel framework that enhances Vision Language Models (VLMs) by incorporating a dedicated critic component that provides natural language feedback during reasoning tasks. The framework uses Direct Preference Optimization and Rule-based Rewards to train the critic, significantly improving VLM performance across multiple benchmarks.
-----
https://arxiv.org/abs/2411.18203
🤔 Original Problem:
VLMs often generate inaccurate or irrelevant responses due to hallucinated image understandings and unrefined reasoning paths. Existing solutions rely heavily on internal model capabilities without external feedback mechanisms.
-----
🔧 Solution in this Paper:
→ Critic-V implements a Reasoner-Critic architecture where the Reasoner generates reasoning paths from visual and text inputs
→ The Critic provides natural language feedback instead of scalar rewards to guide the Reasoner's improvements
→ A Vision Error inSertion Technique (VEST) creates training data by inserting errors into ground-truth answers
→ Rule-based Reward mechanism evaluates critique quality using Jaccard index and GPT-based scoring
→ The framework uses Direct Preference Optimization to train the Critic model on ranked critique pairs
-----
💡 Key Insights:
→ Natural language feedback provides more nuanced guidance than scalar rewards
→ Decoupling reasoning and critique processes improves overall performance
→ External feedback mechanisms are crucial for reliable multimodal reasoning
-----
📊 Results:
→ Outperforms GPT-4V on 5 out of 8 benchmarks
→ Improves MathVista scores by 11.8% for Qwen2-VL-7B
→ Achieves 17.8% gain on MathVista for DeepSeek-VL-7B
→ Shows 12.8% improvement on RealWorldQA for LLaVA-v1.5-7B
Share this post