LLaVA-o1 teaches machines to think step-by-step like humans when analyzing images.
LLaVA-o1 introduces a novel approach to enhance Vision Language Models (VLMs) by implementing structured, multi-stage reasoning. This paper tackles the challenge of systematic reasoning in visual tasks by breaking down the process into distinct stages: summary, caption, reasoning, and conclusion.
-----
https://arxiv.org/abs/2411.10440
🤔 Original Problem:
Current VLMs struggle with systematic reasoning and often produce errors or hallucinated outputs during complex visual question-answering tasks. They lack structured thinking processes and tend to jump to conclusions without proper analysis.
-----
🛠️ Solution in this Paper:
→ LLaVA-o1 implements a 4-stage reasoning process with dedicated tags for each stage: summary, caption, reasoning, and conclusion.
→ The model uses supervised fine-tuning on a new LLaVA-o1-100k dataset, created using GPT-4o for structured reasoning annotations.
→ A stage-level beam search method generates multiple candidates at each reasoning stage, selecting the best one to continue.
→ Training is performed on a single node with 8 H100 GPUs, combining samples from both general VQA and science-targeted datasets.
-----
💡 Key Insights:
→ Structured reasoning stages help models organize thoughts before reaching conclusions
→ Special tags for each stage maintain clarity throughout the reasoning process
→ Stage-level beam search is more effective than sentence-level or best-of-N approaches
-----
📊 Results:
→ Outperforms base model by 8.9% on multimodal reasoning benchmarks
→ Surpasses larger models including Gemini-1.5-pro and GPT-4o-mini
→ Stage-level beam search improves MMVet score from 60.3% to 62.9%
Share this post