0:00
/
0:00
Transcript

"LLaVA-o1: Let Vision Language Models Reason Step-by-Step"

The podcast on this paper is generated with Google's Illuminate.

LLaVA-o1 teaches machines to think step-by-step like humans when analyzing images.

LLaVA-o1 introduces a novel approach to enhance Vision Language Models (VLMs) by implementing structured, multi-stage reasoning. This paper tackles the challenge of systematic reasoning in visual tasks by breaking down the process into distinct stages: summary, caption, reasoning, and conclusion.

-----

https://arxiv.org/abs/2411.10440

🤔 Original Problem:

Current VLMs struggle with systematic reasoning and often produce errors or hallucinated outputs during complex visual question-answering tasks. They lack structured thinking processes and tend to jump to conclusions without proper analysis.

-----

🛠️ Solution in this Paper:

→ LLaVA-o1 implements a 4-stage reasoning process with dedicated tags for each stage: summary, caption, reasoning, and conclusion.

→ The model uses supervised fine-tuning on a new LLaVA-o1-100k dataset, created using GPT-4o for structured reasoning annotations.

→ A stage-level beam search method generates multiple candidates at each reasoning stage, selecting the best one to continue.

→ Training is performed on a single node with 8 H100 GPUs, combining samples from both general VQA and science-targeted datasets.

-----

💡 Key Insights:

→ Structured reasoning stages help models organize thoughts before reaching conclusions

→ Special tags for each stage maintain clarity throughout the reasoning process

→ Stage-level beam search is more effective than sentence-level or best-of-N approaches

-----

📊 Results:

→ Outperforms base model by 8.9% on multimodal reasoning benchmarks

→ Surpasses larger models including Gemini-1.5-pro and GPT-4o-mini

→ Stage-level beam search improves MMVet score from 60.3% to 62.9%

Discussion about this video