0:00
/
0:00
Transcript

"ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding"

Generated below podcast on this paper with Google's Illuminate.

Visual editing through code generation helps AI focus on what matters in complex charts and tables.

Like using a marker to highlight important parts, AI edits images to understand better.

ReFocus enables LLMs to perform visual editing on structured images like tables and charts, improving their ability to understand and reason about complex visual information.

-----

https://arxiv.org/abs/2501.05452

🤔 Original Problem:

Current multimodal LLMs struggle with selective attention and multi-hop reasoning when analyzing structured images like tables and charts. They typically convert images to text and never look back at the visual content.

-----

🔧 Solution in this Paper:

→ ReFocus introduces a framework where LLMs generate Python code to edit input images through drawing boxes, highlighting, or masking regions.

→ The system works iteratively - the LLM sees an image, thinks about what to focus on, edits the image through code, and continues this process until reaching an answer.

→ For tables, it enables column/row editing while for charts it handles bar modifications and subplot management.

→ The editing process helps LLMs break down complex visual reasoning into simpler, focused steps.

-----

💡 Key Insights:

→ Visual editing improves OCR accuracy and reduces hallucinations without adding external information

→ All editing methods (masking, drawing, highlighting) show similar effectiveness

→ The approach works best on Wikipedia tables and chart reasoning tasks

→ The visual editing process can be distilled into other models through finetuning

-----

📊 Results:

→ 11.0% accuracy gain on table tasks over GPT-4V baseline

→ 6.8% improvement on chart understanding tasks

→ 8.0% better performance when finetuning other models using ReFocus data vs standard QA pairs

------

Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓

🎉 https://rohanpaul.substack.com/

Discussion about this video