VLMs don't just see, they understand visual cues through language.
Paper reveals that Vision Language Models modulate visual biases, specifically texture vs. shape bias, through language interaction, and these biases can be steered via prompting.
-----
Paper - https://arxiv.org/abs/2403.09193
Original Problem 🧐:
→ Vision models often rely more on texture than shape for object recognition.
→ This contrasts with human vision, which is strongly shape-biased.
→ It is unclear how multimodal Vision Language Models are biased.
→ Specifically, it's unknown if VLMs inherit texture bias from vision encoders or if language influences this bias.
-----
Solution in this Paper 💡:
→ This paper investigates texture versus shape bias in VLMs.
→ It uses a cue-conflict dataset to measure shape bias in VLMs across two tasks: Visual Question Answering and Image Captioning.
→ The paper examines if VLMs understand shape and texture concepts.
→ It tests if visual biases in VLMs can be steered through language prompts and visual input modifications.
→ Steering via prompts includes hand-crafted prompts and automatically searched prompts using an LLM optimizer.
-----
Key Insights from this Paper 🔑:
→ VLMs are generally more shape-biased than vision-only models.
→ VLMs modulate visual biases through text.
→ VLMs understand visual concepts of shape and texture.
→ Visual biases in VLMs can be steered through language prompting.
→ Visual biases in VLMs can be steered through visual preprocessing, like adding noise or patch shuffling.
-----
Results 📊:
→ Shape bias in VLMs ranges from 52.9% to 73.8% in VQA and 54.1% to 73.2% in captioning tasks.
→ Shape bias was steered from 49% to 72% through prompting alone.
→ Adding Gaussian noise to input images increased shape bias up to 91.7%.
→ Patch shuffling decreased shape bias to as low as 6.1%.
Share this post