"Are Vision Language Models Texture or Shape Biased and Can We Steer Them?"

Playback speed

Share post at current time

0:00

Transcript

"Are Vision Language Models Texture or Shape Biased and Can We Steer Them?"

Below podcast is generated with Google's Illuminate.

Rohan Paul

Feb 03, 2025

VLMs don't just see, they understand visual cues through language.

Paper reveals that Vision Language Models modulate visual biases, specifically texture vs. shape bias, through language interaction, and these biases can be steered via prompting.

-----

Paper - https://arxiv.org/abs/2403.09193

Original Problem 🧐:

→ Vision models often rely more on texture than shape for object recognition.

→ This contrasts with human vision, which is strongly shape-biased.

→ It is unclear how multimodal Vision Language Models are biased.

→ Specifically, it's unknown if VLMs inherit texture bias from vision encoders or if language influences this bias.

-----

Solution in this Paper 💡:

→ This paper investigates texture versus shape bias in VLMs.

→ It uses a cue-conflict dataset to measure shape bias in VLMs across two tasks: Visual Question Answering and Image Captioning.

→ The paper examines if VLMs understand shape and texture concepts.

→ It tests if visual biases in VLMs can be steered through language prompts and visual input modifications.

→ Steering via prompts includes hand-crafted prompts and automatically searched prompts using an LLM optimizer.

-----

Key Insights from this Paper 🔑:

→ VLMs are generally more shape-biased than vision-only models.

→ VLMs modulate visual biases through text.

→ VLMs understand visual concepts of shape and texture.

→ Visual biases in VLMs can be steered through language prompting.

→ Visual biases in VLMs can be steered through visual preprocessing, like adding noise or patch shuffling.

-----

Results 📊:

→ Shape bias in VLMs ranges from 52.9% to 73.8% in VQA and 54.1% to 73.2% in captioning tasks.

→ Shape bias was steered from 49% to 72% through prompting alone.

→ Adding Gaussian noise to input images increased shape bias up to 91.7%.

→ Patch shuffling decreased shape bias to as low as 6.1%.

Rohan's Bytes

"Are Vision Language Models Texture or Shape Biased and Can We Steer Them?"

Discussion about this video