A benchmark that tests if Vision-Language Models (VLMs) can truly understand and combine visual-text relationships like humans do.
MMCOMPOSITION: Revisiting the…
A benchmark that tests if Vision-Language Models (VLMs) can truly understand and combine visual-text relationships like humans do.