ContextualBench

Challenging and Enhancing the Reasoning Capacity of Multimodal LLMs in Context-violating Images

Beijing Institute of Technology

Dose these images conform to the given contexts?

《Copper Combustion Experiment》

This image does not align with the "copper combustion experiment" because copper burning produces a green flame, not a yellow one.

《Santra Claus》

This image does not correspond to Santa Claus because Santa Claus is an elderly man, not a young boy, and his vehicle is pulled by reindeer, not sled dogs, and it should be loaded with gifts, not pumpkins.

《Life of Pi》

This image does not correspond to the story of "Life of Pi," because the original story features a boy and a tiger drifting on a wooden boat, whereas the image shows a girl and a kitten, and they are on a motorboat instead.

《Little Red Riding Hood》

This image does not align with the story of Little Red Riding Hood, because in the original tale, Red Riding Hood is a young girl, not an old lady, and the animal that appears is a big grey wolf, not a squirrel.

《The Little Match Girl》

This image does not match the story of "The Little Match Girl," because the original story takes place on a snowy day, not a rainy one.

《Wu Song Fights the Tiger》

This image does not correspond to the story of "Wu Song Fighting the Tiger" because the protagonist in the original story is an adult man, but in the picture, it is a little boy, and he is happily playing with the tiger.

Abstract

Context-violating images contain visual information that is consistent with common sense but conflicts with a given context. For example, given the story of ‘Little Red Riding Hood’, an image depicting an old lady in a red hat finding a squirrel in the woods is a context-violating image, even though it looks visually plausible without the background story.
Humans can easily identify and explain whether an image is consistent with the implicit constraints of a specific context, but can Multimodal Large Language Models (MLLMs) achieve similar performance?
To explore the contextual reasoning capacity of MLLMs, we construct ContextualBench, a benchmark dataset consisting of contextviolating images generated by text-to-image models. Each image is associated with a specific context and several constraints, and five types of context are defined in total.
We use 10 MLLMs for four reasoning tasks on ContextualBench, and the results demonstrate that they fail to accurately identify and explain context-violating images, significantly falling behind human performance.
As a pioneering step towards enhancing the contextual reasoning capacity of MLLMs, we propose a framework that retrieves context-related knowledge from external resources and integrates it into the inference phase of MLLMs.
Our research suggests that contextual reasoning remains an open challenge, and the integration of context-related knowledge is crucial for realizing trustworthy artificial intelligence.

ContextualBench Introduction

MY ALT TEXT ● ContextualBench conprises six categories of contexts: fable, fairytale, science, history, folklore, and movie, with each category comprising 12 contexts.
● We perform content review and filtering during the iterative process of image generation to ensure that the produced images: (1) do not contain visual illusions, (2) adhere to the specified constraint conditions, and (3) steer clear of potential offensiveness.

Create V&L Tasks for Generated Images

We design four visual reasoning tasks on Contextual-Bench:

Image identification: evaluating whether an image conforms to a given context
Image explanation: providing detailed explanations when an image does not conform to the context
Image captioning: generating visual descriptions that include the context constraints
Visual question answering: asking and answering questions about entities and attributes in the context constraints

Baseline V&L models results

Image Identification

Image Explanation