🤗
Dataset 🏆 Leaderboard 🧑💻 Code 📄 PaperThis image does not align with the "copper combustion experiment" because copper burning produces a green flame, not a yellow one.
This image does not correspond to Santa Claus because Santa Claus is an elderly man, not a young boy, and his vehicle is pulled by reindeer, not sled dogs, and it should be loaded with gifts, not pumpkins.
This image does not correspond to the story of "Life of Pi," because the original story features a boy and a tiger drifting on a wooden boat, whereas the image shows a girl and a kitten, and they are on a motorboat instead.
This image does not align with the story of Little Red Riding Hood, because in the original tale, Red Riding Hood is a young girl, not an old lady, and the animal that appears is a big grey wolf, not a squirrel.
This image does not match the story of "The Little Match Girl," because the original story takes place on a snowy day, not a rainy one.
This image does not correspond to the story of "Wu Song Fighting the Tiger" because the protagonist in the original story is an adult man, but in the picture, it is a little boy, and he is happily playing with the tiger.
Context-violating images contain visual information that is consistent with common sense but conflicts with a given context.
For example, given the story of ‘Little Red Riding Hood’, an image depicting an old lady in a red hat finding a squirrel in the woods is a context-violating image, even though it looks visually plausible without the background story.
Humans can easily identify and explain whether an image is consistent with the implicit constraints of a specific context, but can Multimodal Large Language Models (MLLMs) achieve similar performance?
To explore the contextual reasoning capacity of MLLMs, we construct ContextualBench, a benchmark dataset consisting of contextviolating images generated by text-to-image models.
Each image is associated with a specific context and several constraints, and five types of context are defined in total.
We use 10 MLLMs for four reasoning tasks on ContextualBench, and the results demonstrate that they fail to accurately identify and explain context-violating images, significantly falling behind human performance.
As a pioneering step towards enhancing the contextual reasoning capacity of MLLMs, we propose a framework that retrieves context-related knowledge from external resources and integrates it into the inference phase of MLLMs.
Our research suggests that contextual reasoning remains an open challenge, and the integration of context-related knowledge is crucial for realizing trustworthy artificial intelligence.
● ContextualBench conprises six categories of contexts: fable, fairytale, science, history, folklore, and movie, with each category comprising 12 contexts.
● We perform content review and filtering during the iterative process of image generation to ensure that the produced images: (1) do not contain visual illusions, (2) adhere to the specified constraint conditions, and (3) steer clear of potential offensiveness.
We design four visual reasoning tasks on Contextual-Bench: