Challenging and Enhancing the Reasoning Capacity of Multimodal LLMs in Context-violating Images

Beijing Institute of Technology  

🤗

Dataset
🏆 Leaderboard 🧑‍💻 Code 📄 Paper

Dose these images conform to the given contexts?

Paris



《Copper Combustion Experiment》

This image does not align with the "copper combustion experiment" because copper burning produces a green flame, not a yellow one.

Paris



《Santra Claus》

This image does not correspond to Santa Claus because Santa Claus is an elderly man, not a young boy, and his vehicle is pulled by reindeer, not sled dogs, and it should be loaded with gifts, not pumpkins.

Paris



《Life of Pi》

This image does not correspond to the story of "Life of Pi," because the original story features a boy and a tiger drifting on a wooden boat, whereas the image shows a girl and a kitten, and they are on a motorboat instead.

Paris



《Little Red Riding Hood》

This image does not align with the story of Little Red Riding Hood, because in the original tale, Red Riding Hood is a young girl, not an old lady, and the animal that appears is a big grey wolf, not a squirrel.

Paris



《The Little Match Girl》

This image does not match the story of "The Little Match Girl," because the original story takes place on a snowy day, not a rainy one.

Paris



《Wu Song Fights the Tiger》

This image does not correspond to the story of "Wu Song Fighting the Tiger" because the protagonist in the original story is an adult man, but in the picture, it is a little boy, and he is happily playing with the tiger.