MMErroR: A Benchmark for Erroneous Reasoning

Yang Shi1*, Yifeng Xie2*, Minzhe Guo1, Liangsi Lu1,
Mingxuan Huang3, Jingchao Wang4, Zhihong Zhu4, Boyan Xu1, Zhiqi Huang4
1 Guangdong University of Technology · 2 Hong Kong Baptist University
3 Sun Yat-sen University · 4 Peking University

*Equal contribution

MMErroR delivers 2,013 multimodal cases with a single reasoning error each, spanning 24 subdomains across six domains. Two evaluation modes—Error Type Classification and Error Presence Detection—stress-test VLMs beyond answer accuracy, with Gemini-3.0-Pro reaching 66.47% error classification accuracy.

Abstract

Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 2,013 samples, each embedding a single coherent reasoning error across 24 subdomains and six top-level domains. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 20 advanced VLMs; even the best model (Gemini-3.0-Pro) classifies the error in only 66.47% of cases, underscoring the challenge of identifying erroneous reasoning. Accurately locating errors offers a sharper lens on the capabilities and limits of multi-modal reasoning models.

Benchmark Tasks

Error Type Classification (ETC)

The model is explicitly told an error exists and must classify it using the MMErroR taxonomy (visual perception, knowledge deployment, question comprehension, reasoning).

Error Presence Detection (EPD)

The model first decides whether a reasoning chain is sound, and only then diagnoses the error if present—mirroring real-world uncertainty.

MMErroR targets process-level understanding: VLMs must look beyond answer correctness to locate why reasoning fails.

MMErroR Taxonomy

Visual Perception Error

Incorrect grounding such as object misidentification, spatial misinterpretation, or misreading symbols/diagrams.

Knowledge Deployment Error

Misuse or misapplication of external knowledge—e.g., physics, math formulas, or domain-specific concepts.

Question Comprehension Error

Misunderstanding task intent, overlooking constraints, or misinterpreting the required target.

Reasoning Error

Logical fallacies, missing premises, invalid inference steps, or internal inconsistencies in the reasoning chain.

Figure 1: MMErroR Benchmark Overview

arXiv

BibTeX

@misc{shi2026mmerror,
  title={MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models},
  author={Yang Shi and Yifeng Xie and Minzhe Guo and Liangsi Lu and Mingxuan Huang and Jingchao Wang and Zhihong Zhu and Boyan Xu and Zhiqi Huang},
  year={},
  howpublished={},
  url={https://mmerror-benchmark.github.io}
}