MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

Shi, Yang; Xie, Yifeng; Guo, Minzhe; Lu, Liangsi; Huang, Mingxuan; Wang, Jingchao; Zhu, Zhihong; Xu, Boyan; Huang, Zhiqi

MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

Yang Shi^1*, Yifeng Xie^2*, Minzhe Guo¹, Liangsi Lu¹,

Mingxuan Huang³, Jingchao Wang⁴, Zhihong Zhu⁴, Boyan Xu^1†, Zhiqi Huang⁴

1 Guangdong University of Technology · 2 Hong Kong Baptist University
3 Sun Yat-sen University · 4 Peking University
^*Equal contribution ^†Corresponding author

ACL 2026

ArXiv Dataset Code

MMErroR delivers 1,997 multimodal cases with a single reasoning error each, spanning 24 subdomains across six domains. Two evaluation modes—Error Type Classification and Error Presence Detection—stress-test VLMs beyond answer accuracy, with Gemini-3-Pro-Preview reaching 66.65% error classification accuracy.

Abstract

Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 1997 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 12 representative VLMs, and even the best model, Gemini-3-Pro-Preview, classifies the error correctly in only 66.65% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal models.

Overview

MMErroR evaluates whether a vision-language model can move beyond answer correctness and identify why a reasoning chain fails.

MMErroR benchmark workflow with a multimodal query, flawed reasoning steps, and an error diagnosis output — Comparison with existing error localization benchmarks. The sample illustrates an erroneous reasoning chain where the model must both detect and classify the error type.

Statistics

MMErroR contains 1,997 multimodal samples across six domains and 24 subdomains. Each sample contains exactly one verified root-cause error.

Sunburst chart showing MMErroR domain and subdomain distribution — Detailed analysis of the domain, subdomain, and error-type statistics of MMErroR. Physics & Engineering is the largest domain with 459 samples, and Knowledge Deployment Error is the most common error type with 880 samples.

Summary table listing MMErroR domains, error categories, counts, percentages, and average lengths — Detailed analysis of the domain, subdomain, and error-type statistics of MMErroR. Physics & Engineering is the largest domain with 459 samples, and Knowledge Deployment Error is the most common error type with 880 samples.

Experiment Results

Representative VLMs are evaluated under Error Type Classification (ETC) and Error Presence Detection (EPD), exposing broad gaps in process-level error diagnosis.

Radar charts comparing VLM performance by error cause, ETC domain accuracy, and EPD domain accuracy — Comparison of different VLMs across task domains and four error types: Visual Perception Error, Reasoning Error, Question Comprehension Error, and Knowledge Deployment Error.

Bar chart comparing ETC and EPD accuracy for representative vision-language models — Performance comparison of different VLMs on MMErroR under Error Type Classification (ETC) and Error Presence Detection (EPD).

Error taxonomy spans visual perception, knowledge deployment, question comprehension, and reasoning failures.

1,997 multimodal cases across six domains and 24 subdomains—distribution and benchmark summary.

Two evaluation modes: Error Type Classification (ETC) and Error Presence Detection (EPD).

12 representative VLMs evaluated; Gemini-3-Pro-Preview reaches 66.65% error-type accuracy.

Benchmark Tasks

Error Type Classification (ETC)

The model is explicitly told an error exists and must classify it using the MMErroR taxonomy (visual perception, knowledge deployment, question comprehension, reasoning).

Error Presence Detection (EPD)

The model first decides whether a reasoning chain is sound, and only then diagnoses the error if present—mirroring real-world uncertainty.

MMErroR targets process-level understanding: VLMs must look beyond answer correctness to locate why reasoning fails.

MMErroR Taxonomy

Visual Perception Error

Incorrect grounding such as object misidentification, spatial misinterpretation, or misreading symbols/diagrams.

Knowledge Deployment Error

Misuse or misapplication of external knowledge—e.g., physics, math formulas, or domain-specific concepts.

Question Comprehension Error

Misunderstanding task intent, overlooking constraints, or misinterpreting the required target.

Reasoning Error

Logical fallacies, missing premises, invalid inference steps, or internal inconsistencies in the reasoning chain.

Figure 1: MMErroR Benchmark Overview

arXiv

BibTeX

@article{shi2026mmerror,
  title={MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models},
  author={Shi, Yang and Xie, Yifeng and Guo, Minzhe and Lu, Liangsi and Huang, Mingxuan and Wang, Jingchao and Zhu, Zhihong and Xu, Boyan and Huang, Zhiqi},
  journal={arXiv preprint arXiv:2601.03331},
  year={2026}
}

More Works from Our Lab

ProcessBench

PRISM-Bench

ErrorRadar

MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

Abstract

Overview

Statistics

Experiment Results

Error taxonomy spans visual perception, knowledge deployment, question comprehension, and reasoning failures.

1,997 multimodal cases across six domains and 24 subdomains—distribution and benchmark summary.

Two evaluation modes: Error Type Classification (ETC) and Error Presence Detection (EPD).

12 representative VLMs evaluated; Gemini-3-Pro-Preview reaches 66.65% error-type accuracy.

Benchmark Tasks

Error Type Classification (ETC)

Error Presence Detection (EPD)

MMErroR Taxonomy

Visual Perception Error

Knowledge Deployment Error

Question Comprehension Error

Reasoning Error

Figure 1: MMErroR Benchmark Overview

arXiv

BibTeX