Yue, Ni, Zhang, Zheng et al. — CVPR 2024

MMMU: Massive Multi-discipline Multimodal Understanding

The benchmark that exposed how far multimodal AI is from expert-level reasoning — 11.5K college exam questions across 30 subjects where GPT-4V scores 56.8% and humans score 88.6%.

Prerequisites: VLMs (vision-language models) + Evaluation benchmarks
10
Chapters
5+
Simulations

Chapter 0: The Problem

By late 2023, multimodal AI models were crushing every benchmark in sight. CogVLM hit 85% on VQA-v2. LLaVA scored 92% on ScienceQA-IMG. RefCOCO? 93%. The leaderboards told a story of rapid, inexorable progress toward human-level visual understanding.

But something was deeply wrong with that story.

All of these benchmarks test perception — recognizing objects, reading text in images, answering common-sense questions about everyday scenes. They ask "what color is the car?" or "how many people are in this photo?" A middle-schooler could answer most of them. And that's the problem: when your test is easy, a high score tells you nothing about whether the system actually understands anything.

Consider what a real expert does. A radiologist doesn't just see a bright spot on an MRI — they combine visual perception with years of medical knowledge to diagnose fat necrosis versus hematoma versus susceptibility artifact. A music theorist doesn't just see sheet music — they identify that a notated interval is a diminished fifth, not a minor seventh. An electrical engineer doesn't just see a circuit diagram — they apply Kirchhoff's laws to compute VCE = 3.75V.

None of the existing benchmarks test this kind of expert-level multimodal reasoning. They test the visual equivalent of asking a language model to complete "The cat sat on the ___." High scores, zero insight into actual intelligence.

The gap in evaluation: Text-only benchmarks like MMLU tested expert knowledge across 57 subjects — but ignored images entirely. Multimodal benchmarks like VQA tested visual perception — but only at a common-sense level. No benchmark tested both: expert domain knowledge combined with sophisticated visual understanding. MMMU fills this gap with 11.5K college-level multimodal exam questions.
Why were existing multimodal benchmarks insufficient for measuring progress toward expert-level AI?

Chapter 1: The Key Insight

How do you test whether an AI system has expert-level multimodal understanding? The MMMU authors realized the answer was hiding in plain sight: college exams.

College exams are purpose-built evaluation instruments. They are designed by domain experts to test whether a student has mastered a subject at a professional level. They combine text and images naturally — circuit diagrams in electrical engineering, pathology slides in medicine, sheet music in music theory, molecular structures in chemistry. And critically, they require three skills simultaneously:

  1. Perception: Accurately parsing heterogeneous visual inputs — not just photographs, but diagrams, charts, tables, chemical structures, medical scans, geometric figures, musical notation
  2. Knowledge: Recalling domain-specific facts and principles — Fourier transforms, equilibrium theory, art history, pharmacology, circuit laws
  3. Reasoning: Applying that knowledge to the perceived visual information through multi-step logical, mathematical, or spatial reasoning to arrive at a solution
The three-skill pyramid: Most existing benchmarks test only perception. Some test perception + common-sense knowledge. MMMU is the first to systematically test all three: perception + domain expertise + deliberate reasoning. A model can't "hack" MMMU by being good at one skill — it needs all three, operating together, at an expert level.

This is why MMMU's subtitle is "A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI." The authors explicitly frame it as measuring progress toward Level 3 AGI in the Morris et al. taxonomy — systems that perform at "the 90th percentile of skilled adults" across broad domains.

The genius of using college exams is that difficulty is built in by construction. You don't need to artificially make questions harder. A second-year medical imaging exam is hard. A graduate-level calculus problem is hard. The domain provides the challenge.

What three skills must a model combine to succeed on MMMU?

Chapter 2: Benchmark Design

MMMU contains 11,550 multimodal questions spanning 6 disciplines, 30 subjects, and 183 subfields. This isn't a random collection — it's a carefully structured taxonomy designed for comprehensive coverage of expert knowledge.

The six disciplines

Art & Design (11%)
Music, art, design — sheet music analysis, art history, visual design principles
Business (14%)
Marketing, finance, accounting, economics — charts, financial statements, market data
Science (23%)
Math, physics, chemistry, biology — formulas, molecular structures, experimental plots
Health & Medicine (17%)
Clinical medicine, pharmacy, biology — MRI scans, pathology slides, anatomical diagrams
Humanities & Social Sci (9%)
History, geography, psychology, sociology — maps, political cartoons, statistical graphs
Tech & Engineering (26%)
EE, CS, mechanical, civil — circuit diagrams, architecture plans, code snippets

Data curation: A three-stage process

Stage 1: Subject selection. University professors identified which subjects commonly use visual inputs. Subjects like law and linguistics were excluded because multimodal questions are rare in those fields.

Stage 2: Question collection. Over 50 university students specializing in these subjects collected questions from major textbooks, online resources, and lecture materials. They were instructed to select questions without immediately available answers (e.g., answers in separate documents) to minimize data contamination.

Stage 3: Quality control. A multi-step cleaning pipeline: duplicate detection via lexical overlap and URL similarity, format/typo checking by co-authors, and difficulty categorization. Approximately 10% of questions classified as "very easy" were removed to maintain the benchmark's expert-level difficulty.

Key statistics

Split
Dev: 150 | Validation: 900 | Test: 10,500
Difficulty
Easy: 28% | Medium: 45% | Hard: 27%
Image types
30 heterogeneous types: diagrams, charts, photos, chemical structures, sheet music, medical scans, ...
Image position
Beginning: 18% | Middle: 37% | End: 50% — truly interleaved
Discipline Distribution

How the 11,550 questions are distributed across six disciplines. Engineering and Science dominate, reflecting the prevalence of multimodal content in STEM fields.

Why were approximately 10% of collected questions removed from MMMU?

Chapter 3: Question Types

MMMU questions are not the simple "what is in this image?" format you see in VQA. They mirror actual college exam questions, with two key structural innovations that make them uniquely challenging for multimodal models.

Format: Multiple-choice + open-ended

Multiple-choice (94%): The vast majority of MMMU questions present 4 options. But unlike typical VQA multiple-choice, the options often require domain expertise to even understand — "susceptibility artifact" vs. "fat necrosis" vs. "hematoma" in radiology, or integral expressions involving specific function bounds in calculus.

Open-ended (6%): These questions require the model to produce a numerical answer, short phrase, or formula. No options to guess from. For example: "Find VCE for the circuit shown" with answer 3.75V.

The interleaved image challenge

This is what truly sets MMMU apart. Images don't just accompany questions — they are woven into the question text. A breast MRI question might read: "You are shown subtraction <image 1>, T2 weighted <image 2> and T1 weighted axial <image 3> from a screening breast MRI. What is the etiology of the finding in the left breast?"

The model must:

  1. Parse the text to understand which image corresponds to which modality
  2. Interpret each medical image using domain-specific visual knowledge
  3. Synthesize information across all three images
  4. Apply clinical reasoning to reach a diagnosis

7.4% of questions include multiple images, and images appear at different positions: beginning (18%), middle (37%), or end (50%) of the question text.

30 heterogeneous image types

This is perhaps the most brutal aspect of MMMU. The 30 image types span wildly different visual domains:

Why heterogeneity matters: A model trained primarily on natural photographs (which most vision encoders are) must generalize to radically different visual domains — reading sheet music notation, parsing circuit schematics, interpreting pathology stains. Each domain has its own visual grammar. This is where most models fail catastrophically.
MMMU Question Anatomy

Explore how MMMU questions interleave text and images across different disciplines. Click a discipline to see an example question structure.

What makes MMMU's interleaved text-image format harder than typical VQA?

Chapter 4: Difficulty Calibration

A benchmark is only as good as the quality of its questions. The MMMU team invested heavily in ensuring that every question genuinely tests expert-level understanding — not just pattern matching or lucky guessing.

Human expert validation

The authors recruited 90 college senior students — 3 experts per subject across all 30 subjects — to take the benchmark. These weren't random participants; they were students specializing in the relevant field, and they were allowed to consult their textbooks. This establishes a realistic upper bound: humans with domain training and reference materials.

The results were telling:

Even the worst human experts scored 76.2% — significantly above any AI model. And these are students, not professors. A seasoned professional would likely score even higher.

Difficulty stratification

Questions are categorized into three difficulty levels based on expert annotation:

The "very easy" questions (approximately 10% of the original collection) were removed entirely. If a question could be answered without domain expertise, it doesn't belong in MMMU.

Anti-contamination measures

Data contamination — where the test questions appear in the model's training data — is a growing concern. MMMU addresses this in two ways:

  1. Source diversity: Questions come from textbooks, exams, and newly created problems — not from widely-scraped web sources
  2. Answer separation: Annotators were instructed to select questions whose answers aren't immediately adjacent (e.g., answers at the back of the textbook or in separate answer keys), making it harder for web scraping to pair question with answer
Why random chance is ~23%: With 4-option multiple-choice (94% of questions), random guessing gives about 25% on those. Open-ended questions (6%) are nearly impossible to guess. Weighted together: approximately 23.9% on the test set. The "frequent choice" baseline (always picking the most common answer) scores 25.8% — barely above random. These sanity checks confirm the benchmark isn't exploitable by simple heuristics.
Why were human experts allowed to consult textbooks during evaluation?

Chapter 5: Model Evaluation

MMMU evaluates 28 open-source large multimodal models (LMMs) plus proprietary systems including GPT-4V(ision). All evaluations are zero-shot — no fine-tuning or few-shot examples on MMMU data. This tests raw generalization.

The proprietary frontier

At the time of publication, GPT-4V achieved 55.7% accuracy on the validation set — the highest score of any model. This sounds respectable until you remember: random chance is ~24%, and human experts score 88.6%. GPT-4V is closer to random than to human.

The open-source landscape

Open-source models performed dramatically worse. The best open-source results at publication:

These same models score 85-93% on standard VQA benchmarks. On MMMU, they barely beat random chance. The contrast is stark and reveals how shallow their "understanding" really is.

Text-only baselines

An important control: what happens when you feed only the text (no images) to powerful LLMs? Or when you add OCR-extracted text or LLaVA-generated captions as a substitute for actual image understanding?

The result: no significant improvement. Adding OCR or captions to text-only LLMs doesn't help because MMMU questions require genuine visual understanding — not just text extraction from images. You can't OCR a pathology slide into useful information.

Model Performance on MMMU

Accuracy on the MMMU validation set. The gap between the best AI model and human experts is massive. Note how open-source models cluster near random chance.

The sobering reality: Models that "solved" standard VQA (85-93% accuracy) can barely outperform random guessing on MMMU (30-34%). This isn't an incremental gap — it's a chasm. MMMU exposes that high scores on perception-focused benchmarks don't translate to expert-level multimodal reasoning.
Why doesn't adding OCR or image captions to text-only LLMs improve MMMU performance?

Chapter 6: The Human-AI Gap

Let's put the numbers side by side and understand what they mean.

Human (best expert)
88.6% — college seniors in their field, with textbook access
GPT-4V
55.7% — the most capable AI system at time of publication
Best open-source
~34% — barely above random (24%)
Random chance
~24% — guessing on 4-option multiple choice

The gap between human experts (88.6%) and the best AI (55.7%) is 32.9 percentage points. For context, on MMLU (the text-only equivalent), GPT-4 scores ~86% — nearly matching human experts. But add images and domain-specific visual reasoning, and the gap explodes.

What the gap reveals

This isn't just a "we need better models" gap. It reveals fundamental architectural limitations:

The Human-AI Gap

Visualizing the chasm between human expert performance and AI models on MMMU. The gap is far larger than on any previous multimodal benchmark.

Why is the human-AI gap on MMMU so much larger than on text-only benchmarks like MMLU?

Chapter 7: Error Analysis

The authors analyzed 150 error cases from GPT-4V to understand why the model fails. The taxonomy of errors reveals where the bottlenecks lie — and they're spread across all three skills.

Error breakdown

Perceptual errors (35%)
The model misreads the visual input — confusing symbols, misidentifying structures, failing to parse diagrams. These are basic vision failures.
Knowledge gaps (29%)
The model perceives the image correctly but lacks the domain knowledge to interpret it. It sees the MRI but doesn't know what "fat necrosis" looks like.
Reasoning failures (26%)
The model sees correctly AND has relevant knowledge, but fails to chain the reasoning steps. It skips a step in a mathematical derivation or draws an incorrect logical inference.
Other errors (10%)
Textual understanding (6%), refusal to answer (3%), annotation errors (2%), answer extraction (1%)

Perception: The biggest bottleneck

The most striking finding: 35% of GPT-4V's errors are pure perception failures. The model misreads what's in the image. These are "easy for humans, hard for AI" errors — a human can immediately see that a musical interval is a diminished fifth, but GPT-4V confuses it with a minor seventh.

This is particularly damaging because perception errors cascade. If you misread the circuit diagram, your Kirchhoff's law calculation will be wrong no matter how good your math is. If you misidentify the tissue type in a pathology slide, no amount of medical knowledge can save you.

The interplay of language and vision

The error analysis reveals a subtle problem: language can both help and hurt visual understanding. In some cases, text context helps the model make sense of ambiguous visuals. But in other cases, text leads the model to hallucinate — it generates plausible-sounding medical terminology based on the text prompt while ignoring what's actually in the image.

GPT-4V Error Taxonomy

Distribution of 150 analyzed error cases. Perception errors dominate, followed by knowledge gaps and reasoning failures.

The cascading failure insight: Errors in perception, knowledge, and reasoning are not independent — they cascade. A perception error (misreading a diagram) causes a knowledge error (applying the wrong formula) which causes a reasoning error (incorrect derivation). Fixing perception alone would eliminate far more than 35% of total errors because it would break these cascading chains.
What is the largest single category of GPT-4V errors on MMMU?

Chapter 8: Subject-Level Analysis

Not all subjects are equally hard for AI models. The per-discipline breakdown reveals a clear pattern: models do best on subjects with simpler visual content and worst on subjects requiring complex visual reasoning.

Where models struggle most

Science (worst): Questions involving mathematical notation, chemical structures, and geometric constructions require precise visual parsing that current models handle poorly. Reading a complex integral from an image and then computing it is a two-stage challenge where both stages can fail.

Health & Medicine: Medical imaging (MRI, CT, pathology) uses visual conventions completely different from natural images. A "bright spot" on an MRI means something specific in radiology, and models lack this specialized visual vocabulary.

Tech & Engineering: Circuit diagrams, architectural plans, and mechanical drawings require understanding domain-specific symbolic conventions (e.g., reading resistor values from a schematic).

Where models do relatively better

Art & Design (best): GPT-4V achieves ~57% on Art & Design at the time of the updated leaderboard. Why? Visual content in art questions (paintings, design mockups) is closer to the natural images that vision encoders were trained on. Plus, art history knowledge is well-represented in LLM training data.

Humanities & Social Science: Questions often involve photographs, cartoons, and maps — familiar visual formats. The reasoning required is often interpretive rather than computational.

The visual complexity hypothesis: Model performance correlates inversely with the visual complexity and domain-specificity of the image types. Disciplines with natural-image-like visuals (art, humanities) see higher scores. Disciplines with specialized notation systems (science, engineering, medicine) see lower scores. This suggests the bottleneck is in the vision encoder, not the language model.
Performance by Discipline (GPT-4V)

GPT-4V accuracy across six disciplines on the test set. Disciplines with familiar visual content (Art, Humanities) outperform those with specialized imagery (Science, Medicine).

Why do models perform better on Art & Design than on Science or Medicine in MMMU?

Chapter 9: Connections

MMMU sits at a critical junction in the evolution of multimodal evaluation. Here's how it connects to the broader landscape.

Predecessors

Contemporaries and successors

Impact on the field

MMMU became the standard expert-level multimodal benchmark. Within a year of release, it was included in the evaluation suite of every major multimodal model release (GPT-4o, Claude 3, Gemini 1.5, LLaVA-NeXT, InternVL). Scores on MMMU are now reported in model cards alongside MMLU, HumanEval, and other flagship benchmarks.

More importantly, MMMU shifted the conversation. Before MMMU, the narrative was "VLMs are nearly solving vision." After MMMU, the narrative became "VLMs can see, but they can't think." This reframing directed research attention toward deeper multimodal reasoning rather than better perception alone.

MMMU's legacy: MMMU demonstrated that the gap between "seeing" and "understanding" is enormous. It motivated a generation of models that explicitly target expert-level multimodal reasoning — not just perception. Every major VLM released in 2024 reports its MMMU score, making it the de facto benchmark for measuring progress toward Expert AGI.

Cheat sheet

Scale
11.5K questions, 30 subjects, 6 disciplines, 183 subfields, 30 image types
Key finding
GPT-4V: 55.7% | Human experts: 88.6% | Gap: 32.9 points
Error breakdown
Perception 35% | Knowledge 29% | Reasoning 26% | Other 10%
Innovation
First benchmark requiring expert domain knowledge + multimodal reasoning at college level
Impact
Standard expert-level VLM benchmark; reported in every major model card since 2024
How does MMMU differ from its closest predecessor, ScienceQA?