MMMU — Veanors

Chapter 0: The Problem

By late 2023, multimodal AI models were crushing every benchmark in sight. CogVLM hit 85% on VQA-v2. LLaVA scored 92% on ScienceQA-IMG. RefCOCO? 93%. The leaderboards told a story of rapid, inexorable progress toward human-level visual understanding.

But something was deeply wrong with that story.

All of these benchmarks test perception — recognizing objects, reading text in images, answering common-sense questions about everyday scenes. They ask "what color is the car?" or "how many people are in this photo?" A middle-schooler could answer most of them. And that's the problem: when your test is easy, a high score tells you nothing about whether the system actually understands anything.

Consider what a real expert does. A radiologist doesn't just see a bright spot on an MRI — they combine visual perception with years of medical knowledge to diagnose fat necrosis versus hematoma versus susceptibility artifact. A music theorist doesn't just see sheet music — they identify that a notated interval is a diminished fifth, not a minor seventh. An electrical engineer doesn't just see a circuit diagram — they apply Kirchhoff's laws to compute V_CE = 3.75V.

None of the existing benchmarks test this kind of expert-level multimodal reasoning. They test the visual equivalent of asking a language model to complete "The cat sat on the ___." High scores, zero insight into actual intelligence.

The gap in evaluation: Text-only benchmarks like MMLU tested expert knowledge across 57 subjects — but ignored images entirely. Multimodal benchmarks like VQA tested visual perception — but only at a common-sense level. No benchmark tested both: expert domain knowledge combined with sophisticated visual understanding. MMMU fills this gap with 11.5K college-level multimodal exam questions.

Why were existing multimodal benchmarks insufficient for measuring progress toward expert-level AI?

They only tested basic visual perception and common-sense knowledge, not the combination of domain expertise and visual reasoning that real experts use They were too small They didn't include enough images

Chapter 1: The Key Insight

How do you test whether an AI system has expert-level multimodal understanding? The MMMU authors realized the answer was hiding in plain sight: college exams.

College exams are purpose-built evaluation instruments. They are designed by domain experts to test whether a student has mastered a subject at a professional level. They combine text and images naturally — circuit diagrams in electrical engineering, pathology slides in medicine, sheet music in music theory, molecular structures in chemistry. And critically, they require three skills simultaneously:

Perception: Accurately parsing heterogeneous visual inputs — not just photographs, but diagrams, charts, tables, chemical structures, medical scans, geometric figures, musical notation
Knowledge: Recalling domain-specific facts and principles — Fourier transforms, equilibrium theory, art history, pharmacology, circuit laws
Reasoning: Applying that knowledge to the perceived visual information through multi-step logical, mathematical, or spatial reasoning to arrive at a solution

The three-skill pyramid: Most existing benchmarks test only perception. Some test perception + common-sense knowledge. MMMU is the first to systematically test all three: perception + domain expertise + deliberate reasoning. A model can't "hack" MMMU by being good at one skill — it needs all three, operating together, at an expert level.

This is why MMMU's subtitle is "A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI." The authors explicitly frame it as measuring progress toward Level 3 AGI in the Morris et al. taxonomy — systems that perform at "the 90th percentile of skilled adults" across broad domains.

The genius of using college exams is that difficulty is built in by construction. You don't need to artificially make questions harder. A second-year medical imaging exam is hard. A graduate-level calculus problem is hard. The domain provides the challenge.

What three skills must a model combine to succeed on MMMU?

Visual perception of heterogeneous image types, domain-specific knowledge, and multi-step reasoning — all operating together at an expert level Object detection, image captioning, and text generation Speed, accuracy, and model size

Chapter 2: Benchmark Design

MMMU contains 11,550 multimodal questions spanning 6 disciplines, 30 subjects, and 183 subfields. This isn't a random collection — it's a carefully structured taxonomy designed for comprehensive coverage of expert knowledge.

The six disciplines

Art & Design (11%)

Music, art, design — sheet music analysis, art history, visual design principles

Business (14%)

Marketing, finance, accounting, economics — charts, financial statements, market data

Science (23%)

Math, physics, chemistry, biology — formulas, molecular structures, experimental plots

Health & Medicine (17%)

Clinical medicine, pharmacy, biology — MRI scans, pathology slides, anatomical diagrams

Humanities & Social Sci (9%)

History, geography, psychology, sociology — maps, political cartoons, statistical graphs

Tech & Engineering (26%)

EE, CS, mechanical, civil — circuit diagrams, architecture plans, code snippets

Data curation: A three-stage process

Stage 1: Subject selection. University professors identified which subjects commonly use visual inputs. Subjects like law and linguistics were excluded because multimodal questions are rare in those fields.

Stage 2: Question collection. Over 50 university students specializing in these subjects collected questions from major textbooks, online resources, and lecture materials. They were instructed to select questions without immediately available answers (e.g., answers in separate documents) to minimize data contamination.

Stage 3: Quality control. A multi-step cleaning pipeline: duplicate detection via lexical overlap and URL similarity, format/typo checking by co-authors, and difficulty categorization. Approximately 10% of questions classified as "very easy" were removed to maintain the benchmark's expert-level difficulty.

Key statistics

Split

Dev: 150 | Validation: 900 | Test: 10,500

Difficulty

Easy: 28% | Medium: 45% | Hard: 27%

Image types

30 heterogeneous types: diagrams, charts, photos, chemical structures, sheet music, medical scans, ...

Image position

Beginning: 18% | Middle: 37% | End: 50% — truly interleaved

Discipline Distribution

How the 11,550 questions are distributed across six disciplines. Engineering and Science dominate, reflecting the prevalence of multimodal content in STEM fields.

Why were approximately 10% of collected questions removed from MMMU?

They were classified as "very easy" and didn't meet the benchmark's goal of testing expert-level reasoning They had copyright issues They were duplicates

Chapter 3: Question Types

MMMU questions are not the simple "what is in this image?" format you see in VQA. They mirror actual college exam questions, with two key structural innovations that make them uniquely challenging for multimodal models.

Format: Multiple-choice + open-ended

Multiple-choice (94%): The vast majority of MMMU questions present 4 options. But unlike typical VQA multiple-choice, the options often require domain expertise to even understand — "susceptibility artifact" vs. "fat necrosis" vs. "hematoma" in radiology, or integral expressions involving specific function bounds in calculus.

Open-ended (6%): These questions require the model to produce a numerical answer, short phrase, or formula. No options to guess from. For example: "Find V_CE for the circuit shown" with answer 3.75V.

The interleaved image challenge

This is what truly sets MMMU apart. Images don't just accompany questions — they are woven into the question text. A breast MRI question might read: "You are shown subtraction <image 1>, T2 weighted <image 2> and T1 weighted axial <image 3> from a screening breast MRI. What is the etiology of the finding in the left breast?"

The model must:

Parse the text to understand which image corresponds to which modality
Interpret each medical image using domain-specific visual knowledge
Synthesize information across all three images
Apply clinical reasoning to reach a diagnosis

7.4% of questions include multiple images, and images appear at different positions: beginning (18%), middle (37%), or end (50%) of the question text.

30 heterogeneous image types

This is perhaps the most brutal aspect of MMMU. The 30 image types span wildly different visual domains:

Scientific: Diagrams, plots/charts, tables, mathematical notations, chemical structures, geometric shapes
Medical: MRI scans, CT scans, pathology slides, microscopic images, X-rays
Artistic: Paintings, sheet music, architectural plans, design mockups
Documentary: Photographs, maps, comics/cartoons, screenshots

Why heterogeneity matters: A model trained primarily on natural photographs (which most vision encoders are) must generalize to radically different visual domains — reading sheet music notation, parsing circuit schematics, interpreting pathology stains. Each domain has its own visual grammar. This is where most models fail catastrophically.

MMMU Question Anatomy

Explore how MMMU questions interleave text and images across different disciplines. Click a discipline to see an example question structure.

What makes MMMU's interleaved text-image format harder than typical VQA?

Images are woven into the question text at different positions, requiring the model to jointly parse text and multiple images using domain-specific visual knowledge The images are higher resolution There are more answer options

Chapter 4: Difficulty Calibration

A benchmark is only as good as the quality of its questions. The MMMU team invested heavily in ensuring that every question genuinely tests expert-level understanding — not just pattern matching or lucky guessing.

Human expert validation

The authors recruited 90 college senior students — 3 experts per subject across all 30 subjects — to take the benchmark. These weren't random participants; they were students specializing in the relevant field, and they were allowed to consult their textbooks. This establishes a realistic upper bound: humans with domain training and reference materials.

The results were telling:

Best expert per subject: 88.6% average accuracy
Medium expert: 82.6%
Worst expert per subject: 76.2%

Even the worst human experts scored 76.2% — significantly above any AI model. And these are students, not professors. A seasoned professional would likely score even higher.

Difficulty stratification

Questions are categorized into three difficulty levels based on expert annotation:

Easy (28%): Requires basic domain knowledge and straightforward visual interpretation
Medium (45%): Requires combining multiple pieces of domain knowledge with careful visual analysis
Hard (27%): Requires deep expertise, multi-step reasoning, and nuanced visual interpretation

The "very easy" questions (approximately 10% of the original collection) were removed entirely. If a question could be answered without domain expertise, it doesn't belong in MMMU.

Anti-contamination measures

Data contamination — where the test questions appear in the model's training data — is a growing concern. MMMU addresses this in two ways:

Source diversity: Questions come from textbooks, exams, and newly created problems — not from widely-scraped web sources
Answer separation: Annotators were instructed to select questions whose answers aren't immediately adjacent (e.g., answers at the back of the textbook or in separate answer keys), making it harder for web scraping to pair question with answer

Why random chance is ~23%: With 4-option multiple-choice (94% of questions), random guessing gives about 25% on those. Open-ended questions (6%) are nearly impossible to guess. Weighted together: approximately 23.9% on the test set. The "frequent choice" baseline (always picking the most common answer) scores 25.8% — barely above random. These sanity checks confirm the benchmark isn't exploitable by simple heuristics.

Why were human experts allowed to consult textbooks during evaluation?

To establish a realistic upper bound — expert performance with reference materials mirrors how professionals actually work, and AI models also have their training data as "reference" To make the questions easier Because the questions were too hard without references

Chapter 5: Model Evaluation

MMMU evaluates 28 open-source large multimodal models (LMMs) plus proprietary systems including GPT-4V(ision). All evaluations are zero-shot — no fine-tuning or few-shot examples on MMMU data. This tests raw generalization.

The proprietary frontier

At the time of publication, GPT-4V achieved 55.7% accuracy on the validation set — the highest score of any model. This sounds respectable until you remember: random chance is ~24%, and human experts score 88.6%. GPT-4V is closer to random than to human.

The open-source landscape

Open-source models performed dramatically worse. The best open-source results at publication:

LLaVA-1.5: ~34% (13B parameters)
BLIP-2 FLAN-T5-XXL: ~34%
InstructBLIP: ~34%
Qwen-VL-7B-Chat: ~33%
CogVLM: ~30%

These same models score 85-93% on standard VQA benchmarks. On MMMU, they barely beat random chance. The contrast is stark and reveals how shallow their "understanding" really is.

Text-only baselines

An important control: what happens when you feed only the text (no images) to powerful LLMs? Or when you add OCR-extracted text or LLaVA-generated captions as a substitute for actual image understanding?

The result: no significant improvement. Adding OCR or captions to text-only LLMs doesn't help because MMMU questions require genuine visual understanding — not just text extraction from images. You can't OCR a pathology slide into useful information.

Model Performance on MMMU

Accuracy on the MMMU validation set. The gap between the best AI model and human experts is massive. Note how open-source models cluster near random chance.

The sobering reality: Models that "solved" standard VQA (85-93% accuracy) can barely outperform random guessing on MMMU (30-34%). This isn't an incremental gap — it's a chasm. MMMU exposes that high scores on perception-focused benchmarks don't translate to expert-level multimodal reasoning.

Why doesn't adding OCR or image captions to text-only LLMs improve MMMU performance?

MMMU requires genuine visual understanding — you can't reduce a pathology slide or circuit diagram to useful text via OCR or captions The OCR quality is too low Text-only models can't read

Chapter 6: The Human-AI Gap

Let's put the numbers side by side and understand what they mean.

Human (best expert)

88.6% — college seniors in their field, with textbook access

GPT-4V

55.7% — the most capable AI system at time of publication

Best open-source

~34% — barely above random (24%)

Random chance

~24% — guessing on 4-option multiple choice

The gap between human experts (88.6%) and the best AI (55.7%) is 32.9 percentage points. For context, on MMLU (the text-only equivalent), GPT-4 scores ~86% — nearly matching human experts. But add images and domain-specific visual reasoning, and the gap explodes.

What the gap reveals

This isn't just a "we need better models" gap. It reveals fundamental architectural limitations:

Vision encoders are biased toward natural images. CLIP and similar encoders were trained on web photos. They've never seen a pathology slide or a Fourier transform diagram in training. The visual representations they produce for these domains are impoverished.
Multimodal alignment is shallow. Current approaches (linear projection, Q-Former, etc.) align vision and language at a surface level. They don't produce the deep, domain-specific visual understanding needed for expert reasoning.
Reasoning chains break under visual complexity. Even when models perceive correctly, they struggle to chain multiple reasoning steps that depend on visual information — applying a formula from the text to numbers read from a chart, for instance.

The Human-AI Gap

Visualizing the chasm between human expert performance and AI models on MMMU. The gap is far larger than on any previous multimodal benchmark.

Why is the human-AI gap on MMMU so much larger than on text-only benchmarks like MMLU?

MMMU adds expert-level visual understanding on top of domain knowledge — vision encoders trained on web photos can't parse medical scans or circuit diagrams, and multimodal alignment is too shallow for domain-specific reasoning MMMU has more questions The human experts were given more time

Chapter 7: Error Analysis

The authors analyzed 150 error cases from GPT-4V to understand why the model fails. The taxonomy of errors reveals where the bottlenecks lie — and they're spread across all three skills.

Error breakdown

Perceptual errors (35%)

The model misreads the visual input — confusing symbols, misidentifying structures, failing to parse diagrams. These are basic vision failures.

Knowledge gaps (29%)

The model perceives the image correctly but lacks the domain knowledge to interpret it. It sees the MRI but doesn't know what "fat necrosis" looks like.

Reasoning failures (26%)

The model sees correctly AND has relevant knowledge, but fails to chain the reasoning steps. It skips a step in a mathematical derivation or draws an incorrect logical inference.

Other errors (10%)

Textual understanding (6%), refusal to answer (3%), annotation errors (2%), answer extraction (1%)

Perception: The biggest bottleneck

The most striking finding: 35% of GPT-4V's errors are pure perception failures. The model misreads what's in the image. These are "easy for humans, hard for AI" errors — a human can immediately see that a musical interval is a diminished fifth, but GPT-4V confuses it with a minor seventh.

This is particularly damaging because perception errors cascade. If you misread the circuit diagram, your Kirchhoff's law calculation will be wrong no matter how good your math is. If you misidentify the tissue type in a pathology slide, no amount of medical knowledge can save you.

The interplay of language and vision

The error analysis reveals a subtle problem: language can both help and hurt visual understanding. In some cases, text context helps the model make sense of ambiguous visuals. But in other cases, text leads the model to hallucinate — it generates plausible-sounding medical terminology based on the text prompt while ignoring what's actually in the image.

GPT-4V Error Taxonomy

Distribution of 150 analyzed error cases. Perception errors dominate, followed by knowledge gaps and reasoning failures.

The cascading failure insight: Errors in perception, knowledge, and reasoning are not independent — they cascade. A perception error (misreading a diagram) causes a knowledge error (applying the wrong formula) which causes a reasoning error (incorrect derivation). Fixing perception alone would eliminate far more than 35% of total errors because it would break these cascading chains.

What is the largest single category of GPT-4V errors on MMMU?

Perceptual errors (35%) — the model misreads or misinterprets the visual input, which then cascades into downstream knowledge and reasoning failures Reasoning errors Knowledge gaps

Chapter 8: Subject-Level Analysis

Not all subjects are equally hard for AI models. The per-discipline breakdown reveals a clear pattern: models do best on subjects with simpler visual content and worst on subjects requiring complex visual reasoning.

Where models struggle most

Science (worst): Questions involving mathematical notation, chemical structures, and geometric constructions require precise visual parsing that current models handle poorly. Reading a complex integral from an image and then computing it is a two-stage challenge where both stages can fail.

Health & Medicine: Medical imaging (MRI, CT, pathology) uses visual conventions completely different from natural images. A "bright spot" on an MRI means something specific in radiology, and models lack this specialized visual vocabulary.

Tech & Engineering: Circuit diagrams, architectural plans, and mechanical drawings require understanding domain-specific symbolic conventions (e.g., reading resistor values from a schematic).

Where models do relatively better

Art & Design (best): GPT-4V achieves ~57% on Art & Design at the time of the updated leaderboard. Why? Visual content in art questions (paintings, design mockups) is closer to the natural images that vision encoders were trained on. Plus, art history knowledge is well-represented in LLM training data.

Humanities & Social Science: Questions often involve photographs, cartoons, and maps — familiar visual formats. The reasoning required is often interpretive rather than computational.

The visual complexity hypothesis: Model performance correlates inversely with the visual complexity and domain-specificity of the image types. Disciplines with natural-image-like visuals (art, humanities) see higher scores. Disciplines with specialized notation systems (science, engineering, medicine) see lower scores. This suggests the bottleneck is in the vision encoder, not the language model.

Performance by Discipline (GPT-4V)

GPT-4V accuracy across six disciplines on the test set. Disciplines with familiar visual content (Art, Humanities) outperform those with specialized imagery (Science, Medicine).

Why do models perform better on Art & Design than on Science or Medicine in MMMU?

Art questions use visual content (paintings, designs) closer to natural images that vision encoders were trained on, while science and medicine use specialized notation and imaging that encoders have rarely seen Art questions are easier There are fewer Art questions

Chapter 9: Connections

MMMU sits at a critical junction in the evolution of multimodal evaluation. Here's how it connects to the broader landscape.

Predecessors

VQA / VQA-v2 (2015/2017): The original visual question answering benchmarks. Focus on natural images and common-sense questions. Saturated by 2023 (85%+ accuracy). MMMU is what VQA would look like if the questions required a PhD.
MMLU (2020): The text-only equivalent — 57 subjects, expert-level knowledge questions. MMMU extends MMLU's vision to the multimodal domain, adding images as a first-class input.
ScienceQA (2022): The closest predecessor, with multimodal questions across disciplines. But ScienceQA's questions are elementary-to-middle-school level. MMMU operates at college level and above, dramatically raising the difficulty bar.
MathVista (2023): Expert-level multimodal reasoning, but limited to mathematics. MMMU covers 30 subjects — the breadth that MathVista lacks.

Contemporaries and successors

GAIA (2023): A concurrent benchmark testing fundamental AI abilities including multimodal handling and tool use. Only 466 questions — MMMU's scale (11.5K) provides far more granular signal.
MMMU-Pro (2024): The successor benchmark by the same team. Makes MMMU harder by augmenting with candidate options to reduce guessing, adding vision-only questions with no text context, and filtering out questions answerable by text-only models. Designed to remain challenging as models improve.
MMBench / SEED / MM-Vet: Holistic LMM benchmarks that test breadth of abilities (OCR, spatial reasoning, etc.) but at a common-sense rather than expert level. Complementary to MMMU rather than competing.

Impact on the field

MMMU became the standard expert-level multimodal benchmark. Within a year of release, it was included in the evaluation suite of every major multimodal model release (GPT-4o, Claude 3, Gemini 1.5, LLaVA-NeXT, InternVL). Scores on MMMU are now reported in model cards alongside MMLU, HumanEval, and other flagship benchmarks.

More importantly, MMMU shifted the conversation. Before MMMU, the narrative was "VLMs are nearly solving vision." After MMMU, the narrative became "VLMs can see, but they can't think." This reframing directed research attention toward deeper multimodal reasoning rather than better perception alone.

MMMU's legacy: MMMU demonstrated that the gap between "seeing" and "understanding" is enormous. It motivated a generation of models that explicitly target expert-level multimodal reasoning — not just perception. Every major VLM released in 2024 reports its MMMU score, making it the de facto benchmark for measuring progress toward Expert AGI.

Cheat sheet

Scale

11.5K questions, 30 subjects, 6 disciplines, 183 subfields, 30 image types

Key finding

GPT-4V: 55.7% | Human experts: 88.6% | Gap: 32.9 points

Error breakdown

Perception 35% | Knowledge 29% | Reasoning 26% | Other 10%

Innovation

First benchmark requiring expert domain knowledge + multimodal reasoning at college level

Impact

Standard expert-level VLM benchmark; reported in every major model card since 2024

How does MMMU differ from its closest predecessor, ScienceQA?

ScienceQA tests elementary-to-middle-school knowledge; MMMU tests college-level expert knowledge across 30 subjects with 30 heterogeneous image types and interleaved text-image inputs MMMU has fewer questions ScienceQA doesn't include images

MMMU: Massive Multi-discipline Multimodal Understanding

Chapter 0: The Problem

Chapter 1: The Key Insight

Chapter 2: Benchmark Design

The six disciplines

Data curation: A three-stage process

Key statistics

Chapter 3: Question Types

Format: Multiple-choice + open-ended

The interleaved image challenge

30 heterogeneous image types

Chapter 4: Difficulty Calibration

Human expert validation

Difficulty stratification

Anti-contamination measures

Chapter 5: Model Evaluation

The proprietary frontier

The open-source landscape

Text-only baselines

Chapter 6: The Human-AI Gap

What the gap reveals

Chapter 7: Error Analysis

Error breakdown

Perception: The biggest bottleneck

The interplay of language and vision

Chapter 8: Subject-Level Analysis

Where models struggle most

Where models do relatively better

Chapter 9: Connections

Predecessors

Contemporaries and successors

Impact on the field

Cheat sheet