The benchmark that exposed how far multimodal AI is from expert-level reasoning — 11.5K college exam questions across 30 subjects where GPT-4V scores 56.8% and humans score 88.6%.
By late 2023, multimodal AI models were crushing every benchmark in sight. CogVLM hit 85% on VQA-v2. LLaVA scored 92% on ScienceQA-IMG. RefCOCO? 93%. The leaderboards told a story of rapid, inexorable progress toward human-level visual understanding.
But something was deeply wrong with that story.
All of these benchmarks test perception — recognizing objects, reading text in images, answering common-sense questions about everyday scenes. They ask "what color is the car?" or "how many people are in this photo?" A middle-schooler could answer most of them. And that's the problem: when your test is easy, a high score tells you nothing about whether the system actually understands anything.
Consider what a real expert does. A radiologist doesn't just see a bright spot on an MRI — they combine visual perception with years of medical knowledge to diagnose fat necrosis versus hematoma versus susceptibility artifact. A music theorist doesn't just see sheet music — they identify that a notated interval is a diminished fifth, not a minor seventh. An electrical engineer doesn't just see a circuit diagram — they apply Kirchhoff's laws to compute VCE = 3.75V.
None of the existing benchmarks test this kind of expert-level multimodal reasoning. They test the visual equivalent of asking a language model to complete "The cat sat on the ___." High scores, zero insight into actual intelligence.
How do you test whether an AI system has expert-level multimodal understanding? The MMMU authors realized the answer was hiding in plain sight: college exams.
College exams are purpose-built evaluation instruments. They are designed by domain experts to test whether a student has mastered a subject at a professional level. They combine text and images naturally — circuit diagrams in electrical engineering, pathology slides in medicine, sheet music in music theory, molecular structures in chemistry. And critically, they require three skills simultaneously:
This is why MMMU's subtitle is "A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI." The authors explicitly frame it as measuring progress toward Level 3 AGI in the Morris et al. taxonomy — systems that perform at "the 90th percentile of skilled adults" across broad domains.
The genius of using college exams is that difficulty is built in by construction. You don't need to artificially make questions harder. A second-year medical imaging exam is hard. A graduate-level calculus problem is hard. The domain provides the challenge.
MMMU contains 11,550 multimodal questions spanning 6 disciplines, 30 subjects, and 183 subfields. This isn't a random collection — it's a carefully structured taxonomy designed for comprehensive coverage of expert knowledge.
Stage 1: Subject selection. University professors identified which subjects commonly use visual inputs. Subjects like law and linguistics were excluded because multimodal questions are rare in those fields.
Stage 2: Question collection. Over 50 university students specializing in these subjects collected questions from major textbooks, online resources, and lecture materials. They were instructed to select questions without immediately available answers (e.g., answers in separate documents) to minimize data contamination.
Stage 3: Quality control. A multi-step cleaning pipeline: duplicate detection via lexical overlap and URL similarity, format/typo checking by co-authors, and difficulty categorization. Approximately 10% of questions classified as "very easy" were removed to maintain the benchmark's expert-level difficulty.
How the 11,550 questions are distributed across six disciplines. Engineering and Science dominate, reflecting the prevalence of multimodal content in STEM fields.
MMMU questions are not the simple "what is in this image?" format you see in VQA. They mirror actual college exam questions, with two key structural innovations that make them uniquely challenging for multimodal models.
Multiple-choice (94%): The vast majority of MMMU questions present 4 options. But unlike typical VQA multiple-choice, the options often require domain expertise to even understand — "susceptibility artifact" vs. "fat necrosis" vs. "hematoma" in radiology, or integral expressions involving specific function bounds in calculus.
Open-ended (6%): These questions require the model to produce a numerical answer, short phrase, or formula. No options to guess from. For example: "Find VCE for the circuit shown" with answer 3.75V.
This is what truly sets MMMU apart. Images don't just accompany questions — they are woven into the question text. A breast MRI question might read: "You are shown subtraction <image 1>, T2 weighted <image 2> and T1 weighted axial <image 3> from a screening breast MRI. What is the etiology of the finding in the left breast?"
The model must:
7.4% of questions include multiple images, and images appear at different positions: beginning (18%), middle (37%), or end (50%) of the question text.
This is perhaps the most brutal aspect of MMMU. The 30 image types span wildly different visual domains:
Explore how MMMU questions interleave text and images across different disciplines. Click a discipline to see an example question structure.
A benchmark is only as good as the quality of its questions. The MMMU team invested heavily in ensuring that every question genuinely tests expert-level understanding — not just pattern matching or lucky guessing.
The authors recruited 90 college senior students — 3 experts per subject across all 30 subjects — to take the benchmark. These weren't random participants; they were students specializing in the relevant field, and they were allowed to consult their textbooks. This establishes a realistic upper bound: humans with domain training and reference materials.
The results were telling:
Even the worst human experts scored 76.2% — significantly above any AI model. And these are students, not professors. A seasoned professional would likely score even higher.
Questions are categorized into three difficulty levels based on expert annotation:
The "very easy" questions (approximately 10% of the original collection) were removed entirely. If a question could be answered without domain expertise, it doesn't belong in MMMU.
Data contamination — where the test questions appear in the model's training data — is a growing concern. MMMU addresses this in two ways:
MMMU evaluates 28 open-source large multimodal models (LMMs) plus proprietary systems including GPT-4V(ision). All evaluations are zero-shot — no fine-tuning or few-shot examples on MMMU data. This tests raw generalization.
At the time of publication, GPT-4V achieved 55.7% accuracy on the validation set — the highest score of any model. This sounds respectable until you remember: random chance is ~24%, and human experts score 88.6%. GPT-4V is closer to random than to human.
Open-source models performed dramatically worse. The best open-source results at publication:
These same models score 85-93% on standard VQA benchmarks. On MMMU, they barely beat random chance. The contrast is stark and reveals how shallow their "understanding" really is.
An important control: what happens when you feed only the text (no images) to powerful LLMs? Or when you add OCR-extracted text or LLaVA-generated captions as a substitute for actual image understanding?
The result: no significant improvement. Adding OCR or captions to text-only LLMs doesn't help because MMMU questions require genuine visual understanding — not just text extraction from images. You can't OCR a pathology slide into useful information.
Accuracy on the MMMU validation set. The gap between the best AI model and human experts is massive. Note how open-source models cluster near random chance.
Let's put the numbers side by side and understand what they mean.
The gap between human experts (88.6%) and the best AI (55.7%) is 32.9 percentage points. For context, on MMLU (the text-only equivalent), GPT-4 scores ~86% — nearly matching human experts. But add images and domain-specific visual reasoning, and the gap explodes.
This isn't just a "we need better models" gap. It reveals fundamental architectural limitations:
Visualizing the chasm between human expert performance and AI models on MMMU. The gap is far larger than on any previous multimodal benchmark.
The authors analyzed 150 error cases from GPT-4V to understand why the model fails. The taxonomy of errors reveals where the bottlenecks lie — and they're spread across all three skills.
The most striking finding: 35% of GPT-4V's errors are pure perception failures. The model misreads what's in the image. These are "easy for humans, hard for AI" errors — a human can immediately see that a musical interval is a diminished fifth, but GPT-4V confuses it with a minor seventh.
This is particularly damaging because perception errors cascade. If you misread the circuit diagram, your Kirchhoff's law calculation will be wrong no matter how good your math is. If you misidentify the tissue type in a pathology slide, no amount of medical knowledge can save you.
The error analysis reveals a subtle problem: language can both help and hurt visual understanding. In some cases, text context helps the model make sense of ambiguous visuals. But in other cases, text leads the model to hallucinate — it generates plausible-sounding medical terminology based on the text prompt while ignoring what's actually in the image.
Distribution of 150 analyzed error cases. Perception errors dominate, followed by knowledge gaps and reasoning failures.
Not all subjects are equally hard for AI models. The per-discipline breakdown reveals a clear pattern: models do best on subjects with simpler visual content and worst on subjects requiring complex visual reasoning.
Science (worst): Questions involving mathematical notation, chemical structures, and geometric constructions require precise visual parsing that current models handle poorly. Reading a complex integral from an image and then computing it is a two-stage challenge where both stages can fail.
Health & Medicine: Medical imaging (MRI, CT, pathology) uses visual conventions completely different from natural images. A "bright spot" on an MRI means something specific in radiology, and models lack this specialized visual vocabulary.
Tech & Engineering: Circuit diagrams, architectural plans, and mechanical drawings require understanding domain-specific symbolic conventions (e.g., reading resistor values from a schematic).
Art & Design (best): GPT-4V achieves ~57% on Art & Design at the time of the updated leaderboard. Why? Visual content in art questions (paintings, design mockups) is closer to the natural images that vision encoders were trained on. Plus, art history knowledge is well-represented in LLM training data.
Humanities & Social Science: Questions often involve photographs, cartoons, and maps — familiar visual formats. The reasoning required is often interpretive rather than computational.
GPT-4V accuracy across six disciplines on the test set. Disciplines with familiar visual content (Art, Humanities) outperform those with specialized imagery (Science, Medicine).
MMMU sits at a critical junction in the evolution of multimodal evaluation. Here's how it connects to the broader landscape.
MMMU became the standard expert-level multimodal benchmark. Within a year of release, it was included in the evaluation suite of every major multimodal model release (GPT-4o, Claude 3, Gemini 1.5, LLaVA-NeXT, InternVL). Scores on MMMU are now reported in model cards alongside MMLU, HumanEval, and other flagship benchmarks.
More importantly, MMMU shifted the conversation. Before MMMU, the narrative was "VLMs are nearly solving vision." After MMMU, the narrative became "VLMs can see, but they can't think." This reframing directed research attention toward deeper multimodal reasoning rather than better perception alone.