2025

Multimodal RewardBench

Holistic Evaluation of Reward Models for Vision Language Models — how do you evaluate the evaluators? A benchmark for multimodal reward models.

Prerequisites: RLHF basics + Reward models + VLMs. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Who Judges the Judges?

RLHF (Reinforcement Learning from Human Feedback) has transformed how we align language models. The key component: a reward model that scores model outputs. "This response is helpful and harmless: score 0.92. This one is toxic: score 0.12." The policy model is trained to maximize these scores.

But what makes a good reward model? For text-only models, we have benchmarks like RewardBench. For multimodal models (VLMs that handle images + text), we had no equivalent. How do you know if your reward model correctly judges whether a VLM's response about an image is accurate?

The meta-evaluation problem: If your reward model gives high scores to responses that hallucinate about images (says "there are 5 cats" when there are 3), your VLM will learn to hallucinate. If it can't distinguish between a response that correctly describes spatial relationships and one that gets them wrong, your VLM won't learn spatial reasoning. Multimodal RewardBench evaluates reward models on their ability to correctly rank good vs bad VLM responses across diverse vision-language tasks.
The Reward Model Pipeline

See how reward models fit into the RLHF pipeline for VLMs. A bad reward model produces a bad VLM.

Why do we need a benchmark specifically for multimodal reward models?

Chapter 1: Reward Models 101

A reward model is a function R(prompt, response) → score that tells you how good a response is. For multimodal reward models, the prompt includes an image, and the response is the VLM's text answer about that image.

Types of multimodal reward models

TypeHow It WorksExamples
Explicit RMA VLM fine-tuned with a reward head that outputs a scalar scoreInternVL-2-RM, LLaVA-RM
Generative RMA VLM prompted to judge quality and output a score in textGPT-4V-as-judge, Claude-as-judge
Implicit RMUsing the VLM's own log-probabilities as a proxy for qualityDPO-trained models
python
# Explicit reward model: trained classifier
class MultimodalRM(nn.Module):
    def forward(self, image, prompt, response):
        # Encode image + prompt + response
        features = self.vlm_backbone(image, prompt + response)
        # Output scalar reward score
        score = self.reward_head(features[:, -1])  # [B, 1]
        return score  # higher = better response

# Generative reward model: VLM-as-judge
def generative_rm(judge_model, image, prompt, response_a, response_b):
    judge_prompt = f"""Given this image, which response better answers '{prompt}'?
    A: {response_a}
    B: {response_b}
    Output only 'A' or 'B'."""
    verdict = judge_model.generate(image, judge_prompt)
    return verdict  # "A" or "B"
The diversity challenge: Text reward models mostly judge language quality. Multimodal reward models must judge a much wider range of capabilities: object counting, spatial relationships, text reading (OCR), color recognition, reasoning about scenes, safety, and more. A reward model might excel at counting objects but fail at spatial reasoning. Multimodal RewardBench tests all these dimensions.
Reward Model Types

Explore the three types of multimodal reward models and how each scores VLM responses.

What makes evaluating multimodal reward models harder than text-only ones?

Chapter 2: Benchmark Design

Multimodal RewardBench is structured as a set of preference pairs: for each image+prompt, there are two responses — one chosen (better) and one rejected (worse). The reward model's job is to assign a higher score to the chosen response.

Dataset construction

Source Collection
Gather image+prompt pairs from diverse VLM benchmarks covering different visual capabilities.
Response Generation
Generate responses from multiple VLMs. Pair a correct response (chosen) with an incorrect one (rejected).
Human Verification
Human annotators verify that the chosen response is genuinely better. Ambiguous pairs removed.
The difficulty spectrum: Some pairs are easy (correct vs wildly wrong), others are hard (mostly correct vs subtly wrong). Multimodal RewardBench includes both, with difficulty levels explicitly annotated. This reveals whether a reward model can detect fine-grained errors, not just obvious ones.
Preference Pair Examples

See examples of easy and hard preference pairs. The reward model must score the chosen response higher than the rejected one.

How is Multimodal RewardBench structured?

Chapter 3: Task Categories

The benchmark covers 5+ major categories of multimodal judgment, each testing a different aspect of visual understanding.

CategoryWhat It Tests# PairsExample Error to Detect
Object RecognitionIdentifying and counting objects~500"There are 3 cats" vs "There are 5 cats"
Spatial ReasoningUnderstanding object positions and relationships~400"The cup is on the table" vs "The cup is under the table"
Text/OCRReading text in images~300Correctly vs incorrectly reading a sign
Visual ReasoningComplex inference about scenes~400"It's raining because people have umbrellas" vs wrong inference
SafetyRefusing harmful requests about images~200Refusing to describe how to replicate a dangerous action shown
HallucinationDetecting made-up visual details~300"The person is wearing glasses" when they're not
Category-level evaluation is crucial. A reward model might achieve 85% overall accuracy but score 95% on object recognition and only 60% on spatial reasoning. If you use this reward model for RLHF, your VLM will learn to count objects well but fail at spatial relationships. Multimodal RewardBench exposes these imbalances.
Task Category Explorer

Explore different task categories and see example preference pairs for each.

Category Object Recognition
Why does Multimodal RewardBench evaluate per-category rather than just overall accuracy?

Chapter 4: Evaluation Protocol

The benchmark evaluates reward models using a simple metric: preference accuracy. For each pair, does the reward model assign a higher score to the chosen response? The overall accuracy is the fraction of pairs where it gets this right.

Accuracy = (# pairs where R(chosen) > R(rejected)) / (# total pairs)

Random guessing gives 50%. A perfect reward model gives 100%.

Evaluation modes

ModeInput to RMOutput
PointwiseImage + prompt + single responseScalar score per response; compare
PairwiseImage + prompt + both responsesDirect "A is better" or "B is better"
Pointwise vs pairwise matters: Some reward models perform differently in each mode. Generative RMs (LLM-as-judge) tend to perform better pairwise because they can directly compare. Explicit RMs work better pointwise because they produce calibrated scalar scores. The benchmark reports both.
Evaluation Protocol Visualizer

See how pointwise and pairwise evaluation work for a reward model.

What metric does Multimodal RewardBench use to evaluate reward models?

Chapter 5: Key Findings

The benchmark reveals several surprising findings about the current state of multimodal reward models.

Finding 1: Generative RMs beat explicit RMs

Large generative models (GPT-4V, Claude) used as judges consistently outperform purpose-built reward models. This suggests that strong general-purpose VLMs are better judges than specialized reward models — echoing findings from text-only RewardBench.

Finding 2: Hallucination detection is the hardest category

Most reward models struggle most with detecting subtle hallucinations — responses that are mostly correct but fabricate one visual detail. This is concerning because hallucination is the most dangerous failure mode for deployed VLMs.

Finding 3: Size matters, but not linearly

Larger reward models are generally better, but the relationship isn't linear. Some 7B reward models outperform 70B ones on specific categories, suggesting that architecture and training data matter more than raw size.

The most concerning finding: No reward model achieves above 80% accuracy on the hallucination category. This means that any VLM trained with RLHF using current reward models will still learn to hallucinate about images, because the reward model can't reliably detect these errors.
Reward Model Leaderboard

See how different reward models perform across categories. Note the consistent weakness on hallucination detection.

Category Overall
What is the most concerning finding from Multimodal RewardBench?

Chapter 6: Failure Modes & Showcase

The benchmark identifies specific failure patterns that reveal fundamental limitations in how current reward models process visual information.

Common failure modes

FailureDescriptionFrequency
Verbosity biasLonger responses get higher scores regardless of accuracy~30% of errors
Confidence biasConfident-sounding wrong answers score higher than hedged correct ones~25% of errors
Text-only judgingReward model ignores the image entirely, judging only text quality~20% of errors
Position biasIn pairwise mode, preferring whichever response is shown first/second~15% of errors
The "text-only judging" problem: Some reward models barely look at the image. They assign high scores to eloquent, well-structured responses even when those responses describe objects that aren't in the image. This means the model is rewarding linguistic fluency, not visual accuracy — training VLMs to be fluent hallucinations.
Failure Mode Simulator

See how different biases cause reward models to prefer wrong responses. Toggle each bias to see its effect on scoring.

What is the "text-only judging" failure mode?

Chapter 7: Connections

Multimodal RewardBench fills a critical gap in the multimodal AI evaluation ecosystem. As VLMs become more capable, the reward models that guide their alignment become increasingly important.

BenchmarkEvaluatesModality
RewardBenchText reward modelsText only
Multimodal RewardBenchMultimodal reward modelsVision + Language
VLMEvalKitVLM capabilitiesVision + Language
GenEvalImage generation qualityImage generation
Lesson 1: You can't align what you can't evaluate. RLHF quality is bottlenecked by reward model quality. If reward models can't detect hallucinations, RLHF won't fix hallucinations.
Lesson 2: Category-level evaluation reveals hidden weaknesses. Aggregate scores hide critical failures. A reward model must be evaluated on each visual capability separately.
Lesson 3: General-purpose VLMs are currently the best judges. Purpose-built reward models don't yet outperform strong generative VLMs used as judges. This may change as reward model training improves.
Evaluation Ecosystem

See where Multimodal RewardBench fits in the broader evaluation landscape.

What is Multimodal RewardBench's main contribution?