Multimodal RewardBench (2025)

Chapter 0: Who Judges the Judges?

RLHF (Reinforcement Learning from Human Feedback) has transformed how we align language models. The key component: a reward model that scores model outputs. "This response is helpful and harmless: score 0.92. This one is toxic: score 0.12." The policy model is trained to maximize these scores.

But what makes a good reward model? For text-only models, we have benchmarks like RewardBench. For multimodal models (VLMs that handle images + text), we had no equivalent. How do you know if your reward model correctly judges whether a VLM's response about an image is accurate?

The meta-evaluation problem: If your reward model gives high scores to responses that hallucinate about images (says "there are 5 cats" when there are 3), your VLM will learn to hallucinate. If it can't distinguish between a response that correctly describes spatial relationships and one that gets them wrong, your VLM won't learn spatial reasoning. Multimodal RewardBench evaluates reward models on their ability to correctly rank good vs bad VLM responses across diverse vision-language tasks.

The Reward Model Pipeline

See how reward models fit into the RLHF pipeline for VLMs. A bad reward model produces a bad VLM.

Why do we need a benchmark specifically for multimodal reward models?

Because reward models that can't correctly judge VLM responses (e.g., giving high scores to image hallucinations) will train VLMs to produce those same errors — we need to evaluate the evaluators before trusting them to guide VLM training Because text reward models already work perfectly Because we need more benchmarks in general

Chapter 1: Reward Models 101

A reward model is a function R(prompt, response) → score that tells you how good a response is. For multimodal reward models, the prompt includes an image, and the response is the VLM's text answer about that image.

Types of multimodal reward models

Type	How It Works	Examples
Explicit RM	A VLM fine-tuned with a reward head that outputs a scalar score	InternVL-2-RM, LLaVA-RM
Generative RM	A VLM prompted to judge quality and output a score in text	GPT-4V-as-judge, Claude-as-judge
Implicit RM	Using the VLM's own log-probabilities as a proxy for quality	DPO-trained models

python
# Explicit reward model: trained classifier
class MultimodalRM(nn.Module):
    def forward(self, image, prompt, response):
        # Encode image + prompt + response
        features = self.vlm_backbone(image, prompt + response)
        # Output scalar reward score
        score = self.reward_head(features[:, -1])  # [B, 1]
        return score  # higher = better response

# Generative reward model: VLM-as-judge
def generative_rm(judge_model, image, prompt, response_a, response_b):
    judge_prompt = f"""Given this image, which response better answers '{prompt}'?
    A: {response_a}
    B: {response_b}
    Output only 'A' or 'B'."""
    verdict = judge_model.generate(image, judge_prompt)
    return verdict  # "A" or "B"

The diversity challenge: Text reward models mostly judge language quality. Multimodal reward models must judge a much wider range of capabilities: object counting, spatial relationships, text reading (OCR), color recognition, reasoning about scenes, safety, and more. A reward model might excel at counting objects but fail at spatial reasoning. Multimodal RewardBench tests all these dimensions.

Reward Model Types

Explore the three types of multimodal reward models and how each scores VLM responses.

What makes evaluating multimodal reward models harder than text-only ones?

Multimodal reward models must judge a much wider range of capabilities — object counting, spatial relationships, OCR, color recognition, scene reasoning, safety — and a model might excel at some dimensions while failing at others Because images are larger than text Because there are fewer VLMs to evaluate

Chapter 2: Benchmark Design

Multimodal RewardBench is structured as a set of preference pairs: for each image+prompt, there are two responses — one chosen (better) and one rejected (worse). The reward model's job is to assign a higher score to the chosen response.

Dataset construction

Source Collection

Gather image+prompt pairs from diverse VLM benchmarks covering different visual capabilities.

↓

Response Generation

Generate responses from multiple VLMs. Pair a correct response (chosen) with an incorrect one (rejected).

↓

Human Verification

Human annotators verify that the chosen response is genuinely better. Ambiguous pairs removed.

The difficulty spectrum: Some pairs are easy (correct vs wildly wrong), others are hard (mostly correct vs subtly wrong). Multimodal RewardBench includes both, with difficulty levels explicitly annotated. This reveals whether a reward model can detect fine-grained errors, not just obvious ones.

Preference Pair Examples

See examples of easy and hard preference pairs. The reward model must score the chosen response higher than the rejected one.

How is Multimodal RewardBench structured?

As preference pairs (chosen vs rejected responses for each image+prompt), with varying difficulty levels — the reward model must assign higher scores to chosen responses, testing both easy discriminations and fine-grained error detection As a set of images to classify As free-form generation prompts

Chapter 3: Task Categories

The benchmark covers 5+ major categories of multimodal judgment, each testing a different aspect of visual understanding.

Category	What It Tests	# Pairs	Example Error to Detect
Object Recognition	Identifying and counting objects	~500	"There are 3 cats" vs "There are 5 cats"
Spatial Reasoning	Understanding object positions and relationships	~400	"The cup is on the table" vs "The cup is under the table"
Text/OCR	Reading text in images	~300	Correctly vs incorrectly reading a sign
Visual Reasoning	Complex inference about scenes	~400	"It's raining because people have umbrellas" vs wrong inference
Safety	Refusing harmful requests about images	~200	Refusing to describe how to replicate a dangerous action shown
Hallucination	Detecting made-up visual details	~300	"The person is wearing glasses" when they're not

Category-level evaluation is crucial. A reward model might achieve 85% overall accuracy but score 95% on object recognition and only 60% on spatial reasoning. If you use this reward model for RLHF, your VLM will learn to count objects well but fail at spatial relationships. Multimodal RewardBench exposes these imbalances.

Task Category Explorer

Explore different task categories and see example preference pairs for each.

Category Object Recognition

Why does Multimodal RewardBench evaluate per-category rather than just overall accuracy?

Because a high overall score can mask severe weaknesses in specific categories — a reward model scoring 85% overall but only 60% on spatial reasoning will train VLMs that fail at spatial tasks, even though aggregate metrics look fine Because per-category scores are easier to compute Because there aren't enough total examples

Chapter 4: Evaluation Protocol

The benchmark evaluates reward models using a simple metric: preference accuracy. For each pair, does the reward model assign a higher score to the chosen response? The overall accuracy is the fraction of pairs where it gets this right.

Accuracy = (# pairs where R(chosen) > R(rejected)) / (# total pairs)

Random guessing gives 50%. A perfect reward model gives 100%.

Evaluation modes

Mode	Input to RM	Output
Pointwise	Image + prompt + single response	Scalar score per response; compare
Pairwise	Image + prompt + both responses	Direct "A is better" or "B is better"

Pointwise vs pairwise matters: Some reward models perform differently in each mode. Generative RMs (LLM-as-judge) tend to perform better pairwise because they can directly compare. Explicit RMs work better pointwise because they produce calibrated scalar scores. The benchmark reports both.

Evaluation Protocol Visualizer

See how pointwise and pairwise evaluation work for a reward model.

What metric does Multimodal RewardBench use to evaluate reward models?

Preference accuracy — the fraction of pairs where the reward model assigns a higher score to the genuinely better response (chosen over rejected), with random guessing at 50% and perfect discrimination at 100% BLEU score on generated text FID on generated images

Chapter 5: Key Findings

The benchmark reveals several surprising findings about the current state of multimodal reward models.

Finding 1: Generative RMs beat explicit RMs

Large generative models (GPT-4V, Claude) used as judges consistently outperform purpose-built reward models. This suggests that strong general-purpose VLMs are better judges than specialized reward models — echoing findings from text-only RewardBench.

Finding 2: Hallucination detection is the hardest category

Most reward models struggle most with detecting subtle hallucinations — responses that are mostly correct but fabricate one visual detail. This is concerning because hallucination is the most dangerous failure mode for deployed VLMs.

Finding 3: Size matters, but not linearly

Larger reward models are generally better, but the relationship isn't linear. Some 7B reward models outperform 70B ones on specific categories, suggesting that architecture and training data matter more than raw size.

The most concerning finding: No reward model achieves above 80% accuracy on the hallucination category. This means that any VLM trained with RLHF using current reward models will still learn to hallucinate about images, because the reward model can't reliably detect these errors.

Reward Model Leaderboard

See how different reward models perform across categories. Note the consistent weakness on hallucination detection.

Category Overall

What is the most concerning finding from Multimodal RewardBench?

No reward model exceeds 80% accuracy on hallucination detection — meaning VLMs trained with RLHF using current reward models will still learn to hallucinate about images, because the reward signal doesn't reliably penalize fabricated visual details That all reward models perform equally That smaller models are always better

Chapter 6: Failure Modes & Showcase

The benchmark identifies specific failure patterns that reveal fundamental limitations in how current reward models process visual information.

Common failure modes

Failure	Description	Frequency
Verbosity bias	Longer responses get higher scores regardless of accuracy	~30% of errors
Confidence bias	Confident-sounding wrong answers score higher than hedged correct ones	~25% of errors
Text-only judging	Reward model ignores the image entirely, judging only text quality	~20% of errors
Position bias	In pairwise mode, preferring whichever response is shown first/second	~15% of errors

The "text-only judging" problem: Some reward models barely look at the image. They assign high scores to eloquent, well-structured responses even when those responses describe objects that aren't in the image. This means the model is rewarding linguistic fluency, not visual accuracy — training VLMs to be fluent hallucinations.

Failure Mode Simulator

See how different biases cause reward models to prefer wrong responses. Toggle each bias to see its effect on scoring.

What is the "text-only judging" failure mode?

When the reward model ignores the image and judges only text quality — assigning high scores to eloquent responses even when they describe objects not present in the image, effectively rewarding fluent hallucination When the model can only process text input When text-only tasks are evaluated

Chapter 7: Connections

Multimodal RewardBench fills a critical gap in the multimodal AI evaluation ecosystem. As VLMs become more capable, the reward models that guide their alignment become increasingly important.

Benchmark	Evaluates	Modality
RewardBench	Text reward models	Text only
Multimodal RewardBench	Multimodal reward models	Vision + Language
VLMEvalKit	VLM capabilities	Vision + Language
GenEval	Image generation quality	Image generation

Lesson 1: You can't align what you can't evaluate. RLHF quality is bottlenecked by reward model quality. If reward models can't detect hallucinations, RLHF won't fix hallucinations.

Lesson 2: Category-level evaluation reveals hidden weaknesses. Aggregate scores hide critical failures. A reward model must be evaluated on each visual capability separately.

Lesson 3: General-purpose VLMs are currently the best judges. Purpose-built reward models don't yet outperform strong generative VLMs used as judges. This may change as reward model training improves.

Evaluation Ecosystem

See where Multimodal RewardBench fits in the broader evaluation landscape.

What is Multimodal RewardBench's main contribution?

Providing the first systematic benchmark for evaluating multimodal reward models across diverse visual capabilities — revealing that current reward models have critical blind spots (especially hallucination detection) that directly limit how well VLMs can be aligned through RLHF A new reward model architecture A new training dataset for VLMs