Holistic Evaluation of Reward Models for Vision Language Models — how do you evaluate the evaluators? A benchmark for multimodal reward models.
RLHF (Reinforcement Learning from Human Feedback) has transformed how we align language models. The key component: a reward model that scores model outputs. "This response is helpful and harmless: score 0.92. This one is toxic: score 0.12." The policy model is trained to maximize these scores.
But what makes a good reward model? For text-only models, we have benchmarks like RewardBench. For multimodal models (VLMs that handle images + text), we had no equivalent. How do you know if your reward model correctly judges whether a VLM's response about an image is accurate?
See how reward models fit into the RLHF pipeline for VLMs. A bad reward model produces a bad VLM.
A reward model is a function R(prompt, response) → score that tells you how good a response is. For multimodal reward models, the prompt includes an image, and the response is the VLM's text answer about that image.
| Type | How It Works | Examples |
|---|---|---|
| Explicit RM | A VLM fine-tuned with a reward head that outputs a scalar score | InternVL-2-RM, LLaVA-RM |
| Generative RM | A VLM prompted to judge quality and output a score in text | GPT-4V-as-judge, Claude-as-judge |
| Implicit RM | Using the VLM's own log-probabilities as a proxy for quality | DPO-trained models |
python # Explicit reward model: trained classifier class MultimodalRM(nn.Module): def forward(self, image, prompt, response): # Encode image + prompt + response features = self.vlm_backbone(image, prompt + response) # Output scalar reward score score = self.reward_head(features[:, -1]) # [B, 1] return score # higher = better response # Generative reward model: VLM-as-judge def generative_rm(judge_model, image, prompt, response_a, response_b): judge_prompt = f"""Given this image, which response better answers '{prompt}'? A: {response_a} B: {response_b} Output only 'A' or 'B'.""" verdict = judge_model.generate(image, judge_prompt) return verdict # "A" or "B"
Explore the three types of multimodal reward models and how each scores VLM responses.
Multimodal RewardBench is structured as a set of preference pairs: for each image+prompt, there are two responses — one chosen (better) and one rejected (worse). The reward model's job is to assign a higher score to the chosen response.
See examples of easy and hard preference pairs. The reward model must score the chosen response higher than the rejected one.
The benchmark covers 5+ major categories of multimodal judgment, each testing a different aspect of visual understanding.
| Category | What It Tests | # Pairs | Example Error to Detect |
|---|---|---|---|
| Object Recognition | Identifying and counting objects | ~500 | "There are 3 cats" vs "There are 5 cats" |
| Spatial Reasoning | Understanding object positions and relationships | ~400 | "The cup is on the table" vs "The cup is under the table" |
| Text/OCR | Reading text in images | ~300 | Correctly vs incorrectly reading a sign |
| Visual Reasoning | Complex inference about scenes | ~400 | "It's raining because people have umbrellas" vs wrong inference |
| Safety | Refusing harmful requests about images | ~200 | Refusing to describe how to replicate a dangerous action shown |
| Hallucination | Detecting made-up visual details | ~300 | "The person is wearing glasses" when they're not |
Explore different task categories and see example preference pairs for each.
The benchmark evaluates reward models using a simple metric: preference accuracy. For each pair, does the reward model assign a higher score to the chosen response? The overall accuracy is the fraction of pairs where it gets this right.
Random guessing gives 50%. A perfect reward model gives 100%.
| Mode | Input to RM | Output |
|---|---|---|
| Pointwise | Image + prompt + single response | Scalar score per response; compare |
| Pairwise | Image + prompt + both responses | Direct "A is better" or "B is better" |
See how pointwise and pairwise evaluation work for a reward model.
The benchmark reveals several surprising findings about the current state of multimodal reward models.
Large generative models (GPT-4V, Claude) used as judges consistently outperform purpose-built reward models. This suggests that strong general-purpose VLMs are better judges than specialized reward models — echoing findings from text-only RewardBench.
Most reward models struggle most with detecting subtle hallucinations — responses that are mostly correct but fabricate one visual detail. This is concerning because hallucination is the most dangerous failure mode for deployed VLMs.
Larger reward models are generally better, but the relationship isn't linear. Some 7B reward models outperform 70B ones on specific categories, suggesting that architecture and training data matter more than raw size.
See how different reward models perform across categories. Note the consistent weakness on hallucination detection.
The benchmark identifies specific failure patterns that reveal fundamental limitations in how current reward models process visual information.
| Failure | Description | Frequency |
|---|---|---|
| Verbosity bias | Longer responses get higher scores regardless of accuracy | ~30% of errors |
| Confidence bias | Confident-sounding wrong answers score higher than hedged correct ones | ~25% of errors |
| Text-only judging | Reward model ignores the image entirely, judging only text quality | ~20% of errors |
| Position bias | In pairwise mode, preferring whichever response is shown first/second | ~15% of errors |
See how different biases cause reward models to prefer wrong responses. Toggle each bias to see its effect on scoring.
Multimodal RewardBench fills a critical gap in the multimodal AI evaluation ecosystem. As VLMs become more capable, the reward models that guide their alignment become increasingly important.
| Benchmark | Evaluates | Modality |
|---|---|---|
| RewardBench | Text reward models | Text only |
| Multimodal RewardBench | Multimodal reward models | Vision + Language |
| VLMEvalKit | VLM capabilities | Vision + Language |
| GenEval | Image generation quality | Image generation |
See where Multimodal RewardBench fits in the broader evaluation landscape.