Image generators are secretly generalist vision learners — instruction-tuning an image generator to output decodable RGB produces a single model that beats SAM 3 on segmentation and Depth Anything 3 on depth.
You want a robot to pick up a mug from a cluttered table. Simple enough. But to do it, the robot needs to understand the scene in at least three ways: it needs segmentation to know which pixels are the mug, depth to know how far away the mug is, and surface normals to know the angle of the surface it's grasping.
Today, you would deploy three separate specialist models for this. SAM 3 for segmentation. Depth Anything 3 for metric depth. Lotus-2 for surface normals. Each is a different architecture, trained on different data, with different input/output formats. Three models to load, three forward passes per frame, three sets of weights eating your GPU memory.
Now scale that to all of computer vision. Instance segmentation? Another model. Referring expression segmentation ("the red mug on the left")? Another model. Semantic segmentation? Yet another. Every new vision task requires a new specialist — new architecture, new training recipe, new failure modes.
Compare this to language. GPT, Claude, Gemini — they are all generalist language models. One model that summarizes, translates, codes, answers questions, writes poetry. The key was instruction-tuning a pretrained text generator. The generator already understood language; instruction-tuning just taught it to follow task formats.
Vision Banana asks: can we do the same thing with images?
Here is the surprising observation that makes Vision Banana work: image generators already understand vision. A model that can generate a photorealistic image of a cat on a table has, implicitly, learned what cats look like (segmentation), how far away objects are (depth), and the geometry of surfaces (normals). All of that knowledge is baked into the generator's weights during pretraining.
The problem is that this knowledge is locked inside the generative process. The model knows what a cat looks like, but it can only express that knowledge by generating images of cats. It cannot directly output a segmentation mask or a depth map.
Vision Banana's key insight: instruction-tuning unlocks it. Just like instruction-tuning an LLM teaches it to follow task prompts ("Summarize this text:", "Translate to French:"), instruction-tuning an image generator teaches it to follow vision task prompts ("Segment the cat:", "Estimate depth:").
The base model is Nano Banana Pro, a pretrained image generator. Vision Banana instruction-tunes it by mixing vision task data at a low ratio into the original training data. The ratio is critical — too much task data and the model forgets how to generate images. Too little and it doesn't learn the output formats. The sweet spot preserves generation capabilities while adding vision understanding.
The result: a single model that does segmentation, depth, normals, and image generation. No specialized architectures. No custom loss functions. No new modules. Just an image generator that has learned to generate a different kind of image.
To understand why Vision Banana works, you need to see the structural parallel with LLMs. The evolution of language AI and vision AI are converging on the exact same pattern.
Step 1: Generative pretraining. Train a model to predict the next token. This forces it to learn grammar, facts, reasoning — the full structure of language. The result is a base model (GPT-3, LLaMA, etc.).
Step 2: Instruction-tuning. Take the base model and fine-tune on (instruction, response) pairs. "Summarize this article: [text]" → "[summary]". The model's language understanding doesn't change — it just learns to follow task formats expressed as text.
Step 3: Universal interface. Text generation becomes the universal interface. Every task — summarization, translation, QA, coding — is expressed as "generate the right text." One model, infinite tasks.
Step 1: Image generation pretraining. Train a model to generate images from text prompts. This forces it to learn object identity, spatial relationships, depth, geometry — the full structure of visual scenes. The result is Nano Banana Pro.
Step 2: Instruction-tuning. Take the base generator and fine-tune on (vision prompt, output image) pairs. "Segment the cat: [image]" → "[colored mask image]". The model's visual understanding doesn't change — it just learns to follow task formats expressed as RGB images.
Step 3: Universal interface. Image generation becomes the universal interface. Every vision task — segmentation, depth, normals, editing — is expressed as "generate the right image." One model, infinite tasks.
Vision Banana's method is deceptively simple. There are three ideas, and they all work together.
This is the foundational trick. Instead of building custom decoders for each task, Vision Banana parameterizes every vision task output as an RGB image that the generator can produce:
The key constraint: the mapping must be invertible. You must be able to decode the generated RGB image back into the original task output (depth values, class labels, normal vectors). If the mapping is lossy, you lose information and can't recover the answer.
Instruction-tuning is done by mixing vision task data into the original training data at a low ratio. The model continues to see image generation examples alongside vision task examples. This prevents catastrophic forgetting — the model retains its ability to generate photorealistic images while learning to produce vision outputs.
The mixing ratio is the key hyperparameter. Too high (say 50% task data) and the model degrades at generation. Too low (say 1%) and it doesn't learn the output formats reliably. The paper finds the sweet spot empirically.
This is perhaps the most striking aspect. Vision Banana does not add any new modules, custom loss functions, or architectural modifications to Nano Banana Pro. The training objective is the same as image generation: generate an image that matches the target. The only difference is what the target image represents.
Encoding metric depth as RGB is the paper's most technically interesting contribution. The challenge: metric depth ranges from 0 (at the camera) to infinity (the horizon). How do you map an infinite range to 256×256×256 possible RGB colors — and make it invertible?
First, compress the infinite depth range into [0, 1). The power transform is:
With λ = −3 and c = 10/3. This maps metric depth d ∈ [0, ∞) to a normalized distance f ∈ [0, 1).
Why this specific transform? Think about robotics. A mug 0.3m away vs 0.5m away matters enormously for grasping. A mountain 300m away vs 500m away is irrelevant. The power transform "curves" the mapping to allocate more color resolution to nearby objects. The steep part of the curve is near d = 0, where small depth differences produce large color changes. The flat tail is at large d, where huge depth differences compress into tiny color changes.
Now map the normalized distance f ∈ [0, 1) to an RGB color. The path follows the edges of the RGB cube — think of it as the first iteration of a 3D Hilbert curve:
Between these anchor points, colors interpolate linearly along the cube edges. This gives 7 × 256 = 1,792 distinct color steps — far more precision than a single grayscale channel (256 steps).
This entire pipeline is a bijection. Given a generated RGB pixel, you reverse the path: RGB → position along cube edges → normalized distance f → invert the power transform → metric depth d. No information is lost.
During training, the paper also augments with Plasma, Inferno, Viridis, and grayscale colormaps. This teaches the model that the concept of "depth encoded as color" is general, not tied to one specific color scheme. At inference, the RGB cube path is used because it is the most precise and fully invertible.
Drag the depth slider to see the power transform and RGB mapping in real time. Adjust λ to see how it reshapes the curve. The mapping is fully invertible: depth → color → depth.
Segmentation is conceptually simpler than depth: each pixel gets a class label, and each class gets a color. The generated image IS the segmentation mask. But Vision Banana handles three different segmentation paradigms, each with its own color strategy.
The prompt specifies a JSON color mapping: {"cat": "red", "lock": "pink", "background": "yellow"}. The model generates an image where all cat pixels are red, all lock pixels are pink, and all background pixels are yellow. Decoding: cluster pixels by color, assign each cluster the corresponding class label.
Harder: multiple instances of the same class need different colors. Vision Banana handles this by prompting one class at a time. "Show all instances of 'person'." The model dynamically assigns different colors to different person instances — person 1 in red, person 2 in blue, person 3 in green. Decoding: cluster by color, each cluster is one instance.
The most flexible variant. Free-form text queries like "the man in the pink t-shirt" or "the stretching cat." The model generates a mask image where the referred object is highlighted. This is where the image generator's language understanding really shines — it can parse complex referring expressions because it was pretrained on text-image pairs.
The decoding pipeline is straightforward: take the generated image, cluster pixels by their RGB values (accounting for small generation artifacts via nearest-neighbor to the specified palette), and extract a binary mask for each color cluster. Each mask corresponds to one class (semantic) or one instance (instance).
Surface normals describe the orientation of a surface at each pixel. A normal vector n = (nx, ny, nz) has three components, each ranging from -1 to 1. RGB images have three channels, each ranging from 0 to 255. The mapping is natural and elegant.
Each normal component maps linearly to a color channel. A surface pointing right (nx = 1) is fully red. A surface pointing up (ny = 1) is fully green. A surface pointing toward the camera (nz = 1) is fully blue. Most real-world surfaces face roughly toward the camera, so normal maps tend to be dominated by blue with red and green variations encoding the surface tilt.
The precision is limited to 256 levels per component, giving an angular resolution of about 0.4 degrees. For most applications (navigation, grasping, reconstruction), this is more than sufficient. The paper achieves 18.928° mean angular error on average across four benchmarks, beating the specialist Lotus-2 model (19.642°).
Vision Banana isn't just a proof of concept. It beats dedicated specialist models across the board — models that were specifically designed, trained, and optimized for a single task. Here are the numbers.
| Task / Benchmark | Metric | Vision Banana | Best Specialist | Delta |
|---|---|---|---|---|
| Semantic Seg (Cityscapes) | mIoU | 0.699 | SAM 3: 0.652 | +4.7 |
| Instance Seg (SA-Co/Gold) | pmF1 | 0.540 | DINO-X: 0.552 | −1.2 |
| Referring Seg (RefCOCOg) | cIoU | 0.738 | SAM 3 Agent: 0.734 | +0.4 |
| Reasoning Seg (ReasonSeg) | gIoU | 0.793 | SAM 3 Agent: 0.770 | +2.3 |
| Task / Benchmark | Metric | Vision Banana | Best Specialist | Delta |
|---|---|---|---|---|
| Metric Depth (avg 4 sets) | δ1 | 0.929 | DA3: 0.918 | +1.1 |
| Surface Normals (avg 4 sets) | Mean ∠ | 18.928° | Lotus-2: 19.642° | +0.714° |
| Task | Metric | Vision Banana |
|---|---|---|
| Text-to-Image (GenAI-Bench) | Win rate vs base | 53.5% |
| Image Editing (ImgEdit) | Win rate vs base | 47.8% |
The instance segmentation result is the only one where Vision Banana doesn't beat the specialist — it trails DINO-X by 1.2 points. But consider: DINO-X is a model built specifically for instance segmentation with custom architectures for handling overlapping instances. Vision Banana is doing instance segmentation by generating colored images. The fact that it's even competitive is remarkable.
Vision Banana is not just a good model. It's a thesis about the future of computer vision.
The thesis: image generation is the universal interface for vision, just as text generation is the universal interface for language.
Before LLMs, NLP had separate models for sentiment analysis, named entity recognition, machine translation, question answering, summarization, and dozens of other tasks. Each task had its own architecture, its own training pipeline, its own community. The "GPT moment" was the realization that a single text generator, properly instruction-tuned, could do all of them.
Vision Banana argues we're at the same inflection point for vision. Instead of SAM for segmentation + Depth Anything for depth + Lotus for normals + DALL-E for generation, you have one model that does everything. The key insight is that RGB is rich enough to encode any vision output — you just need clever, invertible mappings.
Of course, there are caveats. Image generation is computationally expensive compared to forward-pass-only specialists. The RGB encoding has limited precision (especially for depth at long range). And the approach depends on the base generator being powerful enough to produce precise, artifact-free outputs. But these are engineering constraints, not fundamental limitations. They will improve with scale.
Vision Banana sits at the intersection of generative models, vision understanding, and the unification of AI modalities.
The specialist that Vision Banana beats on metric depth. Depth Anything v2 uses a DINOv2 encoder with a DPT decoder, trained on massive labeled and pseudo-labeled depth data. Vision Banana achieves better depth without camera intrinsics and without a depth-specific architecture — the generative prior provides geometric understanding that a discriminative depth model must learn from labels alone.
Self-supervised ViT features that underpin many specialist models. Vision Banana takes a different path to visual representations — through generative pretraining rather than contrastive/self-distillation objectives. The result is features that can be "read out" as images rather than as embedding vectors.
The architectural backbone for most modern vision models, including the generator inside Nano Banana Pro. ViT's ability to model long-range dependencies across image patches is crucial for global scene understanding tasks like depth and segmentation.
If Nano Banana Pro uses a diffusion-based architecture (the paper doesn't fully specify), then DiT's fusion of transformer attention with diffusion denoising provides the generative backbone. The key insight is that diffusion models learn a denoising objective that implicitly captures the distribution of natural images — and by extension, the structure of visual scenes.
The segmentation specialist line. SAM introduced promptable segmentation with a separate architecture (ViT encoder + mask decoder). Vision Banana subsumes this capability into the generator itself — no separate mask decoder needed. The color-based output format is simpler but surprisingly more effective for semantic and reasoning segmentation tasks.
| Model | Approach | Tasks | Vision Banana's Edge |
|---|---|---|---|
| SAM 3 | Discriminative, task-specific | Segmentation only | Generalist beats specialist on 3/4 seg benchmarks |
| Depth Anything 3 | Discriminative, task-specific | Depth only | No camera intrinsics needed, better δ1 |
| Lotus-2 | Discriminative, task-specific | Normals only | Lower mean angular error |
| DINO-X | Discriminative, task-specific | Instance seg only | Competitive (within 1.2 points) |
| Vision Banana | Generative, instruction-tuned | All of the above + image gen | — |