Vision Banana — Veanors

Chapter 0: The Problem

You want a robot to pick up a mug from a cluttered table. Simple enough. But to do it, the robot needs to understand the scene in at least three ways: it needs segmentation to know which pixels are the mug, depth to know how far away the mug is, and surface normals to know the angle of the surface it's grasping.

Today, you would deploy three separate specialist models for this. SAM 3 for segmentation. Depth Anything 3 for metric depth. Lotus-2 for surface normals. Each is a different architecture, trained on different data, with different input/output formats. Three models to load, three forward passes per frame, three sets of weights eating your GPU memory.

Now scale that to all of computer vision. Instance segmentation? Another model. Referring expression segmentation ("the red mug on the left")? Another model. Semantic segmentation? Yet another. Every new vision task requires a new specialist — new architecture, new training recipe, new failure modes.

The fragmentation problem: Computer vision in 2025 looks like NLP in 2017 — a zoo of task-specific models, each excellent at one thing but incapable of anything else. There is no "GPT for vision." No single model that does segmentation AND depth AND normals AND generation. Every new task means training a new model from scratch.

Compare this to language. GPT, Claude, Gemini — they are all generalist language models. One model that summarizes, translates, codes, answers questions, writes poetry. The key was instruction-tuning a pretrained text generator. The generator already understood language; instruction-tuning just taught it to follow task formats.

Vision Banana asks: can we do the same thing with images?

What is the fundamental problem Vision Banana addresses?

Image generators cannot produce high-resolution images Computer vision requires a separate specialist model for every task, with no generalist foundation model equivalent to LLMs Depth estimation is too slow for real-time robotics

Chapter 1: The Key Insight

Here is the surprising observation that makes Vision Banana work: image generators already understand vision. A model that can generate a photorealistic image of a cat on a table has, implicitly, learned what cats look like (segmentation), how far away objects are (depth), and the geometry of surfaces (normals). All of that knowledge is baked into the generator's weights during pretraining.

The problem is that this knowledge is locked inside the generative process. The model knows what a cat looks like, but it can only express that knowledge by generating images of cats. It cannot directly output a segmentation mask or a depth map.

Vision Banana's key insight: instruction-tuning unlocks it. Just like instruction-tuning an LLM teaches it to follow task prompts ("Summarize this text:", "Translate to French:"), instruction-tuning an image generator teaches it to follow vision task prompts ("Segment the cat:", "Estimate depth:").

The unlock: You do not need to teach the model to understand vision — it already does. You just need to teach it a new output format. Instead of generating a photorealistic image, generate an image where the pixel colors encode the answer to a vision task. A segmentation mask IS an image. A depth map IS an image. Surface normals ARE an image. The generator just needs to learn which kind of image to produce.

The base model is Nano Banana Pro, a pretrained image generator. Vision Banana instruction-tunes it by mixing vision task data at a low ratio into the original training data. The ratio is critical — too much task data and the model forgets how to generate images. Too little and it doesn't learn the output formats. The sweet spot preserves generation capabilities while adding vision understanding.

The result: a single model that does segmentation, depth, normals, and image generation. No specialized architectures. No custom loss functions. No new modules. Just an image generator that has learned to generate a different kind of image.

Why does instruction-tuning an image generator work for vision tasks?

Because the generator already understands visual concepts from pretraining — instruction-tuning just teaches it to express that knowledge in new output formats (colored masks, depth maps, etc.) Because the generator has a built-in segmentation module that is activated by instruction-tuning Because instruction-tuning replaces the generator's weights with task-specific weights

Chapter 2: The Analogy

To understand why Vision Banana works, you need to see the structural parallel with LLMs. The evolution of language AI and vision AI are converging on the exact same pattern.

Language: the path we already walked

Step 1: Generative pretraining. Train a model to predict the next token. This forces it to learn grammar, facts, reasoning — the full structure of language. The result is a base model (GPT-3, LLaMA, etc.).

Step 2: Instruction-tuning. Take the base model and fine-tune on (instruction, response) pairs. "Summarize this article: [text]" → "[summary]". The model's language understanding doesn't change — it just learns to follow task formats expressed as text.

Step 3: Universal interface. Text generation becomes the universal interface. Every task — summarization, translation, QA, coding — is expressed as "generate the right text." One model, infinite tasks.

Vision: the same path, one step behind

Step 1: Image generation pretraining. Train a model to generate images from text prompts. This forces it to learn object identity, spatial relationships, depth, geometry — the full structure of visual scenes. The result is Nano Banana Pro.

Step 2: Instruction-tuning. Take the base generator and fine-tune on (vision prompt, output image) pairs. "Segment the cat: [image]" → "[colored mask image]". The model's visual understanding doesn't change — it just learns to follow task formats expressed as RGB images.

Step 3: Universal interface. Image generation becomes the universal interface. Every vision task — segmentation, depth, normals, editing — is expressed as "generate the right image." One model, infinite tasks.

The parallel is exact: LLM pretraining teaches language understanding through text generation. Image generation pretraining teaches visual understanding through image generation. Instruction-tuning unlocks both. The only difference is the output modality: text tokens vs. RGB pixels.

In the LLM-to-Vision analogy, what corresponds to "instruction-tuning an LLM on (instruction, response) pairs"?

Training an image generator from scratch on vision task data Fine-tuning the pretrained image generator on (vision prompt, output image) pairs where the output image encodes the task answer Adding a separate classification head to the image generator

Chapter 3: The Method

Vision Banana's method is deceptively simple. There are three ideas, and they all work together.

Idea 1: All vision outputs are RGB images

This is the foundational trick. Instead of building custom decoders for each task, Vision Banana parameterizes every vision task output as an RGB image that the generator can produce:

Segmentation → colored mask images (one color per class or instance)
Metric depth → false-color images via an invertible power transform + RGB cube path
Surface normals → natural RGB mapping (x→R, y→G, z→B)

The key constraint: the mapping must be invertible. You must be able to decode the generated RGB image back into the original task output (depth values, class labels, normal vectors). If the mapping is lossy, you lose information and can't recover the answer.

Idea 2: Low-ratio data mixing

Instruction-tuning is done by mixing vision task data into the original training data at a low ratio. The model continues to see image generation examples alongside vision task examples. This prevents catastrophic forgetting — the model retains its ability to generate photorealistic images while learning to produce vision outputs.

The mixing ratio is the key hyperparameter. Too high (say 50% task data) and the model degrades at generation. Too low (say 1%) and it doesn't learn the output formats reliably. The paper finds the sweet spot empirically.

Idea 3: No architectural changes

This is perhaps the most striking aspect. Vision Banana does not add any new modules, custom loss functions, or architectural modifications to Nano Banana Pro. The training objective is the same as image generation: generate an image that matches the target. The only difference is what the target image represents.

The data flow: Input: text prompt ("Segment the scene. Classes: car=red, sky=blue, road=gray.") + input image. Output: generated RGB image where pixel colors encode the task answer. Decoding: cluster pixels by color → extract masks / read depth values / read normal vectors. The model is just doing image generation — but the images encode structured vision outputs.

1. Task Prompt

Text instruction specifying the vision task + output format (e.g., color mapping for segmentation, or "estimate metric depth")

↓

2. Input Image

The scene to analyze — same image the generator would normally condition on for editing tasks

↓

3. Generate RGB

Nano Banana Pro generates an output image using its standard generation pipeline (diffusion, autoregressive, etc.)

↓

4. Decode Output

Post-process the RGB image to extract the vision task answer: color clustering for masks, inverse transform for depth, channel readout for normals

Why must the RGB encoding of vision outputs be invertible?

Because the model needs to verify its own outputs during training Because you need to decode the generated RGB image back into the original task output (depth values, class labels, normal vectors) without information loss Because invertible mappings train faster

Chapter 4: Depth as Color

Encoding metric depth as RGB is the paper's most technically interesting contribution. The challenge: metric depth ranges from 0 (at the camera) to infinity (the horizon). How do you map an infinite range to 256×256×256 possible RGB colors — and make it invertible?

Step 1: Power transform

First, compress the infinite depth range into [0, 1). The power transform is:

f(d, λ, c) = 1 − (1 − d/(λc))^(λ+1)

With λ = −3 and c = 10/3. This maps metric depth d ∈ [0, ∞) to a normalized distance f ∈ [0, 1).

Why this specific transform? Think about robotics. A mug 0.3m away vs 0.5m away matters enormously for grasping. A mountain 300m away vs 500m away is irrelevant. The power transform "curves" the mapping to allocate more color resolution to nearby objects. The steep part of the curve is near d = 0, where small depth differences produce large color changes. The flat tail is at large d, where huge depth differences compress into tiny color changes.

The intuition: Imagine you have only 256 colors to represent all possible depths. A linear mapping wastes most of those colors on distant objects you'll never interact with. The power transform spends colors where they matter — on the nearby objects that a robot actually needs to grasp, avoid, or navigate around.

Step 2: RGB cube path

Now map the normalized distance f ∈ [0, 1) to an RGB color. The path follows the edges of the RGB cube — think of it as the first iteration of a 3D Hilbert curve:

f = 0.000 → (0,0,0) black
f = 0.143 → (0,0,1) blue
f = 0.286 → (0,1,1) cyan
f = 0.429 → (0,1,0) green
f = 0.571 → (1,1,0) yellow
f = 0.714 → (1,0,0) red
f = 0.857 → (1,0,1) magenta
f = 1.000 → (1,1,1) white

Between these anchor points, colors interpolate linearly along the cube edges. This gives 7 × 256 = 1,792 distinct color steps — far more precision than a single grayscale channel (256 steps).

Step 3: Inversion

This entire pipeline is a bijection. Given a generated RGB pixel, you reverse the path: RGB → position along cube edges → normalized distance f → invert the power transform → metric depth d. No information is lost.

Why not just use grayscale? Grayscale gives you 256 levels. The RGB cube path gives you 1,792. That is 7x more depth precision. For a robot reaching for a mug, this is the difference between knowing the mug is "somewhere between 28cm and 32cm" vs "at 29.4cm." The extra precision comes free — you're already generating 3-channel RGB images.

During training, the paper also augments with Plasma, Inferno, Viridis, and grayscale colormaps. This teaches the model that the concept of "depth encoded as color" is general, not tied to one specific color scheme. At inference, the RGB cube path is used because it is the most precise and fully invertible.

Depth ↔ Color Explorer

Drag the depth slider to see the power transform and RGB mapping in real time. Adjust λ to see how it reshapes the curve. The mapping is fully invertible: depth → color → depth.

Depth (m) 2.0

λ -3

Why does the power transform allocate more color resolution to nearby objects?

Because the transform is linear and treats all depths equally Because the steep part of the curve is near d=0, so small depth differences near the camera produce large color changes, while large depth differences far away compress into tiny changes Because nearby objects are always brighter than distant objects

Chapter 5: Segmentation as Color

Segmentation is conceptually simpler than depth: each pixel gets a class label, and each class gets a color. The generated image IS the segmentation mask. But Vision Banana handles three different segmentation paradigms, each with its own color strategy.

Semantic segmentation

The prompt specifies a JSON color mapping: {"cat": "red", "lock": "pink", "background": "yellow"}. The model generates an image where all cat pixels are red, all lock pixels are pink, and all background pixels are yellow. Decoding: cluster pixels by color, assign each cluster the corresponding class label.

Instance segmentation

Harder: multiple instances of the same class need different colors. Vision Banana handles this by prompting one class at a time. "Show all instances of 'person'." The model dynamically assigns different colors to different person instances — person 1 in red, person 2 in blue, person 3 in green. Decoding: cluster by color, each cluster is one instance.

Referring expression segmentation

The most flexible variant. Free-form text queries like "the man in the pink t-shirt" or "the stretching cat." The model generates a mask image where the referred object is highlighted. This is where the image generator's language understanding really shines — it can parse complex referring expressions because it was pretrained on text-image pairs.

Why JSON in the prompt? By specifying colors in the prompt, Vision Banana avoids hardcoding any class-to-color mapping. The model learns the concept of "assign this class this color" rather than a fixed lookup table. This means it generalizes to arbitrary classes it has never seen during training — just specify the class name and a color in the prompt.

The decoding pipeline is straightforward: take the generated image, cluster pixels by their RGB values (accounting for small generation artifacts via nearest-neighbor to the specified palette), and extract a binary mask for each color cluster. Each mask corresponds to one class (semantic) or one instance (instance).

The power of the generative prior: A segmentation specialist trained only on segmentation data can only segment classes it saw during training. Vision Banana's image generator was pretrained on billions of image-text pairs. It has seen cats, cars, chandeliers, and cantaloupe. The generative prior gives it zero-shot vocabulary — it can segment classes it was never explicitly trained to segment, because it knows what they look like from pretraining.

How does Vision Banana handle instance segmentation (multiple objects of the same class)?

It prompts one class at a time and the model dynamically assigns different colors to different instances of that class It uses a fixed color palette where instance 1 is always red, instance 2 is always blue, etc. It runs a separate clustering algorithm on top of the semantic segmentation output

Chapter 6: Surface Normals as Color

Surface normals describe the orientation of a surface at each pixel. A normal vector n = (n_x, n_y, n_z) has three components, each ranging from -1 to 1. RGB images have three channels, each ranging from 0 to 255. The mapping is natural and elegant.

The direct mapping

R = (n_x + 1) / 2 × 255
G = (n_y + 1) / 2 × 255
B = (n_z + 1) / 2 × 255

Each normal component maps linearly to a color channel. A surface pointing right (n_x = 1) is fully red. A surface pointing up (n_y = 1) is fully green. A surface pointing toward the camera (n_z = 1) is fully blue. Most real-world surfaces face roughly toward the camera, so normal maps tend to be dominated by blue with red and green variations encoding the surface tilt.

This is not new. Normal maps have been encoded as RGB images in computer graphics since the 1990s. What IS new is having an image generator produce them. The model doesn't need a geometric decoder — it generates normal maps the same way it generates any other image. The generative prior provides an implicit understanding of 3D surface geometry learned from billions of real images.

The precision is limited to 256 levels per component, giving an angular resolution of about 0.4 degrees. For most applications (navigation, grasping, reconstruction), this is more than sufficient. The paper achieves 18.928° mean angular error on average across four benchmarks, beating the specialist Lotus-2 model (19.642°).

Why normals from an image generator? When you train a model to generate photorealistic images, it implicitly learns about lighting, shading, and surface geometry — because these determine how surfaces look in photos. A surface's normal determines its shading. The generator has learned this relationship inside-out. Instruction-tuning just asks it to output the normals directly instead of using them implicitly during generation.

Why is the surface normal-to-RGB mapping particularly natural compared to the depth encoding?

Because normal maps are always smaller than depth maps Because normals have exactly three components (x,y,z) mapping directly to three RGB channels, each with the same [-1,1] to [0,255] linear transform — no complex nonlinear transform or cube path needed Because surface normals are always blue

Chapter 7: Results

Vision Banana isn't just a proof of concept. It beats dedicated specialist models across the board — models that were specifically designed, trained, and optimized for a single task. Here are the numbers.

2D Understanding

Task / Benchmark	Metric	Vision Banana	Best Specialist	Delta
Semantic Seg (Cityscapes)	mIoU	0.699	SAM 3: 0.652	+4.7
Instance Seg (SA-Co/Gold)	pmF1	0.540	DINO-X: 0.552	−1.2
Referring Seg (RefCOCOg)	cIoU	0.738	SAM 3 Agent: 0.734	+0.4
Reasoning Seg (ReasonSeg)	gIoU	0.793	SAM 3 Agent: 0.770	+2.3

3D Understanding

Task / Benchmark	Metric	Vision Banana	Best Specialist	Delta
Metric Depth (avg 4 sets)	δ1	0.929	DA3: 0.918	+1.1
Surface Normals (avg 4 sets)	Mean ∠	18.928°	Lotus-2: 19.642°	+0.714°

Visual Generation (Retained!)

Task	Metric	Vision Banana
Text-to-Image (GenAI-Bench)	Win rate vs base	53.5%
Image Editing (ImgEdit)	Win rate vs base	47.8%

Read that again. Vision Banana beats SAM 3 on semantic segmentation by 4.7 mIoU points. It beats Depth Anything 3 on metric depth WITHOUT camera intrinsics. It beats Lotus-2 on surface normals. And it still generates images as well as the base model. One model. No task-specific architecture. No custom losses. Just instruction-tuned image generation.

Task Semantic Seg

The instance segmentation result is the only one where Vision Banana doesn't beat the specialist — it trails DINO-X by 1.2 points. But consider: DINO-X is a model built specifically for instance segmentation with custom architectures for handling overlapping instances. Vision Banana is doing instance segmentation by generating colored images. The fact that it's even competitive is remarkable.

What makes the metric depth result particularly impressive?

Vision Banana beats Depth Anything 3 (δ1 = 0.929 vs 0.918) WITHOUT requiring camera intrinsics, while being a generalist model that also does segmentation, normals, and image generation Vision Banana uses a more advanced depth sensor Vision Banana was trained on more depth data than Depth Anything 3

Chapter 8: The Paradigm Shift

Vision Banana is not just a good model. It's a thesis about the future of computer vision.

The thesis: image generation is the universal interface for vision, just as text generation is the universal interface for language.

What this means in practice

Before LLMs, NLP had separate models for sentiment analysis, named entity recognition, machine translation, question answering, summarization, and dozens of other tasks. Each task had its own architecture, its own training pipeline, its own community. The "GPT moment" was the realization that a single text generator, properly instruction-tuned, could do all of them.

Vision Banana argues we're at the same inflection point for vision. Instead of SAM for segmentation + Depth Anything for depth + Lotus for normals + DALL-E for generation, you have one model that does everything. The key insight is that RGB is rich enough to encode any vision output — you just need clever, invertible mappings.

Implications

Simpler systems. One model replaces a zoo. One set of weights. One forward pass. One API.
New tasks for free. Any vision task whose output can be encoded as an image can be added by instruction-tuning. Optical flow? Encode as color. Edge detection? Already an image. Keypoints? Colored dots on black background.
Scaling benefits. When you scale the base generator (bigger model, more data), all vision tasks improve simultaneously. You don't need to scale each specialist independently.
Multi-task reasoning. A single model can potentially leverage depth understanding to improve segmentation and vice versa, because all knowledge lives in the same weights.

The uncomfortable prediction: If this thesis is correct, most specialist vision models become obsolete in the same way that task-specific NLP models became obsolete after GPT-3. The entire SAM line, the Depth Anything line, every task-specific architecture — they become stepping stones toward the generalist image generator that replaces them all.

Of course, there are caveats. Image generation is computationally expensive compared to forward-pass-only specialists. The RGB encoding has limited precision (especially for depth at long range). And the approach depends on the base generator being powerful enough to produce precise, artifact-free outputs. But these are engineering constraints, not fundamental limitations. They will improve with scale.

Why does the paper call this the "GPT moment" for computer vision?

Because Vision Banana uses the same architecture as GPT Because just as instruction-tuned text generators replaced task-specific NLP models, instruction-tuned image generators can replace task-specific vision models — one model, one interface (RGB), all tasks Because Vision Banana can also process text

Chapter 9: Connections

Vision Banana sits at the intersection of generative models, vision understanding, and the unification of AI modalities.

Depth Anything (v1 & v2)

The specialist that Vision Banana beats on metric depth. Depth Anything v2 uses a DINOv2 encoder with a DPT decoder, trained on massive labeled and pseudo-labeled depth data. Vision Banana achieves better depth without camera intrinsics and without a depth-specific architecture — the generative prior provides geometric understanding that a discriminative depth model must learn from labels alone.

DINO / DINOv2

Self-supervised ViT features that underpin many specialist models. Vision Banana takes a different path to visual representations — through generative pretraining rather than contrastive/self-distillation objectives. The result is features that can be "read out" as images rather than as embedding vectors.

Vision Transformers (ViT)

The architectural backbone for most modern vision models, including the generator inside Nano Banana Pro. ViT's ability to model long-range dependencies across image patches is crucial for global scene understanding tasks like depth and segmentation.

Diffusion Transformers (DiT)

If Nano Banana Pro uses a diffusion-based architecture (the paper doesn't fully specify), then DiT's fusion of transformer attention with diffusion denoising provides the generative backbone. The key insight is that diffusion models learn a denoising objective that implicitly captures the distribution of natural images — and by extension, the structure of visual scenes.

SAM / SAM 2 / SAM 3

The segmentation specialist line. SAM introduced promptable segmentation with a separate architecture (ViT encoder + mask decoder). Vision Banana subsumes this capability into the generator itself — no separate mask decoder needed. The color-based output format is simpler but surprisingly more effective for semantic and reasoning segmentation tasks.

Model	Approach	Tasks	Vision Banana's Edge
SAM 3	Discriminative, task-specific	Segmentation only	Generalist beats specialist on 3/4 seg benchmarks
Depth Anything 3	Discriminative, task-specific	Depth only	No camera intrinsics needed, better δ1
Lotus-2	Discriminative, task-specific	Normals only	Lower mean angular error
DINO-X	Discriminative, task-specific	Instance seg only	Competitive (within 1.2 points)
Vision Banana	Generative, instruction-tuned	All of the above + image gen	—

The bigger picture: Vision Banana belongs to a growing family of "generalist via generation" models. In language, we went from task-specific models to GPT. In vision, we may be going from SAM/Depth Anything/Lotus to Vision Banana. The pattern is the same: a powerful generative prior + instruction-tuning + a universal output format = a single model that rivals or beats all specialists.

How does Vision Banana's approach to depth estimation fundamentally differ from Depth Anything's?

Vision Banana uses a generative approach (producing depth as a color image from its pretrained image generation prior) while Depth Anything uses a discriminative approach (a DINOv2 encoder with a task-specific DPT decoder trained on depth labels) Vision Banana uses LiDAR data while Depth Anything uses monocular images They use the same approach but Vision Banana has more training data