Vision Banana Generalist Model (39 minute read)

Researchers demonstrate that image generation models can serve as generalist vision systems by reframing perception tasks like segmentation and depth estimation as image generation problems.

What: Vision Banana is a generalist vision model created by instruction-tuning Google's Nano Banana Pro image generator on vision task data, treating all outputs (segmentation masks, depth maps, etc.) as RGB images to generate rather than separate prediction tasks.

Why it matters: This suggests a potential paradigm shift for computer vision similar to what happened with large language models—generative pretraining on images may be the key to building foundational vision models that excel at both creation and understanding, rather than training specialized models for each task.

Takeaway: Monitor the project page for model weights and implementation details if you're working on vision tasks that could benefit from a unified generalist approach instead of maintaining multiple specialized models.

Deep dive

The research challenges the traditional computer vision paradigm where separate models are trained for different tasks like segmentation, depth estimation, and object detection
Vision Banana achieves state-of-the-art results by converting vision tasks into image generation problems—outputting segmentation masks and depth maps as generated RGB images
The model beats or matches specialized systems including Segment Anything Model 3 for segmentation and Depth Anything for metric depth estimation, despite being a generalist
Built through lightweight instruction-tuning of Nano Banana Pro on a mixture of original image generation data plus a small amount of vision task data
The key insight mirrors the LLM revolution: just as language generation pretraining gave models emergent understanding capabilities, image generation pretraining provides powerful general visual representations
The instruction-tuning approach preserves the base model's image generation capabilities while adding perception abilities
Works across both 2D and 3D vision understanding tasks, demonstrating true generalist capabilities
The unified interface of image generation for all vision tasks parallels how text generation became the universal interface for language understanding and reasoning
Results suggest that the ability to generate visual content inherently requires understanding visual content, validating a long-standing conjecture in computer vision
The paper proposes that generative vision pretraining should take a central role in building foundational vision models going forward
This approach eliminates the need for task-specific architectures and output layers that have dominated computer vision for decades

Decoder

Instruction-tuning: Training a pretrained model on task-specific examples with instructions, similar to fine-tuning but focused on teaching the model to follow diverse commands
Zero-shot: A model's ability to perform tasks it wasn't explicitly trained on, by generalizing from its pretraining
SOTA: State-of-the-art, the best currently available performance on a benchmark
SAM (Segment Anything Model): Meta's specialized model for image segmentation that can identify and mask objects
Metric depth estimation: Predicting actual distance measurements from the camera to objects in a scene, not just relative depth ordering
Nano Banana Pro (NBP): Google's image generation model that serves as the base for Vision Banana (likely part of the Banana model family)

Original article

Image Generators are Generalist Vision Learners

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.