Devoured - April 21, 2026
DeepMind's TIPSv2 Vision-Language Encoder (6 minute read)

DeepMind's TIPSv2 Vision-Language Encoder (6 minute read)

AI Read original

Google DeepMind's TIPSv2 vision-language encoder achieves state-of-the-art zero-shot segmentation by supervising all image patches rather than just masked ones during training.

What: TIPSv2 is a vision-language pretraining model from Google DeepMind that learns joint representations of images and text. It introduces three improvements: iBOT++ (supervising all image patches instead of just masked ones), head-only exponential moving average (reducing training parameters by 42%), and multi-granularity captions using descriptions from PaliGemma and Gemini for richer supervision.
Why it matters: The research reveals a counterintuitive finding that smaller distilled models can outperform their larger teachers at patch-text alignment tasks, which led to discovering that supervising visible image patches (not just masked ones) is critical for dense vision-language understanding like segmentation.
Takeaway: Models, code, interactive demos, and Colab notebooks are available on GitHub and HuggingFace for experimenting with vision-language tasks.
Deep dive
  • Discovery that distilled ViT-L student models dramatically outperform their larger ViT-g teachers in zero-shot segmentation, reversing typical size-performance trends
  • Investigation revealed that supervision on visible tokens (not just masked ones) is the key differentiator between distillation and pretraining success
  • iBOT++ extends patch-level self-distillation loss to all patches (both masked and visible), yielding +14.1 mIoU gain in zero-shot segmentation on ADE150 dataset
  • Head-only EMA applies exponential moving average only to the projector head rather than full model, reducing training parameters by 42% while maintaining performance
  • Multi-granularity captions combine alt-text, PaliGemma, and Gemini Flash descriptions, randomly alternating during training to prevent shortcut learning on coarse keywords
  • Achieves state-of-the-art results on all four zero-shot segmentation benchmarks tested
  • TIPSv2-g outperforms PE-core G/14 on 3 of 5 shared evaluations despite PE having 56% more parameters and 47× more training image-text pairs
  • At ViT-L size, TIPSv2 outperforms DINOv3 on 4 of 6 benchmarks despite DINOv3's teacher using 6× more parameters and 15× more images
  • Produces smoother feature maps with better object boundary delineation and granular semantic details compared to previous models like TIPS, SigLIP2, and DINOv2
  • Presented at CVPR 2026 with full code, model checkpoints, Colab notebooks, and HuggingFace demos publicly available
Decoder
  • Vision-language encoder: A neural network that learns joint representations of images and text for multimodal understanding
  • Patch-text alignment: How well individual image patches (small regions) correspond to text descriptions
  • Zero-shot segmentation: Segmenting objects in images without task-specific training, using only natural language descriptions
  • iBOT: Image BERT pre-training with Online Tokenizer, a self-supervised learning method for vision models
  • Distillation: Training a smaller student model to mimic a larger teacher model's behavior
  • EMA (Exponential Moving Average): A technique that maintains a smoothed version of model weights during training for stability
  • MIM (Masked Image Modeling): Self-supervised learning where the model predicts masked portions of images
  • mIoU: Mean Intersection over Union, a metric measuring segmentation quality by comparing predicted and ground truth regions
  • ViT: Vision Transformer, an architecture that applies transformer models to image patches
Original article

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Overview

TIPSv2 is the next generation of the TIPS family of foundational image-text encoders empowering strong performance across numerous multimodal and vision tasks. Our work starts by revealing a surprising finding, where distillation unlocks superior patch-text alignment over standard pretraining, leading to distilled student models significantly surpassing their much larger teachers in this capability. We carefully investigate this phenomenon, leading to an improved pretraining recipe that upgrades our vision-language encoder significantly. Three key changes are introduced to our pretraining process (illustrated in the figure below): iBOT++ extends the patch-level self-supervised loss to all tokens for stronger dense alignment; Head-only EMA reduces training cost while retaining performance; and Multi-Granularity Captions uses PaliGemma and Gemini descriptions for richer text supervision. Combining these components, TIPSv2 demonstrates strong performance across 9 tasks and 20 datasets, generally on par with or better than recent vision encoder models, with particularly strong gains in zero-shot segmentation.

TIPSv2 Pretraining Overview

TIPSv2 pretraining overview. TIPSv2 introduces 3 pretraining improvements: iBOT++ (enhanced MIM loss), Head-only EMA (memory-efficient self-supervised losses), and Multi-granularity captions (richer text supervision).

Visualization

PCA Feature Maps

TIPSv2 produces smoother feature maps with well-delineated objects compared to prior vision-language models (e.g., TIPS and SigLIP2). While DINOv3 also exhibits smooth feature maps, TIPSv2 shows stronger semantic focus: object boundaries are more precisely delineated and regions show granular semantic details. We compare ViT-g models of several vision encoders, except for DINOv3, where we compare with the 6× larger ViT-7B. Select an image below to explore PCA components of patch embeddings.

Hike Copenhagen Angus Dadaocheng Original image TIPS PCA SigLIP2 PCA DINOv2 PCA DINOv3 PCA TIPSv2 PCA

TIPSv2 PCA features demonstrate more fine-grained semantic separation: backpacks, people, and hiking poles are clearly delineated.

Feature Explorer

Upload your own image and explore TIPSv2 patch embeddings feature maps or applications in zero-shot segmentation or depth and normal prediction. Also available on HuggingFace.

Method

TIPSv2 investigates the differences between pre-training and distillation, motivating the introduction of three targeted pretraining improvements to standard vision-language models: iBOT++, Head-only EMA, and Multi-Granularity Text Captions.

Bridging Pre-training and Distillation

We reveal a surprising gap between pre-training and distillation: a smaller ViT-L model distilled from a larger ViT-g TIPS teacher dramatically outperforms its teacher in zero-shot segmentation, reversing the trend of all other evaluation tasks. We observe a similar trend in SigLIP2. In the paper, we ablate the differences between pre-training and distillation, such as masking ratio, encoder initialization, frozen or training parameters, and supervision. Our investigation reveals that the important distinction that causes differences in patch-text alignment between distillation and pre-training is supervision on visible tokens.

distillation vs standard pretraining

Distillation vs standard pretraining: surprising findings. Zero-shot segmentation for a TIPS ViT-g pre-trained teacher model and a ViT-L student distilled from the ViT-g teacher. The student model strongly surpasses the teacher for patch-text alignment.

iBOT++: Enhanced Masked Image Modeling

In our investigation of the gap between distillation and standard pretraining, we find that supervising visible patches is the key differentiator. To introduce this improvement in distillation to pretraining, we propose a simple augmentation: iBOT++. Whereas standard iBOT only supervises masked patch tokens, leaving visible token representations unconstrained, iBOT++ extends the patch-level self-distillation loss to all patches (both masked and visible), yielding a +14.1 mIoU gain in zero-shot segmentation on ADE150.

iBOT vs iBOT++ — TIPSv2 teaser

iBOT++. Applies the patch-level loss to all patches (masked and visible), dramatically improving patch-text alignment as shown by zero-shot segmentation results.

Head-only EMA

Since the contrastive loss already stabilizes the vision encoder, we apply EMA only to the projector head rather than the full model. This reduces training parameters by 42% while retaining comparable performance.

Head-only EMA

Head-only EMA. Reduces training parameters while maintaining performance.

Multi-Granularity Text Captions

We supplement alt-text and PaliGemma captions with richer Gemini Flash captions, randomly alternating between them during training to avoid shortcutting on coarse keywords. This boosts both dense and global image-text performance.

Multi-granularity captions

Multi-granularity captions. Image captions at different granularities.

Ablations

We ablate each component cumulatively from the TIPS baseline. iBOT++ alone yields the largest single gain: a +14.1 mIoU improvement in zero-shot segmentation on ADE150 (3.5 → 17.6), confirming that extending the patch-level loss to visible tokens is the key driver of dense patch-text alignment.

Ablation studies table

Ablation studies. Cumulative ablations from the TIPS baseline, each adding one TIPSv2 component on ViT-g.

Results

We evaluate TIPSv2 across a wide range of evaluation categories, including Dense Image-Text (zero-shot segmentation), Global Image-Text (classification and retrieval), and Image-Only tasks (segmentation, depth, normals, retrieval, classification). Select a tab below to explore the detailed results tables.

Dense image-text evaluations table

Dense image-text evaluations. TIPSv2 achieves SOTA on all four zero-shot segmentation benchmarks, outperforming SILC and DINOv2 even though they use the more complex TCL evaluation protocols.

Global image-text results table

Global image-text evaluations. TIPSv2 achieves best or second-best in 5 of 7 global evaluations. Notably, TIPSv2-g outperforms PE-core G/14 on 3 of 5 shared evals, despite PE having 56% more parameters and 47× more training pairs.

Image-only results table

Image-only evaluations. TIPSv2 achieves best or second-best in 7 of 9 image-only evaluations.

DINOv3 vs TIPSv2 comparison table

DINOv3 vs TIPSv2 comparison. We compare TIPSv2 with DINOv3 at the largest common size between the two families: ViT-L. Despite DINOv3's teacher using 6× more parameters and 15× more images, TIPSv2 wins 4 of 6 shared evaluations including zero-shot segmentation (both using sliding window protocol from TCL in this case).

Acknowledgements

We would like to thank Connor Schenck and Gabriele Berton for thoughtful discussions and suggestions. We also thank the D4RT project for website template.

Citation

@inproceedings{cao2026tipsv2,
  title     = {{TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment}},
  author    = {Cao, Bingyi and Chen, Koert and Maninis, Kevis-Kokitsi and Chen, Kaifeng and Karpur, Arjun and Xia, Ye and Dua, Sahil and Dabral, Tanmaya and Han, Guangxing and Han, Bohyung and Ainslie, Joshua and Bewley, Alex and Jacob, Mithun and Wagner, Rene and Ramos, Washington and Choromanski, Krzysztof and Seyedhosseini, Mojtaba and Zhou, Howard and Araujo, Andre},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}