CYBORGS: Contrastively Bootstrapping Object Representations by Grounding in Segmentation

Renhao Wang 1, Hang Zhao 1,2, Yang Gao 1,2
1Tsinghua University, 2Shanghai Qi Zhi Institute

European Conference on Computer Vision (ECCV) 2022

CYBORGS relies on mutually improving representation learning and semantic segmentation by iteration between two stages. In the first stage, we use available segmentation masks to ground contrastive learning for the model θ. In the second stage, we use representations from the model θ to bootstrap improved segmentation masks.

Abstract

Many recent approaches in contrastive learning have worked to close the gap between pretraining on iconic images like ImageNet and pretraining on complex scenes like COCO. This gap exists largely because commonly used random crop augmentations obtain semantically inconsistent content in crowded scene images of diverse objects. In this work, we propose a framework which tackles this problem via joint learning of representations and segmentation. We leverage segmentation masks to train a model with a mask-dependent contrastive loss, and use the partially trained model to bootstrap better masks. By iterating between these two components, we ground the contrastive updates in segmentation information, and simultaneously improve segmentation throughout pretraining. Experiments show our representations transfer robustly to downstream tasks in classification, detection and segmentation.



Segmentations on COCO 2017


We show that notions of objectness emerge naturally within the learned representations following our bootstrapping process.


Raw RGB

kMeans on Features

After CRF

Ground Truth


Raw RGB

kMeans on Features

After CRF

Ground Truth


Raw RGB

kMeans on Features

After CRF

Ground Truth