croco.europe.naverlabs.com Open in urlscan Pro
89.91.80.115  Public Scan

Submitted URL: https://croco.europe.naverlabs.com/
Effective URL: https://croco.europe.naverlabs.com/public/index.html
Submission: On August 05 via automatic, source certstream-suspicious — Scanned from FR

Form analysis 0 forms found in the DOM

Text Content

CroCo: Self-Supervised Pretraining for 3D Vision Tasks by Cross-View Completion

Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Romain Brégier, Yohann Cabon,
Vaibhav Arora, Leonid Antsfeld,
Boris Chidlovskii, Gabriela Csurka, Jérôme Revaud

Paper





ABSTRACT

Masked Image Modeling (MIM) has recently been established as a potent
pre-training paradigm. A pretext task is constructed by masking patches in an
input image, and this masked content is then predicted by a neural network using
visible patches as sole input. This pre-training leads to state-of-the-art
performance when finetuned for high-level semantic tasks, e.g. image
classification and object detection. In this paper we instead seek to learn
representations that transfer well to a wide variety of 3D vision and
lower-level geometric downstream tasks, such as depth prediction or optical flow
estimation. Inspired by MIM, we propose an unsupervised representation learning
task trained from pairs of images showing the same scene from different
viewpoints. More precisely, we propose the pretext task of cross-view completion
where the first input image is partially masked, and this masked content has to
be reconstructed from the visible content and the second image. In single-view
MIM, the masked content often cannot be inferred precisely from the visible
portion only, so the model learns to act as a prior influenced by high-level
semantics. In contrast, this ambiguity can be resolved with cross-view
completion from the second unmasked image, on the condition that the model is
able to understand the spatial relationship between the two images. Our
experiments show that our pretext task leads to significantly improved
performance for monocular 3D vision downstream tasks such as depth estimation.
In addition, our model can be directly applied to binocular downstream tasks
like optical flow or relative camera pose estimation, for which we obtain
competitive results without bells and whistles, i.e., using a generic
architecture without any task-specific design.


OVERVIEW

Cross-view Completion (CroCo, in short) is a self-supervised pretext task
consisting of feeding two images of the same scene, one of them being partially
masked, to a network. The goal of the pretext task is then for the network to
recover the masked pixels. Since the two views have different viewpoints, this
is only possible if the network “understands” the 3D structure of the scene, the
camera poses and the visual correspondences between the two images.

We present below some reconstruction examples from CroCo on scenes unseen during
training. From top to bottom, we show the first image (input), the masked second
image (input), the output from CroCo, and the original (ground-truth) second
image.




DEMONSTRATION

Reference Input
(select an image)
Mask Ratio
(adjust ratio)
Masked Image
(drag to change point of view)


30%

Estimated Image
(drag to change point of view)
Expected Image
(drag to change point of view)




CROCO DOWNSTREAM TRANSFER

Our CroCo pretext task leads to significantly improved performance for 3D vision
downstream tasks, for both monocular and binocular tasks, without bells and
whistles, i.e., using a generic architecture without any task-specific design.
For instance, we show below some qualitative results for the task of monocular
depth estimation.




BIBTEX

@inproceedings{croco,
title={{CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion}}, 
author={{Weinzaepfel, Philippe and Leroy, Vincent and Lucas, Thomas and Br\'egier, Romain and Cabon, Yohann and Arora, Vaibhav and Antsfeld, Leonid and Chidlovskii, Boris and Csurka, Gabriela and Revaud J\'er\^ome}}, 
booktitle={{NeurIPS}}, 
year={2022} 
}


SEE ALSO

 * CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and
   Optical Flow
 * DUSt3R: Geometric 3D Vision Made Easy
 * MFOS: Model-Free & One-Shot Object Pose Estimation
 * Win-Win: Training High-Resolution Vision Transformers from Two Windows
 * SACReg: Scene-Agnostic Coordinate Regression for Visual Localization

   
© 2023 NAVER LABS Europe