MAE预训练对十亿规模预训练的有效性 The effectiveness of MAE pre-pretraining for billion-scale pretraining

作者:Mannat Singh Quentin Duval Kalyan Vasudev Alwala Haoqi Fan Vaibhav Aggarwal Aaron Adcock Armand Joulin Piotr Dollár Christoph Feichtenhofer Ross Girshick Rohit Girdhar Ishan Misra


This paper revisits the standard pretrain-then-finetune paradigm used incomputer vision for visual recognition tasks. Typically, state-of-the-artfoundation models are pretrained using large scale (weakly) supervised datasetswith billions of images. We introduce an additional pre-pretraining stage thatis simple and uses the self-supervised MAE technique to initialize the model.While MAE has only been shown to scale with the size of models, we find that itscales with the size of the training dataset as well. Thus, our MAE-basedpre-pretraining scales with both model and data size making it applicable fortraining foundation models. Pre-pretraining consistently improves both themodel convergence and the downstream transfer performance across a range ofmodel scales (millions to billions of parameters), and dataset sizes (millionsto billions of images). We measure the effectiveness of pre-pretraining on 10different visual recognition tasks spanning image classification, videorecognition, object detection, low-shot classification and zero-shotrecognition. Our largest model achieves new state-of-the-art results oniNaturalist-18 (91.3%), 1-shot ImageNet-1k (62.1%), and zero-shot transfer onFood-101 (96.0%). Our study reveals that model initialization plays asignificant role, even for web-scale pretraining with billions of images.



Related posts