TY - JOUR
T1 - Training object detectors from scratch
T2 - an empirical study in the era of vision transformer
AU - Hong, Weixiang
AU - Ren, Wang
AU - Lao, Jiangwei
AU - Xie, Lele
AU - Zhong, Liheng
AU - Wang, Jian
AU - Chen, Jingdong
AU - Liu, Honghai
AU - Chu, Wei
PY - 2024/2/26
Y1 - 2024/2/26
N2 - Modeling in computer vision has long been dominated by convolutional neural networks (CNNs). Recently, in light of the excellent performance of self-attention mechanism in the language field, transformers tailored for visual data have drawn significant attention and triumphed over CNNs in various vision tasks. These vision transformers heavily rely on large-scale pre-training to achieve competitive accuracy, which not only hinders the freedom of architectural design in downstream tasks like object detection, but also causes learning bias and domain mismatch in the fine-tuning stages. To this end, we aim to get rid of the “pre-train and fine-tune” paradigm of vision transformer and train transformer based object detector from scratch. Some earlier works in the CNNs era have successfully trained CNNs based detectors without pre-training, unfortunately, their findings do not generalize well when the backbone is switched from CNNs to a vision transformer. Instead of proposing a specific vision transformer based detector, in this work, our goal is to reveal the insights of training vision transformer based detectors from scratch. In particular, we expect those insights to help other researchers and practitioners, and inspire more interesting research in other fields, such as remote sensing, visual-linguistic pre-training, etc. One of the key findings is that both architectural changes and more epochs play critical roles in training vision transformer based detectors from scratch. Experiments on the MS COCO dataset demonstrate that vision transformer based detectors trained from scratch can also achieve similar performance to their counterparts with ImageNet pre-training.
AB - Modeling in computer vision has long been dominated by convolutional neural networks (CNNs). Recently, in light of the excellent performance of self-attention mechanism in the language field, transformers tailored for visual data have drawn significant attention and triumphed over CNNs in various vision tasks. These vision transformers heavily rely on large-scale pre-training to achieve competitive accuracy, which not only hinders the freedom of architectural design in downstream tasks like object detection, but also causes learning bias and domain mismatch in the fine-tuning stages. To this end, we aim to get rid of the “pre-train and fine-tune” paradigm of vision transformer and train transformer based object detector from scratch. Some earlier works in the CNNs era have successfully trained CNNs based detectors without pre-training, unfortunately, their findings do not generalize well when the backbone is switched from CNNs to a vision transformer. Instead of proposing a specific vision transformer based detector, in this work, our goal is to reveal the insights of training vision transformer based detectors from scratch. In particular, we expect those insights to help other researchers and practitioners, and inspire more interesting research in other fields, such as remote sensing, visual-linguistic pre-training, etc. One of the key findings is that both architectural changes and more epochs play critical roles in training vision transformer based detectors from scratch. Experiments on the MS COCO dataset demonstrate that vision transformer based detectors trained from scratch can also achieve similar performance to their counterparts with ImageNet pre-training.
KW - convolutional neural networks
KW - detection performance and efficiency
KW - large-scale pre-training
KW - object detection
KW - training from scratch
KW - vision transformer
UR - http://www.scopus.com/inward/record.url?scp=85186179826&partnerID=8YFLogxK
U2 - 10.1007/s11263-024-01988-x
DO - 10.1007/s11263-024-01988-x
M3 - Article
AN - SCOPUS:85186179826
SN - 0920-5691
JO - International Journal of Computer Vision
JF - International Journal of Computer Vision
ER -