You are on page 1of 14

SimDETR: Simplifying self-supervised pretraining for DETR

1* 2 1 2
Ioannis Maniadis Metaxas Adrian Bulat Ioannis Patras Brais Martinez
1,2
Georgios Tzimiropoulos
1 2
Queen Mary University of London Samsung AI Cambridge
arXiv:2307.15697v1 [cs.CV] 28 Jul 2023

Abstract

DETR-based object detectors have achieved remarkable


performance but are sample-inefficient and exhibit slow
convergence. Unsupervised pretraining has been found to Figure 1. SimDETR overview: Labeled object proposals are ex-
be helpful to alleviate these impediments, allowing train- tracted from images in an unsupervised manner and used to train
ing with large amounts of unlabeled data to improve the a DETR-based detector. That detector then generates a new set of
detector’s performance. However, existing methods have proposals, which are used for self-training. Best seen in color.
their own limitations, like keeping the detector’s back-
bone frozen in order to avoid performance degradation and
utilizing pretraining objectives misaligned with the down- (i.e. requiring large amounts of extensively annotated data),
stream task. To overcome these limitations, we propose and slow training convergence. DETR-related follow-
a simple pretraining framework for DETR-based detec- up works tackle these issues either through architectural
tors that consists of three simple yet key ingredients: (i) changes [36, 68, 45] or through self-supervised pretraining
richer, semantics-based initial proposals derived from high- [12, 2, 57, 24]. In this work, we directly adopt recently pro-
level feature maps, (ii) discriminative training using object posed advanced DETR variants (ViDT+ [45, 46] and De-
pseudo-labels produced via clustering, (iii) self-training to formable DETR [68]) and focus on the important comple-
take advantage of the improved object proposals learned mentary aspect of self-supervised pretraining.
by the detector. We report two main findings: (1) Our pre- There are few works for DETR pretraining [12, 2, 57,
training outperforms prior DETR pretraining works on both 24], all of which have some important limitations. Their
the full and low data regimes by significant margins. (2) objectives focus on discriminating between object and non-
We show we can pretrain DETR from scratch (including the object regions (i.e., localization task) and ignore class-
backbone) directly on complex image datasets like COCO, related information, which is not well aligned with the
paving the path for unsupervised representation learning downstream, class-aware detection task. Instead, they rely
directly using DETR. on a pretrained backbone to formulate distillation-based
auxiliary tasks. As a consequence, the backbone needs to
remain frozen during pretraining to avoid the degradation of
1. Introduction its instance discrimination power. Finally, they are trained
in a single stage, ignoring potential benefits from iterative
Object detection has been a major challenge in com- self-training which seems a natural fit for detection pretrain-
puter vision and the focus of extensive research efforts. A ing.
recent breakthrough in object detection is the DEtection
In this work we propose SimDETR, a simple framework
TRansformer (DETR) [6], an end-to-end trainable single-
for self-supervised pretraining for DETR, which addresses
stage detection framework that reformulates the task as
the aforementioned limitations. The main components of
direct set prediction, bypassing the use of hand-crafted
our method, seen in Fig. 1 are:
components such as non-maximum suppression or anchor
(i) Richer semantics-based initial proposals derived
generation. Despite these significant advantages, DETR
from high-level feature maps: Rather than relying on ran-
has two important drawbacks, namely sample inefficiency
dom [12] or hand-crafted object proposal methods [2], our
* Corresponding author; Email: i.maniadismetaxas@qmul.ac.uk; proposals are obtained by clustering the feature map pro-
Project done during an internship at Samsung AI Cambridge. duced by a self-supervised pretrained backbone, resulting
in segmentation maps. We then identify contiguous regions sample-inefficient, highlighting the complementary nature
from where object proposals and their feature representa- and the importance of unsupervised pretraining methods.
tions are extracted. DETR pretraining: Despite significant progress made
(ii) Class-aware pretraining via clustering: The high- with novel architectures, DETR-based approaches require
level feature representations of (i) are clustered across the long training with a lot of data to achieve strong perfor-
dataset, with cluster membership being used as a class mance. Their effectiveness, therefore, depends on abun-
pseudo-label for standard, class-aware object detection dant, extensively annotated training samples. Despite the
training. Contrary to prior methods that isolate localiza- success of self-supervised backbone pretraining in object
tion and discrimination during pretraining, SimDETR com- detection, few methods have been proposed for DETR pre-
bines them through the class-aware detection objective, bet- training. The first among them, UP-DETR [12], trained the
ter aligning with the downstream task. (iii) Iterative self- detector to identify random boxes using query tokens. DE-
training: We find that, after pretraining, the detector can be TReg [2] subsequently used Selective Search [48] to gen-
used to produce proposals better than the ones it was trained erate object proposals as annotations for the detector, com-
on. Therefore, we find it beneficial for detection pretrain- bined with a distillation-based feature matching objective.
ing to be applied in an iterative fashion, where the model DETReg was subsequently improved upon by JoinDet [57],
is trained with pseudo-labels (bounding boxes and pseudo- which replaced Selective Search with a dynamic object pro-
class assignments) produced by the model itself in the pre- posal method, and Siamese DETR [24], which proposed a
vious round of training. Importantly, we show that all three student-teacher architecture for pretraining. Notably, all of
components are needed for highly effective pretraining. these works pretrain detectors in a class-unaware manner,
We report two main findings: relying on auxiliary objectives to improve their discrimi-
(1) Improved detection accuracy: We show that native capacity. Furthermore, they all freeze the detector’s
SimDETR consistently outperforms previous works on backbone encoder during pretraining, as they suffer perfor-
DETR pretraining by significant margins in all standard mance drops otherwise [12, 2]. This is a significant limita-
benchmarks, including the full data, semi-supervised and tion, as it prevents true end-to-end self-supervised training,
few-shot settings. and makes such frameworks heavily dependent on the qual-
(2) Self-supervised representation learning from com- ity of the pretrained backbone.
plex images: We show that SimDETR can be used to train
the whole network from scratch directly on complex im- Self-supervised learning for dense prediction down-
ages, demonstrating impressive performance for unsuper- stream tasks: Recently, self-supervised learning has
vised representation learning. emerged as a powerful paradigm for learning representa-
tions from unlabeled visual data [19, 10, 7, 8]. Among
2. Related Works such methods, a distinct line of research has focused on
self-supervised training of image encoders that better cap-
DETR and its variants: Transformer-based object detec- ture local information in images [64, 22, 59, 50, 56, 25, 16,
tion architectures, introduced by DETR [6], deviate from 61, 60, 23, 1, 28, 26, 14, 31, 62]. Such encoders are par-
previous architectures [47, 67, 5, 20, 38] by reformulating ticularly effective as backbones for detection/segmentation
object detection as a set prediction problem with bipartite architectures, significantly improving their performance on
matching. This eliminates the need for hand-crafted com- dense prediction downstream tasks. Among self-supervised
ponents such as non-maximum suppression and a region- methods in this area, some propose pretext tasks related
proposal network, producing a truly end-to-end object de- to semantic segmentation [23, 60, 16], while others lever-
tection pipeline. However, despite its elegance and perfor- age object priors [22, 59, 28, 31, 62] to formulate effec-
mance, DETR has been shown to suffer from limited sample tive training objectives. We emphasize, however, that the
efficiency and slow convergence. Subsequent works built aforementioned works, including works such as SoCo [59]
on DETR to tackle these issues, including DAB-DETR [33], that leverage components of detector frameworks, focus ex-
DN-DETR [30] and Efficient-DETR [65], which focus clusively on training backbone encoders, not complete de-
on improving DETR’s queries, H-DETR [27] and Group- tection/segmentation architectures, and use objectives that
DETR [9], that propose improvements on the matching al- are not applicable to the training of end-to-end object de-
gorithm between queries and objects, C-DETR [36] and tection pipelines. Furthermore, despite their usefulness in
Def. DETR [68] that propose methods for making DETR a wide array of downstream tasks, and their applicability to
more training-efficient, and ViDT+ [45], which proposes some dense prediction tasks such as semantic segmentation,
unifying the detector’s backbone and encoder with a re- backbones trained with these methods are not directly ap-
configured attention module, achieving even further train- plicable to instance-level dense prediction tasks, and object
ing speed and performance improvements. Despite signifi- detection in particular. These methods are, therefore, dis-
cant progress, however, transformer-based detectors remain tinct from those that seek to pretrain detection architectures
Figure 2. Overview of SimDETR’s pretraining Stage 1. Labeled region proposals are extracted at the start of training using a self-supervised
pretrained backbone. Those proposals are then used to train the detector in a class-aware manner. Best seen in color.

in a self-supervised, end-to-end manner. tate high precision (see Sec. 2). On the other hand, meth-
Unsupervised object discovery: Different from object ods like Selective Search [48] generate many proposals, but
detector pretraining, this task aims at correctly localiz- their reliance on low-level priors such as color and texture
ing objects in images in an unsupervised manner. Re- make them very noisy and unsuitable to generate meaning-
cently, this area has seen remarkable progress utilizing self- ful pseudo-labels.
supervised models. Specifically, several works have pro- We seek to address this gap by utilizing semantic in-
posed identifying objects based on local feature similar- formation from self-supervised image encoders to produce
ity [51, 55, 42, 58, 43, 35, 53]. Distinctly, [52] treats lo- rich object proposals and coherent pseudo-class member-
calization as a ranking problem , [3] combines a MAE [18] ships. Specifically, we extract feature maps using a pre-
architecture with a GAN [17], [40] combines slot attention trained self-supervised encoder and leverage a bi-level clus-
and a reconstruction objective, and [41] clusters features tering strategy. The first level (termed local clustering) pro-
from multiple encoders to produce annotations for salient duces bounding box proposals and associated feature rep-
object detection. The proposal(s) produced by these meth- resentations. The second level, termed global clustering,
ods can then be used to train a detector or segmentation assigns class pseudo-labels to each proposal. Our method
model, improving performance and allowing for multiple leads to rich and diverse region proposals (see Tab. 8) and
object discovery. Notably, methods such as [55, 66, 53] is essential for the state-of-the-art results of SimDETR (see
also leverage self-training. We emphasize, however, that Tab. 9).
the main goal of works in this area is object discovery, not Unsupervised proposal extraction: Given an input image
the training of powerful detectors. Accordingly, the detec- 3×H×W
X∈R , we use a self-supervised pretrained encoder
tors trained by these works typically are not evaluated by d ×H ×W
to extract feature maps Fl ∈ R l l l from each of the
finetuning with annotated data. Furthermore, such methods
encoder’s levels l. Given a feature map F, we employ pixel-
typically restrict their proposals to the most confident few
wise clustering to group semantically similar features (local
(often just one) to avoid false positives, which is not well
clustering). This results in a set of masks M = {mk }k=1∶K ,
suited for detector pretraining, where training benefits from
where K represents the number of clusters, which is a user-
a rich set of object proposals covering as many objects as
defined parameter. In order to provide good coverage for
possible, not only the few most prominent ones.
all objects in the image, we apply clustering with different
values K ∈ K and use feature maps from different layers
3. Method
l ∈ L, leading to a set of masks M = ⋃{M }K∈K,l∈L .
l,K

Our method aims to simplify and better align the pre- Next, the different connected components of each mask
training with respect to the downstream task (class-aware are computed, leading to a set of regions R. Each region
detection). To this end, we produce object proposals in the r ∈ R is then used to extract a bounding box (proposal) b
form of bounding boxes and class pseudo-label pairs in an and a corresponding feature vector f , where f is computed
unsupervised manner, and then employ a self-training strat- by average-pooling the last layer feature map FL over r.
egy to pretrain and iteratively refine the detector. Proposal filtering: Due to the repeated clustering runs, the
process just described leads to noisy and overlapping pro-
3.1. Improving object proposals
posals. We employ a number of filters to refine them, in-
Examining existing methods, we note that object discov- cluding merging proposals that have a high IoU, proposals
ery works generate very limited initial proposals to facili- with highly-related semantic content and proposals that are
part of other proposals. This results in a set of N (i) bound- A new set of weights Θi can be obtained by using train-
N (i)
ing box-feature vector pairs for image i, {bn , fn }n=1 . ing set Ti and using Θi−1 to initialize the weights. Simulta-
Pseudo-class label generation: We then cluster proposals neously, Θi can be used to generate a new training set Ti+1 .
across the whole dataset (global clustering) based on the While this process can be iterated indefinitely, we notice
feature vectors, i.e. we perform a single clustering round on optimal performance involves just two rounds of training,
{fn }n=1∶N (i) . Class membership is then used as the pseudo-
i i=1∶I which we refer to as Stages 1 & 2. Stage 1 training, includ-
class label, c ∈ C, for each of the proposals. This results in ing the proposal extraction process for T0 is shown in Fig. 2.
a training set T0 = {Xi , {(bn , cn )}}.
i i We highlight that, importantly, the proposed pretraining
We used Spectral Clustering [37] for local and K-Means is very well-aligned with the downstream task, i.e. super-
for global clustering as Spectral Clustering typically per- vised class-aware object detection. Furthermore, it allows
forms better but it cannot handle the millions of data points us to effectively pretrain both the backbone and the detec-
involved in the global clustering. However, any clustering tion head simultaneously. This is unlike other detector pre-
algorithm may be used in either case. training methods [12, 2, 57] that require freezing the back-
bone to avoid performance degradation.
3.2. Pretraining and Self-Training The whole method is summarized in Algorithm 1.
We can now use the training set T0 to train an object
Algorithm 1 Pretraining
detector within the DETR framework. In particular, given
Require: {Xi }i=1 , Net g = (gb , gh ), initial params. Θ0
I
an input image and its corresponding extracted object pro-
Q
posals y, the network predicts a set ŷ = {ŷq }q=1 , where 1: ▷ Unsup. train set gen., Sec. 3.1
2: for i = 1 ∶ N do
ŷq = (b̂q , ĉq ), that is, the predicted bounding box and pre-
3: Fl ← gb (Xi )
dicted category. We note that the extracted proposals y are
4: Mi ← ⋃ Cluster(Fl , K) ▷ K ∈ K, l ∈ L
padded to size Q with ∅ (no object). Typically in DETR
5: Ri ← Connected Components(Mi )
architectures, the ground truth and the predictions are put in
{bn , fn }N (i) ← Filter(Ri )
i i
6:
correspondence via bipartite matching, formally defined as:
7: end for
8: {cn } ← K-Means({fn }, K = C)
Q i i
▷ Pseudo-classes
σ̂ = arg min ∑ L(yq , ŷσ(q) ) (1) N (i) I
σ∈SQ q 9: T0 ← {Xi , {(bn , cn )}n=1 }
i=1
where SQ is the space of permutations of Q elements. 10: ▷ Self-training (Sec. 3.2)
The loss between y and ŷ is computed as a combination 11: for j stages do
of a bounding box matching loss and a class matching loss: 12: g(−; Θj+1 ) ← Train (Tj , g) ▷ Using eq. 2
Tj+1 ← Filter( {g(Xi ; Θj )}i=1 )
I
13:
Q 14: end for
∑ (−log p̂σ̂(q) (cq ) + 1{cq ≠∅} Lbox (bq , b̂σ̂(q) )) (2)
q=1
4. Experimental Setting
where p̂ indicates the predicted per-class probabilities. The
indicator function 1ci ≠∅ represents that the box loss only In order to compare with prior work on object detection
applies to predictions that have been matched to object pro- pretraining, we follow DETReg [2] in terms of datasets, hy-
posals y. Minimizing this loss results in weights Θ0 . perparameters and experiments (namely, the full data and
Upon training the detector in this way, we observe that it low data settings). In order to study the effectiveness of our
can identify more objects than those in our original propos- method for unsupervised representation learning, in the ab-
als. Critically, this includes smaller and more challenging sence of a pre-defined protocol, we use the ViDT+ detector
objects, which constitute a stronger supervisory signal. We and experiment with the most well-established datasets in
thus generate a new set of labels for image i as {g(Xi ; Θ0 )}, object detection. The hyperparameters for each experiment
where g = (gb , gh ) are the detection network, backbone are provided in detail in Appendix Section A. Unless stated
and head respectively. In self-training, pseudo-labels are otherwise, for methods other than SimDETR we report re-
typically filtered based on confidence. In our case, filter- sults from the respective papers, except where VIDT+ is
ing based on the detector’s confidence leads to the removal used.
of small or challenging instances such as partially-occluded While our method is, in principle, not restricted to
or uncommon objects. Thus, we instead filter the new pro- DETR, we follow prior work [12, 2, 57] and focus on
posals so that any two boxes have an IOU lower than 0.55 DETR-based architectures since a) DETR methods suffer
(following [44]), with only the most confident box being from slow convergence and sample inefficiency, so they
kept when such conflicts exist. This leads to training set T1 . benefit the most from unsupervised pretraining methods, b)
The end-to-end single-stage design is well suited for end-to- datasets and with both detector architectures. Further-
end representation learning, which we explore in Sec. 5.2. more, it also achieves the highest performance among self-
Datasets: We use the training sets of ImageNet [39], Open supervised learning methods for detection. While in this
Images [29] and MS COCO [32] for unsupervised pretrain- case the different base detection frameworks make a direct
ing. For finetuning (with annotations) we use the training comparison harder, we still find it a very encouraging result.
sets of MS COCO and PASCAL VOC [15]. Results are re- Semi-supervised setting: In this protocol unsupervised
ported for the corresponding validation sets, using the Av- pretraining is conducted on COCO’s train set, and k% of the
erage Precision (AP) and Average Recall (AR). Details on train set’s samples are subsequently used (with annotations)
the datasets are provided in Appendix Section B. for finetuning. Results in Tab. 3 demonstrate that SimDETR
Architectures: We use Deformable DETR [68] and ViDT+ outperforms previous works by large margins, particularly
[45] in our experiments. Def. DETR is used to com- in the more challenging settings with fewer labeled samples.
pare with prior work for detector pretraining. Follow- Few-shot setting: Following [2], we pretrain Def.
ing [2], a ResNet-50 backbone [21] is used, initialized with DETR on ImageNet and report results for two settings. In
SwAV [7], pretrained on ImageNet. ViDT+ is a more recent the first, we finetune on COCO’s 60 base classes, and then
state-of-the-art method, and is used to compare against un- finetune again on k ∈ {10, 30} instances from all classes.
supervised representation learning methods. Unless stated In the second, ”extreme” setting, we skip finetuning on the
otherwise, its Swin-T [34] transformer backbone is initial- base classes. Results are reported in Tab. 4 on the vali-
ized with MoBY [63], which is pretrained on ImageNet. dation set’s novel classes, and demonstrate that SimDETR
We emphasize that, for both Def. DETR and ViDT+, their not only outperforms DETReg by significant margins, but
backbones were trained in a totally unsupervised manner. its performance without base class finetuning is very close
to its performance with it. These results support that a)
5. Experiments our method drastically reduces DETR architectures’ depen-
We highlight two main results, namely state-of-the-art dency on annotated data, and b) SimDETR’s learned rep-
results for detection pretraining and competitive results for resentations are already class-aware, and the pseudo-labels
self-supervised representation learning for detection, in- produced by our method are good enough that SimDETR
cluding pretraining on scene-centric data such as COCO can align with COCO’s classes with minimal (10-shot) su-
from scratch. We complement these results with a com- pervision. We conduct a more in depth analysis of the few-
prehensive set of ablation studies. shot setting outcomes in Appendix Section C.

5.1. Object detection pretraining 5.2. Self-supervised representation learning on


scene-centric images
In order to validate our method in terms of object detec-
tion pretraining, we closely follow the benchmarks estab- We first examine the ability of our method to learn
lished by DETReg [2]. We use ViDT+ and Def. DETR as self-supervised representations (i.e., train a backbone)
backbones and cover three settings, namely full data, semi- that are suitable for detection. We begin by validating
supervised and few-shot settings. that SimDETR, when trained on scene-centric data (e.g.
Full data regime: The detectors are pretrained on Ima- COCO), can perform competitively compared to ImageNet
geNet and then finetuned on detection datasets. We provide pretraining. Then we use SimDETR directly for self-
a set of comparisons with competing detector pretraining supervised representation learning on scene-centric data
methods in Tab. 1, where we finetune on COCO. Interest- (i.e., training from scratch on COCO/Open Images), show-
ingly, all prior work on DETR pretraining requires freezing ing promising results. Finally, we show that pretraining on
the backbone. We quantitatively assess the impact of this COCO leads to representations that transfer to ImageNet
requirement by making the DETReg backbone trainable, under the linear-probe setting.
and observe steep performance degradation. Contrary to all Object vs Scene-centric pretraining: In the full data
these works, SimDETR supports a trainable backbone due experiments in Sec. 5.1 we followed prior literature and
to its better alignment of the pretraining and downstream pretrained on object-centric data (ImageNet). In this sec-
tasks. Thus, we further include results from other self- tion, we examine whether our method can achieve compet-
supervised representation learning methods that focus on itive performance by pretraining on scene-centric datasets
backbone-pretraining for detection, irrespective of whether instead. To that end, we pretrain ViDT+ on COCO and
they are within the DETR framework. We also report re- Open Images (keeping the initialization settings described
sults for ImageNet pretraining and PASCAL VOC finetun- in Sec. 4), finetune on COCO’s train set and present re-
ing in Tab. 2. sults on its validation set. These results are shown in Tab. 5.
As Tabs. 1 and 2 show, our method significantly out- We further report the class-unaware object detection perfor-
performs competing detector pretraining methods across mance in terms of average recall (AR) as it hints at different
Backbone Detector Frozen
Detector Backbone AP AP50 AP75
Pretraining Pretraining Backbone
Unsupervised representation learning for detection
1
Supervised 41.6 - -
MoCo v2 [11] 41.7 - -
SimCLR [10] 41.6 - -
Mask-RCNN [20] (x2) ResNet50 - -
DINO [8] 42.3 - -
SlotCon [60] 42.6 62.7 46.2
UniVIP [31] 43.1 - -
1
Supervised 44.2 - -
ResNet50 DetConB [22] - - 45.4 - -
Odin [23] 45.6 - -
FCOS* [47] 1
Supervised 46.7 - -
MoBY [63] 47.6 - -
Swin-T - -
DetConB [22] 48.4 - -
Odin [23] 48.5 - -
MoBY 48.3 66.9 52.4
ViDT+ [45] Swin-T 2 - -
SimDETR 48.8 67.4 53.1
Unsupervised Detector pretraining
Cascade R-CNN [5] ResNet50 DINO [8] CutLER [53] ✗ 44.7 - -
1
Supervised - - 44.5 63.6 48.7
SwAV - - 45.2 64.0 49.5
SwAV UP-DETR [12] ✓ 44.7 63.7 48.6
Def. DETR [68] ResNet50
SwAV DETReg [2] ✓ 45.5 64.1 49.9
SwAV JoinDet [57] ✓ 45.6 64.3 49.8
SwAV Siamese DETR [24] ✓ 46.3 64.6 50.5
SwAV SimDETR ✗ 46.7 65.4 50.9
MoBY - - 48.3 66.9 52.4
MoBY DETReg ✓ 49.1 67.4 53.1
ViDT+ [45] Swin-T
MoBY DETReg ✗ 47.8 65.9 52.0
MoBY SimDETR ✗ 49.6 68.2 53.8

Table 1. Lower section: Results for detection pretraining methods. Upper section: Unsupervised representation learning methods (detec-
tor head is only trained during the downstream fine-tuning stage), with pretraining on ImageNet and finetuning on COCO. 1: Backbone
trained on ImageNet classification with labels (baseline). 2: Backbone initialized with MoBY and pretrained with SimDETR (pretrained
detection head was discarded).

Method AP AP50 AP75 Base Class Novel Class AP Novel Class AP75
Method
Finetuning 10 30 10 30
Supervised 59.5 82.6 65.6
SwAV [7] 61.0 83.0 68.1 DETReg [2] 5.6 10.3 6.0 10.9

DETReg [2] 63.5 83.3 70.3 SimDETR 10.3 14.5 10.9 15.1
JoinDet [57] 63.7 83.8 70.7
DETReg [2] 9.9 15.3 10.9 16.4
SimDETR [2] 64.8 84.6 72.7 ✓
SimDETR 12.4 18.9 13.1 20.4

Table 2. PASCAL VOC results. Def. DETR detectors were pre-


Table 4. Few-shot results. Def. DETR detectors were pretrained
trained on ImageNet and finetuned PASCAL VOC.
on ImageNet and finetuned on k ∈ {10, 30} instances from each
class. Results reported on the novel classes. DETReg results re-
AP
Method produced in our codebase using their published model checkpoint.
1% 2% 5% 10%
SwAV 11.79±0.3 16.02±0.4 22.81±0.3 27.79±0.2
DETReg [2] 14.58±0.3 18.69±0.2 24.80±0.2 29.12±0.2 behaviors between the two settings in this regard.
JoinDet [57] 15.89±0.2 - - 30.87±0.1
SimDETR 18.19±0.1 21.80±0.2 26.90±0.2 30.97±0.2 We observe that, in all cases, our method improves over
the baseline, even on COCO, where we pretrain and fine-
Table 3. Semi-supervised results. Def. DETR detectors were pre- tune on the same set of data. Stage 1 achieves similar
trained on COCO’s train set and finetuned on k% labeled samples. results across datasets, a significant finding which shows
that SimDETR is: a) sample efficient, achieving similar
performance pretraining on COCO and on the larger Im-
100 Backbone Detector Pretraining
Dataset Stage AP AP50 AP75 AR Detector AP
Pretraining Pretraining Dataset
- MoBY 48.3 66.9 52.4 -
MoBY [63] - ImageNet 47.6
COCO 48.8 67.6 53.0 23.9 DetCon [22] - FCOS* ImageNet 48.4
ImageNet Stage 1 48.9 67.4 52.9 25.9 Odin [23] - ImageNet 48.5
Open-Images 48.9 67.5 52.9 24.5
- - ImageNet 38.5
COCO 49.1 67.8 53.1 25.1 MoBY [63] - ImageNet 48.3
ImageNet Stage 2 49.6 68.2 53.8 27.1 - SimDETR VIDT+ COCO 48.3
Open Images 49.4 67.9 53.9 25.5 - SimDETR Open Images 48.8
- SimDETR ImageNet 49.2
Table 5. Object-centric vs Scene-centric. Finetuning results for
pretraining on ImageNet vs. pretraining on COCO/Open Images. Table 6. Pretraining from scratch. We pretrain SimDETR with-
out backbone initialization and finetune on COCO. For compari-
son, we finetune VIDT+ without pretraining and with a backbone
ageNet and Open Images datasets, and b) flexible, being
pretrained with MoBY. We also provide results for [22, 60, 23, 63]
able to handle both object-centric and scene-centric data. that perform self-supervised backbone-only pretraining.
Self-training (Stage 2) consistently leads to improved per-
formance. However, Open Images pretraining outperforms Backbone Pretraining Acc
COCO, indicating that, with enough training time, pretrain-
DenseCL [56] 49.9
ing on a larger dataset is impactful. Furthermore, ImageNet VirTex [13] 53.8
pretraining outperforms the two scene-centric datasets, al- MoCo [19] 49.8
though by a small margin in the case of Open Images. Van Gansbeke et al. [49] 56.1
SimDETR 56.4
We attribute this to the relative quality of object propos-
als produced for self-training. As seen by contrasting AR
Table 7. We pretrain SimDETR with VIDT+ on COCO’s train set,
scores in Tab. 5, ImageNet’s Stage 1 detector localizes more and apply the backbone to linear evaluation on ImageNet. Results
objects correctly, which likely leads to improved supervi- reported on ImageNet’s validation set. Results from [49].
sion when self-training. Overall, these results indicate that
SimDETR does not require carefully curated object-centric encoder and thus a direct comparison is hard. It is how-
data to achieve competitive results. ever clear that our method is competitive, despite being pre-
Self-supervised representation learning on scene- trained for object detection, highlighting the natural fit of
centric data: Experiments conducted in previous sections SimDETR for general-purpose representation learning from
uniformly initialize the backbone with weights obtained by scene-centric images.
self-supervised training on ImageNet. In this section, we
evaluate the representation learning capacity of SimDETR 5.3. Analysis and ablations
by pretraining a ViDT+ detector from an untrained back- Throughout this section we use ViDT+ and, unless stated
bone (from scratch) to examine whether independent back- otherwise, pretrain on ImageNet for 10 epochs per stage.
bone pretraining is indeed necessary. We pretrain on object- Impact of object proposals: We evaluate our object pro-
centric (ImageNet) and scene-centric (COCO & Open Im- posal method in two ways: a) we examine how well it local-
ages) datasets and present results in Tab. 6. For complete- izes objects by computing the Average Recall (AR) score on
ness, we also provide results for other methods that focus COCO’s validation set (see Tab. 8), and b) we investigate its
on self-supervised backbone pretraining, noting that they impact on SimDETR by replacing it with Selective Search,
use different detector architectures during finetuning. and present the outcomes (see Tab. 9).
Results again show that SimDETR performs best with a Tab. 8 includes results both for our initial proposals
well-curated, object centric pretraining dataset such as Ima- (noted as SimDETR-St. 0), and the proposals generated
geNet, but is competitive even when trained on complex, by pretrained detectors. Results show that our approach is
scene-scentric images. Specifically, SimDETR performs superior to Selective Search, that our framework leads to
on par with backbone-only ImageNet pretraining (MoBY) better localization results than DETReg, and that detector
when applied to COCO, and outperforms it when applied pretraining significantly improves over our initial propos-
to Open Images. This outcome supports our thesis, namely als, supporting our decision to self-train. We also notice
that unsupervised pretraining directly on scene-centric data diminishing gains in terms of AR between Stages 1 and 2.
with an object detection task is feasible and can be effective. In Tab. 9 we find that, using Selective Search propos-
We further evaluate the quality of the COCO-pretrained als, SimDETR still outperforms the MoBY baseline, but we
backbone by performing a linear-probe experiment on Im- observe a performance drop relative to our object proposal
ageNet. Tab. 7 shows SimDETR’s performance as well as method. We attribute this to two reasons: a) out method
that of prior work. We note that prior work use a ResNet50 likely produces more discriminative descriptors f by aggre-
100 Detector Stage AP AP50 AP75
Object proposals Detection Architecture AR
Sel. Search - 10.9 1 48.9 67.4 52.9
SimDETR-St. 0 - 13.4 VIDT+ [45] 2 49.6 68.2 53.8
3 49.6 68.0 53.9
DETReg 21.5
SimDETR-St. 1 ViDT+ 25.9 1 46.1 64.6 50.3
Def. DETR [68]
SimDETR-St. 2 27.1 2 46.7 65.4 50.9

Table 8. Quality of proposals: AR results on COCO’s valida- Table 11. Self-training rounds. AP results for ViDT+ pretrained
tion set. Upper section: methods for the initial extraction of with SimDETR on ImageNet and finetuned on COCO. Avg. pro-
object proposals. Lower section: proposals generated by detec- posals per image are measured during training.
tion/segmentation architectures trained on the initial proposals.
Stage Epochs AP AP50 AP75
Method Proposals AP AP50 AP75
1 10 48.9 67.4 52.9
MoBY - 48.3 66.9 52.4 1 25 49.2 67.7 53.6
SimDETR-St. 1 48.7 67.3 52.7 2 10 49.6 68.2 53.8
Sel. Search
SimDETR-St. 2 48.6 67.1 52.2 2 25 49.7 68.1 54.2
SimDETR-St. 1 48.9 67.4 52.9
Our Anns.
SimDETR-St. 2 49.6 68.2 53.8 Table 12. Scheduler length. AP results for varying training
epochs. 10 and 25 epoch Stage 2 models are initialized from 10
Table 9. Impact of initial proposals: AP results on COCO’s val- and 25 epoch Stage 1 models respectively.
idation set, using different initial object proposal methods.

Classes ACC AR AP gains. We explore additional self-training with VIDT+ [45],


but observe no benefits, and therefore limit self-training to
1 - 25.2 41.2
256 80.01 23.9 43.8 one round throughout the paper.
512 75.13 24.0 43.9 Schedule length: In Tab. 12 we examine the impact of a
2048 53.75 23.9 44.1 longer training schedule on our method for both training
stages by extending training from 10 to 25 epochs per stage.
Table 10. Number of classes. Pretraining and finetuning on
The results show that a longer training schedule can have
COCO, evaluation in terms of training accuracy, AR of the pre-
trained detector, and AP of the finetuned model. 1 class implies
some beneficial, yet marginal, effect. Interestingly, Tab. 12
class-unaware pretraining. highlights the importance of self-training, as two training
stages totaling a combined 20 epochs (10 per stage) clearly
gating representations over a mask of semantically related outperform a single training round of 25 epochs.
pixels, rather than over a box, which is the case for Selec-
tive Search. This, in turn, leads to better pseudo-labels. b) 6. Conclusion
Our proposals are more robust (see Tab. 8), and therefore
provide better supervision. In summary, we conclude that We have proposed SimDETR, a novel method for self-
SimDETR is robust to different object proposal methods, supervised pretraining of an end-to-end object detector.
but greatly benefits from an appropriate method choice. Compared to prior work, our method aligns pretraining
Number of classes: We ablate the number of pseudo- and downstream tasks through the careful construction of
classes produced by the global clustering of object propos- pseudo-labels and the use of self-training. We extensively
als. For this set of experiments, we pretrain and finetune evaluate SimDETR in typical object detector pretraining
on COCO’s train set for 25 epochs each. Note this is a benchmarks and demonstrate that it consistently outper-
simplified (and cheaper) setting for the purpose of ablat- forms previous methods. However, unlike prior work, we
ing. We find that, during pretraining, increasing the number show that SimDETR is also capable of effectively pretrain-
of classes leads to decreased training accuracy (ACC) and ing the backbone. This brings our method in line with the
class-unaware AR (measured on the validation set), which wider literature on self-supervised representation learning
is expected, since increasing the number of classes makes for detection. We again show competitive performance in
the task harder. However, the AP score after finetuning in- this area and explore novel settings, specifically pretrain-
creases, indicating that the pretrained detector is more pow- ing with scene-centric datasets and even pretraining from
erful. Overall, results indicate that our method is fairly ro- scratch. Overall, we believe our framework not only out-
bust to the number of clusters chosen. performs existing DETR pretraining methods, but also rep-
Self-training stages: We examine the impact of self- resents a promising step toward self-supervised, fully end-
training in Tab. 11, and find that it produces meaningful to-end object detection pretraining on uncurated images.
References [14] Jian Ding, Enze Xie, Hang Xu, Chenhan Jiang, Zhenguo Li,
Ping Luo, and Gui-Song Xia. Deeply unsupervised patch re-
[1] Yutong Bai, Xinlei Chen, Alexander Kirillov, Alan Yuille, identification for pre-training object detectors. IEEE Trans-
and Alexander C Berg. Point-level region contrast for object actions on Pattern Analysis and Machine Intelligence, 2022.
detection pre-training. In IEEE/CVF Conference on Com- 2
puter Vision and Pattern Recognition, 2022. 2
[15] Mark Everingham, Luc Van Gool, Christopher KI Williams,
[2] Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, John Winn, and Andrew Zisserman. The pascal visual object
Roei Herzig, Gal Chechik, Anna Rohrbach, Trevor Darrell, classes (voc) challenge. International Journal of Computer
and Amir Globerson. DETReg: Unsupervised pretraining Vision, 88:303–338, 2010. 5, 12
with region priors for object detection. In IEEE/CVF Con-
[16] Akash Gokul, Konstantinos Kallidromitis, Shufan Li,
ference on Computer Vision and Pattern Recognition, 2022.
Yusuke Kato, Kazuki Kozuka, Trevor Darrell, and Col-
1, 2, 4, 5, 6, 12, 13
orado J Reed. Refine and represent: Region-to-object repre-
[3] Adam Bielski and Paolo Favaro. Move: Unsupervised
sentation learning. arXiv preprint arXiv:2208.11821, 2022.
movable object segmentation and detection. arXiv preprint
2
arXiv:2210.07920, 2022. 3
[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
[4] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yuan Mark Liao. Yolov4: Optimal speed and accuracy of
Yoshua Bengio. Generative adversarial nets. Advances in
object detection. arXiv preprint arXiv:2004.10934, 2020.
neural information processing systems, 27, 2014. 3
12
[18] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
[5] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delv-
Dollár, and Ross Girshick. Masked autoencoders are scal-
ing into high quality object detection. In IEEE/CVF Confer-
able vision learners. In IEEE/CVF Conference on Computer
ence on Computer Vision and Pattern Recognition, 2018. 2,
Vision and Pattern Recognition, 2022. 3
6
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas [19] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- Girshick. Momentum contrast for unsupervised visual rep-
end object detection with transformers. In European confer- resentation learning. In IEEE/CVF Conference on Computer
ence on computer vision, 2020. 1, 2 Vision and Pattern Recognition, 2020. 2, 7
[7] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- [20] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
otr Bojanowski, and Armand Joulin. Unsupervised learning shick. Mask R-CNN. In IEEE/CVF International Confer-
of visual features by contrasting cluster assignments. In Ad- ence on Computer Vision, 2017. 2, 6
vances in neural information processing systems, 2020. 2, 5, [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
6 Deep residual learning for image recognition. In IEEE/CVF
[8] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Conference on Computer Vision and Pattern Recognition,
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- 2016. 5
ing properties in self-supervised vision transformers. In [22] Olivier J Hénaff, Skanda Koppula, Jean-Baptiste Alayrac,
IEEE/CVF International Conference on Computer Vision, Aaron Van den Oord, Oriol Vinyals, and João Carreira.
2021. 2, 6 Efficient visual pretraining with contrastive detection. In
[9] Qiang Chen, Xiaokang Chen, Jian Wang, Haocheng Feng, IEEE/CVF International Conference on Computer Vision,
Junyu Han, Errui Ding, Gang Zeng, and Jingdong Wang. 2021. 2, 6, 7
Group DETR: Fast DETR training with group-wise one-to- [23] Olivier J. Hénaff, Skanda Koppula, Evan Shelhamer, Daniel
many assignment. arXiv preprint arXiv:2207.13085, 2022. Zoran, Andrew Jaegle, Andrew Zisserman, João Carreira,
2 and Relja Arandjelović. Object discovery and representa-
[10] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- tion networks. In European conference on computer vision,
offrey Hinton. A simple framework for contrastive learn- 2022. 2, 6, 7
ing of visual representations. In International Conference on [24] Gengshi Huang, Wei Li, Jianing Teng, Kun Wang, Zeren
Machine Learning, 2020. 2, 6 Chen, Jing Shao, Chen Change Loy, and Lu Sheng. Siamese
[11] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. detr. In IEEE/CVF Conference on Computer Vision and Pat-
Improved baselines with momentum contrastive learning. tern Recognition, 2023. 1, 2, 6
arXiv preprint arXiv:2003.04297, 2020. 6 [25] Lang Huang, Shan You, Mingkai Zheng, Fei Wang, Chen
[12] Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen. Qian, and Toshihiko Yamasaki. Learning where to learn in
UP-DETR: Unsupervised pre-training for object detection cross-view self-supervised learning. In IEEE/CVF Confer-
with transformers. In IEEE/CVF Conference on Computer ence on Computer Vision and Pattern Recognition, 2022. 2
Vision and Pattern Recognition, 2021. 1, 2, 4, 6 [26] Ashraful Islam, Benjamin Lundell, Harpreet Sawhney,
[13] Karan Desai and Justin Johnson. Virtex: Learning visual Sudipta N Sinha, Peter Morales, and Richard J Radke. Self-
representations from textual annotations. In IEEE/CVF Con- supervised learning with local contrastive loss for detection
ference on Computer Vision and Pattern Recognition, 2021. and semantic segmentation. In Winter Conference on Appli-
7 cations of Computer Vision, 2023. 2
[27] Ding Jia, Yuhui Yuan, Haodi He, Xiaopei Wu, Haojun Yu, [39] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
Weihong Lin, Lei Sun, Chao Zhang, and Han Hu. Detrs with jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
hybrid matching. arXiv preprint arXiv:2207.13080, 2022. 2 Aditya Khosla, Michael Bernstein, Alexander C. Berg, and
[28] Robin Karlsson, Tomoki Hayashi, Keisuke Fujii, Alexander Fei-Fei Li. ImageNet large scale visual recognition chal-
Carballo, Kento Ohtani, and Kazuya Takeda. Vice: Improv- lenge. International Journal on Computer Vision, 2015. 5,
ing dense representation learning by superpixelization and 12
contrasting cluster assignment. arXiv e-prints, pages arXiv– [40] Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Do-
2111, 2021. 2 minik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel,
[29] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Tong He, Zheng Zhang, Bernhard Schölkopf, Thomas Brox,
Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper and Francesco Locatello. Bridging the gap to real-world
Uijlings, Stefan Popov, Shahab Kamali, Matteo Malloci, object-centric learning. International Conference on Learn-
Jordi Pont-Tuset, Andreas Veit, Serge Belongie, Victor ing Representations, 2023. 3
Gomes, Abhinav Gupta, Chen Sun, Gal Chechik, David Cai, [41] Gyungin Shin, Samuel Albanie, and Weidi Xie. Unsuper-
Zheyun Feng, Dhyanesh Narayanan, and Kevin Murphy. vised salient object detection with spectral cluster voting.
Openimages: A public dataset for large-scale multi-label In IEEE/CVF Conference on Computer Vision and Pattern
and multi-class image classification. Dataset available from Recognition, 2022. 3
https://storage.googleapis.com/openimages/web/index.html, [42] Oriane Siméoni, Gilles Puy, Huy V Vo, Simon Roburin,
2017. 5, 12 Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud
[30] Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, Marlet, and Jean Ponce. Localizing objects with self-
and Lei Zhang. DN-DETR: Accelerate DETR training by supervised transformers and no labels. arXiv preprint
introducing query denoising. In IEEE/CVF Conference on arXiv:2109.14279, 2021. 3
Computer Vision and Pattern Recognition, 2022. 2 [43] Oriane Siméoni, Chloé Sekkat, Gilles Puy, Antonin Vobecky,
[31] Zhaowen Li, Yousong Zhu, Fan Yang, Wei Li, Chaoyang Éloi Zablocki, and Patrick Pérez. Unsupervised object local-
Zhao, Yingying Chen, Zhiyang Chen, Jiahao Xie, Liwei Wu, ization: Observing the background to discover objects. arXiv
Rui Zhao, et al. Univip: A unified framework for self- preprint arXiv:2212.07834, 2022. 3
supervised visual pre-training. In IEEE/CVF Conference on [44] Roman Solovyev, Weimin Wang, and Tatiana Gabruseva.
Computer Vision and Pattern Recognition, 2022. 2, 6 Weighted boxes fusion: Ensembling boxes from different
object detection models. Image and Vision Computing,
[32] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
107:104117, 2021. 4
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[45] Hwanjun Song, Deqing Sun, Sanghyuk Chun, Varun Jam-
Zitnick. Microsoft COCO: Common objects in context. In
pani, Dongyoon Han, Byeongho Heo, Wonjae Kim, and
European conference on computer vision, 2014. 5, 12
Ming-Hsuan Yang. An extendable, efficient and effec-
[33] Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi,
tive transformer-based object detector. arXiv preprint
Hang Su, Jun Zhu, and Lei Zhang. DAB-DETR: Dynamic
arXiv:2204.07962, 2022. 1, 2, 5, 6, 8, 12
anchor boxes are better queries for DETR. arXiv preprint
[46] Hwanjun Song, Deqing Sun, Sanghyuk Chun, Varun Jam-
arXiv:2201.12329, 2022. 2
pani, Dongyoon Han, Byeongho Heo, Wonjae Kim, and
[34] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Ming-Hsuan Yang. ViDT: An efficient and effective fully
Zhang, Stephen Lin, and Baining Guo. Swin transformer: transformer-based object detector. In International Confer-
Hierarchical vision transformer using shifted windows. In ence on Learning Representations, 2022. 1
IEEE/CVF International Conference on Computer Vision, [47] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS:
2021. 5 Fully convolutional one-stage object detection. In IEEE/CVF
[35] Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and International Conference on Computer Vision, 2019. 2, 6
Andrea Vedaldi. Deep spectral methods: A surprisingly [48] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gev-
strong baseline for unsupervised semantic segmentation and ers, and Arnold WM Smeulders. Selective search for ob-
localization. In IEEE/CVF Conference on Computer Vision ject recognition. International Journal of Computer Vision,
and Pattern Recognition, pages 8364–8375, 2022. 3 104:154–171, 2013. 2, 3
[36] Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, [49] Wouter Van Gansbeke, Simon Vandenhende, Stamatios
Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Georgoulis, and Luc V Gool. Revisiting contrastive meth-
Conditional DETR for fast training convergence. In ods for unsupervised learning of visual representations. Ad-
IEEE/CVF International Conference on Computer Vision, vances in Neural Information Processing Systems, 2021. 7
2021. 1, 2 [50] Wouter Van Gansbeke, Simon Vandenhende, Stamatios
[37] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral Georgoulis, and Luc Van Gool. Revisiting contrastive meth-
clustering: Analysis and an algorithm. Advances in neural ods for unsupervised learning of visual representations. In
information processing systems, 2001. 4 Advances in neural information processing systems, 2021. 2
[38] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. [51] Wouter Van Gansbeke, Simon Vandenhende, and Luc
Faster R-CNN: Towards real-time object detection with re- Van Gool. Discovering object masks with transformers
gion proposal networks. Advances in neural information pro- for unsupervised semantic segmentation. arXiv preprint
cessing systems, 2015. 2 arXiv:2206.06363, 2022. 3
[52] Van Huy Vo, Elena Sizikova, Cordelia Schmid, Patrick [65] Zhuyu Yao, Jiangbo Ai, Boxun Li, and Chi Zhang. Ef-
Pérez, and Jean Ponce. Large-scale unsupervised object dis- ficient DETR: improving end-to-end object detector with
covery. Advances in neural information processing systems, dense prior. arXiv preprint arXiv:2104.01318, 2021. 2
2021. 3 [66] Andrii Zadaianchuk, Matthaeus Kleindessner, Yi Zhu,
[53] Xudong Wang, Rohit Girdhar, Stella X Yu, and Ishan Misra. Francesco Locatello, and Thomas Brox. Unsupervised se-
Cut and learn for unsupervised object detection and instance mantic segmentation with self-supervised object-centric rep-
segmentation. In IEEE/CVF Conference on Computer Vision resentations. arXiv preprint arXiv:2207.05027, 2022. 3
and Pattern Recognition, 2023. 3, 6 [67] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Ob-
[54] Xin Wang, Thomas E Huang, Trevor Darrell, Joseph E Gon- jects as points. arXiv preprint arXiv:1904.07850, 2019. 2
zalez, and Fisher Yu. Frustratingly simple few-shot object [68] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
detection. arXiv preprint arXiv:2003.06957, 2020. 12 and Jifeng Dai. Deformable DETR: Deformable transform-
[55] Xinlong Wang, Zhiding Yu, Shalini De Mello, Jan Kautz, ers for end-to-end object detection. International Conference
Anima Anandkumar, Chunhua Shen, and Jose M Alvarez. on Learning Representations, 2021. 1, 2, 5, 6, 8, 12
Freesolo: Learning to segment objects without annotations.
In IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2022. 3
[56] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong,
and Lei Li. Dense contrastive learning for self-supervised
visual pre-training. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2021. 2, 7
[57] Yizhou Wang, Meilin Chen, Shixiang Tang, Feng Zhu,
Haiyang Yang, Lei Bai, Rui Zhao, Yunfeng Yan, Donglian
Qi, and Wanli Ouyang. Unsupervised object detection pre-
training with joint object priors generation and detector
learning. In Advances in neural information processing sys-
tems, 2022. 1, 2, 4, 6
[58] Yangtao Wang, Xi Shen, Yuan Yuan, Yuming Du, Maomao
Li, Shell Xu Hu, James L Crowley, and Dominique Vaufrey-
daz. Tokencut: Segmenting objects in images and videos
with self-supervised transformer and normalized cut. arXiv
preprint arXiv:2209.00383, 2022. 3
[59] Fangyun Wei, Yue Gao, Zhirong Wu, Han Hu, and Stephen
Lin. Aligning pretraining for detection via object-level con-
trastive learning. Advances in neural information processing
systems, 2021. 2
[60] Xin Wen, Bingchen Zhao, Anlin Zheng, Xiangyu Zhang, and
Xiaojuan Qi. Self-supervised visual representation learning
with semantic grouping. In Advances in neural information
processing systems, 2022. 2, 6, 7
[61] Enze Xie, Jian Ding, Wenhai Wang, Xiaohang Zhan, Hang
Xu, Peize Sun, Zhenguo Li, and Ping Luo. DetCo: Unsuper-
vised contrastive learning for object detection. In IEEE/CVF
International Conference on Computer Vision, 2021. 2
[62] Jiahao Xie, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, and
Chen Change Loy. Unsupervised object-level representation
learning from scene images. Advances in Neural Information
Processing Systems, 34:28864–28876, 2021. 2
[63] Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi
Dai, Yue Cao, and Han Hu. Self-supervised learning with
swin transformers. arXiv preprint arXiv:2105.04553, 2021.
5, 6, 7
[64] Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen
Lin, and Han Hu. Propagate yourself: Exploring pixel-level
consistency for unsupervised visual representation learning.
In IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2021. 2
A. Training Hyperparameters for 1,000 epochs on COCO, 100 epochs on ImageNet, and
70 epochs on Open Images. This allows for a fair compari-
In this section we provide detailed hyperparameters for son, with approximately the same number of training steps
each training setting included in the main paper. We use two across datasets.
detectors, Def. DETR [68] and ViDT+ [45], and typically
follow the training settings proposed in their respective pa-
pers for finetuning and DETReg [2] for pretraining. More
B. Datasets
specifically, unless stated otherwise, the following hyperpa-
rameters apply: In our paper, we use the training sets of ImageNet [39],
For Def. DETR, we train SimDETR following [2]. Open Images [29] and MS COCO (COCO) [32] for un-
Specifically, we pretrain for 5 epochs per stage on ImageNet supervised pretraining. We use the training sets of MS
with a batch size of 192 and a fixed learning rate of 0.0002. COCO and PASCAL VOC [15] for supervised finetuning
For finetuning, we train on COCO for 50 epochs and PAS- and their validation sets for evaluation. ImageNet includes
CAL VOC for 100 epochs, with a batch size of 32. The 1.2M object-centric images, classified with 1,000 labels and
learning rate is set to 0.0002, and is decreased by a factor without object-level annotations. Open Images includes
of 10 at epoch 40 and 100 for COCO and PASCAL VOC 1.7M scene-centric images, and a total of 14.6M bounding
respectively. boxes with 600 object classes. COCO is a scene-centric
For ViDT+, we use the training hyperparameters pro- dataset with 120K training images and 5K validation im-
posed in [45]. Specifically, unless stated otherwise, ViDT+ ages containing 80 classes. PASCAL VOC is scene-centric
is pretrained for 10 epochs per stage on ImageNet and Open and contains 20K images with object annotations covering
Images, and for 50 epochs per stage on COCO, with batch 21 classes.
size 128. In all cases, the learning rate is set to 0.0001 and
follows a cosine decay schedule.
Unless stated otherwise, we pretrain with 2048 pseudo- C. Convergence & Alignment Analysis
classes (i.e. we set the number of clusters for the global
clustering step to 2048), and apply one round of self- In this section we discuss the convergence and alignment
training, following our findings in Tab. 11. Finally, during properties of SimDETR by analyzing the results of the ”ex-
pretraining, we use the mosaic augmentation [4]. treme” few-shot experiments. As discussed in Sec. 5 of
For specific experiments conducted in the paper, we note the main paper, in this setting we pretrain Def. DETR on
changes relative to the settings described above: ImageNet, and then finetune directly on COCO’s train set,
Full data regime: Same as above. using k ∈ {10, 30} instances from all classes.
Semi-supervised: Following DETReg [2], we finetune on
COCO for 2,000 epochs for 1% of samples annotated, 1,000 In Figs. 1 and 2 we present the AP scores for SimDETR
epochs for 2% of samples, 500 epochs for 5% of samples, and DETReg during training, averaged over 5 runs and mea-
and 400 epochs for 10% of samples. The learning rate is sured over the validation set’s novel classes. As was noted
kept fixed at 0.0002. Results in Table 3 are measured over 5 in Sec. 5 of the main paper, SimDETR outperforms DE-
runs, with different, randomly sampled annotated samples. TReg by large margins. Notably, however, it is also shown
Few-shot: We finetune on COCO’s base classes, using the to converge much faster. More specifically, in Tab. 1 we
splits proposed in [54]. For the standard few-shot setting we present results for 50 epochs of k-shot finetuning against
a) finetune on the base classes following the COCO finetun- the performance reached after 400 epochs. In both cases,
ing settings outlined above, and b) finetune on the 10- and we average the best validation score across 5 runs. We see
30-shot sets for 30 and 50 epochs respectively, with a fixed that, at 50 epochs, SimDETR has already reached near-peak
learning rate of 0.0002 and 0.00004. For the extreme set- performance, while DETReg converges at a much slower
ting, we directly finetune on the 10- and 30-shot sets for 400 rate.
epochs with a learning rate of 0.0002 that is decreased by a This means SimDETR effectively alleviates the sample
factor of 10 after 320 epochs. Results in Table 4 correspond inefficiency and slow convergence of DETR architectures,
to the best validation score of each run during training, av- and makes our method particularly useful when annotations
eraged over 5 runs, with k-shot samples corresponding to and/or computational resources are extremely scarce. These
seeds 1-5 of [54]. When finetuning on the k-shot instances, results provide further support for our conclusions in Sec. 5
the backbone is kept frozen in both settings. of the main paper, namely that SimDETR is much better
Object vs Scene-centric pretraining: Same as above. aligned with the downstream task, with learned object rep-
Self-supervised representation learning on scene-centric resentations that are well suited for class-aware object de-
data: For these experiments, where the entire architecture tection, so that minimal training and supervision can lead to
is initialized from scratch (backbone & detector), we train strong performance.
for all three methods we only include objects whose pre-
10.0 dicted bounding boxes have an IOU higher than 0.5 with an
object in the ground truth set.
7.5 The images illustrate that self-training significantly im-
proves the object discovery performance of SimDETR over
the original region proposals. Notably, those include much
AP

5.0 smaller items, and much better performance in cluttered


scenes. As stated in the main paper, this contributes to the
performance of our framework and specifically the perfor-
2.5 SimDETR mance gains between stages.
DETReg

0 100 200 300 400


Epochs
Figure 1. AP scores on COCO’s validation set’s novel classes dur-
ing finetuning with k=10 instances per class. Results averaged
over 5 runs.

12.5

10.0
AP

7.5

5.0
SimDETR
2.5 DETReg

0 100 200 300 400


Epochs
Figure 2. AP scores on COCO’s validation set’s novel classes dur-
ing finetuning with k=30 instances per class. Results averaged
over 5 runs.

Novel Class AP Novel Class AP75


Method Epochs
10 30 10 30
DETReg [2] 1.9 3.4 1.8 3.52
50
SimDETR 8.32 13.9 8.06 14.4
DETReg [2] 5.6 10.3 6.0 10.9
400
SimDETR 10.3 14.5 10.9 15.1

Table 1. Results of ”extreme” few-shot training for 50 epochs and


400 epochs.

D. Visualization
In Fig. 3 we provide examples visual examples of bound-
ing boxes produced by Selective Search, our labeled object
proposal method, and SimDETR, specifically a VIDT+ de-
tector trained for two stages on ImageNet. To avoid clutter,
Figure 3. Examples of object proposals extracted from SimDETR, contrasted with the ground truth, Selective Search and our initial labeled
object proposals, extracted as described in paper Sec. 3.1. The images belong to COCO’s train set. To avoid clutter, we only show predicted
objects whose bounding boxes have an IOU greater than 0.5 with at least one ground truth object. Best seen in color.

You might also like