Professional Documents
Culture Documents
1* 2 1 2
Ioannis Maniadis Metaxas Adrian Bulat Ioannis Patras Brais Martinez
1,2
Georgios Tzimiropoulos
1 2
Queen Mary University of London Samsung AI Cambridge
arXiv:2307.15697v1 [cs.CV] 28 Jul 2023
Abstract
in a self-supervised, end-to-end manner. tate high precision (see Sec. 2). On the other hand, meth-
Unsupervised object discovery: Different from object ods like Selective Search [48] generate many proposals, but
detector pretraining, this task aims at correctly localiz- their reliance on low-level priors such as color and texture
ing objects in images in an unsupervised manner. Re- make them very noisy and unsuitable to generate meaning-
cently, this area has seen remarkable progress utilizing self- ful pseudo-labels.
supervised models. Specifically, several works have pro- We seek to address this gap by utilizing semantic in-
posed identifying objects based on local feature similar- formation from self-supervised image encoders to produce
ity [51, 55, 42, 58, 43, 35, 53]. Distinctly, [52] treats lo- rich object proposals and coherent pseudo-class member-
calization as a ranking problem , [3] combines a MAE [18] ships. Specifically, we extract feature maps using a pre-
architecture with a GAN [17], [40] combines slot attention trained self-supervised encoder and leverage a bi-level clus-
and a reconstruction objective, and [41] clusters features tering strategy. The first level (termed local clustering) pro-
from multiple encoders to produce annotations for salient duces bounding box proposals and associated feature rep-
object detection. The proposal(s) produced by these meth- resentations. The second level, termed global clustering,
ods can then be used to train a detector or segmentation assigns class pseudo-labels to each proposal. Our method
model, improving performance and allowing for multiple leads to rich and diverse region proposals (see Tab. 8) and
object discovery. Notably, methods such as [55, 66, 53] is essential for the state-of-the-art results of SimDETR (see
also leverage self-training. We emphasize, however, that Tab. 9).
the main goal of works in this area is object discovery, not Unsupervised proposal extraction: Given an input image
the training of powerful detectors. Accordingly, the detec- 3×H×W
X∈R , we use a self-supervised pretrained encoder
tors trained by these works typically are not evaluated by d ×H ×W
to extract feature maps Fl ∈ R l l l from each of the
finetuning with annotated data. Furthermore, such methods
encoder’s levels l. Given a feature map F, we employ pixel-
typically restrict their proposals to the most confident few
wise clustering to group semantically similar features (local
(often just one) to avoid false positives, which is not well
clustering). This results in a set of masks M = {mk }k=1∶K ,
suited for detector pretraining, where training benefits from
where K represents the number of clusters, which is a user-
a rich set of object proposals covering as many objects as
defined parameter. In order to provide good coverage for
possible, not only the few most prominent ones.
all objects in the image, we apply clustering with different
values K ∈ K and use feature maps from different layers
3. Method
l ∈ L, leading to a set of masks M = ⋃{M }K∈K,l∈L .
l,K
Our method aims to simplify and better align the pre- Next, the different connected components of each mask
training with respect to the downstream task (class-aware are computed, leading to a set of regions R. Each region
detection). To this end, we produce object proposals in the r ∈ R is then used to extract a bounding box (proposal) b
form of bounding boxes and class pseudo-label pairs in an and a corresponding feature vector f , where f is computed
unsupervised manner, and then employ a self-training strat- by average-pooling the last layer feature map FL over r.
egy to pretrain and iteratively refine the detector. Proposal filtering: Due to the repeated clustering runs, the
process just described leads to noisy and overlapping pro-
3.1. Improving object proposals
posals. We employ a number of filters to refine them, in-
Examining existing methods, we note that object discov- cluding merging proposals that have a high IoU, proposals
ery works generate very limited initial proposals to facili- with highly-related semantic content and proposals that are
part of other proposals. This results in a set of N (i) bound- A new set of weights Θi can be obtained by using train-
N (i)
ing box-feature vector pairs for image i, {bn , fn }n=1 . ing set Ti and using Θi−1 to initialize the weights. Simulta-
Pseudo-class label generation: We then cluster proposals neously, Θi can be used to generate a new training set Ti+1 .
across the whole dataset (global clustering) based on the While this process can be iterated indefinitely, we notice
feature vectors, i.e. we perform a single clustering round on optimal performance involves just two rounds of training,
{fn }n=1∶N (i) . Class membership is then used as the pseudo-
i i=1∶I which we refer to as Stages 1 & 2. Stage 1 training, includ-
class label, c ∈ C, for each of the proposals. This results in ing the proposal extraction process for T0 is shown in Fig. 2.
a training set T0 = {Xi , {(bn , cn )}}.
i i We highlight that, importantly, the proposed pretraining
We used Spectral Clustering [37] for local and K-Means is very well-aligned with the downstream task, i.e. super-
for global clustering as Spectral Clustering typically per- vised class-aware object detection. Furthermore, it allows
forms better but it cannot handle the millions of data points us to effectively pretrain both the backbone and the detec-
involved in the global clustering. However, any clustering tion head simultaneously. This is unlike other detector pre-
algorithm may be used in either case. training methods [12, 2, 57] that require freezing the back-
bone to avoid performance degradation.
3.2. Pretraining and Self-Training The whole method is summarized in Algorithm 1.
We can now use the training set T0 to train an object
Algorithm 1 Pretraining
detector within the DETR framework. In particular, given
Require: {Xi }i=1 , Net g = (gb , gh ), initial params. Θ0
I
an input image and its corresponding extracted object pro-
Q
posals y, the network predicts a set ŷ = {ŷq }q=1 , where 1: ▷ Unsup. train set gen., Sec. 3.1
2: for i = 1 ∶ N do
ŷq = (b̂q , ĉq ), that is, the predicted bounding box and pre-
3: Fl ← gb (Xi )
dicted category. We note that the extracted proposals y are
4: Mi ← ⋃ Cluster(Fl , K) ▷ K ∈ K, l ∈ L
padded to size Q with ∅ (no object). Typically in DETR
5: Ri ← Connected Components(Mi )
architectures, the ground truth and the predictions are put in
{bn , fn }N (i) ← Filter(Ri )
i i
6:
correspondence via bipartite matching, formally defined as:
7: end for
8: {cn } ← K-Means({fn }, K = C)
Q i i
▷ Pseudo-classes
σ̂ = arg min ∑ L(yq , ŷσ(q) ) (1) N (i) I
σ∈SQ q 9: T0 ← {Xi , {(bn , cn )}n=1 }
i=1
where SQ is the space of permutations of Q elements. 10: ▷ Self-training (Sec. 3.2)
The loss between y and ŷ is computed as a combination 11: for j stages do
of a bounding box matching loss and a class matching loss: 12: g(−; Θj+1 ) ← Train (Tj , g) ▷ Using eq. 2
Tj+1 ← Filter( {g(Xi ; Θj )}i=1 )
I
13:
Q 14: end for
∑ (−log p̂σ̂(q) (cq ) + 1{cq ≠∅} Lbox (bq , b̂σ̂(q) )) (2)
q=1
4. Experimental Setting
where p̂ indicates the predicted per-class probabilities. The
indicator function 1ci ≠∅ represents that the box loss only In order to compare with prior work on object detection
applies to predictions that have been matched to object pro- pretraining, we follow DETReg [2] in terms of datasets, hy-
posals y. Minimizing this loss results in weights Θ0 . perparameters and experiments (namely, the full data and
Upon training the detector in this way, we observe that it low data settings). In order to study the effectiveness of our
can identify more objects than those in our original propos- method for unsupervised representation learning, in the ab-
als. Critically, this includes smaller and more challenging sence of a pre-defined protocol, we use the ViDT+ detector
objects, which constitute a stronger supervisory signal. We and experiment with the most well-established datasets in
thus generate a new set of labels for image i as {g(Xi ; Θ0 )}, object detection. The hyperparameters for each experiment
where g = (gb , gh ) are the detection network, backbone are provided in detail in Appendix Section A. Unless stated
and head respectively. In self-training, pseudo-labels are otherwise, for methods other than SimDETR we report re-
typically filtered based on confidence. In our case, filter- sults from the respective papers, except where VIDT+ is
ing based on the detector’s confidence leads to the removal used.
of small or challenging instances such as partially-occluded While our method is, in principle, not restricted to
or uncommon objects. Thus, we instead filter the new pro- DETR, we follow prior work [12, 2, 57] and focus on
posals so that any two boxes have an IOU lower than 0.55 DETR-based architectures since a) DETR methods suffer
(following [44]), with only the most confident box being from slow convergence and sample inefficiency, so they
kept when such conflicts exist. This leads to training set T1 . benefit the most from unsupervised pretraining methods, b)
The end-to-end single-stage design is well suited for end-to- datasets and with both detector architectures. Further-
end representation learning, which we explore in Sec. 5.2. more, it also achieves the highest performance among self-
Datasets: We use the training sets of ImageNet [39], Open supervised learning methods for detection. While in this
Images [29] and MS COCO [32] for unsupervised pretrain- case the different base detection frameworks make a direct
ing. For finetuning (with annotations) we use the training comparison harder, we still find it a very encouraging result.
sets of MS COCO and PASCAL VOC [15]. Results are re- Semi-supervised setting: In this protocol unsupervised
ported for the corresponding validation sets, using the Av- pretraining is conducted on COCO’s train set, and k% of the
erage Precision (AP) and Average Recall (AR). Details on train set’s samples are subsequently used (with annotations)
the datasets are provided in Appendix Section B. for finetuning. Results in Tab. 3 demonstrate that SimDETR
Architectures: We use Deformable DETR [68] and ViDT+ outperforms previous works by large margins, particularly
[45] in our experiments. Def. DETR is used to com- in the more challenging settings with fewer labeled samples.
pare with prior work for detector pretraining. Follow- Few-shot setting: Following [2], we pretrain Def.
ing [2], a ResNet-50 backbone [21] is used, initialized with DETR on ImageNet and report results for two settings. In
SwAV [7], pretrained on ImageNet. ViDT+ is a more recent the first, we finetune on COCO’s 60 base classes, and then
state-of-the-art method, and is used to compare against un- finetune again on k ∈ {10, 30} instances from all classes.
supervised representation learning methods. Unless stated In the second, ”extreme” setting, we skip finetuning on the
otherwise, its Swin-T [34] transformer backbone is initial- base classes. Results are reported in Tab. 4 on the vali-
ized with MoBY [63], which is pretrained on ImageNet. dation set’s novel classes, and demonstrate that SimDETR
We emphasize that, for both Def. DETR and ViDT+, their not only outperforms DETReg by significant margins, but
backbones were trained in a totally unsupervised manner. its performance without base class finetuning is very close
to its performance with it. These results support that a)
5. Experiments our method drastically reduces DETR architectures’ depen-
We highlight two main results, namely state-of-the-art dency on annotated data, and b) SimDETR’s learned rep-
results for detection pretraining and competitive results for resentations are already class-aware, and the pseudo-labels
self-supervised representation learning for detection, in- produced by our method are good enough that SimDETR
cluding pretraining on scene-centric data such as COCO can align with COCO’s classes with minimal (10-shot) su-
from scratch. We complement these results with a com- pervision. We conduct a more in depth analysis of the few-
prehensive set of ablation studies. shot setting outcomes in Appendix Section C.
Table 1. Lower section: Results for detection pretraining methods. Upper section: Unsupervised representation learning methods (detec-
tor head is only trained during the downstream fine-tuning stage), with pretraining on ImageNet and finetuning on COCO. 1: Backbone
trained on ImageNet classification with labels (baseline). 2: Backbone initialized with MoBY and pretrained with SimDETR (pretrained
detection head was discarded).
Method AP AP50 AP75 Base Class Novel Class AP Novel Class AP75
Method
Finetuning 10 30 10 30
Supervised 59.5 82.6 65.6
SwAV [7] 61.0 83.0 68.1 DETReg [2] 5.6 10.3 6.0 10.9
✗
DETReg [2] 63.5 83.3 70.3 SimDETR 10.3 14.5 10.9 15.1
JoinDet [57] 63.7 83.8 70.7
DETReg [2] 9.9 15.3 10.9 16.4
SimDETR [2] 64.8 84.6 72.7 ✓
SimDETR 12.4 18.9 13.1 20.4
Table 8. Quality of proposals: AR results on COCO’s valida- Table 11. Self-training rounds. AP results for ViDT+ pretrained
tion set. Upper section: methods for the initial extraction of with SimDETR on ImageNet and finetuned on COCO. Avg. pro-
object proposals. Lower section: proposals generated by detec- posals per image are measured during training.
tion/segmentation architectures trained on the initial proposals.
Stage Epochs AP AP50 AP75
Method Proposals AP AP50 AP75
1 10 48.9 67.4 52.9
MoBY - 48.3 66.9 52.4 1 25 49.2 67.7 53.6
SimDETR-St. 1 48.7 67.3 52.7 2 10 49.6 68.2 53.8
Sel. Search
SimDETR-St. 2 48.6 67.1 52.2 2 25 49.7 68.1 54.2
SimDETR-St. 1 48.9 67.4 52.9
Our Anns.
SimDETR-St. 2 49.6 68.2 53.8 Table 12. Scheduler length. AP results for varying training
epochs. 10 and 25 epoch Stage 2 models are initialized from 10
Table 9. Impact of initial proposals: AP results on COCO’s val- and 25 epoch Stage 1 models respectively.
idation set, using different initial object proposal methods.
12.5
10.0
AP
7.5
5.0
SimDETR
2.5 DETReg
D. Visualization
In Fig. 3 we provide examples visual examples of bound-
ing boxes produced by Selective Search, our labeled object
proposal method, and SimDETR, specifically a VIDT+ de-
tector trained for two stages on ImageNet. To avoid clutter,
Figure 3. Examples of object proposals extracted from SimDETR, contrasted with the ground truth, Selective Search and our initial labeled
object proposals, extracted as described in paper Sec. 3.1. The images belong to COCO’s train set. To avoid clutter, we only show predicted
objects whose bounding boxes have an IOU greater than 0.5 with at least one ground truth object. Best seen in color.