Professional Documents
Culture Documents
6912
(5) Repeat
(1) Train
𝑃!"
Labeled Samples Labeled Samples
(2) Predict
(4) Re-train
(3) Select 𝑃!"#$
Figure 1: Curriculum Labeling (CL) algorithm. The model is (1) trained on the labeled samples, then this model is used to (2)
predict and assign pseudo-labels for the unlabeled samples. Then the distribution of the prediction scores is used to (3) select a
subset of pseudo-labeled samples. Then a new model is (4) re-trained with the labeled and pseudo-labeled samples. This process
is (5) repeated by re-labeling unlabeled samples using this new model. The process stops when all samples in the dataset have
been used during training.
tion scores of the unlabeled samples and applying a criterion careful selection of thresholds and pacing of the pseudo-
based on Extreme Value Theory (EVT) (Clifton, Hugueny, labeling of the unlabeled samples can lead to significant
and Tarassenko 2011). Figure 1 shows an overview of our gains. Here (Lee 2013; Shi et al. 2018; Arazo et al. 2020) re-
model. We empirically demonstrate that curriculum labeling lied on a trained parametric model to pseudo-label unlabeled
can achieve competitive results in comparison to the state- samples (e.g. by choosing the most confident class (Lee
of-the-art, matching the recently proposed UDA (Xie et al. 2013; Shi et al. 2018), or using soft-labels (Arazo et al.
2019) on Imagenet and surpassing UDA (Xie et al. 2019) 2020)), while (Iscen et al. 2019) proposed to propagate la-
and ICT (Verma et al. 2019) on CIFAR-10. bels using a nearest-neighbors graph. Unlike these methods
A current issue in the evaluation of SSL methods was re- we adopt curriculum labeling which adds self-pacing to our
cently highlighted by Oliver et al (Oliver et al. 2018), which training where we only add the most confident samples in
criticize the widespread practice of setting aside a small part the first few iterations and the threshold is chosen based on
of the training data as “labeled” and a large portion of the the distribution of scores on training samples.
same training data as “unlabeled”. This way of splitting the Most closely related to our approach is the concurrent
data leads to a scenario where all images in this type of “un- work of Arazo et al (Arazo et al. 2020) where pseudo-
labeled” set have the implicit guarantee that they are drawn labeling is adopted in combination with Mixup (Zhang et al.
from the exact same class distribution as the “labeled” set. 2018) to prevent what is referred in that work as confirma-
As a result, several SSL methods underperform under the tion bias. Confirmation bias is related in this context to the
more realistic scenario where unlabeled samples do not have phenomenon of concept drift where the properties of a target
this guarantee. We demonstrate that pseudo-labeling under a variable change over time. This deserves special attention in
curriculum is surprisingly robust to out-of-distribution sam- pseudo-labeling since the target variables are affected by the
ples in the unlabeled set compared to other methods. same model feeding onto itself. In our version of pseudo-
Our contributions can be summarized as follows: labeling we alleviate this confirmation bias by re-initializing
• We propose curriculum labeling (CL), which augments the model before every self-training iteration, and as such,
pseudo-labeling with a careful curriculum choice as pac- pseudo-labels from past epochs do not have an outsized ef-
ing criteria based on Extreme Value Theory (EVT). fect on the final model. While our results are comparable on
CIFAR-10, the work of Arazo et al (Arazo et al. 2020) fur-
• We demonstrate that, with the proposed curriculum
ther solidifies our finding that pseudo-labeling has a lot more
paradigm, the classic pseudo-labeling approach can de-
to offer than previously found.
liver near state-of-the-art results on CIFAR-10, Imagenet
ILSVRC top-1 accuracy, and SVHN – compared to very Recent methods for semi-supervised learning rather use
recently proposed methods. a consistency regularization approach. In these works (Saj-
jadi, Javanmardi, and Tasdizen 2016; Laine and Aila 2017;
• Compared to previous approaches, our version of pseudo- Tarvainen and Valpola 2017; Sohn et al. 2020), a network
labeling leads to more consistent results in realistic eval- is trained to make consistent predictions in response to
uation settings with out-of-distribution samples. perturbation of unlabeled samples, by combining the stan-
dard cross-entropy loss with a consistency loss. Consistency
Related Work losses generally encourage that in the absence of labels for
Our work is most closely related to the general use of a sample, its predictions should be consistent with the pre-
pseudo-labeling in semi supervised learning (Lee 2013; Shi dictions for a perturbed version of the same sample. Various
et al. 2018; Iscen et al. 2019; Arazo et al. 2020). Although perturbation operations have been explored, including basic
these methods have been surpassed by consistency regular- image processing operators (Xie et al. 2019), dedicated op-
ization methods, our work suggests that this is not a funda- erators such as MixUp (Zhang et al. 2018; Berthelot et al.
mental flaw of self-training algorithms and demonstrate that 2020), learned transformations (Cubuk et al. 2018), and ad-
6913
versarial perturbations (Miyato et al. 2018). While we op- Algorithm 1 Pseudo-Labeling under Curriculum Labeling
tionally leverage data augmentation, our use of it is the same 1: Require: DL . set of labeled samples
as in the standard supervised setting, and our method does 2: Require: DU L . set of unlabeled samples
not enforce any consistency for pairs of samples. 3: Require: ∆ := 20 . stepping threshold percent
4: Pθt ← train classifier using DL only
Method: Pseudo-Labeling under Curriculum 5: t := 1
In semi-supervised learning (SSL), a dataset D = {x|x ∈ 6: Tr := 100 − ∆
7: do
X} is comprised of two disjoint subsets: a labeled subset
8: T := Percentile(Pθt (DU L ), Tr )
DL = {(x, y)|x ∈ X, y ∈ Y } and an unlabeled subset 9: Xt := DL
DU L = {x|x ∈ X}, where x and y denote the input and the 10: for x ∈ DU L do
corresponding label. Typically |DL | |DU L |. 11: if Pθt (x) > T then
Pseudo-labeling builds upon the general self-training 12: Xt := Xt ∪ (x, pseudo-label(Pθt , x))
framework (McLachlan 1975), where a model goes through 13: t
Pθ ← train classifier from scratch using Xt
multiple rounds of training by leveraging its own past pre- 14: t := t + 1
dictions. In the first round, the model is trained on the la- 15: Tr := Tr − ∆
beled set DL . In all subsequent rounds, the training data is 16: while |Xt | 6= |DL + DU L |
the union of DL and a subset of DU L pseudo-labeled by 17: end
the model in the previous round. In round t, let the training
samples be denoted as (Xt , Yt ) and the current model as Pθt , Theoretical Analysis
where θ are the model parameters, and t ∈ {1, · · · , T }. Af-
Our data consists of N labeled samples (Xi , Yi ) and M un-
ter round t, an unlabeled subset X̄t is added into Xt+1 := labeled samples Xj . Let H be a set of hypotheses hθ where
X1 ∪ X̄t , and the new target set is defined as Yt+1 := Y1 ∪ Y¯t . hθ ∈ H, and each of them denotes a function mapping X
Here Y¯t represents the pseudo labels of X̄t , predicted by to Y . Let Lθ (Xi ) denote the loss for a given example Xi .
model Pθt . In this sense, the labels are “propagated” to the To choose the best predictor with the lowest possible error,
unlabeled data points via Pθt . our formulation can be explained with a regularized Empiri-
The criterion used to decide which subset of samples in cal Risk Minimization (ERM) framework. Below, we define
DU L to incorporate into the training in each round is key L(θ) as the pseudo-labeling regularized empirical loss as:
to our method. Different uncertainty metrics have been ex- 1 X
N M
1 X 0
plored in the previous literature, including choosing the sam- L(θ) = Ê[Lθ ] = Lθ (Xi ) + Lθ (Xj )
N i=1 M j=1
ples with the highest-confidence (Zhu 2005), or retrieving
the nearest samples in feature space (Shi et al. 2018; Iscen θ̂ = arg min L(θ) (1)
et al. 2019; Luo et al. 2017). We draw insights from Extreme θ
Value Theory (EVT) which is often used to simulate extreme Lθ (Xi ) = CEE(Pθt (Xi ), Yi )
events in the tails of one-dimensional probability distribu- L0θ (Xj ) = CEE(Pθt (Xj ), Pθt−1 (Xj ))
tions by assessing the probability of events that are more
extreme (Broadwater and Chellappa 2010; Rudd et al. 2018; Here CEE indicates cross entropy. Following (Hacohen
Al-Behadili, Grumpe, and Wöhler 2015; Clifton et al. 2008; and Weinshall 2019) this can be rewritten as follows:
Shi et al. 2008). In our problem, we observed that the distri-
bution of the maximum probability predictions for pseudo- θ̂ = arg min L(θ) = arg max exp(−L(θ)) (2)
θ θ
labeled data follows this type of Pareto distribution. So in-
stead of using fixed thresholds or tuning thresholds manu- Now to simplify the notions, we reformulate the objective
ally, we use percentile scores to decide which samples to as to maximize a so-called Utility objective, where δ is an
indicator function:
add. Algorithm 1 shows the full pipeline of our model, where 0
Percentile(X, Tr ) returns the value of the r-th percentile. We Uθ (X) = e−δ(X∈L)Lθ (X)−δ(X∈U L)Lθ (X)
use values of r from 0% to 100% in increments of 20. 1 X
N M
1 X 0 (3)
Note that, in Algorithm 1 X̄t is selected from the whole U (θ) = Ê[Uθ ] = Uθ (Xi ) + Uθ (Xj )
N i=1 M i=1
unlabeled set DU L , enabling previous pseudo-annotated
samples to enter or to leave the new training set. This is The pacing function in our pseudo curriculum labeling ef-
used to discourage concept drift or confirmation bias, as it fectively provides a Bayesian prior for sampling unlabeled
can prevent erroneous labels predicted by an undertrained samples. This can be formalized as in Eq 4:
network during the early stages of training to be accumu-
lated. To further alleviate the problem, we also reinitialize 1 X
N XM
the model parameters θ randomly after each round, and em- Up (θ) = Êp [Uθ ] = Uθ (Xi ) + Uθ0 (Xj )p(Xj )
N i=1
pirically observe that, –as opposed to fine-tuning–, our reini- j=1
training data samples which will take place when the per- θ̂ = arg max Up (θ) (4)
centile threshold is lowered to the minimum value (Tr = 0). θ
6914
2016) (depth 28, width 2) for CIFAR-10 and SVHN, and
Here pj = p(Xj ) denotes the induced prior probability ResNet-50 (He et al. 2015) for ImageNet. The networks are
we use to decide how to sample the unlabeled samples and optimized using Stochastic Gradient Descent with nesterov
is determined by the pacing function of curriculum label- momentum. We use weight decay regularization of 0.0005,
ing algorithm, thus, p(Xj ) works like an indicator function momentum factor of 0.9, and an initial learning rate of 0.1
where 1/M are the unlabeled samples whose scores are in which is then updated by cosine annealing (Loshchilov and
the top µ percentile, and 0 otherwise. Hutter 2016). Note that we use the same hyper-parameter
We can then rewrite part of Eq (4) as follows: setting for all of our experiments, except the batch size when
N M applying moderate and heavy data augmentation. We empir-
1 X X
ically observe that small batches (i.e. 64-100) work better
Up (θ) = Uθ (Xi ) + (Uθ0 (Xj ) − Ê[Uθ0 ])(pj − Ê[p])
N i=1 j=1 for moderate data augmentation (random cropping, padding,
whitening and horizontal flipping), while large batches (i.e.
+ M ∗ Ê[Uθ0 ]Ê[p]
512-1024) work better for heavy data augmentation. For
N
1 X ˆ 0 0 CIFAR-10 and SVHN, we train the models for 750 epochs.
= Uθ (Xi ) + Cov[U θ , p] + M ∗ Ê[Uθ ]Ê[p] Starting from the 500th epoch, we also apply stochastic
N i=1
0
weight averaging (SWA) (Izmailov et al. 2018) every 5
ˆ
= U (θ) + Cov[U θ , p] epochs. For ImageNet, we train the network for 220 epochs
(5) and apply SWA, starting from the 100th epoch.
Data Augmentation: While data augmentation has be-
Eq (5) tells that when we sample the unlabeled data points came a common practice in supervised learning especially
according to a probability that positively correlates with the when the training data is scarce (Simard, Steinkraus, and
confidence of the predictions of Xj (negative correlating Platt 2003; Perez and Wang 2017; Shorten and Khoshgoftaar
with the loss L0 (Xj )), we are improving the utility of the 2019; Devries and Taylor 2017), previous work on SSL typ-
estimated θ. Under such a pacing curriculum, we are mini- ically use basic augmentation operations such as cropping,
mizing the overall loss L. The smaller the L0 (Xj ) (the larger padding, whitening and horizontal flipping (Rasmus et al.
the utility of an unlabeled point Xj ), the more likely we sam- 2015; Laine and Aila 2017; Tarvainen and Valpola 2017).
ple Xj . This proves that we want to sample more those un- More recent work relies on heavy data augmentation policies
labeled points that are predicted with more confidence. that are learned automatically by Reinforcement Learning
(Cubuk et al. 2018) or density matching (Lim et al. 2019).
Experiments Other augmentation techniques generate perturbations that
We first discuss our experiment settings in detail, then we take the form of adversarial examples (Miyato et al. 2018),
compare against previous methods using the standard semi- or by interpolations between a random pair of image samples
supervised learning practice of setting aside a portion of the (Zhang et al. 2018). We explore both moderate and heavy
training data as labeled, and the rest as unlabeled, then we data augmentation techniques that do not require to learn or
test in the more realistic scenario where unlabeled training search any policy, but instead apply transformations in an
samples do not follow the same distribution as labeled train- entirely random fashion. We show that using arbitrary trans-
ing samples, finally we conduct extensive evaluation justify- formations on the training set yields positive results. We re-
ing why our version of pseudo-labeling and specifically cur- fer to this technique as Random Augmentation (RA) in our
riculum labeling enables superior results compared to prior experiments. However, we also report results without the use
efforts on pseudo-labeling. of data augmentation.
6915
Approach Method Year CIFAR-10 SVHN
Nl = 4000 Nl = 1000
– Supervised – 20.26 ± 0.38 12.83 ± 0.47
Pseudo PL (Lee 2013) 2013 17.78 ± 0.57 7.62 ± 0.29
Labeling PL-CB (Arazo et al. 2020) 2019 6.28 ± 0.3 -
Π Model (Laine and Aila 2017) 2017 16.37 ± 0.63 7.19 ± 0.27
Mean Teacher (Tarvainen and Valpola 2017) 2017 15.87 ± 0.28 5.65 ± 0.47
VAT (Miyato et al. 2018) 2018 13.86 ± 0.27 5.63 ± 0.20
Consistency VAT + EntMin (Miyato et al. 2018) 2018 13.13 ± 0.39 5.35 ± 0.19
LGA + VAT (Jackson and Schulman 2019) 2019 12.06 ± 0.19 6.58 ± 0.36
Regularization
ICT (Verma et al. 2019) 2019 7.66 ± 0.17 3.53 ± 0.07
MixMatch (Berthelot et al. 2019) 2019 6.24 ± 0.06 3.27 ± 0.31
UDA (Xie et al. 2019) 2019 5.29 ± 0.25 2.46 ± 0.17
ReMixMatch (Berthelot et al. 2020) 2020 5.14 ± 0.04 2.42 ± 0.09
FixMatch (Sohn et al. 2020) 2020 4.26 ± 0.05 2.28 ± 0.11
CL 2020 8.92 ± 0.03 5.65 ± 0.11
Pseudo CL+FA(Lim et al. 2019) 2020 5.51 ± 0.14 2.90 ± 0.19
Labeling CL+FA(Lim et al. 2019)+Mixup(Zhang et al. 2018) 2020 5.09 ± 0.18 2.75 ± 0.15
CL+RA+Mixup(Zhang et al. 2018) 2020 5.27 ± 0.16 2.80 ± 0.18
Table 1: Test error rate on CIFAR-10 and SVHN using WideResNet-28. We show that our CL method can achieve comparable
results to the state-of-the-art. ”Supervised” refers to using only 4,000/1,000 labeled samples from CIFAR-10/SVHN without
relying on any unlabeled data.
Table 2: Test error rate on CIFAR-10 and SVHN using CNN-13. Nl is the number of labeled samples in the training set.
class. In Figure 2, we also evaluate our method using this Realistic Evaluation with Out-of-Distribution
setting on WideResNet for CIFAR-10. We use the standard Unlabeled Samples
validation set size of 5,000 to make our method comparable
with previous work. Decreasing the size of the available la- In a more realistic SSL setting (Oliver et al. 2018), the un-
beled samples recreates a more realistic scenario where there labeled data may not share the same class set as the labeled
is less labeled data available. We keep the same hyperparam- data. We test our method under a scenario where the labeled
eters we use when training on 4,000 labeled samples, which and unlabeled data come from the same underlying distri-
shows that our model does not drastically degrade when bution, but the unlabeled data contains classes not present
used with smaller labeled sets. We show the lines for the in the labeled data as proposed by (Oliver et al. 2018). We
mean and shaded regions for the standard deviation across reproduce the experiment by synthetically varying the class
five independent runs, and our results are closer to the cur- overlap on CIFAR-10, choosing only the animal classes to
rent best method under this benchmark. perform the classification (bird, cat, deer, dog, frog, horse).
ImageNet: We further evaluate our method on the In this setting, the unlabeled data comes from four classes.
large-scale ImageNet dataset (ILSVRC). Following prior The idea is to vary how many of those classes are among
works (Verma et al. 2019; Xie et al. 2019; Berthelot the six animal classes to modulate the class distribution mis-
et al. 2019), we use 10%/90% of the training split as la- match. We report the results of (Lee 2013; Miyato et al.
beled/unlabeled data. Table 3 shows that we achieve compet- 2018) from (Oliver et al. 2018). We also include the results
itive results with the state-of-the-art with scores very close of (Verma et al. 2019; Xie et al. 2019) obtained by running
to the current top performing method, UDA (Xie et al. 2019) their released source code. Figure 3 shows that our method is
on both top-1 and top-5 accuracies. robust to out-of-distribution classes, while the performance
6916
Method Approach Top-1 Top-5
Supervised Baseline (Zhai et al. 2019) – – 80.43
Pseudo-Label (Lee 2013) Pseudo Labeling – 82.41
VAT (Miyato et al. 2018) Consist. Reg. – 82.78
VAT + EntMin (Miyato et al. 2018) Consist. Reg. – 83.39
S 4 L-Rotation (Zhai et al. 2019) Self-Supervision – 83.82
S 4 L-Exemplar (Zhai et al. 2019) Self-Supervision – 83.72
UDA Supervised (Xie et al. 2019) – 55.09 77.26
UDA Supervised (w. Aug) (Xie et al. 2019) – 58.84 80.56
UDA (w. Aug) (Xie et al. 2019) Consist. Reg. 68.78 88.80
FixMatch (Sohn et al. 2020) Consist. Reg. + PL 71.46 89.13
CL Supervised (w. Aug) – 55.75 79.67
CL (w. Aug) Pseudo Labeling 68.87 88.56
Table 3: Top-1 and top-5 accuracies on ImageNet with 10% of the labeled set. Here UDA and CL are trained using ResNet-50.
Previous methods (Lee 2013; Miyato et al. 2018; Zhai et al. 2019) use ResNet-50v2 to report their results.
28%
25%
26%
20%
24%
15% Test Error
22%
10%
20%
5%
500 1000 2000 4000 18%
Number of Labeled Samples 16%
14%
Figure 2: Comparison of test error rate using WideResNet
varying the size of the labeled samples on CIFAR-10. We 0% 25% 50% 75% 100%
Mismatch Percentage Between Classes
use the standard validation set size of 5,000 to make our
method comparable with previous work.
Figure 3: Comparison of test error on CIFAR-10 (six animal
classes) with varying overlap between classes. For example,
in “50%”, two of the four classes in the unlabeled data are
of previous methods drops significantly. We conjecture that
not present in the labeled data. “Supervised” refers to using
our self-pacing curriculum is key to this scenario, where the
only the 2,400 labeled images.
adaptive thresholding scheme could help filter the out-of-
distribution unlabeled samples during training.
6917
Data Augmentation Mixup SWA Top-1 Error Threshold Moderate Aug Heavy (RA)
None 7 7 38.47 0.1 21.12 7.87
None X 7 34.03 0.2 21.12 8.57
None 7 X 37.35 0.3 20.59 7.90
None X X 32.85 0.4 22.11 8.15
Moderate 7 7 22.8 0.5 19.98 7.46
Moderate X 7 16.11 0.6 19.51 6.88
Moderate 7 X 21.63 0.7 19.35 6.65
Moderate X X 15.83 0.8 18.08 6.29
Heavy (RA) 7 7 11.38 0.9 17.11 6.21
Heavy (RA) X 7 8.88 CL 8.92 5.27
Heavy (RA) 7 X 11.32
Heavy (RA) X X 8.59
Table 5: Test errors when using pseudo-labeling with several
Table 4: Test errors when using pseudo-labeling without a fixed thresholds and different data augmentation techniques.
curriculum (the threshold is set to 0.0). We use WideResnet- We use WideResnet-28 as the base network. The Heavy Aug-
28 as the base network. Additionally, we report which data mentation applies Random Augmentation, Mixup and SWA.
augmentation technique was applied. Last row shows results of our approach.
Table 6: Test errors when using two static thresholds (0.91 and 0.99952 ) and our self-pacing training. We use WideResnet-
28(WRN) as the base network. Fully Supervised refers to using only 4,000 labeled datapoints from CIFAR-10 without any
unlabeled data. The # columns show the average numbers of images automatically selected for each iteration during training.
We used ZCA preprocessing and moderate data augmentation on these experiments.
6918
References Fralick, S. 1967. Learning to recognize patterns without a
Agrawala, A. 1970. Learning with a probabilistic teacher. teacher. IEEE Transactions on Information Theory 13(1):
IEEE Transactions on Information Theory 16(4): 373–379. 57–64. doi:10.1109/TIT.1967.1053952.
doi:10.1109/TIT.1970.1054472. Hacohen, G.; and Weinshall, D. 2019. On The Power of
Al-Behadili, H.; Grumpe, A.; and Wöhler, C. 2015. Semi- Curriculum Learning in Training Deep Networks. CoRR
supervised Learning Using Incremental Polynomial Classi- abs/1904.03626. URL http://arxiv.org/abs/1904.03626.
fier and Extreme Value Theory. In 2015 3rd International Halevy, A.; Norvig, P.; and Pereira, F. 2009. The Unreason-
Conference on Artificial Intelligence, Modelling and Simu- able Effectiveness of Data. Intelligent Systems, IEEE 24: 8
lation (AIMS), 332–337. doi:10.1109/AIMS.2015.60. – 12. doi:10.1109/MIS.2009.36.
Arazo, E.; Ortego, D.; Albert, P.; O’Connor, N. E.; and He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep Resid-
McGuinness, K. 2020. Pseudo-Labeling and Confirmation ual Learning for Image Recognition. CoRR abs/1512.03385.
Bias in Deep Semi-Supervised Learning. In 2020 Interna- URL http://arxiv.org/abs/1512.03385.
tional Joint Conference on Neural Networks (IJCNN), 1–8.
doi:10.1109/IJCNN48605.2020.9207304. Iscen, A.; Tolias, G.; Avrithis, Y.; and Chum, O. 2019. La-
bel Propagation for Deep Semi-supervised Learning. CoRR
Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. abs/1904.04717. URL http://arxiv.org/abs/1904.04717.
2009. Curriculum learning. In ICML.
Izmailov, P.; Podoprikhin, D.; Garipov, T.; Vetrov, D. P.; and
Berthelot, D.; Carlini, N.; Cubuk, E. D.; Kurakin, A.; Wilson, A. G. 2018. Averaging Weights Leads to Wider
Sohn, K.; Zhang, H.; and Raffel, C. 2020. ReMixMatch: Optima and Better Generalization. CoRR abs/1803.05407.
Semi-Supervised Learning with Distribution Matching and URL http://arxiv.org/abs/1803.05407.
Augmentation Anchoring. In International Conference
on Learning Representations. URL https://openreview.net/ Jackson, J.; and Schulman, J. 2019. Semi-Supervised Learn-
forum?id=HklkeR4KPB. ing by Label Gradient Alignment. CoRR abs/1902.02336.
URL http://arxiv.org/abs/1902.02336.
Berthelot, D.; Carlini, N.; Goodfellow, I. J.; Papernot,
N.; Oliver, A.; and Raffel, C. 2019. MixMatch: A Krizhevsky, A. 2012. Learning Multiple Layers of Features
Holistic Approach to Semi-Supervised Learning. CoRR from Tiny Images. University of Toronto .
abs/1905.02249. URL http://arxiv.org/abs/1905.02249.
Laine, S.; and Aila, T. 2017. Temporal Ensembling for
Broadwater, J. B.; and Chellappa, R. 2010. Adaptive Semi-Supervised Learning. In 5th International Confer-
Threshold Estimation via Extreme Value Theory. IEEE ence on Learning Representations, ICLR 2017, Toulon,
Transactions on Signal Processing 58(2): 490–500. ISSN France, April 24-26, 2017, Conference Track Proceedings.
1941-0476. doi:10.1109/TSP.2009.2031285. OpenReview.net. URL https://openreview.net/forum?id=
Chapelle, O.; Schlkopf, B.; and Zien, A. 2010. Semi- BJ6oOfqge.
Supervised Learning. The MIT Press, 1st edition. ISBN Lee, D.-H. 2013. Pseudo-Label : The Simple and Efficient
0262514125, 9780262514125. Semi-Supervised Learning Method for Deep Neural Net-
Chapelle, O.; and Zien, A. 2005. Semi-Supervised Classifi- works. ICML 2013 Workshop : Challenges in Representa-
cation by Low Density Separation. In AISTATS. tion Learning (WREPL) .
Clifton, D. A.; Clifton, L. A.; Bannister, P. R.; and Lim, S.; Kim, I.; Kim, T.; Kim, C.; and Kim, S. 2019. Fast
Tarassenko, L. 2008. Automated Novelty Detection in In- AutoAugment. CoRR abs/1905.00397. URL http://arxiv.
dustrial Systems. In Advances of Computational Intelligence org/abs/1905.00397.
in Industrial Systems. Loshchilov, I.; and Hutter, F. 2016. SGDR: Stochastic Gra-
Clifton, D. A.; Hugueny, S.; and Tarassenko, L. 2011. Nov- dient Descent with Restarts. CoRR abs/1608.03983. URL
elty Detection with Multivariate Extreme Value Statistics. http://arxiv.org/abs/1608.03983.
J. Signal Process. Syst. 65(3): 371–389. ISSN 1939-8018. Luo, Y.; Zhu, J.; Li, M.; Ren, Y.; and Zhang, B. 2017.
doi:10.1007/s11265-010-0513-6. URL http://dx.doi.org/10. Smooth Neighbors on Teacher Graphs for Semi-supervised
1007/s11265-010-0513-6. Learning. CoRR abs/1711.00258. URL http://arxiv.org/abs/
Cubuk, E. D.; Zoph, B.; Mané, D.; Vasudevan, V.; and Le, 1711.00258.
Q. V. 2018. AutoAugment: Learning Augmentation Policies McLachlan, G. J. 1975. Iterative Reclassification Procedure
from Data. CoRR abs/1805.09501. URL http://arxiv.org/abs/ for Constructing an Asymptotically Optimal Rule of Allo-
1805.09501. cation in Discriminant Analysis. Journal of the American
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei- Statistical Association 70(350): 365–369.
Fei, L. 2009. ImageNet: A Large-Scale Hierarchical Image Miyato, T.; ichi Maeda, S.; Koyama, M.; and Ishii, S. 2018.
Database. In CVPR09. Virtual Adversarial Training: a Regularization Method for
Devries, T.; and Taylor, G. W. 2017. Dataset Augmentation Supervised and Semi-supervised Learning. IEEE transac-
in Feature Space. ArXiv abs/1702.05538. tions on pattern analysis and machine intelligence .
6919
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Sun, C.; Shrivastava, A.; Singh, S.; and Gupta, A. 2017. Re-
and Ng, A. Y. 2011. Reading Digits in Natural Im- visiting Unreasonable Effectiveness of Data in Deep Learn-
ages with Unsupervised Feature Learning. In NIPS Work- ing Era. CoRR abs/1707.02968. URL http://arxiv.org/abs/
shop on Deep Learning and Unsupervised Feature Learn- 1707.02968.
ing 2011. URL http://ufldl.stanford.edu/housenumbers/ Tarvainen, A.; and Valpola, H. 2017. Mean teachers are
nips2011{\ }housenumbers.pdf. better role models: Weight-averaged consistency targets im-
Oliver, A.; Odena, A.; Raffel, C. A.; Cubuk, E. D.; and prove semi-supervised deep learning results. In ICLR.
Goodfellow, I. J. 2018. Realistic Evaluation of Deep Semi- Verma, V.; Lamb, A.; Kannala, J.; Bengio, Y.; and Lopez-
Supervised Learning Algorithms. In NeurIPS. Paz, D. 2019. Interpolation Consistency Training for Semi-
Perez, L.; and Wang, J. 2017. The Effectiveness of Data Supervised Learning. CoRR abs/1903.03825.
Augmentation in Image Classification using Deep Learn- Xie, Q.; Dai, Z.; Hovy, E. H.; Luong, M.; and Le, Q. V. 2019.
ing. CoRR abs/1712.04621. URL http://arxiv.org/abs/1712. Unsupervised Data Augmentation. CoRR abs/1904.12848.
04621. URL http://arxiv.org/abs/1904.12848.
Rasmus, A.; Berglund, M.; Honkala, M.; Valpola, H.; and
Zagoruyko, S.; and Komodakis, N. 2016. Wide Residual
Raiko, T. 2015. Semi-supervised Learning with Ladder Net-
Networks. CoRR abs/1605.07146. URL http://arxiv.org/abs/
works. In NIPS.
1605.07146.
Rudd, E.; Jain, L. P.; Scheirer, W. J.; and Boult, T. 2018.
Zhai, X.; Oliver, A.; Kolesnikov, A.; and Beyer, L. 2019.
The Extreme Value Machine. IEEE Transactions on Pattern
Analysis and Machine Intelligence (T-PAMI) 40(3). S4 L: Self-Supervised Semi-Supervised Learning. CoRR
abs/1905.03670. URL http://arxiv.org/abs/1905.03670.
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.;
Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D.
et al. 2015. Imagenet large scale visual recognition chal- 2018. mixup: Beyond Empirical Risk Minimization. In In-
lenge. International journal of computer vision 115(3): 211– ternational Conference on Learning Representations. URL
252. https://openreview.net/forum?id=r1Ddp1-Rb.
Sajjadi, M.; Javanmardi, M.; and Tasdizen, T. 2016. Reg- Zhu, X. 2005. Semi-Supervised Learning Literature Sur-
ularization With Stochastic Transformations and Perturba- vey. Technical Report 1530, Computer Sciences, Univer-
tions for Deep Semi-Supervised Learning. In NIPS. sity of Wisconsin-Madison. URL http://pages.cs.wisc.edu/
∼jerryzhu/pub/ssl survey.pdf.
Scudder, H. 1965. Probability of error of some adaptive
pattern-recognition machines. IEEE Transactions on In-
formation Theory 11(3): 363–371. doi:10.1109/TIT.1965.
1053799.
Shi, W.; Gong, Y.; Ding, C.; MaXiaoyu Tao, Z.; and Zheng,
N. 2018. Transductive Semi-Supervised Deep Learning us-
ing Min-Max Features. In The European Conference on
Computer Vision (ECCV).
Shi, Z.; Kiefer, F.; Schneider, J.; and Govindaraju, V. 2008.
Modeling biometric systems using the general pareto distri-
bution (GPD) - art. no. 69440O. Proc SPIE doi:10.1117/12.
778687.
Shorten, C.; and Khoshgoftaar, T. M. 2019. A survey on Im-
age Data Augmentation for Deep Learning. Journal of Big
Data 6(1): 60. ISSN 2196-1115. doi:10.1186/s40537-019-
0197-0. URL https://doi.org/10.1186/s40537-019-0197-0.
Simard, P. Y.; Steinkraus, D.; and Platt, J. C. 2003. Best
practices for convolutional neural networks applied to visual
document analysis. In Seventh International Conference on
Document Analysis and Recognition, 2003. Proceedings.,
958–963. doi:10.1109/ICDAR.2003.1227801.
Sohn, K.; Berthelot, D.; Li, C.; Zhang, Z.; Carlini, N.;
Cubuk, E. D.; Kurakin, A.; Zhang, H.; and Raffel, C.
2020. FixMatch: Simplifying Semi-Supervised Learning
with Consistency and Confidence. ArXiv abs/2001.07685.
Springenberg, J. T.; Dosovitskiy, A.; Brox, T.; and Ried-
miller, M. A. 2014. Striving for Simplicity: The All Con-
volutional Net. CoRR abs/1412.6806.
6920