You are on page 1of 29

Confident Learning:

Estimating Uncertainty in Dataset Labels

Curtis G. Northcutt Lu Jiang Isaac L. Chuang


MIT Google MIT
Cambridge, MA, USA Mountain View, CA, USA Cambridge, MA, USA
arXiv:1911.00068v1 [stat.ML] 31 Oct 2019

cgn@mit.edu lujiang@google.com ichuang@mit.edu

Abstract
Learning exists in the context of data, yet notions of confidence typically focus
on model predictions, not label quality. Confident learning (CL) has emerged
as an approach for characterizing, identifying, and learning with noisy labels
in datasets, based on the principles of pruning noisy data, counting to estimate
noise, and ranking examples to train with confidence. Here, we generalize CL,
building on the assumption of a classification noise process, to directly estimate the
joint distribution between noisy (given) labels and uncorrupted (unknown) labels.
This generalized CL, open-sourced as cleanlab, is provably consistent under
reasonable conditions, and experimentally performant on ImageNet and CIFAR,
outperforming recent approaches, e.g. MentorNet, by 30% or more, when label
noise is non-uniform. cleanlab also quantifies ontological class overlap, and can
increase model accuracy (e.g. ResNet) by providing clean data for training.

1 Introduction: model-agnostic dataset uncertainty estimation


Large datasets with noisy labels have become increasingly common. Examples span prominent
benchmark datasets like ImageNet (Russakovsky et al., 2015) and MS-COCO (Lin et al., 2014) to
human-centric datasets like electronic health records (Halpern et al., 2016) and educational data
(Northcutt et al., 2016). The presence of noisy labels in these datasets introduces two problems. How
can examples with label errors be identified, and how can learning be done well in spite of noisy
labels, irrespective of data modality or model employed?
A large body of work, which may be termed “confident learning,” has arisen to address these interest-
ing problems, from which two aspects stand out. First, Angluin and Laird (1988)’s classification noise
process (CNP) provides a starting assumption, that label noise is class-conditional, depending only
on the latent true class, not the data. While there are exceptions, this assumption is commonly used
(Goldberger and Ben-Reuven, 2017; Sukhbaatar et al., 2014) because it is reasonable. For example,
in ImageNet, a leopard is more likely to be mislabeled jaguar than bathtub. Second, direct estimation
of the joint distribution between noisy (given) labels and true (unknown) labels can be pursued
effectively based on three principled approaches: (a) Prune, to search for label errors, e.g. following
the example of Natarajan et al. (2013); van Rooyen et al. (2015); Patrini et al. (2017a), using soft-
pruning via loss-reweighting, to avoid the convergence pitfalls of iterative re-labeling – (b) Count, to
train on clean data, avoiding error-propagation in learned model weights from reweighting the loss
(Natarajan et al., 2017) with imperfect predicted probabilities, generalizing seminal work Forman
(2005, 2008); Lipton et al. (2018) – and (c) Rank which examples to use during training, to allow
learning with unnormalized probabilities or decision boundary distances, building on well-known
robustness findings (Page et al., 1997) and ideas of curriculum learning (Jiang et al., 2018).
To our knowledge, no prior work has thoroughly analyzed direct estimation of the joint distribution
between noisy and uncorrupted labels. Here, we assemble these principled approaches to generalize

This manuscript is under review for publication at AISTATS 2020.


Noisy Data, 5 JK,K
7 ∗ 8 ∗=dog 8 ∗=fox 8 ∗=cow
9
6, 87 9 ∈ ℝ< , ℤ>?
87 =dog 100 40 20

Model, ( 87 =fox 56 60 0
87 =cow 32 12 80
Noisy Predicted Normalize rows
to match prior

t
un
Probs, Ĉ (8;
7 6, () & divide by total

Co
P̀K,K
7 ∗ 8 ∗=dog 8 ∗=fox 8 ∗=cow

Noisy inputs 87 =dog 0.25 0.1 0.05


87 =fox 0.14 0.15 0
Confident Joint, JK,K
7 ∗
O K,K 87 =cow 0.08 0.03 0.2
Estimate of Joint, P 7 ∗

Dirty Data
Prune
cleanlab ( Examples with
Label Issues )
Clean Data
Figure 1: The confident learning process and examples of the confident joint Cỹ,y∗ and estimated
joint distribution Q̂ỹ,y∗ . ỹ denotes an observed noisy label and y ∗ denotes a latent uncorrupted label.

confident learning (CL) for this purpose. Estimating the joint distribution is challenging, but useful
because its marginals yield important statistics used in the literature, including latent noise transition
rates (Sukhbaatar et al., 2014; Goldberger and Ben-Reuven, 2017; Reed et al., 2015), latent prior of
uncorrupted labels (Lawrence and Schölkopf, 2001; Graepel and Herbrich, 2001), and inverse noise
rates (Katz-Samuels et al., 2017). While noise rates are useful for loss-reweighting (Natarajan et al.,
2013) in learning with noisy labels, only the joint can directly estimate the number of label errors for
each pair of true and noisy classes. The joint is also useful to discover ontological issues in datasets,
e.g. ImageNet includes two classes for the same maillot class (c.f. Table 3 in Sec. 5).
The resulting CL procedure (Fig. 1) is a model-agnostic family of theory and algorithms for
characterizing, finding, and learning with label errors, which uses predicted probabilities and noisy
labels to count examples in the unnormalized confident joint then normalize to estimate the joint
distribution, and prune noisy data, producing clean data as output.
This new CL generalization provides three key advantages over prior art: (1) direct estimation of the
joint distribution of label noise, (2) robust performance against non-uniformly random label noise,
and (3) consistent joint estimation and exact identification of label errors under realistic sufficient
conditions. We empirically evaluate CL on (a) accuracy of joint estimation, (b) label error finding,
and (c) learning with noisy labels on CIFAR and ImageNet for both synthetic and real-world noise.
These experiments validate the performance benefits of estimating the joint distribution. The new CL
code, which reproduces all results described here, is fully open-sourced as the cleanlab1 Python
package.
Our contributions can be summarized as follows:
1. Proposed confident learning for characterizing, finding, & learning with label errors in datasets.
2. Proved non-trivial conditions for consistent joint estimation and exactly finding label errors.
3. Verified the efficacy of CL on CIFAR (added label noise) and ImageNet (real label noise).
4. Released the cleanlab Python package for accessibility and reproducibility.

1
cleanlab for finding and learning with noisy labels is open-source: https://github.com/cgnorthcutt/cleanlab/

2
2 Framework
Here, we consider standard multiclass classification with possibly noisy labels. Notation used in
this manuscript is summarized in Table 4 in the Appendix. Let M denote the set of m=|M | unique
n
class labels and X := (x, ỹ)n ∈ (Rd , Z>0 ) denote the set of n examples x ∈ Rd with associated
observed noisy labels ỹ ∈ Z>0 . We couple x and ỹ in X to signify that cleaning implies removal of
data and label. The discrete random variable ỹ takes an observed, noisy label (potentially flipped
to an incorrect class), and y ∗ takes a latent, uncorrupted label. We use functions y ∗ (x) and ỹ(x) to
denote the true and noisy label for a given example x. The subset of examples in X with noisy label
i is denoted Xỹ=i , i.e. Xỹ=cat is read, “examples labeled cat.”
Notation. The notation p(ỹ; x), as opposed to p(ỹ|x), expresses our assumption that input x is
deterministic and error-free. We denote the discrete joint probability of the noisy and latent labels ỹ
and y ∗ as p(ỹ, y ∗ ), where conditionals p(ỹ|y ∗ ) and p(y ∗ |ỹ) denote probabilities of label flipping. We
use p̂ for estimated or predicted probabilities. In matrix notation, Qy∗ is the prior of the latent labels;
Qỹ,y∗ is the m × m joint distribution matrix for p(ỹ, y ∗ ); Qỹ|y∗ is the m × m noise transition matrix
(noisy channel) of flipping rates for p(ỹ|y ∗ ); and Qy∗ |ỹ is the inverse noise matrix for p(y ∗ |ỹ). At
times, we abbreviate p̂(ỹ = i; x, θ) as p̂x,ỹ=i , where θ denotes the model parameters.
Definition. Self-Confidence is the predicted probability that an example x belongs to its given label
ỹ, expressed as p̂(ỹ =i; x∈Xỹ=i ). Low self-confidence is a heuristic likelihood of being a label error.
Assumptions. Prior to observing ỹ, we assume a class-conditional classification noise process
(CNP) (Angluin and Laird, 1988) maps y ∗ → ỹ such that every label in class j ∈ M may be inde-
pendently mislabeled as class i ∈ M with probability p(ỹ =i|y ∗ =j). This assumption is reasonable
and has been used in prior work (Goldberger and Ben-Reuven, 2017; Sukhbaatar et al., 2014) For
example, in ImageNet (Russakovsky et al., 2015), leopard is more likely to be mislabeled jaguar than
bathtub. CNP implies a data-independent noise transition probability, namely p(ỹ|y ∗ ; x) = p(ỹ|y ∗ ).

3 CL Methods
Confident learning estimates the joint distribution between the (noisy) observed labels and the (true)
latent labels and can be used to (i) improve training with noisy labels, and (ii) identify noisy labels
in existing datasets. The main procedure consists of three steps: (1) estimate the joint Q̂ỹ,y∗ to
characterize class-conditional label noise, (2) filter out noisy examples, and (3) train with errors
Q̂ ∗ [i]
removed, re-weighting examples by class weights Q̂ y∗ [i][i] for each class i ∈ M . In this section, we
ỹ,y
define these three steps and discuss their expected outcomes. Note that only two inputs are used: (1)
P̂k,i the n × m matrix of out-of-sample predicted probabilities p̂(ỹ = j; xk , θ) and (2) an associated
array of noisy labels ỹ(xk ). We use cross-validation to obtain P̂k,i , hence P̂k,i and xk share the
same index. Our method requires no hyperparameters.

3.1 Count: Label Noise Characterization

We estimate Q̂ỹ,y∗ by counting examples in the joint distribution, calibrating estimated counts using
the actual count of noisy labels in each class, |Xỹ=i |, then normalizing. Counts are captured by
the confident joint Cỹ,y∗ ∈ Z≥0 m×m , the key structure of confident learning. Diagonal entries of
Cỹ,y∗ count correct labels and non-diagonals capture asymmetric label error counts. As an example,
Cỹ=3,y∗ =1 =10 is read, “10 examples are labeled ‘3’, but should be labeled ‘1’.”
Confusion matrix Cconfusion . Cỹ,y∗ may be constructed as a confusion matrix of given labels
ỹ(x)k and predictions arg maxi∈M p̂(ỹ =i; xk , θ). This approach performs reasonably empirically
(Sec. 5) and is a consistent estimator for noiseless predicted probabilities (Thm. 1), but fails when
the distributions of probabilities are not similar for each class (Thm. 2).
The confident joint Cỹ,y∗ . Cỹ,y∗ bins examples x labeled ỹ =i with large enough p̂x,ỹ=j to likely
belong to label y ∗ =j. As a first try, we express Cỹ,y∗ as
?
Cỹ,y∗ [i][j] := |X̂ỹ=i,y∗ =j | , where, X̂ỹ=i,y∗ =j = {x ∈ Xỹ=i : p̂(ỹ = j; x, θ) ≥ tj }
and the threshold tj is the expected (average) self-confidence for each class.

3
1 X
tj = p̂(ỹ = j; x, θ) (1)
|Xỹ=j |
x∈Xỹ=j

This formulation fixes the problems with Cconfusion so that Cỹ,y∗ is robust for any particular class
with large or small probabilities, but introduces label collisions when an example x is confidently
counted into more than one X̂ỹ=i,y∗ =j bin. Collisions only occur along the y ∗ dimension of Cỹ,y∗
because ỹ is given. We handle collisions by selecting ŷ ∗ ← arg max p̂x,ỹ=j . The result (Eqn. 2) is
j∈M
the confident joint:
Cỹ,y∗ [i][j] :=|X̂ỹ=i,y∗ =j | where
( )
(2)
X̂ỹ=i,y∗ =j := x ∈ Xỹ=i : p̂(ỹ = j; x, θ) ≥ tj , j = arg max p̂(ỹ = k; x, θ)
k∈M :p̂(ỹ=k;x,θ)≥tk

where the j = arg max term only matters when |{k ∈M : p̂(ỹ =k; x∈Xỹ=i , θ) ≥ tk }| > 1 (collision).
In practice with softmax, collisions sometimes occur for softmax outputs with low temperature, few
collisions occur with high temperature, and no collisions occur as the temperature → ∞ because this
reverts to Cconfusion .
Cỹ,y∗ (Eqn. 2) has some nice properties. First, if an example has low (near-uniform) probabilities
across classes, it is not counted so that Cỹ,y∗ is robust to examples from an alien class not in the
dataset. Second, tj embodies the intuition that examples with higher probability of belonging to class
j than the expected probability of examples in class j probably belong to class j. Third, the 90th
percentile may be used in tj instead of the mean for higher confidence.
We provide algorithmic implementations of Eqns. 1, 2, and 3 in the Appendix. Given predicted
probabilities P̂k,i and noisy labels ỹ(xk ), these require O(m2 +nm) operations to store and compute
Cỹ,y∗ .

Estimate the joint Q̂ỹ,y∗ . Given the confident joint Cỹ,y∗ , we estimate the joint as
Cỹ=i,y∗ =j
P
Cỹ=i,y∗ =j · |Xỹ=i |
j∈M
Q̂ỹ=i,y∗ =j = P 
Cỹ=i,y∗ =j
 (3)
P · |Xỹ=i |
j∈M Cỹ=i,y ∗ =j
i∈M,j∈M
P P
The numerator calibrates j Q̂ỹ=i,y∗ =j = |Xi |/ i∈M |Xi |, ∀i∈M so that row-sums match the
P
observed marginals. The denominator calibrates i,j Q̂ỹ=i,y∗ =j = 1 so the distribution sums to 1.
P
Label noise characterization Using the observed prior Qỹ=i = |Xi | / i∈M |Xi | and marginals
P
of Qỹ,y∗ , we estimate the latent prior as Q̂y∗ =j := i Q̂ỹ=i,y∗ =j , ∀j ∈M ; the noise transition ma-
trix (noisy channel) as Q̂ỹ=i|y∗ =j :=Q̂ỹ=i,y∗ =j /Q̂y∗ =j , ∀i∈M ; and the inverse noise matrix as
Q̂y∗ =j|ỹ=i :=Q̂>
ỹ=j,y ∗ =i /Qỹ=i , ∀i∈M . Whereas prior approaches estimate the noise transition matri-
ces from error-prone predicted probabilities (Reed et al., 2015; Goldberger and Ben-Reuven, 2017), as
demonstrated in the experiments (section 5), CL marginalizes the joint directly in favor of robustness
to imperfect probability estimation.

3.2 Rank and Prune: Data Cleaning

Following estimation of the joint, we apply pruning, ranking, and other heuristics for cleaning training
data. Two approaches are: (1) use the off-diagonals of Cỹ,y∗ or (2) use Q̂ỹ,y∗ to estimate the number
of label errors and remove errors by ranking over predicted probability. Sec. 4 and the first two
methods below examine the first approach, while the second is addressed by the last three methods
below:
Method: Cconfusion . Estimate label errors as 1[[ỹ(x) 6= arg maxj∈M p̂(ỹ = j; x, θ)]], ∀x∈X. This
is identical to using the off-diagonals of Cconfusion .
Method: Cỹ,y∗ . Estimate label errors as {x ∈ X̂ỹ=i,y∗ =j : i 6= j} from the off-diagonals of Cỹ,y∗ .

4
P  
Method: Prune by Class (PBC). For each class i ∈ M , select the n · j∈M :j6=i Q̂ỹ=i,y∗ =j [i]
examples with lowest self-confidence p̂(ỹ = i; x ∈ Xi ) .
Method: Prune by Noise Rate (PBNR). For each off-diagonal entry in Cỹ,y∗ , select the n ·
Q̂ỹ=i,y∗ =j examples x∈Xỹ=i with max margin p̂x,ỹ=j − p̂x,ỹ=i .
Method: C+NR. Combine the previous two methods via element-wise and, i.e. set intersection.
Which CL method to use? CL requires no hyper-parameters, but five methods are presented to
clean data. By default, we use CL: PBNR because it most closely matches the conditions of
Thm. 2 by pruning for each off-diagonal in Q̂ỹ,y∗ . This choice is justified experimentally in
Table 2. Once label errors are found, we observe ordering label errors by the normalized margin:
p̂(ỹ =i; x, θ) − maxj6=i p̂(ỹ =j; x, θ) (Wei et al., 2018) works well. When training with label errors
1 Q̂y∗ [i]
removed, we re-weight the loss by p̂(ỹ=i|y ∗ =i) = Q̂ỹ,y∗ [i][i] for each class i∈M .

4 Theory

In this section, we examine sufficient conditions when (1) the confident joint exactly finds label errors
and (2) Q̂ỹ,y∗ is a consistent estimator for Qỹ,y∗ . We first analyze CL for noiseless p̂x,ỹ=j , then
evaluate more realistic conditions, culminating in Thm. 2 where we prove (1) and (2) with noise in
predicted probabilities for every example. Proofs are in the Appendix.
In Lemma 1 and Thm. 2, we assume |X| → ∞; however, these results apply to finite-sized X
omitting the precision error of estimating a real-valued Qỹ,y∗ from discrete count-based Cỹ,y∗ .
Throughout, we assume X is error-free and includes an example from every class.

4.1 Noiseless Predicted Probabilities

We start with the ideal condition and a non-obvious lemma that yields a closed-form expression for
tj when p̂x,ỹ=j is ideal. Without some condition on p̂x,ỹ=j , one cannot disambiguate label noise
from model noise.
Condition (Ideal). The predicted probs p̂(ỹ; x, θ) for a model θ are ideal if ∀x∈Xy∗ =j , ∀i, j ∈M ,
p̂(ỹ =i; x ∈ Xy∗ =j , θ)=p∗ (ỹ =i|y ∗ =y ∗ (x))=p∗ (ỹ =i|y ∗ =j), where the last equality follows from the
CNP assumption. The ideal condition implies error-free predicted probabilities: they match the noise
rates of the y ∗ label corresponding to x. We use p∗x,ỹ=i as shorthand.
n
Lemma 1 (Ideal Thresholds).
P For a dataset X := (x, ỹ)n ∈ (Rd , Z>0 ) and model θ, if p̂(ỹ; x, θ)
is ideal, then ∀i∈M, ti = j∈M p(ỹ = i|y = j)p(y ∗ = j|ỹ = i).

This form of the threshold is intuitively reasonable: the contributions to the sum when i = j
represents the probabilities of correct labeling, whereas when i 6= j, the terms give the probabilities
of mislabeling p(ỹ = i|y ∗ = j), weighted by the probability p(y ∗ = j|ỹ = i) that the mislabeling is
corrected. Using Lemma 1 under the ideal condition we prove in Thm. 1 confident learning exactly
finds label errors and Q̂ỹ,y∗ is a consistent estimator for Qỹ,y∗ when each diagonal entry of Qỹ|y∗
maximizes its row and column. The proof hinges on the fact that the construction of Cỹ,y∗ eliminates
collisions.
n
Theorem 1 (Exact Label Errors). For a dataset X := (x, ỹ)n ∈(Rd , Z>0 ) and model θ :x→p̂(ỹ), if
p̂(ỹ; x, θ) is ideal and each diagonal entry of Qỹ|y∗ maximizes its row and column, then X̂ỹ=i,y∗ =j =
Xỹ=i,y∗ =j and as n → ∞, Q̂ỹ,y∗ =Qỹ,y∗ (consistent estimator for Qỹ,y∗ ).

While Thm. 1 is a reasonable sanity check, observe that y ∗ ← arg maxj p̂(ỹ =i|ỹ ∗ =i; x), used by
Cconfusion , trivially satisfies Thm. 1 under the assumption that the diagonal of Qỹ|y∗ maximizes its
row and column. We next consider conditions motivated by real-world settings where this is no
longer the case.

5
4.2 Noisy Predicted Probabilities

Motivated by the importance of addressing class imbalance, we consider linear combinations of noise
per-class.
Condition (Per-Class Diffracted). p̂x,ỹ=i is per-class diffracted if there exist linear combinations of
(1) (2) (1) (2)
class-conditional error in the predicted probabilities s.t. p̂x,ỹ=i = i p∗x,ỹ=i +i where j , j ∈R
and j can be any distribution. This relaxes the ideal condition with noise relevant for neural networks,
known to be class-conditionally overly confident (Guo et al., 2017).
n
Corollary 1.1 (Per-Class Robustness). For a dataset X := (x, ỹ)n ∈(Rd , Z>0 ) and model
θ :x→p̂(ỹ), if p̂x,ỹ=i is per-class diffracted without label collisions and each diagonal entry of
Qỹ|y∗ maximizes its row, then X̂ỹ=i,y∗ =j = Xỹ=i,y∗ =j and as n → ∞, Q̂ỹ,y∗ =Qỹ,y∗ .

Cor. 1.1 shows us that Cỹ,y∗ in confident learning is robust to any linear combination of per-class
error in probabilities. Observe that Cconfusion does not satsify Cor. 1.1 because the theorem no longer
requires that the diagonal of Qỹ|y∗ maximize its column. Intuitively, Cconfusion cannot satisfy Cor.
1.1 because it assumes similar distributions of probabilities for each class by not using thresholds,
whereas Cor. 1.1 shows Cỹ,y∗ is robust to distributional shift and class-imbalance.
Cor. 1.1 only allows for m alterations in the probabilities and there are only m2 unique probabilities
under the ideal condition, whereas in real-world conditions, an error-prone model could potentially
output nm unique probabilities. Next, in Thm. 2, we examine a reasonable sufficient condition where
CL is robust to erroneous probabilities for every example and class.
Condition (Per-Example Diffracted). p̂x,ỹ=i is per-example diffracted if ∀j ∈M, ∀x∈X, we have
error as p̂x,ỹ=j = p∗x,ỹ=j + x,ỹ=j where j = Ex∈X x,ỹ=j and

U(j +tj −p∗x,ỹ=j , j −tj +p∗x,ỹ=j ] p∗x,ỹ=j ≥ tj



x,ỹ=j ∼ (4)
U[j −tj +p∗x,ỹ=j , j +tj −p∗x,ỹ=j ) p∗x,ỹ=j < tj
where U denotes a uniform distribution (a more general case is discussed in the Appendix).
n
Theorem 2 (General Per-Example Robustness). For a dataset X := (x, ỹ)n ∈ (Rd , Z>0 ) and model
θ :x→p̂(ỹ), if p̂x,ỹ=i is per-example diffracted without label collisions and each diagonal entry of
Qỹ|y∗ maximizes its row, then X̂ỹ=i,y∗ =j = Xỹ=i,y∗ =j and as n → ∞, Q̂ỹ,y∗ =Qỹ,y∗ (consistent
estimator for Qỹ,y∗ ).

In Thm. 2, we observe that if each example’s predicted probability resides within the residual range
of the ideal probability and the threshold, then CL exactly identifies label errors and consistently
estimates Qỹ,y∗ . Intuitively, if p̂x,ỹ=j ≥ tj whenever p∗x,ỹ=j ≥ tj and p̂x,ỹ=j < tj whenever p∗x,ỹ=j <
tj , then regardless of error in p̂x,ỹ=j , CL exactly finds label errors. As an example, consider an image
that is mislabeled as fox, but is actually a dog where tf ox = 0.6, p∗ (ỹ =f ox; x ∈ Xy∗ =dog , θ) = 0.2,
tdog = 0.8, and p∗ (ỹ =dog; x ∈ Xy∗ =dog , θ) = 0.9. Then as long as −0.4 ≤ x,f ox < 0.4 and
−0.1 < x,dog ≤ 0.1, CL will guess y ∗ (x)=dog, not f ox, where ỹ(x)=f ox is given.

5 Experiments
This section empirically validates CL on CIFAR (Krizhevsky et al., 2009) and ImageNet (Russakovsky
et al., 2015) benchmarks (see Appendix for MNIST). Sec. 5.1 presents CL’s performance on noisy
examples in CIFAR where true labels are known. Sec. 5.2 shows real-world noise identification using
ImageNet, and the performance gain when training with CL. We compute out-of-sample predicted
probabilities P̂k,j using four-fold cross validation with ResNet architectures.

5.1 Non-uniform Label Noise on CIFAR

We evaluate CL on three criteria: (a) joint estimation (Fig. 2), (b) accuracy finding label errors (Table
2), and (c) accuracy learning with noisy labels (Table 1).
Following prior work Sukhbaatar et al. (2014); Goldberger and Ben-Reuven (2017), we study CL
performance on non-uniformly random label noise for its resemblance to real-world noise. We

6
4 0.5 0 0.4 0 0 0.5 0 0 0 1.3 2.6 0.2 0.3 0 0.4 0.4 0 0.2 0.1 2.7 2 0.2 0.1 0 0.4 0.1 0 0.2 0.1 10
plane

Joint probability (10 2)


car 3.2 6.3 0 0.4 2.7 0.4 0 0 0.5 0.1 3.1 6.1 0.2 0.4 2.1 0.6 0.1 0.1 0.6 0.3 0.1 0.2 0.2 0 0.6 0.2 0.1 0.1 0.2 0.2
0.6 0 4.6 0.4 0 0 0 0 0 0 0.8 0.1 3.8 0.4 0.1 0.2 0.1 0 0.1 0.1 0.2 0.1 0.9 0 0.1 0.2 0.1 0 0.1 0.1 8
bird

Noisy label y
cat 0.1 0.4 0 4 0 0 0 0 0 0 0.2 0.5 0.1 2.6 0.1 0.7 0.1 0 0 0.1 0.1 0.1 0.1 1.4 0.1 0.7 0.1 0 0 0.1
0 0.2 0 0.4 7.1 0 0 0 0 0 0.1 0.4 0.2 0.4 6.1 0.2 0.1 0.1 0 0 0.1 0.1 0.2 0.1 1 0.2 0.1 0.1 0 0 6
deer
dog 1.1 2 0 0.3 0 5.2 3.9 0 0.3 0 1.2 2.1 0.2 0.9 0.1 4.9 2.6 0.1 0.4 0.3 0 0.1 0.2 0.6 0.1 0.3 1.3 0.1 0.2 0.3
0.2 0.1 0 0.4 0.2 0 2.9 0 0 0 0.2 0.2 0.1 0.4 0.2 0.8 2 0 0.1 0 0 0 0.1 0 0 0.8 0.9 0 0.1 0 4
frog
horse 0 0.3 0 0.2 0 0 0 6.8 0 0.1 0.1 0.4 0 0.3 0.1 0.2 0 6 0 0.2 0.1 0 0 0.1 0.1 0.2 0 0.8 0 0.2
1.4 0.3 2.9 1.5 0.2 1.3 2 0 9 0.2 0.6 0.3 0.8 0.7 0.2 1.3 0.8 0 0.2 0.2 2
ship 0.8 0 3.8 2.2 0 0 2.8 0 9.3 0
truck 0 0 1.6 1.3 0 4.4 0 3.2 0 9.8 0.4 0.5 1.3 1.4 0.2 3.7 0.1 2.8 0.2 9.5 0.4 0.5 0.3 0.1 0.2 0.6 0.1 0.4 0.2 0.3
0

ne
r
d
t
er
g
g
sh e
ip
ck
ne
r
d
t
er
g
g
sh e
ip
ck
ne
r
d
t
er
g
g
sh e
ip
ck

ca
ca

ca
ca

rs
rs

bir

do
fro
bir

do
fro
ca
ca

rs
bir

do
fro

de
de

tru
tru
pla
pla
de

tru

ho
ho
pla

ho
Latent, true label y * Latent, true label y * Latent, true label y *
(a) True joint (unknown to CL) Qỹ,y∗ (b) CL estimated joint Q̂ỹ,y∗ (c) |Qỹ,y∗ − Q̂ỹ,y∗ |
Figure 2: Our estimation of the joint distribution of label noise for CIFAR with 40% label noise and
60% sparsity. Observe the similarity of values between (a) and (b) and the low error of the absolute
difference of every entry in the matrix in (c). Probabilities are scaled up by 100.

Table 1: Comparison of confident learning versus prior art for multiclass learning with noisy labels in
CIFAR-10. CL: OPT is the max of (CL: PBC, CL: PBNR, CL: C+NR). See Appendix Table 5 for
individual scores.
N OISE 0.2 0.4 0.7
S PARSITY 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 0.4 0.6
CL: CCONFUSION 0.854 0.854 0.863 0.857 0.806 0.796 0.802 0.798 0.332 0.363 0.328 0.291
CL: Cỹ,y∗ 0.848 0.858 0.862 0.861 0.815 0.810 0.816 0.815 0.340 0.398 0.282 0.372
CL: OPT 0.860 0.859 0.865 0.862 0.810 0.801 0.814 0.825 0.468 0.420 0.399 0.371
M ENTORNET 0.849 0.851 0.832 0.834 0.644 0.642 0.624 0.615 0.300 0.316 0.293 0.279
S-M ODEL 0.800 0.800 0.797 0.791 0.586 0.612 0.591 0.575 0.284 0.285 0.279 0.273
R EED 0.781 0.789 0.808 0.793 0.605 0.604 0.612 0.586 0.290 0.294 0.291 0.268
BASELINE 0.784 0.792 0.790 0.782 0.602 0.608 0.596 0.573 0.270 0.297 0.282 0.268

generate noisy data from clean data by stochastically switching some labels of training examples to
different classes non-uniformly according to a randomly generated Qỹ|y∗ noise transition matrix. We
generate Qỹ|y∗ matrices with different traces to run experiments for different noise levels. The noise
matrices we used in our experiments are provided in Fig. 7 in the Appendix.
We generate noise in the CIFAR training dataset for varying amounts of noise, the fraction of
incorrect labels, and sparsity, the fraction of off-diagonals in Qỹ,y∗ that are zero. Sparsity quantifies
the magnitude of non-uniformity of the label noise. All models are evaluated on the unaltered test set.
In Table 1, we compare CL performance versus three models commonly used for benchmark
comparison on CIFAR including the S-Model (Goldberger and Ben-Reuven, 2017) which uses
an extra softmax layer to model noise during training, Reed (Reed et al., 2015) which uses loss-
reweighting, and MentorNet (Jiang et al., 2018) which uses curriculum learning to avoid noisy data in
early stages of training; and a Baseline model that denotes a vanilla training with the noisy labels. All
models, including ours, are trained using a ResNet50 architecture under identical settings: learning
rate 0.1 for epoch [0,150), 0.01 for epoch [150,250), 0.001 for epoch [250,350); momentum 0.9; and
weight decay 0.0001. In this table, we show the max of the three prune methods (denoted CL: OPT)
to make it easier to compare CL with the other methods: see Table 5 in the Appendix for benchmarks
of all CL methods.
Table 1 lists the test accuracy for learning with noisy labels across different noise fractions and
sparsities, where the first three rows report our CL approaches. As shown for 40% label noise, CL
yields a 34% improvement over other baselines including the competitive MentorNet baseline. We
observe significant improvement in high-noise regimes and moderate improvement in low-noise
regimes, where CL models appear less affected by sparsity at 20% and 40% noise levels. The simplest
CL method CL : Cconfusion greatly outperforms prior art and best performance is achieved by CL:
OPT or Cỹ,y∗ across all noise and sparsity settings. The results validate the benefit of directly
modeling the joint noise distribution.
To understand why CL performs well, we evaluate CL joint estimation across noise and sparsity
with RMSE in Table 6 in the Appendix and estimated Q̂ỹ,y∗ in Fig. 5 in the Appendix. For the 20%

7
Table 2: Accuracy, F1, precision, and recall measures for finding label errors in CIFAR-10.
M EASURE ACCURACY F1 P RECISION R ECALL
N OISE 0.2 0.4 0.2 0.4 0.2 0.4 0.2 0.4
S PARSITY 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
CL: CCONFUSION 0.84 0.85 0.85 0.81 0.71 0.72 0.84 0.79 0.56 0.58 0.74 0.70 0.98 0.97 0.97 0.90
CL: Cỹ,y∗ 0.89 0.90 0.86 0.84 0.75 0.78 0.84 0.80 0.67 0.70 0.78 0.77 0.86 0.88 0.91 0.84
CL: PBC 0.88 0.88 0.86 0.82 0.76 0.76 0.84 0.79 0.64 0.65 0.76 0.74 0.96 0.93 0.94 0.85
CL: PBNR 0.89 0.90 0.88 0.84 0.77 0.79 0.85 0.80 0.65 0.68 0.82 0.79 0.93 0.94 0.88 0.82
CL: C+NR 0.90 0.90 0.87 0.83 0.78 0.78 0.84 0.78 0.67 0.69 0.82 0.79 0.93 0.90 0.87 0.78
Table 3: The ten largest non-diagonal entries in the confident joint Cỹ,y∗ of the ImageNet train set.

ỹ name y ∗ name ỹ nid y ∗ nid Cconfusion Cỹ,y∗ Q̂ỹ,y∗


1 projectile missile n04008634 n03773504 494 645 0.000503
2 tub bathtub n04493381 n02808440 400 539 0.000421
3 breastplate cuirass n02895154 n03146219 398 476 0.000372
4 green_lizard chameleon n01693334 n01682714 369 437 0.000341
5 chameleon green_lizard n01682714 n01693334 362 435 0.000340
6 missile projectile n03773504 n04008634 362 433 0.000338
7 maillot maillot n03710637 n03710721 338 417 0.000325
8 horned_viper sidewinder n01753488 n01756291 336 416 0.000325
9 corn ear n12144580 n13133613 333 410 0.000320
10 keyboard space_bar n04505470 n04264628 293 407 0.000318

and 40% noise settings, on average, CL achieves an RMSE of .004 relative to the true joint Qỹ,y∗
across all sparsities. The simplest CL variant, Cconfusion normalized via Eqn. (3) to obtain Q̂argmax ,
achieves a slightly worse RMSE of .006. In Fig. 2, we visualize the quality of CL joint estimation in
a challenging high-noise (40%), high-sparsity (60%) regime on CIFAR. Sub-figure 2a demonstrates
high sparsity in the latent true joint Qỹ,y∗ , with over half the noise in just six noise rates. Yet, as
can be seen in sub-figures (b) and (c), CL still estimates over 80% of the entries of Qỹ,y∗ within an
absolute difference of .005. The results empirically substantiate the theoretical bounds of Section 4.
We also evaluate CL’s accuracy in finding label errors. In Table 2, we compare five variants of CL
methods across noise and sparsity and report their precision, recall, and F1 in recovering the true
label. The results show that CL is able to find the label errors with high recall and reasonable F1.
There is a slight improvement using remove by rank methods (bottom three).

5.2 Real-world Noise with ImageNet

Russakovsky et al. (2015) suggest label errors exist in ImageNet due to human error, but to our
knowledge, no attempt has been made to find label errors in the ILSVRC 2012 training set, character-
ize them, and re-train without them. Here, we consider each application (see Appendix Sec. E for
MNIST analogues). We use ResNet18 and ResNet50 architectures with standard settings: 0.1 initial
learning rate, 90 training epochs with 0.9 momentum.
Ontological discovery via label noise characterization. Because ImageNet is a single-class
dataset, classes are required to be mutually exclusive. We observe auto-discovery of ontological
issues in datasets in Table 3 by listing the 10 largest non-diagonal entries in Cỹ,y∗ . For example, the
class maillot appears twice, the existence of is-a relationships like bathtub is a tub, misnomers like
projectile and missile, and unanticipated issues caused by words with multiple definitions like corn
and ear. We include Cconfusion to show that while it counts fewer, it still ranks similarly.
Finding label issues. Fig. 3 depicts the top 16 label issues found using CL: PBNR with ResNet50
ordered by the normalized margin. We use the term issue versus error because examples found by
CL consist of a mixture of multi-label images, ontological issues, and actual label errors. Examples
of each are indicated by colored borders in the figure. To evaluate CL in the absence of true labels,
we conducted a small-scale human validation on a random, unordered sample of 500 CL: PBNR
errors and found 58% were either multi-labeled, ontological issues, or errors. ImageNet data are
often presumed clean yet ours is the first attempt to identify label errors in ImageNet training images.
Training ResNet on ImageNet with label issues removed. To understand the performance
differences, we train both ResNet50 (Fig. 4b) and ResNet18 (Fig. 4a) by progressively removing

8
Figure 3: Top 32 identified label issues in the 2012 ILSVRC ImageNet train set using CL: PBNR.
Errors are boxed in red. Ontological issues are boxed in green. Multi-label images are boxed in blue.

Method 74% Method


69% CL: Cconfusion CL: Cconfusion
CL: Cy, y * CL: Cy, y *
CL: opt CL: opt
Rand remove 73% Rand remove
68% No Removal No Removal

72%
0K 50K 100K 150K 200K 0K 50K 100K 150K 200K
Number of examples removed before training Number of examples removed before training
(a) ResNet18 Validation Accuracy (b) ResNet50 Validation Accuracy
Figure 4: Increased ResNet validation accuracy using CL methods on ImageNet with original labels
(no synthetic noise added). Each point on the line for each method, from left to right, depicts the
accuracy of training with 20%, 40%..., 100% of estimated label errors removed. Error bars are
computed with Clopper-Pearson 95% confidence intervals. The red dash-dotted baseline captures
when examples are removed uniformly randomly. The black dotted line depicts accuracy when
training with all examples.

identified noisy examples in the ImageNet training set. Fig. 4 shows the top-1 accuracy on the
ILSVRC validation set when removing label errors estimated by CL methods versus removing
random examples. We do not compare with other baselines because they may not identify label errors.
For each CL method, we plot the accuracy of training with 20%, 40%,..., 100% of estimated label
errors removed, omitting points beyond 200k.
We find CL methods may even improve the standard ImageNet training on clean training data by
filtering out a subset of training examples. The result is significant as ImageNet training images
are often assumed to have correct labels. These results suggest CL is able to identify the label
noise in the real-world dataset and improve the training over unnoticed label errors in the ImageNet
train set. Once more than 100K many examples are removed, CL may not improve the standard
training. However, CL methods still significantly outperform the random removal baseline. We
provide additional comparison of CL: PBNR versus random pruning in the Appendix in Figures 9
and 8.
Computation time. Finding label errors in ImageNet takes 3 minutes on an i7 CPU. Figures are
seeded and reproducible via the open-source cleanlab package.

9
6 Related work

Common approaches to learning with noisy labels include iterative co-training (Blum and Mitchell,
1998), modifying the loss (Patrini et al., 2016, 2017b; Sukhbaatar et al., 2014), imputation (Li et al.,
2017; Amjad et al., 2017), crowd-based approaches (Zhang et al., 2017b; Dawid and Skene, 1979;
Ratner et al., 2016), error removal (Wang et al., 2018; Northcutt et al., 2017), or fixing labels (Han
et al., 2019). Works in these areas introduced a number of grand insights used in CL, discussed in the
next subsection.
Confident learning. Pioneering work by Forman (2008, 2005) introduced counting approaches
to estimate false positive and false negative rates for binary classification. CL uses counting for
robustness to error in predicted probabilities in the multi-class setting. Elkan and Noto (2008)
improved on Forman’s approach using a threshold for robustness, but required uncorrupted positive
labels. CL generalizes the use of thresholds for noise in every class. A number of formative works
(Natarajan et al., 2013; van Rooyen et al., 2015; Katz-Samuels et al., 2017) use loss re-weighting
to prove empirical risk minimization motivating CL loss re-weighting when learning on cleaned
datasets. Among recent contributions, Lipton et al. (2018) extended the confusion matrix approach to
multi-class but only for a particular label shift without finding label errors. Han et al. (2019) proposed
a deep self-supervised learning approach to avoid probabilities by using embedding layers of a neural
network. Like CL, these approaches require probabilities obtained out-of-sample. This is important
in deep learning where outputs have been shown to be overconfident on training examples Guo et al.
(2017). Different from CL, these approaches are either limited to a specific class of models, iterative
(slow), or have limited theoretical justification.
Label noise estimation. Prior work in learning with noisy labels is often restricted to binary
classification. For example, Scott et al. (2013); Scott (2015) developed a theoretical and practical
convergence criterion in the binary setting. With the simplifying assumption that all positive labels
are error-free, Elkan and Noto (2008) introduced a formative time-efficient probabilistic approach
that directly estimates the probability of label flipping using a holdout set. Northcutt et al. (2017)
introduced learning with confident examples for binary classification by a calibration technique.
Assuming the noise rates are given (which in practice is rarely true), a variety of algorithms (Natarajan
et al., 2013; Liu and Tao, 2016; Sugiyama et al., 2012) achieved equivalent expected risk as learning
with binary uncorrupted labels. In contrast to these studies, CL can estimate the label noise for
multi-class classification.
In the multiclass setting, prior work generally falls into five categories: (1) theoretical contributions
(Katz-Samuels et al., 2017; Blanchard and Scott, 2014), (2) modifying the loss for label noise
robustness (Patrini et al., 2016, 2017b; Sukhbaatar et al., 2014; van Rooyen et al., 2015), (3) deep
learning and model-specific approaches (Sukhbaatar et al., 2014; Patrini et al., 2016; Jindal et al.,
2016), (4) improving crowd-sourced labels by multiple workers (Zhang et al., 2017b; Dawid and
Skene, 1979; Ratner et al., 2016), (5) factorization methods for distillation (Li et al., 2017) similar to
using SVD for imputation (Amjad et al., 2017), among other methods (Bootkrajang and Kab, 2011;
Sáez et al., 2014). Different from these approaches, CL directly estimates the joint distribution of
multi-class label noise, supported by theoretical justification.
Noise-robust learning. Extensive studies have investigated training models on noisy datasets, e.g.
(Beigman and Klebanov, 2009; Natarajan et al., 2013; Brodley and Friedl, 1999) as well as some
noise-estimation approaches discussed above. Noise-robust learning is important for deep learning
as modern neural networks trained on noisy labels generalize poorly on clean data (Zhang et al.,
2017a). Recent studies mainly dealt with uniform label noise in which the label is uniformly changed
to another class with a probability, e.g. (Goldberger and Ben-Reuven, 2017; Arazo et al., 2019).
To approximate real-world noise, an increasing number of studies examined non-uniform noise
using multiple approaches such as loss or label correction (Patrini et al., 2017a; Reed et al., 2015;
Goldberger and Ben-Reuven, 2017), example weighting (Jiang et al., 2018; Shu et al., 2019), co-
teaching (Han et al., 2018), semi-supervised learning (Hendrycks et al., 2018; Li et al., 2017; Vahdat,
2017), among others.

10
7 Conclusion
These findings emphasize the practical nature of confident learning, identifying numerous label issues
in ImageNet and CIFAR, and improving standard ResNet performance by training on a cleaned
dataset. Confident learning motivates the need for further understanding of dataset uncertainty
estimation, methods to clean training and test sets, and approaches to identify ontological and label
issues in datasets.

Acknowledgements
We thank the following colleagues: Jonas Mueller assisted with notation. Anish Athayle suggested
starting the proof in claim 1 of Theorem 1 with the identity. Tailin Wu contributed to Lemma 1.
Niranjan Subrahmanya provided feedback on baselines for confident learning.

11
References
Amjad, M. J., Shah, D., and Shen, D. (2017). Robust synthetic control.
Angluin, D. and Laird, P. (1988). Learning from noisy examples. Machine Learning, 2(4):343–370.
Arazo, E., Ortego, D., Albert, P., O’Connor, N. E., and McGuinness, K. (2019). Unsupervised label
noise modeling and loss correction. In ICML.
Beigman, E. and Klebanov, B. B. (2009). Learning with annotation noise. In ACL.
Blanchard, G. and Scott, C. (2014). Decontamination of mutually contaminated models. In AISTATS,
volume 33 of JMLR Workshop and Conference Proceedings, pages 1–9. JMLR.org.
Blum, A. and Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In 11th
Conf. on COLT, pages 92–100, New York, NY, USA. ACM.
Bootkrajang, J. and Kab, A. (2011). Multi-class Classification in the Presence of Labelling Errors.
Computational Intelligence, pages 27–29.
Brodley, C. E. and Friedl, M. A. (1999). Identifying mislabeled training data. Journal of artificial
intelligence research, 11:131–167.
Dawid, A. P. and Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates
using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics),
28(1):20–28.
Elkan, C. and Noto, K. (2008). Learning classifiers from only positive and unlabeled data. In Proc.
of 14th KDD, pages 213–220, NYC, NY, USA. ACM.
Forman, G. (2005). Counting positives accurately despite inaccurate classification. In European
Conference on Machine Learning, pages 564–575. Springer.
Forman, G. (2008). Quantifying counts and costs via classification. Data Mining and Knowledge
Discovery, 17(2):164–206.
Goldberger, J. and Ben-Reuven, E. (2017). Training deep neural-networks using a noise adaptation
layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France,
April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
Graepel, T. and Herbrich, R. (2001). The kernel gibbs sampler. In Leen, T. K., Dietterich, T. G., and
Tresp, V., editors, Advances in Neural Information Processing Systems 13, pages 514–520. MIT
Press.
Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern neural networks.
In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages
1321–1330.
Halpern, Y., Horng, S., Choi, Y., and Sontag, D. (2016). Electronic medical record phenotyping
using the anchor and learn framework. Journal of the American Medical Informatics Association,
23(4):731–740.
Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. (2018). Co-teaching:
Robust training of deep neural networks with extremely noisy labels. In NeurIPS.
Han, J., Luo, P., and Wang, X. (2019). Deep self-learning from noisy labels.
Hendrycks, D., Mazeika, M., Wilson, D., and Gimpel, K. (2018). Using trusted data to train deep
networks on labels corrupted by severe noise. In NeurIPS.
Jiang, L., Zhou, Z., Leung, T., Li, L.-J., and Fei-Fei, L. (2018). Mentornet: Learning data-driven
curriculum for very deep neural networks on corrupted labels. In ICML.
Jindal, I., Nokleby, M., and Chen, X. (2016). Learning deep networks from noisy labels with dropout
regularization. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages
967–972.
Katz-Samuels, J., Blanchard, G., and Scott, C. (2017). Decontamination of mutual contamination
models.
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images.
Technical report, Citeseer.

12
Lawrence, N. D. and Schölkopf, B. (2001). Estimating a kernel fisher discriminant in the presence
of label noise. In Proceedings of the Eighteenth International Conference on Machine Learning,
ICML ’01, pages 306–313, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
LeCun, Y. (1998). The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/.
Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., and Li, L.-J. (2017). Learning from noisy labels with
distillation. In 2017 IEEE International Conference on Computer Vision (ICCV), volume 00, pages
1928–1936.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick,
C. L. (2014). Microsoft coco: Common objects in context. In Fleet, D., Pajdla, T., Schiele,
B., and Tuytelaars, T., editors, Computer Vision – ECCV 2014, pages 740–755, Cham. Springer
International Publishing.
Lipton, Z., Wang, Y.-X., and Smola, A. (2018). Detecting and correcting for label shift with black box
predictors. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on
Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3122–3130,
Stockholmsmässan, Stockholm Sweden. PMLR.
Liu, T. and Tao, D. (2016). Classification with noisy labels by importance reweighting. IEEE Trans.
Pattern Anal. Mach. Intell., 38(3):447–461.
Natarajan, N., Dhillon, I. S., Ravikumar, P., and Tewari, A. (2017). Cost-sensitive learning with noisy
labels. Journal of Machine Learning Research, 18:155–1.
Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A. (2013). Learning with noisy labels. In
Adv. in NIPS 26, pages 1196–1204. Curran Associates, Inc.
Northcutt, C. G., Ho, A. D., and Chuang, I. L. (2016). Detecting and preventing “multiple-account”
cheating in massive open online courses. Computers & Education, 100:71–80.
Northcutt, C. G., Wu, T., and Chuang, I. L. (2017). Learning with confident examples: Rank pruning
for robust classification with noisy labels. In Proceedings of the Thirty-Third Conference on
Uncertainty in Artificial Intelligence, UAI’17.
Page, L., Brin, S., Motwani, R., and Winograd, T. (1997). Pagerank: Bringing order to the web.
Technical report, Stanford Digital Libraries Working Paper.
Patrini, G., Nielsen, F., Nock, R., and Carioni, M. (2016). Loss factorization, weakly supervised
learning and label noise robustness. In ICML, volume 48 of JMLR Workshop and Conference
Proceedings, pages 708–717. JMLR.org.
Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L. (2017a). Making deep neural
networks robust to label noise: A loss correction approach. In CVPR.
Patrini, G., Rozza, A., Menon, A. K., Nock, R., and Qu, L. (2017b). Making deep neural networks
robust to label noise: A loss correction approach. In CVPR, pages 2233–2241. IEEE Computer
Society.
Ratner, A. J., De Sa, C. M., Wu, S., Selsam, D., and Ré, C. (2016). Data programming: Creating
large training sets, quickly. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett,
R., editors, Advances in Neural Information Processing Systems 29, pages 3567–3575. Curran
Associates, Inc.
Reed, S. E., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., and Rabinovich, A. (2015). Training
deep neural networks on noisy labels with bootstrapping. In ICLR.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla,
A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition
Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252.
Sáez, J. A., Galar, M., Luengo, J., and Herrera, F. (2014). Analyzing the presence of noise in
multi-class problems: alleviating its influence with the one-vs-one decomposition. Knowledge and
Information Systems, 38(1):179–206.
Scott, C. (2015). A rate of convergence for mixture proportion estimation, with application to learning
from noisy labels. JMLR, 38:838–846.
Scott, C., Blanchard, G., and Handy, G. (2013). Classification with asymmetric label noise: Consis-
tency and maximal denoising. In COLT, pages 489–511.

13
Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., and Meng, D. (2019). Meta-weight-net: Learning
an explicit mapping for sample weighting. In NeurIPS.
Sugiyama, M., Suzuki, T., and Kanamori, T. (2012). Density Ratio Estimation in ML. Cambridge
University Press, New York, NY, USA, 1st edition.
Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L., and Fergus, R. (2014). Training Convolutional
Networks with Noisy Labels. ICLR, pages 1–11.
Vahdat, A. (2017). Toward robustness against label noise in training deep discriminative neural
networks. In NeurIPS.
van Rooyen, B., Menon, A. K., and Williamson, R. C. (2015). Learning with symmetric label
noise: The importance of being unhinged. In Advances in Neural Information Processing Systems
28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015,
Montreal, Quebec, Canada, pages 10–18.
Wang, Y., Jha, S., and Chaudhuri, K. (2018). Analyzing the robustness of nearest neighbors to
adversarial examples. In Dy, J. and Krause, A., editors, Proceedings of the 35th International
Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages
5133–5142, Stockholmsmässan, Stockholm Sweden. PMLR.
Wei, C., Lee, J. D., Liu, Q., and Ma, T. (2018). On the margin theory of feedforward neural networks.
arXiv preprint arXiv:1810.05369.
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017a). Understanding deep learning
requires rethinking generalization. In ICLR.
Zhang, J., Sheng, V. S., Li, T., and Wu, X. (2017b). Improving crowdsourced label quality using
noise correction. IEEE transactions on neural networks and learning systems, 29(5):1675–1688.

14
A Notation

In this section of the Appendix, we summarize the notation used in confident learning in tabular form.

Table 4: Notation used in confident learning.


Notation Defintion
M The set of m=|M | unique class labels
m The cardinality of M , |M |, i.e. the number of unique class labels
ỹ Discrete random variable ỹ ∈ Z>0 takes an observed, noisy label
y∗ Discrete random variable y ∗ ∈ Z>0 takes the unknown, true, uncorrupted label
ỹ(x) The observed, noisy label example of x
y ∗ (x) The unobserved, true label example of x
n
X The dataset (x, ỹ)n ∈ (Rd , Z>0 ) of n examples x ∈ Rd with noisy labels
n
n The cardinality of X := (x, ỹ) , i.e. the number of examples in the dataset
θ Model parameters
Xỹ=i Subset of examples in X with noisy label i, i.e. Xỹ=cat is “examples labeled cat”
Xỹ=i,y∗ =j Subset of examples in X with noisy label i and true label j
X̂ỹ=i,y∗ =j Estimate of subset of examples in X with noisy label i and true label j
p(ỹ =i, y ∗ =j) Discrete joint probability of noisy label i and true label j.
p(ỹ =i|y ∗ =j) Discrete conditional probability of true label flipping, called the noise rate
p(y ∗ =j|ỹ =i) Discrete conditional probability of noisy label flipping, called the inverse noise rate
p̂(·) Estimated or predicted probability (may replace p(·) in any context)
Qy ∗ The prior of the latent labels
Q̂y∗ Estimate of the prior of the latent labels
Qỹ,y∗ The m × m joint distribution matrix for p(ỹ, y ∗ )
Q̂ỹ,y∗ Estimate of the m × m joint distribution matrix for p(ỹ, y ∗ )
Qỹ|y∗ The m × m noise transition matrix (noisy channel) of flipping rates for p(ỹ|y ∗ )
Q̂ỹ|y∗ Estimate of the m × m noise transition matrix of flipping rates for p(ỹ|y ∗ )
Qy∗ |ỹ The inverse noise matrix for p(y ∗ |ỹ)
Q̂y∗ |ỹ Estimate of the inverse noise matrix for p(y ∗ |ỹ)
p̂(ỹ = i; x, θ) Predicted probability of label ỹ = i for example x and model parameters θ
p̂x,ỹ=i Shorthand abbreviation for predicted probability p̂(ỹ = i; x, θ)
p̂(ỹ =i; x∈Xỹ=i ) The self-confidence of example x belonging to its given label ỹ =i
P̂k,i n × m matrix of out-of-sample predicted probabilities p̂(ỹ = j; xk , θ)
Cỹ,y∗ The confident joint Cỹ,y∗ ∈ Z≥0 m×m , an unnormalized estimate of Qỹ,y∗
Cconfusion Confusion matrix of given labels ỹ(x)k and predictions arg maxi∈M p̂(ỹ =i; xk , θ)
tj The expected (average) self-confidence for class j used as a threshold in Cỹ,y∗
p∗ (ỹ =i|y ∗ =y ∗ (x)) Ideal probability for an example x, equivalent to noise rate p∗ (ỹ =i|y ∗ =j)
p∗x,ỹ=i Shorthand abbreviation for ideal probability p∗ (ỹ =i|y ∗ =y ∗ (x))

B Theorems and proofs for confident learning

In this section, we restate the main theorems for confident learning and provide their proofs.
Thresholds). For a dataset X of (x, ỹ) pairs and model θ, if p̂(ỹ; x, θ) is ideal,
Lemma 1 (Ideal P
then ∀i∈M, ti = j∈M p(ỹ = i|y ∗ = j)p(y ∗ = j|ỹ = i).

Proof. We use ti to denote the thresholds used to partition X into m bins, each estimating one of
Xy∗ . By definition,

∀i∈M, ti = Ex∈Xỹ=i p̂(ỹ = i; x, θ)

15
For any ti , we show the following.
X
ti = E p̂(ỹ =i|y ∗ =j; x, θ)p̂(y ∗ =j; x, θ) . Bayes Rule
x∈Xỹ=i
j∈M
X
ti = E p̂(ỹ =i|y ∗ =j)p̂(y ∗ =j; x, θ) . CNP assumption
x∈Xỹ=i
j∈M
X
ti = p̂(ỹ =i|y ∗ =j) E p̂(y ∗ =j; x, θ)
x∈Xỹ=i
j∈M
X
ti = p(ỹ = i|y ∗ = j)p(y ∗ = j|ỹ = i) . Ideal Condition
j∈M

This form of the threshold is intuitively reasonable: the contributions to the sum when i = j
represents the probabilities of correct labeling, whereas when i 6= j, the terms give the probabilities
of mislabeling p(ỹ = i|y ∗ = j), weighted by the probability p(y ∗ = j|ỹ = i) that the mislabeling is
corrected.

Theorem 1 (Exact Label Errors). For dataset X of (x, ỹ) pairs and model θ :x→p̂(ỹ), if p̂(ỹ; x, θ) is
ideal and each diagonal entry of Qỹ|y∗ maximizes its row and column, then X̂ỹ=i,y∗ =j = Xỹ=i,y∗ =j
and as n → ∞, Q̂ỹ,y∗ = Qỹ,y∗ .

Proof. Alg. 1 defines the construction of the confident joint. We consider case 1: when there are
collisions (trivial by construction of Alg. 1) and case 2: when there are no collisions (harder).

Case 1 (collisions):
When a collision occurs, by construction of the confident joint (Eqn. 2), x gets assigned bijectively
into bin
x ∈ X̂ỹ,y∗ [ỹ(x)][arg max p̂(ỹ = i; x, θ)]
i∈M

Because we have that p̂(ỹ; x, θ) is ideal, we can rewrite this as

x ∈ X̂ỹ,y∗ [ỹ(x)][arg max p̂(ỹ = i|y ∗ =y ∗ (x); x)]


i∈M

And because by assumption each diagonal entry in Qỹ|y∗ maximizes its column, we have

x ∈ X̂ỹ,y∗ [ỹ(x)][y ∗ (x]

So any example x ∈ Xỹ=i,y∗ =j having a collision will be exactly assigned to X̂ỹ=i,y∗ =j .

Case 2 (no collisions):

We want to show that ∀i∈M, j ∈M, X̂ỹ=i,y∗ =j = Xỹ=i,y∗ =j


We can partition Xỹ=i as
Xỹ=i = Xỹ=i,y∗ =j ∪ Xỹ=i,y∗ 6=j

We prove ∀i∈M, j ∈M, X̂ỹ=i,y∗ =j = Xỹ=i,y∗ =j by proving two claims:


Claim 1: Xỹ=i,y∗ =j ⊆ X̂ỹ=i,y∗ =j
Claim 2: Xỹ=i,y∗ 6=j * X̂ỹ=i,y∗ =j
We don’t need to show Xỹ6=i,y∗ =j * X̂ỹ=i,y∗ =j and Xỹ6=i,y∗ 6=j * X̂ỹ=i,y∗ =j because the noisy
labels ỹ are given, thus the confident joint (Eqn. 2) will never place them in the wrong bin of
X̂ỹ=i,y∗ =j . Thus, claim 1 and claim 2 are sufficient to show that X̂ỹ=i,y∗ =j = Xỹ=i,y∗ =j .

16
Proof (Claim 1) of Case 2: Inspecting Eqn (2) and Alg (1), by the construction of Cỹ,y∗ , we have
that ∀x ∈ Xỹ=i , p̂(ỹ = j|y ∗ =j; x, θ) ≥ tj −→ Xỹ=i,y∗ =j ⊆ X̂ỹ=i,y∗ =j . In other words, when
the left hand side is true, all examples with noisy label i and hidden, true label j are counted in
X̂ỹ=i,y∗ =j .
Thus, it is sufficient to prove

∀x ∈ Xỹ=i , p̂(ỹ = j|y ∗ =j; x, θ) ≥ tj (5)


Because predicted probabilities satisfy the ideal condition, p̂(ỹ = j|y ∗ =j, x) = p(ỹ = j|y ∗ =j), ∀x ∈
Xỹ=i . Note the change from predicted probability, p̂, to an exact probability, p. Thus by the ideal
condition, the inequality in (5) can be written as p(ỹ = j|y ∗ =j) ≥ tj , which we prove below:

p(ỹ = j|y ∗ =j) ≥ p(ỹ = j|y ∗ =j) · 1 . Identity


X
≥ p(ỹ = j|y ∗ =j) · p(y ∗ =i|ỹ =j)
i∈M
X
∗ ∗
≥ p(ỹ = j|y =j) · p(y =i|ỹ =j) . move product into sum
i∈M
X
≥ p(ỹ = j|y ∗ =i) · p(y ∗ =i|ỹ =j) . diagonal entry maximizes row
i∈M
≥ tj . Lemma 1, ideal condition

Proof (Claim 2) of Case 2: We prove Xỹ=i,y∗ 6=j * X̂ỹ=i,y∗ =j by contradiction. Assume there
exists some example x ∈ Xỹ=i,y∗ =z for z 6= j such that x ∈ X̂ỹ=i,y∗ =j . By claim 1, we have that
Xỹ=i,y∗ =j ⊆ X̂ỹ=i,y∗ =j , therefore, x ∈ X̂ỹ=i,y∗ =z .
So, x ∈ X̂ỹ=i,y∗ =j and also x ∈ X̂ỹ=i,y∗ =z .
But this is a collision and when a collision occurs, the confident joint will break the tie with arg max.
Because each diagonal entry of Qỹ|y∗ maximizes its row and column this will always be assign
x ∈ X̂ỹ,y∗ [ỹ(x)][y ∗ (x] (the assignment from Claim 1).
This theorem also states as n → ∞, Q̂ỹ,y∗ =Qỹ,y∗ . This directly follows directly from the fact
that ∀i∈M, j ∈M, X̂ỹ=i,y∗ =j = Xỹ=i,y∗ =j , i.e. the confident joint exactly counts the partitions
Xỹ=i,y∗ =j for all pairs (i, j) ∈ M × M , thus Cỹ,y∗ = nQỹ,y∗ and Q̂ỹ,y∗ =Qỹ,y∗ . Thus the confident
joint is a consistent estimator for Qỹ,y∗ because the equivalency holds exactly for infinite examples,
but not for finite examples only due to discretization rounding errors.

n
Corollary 1.0 (Consistent Estimation). For (x, ỹ)n ∈ (Rd , Z>0 ) and θ :x→p̂(ỹ), if p̂(ỹ; x, θ) is
ideal and each diagonal entry of Qỹ|y∗ maximizes its row and column, and if X̂ỹ=i,y∗ =j = Xỹ=i,y∗ =j ,
then as n → ∞, Q̂ỹ,y∗ =Qỹ,y∗ .

Proof. The result follows directly from Thm. 1. Because the confident joint exactly counts the
partitions Xỹ=i,y∗ =j for all pairs (i, j) ∈ M ×M by Thm. 1, Cỹ,y∗ = nQỹ,y∗ , omitting discretization
rounding errors. We name this corollary consistent estimation instead of exact estimation because the
equivalency only holds exactly for infinite examples due to discretization rounding errors.

In the main text, Theorem 1 includes Corollary 1.0 for brevity. We have separated out Corollary 1.0
here to make apparent that the primary contribution of Thm. 1 is to prove X̂ỹ=i,y∗ =j = Xỹ=i,y∗ =j ,
from which the result of Corollary 1.0, namely that as n → ∞, Q̂ỹ,y∗ =Qỹ,y∗ , naturally follows.
n
Corollary 1.1 (Per-class Robustness). For a dataset X := (x, ỹ)n ∈(Rd , Z>0 ) and model θ :x→p̂(ỹ),
if p̂x,ỹ=i is per-class diffracted without label collisions and each diagonal entry of Qỹ|y∗ maximizes
its row, then X̂ỹ=i,y∗ =j = Xỹ=i,y∗ =j and as n → ∞, Q̂ỹ,y∗ =Qỹ,y∗ .

17
Proof. Re-stating the meaning of per-class diffracted, we wish to show that if p̂(ỹ; x, θ) is diffracted
(1) (2)
with class-conditional noise s.t. ∀j ∈M, p̂(ỹ = j; x, θ) = j · p∗ (ỹ = j|y ∗ =y ∗ (x)) + j where
(1) (2)
j ∈ R, j ∈ R (for any distribution) without label collisions and each diagonal entry of Qỹ|y∗
maximizes its row, then X̂ỹ=i,y∗ =j = Xỹ=i,y∗ =j and Q̂ỹ,y∗ =Qỹ,y∗ .
(1) (2)
Firstly, note that combining linear combinations of real-valued j and j with the probabilities
(1) (2)
of class j for each example may result in some examples having p̂x,ỹ=j = j p∗x,ỹ=j + j >1
(1) (2)
or p̂x,ỹ=j = j p∗x,ỹ=j
+ j
< 0. The proof makes no assumption about the validity of model
outputs and therefore holds when this occurs. Furthermore, confident learning does not require
valid probabilities when finding label errors because confident learning uses the rank principle, not
probabilities.
When there are no label collisions, the bins created by the confident joint are:
X̂ỹ=i,y∗ =j := {x ∈ Xỹ=i : p̂(ỹ = j; x, θ) ≥ tj } (6)

where
tj = E p̂x,ỹ=j
x∈Xỹ=j

(1) (2) (1) (2)


WLOG: we re-formulate the error j p∗x,ỹ=j + j as j (p∗x,ỹ=j + j ).
Now, for diffracted (non-ideal) probabilities, we re-write how the threshold tj changes for a given
(1) (2)
j , j :
 (1)
∗ (2)
tj j = E j (px,ỹ=j + j )
x∈Xỹ=j
 
 (1) ∗ (2)
tjj = j E px,ỹ=j + E j
x∈Xỹ=j x∈Xỹ=j
 
 (1) (2)
tjj = j t∗j + j · E 1
x∈Xỹ=j
 (1) (2)
tj j = j (t∗j + j )

Thus, for per-class diffracted (non-ideal) probabilities, Eqn (6) becomes


 j ∗ (1) (2)
∗ (1) (2)
X̂ỹ=i,y ∗ =j = {x ∈ Xỹ=i : j (px,ỹ=j + j ) ≥ j (tj + j )}

= {x ∈ Xỹ=i : p∗x,ỹ=j ≥ t∗j }


= Xỹ=i,y∗ =j . by Thm. (1)

In the second to last step, we see that the formulation of the label errors is the formulation of Cỹ,y∗
for ideal probabilities, which we proved yields exact label errors and consistent estimation of Qỹ,y∗
in Theorem 1, which concludes the proof. Note that we eliminate the need for the assumption that
each diagonal entry of Qỹ|y∗ maximizes its column because this assumption is only used in the proofs
of Theorem 1 when collisions occur, but here we only consider the case when there are no collisions.

n
Theorem 2 (General Per-Example Robustness). For a dataset X := (x, ỹ)n ∈ (Rd , Z>0 ) and model
θ :x→p̂(ỹ), if p̂x,ỹ=i is per-example diffracted without label collisions and each diagonal entry of
Qỹ|y∗ maximizes its row, then X̂ỹ=i,y∗ =j = Xỹ=i,y∗ =j and as n → ∞, Q̂ỹ,y∗ =Qỹ,y∗ (consistent).

Proof. We consider the non-trivial, real-world setting when a learning model θ :x→p̂(ỹ) outputs
erroneous, non-ideal predicted probabilities with an error term added for every example, across every
class, such that ∀x ∈ X, ∀j ∈ M, p̂x,ỹ=j = p∗x,ỹ=j + x,ỹ=j . As a notation reminder p∗x,ỹ=j is
shorthand for the ideal probabilities p∗ (ỹ = j|y ∗ = y ∗ (x)) + x,ỹ=j and p̂x,ỹ=j is shorthand for the
predicted probabilities p̂(ỹ = j; x, θ).

18
The predicted probability error x,ỹ=j is distributed uniformly with no other constraints. We use
j ∈ R to represent the mean of x,ỹ=j per class, i.e. j = Ex∈X x,ỹ=j , which can be seen by
looking at the form of the uniform distribution in Eqn. (4). If we wanted, we could add the constraint
that j = 0, ∀j ∈ M which would simplify the theorem and the proof, but is not as general and we
prove exact label error and joint estimation without this constraint.
We re-iterate the form of the error in Eqn. (4) here (U denotes a uniform distribution):

U (j + tj − p∗x,ỹ=j , j − tj + p∗x,ỹ=j ] p∗x,ỹ=j ≥ tj



x,ỹ=j ∼
U[j − tj + p∗x,ỹ=j , j + tj − p∗x,ỹ=j ) p∗x,ỹ=j < tj
When there are no label collisions, the bins created by the confident joint are:
X̂ỹ=i,y∗ =j := {x ∈ Xỹ=i : p̂x,ỹ=j ≥ tj } (7)
where
1 X
tj = p̂x,ỹ=j
|Xỹ=j |
x∈Xỹ=j

Rewriting the threshold tj to include the error terms x,ỹ=j and j , we have
 1 X
tj j = p∗x,ỹ=j + x,ỹ=j
|Xỹ=j |
x∈Xỹ=j

tj j = E p∗x,ỹ=j + E x,ỹ=j
x∈Xỹ=j x∈Xỹ=j

= tj + j
where the last step uses the fact that x,ỹ=j is uniformly distributed and n → ∞ so that
Ex∈Xỹ=j x,ỹ=j = Ex∈X x,ỹ=j . We now complete the proof by showing that
p∗x,ỹ=j + x,ỹ=j ≥ tj + j ⇐⇒ p∗x,ỹ=j ≥ tj
If this statement is true then the subsets created by the confident joint in Eqn. 7 are unaltered and
x,ỹ=j T hm.1 x,ỹ=j
therefore X̂ỹ=i,y ∗ =j = X̂ỹ=i,y ∗ =j = Xỹ=i,y∗ =j , where X̂ỹ=i,y ∗ =j denotes the confident joint

subsets for x,ỹ=j predicted probabilities.


Now we complete the proof. From Eqn. 4 (the distribution for x,ỹ=j ) , we have that
p∗x,ỹ=j < tj =⇒ x,ỹ=j < j + tj − p∗x,ỹ=j
p∗x,ỹ=j ≥ tj =⇒ x,ỹ=j ≥ j + tj − p∗x,ỹ=j
Re-arranging
p∗x,ỹ=j < tj =⇒ p∗x,ỹ=j + x,ỹ=j < tj + j
p∗x,ỹ=j ≥ tj =⇒ p∗x,ỹ=j + x,ỹ=j ≥ tj + j
Using the contrapositive, we have
p∗x,ỹ=j + x,ỹ=j ≥ tj + j =⇒ p∗x,ỹ=j ≥ tj
p∗x,ỹ=j ≥ tj =⇒ p∗x,ỹ=j + x,ỹ=j ≥ tj + j
Combining, we have
p∗x,ỹ=j + x,ỹ=j ≥ tj + j ⇐⇒ p∗x,ỹ=j ≥ tj
Therefore,

x,ỹ =j T hm.1
X̂ỹ=i,y ∗ =j = Xỹ=i,y∗ =j
x,ỹ =j 
The last line follows from the fact that we’ve reduced X̂ỹ=i,y ∗ =j to counting the same condition

(px,ỹ=j ≥ tj ) as the confident joint counts under ideal probabilities in Thm (1). Thus, we maintain
exact label errors and also consistent estimation (Corollary 1.1) holds under no label collisions.
While we assume there are infinite examples in X, the proof applies for finite datasets if you omit
discretization error.
Note that while we use a uniform distribution in Eqn. 4, any bounded symmetric distribution with
mode j = Ex∈X x,j is sufficient. Observe that the bounds of the distribution are non-vacuous (they
do not collapse to a single value ej ) because tj 6= p∗x,ỹ=j by Lemma 1.

19
C The confident joint and joint algorithms

The confident joint can be expressed succinctly in equation Eqn 2 with the thresholds expressed in
Eqn 1. For clarity, we provide these equations in algorithm-form below.

Algorithm 1 The Confident Joint Algorithm for class-conditional label noise characterization.
input P̂ an n × m matrix of out-of-sample predicted probabilities P̂ [i][j] := p̂(ỹ = j; x, θ)
input ỹ ∈ Z≥0 n , an n × 1 array of noisy labels
procedure C ONFIDENT J OINT(P̂ , ỹ):
PART 1 (C OMPUTE THRESHOLDS )
for j ← 1, m do
for i ← 1, n do
l ← new empty list []
if ỹ[i] = j then
append P̂ [i][j] to l
t[j] ← average(l)
PART 2 (C OMPUTE CONFIDENT JOINT )
C ← m × m matrix of zeros
for i ← 1, m do
cnt ← 0
for j ← 1, m do
if P̂ [i][j] ≥ t[j] then
cnt ← cnt + 1
y∗ ← j . guess of true label
ỹ ← ỹ[i]
if cnt > 1 then . if label collision
y ∗ ← arg max P̂ [i]
if cnt > 0 then
C[ỹ][y ∗ ] ← C[ỹ][y ∗ ] + 1
output C m×m unnormalized counts matrix

The confident joint algorithm (Alg. 1) is an O(m2 + nm) step procedure to compute Cỹ,y∗ . The
algorithm takes two inputs: (1) P̂ an n×m matrix of out-of-sample predicted probabilities P̂ [i][j] :=
p̂(ỹ = j; xi , θ) and (2) the associated array of noisy labels. We typically use cross-validation to
compute P̂ for train sets and a model trained on the train set and fine-tuned with cross-validation on
the test set to compute P̂ for a test set. Any method works as long p̂(ỹ = j; x, θ) are out-of-sample,
holdout, predicted probabilities.
Note that Alg. 1 embodies Eqn. 2, and Alg. 2 realizes Eqn. 3.

Algorithm 2 The Joint Algorithm calibrates the confident joint to estimate the latent, true distribu-
tion of class-conditional label noise
input Cỹ,y∗ [i][j], m×m unnormalized counts
input ỹ an n × 1 array of noisy integer labels
procedure J OINT E STIMATION(C, ỹ):
C ∗ =j
C̃ỹ=i,y∗ =j ← P ỹ=i,y
Cỹ=i,y∗ =j · |Xỹ=i | . calibrate marginals
j∈M
C̃ ∗
Q̂ỹ=i,y∗ =j ← P ỹ=i,y =j . joint sums to 1
C̃ỹ=i,y∗ =j
i∈M,j∈M

output Q̂ỹ,y∗ joint dist. matrix ∼ p(ỹ, y ∗ )

20
Table 5: Comparison of confident learning with common state-of-the-art methods for learning with
noisy labels on CIFAR-10. This table extends Table 1 to include all the variations of confident
learning, where CL: OPT in Table 1 is the max across CL: PBC, CL: PBNR, and CL: C+NR.
N OISE 0.2 0.4 0.7
S PARSITY 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 0.4 0.6
CL: CCONFUSION 0.854 0.854 0.863 0.857 0.806 0.796 0.802 0.798 0.332 0.363 0.328 0.291
CL: Cỹ,y∗ 0.848 0.858 0.862 0.861 0.815 0.810 0.816 0.815 0.340 0.398 0.282 0.372
CL: PBC 0.860 0.854 0.862 0.855 0.802 0.801 0.810 0.813 0.356 0.373 0.263 0.336
CL: PBNR 0.858 0.854 0.865 0.862 0.810 0.796 0.814 0.825 0.468 0.416 0.399 0.360
CL: C+NR 0.858 0.859 0.862 0.862 0.808 0.800 0.807 0.822 0.460 0.420 0.382 0.371
M ENTORNET 0.849 0.851 0.832 0.834 0.644 0.642 0.624 0.615 0.300 0.316 0.293 0.279
S-M ODEL 0.800 0.800 0.797 0.791 0.586 0.612 0.591 0.575 0.284 0.285 0.279 0.273
R EED 0.781 0.789 0.808 0.793 0.605 0.604 0.612 0.586 0.290 0.294 0.291 0.268
BASELINE 0.784 0.792 0.790 0.782 0.602 0.608 0.596 0.573 0.270 0.297 0.282 0.268

Noise = 0.2 | Sparsity = 0.0 Noise = 0.2 | Sparsity = 0.2 Noise = 0.2 | Sparsity = 0.4 0.020
Noise = 0.2 | Sparsity = 0.6
plane
car 0.016
Noisy label y

0.020 0.020 0.016


bird
cat 0.015 0.015 0.012
deer 0.012
dog 0.010 0.010 0.008 0.008
frog
horse 0.005 0.005 0.004 0.004
ship
truck 0.000 0.000 0.000
Noise = 0.4 | Sparsity = 0.0 0.0150
Noise = 0.4 | Sparsity = 0.2 Noise = 0.4 | Sparsity = 0.4 Noise = 0.4 | Sparsity = 0.6
plane 0.020 0.025
car 0.0125 0.016
Noisy label y

bird 0.016 0.020


cat 0.0100 0.012
deer 0.0075 0.012 0.015
dog 0.008
frog 0.0050 0.008 0.010
horse 0.004
ship 0.0025 0.004 0.005
truck 0.0000 0.000 0.000 0.000
Noise = 0.7 | Sparsity = 0.0 Noise = 0.7 | Sparsity = 0.2 Noise = 0.7 | Sparsity = 0.4 Noise = 0.7 | Sparsity = 0.6 0.075
plane
car 0.030 0.025 0.060
Noisy label y

bird 0.060
cat 0.024 0.020 0.045 0.045
deer 0.018 0.015
dog 0.030 0.030
frog 0.012 0.010
horse 0.015 0.015
ship 0.006 0.005
truck 0.000 0.000 0.000
plane
car
bird
cat
deer
dog
frog
horse
ship
truck

plane
car
bird
cat
deer
dog
frog
horse
ship
truck

plane
car
bird
cat
deer
dog
frog
horse
ship
truck

plane
car
bird
cat
deer
dog
frog
horse
ship
truck
Latent, true label y * Latent, true label y * Latent, true label y * Latent, true label y *
Figure 5: Absolute difference of the true joint Qỹ,y∗ and the joint distribution estimated using
confident learning Q̂ỹ,y∗ on CIFAR-10, for 20%, 40%, and 70% label noise, 20%, 40%, and 60%
sparsity, for all pairs of classes in the joint distribution of label noise.

D Extended Comparison of Confident Learning Methods on CIFAR-10

In Table 5, we extend Table 1 to include all the variations of confident learning, where CL: OPT in
Table 1 is the max across CL: PBC, CL: PBNR, and CL: C+NR from Table 5.
Fig. 5 shows the absolute difference ofbsolute difference of the true joint Qỹ,y∗ and the joint
distribution estimated using confident learning Q̂ỹ,y∗ on CIFAR-10, for 20%, 40%, and 70% label
noise, 20%, 40%, and 60% sparsity, for all pairs of classes in the joint distribution of label noise.
Observe that in moderate noise regimes between 20% and 40% noise, confident learning accurately
estimates nearly every entry in the joint distribution of label noise. This figure serves to provide
evidence for how confident learning is able to identify the label errors with high accuracy as shown
in Table 2 as well as support our theoretical contributions that confident learning is a consistent
estimator for the joint distribution.

21
Table 6: RMSE error of joint distribution estimation on CIFAR-10 using CL (Q̂ỹ,y∗ ) vs. (Q̂argmax ).
N OISE 0.2 0.4 0.7
S PARSITY 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 0.4 0.6
kQ̂ỹ,y∗ - Qỹ,y∗ k2 0.004 0.004 0.004 0.004 0.004 0.004 0.004 0.005 0.011 0.010 0.015 0.017
kQ̂argmax - Qỹ,y∗ k2 0.006 0.006 0.005 0.005 0.005 0.005 0.005 0.007 0.011 0.011 0.015 0.019

To our knowledge, for the first time we train both the ResNet50 (9 in Appendix) and ResNet18 (Fig. 8
in Appendix) models on an automatically cleaned ImageNet training set. In Figure 9 we demonstrate
how validation accuracy of ResNet50 improves when removing label errors estimated with confident
learning versus random examples. In (a) we see the validation accuracy when random pruning is used
versus confident learning. Sub-figure (b) shows how that discrepancy increases as we look at the 20
noisiest classes identified by confident learning. The bottom two figures show a consistent increase in
accuracy on the class identified as most noisy by confident learning (c) and a moderately noisy class
(d). For (d) the spike improvement is in the middle because confident learning ranks examples and
this class is not ranked as the noisiest, so label errors in that class did not get removed until 40%-60%
of label errors are removed.
Note we did not remove label errors from the validation set, so by training on a clean train set, we
may have induced distributional shift, making the moderate increase accuracy more satisfying.
In Table 6 in the Appendix, we estimate the Qỹ,y∗ using using the confusion-matrix Cconfusion
approach normalized via Eqn. (3) to obtain Q̂argmax and compare this with CL (Q̂ỹ,y∗ ) for various
amounts of noise and sparsity in Qỹ,y∗ . We show significant improvement using CL over Cconfusion ,
low RMSE scores, and robustness to sparsity in moderate-noise regimes.

E Case Study: MNIST

In the past two decades, the machine learning and vision communities have increasingly used the
MNIST (LeCun, 1998) and ImageNet (Russakovsky et al., 2015) datasets as benchmarks to measure
state-of-the-art progress and evaluate theoretical findings. The trustworthiness of these benchmarks
rely on an implicit, and as we show fallacious, assumption that these datasets have noise-free labels. In
this section, we unveil the existence of numerous label errors in both datasets and thereby demonstrate
the efficacy of confident learning as a general tool for automatically detecting label errors in massive,
crowd-sourced datasets. For MNIST, comprised of black-and-white handwritten digits, we estimate
48 (among 60,000) train label errors and 8 (among 10,000) test label errors, and for the 2012 ImageNet
validation set, comprised of 50 color images for each of 1000 distinct classes, we estimate 5,000
(among 50,000) validation labels are erroneous or confounded (the image has more than one valid
label).

22
convnet guess: 7 | conf: 1.0 convnet guess: 9 | conf: 1.0 convnet guess: 3 | conf: 1.0 convnet guess: 2 | conf: 1.0 convnet guess: 3 | conf: 1.0 convnet guess: 9 | conf: 1.0 convnet guess: 7 | conf: 0.998 convnet guess: 1 | conf: 0.998
train img #: 59915 train img #: 1604 train img #: 43454 train img #: 37038 train img #: 40144 train img #: 51944 train img #: 8729 train img #: 43109

given: 4 | conf: 0.0 given: 4 | conf: 0.0 given: 5 | conf: 0.0 given: 1 | conf: 0.0 given: 5 | conf: 0.0 given: 4 | conf: 0.0 given: 3 | conf: 0.0 given: 8 | conf: 0.001
convnet guess: 4 | conf: 0.999 convnet guess: 4 | conf: 0.999 convnet guess: 0 | conf: 0.999 convnet guess: 7 | conf: 0.999 convnet guess: 5 | conf: 0.998 convnet guess: 1 | conf: 0.998 convnet guess: 5 | conf: 0.998 convnet guess: 2 | conf: 0.994
train img #: 51248 train img #: 26748 train img #: 902 train img #: 25562 train img #: 7080 train img #: 26560 train img #: 30049 train img #: 44484

given: 9 | conf: 0.001 given: 9 | conf: 0.001 given: 9 | conf: 0.001 given: 9 | conf: 0.001 given: 3 | conf: 0.002 given: 7 | conf: 0.002 given: 9 | conf: 0.002 given: 8 | conf: 0.002
convnet guess: 9 | conf: 0.997 convnet guess: 7 | conf: 0.996 convnet guess: 7 | conf: 0.995 convnet guess: 1 | conf: 0.994 convnet guess: 9 | conf: 0.991 convnet guess: 9 | conf: 0.993 convnet guess: 2 | conf: 0.993 convnet guess: 9 | conf: 0.982
train img #: 34750 train img #: 41284 train img #: 23911 train img #: 54264 train img #: 11210 train img #: 53806 train img #: 31134 train img #: 10994

given: 4 | conf: 0.003 given: 2 | conf: 0.004 given: 1 | conf: 0.004 given: 4 | conf: 0.006 given: 8 | conf: 0.007 given: 8 | conf: 0.007 given: 1 | conf: 0.007 given: 3 | conf: 0.008

Figure 6: Label errors of the original MNIST train dataset identified algorithmically with CL: PBNR.
Depicts the 24 least confident labels, ordered left-right, top-down by increasing self-confidence,
denoted conf in teal. arg max p̂(ỹ = k; x) label is in green. Overt errors are in red.

MNIST train set We used confident learning to automatically identify label errors without human
intervention in the original, unperturbed MNIST train set. The key computational step is computing
the n × m matrix of predicted probabilities. We first pre-trained a PyTorch MNIST CNN (architecture
in Supplementary Materials Fig. 10) for 50 epochs, then used five-fold cross-validation to obtain
p̂(ỹ = k; x), the out-of-sample, holdout, transductive predicted probabilities for the train set. The 24
least confident examples in Xerr , i.e. likely label errors, identified by confident learning are shown
in Fig. 6. Errors are ordered left-right, top-down by increasing self-confidence, p̂(ỹ = k; x∈Xỹ=k ),
denoted as conf in teal with the MNIST-provided label. The green boxes depict the arg max p̂(ỹ =
k; x ∈ X) predicted label and confidence . The results are compelling, with the obvious errors
enclosed in red. Unlike most approaches, ours is transductive, thus no information from other datasets
is needed. For verification, the indices of the train label errors in Fig. 6 are shown in grey.
The confidences (in green) in Fig. 6 tend to extreme values. This overconfidence is typical of deep
architectures. Because confident learning is classifier agnostic, we also tried the default scikit-learn
implementation of logistic regression with isometric calibration (Fig. 11 in the Supplementary
Materials). Although the confidence values are less extreme than a convnet, the results are perceptibly
less accurate. The extremity of a convnet’s confidence values can be improved using modern
calibration techniques (Guo et al., 2017). However, this is unnecessary because calibration is a
monotonic operation and confident learning depends only on the rank of the confidence values, hence
the name. Additionally, the confident joint estimation step decides its thresholds based on an average
of the probabilities, for which monotonic adjustment is vacuous. These properties, along with the
quality of label error identification in Fig. 6 follow from removal by rank principles of confident
learning.

F Additional Figures
In this section, we include additional figures that support the main manuscript.
The noise matrices shown in Fig. 7 were used to generate the synthetic noisy labels for the results in
Tables 2 and 1 and 5.
Fig. 8 replicates the experiment from Fig. 9 using ResNet18 instead of ResNet50. Results are
reasonably similar.
Fig. 16 [left] demonstrates the efficacy of Confident learning for multiclass learning with noisy labels
on MNIST with added uniformly random class-conditional noise for different noise sparsities. The
test accuracy values were averaged over 30 trials per Tr(Qỹ|y∗ ). Although the test accuracy is lower
for sparse Qỹ|y∗ , the overall improvement using CL is higher.

23
Figure 7: The CIFAR-10 noise transition matrices used to create the synthetic label errors. In the
code base, s is used in place of ỹ and y is used in place of y ∗ .

69% 44%
1.1%
-0.0% 0.5% 2.7% 2.7%
0.3% 0.6% -0.7%
68% 0.7% 40%
Pruning method
confident learning 2.6%
uniformly random
67% 0K 36K 71K 107K 143K 178K 37% 0K 36K 71K 107K 143K 178K
(a) Accuracy of Validation Set (b) Accuracy of top 20 noisiest classes

34% 70%
8.0% 8.0%
10.0% 8.0%
20.0% 6.0%
20% 22.0% 10.0% 14.0% 58% 2.0%

6% 46%
0K 36K 71K 107K 143K 178K 0K 36K 71K 107K 143K 178K
Number examples pruned before training Number examples pruned before training
(c) Accuracy of the noisiest class: foxhound (d) Accuracy of a moderately noisy class: bighorn
Figure 8: ResNet18 Val Accuracy on ImageNet when 20%, 40%,...,100% of the label errors found
automatically with confident learning are removed prior to training from scratch. Vertical bars depict
the improvement or reduction when removing examples with confident learning vs random examples.

24
74% 49%
0.6%
1.2% 0.9% 0.5% 3.7%
73% Pruning method 0.8%
46% 2.0% 5.2% 3.8%
confident learning
-3.3%
uniformly random
72% 0K 36K 71K 107K 143K 178K 43% 0K 36K 71K 107K 143K 178K
(a) Accuracy of Validation Set (b) Accuracy of top 20 noisiest classes

34% 86%
0.0% 2.0% 8.0%
-12.0%-2.0% 22.0%
26% 4.0% 68% -8.0% -16.0% 28.0%

18% 50%
0K 36K 71K 107K 143K 178K 0K 36K 71K 107K 143K 178K
Number examples pruned before training Number examples pruned before training
(c) Accuracy of the noisiest class: foxhound (d) Accuracy of a moderately noisy class: bighorn
Figure 9: ResNet50 Val Accuracy on ImageNet when 20%, 40%,...,100% of the label errors found
automatically with confident learning are removed prior to training from scratch. Vertical bars depict
the improvement or reduction when removing examples with confident learning vs random examples.

Figure 10: Architecture used for training and probability estimation in our MNIST experiments and
figures. This figure was adapted from https://github.com/floydhub/mnist (2018).

25
Our guess: 2 | conf: 0.991 Our guess: 5 | conf: 0.63 Our guess: 2 | conf: 0.978 Our guess: 0 | conf: 0.989 Our guess: 8 | conf: 0.745 Our guess: 5 | conf: 0.988 Our guess: 8 | conf: 0.905 Our guess: 0 | conf: 0.983
train img #: 59447 train img #: 53578 train img #: 26376 train img #: 26882 train img #: 46726 train img #: 4066 train img #: 36714 train img #: 24798

given: 8 | conf: 0.0 given: 6 | conf: 0.0 given: 1 | conf: 0.0 given: 7 | conf: 0.0 given: 0 | conf: 0.0 given: 1 | conf: 0.0 given: 6 | conf: 0.0 given: 4 | conf: 0.0
Our guess: 2 | conf: 0.951 Our guess: 8 | conf: 0.985 Our guess: 5 | conf: 0.975 Our guess: 2 | conf: 0.951 Our guess: 2 | conf: 0.972 Our guess: 3 | conf: 0.94 Our guess: 2 | conf: 0.962 Our guess: 3 | conf: 0.825
train img #: 27522 train img #: 12679 train img #: 49500 train img #: 20672 train img #: 27155 train img #: 1352 train img #: 39304 train img #: 50514

given: 7 | conf: 0.0 given: 2 | conf: 0.0 given: 0 | conf: 0.0 given: 1 | conf: 0.0 given: 7 | conf: 0.0 given: 9 | conf: 0.0 given: 4 | conf: 0.0 given: 8 | conf: 0.0
Our guess: 2 | conf: 0.973 Our guess: 8 | conf: 0.886 Our guess: 7 | conf: 0.875 Our guess: 7 | conf: 0.978 Our guess: 7 | conf: 0.875 Our guess: 5 | conf: 0.984 Our guess: 6 | conf: 0.987 Our guess: 0 | conf: 0.988
train img #: 23486 train img #: 28357 train img #: 26504 train img #: 4762 train img #: 59915 train img #: 20903 train img #: 40466 train img #: 11482

given: 4 | conf: 0.0 given: 0 | conf: 0.0 given: 2 | conf: 0.0 given: 3 | conf: 0.001 given: 4 | conf: 0.001 given: 3 | conf: 0.001 given: 8 | conf: 0.001 given: 8 | conf: 0.001
Our guess: 2 | conf: 0.771 Our guess: 8 | conf: 0.714 Our guess: 9 | conf: 0.734 Our guess: 2 | conf: 0.974 Our guess: 8 | conf: 0.713 Our guess: 8 | conf: 0.983 Our guess: 9 | conf: 0.91 Our guess: 5 | conf: 0.976
train img #: 14337 train img #: 16560 train img #: 1244 train img #: 4502 train img #: 80 train img #: 45801 train img #: 50369 train img #: 12493

given: 7 | conf: 0.001 given: 6 | conf: 0.001 given: 3 | conf: 0.001 given: 1 | conf: 0.001 given: 9 | conf: 0.001 given: 3 | conf: 0.001 given: 3 | conf: 0.001 given: 8 | conf: 0.001

Figure 11: Label errors of the original MNIST train dataset identified algorithmically with CL: PBNR
using the default scikit-learn implementation of logistic regression with isometric calibration.
Depicts the 24 least confident labels, ordered left-right, top-down by increasing self-confidence
p̂(ỹ = k|Xỹ=k ), denoted conf in teal. The arg max p̂(ỹ = k|x) label is in green. Overt errors are in
red. Note the confidence values are less extreme than when using a convolutional network, however,
the results are perceptibly less accurate.

26
Figure 12: Evidence of nearly perfect noise rate matrix Q̂ỹ|y∗ estimation by confident learning. With
each row, the label noise increases from top to bottom. The histograms of individual noise rate
values in Q̂ỹ|y∗ capture the challenging asymmetry. Noise rates were uniformly randomly generated
between 0 and 0.2 . To illustrate the severity of noise in the lowest row, note that a trace of 5.12
1
Pm=10
implies 10 i=1 p(ỹ=i|y=i) = 0.512.

27
Figure 13: The synthetic three-Gaussian dataset used in our robust estimation experiment.

0.5 1.6 1.6

0
Ep(y)0 kp(y) − p(y) k
0
EPs|y s|y − Ps|y k]
0 [kP
0.34 EPy|s
0 [kP
0
y|s − Py|sk]
kp̂(y) − p(y)k

kP̂y|s − Py|sk
kP̂s|y − Ps|y k

Estimator Estimator Estimator


Robust 0.8 Robust 0.8 Robust
Robust(converged) Standard Robust(converged)
Standard U nif ormly random Standard
U nif ormlyrandom U nif ormlyrandom

0 0 0
(more noise) 1 2 (less noise) m=3 (more noise) 1 2 (less noise) m=3 (more noise) 1 2 (less noise) m=3
T race(Ps|y ) T race(Ps|y ) T race(Ps|y )
Figure 14: Robust latent estimation on Gaussian (m = 3) synthetic data (w/ added label noise) using
the default scikit-learn implementation of logistic regression with isometric calibration.

kp̂(y) − p(y)k kP̂s|y − Ps|y k kP̂y|s − Py|sk


0.35 2.50
2.50 Estimator Estimator
Standard Standard
Robust Robust
Robust (consistent)
0.20
1.25 1.25

Estimator
Standard
Robust
Robust (consistent)
0.00 0.00 0.00
1 3 8 m=10 1 3 8 m=10 1 3 8 m=10
T race(Ps|y ) T race(Ps|y ) T race(Ps|y )
Figure 15: Depicts reduced robust latent estimation of p(y)on MNIST (w/ added label noise) using
PyTorch MNIST CNN. Here we use s to denote the noisy labels and y to denote the true labels in an
effort to match the cleanlab Python package notation standard.

28
Figure 16: Comparison of MNIST (w/ add label noise) test accuracy (averaged over 30 trials per
Tr(Qỹ|y∗ )) using confident learning versus a vanilla training on noisy labels baseline, for different
noise sparsities. Although the test accuracy is lower for sparser Qỹ|y∗ , confident learning boasts
increased accuracy gains.

29

You might also like