You are on page 1of 31

Handle Class Imbalance

Liang Liang
Class Imbalance

• In a classification application, if a class has much more data points than another
class, then it is call class imbalance.
binary classification :
class-0: 1000 data points
class-1: 100 data points
• class imbalance is a very common problem
• credit card fraud detection: most of the transactions are normal
• disease/coronavirus diagnosis based on X-ray images
# of images from diseased patients << # of images from healthy subjects
Class Imbalance: What will happen? see danger_of_class_imbalance.ipynb
• Let's use the dataset in homework
It is binary classification to predict if a person will pay off the credit card or not
class-0: the person will pay off the credit card in time
class-1: the person will default on the credit card payment

a highly imbalanced dataset

139974 data samples in class-0


10026 data samples in class-1
Class Imbalance: What will happen?
Let’s use a logistic regression classifier
a highly imbalanced dataset

139974 data samples in class-0 step-1: divide the data into three sets:
10026 data samples in class-1 training set, validation set, and test set

step-2: fit the model on the training set

step-3: test the model on the test set

step-4: obtain three confusion matrices


see danger_of_class_imbalance.ipynb on training/validation/test sets, and
compute classification accuracy
a highly imbalanced dataset

wrongly-classified 139974 data samples in class-0


10026 data samples in class-1

wrongly-classified

wrongly-classified

We just got a "clever and lazy" classifier that classified most of the data samples
into class-0, but the goal is to find out who will default credit card (class-1).
Measure the 'real' accuracy of the classifier
by normalizing each row of the confusion matrix

100577+247 is the number of samples in class-0


6883+293 is the number of samples in class-1
Normalize each row of the confusion matrix such that the summation of the elements in each row is 1

99% of the samples in class-0 are classified into class-0

96% of the samples in class-1 are wrongly-classified into class-0

weighted/normalized accuracy on training set

as good as random guess


Measure the 'real' accuracy of the classifier
by normalizing each row of the confusion matrix

This function measures the real/weighted/normalized accuracy of the classifier


The Danger of Imbalanced Dataset
The classifier is trained on the imbalanced dataset
It is as good as random guess
Understand Class Imbalance from the Perspective of PDF
- scenario-1
• Assume there are two classes: class-0 and class-1
• The PDF of class-0 is a GMM with two Gaussians
• The PDF of class-1 is a simple Gaussian
class-0 class-1 We have collected data samples:
class-0: 200 data points
100
class-1: 100 data points
data 100
points data It is an imbalanced dataset.
100 points But the number of samples is
data enough for classification.
points 100 samples are more than
The feature space enough to estimate a 2D Gaussian.
Understand Class Imbalance from the Perspective of PDF
- scenario-2
• Assume there are two classes: class-0 and class-1
• The PDF of class-0 is a GMM with two Gaussians
• The PDF of class-1 is a GMM with two Gaussians
class-0 class-1
100
We have collected data samples:
100 data
class-0: 200 data points
data points
class-1:
points 100 data points from the first Gaussian
100 No data No data points from the second Gaussian
data points
points
The feature space
the data samples contain enough the data samples do not contain
information about the two classes. enough information about class-1

class-0 class-1 class-0 class-1


100
100 100 data
data 100 data points
points data points
100 points
100 No data
data data points
points points
The feature space
The feature space
scenario-1 scenario-2

It is relatively easy to handle class imbalance in scenario-1.


The scenario in real applications is a mixture of scenario-1 and scenario-2.
Computational Methods to Handle Class Imbalance
(assuming scenario-1)

• 1. Class weight
• 2. Re-sampling (e.g. up-sampling, down-sampling)
• 3. Generate synthetic data samples

• assuming scenario-1:
we have sufficient data in each class.
The reason of class imbalance is that some classes have very
complex patterns and therefore need more data samples
Class weight
• Let's consider a logistic regression classifier 𝑦ො = 𝑓(𝑥)
𝑥 is a data sample, 𝑦ො is the predicted class label from the classifier 𝑓(𝑥)
the cross-entropy loss on a training sample 𝑥𝑛 with true class label 𝑦𝑛 is
𝐿(𝑓 𝑥𝑛 , 𝑦𝑛 )
• Assume there are two classes
• class-0 has 200 samples: {(𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑 , 𝒚𝟑 )…, (𝒙𝟐𝟎𝟎 , 𝒚𝟐𝟎𝟎 )}
• class-1 has 100 samples: {(𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑 , 𝒚𝟑 )…, (𝒙𝟏𝟎𝟎 , 𝒚𝟏𝟎𝟎 )}
• The total loss for training the classifier is
𝐿𝑡𝑜𝑡𝑎𝑙 = σ200
𝑚=1 𝑳(𝒇 𝒙𝒎 , 𝒚𝒎 ) + σ 100
𝑛=1 𝑳(𝒇 𝒙𝒏 , 𝒚𝒏 )
• 𝐿𝑡𝑜𝑡𝑎𝑙 is largely determined by the samples in class-0 (200>100)
• Next, we will modify the total loss such that the classes have equal weight
Class weight
• class-0 has 200 samples: {(𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑 , 𝒚𝟑 )…, (𝒙𝟐𝟎𝟎 , 𝒚𝟐𝟎𝟎 )}
• class-1 has 100 samples: {(𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑 , 𝒚𝟑 )…, (𝒙𝟏𝟎𝟎 , 𝒚𝟏𝟎𝟎 )}
• The total loss for training the classifier is
𝐿𝑡𝑜𝑡𝑎𝑙 = σ200
𝑚=1 𝑳(𝒇 𝒙 𝒎 , 𝒚𝒎 ) + σ 100
𝑛=1 𝑳(𝒇 𝒙𝒏 , 𝒚𝒏 )
• 𝐿𝑡𝑜𝑡𝑎𝑙 is largely determined by the samples in class-0 (200>100)
• Class-balanced loss:
𝟏 𝟏
𝐿𝑡𝑜𝑡𝑎𝑙_𝑐𝑤 = σ200
𝑚=1 𝑳(𝒇 𝒙𝒎 , 𝒚𝒎 ) + σ100
𝑛=1 𝑳(𝒇 𝒙𝒏 , 𝒚𝒏 )
𝟐𝟎𝟎 𝟏𝟎𝟎

divide the total loss on class-0 by the number of samples in class-0,


divide the total loss on class-1 by the number of samples in class-1
Then, the samples in the two classes have equal weight in the new loss
Class weight
• class-0 has 200 samples: {(𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑 , 𝒚𝟑 )…, (𝒙𝟐𝟎𝟎 , 𝒚𝟐𝟎𝟎 )}
• class-1 has 100 samples: {(𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑 , 𝒚𝟑 )…, (𝒙𝟏𝟎𝟎 , 𝒚𝟏𝟎𝟎 )}
• The class-balanced loss in sk-learn:
𝟑𝟎𝟎 𝟑𝟎𝟎
𝐿𝑡𝑜𝑡𝑎𝑙_𝑐𝑤 = σ200 𝑳(𝒇 𝒙𝒎 , 𝒚𝒎 ) + σ100 𝑳(𝒇 𝒙𝒏 , 𝒚𝒏 )
𝟐∗𝟐𝟎𝟎 𝑚=1 𝟐∗𝟏𝟎𝟎 𝑛=1

Class weights are {0: 300/(2*200), 1: 300/(2*100)}.

𝟑𝟎𝟎 𝟐 [𝟐𝟎𝟎, 𝟏𝟎𝟎]


You only need to set class_weight = 'balanced'.
It will calculate the weights for you.
Up-sampling
• class-0 has 200 samples: {(𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑 , 𝒚𝟑 )…, (𝒙𝟐𝟎𝟎 , 𝒚𝟐𝟎𝟎 )}
• class-1 has 100 samples: {(𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑 , 𝒚𝟑 )…, (𝒙𝟏𝟎𝟎 , 𝒚𝟏𝟎𝟎 )}
• The total loss for training the classifier is
𝐿𝑡𝑜𝑡𝑎𝑙 = σ200𝑚=1 𝑳(𝒇 𝒙𝒎 , 𝒚 𝒎 ) + σ 100
𝑛=1 𝑳(𝒇 𝒙𝒏 , 𝒚𝒏 )
• 𝐿𝑡𝑜𝑡𝑎𝑙 is largely determined by the samples in class-0 (200>100)
• Let’s do up-sampling for class-1: make 100 copies of the samples
class-1 has 200 samples after up-sampling
{(𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑 , 𝒚𝟑 )…, (𝒙𝟏𝟎𝟎 , 𝒚𝟏𝟎𝟎 ), # 100 original samples
(𝒙𝟏𝟎𝟏 , 𝒚𝟏𝟎𝟏 ), (𝒙𝟏𝟎𝟐 , 𝒚𝟏𝟎𝟐 ), (𝒙𝟏𝟎𝟑 , 𝒚𝟏𝟎𝟑 )…, (𝒙𝟐𝟎𝟎 , 𝒚𝟐𝟎𝟎 )} # 100 copies
Up-sampling and Class Weight
• class-0 has 200 samples: {(𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑 , 𝒚𝟑 )…, (𝒙𝟐𝟎𝟎 , 𝒚𝟐𝟎𝟎 )}
• class-1 has 200 samples after up-sampling
{(𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑 , 𝒚𝟑 )…, (𝒙𝟏𝟎𝟎 , 𝒚𝟏𝟎𝟎 ), # 100 original samples
(𝒙𝟏𝟎𝟏 , 𝒚𝟏𝟎𝟏 ), (𝒙𝟏𝟎𝟐 , 𝒚𝟏𝟎𝟐 ), (𝒙𝟏𝟎𝟑 , 𝒚𝟏𝟎𝟑 )…, (𝒙𝟐𝟎𝟎 , 𝒚𝟐𝟎𝟎 )} # 100 copies
• The new loss function is
𝐿𝑡𝑜𝑡𝑎𝑙_𝑢𝑝 = σ200 𝑚=1 𝑳(𝒇 𝒙 𝒎 , 𝒚𝒎 ) + σ 200
𝑛=1 𝑳(𝒇 𝒙𝒏 , 𝒚𝒏 )

which is basically the same as the loss using class-weight


𝟏 𝟏
𝐿𝑡𝑜𝑡𝑎𝑙_𝑐𝑤 = σ200 𝑳(𝒇 𝒙𝒎 , 𝒚𝒎 ) + σ100 𝑳(𝒇 𝒙𝒏 , 𝒚𝒏 )
𝟐𝟎𝟎 𝑚=1 𝟏𝟎𝟎 𝑛=1
𝐿𝑡𝑜𝑡𝑎𝑙_𝑢𝑝 = 200𝐿𝑡𝑜𝑡𝑎𝑙_𝑐𝑤
The two loss functions are essentially the same
Use Up-sampling and Class Weight for Multiclass Classification
Class 0: 𝑁0 samples
Class 1: 𝑁1 samples
Class 2: 𝑁2 samples

Class K-1: 𝑁𝐾−1 samples
assume that: 𝑁0 > 𝑁1 > 𝑁2 > ⋯ > 𝑁𝐾−1
Then we can do
(1) up-sampling such that each class has 𝑁0 samples after up-sampling
σ𝐾−1
𝑘=0 𝑁𝑘
(2) class weight: class_weight_j =
𝐾×𝑁𝑗
Down-sampling
• Example:
class-0: 10,000 samples
class-1: 6,000 samples
class-2: 5,000 samples
Then, we do down-sampling
(1) randomly select 5000 samples from class-0
(2) randomly select 5000 samples from class-1
After down-sampling, each class has 5000 samples

• Clearly, information will get lost after down-sampling.


Re-sampling https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html

• Example:
class-0: 10,000 samples
class-1: 6,000 samples
class-2: 5,000 samples
Then, we do re-sampling
(down-sampling class-0, up-sampling class-1 and class-2)
(1) randomly select 8000 samples from class-0 (without replacement)
(2) randomly select 8000 samples from class-1 (with replacement)
(3) randomly select 8000 samples from class-2 (with replacement)

After re-sampling, each class has 8000 samples


sampling with replacement
dataset={𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 , 𝒙𝟒 , 𝒙𝟓 , 𝒙𝟔 , 𝒙𝟕 , 𝒙𝟖 , 𝒙𝟗 , 𝒙𝟏𝟎 }
We need to randomly select 3 samples from the dataset
We do it one by one:
(1) Randomly select 1 sample from the dataset, and then we get 𝒙𝟐
after this random sampling, dataset={𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 , 𝒙𝟒 , 𝒙𝟓 , 𝒙𝟔 , 𝒙𝟕 , 𝒙𝟖 , 𝒙𝟗 , 𝒙𝟏𝟎 }
(2) randomly select 1 sample from the dataset, and then we get 𝒙𝟒
after this random sampling, dataset={𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 , 𝒙𝟒 , 𝒙𝟓 , 𝒙𝟔 , 𝒙𝟕 , 𝒙𝟖 , 𝒙𝟗 , 𝒙𝟏𝟎 }
(3) randomly select 1 sample from the dataset, and then we get 𝒙𝟐
after this random sampling, dataset={𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 , 𝒙𝟒 , 𝒙𝟓 , 𝒙𝟔 , 𝒙𝟕 , 𝒙𝟖 , 𝒙𝟗 , 𝒙𝟏𝟎 }

then, we got {𝒙𝟐 , 𝒙𝟒 , 𝒙𝟐 }


sampling without replacement
dataset={𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 , 𝒙𝟒 , 𝒙𝟓 , 𝒙𝟔 , 𝒙𝟕 , 𝒙𝟖 , 𝒙𝟗 , 𝒙𝟏𝟎 }
We need to randomly select 3 samples from the dataset
We do it one by one:
(1) Randomly select 1 sample from the dataset, and then we get 𝒙𝟐
after this random sampling, dataset={𝒙𝟏 , 𝒙𝟑 𝒙𝟒 , 𝒙𝟓 , 𝒙𝟔 , 𝒙𝟕 , 𝒙𝟖 , 𝒙𝟗 , 𝒙𝟏𝟎 }
(2) randomly select 1 sample from the dataset, and then we get 𝒙𝟒
after this random sampling, dataset={𝒙𝟏 , 𝒙𝟑 , 𝒙𝟓 , 𝒙𝟔 , 𝒙𝟕 , 𝒙𝟖 , 𝒙𝟗 , 𝒙𝟏𝟎 }
(3) randomly select 1 sample from the dataset, and then we get 𝒙𝟏
after this random sampling, dataset={𝒙𝟑 , 𝒙𝟓 , 𝒙𝟔 , 𝒙𝟕 , 𝒙𝟖 , 𝒙𝟗 , 𝒙𝟏𝟎 }

then, we got {𝒙𝟐 , 𝒙𝟒 , 𝒙𝟏 }


no replacement

replacement

multiple copies of the same element


Generate synthetic data samples
• An old method SMOTE: Synthetic Minority Oversampling Technique.
https://imbalanced-learn.readthedocs.io/en/stable/index.html
https://arxiv.org/pdf/1106.1813.pdf
It may not work for your data !

It assume a new sample can be generated by some linear combinations of


the training samples
Generate synthetic data samples

• PCA to generate new data


• GMM to generate new data
• neural network to generate new data
(you will see examples in the future lectures)
Convert classification into anomaly/outlier detection
• binary classification with extremely imbalanced data
• We have sufficient data in class-0
class-0 class-1 • We only have a small amount of data
100 in class-1, which can not represent all
100 data patterns in class-1
data points
points • The basic idea:
100 No data estimate 𝑝(𝑥), the PDF of class-0
data points
points
Given an input 𝑥, if 𝑝 𝑥 < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑,
The feature space
then, 𝑥 is not in class-0
scenario-2
it is an anomaly/outlier
Convert classification into anomaly/outlier detection
https://arxiv.org/pdf/1901.03407.pdf

Anomaly detection is very changeling. The methods in the book


and the paper may not work for you data.
challenge: COVID-19 diagnosis using chest X-ray images
• image dataset of health/normal subjects
https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
the number of training samples = 1341

• image dataset of infected patients


https://github.com/ieee8023/covid-chestxray-dataset
the number of samples = 25
this is a small number compared to the number of people that got infected

• So, we are in scenario-2


normal chest images

the image of an infected lung

You might also like