Professional Documents
Culture Documents
Liang Liang
Class Imbalance
• In a classification application, if a class has much more data points than another
class, then it is call class imbalance.
binary classification :
class-0: 1000 data points
class-1: 100 data points
• class imbalance is a very common problem
• credit card fraud detection: most of the transactions are normal
• disease/coronavirus diagnosis based on X-ray images
# of images from diseased patients << # of images from healthy subjects
Class Imbalance: What will happen? see danger_of_class_imbalance.ipynb
• Let's use the dataset in homework
It is binary classification to predict if a person will pay off the credit card or not
class-0: the person will pay off the credit card in time
class-1: the person will default on the credit card payment
139974 data samples in class-0 step-1: divide the data into three sets:
10026 data samples in class-1 training set, validation set, and test set
wrongly-classified
wrongly-classified
We just got a "clever and lazy" classifier that classified most of the data samples
into class-0, but the goal is to find out who will default credit card (class-1).
Measure the 'real' accuracy of the classifier
by normalizing each row of the confusion matrix
• 1. Class weight
• 2. Re-sampling (e.g. up-sampling, down-sampling)
• 3. Generate synthetic data samples
• assuming scenario-1:
we have sufficient data in each class.
The reason of class imbalance is that some classes have very
complex patterns and therefore need more data samples
Class weight
• Let's consider a logistic regression classifier 𝑦ො = 𝑓(𝑥)
𝑥 is a data sample, 𝑦ො is the predicted class label from the classifier 𝑓(𝑥)
the cross-entropy loss on a training sample 𝑥𝑛 with true class label 𝑦𝑛 is
𝐿(𝑓 𝑥𝑛 , 𝑦𝑛 )
• Assume there are two classes
• class-0 has 200 samples: {(𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑 , 𝒚𝟑 )…, (𝒙𝟐𝟎𝟎 , 𝒚𝟐𝟎𝟎 )}
• class-1 has 100 samples: {(𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑 , 𝒚𝟑 )…, (𝒙𝟏𝟎𝟎 , 𝒚𝟏𝟎𝟎 )}
• The total loss for training the classifier is
𝐿𝑡𝑜𝑡𝑎𝑙 = σ200
𝑚=1 𝑳(𝒇 𝒙𝒎 , 𝒚𝒎 ) + σ 100
𝑛=1 𝑳(𝒇 𝒙𝒏 , 𝒚𝒏 )
• 𝐿𝑡𝑜𝑡𝑎𝑙 is largely determined by the samples in class-0 (200>100)
• Next, we will modify the total loss such that the classes have equal weight
Class weight
• class-0 has 200 samples: {(𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑 , 𝒚𝟑 )…, (𝒙𝟐𝟎𝟎 , 𝒚𝟐𝟎𝟎 )}
• class-1 has 100 samples: {(𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑 , 𝒚𝟑 )…, (𝒙𝟏𝟎𝟎 , 𝒚𝟏𝟎𝟎 )}
• The total loss for training the classifier is
𝐿𝑡𝑜𝑡𝑎𝑙 = σ200
𝑚=1 𝑳(𝒇 𝒙 𝒎 , 𝒚𝒎 ) + σ 100
𝑛=1 𝑳(𝒇 𝒙𝒏 , 𝒚𝒏 )
• 𝐿𝑡𝑜𝑡𝑎𝑙 is largely determined by the samples in class-0 (200>100)
• Class-balanced loss:
𝟏 𝟏
𝐿𝑡𝑜𝑡𝑎𝑙_𝑐𝑤 = σ200
𝑚=1 𝑳(𝒇 𝒙𝒎 , 𝒚𝒎 ) + σ100
𝑛=1 𝑳(𝒇 𝒙𝒏 , 𝒚𝒏 )
𝟐𝟎𝟎 𝟏𝟎𝟎
• Example:
class-0: 10,000 samples
class-1: 6,000 samples
class-2: 5,000 samples
Then, we do re-sampling
(down-sampling class-0, up-sampling class-1 and class-2)
(1) randomly select 8000 samples from class-0 (without replacement)
(2) randomly select 8000 samples from class-1 (with replacement)
(3) randomly select 8000 samples from class-2 (with replacement)
replacement