You are on page 1of 1

!

"#$%&'(ı)*%+
,%-'./ 0 1'*2-'#"%3 0 4256"-

7%8"

!"#$%&'(!%)*
+,-./.&012(3.).

Working with imbalanced data has its own


difficulties.
We need to create a strategy to handle that
specific data and its imbalance.

!*.)(%4(%,-./.&012(2.).(.)()*1(5%#4)
6/.017
For a data to be used for classification,
imbalance is where amount of classes in the
data are significantly different from each
other,
one variant class in terms of size can lead to
imbalance.

Even when there is enough information for


all classes,
imbalance can cause some algorithms to
underperform.

However it doesn’t mean any data with


uneven classes will cause problems.
Indeed, the level of the imbalance to have
problems is depends on the task and the
algorithm,
but in my experience I can confidently say,
80% / 20% ratio for a binary classification is
imbalanced as a rule of thumb.

8#1()*1#1(1&"9'*(%&5"#,.)%"&(%&()*1
2.).7

In certain cases our data may have 2% / 43% /


55% class distribution.
And out of billions of records that 2% can still
be enough to gather the information.

But say, we have a glioblastoma tumor


classification problem and out of the 5+
million records only 12 records of the data is
positive.
Sure we need more data in this particular
case.

8009#.0:(%4(&")(:"9#("&/:(5#%1&2
There are metrics you can take into account:

Cohen’s Kappa

ROC Curve

Confusion Matrix

Precision

Recall

F-Score

;14.,6/1
Some of the techniques to consider:
- Repopulate your minor classes.
- Undersample the minor classes.

You can use the Python module


‘UnbalancedDataset’ for SMOTE

See: SMOTE with Imbalance Data

However oversampling can further cause


overfitting. When you test your data in
validation,
you might want to reconsider this step for any
overfitting problems.

<",1(8/'"#%)*,4(=*.)(+,,9&1()"
+,-./.&01
Support Vector Machine (a penalized one)

Logistic Regression (Imbalance only


effects the intercept and not the slope.)

Boosting methods like Gradient boosting,


XGBOOST

>1&./%?12(@"21/4
Penalization tweaks the model to consider
minor classes,
with an analogy it ensures equity between
classes.

- Penalized SVM
- Penalized Latent Dirichlet Analysis

You can also use Python’s “costcla” library


CostSensitiveClassification with Scikit/SciPy,
or WEKA’s CostSensitiveClassification.

Or you can reimplement a custom algorithm


that worked relatively well with imbalance,
with an additional penalty strategy.

+4(=*1(6#"-/1,(A&1BC/.44
C/.44%5%0.)%"&7
While there is positive and negative classes, it
is a binary classification.
As a special case of this, if we only know
information about only one class,
for example like we classify if its an apple or
not with only apple data,
we call this One-Class Classification problem.

You can consider using these algorithms:


- Support Vector Machine
- Elliptic Envelope
- Isolation Forest
- Gaussian K-Means
- Neural Network with cost function weights.
(lastly)
- BERT also doesn’t need any balance
augmentation

Imbalance is a common problem from


simplest to most complex data,
however there are many methods we can use
like SMOTE.

Feel free to feedback, and thanks for reading!

C%@A2-"'4"%#-2-D E%6%'7@2"-@"

E%6%'F-D2-""#2-D

.9

.9

!"#$%&#"'%($#)*+%,ı-'*.
:;*;##;<'<"'<2))'#=-'>%56"#?'56#"6@A';=6';=#'%#*5'>%#6A"#BBB

You might also like