You are on page 1of 5

random inverse undersampling

problem
Muhammad Usama Muhammad Ahmed Taha
18K-0200 18K-0231 18K-XXXX

June 16, 2022

Teacher: Waqas Sheikh

1
1 Research Goal deed contain a relatively large class imbalance
problem
Our Research goal is to solve class imbalance
problem on a large data set that holds true to
the the concept of class imbalance and for this we
3 Data Exploration
have used a novel approach known as inverse ran- After the data was extracted it was thoroughly
dom under sampling technique. This technique observed for any unexpected or unusual findings.
has been previously used to solve class imbalance The 10 attributes in our data set were index, Re-
problem for multi-label classification. Our data fractive index, Sodium, Magnesium, Aluminum,
set will have this technique applied upon. Potassium, Silicon, Calcium, Barium and Iron.
Inverse Random Under sampling technique And the distinct 7 types of glass were building
works by inverting the cardinalities of major- windows processed, building windows non pro-
ity and minority class. Thus the majority class cessed, vehicle windows processed, vehicle win-
is divided into subsets such that each subset is dows non processed, containers, tableware and
smaller in size then the minority class. head lamps. All the attributes consisted of con-
Basically the concept behind Inverse Random tinuous values. All in All there were a total of
Under Sampling is to maintain a high rate of 214 samples available in the data set.
True positive rate by imbalance inversion but After delving a little deep we started to ob-
this also induces a high false positive rate. For serve the class imbalance in our data set. As
this by an ensemble technique known as bagging, there were 7 distinct classes an ideal situation
by combining the designed detectors using fusion would have been if every class consisted of equal
we can control false positive rate samples but that is certainly not the case with
our data set. So before evaluating further let’s
go over the per class sample size.
2 Retrieving Data
ˆ 1st class which is building window processed
The data set we intended to retrieve is called contains a sample size of 70 (32.71 %)
Glass identification data set. This data set is
ˆ 2nd class which is building window non pro-
a multi-class data set which provides an analy-
cessed contains a sample size of 76 (35.51 %)
sis of 7 different types of glasses. The motivation
behind the creation of this data set was for crim- ˆ 3rd class which is vehicle window processed
inological investigation, thus the glass could be contains a sample size of 17 (7.94 %)
used as an evidence if correctly identified. This
data set consists of 10 attributes which then are ˆ 4th class which is vehicle window non pro-
used to identify the type of glass based on these cessed contains a sample size of 0 (0 %)
10 attributes.
ˆ 5th class which is containers contains a sam-
This data set was retrieved from github. But ple size of 13 (6.07 %)
the data set did not provide any attribute
name,which were then fetched separately from ˆ 6th class which is tableware contains a sam-
the official UCI website. This data set does in- ple size of 9 (4.21 %)

1
ˆ 7th class which is headlamps contains a that good relationship between the variables and
sample size of 29 (13.55 %) the target variable was found.Missing values are
another hindrance in a data set that provides a
The above points tell us exactly how much of negative impact when training a model. There
an imbalance there exists between each class. were no missing values in any column in our data
4th class, which is the vehicle windows non pro- set thus this eliminated the need for filling in the
cessed, does not even contain a single sample data set.
so even it should will not be considered as a
valid class anymore and will be dropped, leav-
ing us with only 6 classes. Now out of these 6 After finding correlation we started to under-
classes it can easily be observed that the differ- stand the distribution of each variable in the
ence between the class with maximum sample data set. Distribution tells us how the data set
size, which is building windows non processed values are distributed across a curve. A normal
with 76 samples, and the class with minimum distribution also known as Gaussian distribution
sample size, which is tableware with only 9 sam- is a normal distribution that is symmetric about
ples, is exceedingly large. the mean, basically telling us that the majority
But that is not all as there are 6 classes and of data distribution is near the mean, it usually
they can easily be be distributed in 2 groups occurs as a bell curve if plotted on a graph. Here
we find on our data set that most variables have
ˆ 1st group where the sample size is greater
a sort of skewed distribution which if unhandled
or equal to 70 samples. It 2 classes.
will produce incorrect results and will distort the
ˆ 2nd group where the sample size is smaller purpose of our project. We also used box plot
then even 30 samples. It contains 4 classes. and histograms to better understand the the dis-
tribution, obtain an idea about the outliers and
This produces a great disparity and the re- observe individual distribution of each each at-
sults obtained by using any other conventional tribute with all 6 classes. As distribution seemed
method would be incredibly biased and incor- to be the extent of the problems in our data set.
rect. The combined samples from all classes in We also decided to use violin plot integrated with
the 2nd group do not even sum up to a the sam- box plot for every attribute. This granted us
ple size of a single class in group 1. This is the with a deep perspective of every attribute and
primary problem that will be addressed in this how every attribute need to be catered individ-
our project. ually.
After this we did further exploration on our
data set. We applied Pearson Correlation be-
tween the variables and a heat map is used to A complete dashboard with all the minor de-
clearly isolate every correlation. This provided tails regarding the data set is created to examine
us with a general idea of variables and their re- every last detail that might have been missed or
lationship between other variables.Only the at- overlooked. Basically it informs us about the to-
tribute ’Calcium’ was found to be having no cor- tal number of values, distinct values, maximum
relation with the target variable. Apart from value, missing values etc regarding each column.

2
4 Data preparation 5 Data modeling
Now here we will deal with the problem of class
In this stage the first thing that will be needed to
imbalance. And for this problem we will use
be done is dropping the class of vehicle windows
the technique of Inverse Random Under Sam-
non processed from our evaluation as it contains
pling. Inverse Random Under Sampling tech-
0 samples. Another thing that we observed dur-
nique has previously shown great promise while
ing the correlation tests that attribute ’calcium’
dealing with class imbalance problem in multi
and the target variable had no relation and thus
class dataset.
providing us a reason to drop it off and reduce
Before moving on how we modeled our data
the size of our data set.
set using Inverse Random Under Sampling, let
Now as we discovered in the data exploration us understand how The mechanism behind this
stage that data in most columns had a skewed technique works. This technique first increases
distribution and we had also further observed the true positive rate and decreases the false pos-
that data was not on the same scale. So the itive rate, which increases as a consequence of
first priority was to normalize our data so that increasing true positive rate, by using the tech-
a distribution can be achieved which is nearest nique called bagging.
to normal distribution. For this purpose nor- Bagging is an ensemble classifier that works on
malize function is used. We again plot the dis- a relatively simple idea. bagging generates many
tribution of the data after normalizing and ob- training subsets from a larger complete data set.
serve how the skewness had been considerably Each subset of samples is generated randomly
reduced. After normalizing another task was to where each sample in a subset is selected with
bring our data on the same scale. For this we replacement and equal probability. Each subset
standardize our data, this brings the entire data has a prediction method applied to it. Then the
set to a similar scale. We used a standard scaler final prediction is obtained by majority voting.
function that transforms the entire data set into It is usually the positive class with a rather
standardized form. After this when again the small amount of samples, thus in Inverse Ran-
data is observed, a bell shaped curve is achieved dom Under Sampling data manipulation hap-
for every attribute. pens in such a way that we reduce the the sam-
Another problem we observed were during ple size of majority class which is the negative
the exploration stage was the outliers that were class such that the sample size of positive class
found in considerable quantity. We removed the exceeds the negative class. This happens by con-
outliers by finding out the z score and choos- tinuously under sampling until each subset is
ing only the entries with the z score of less then smaller then that of the positive class. Now the
3. Distribution is again plotted and this time samples of positive class exceeds that of nega-
the data distribution seems to be the closest to tive class thus the focus shifts towards the posi-
Normal distribution. As of now data in it’s en- tive class. Each training set yields one classifier
tirety has been taken care off and the data is in design with importance towards positive class.
its cleanest form, of course apart from the class Then by combining the different designs through
imbalance problem. fusion a composite boundary can be constructed.

3
Coming back to our data set we have used the which means that it measures the entire two-
same approach of making One v/s all. We iterate dimensional area underneath the entire ROC
over all the minority classes and consider them curve. F1 score is also known as the harmonic
on a binary level. mean of precision and recall. It is a measure of
We specify 3rd class, 5th class and 6th class model’s accuracy on a data set.
as a set of minority classes. We used the pseudo To better understand how the technique of In-
code of Inverse Random Under Sampling in verse Random Under Sampling technique helps
the paper ”Inverse Random Under Sampling for us, we also used several other machine learning
class imbalance problem and it’s application to models without using this technique to gather
multi-label classification” which was provided to how much exactly does the this technique per-
us. We used decision tree as our base classifier form in comparison to these models that do not
and for bagging we have used mean method for incorporate Inverse Random Under Sampling.
combining.The pseudo code can be observed be- For this purpose the model we used are SVM,
low. KNN, logistic regression, decision tree-C4.5, de-
def IRUS.fit (attributes, prediction): cision tree-Cart.
maj,min,d : majority class in prediction, mi- Finally we ran tests on the the model of In-
nority class in prediction, total classes with value verse Random Under Sampling with other mod-
counts in prediction els not using this technique. The IRUS model
compute subsets(attributes, predictions) outperformed every single other model we used.
for each subset created: The IRUS model got and F1 score of 0.76 while
the 2nd highest score was only 0.67 which was
Xtrain : subset[remove class column]
achieved by 2 different models KNN and SVM.
ytrain : subset[only class column]
This result is evidence enough that the tech-
model : Decision Tree Classifer
nique Inverse Random Under Sampling provides
model.fit(Xtrain, ytrain) a huge increase in performance. It helps to solve
bagging.append(model) the problem of class imbalance and provides and
evident jump in results then other conventional
method.
6 Automation and presenta- The full table of all the models with their F1
tion score and ROC score are shown below

We have used Kfold validation to further un-


Models F1 Score ROC AUC Score
derstand and present the most valid results we
could find. We used 5 fold validation for this pur- IRUS 0.769 0.858
pose. To evaluate our model with better meth- SVM 0.676 0.872
ods of evaluation we used F1 score and ROC- KNN 0.676 858
AUC curve. Logistic Regression 0.666 0.862
ROC-AUC is a performance metric for classi- Decision Tree-C4.5 0.646 0.788
fications problems at various thresholds. AUC Decision Tree-Cart 0.610 0.794
here stands for Area under the ROC curve,

You might also like