Professional Documents
Culture Documents
Classification Performance
Irfan Pratamaa,*, Putri Taqwa Prasetyaningrum b , Albert Yakobus Chandra c ,Ozzi Suria d
a,b,c,d
Universitas Mercu Buana Yogyakarta,Yogyakarta,55283, Indonesia
Corresponding author: Irfanp@mercubuana-yogya.ac.id
Abstract— Imbalanced data refers to a condition that there is different size of samples between one class with another class(es). Which
made a term “majority” class that represent the class with more instances number on the dataset and “minority” classes that
represent the class with less instances number on the dataset. Under the target of educational data mining which demanding accurate
measurement of the student’s performance analysis, data mining requiring appropriate dataset to produce good accuracy. This study
aims to measure the resampling method’s performance through the classification process on the student’s performance dataset which
are also a multi-class dataset. Thus, this study also measuring how the method performs on a multi-class classification problem.
Utilizing 4 public educational dataset which consists of the result of an educational process, this study aim to get a better picture of
which resampling method are suitable for those kind of dataset. This research use more than twenty resampling methods from the
SMOTE variants library. For the comparison of how these resampling methods would perform, this study implement nine
classification method to measure the performance of the resampled data with the non-resampled data. According to the results,
SMOTE-ENN turns out to be the better resampling method in general since it produce 0,97 F1 score under Stacking classification
method and the highest among others. However, the resampling method performs quite low on the dataset with many label variation.
Thus, the study on how the class variation would affect the resampling and classification performance will be conducted.
Manuscript received 15 Oct. 2020; revised 29 Jan. 2021; accepted 2 Feb. 2021. Date of publication 17 Feb. 2021.
International Journal on Informatics Visualization is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.
A. Data Acquisition
This research utilizes four different public datasets
retrieved from either Kaggle and UCI repository. Those
datasets are students’ performance data which are classified
into several label (class) and have different imbalance ratio.
Table I Shows the dataset description used in this study.
Figure 3 Por Dataset
On the resampling steps, this study utilizes variety of
resampling methods to measure how those methods
performs toward the dataset. This study adopted the best
performing resampling techniques and the recently
developed techniques such as SMOTE, ENN, SMOTEENN,
SMOTE-Tomek, Geometric SMOTE, Borderline-SMOTE1,
Borderline-SMOTE2, Plynom-fit-SMOTE, ANS, SN
SMOTE, LLE SMOTE, DEAGO, CURE-SMOTE, Trim-
SMOTE, Gaussian-SMOTE, AND-SMOTE, MDO,
SMOTE-D, SMOTE-IPF, MWMOTE, MCT, ADASYN
[32], and experimental hybrid resampling ADASYN-Tomek.
C. Experimental setting
Figure 4 Math Dataset This study utilizes sklearn package from python
language to carried out all of the data processing. The
imbalanced datasets firstly will be undergoing a resampling
process to produce a relatively balanced dataset using the
resampling methods already mentioned in the previous
section. Each of the resampled dataset then trained using
several supervised learning algorithms namely Logistic
Regression, K- Nearest Neighbor, CART, Random Forest,
SVM, Stacking, XGBoost, XGBRF, Adaboost. As for the
evaluation metric, this study using two kinds of evaluation
mechanism which are data splitting (80:20 ratio) and Cross
Validation. F1 score is being used to identify the
performance of each classification results. Then all of the
evaluation result will be provided on a table in the upcoming
section.
Figure 5 Student Dataset
B. Resampling III. RESULTS AND DISCUSSION
Imbalanced condition found in the very start of the data A. Resampling Result
mining process, which is in the data acquisition step. When In this section, result of each process will be provided
the dataset is known to be imbalanced, one of the along the explanation of the results. The first steps of the
preprocessing steps are handling the imbalanced problem research which is resampling steps, the resampling
using the resampling mechanism. To know the quality of the visualization of each dataset using one of the resampling
resampled data, any of the classification procedure should be method used in this study (ADASYN-Tomek) can be seen in
carried out. Therefore, it needs a whole data mining process Figure 6. to Figure 10.
to implement and evaluate the resampling method on a
dataset.
To get a thorough and more comprehensive result of the
measurement, this study applies several scenarios of the
resampling – classification process. It will also help to find
out which resampling method are generally better according
to the dataset’s imbalance state and the classification
method’s result.
Data
Acquisition
Balanced
Data
B. Classification Results
After each resampling process, the dataset is being
classified using the mentioned classification algorithm. The
Figure 9 Resampled x-API Dataset
following Table II shows the highest F1 score among all
resampling and classification method on each dataset using
both splitting and cross validation evaluation metric.
TABLE II EXPERIMENT RESULTS
80:20 splitting Dataset
xAPI Por Math Student
Classifiers
DEAGO
CURE-SMOTE
SMOTEENN Trim-SMOTE Trim-SMOTE
Logistic Regression Trim-SMOTE
(0,86) (0,92) (0,79)
MWMOTE
(0,93)
ENN SMOTEENN SMOTEENN SMOTEENN
K-NN
(0,95) (0,9) (0,9) (0,86)
SMOTEENN SMOTEENN SMOTEENN SMOTEENN
CART
(0,87) (0,96) (0,93) (0,84)
SMOTEENN
SMOTEENN MCT SMOTEENN
Random Forest Polynom-fit-SMOTE
(0,95) (0,97) (0,9)
(0,96)
Borderline-SMOTE1
SMOTEENN Trim-SMOTE SMOTEENN
SVM LLE-SMOTE
(0,88) (0,9) (0,86)
(0,92)
ENN SMOTEENN SMOTEENN SMOTEENN
STACKING
(0,93) (0,97) (0,96) (0,91)
SMOTEENN
Borderline-SMOTE1
SMOTEENN SMOTE-IPF ENN
XGBoost Cure-SMOTE
(0,97) (0,94) (0,93)
SMOTE-D
(0,95)
SMOTEENN ENN Polynom-fit-SMOTE Trim-SMOTE
XGBRF
(0,88) (0,95) (0,94) (0,78)
AdaBoost Trim-SMOTE ANS LLE-SMOTE Trim-SMOTE
(0,88) (0,86) Cure-SMOTE (0,68)
(0,89)
K-Fold Dataset
xAPI Por Math Student
Classifiers