You are on page 1of 8

Measuring Resampling Methods on Educational Dataset’s

Classification Performance
Irfan Pratamaa,*, Putri Taqwa Prasetyaningrum b , Albert Yakobus Chandra c ,Ozzi Suria d
a,b,c,d
Universitas Mercu Buana Yogyakarta,Yogyakarta,55283, Indonesia
Corresponding author: Irfanp@mercubuana-yogya.ac.id

Abstract— Imbalanced data refers to a condition that there is different size of samples between one class with another class(es). Which
made a term “majority” class that represent the class with more instances number on the dataset and “minority” classes that
represent the class with less instances number on the dataset. Under the target of educational data mining which demanding accurate
measurement of the student’s performance analysis, data mining requiring appropriate dataset to produce good accuracy. This study
aims to measure the resampling method’s performance through the classification process on the student’s performance dataset which
are also a multi-class dataset. Thus, this study also measuring how the method performs on a multi-class classification problem.
Utilizing 4 public educational dataset which consists of the result of an educational process, this study aim to get a better picture of
which resampling method are suitable for those kind of dataset. This research use more than twenty resampling methods from the
SMOTE variants library. For the comparison of how these resampling methods would perform, this study implement nine
classification method to measure the performance of the resampled data with the non-resampled data. According to the results,
SMOTE-ENN turns out to be the better resampling method in general since it produce 0,97 F1 score under Stacking classification
method and the highest among others. However, the resampling method performs quite low on the dataset with many label variation.
Thus, the study on how the class variation would affect the resampling and classification performance will be conducted.

Keywords—Imbalanced Dataset; Resampling; SMOTE-ENN; Classification.

Manuscript received 15 Oct. 2020; revised 29 Jan. 2021; accepted 2 Feb. 2021. Date of publication 17 Feb. 2021.
International Journal on Informatics Visualization is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.

I. INTRODUCTION planning[16]. Under the target of educational data mining


Imbalanced data classification is a common problem in which demanding accurate measurement of the student’s
the field of data mining[1]. Imbalanced data refers to a performance analysis, data mining requiring appropriate
condition that there is different size of samples between one dataset to produce good accuracy [11]. However, it is
class with another class(es). Which made a term “majority” impossible to get a really good classification accuracy with
class that represent the class with more instances number on the imbalanced amount of data for each of the classes as it’s
the dataset and “minority” classes that represent the class decrease the effectiveness of the classifier’s performance [9],
with less instances number on the dataset [2]. Imbalanced [11].
data can be encountered in various field of applications such There are two mechanism to overcome the imbalanced
as financial services[3], [4], healthcare[5], [6], class problem, which are data level approach and
blockchain[7], [8], and educational data mining [9]–[14]. algorithmic level approach [9]. The algorithmic level
Particularly, educational data mining emerging as it can approach relies on some certain classifiers to classify the
provide quality education for students to enhance academic imbalanced dataset with preferable classification output.
performance according to their study record that has been Cost sensitive learning, bagging, boosting, and stacking are
processed using some certain machine learning method [15]. some of the widely-used for algorithmic level approach [17].
Research on educational data mining is increasing due to the But there is a drawback of the algorithmic level approach,
benefits obtained from the acquired knowledge of the algorithmic level approach are work less efficient toward the
machine learning processes. It is improving the decision high ratio of imbalanced data[9]. As algorithmic approach
making of the institutional toward the learning outcome and are less preferable to solve the imbalanced class problem,
data level approach become more preferable as there are TABLE I
DATASET DESCRIPTION
studies conducted utilizing such approach to overcome
imbalanced class problem [2], [10], [18]–[23]. Dataset
Data level approach have 2 distinctive mechanism which xAPI- Student Student Student’s
are random under-sampling and random oversampling [24]. edu- Performa Performa Grade
data nce nce Dataset
Random oversampling is randomly generating duplicates
[30] Dataset Dataset
from the minority class, while random under-sampling is Por [31] Math [31]
deleting random instances from the majority class. Even
though, the excessive use of oversampling would lead to Number 16 33 33 22
overfitting and excessive use of the under-sampling would of
lead to information loss [25]. Hence, there are several hybrid Attribute
resampling methods or enhancement to the previous
oversampling methods to prevent the overfitting problem Class 3 4 4 7
such as SMOTE-Tomek [26], SMOTE-ENN [27], variation
BorderlineSMOTE [28], Geometric SMOTE [29] and other
Missing No No No No
SMOTE variants and derivatives [8]. Values
This study aims to measure the resampling method’s
performance through the classification process on the Number 480 649 649 1203
student’s performance dataset which are also a multi-class of
dataset. Thus, this study also measuring how the method Instance
performs on a multi-class classification problem.
Imbalance 1:1.6 1:21.8 1:4.2 1:10.7
ratio
II. THE MATERIAL AND METHOD
The imbalanced ratio shown in Table I is calculated by
This research consists of two main process which is data dividing the majority class instance number and (lowest)
preparation process and classification process. On the data minority instance number since it is multi-class dataset. The
preparation stage, there are several steps to be done such as, following figures are showing the class distribution on each
data acquisition, data preprocessing (resampling), and dataset.
training-testing data split. On the classification stage, the
dataset divided into two scenarios which are classification
process on train-test split dataset, and classification process
on full dataset using cross validation.

Figure 2 xAPI Dataset

Figure 1 Research Flowchart

A. Data Acquisition
This research utilizes four different public datasets
retrieved from either Kaggle and UCI repository. Those
datasets are students’ performance data which are classified
into several label (class) and have different imbalance ratio.
Table I Shows the dataset description used in this study.
Figure 3 Por Dataset
On the resampling steps, this study utilizes variety of
resampling methods to measure how those methods
performs toward the dataset. This study adopted the best
performing resampling techniques and the recently
developed techniques such as SMOTE, ENN, SMOTEENN,
SMOTE-Tomek, Geometric SMOTE, Borderline-SMOTE1,
Borderline-SMOTE2, Plynom-fit-SMOTE, ANS, SN
SMOTE, LLE SMOTE, DEAGO, CURE-SMOTE, Trim-
SMOTE, Gaussian-SMOTE, AND-SMOTE, MDO,
SMOTE-D, SMOTE-IPF, MWMOTE, MCT, ADASYN
[32], and experimental hybrid resampling ADASYN-Tomek.

C. Experimental setting
Figure 4 Math Dataset This study utilizes sklearn package from python
language to carried out all of the data processing. The
imbalanced datasets firstly will be undergoing a resampling
process to produce a relatively balanced dataset using the
resampling methods already mentioned in the previous
section. Each of the resampled dataset then trained using
several supervised learning algorithms namely Logistic
Regression, K- Nearest Neighbor, CART, Random Forest,
SVM, Stacking, XGBoost, XGBRF, Adaboost. As for the
evaluation metric, this study using two kinds of evaluation
mechanism which are data splitting (80:20 ratio) and Cross
Validation. F1 score is being used to identify the
performance of each classification results. Then all of the
evaluation result will be provided on a table in the upcoming
section.
Figure 5 Student Dataset
B. Resampling III. RESULTS AND DISCUSSION
Imbalanced condition found in the very start of the data A. Resampling Result
mining process, which is in the data acquisition step. When In this section, result of each process will be provided
the dataset is known to be imbalanced, one of the along the explanation of the results. The first steps of the
preprocessing steps are handling the imbalanced problem research which is resampling steps, the resampling
using the resampling mechanism. To know the quality of the visualization of each dataset using one of the resampling
resampled data, any of the classification procedure should be method used in this study (ADASYN-Tomek) can be seen in
carried out. Therefore, it needs a whole data mining process Figure 6. to Figure 10.
to implement and evaluate the resampling method on a
dataset.
To get a thorough and more comprehensive result of the
measurement, this study applies several scenarios of the
resampling – classification process. It will also help to find
out which resampling method are generally better according
to the dataset’s imbalance state and the classification
method’s result.

Data
Acquisition

Figure 7 Resampled Por Dataset


Resampling

Balanced
Data

Figure 6 Resampling Stage


Figure 8 Resampled Math Dataset Figure 10 Resampled Student Dataset

The sample resampling method on the figure is


ADASYN-Tomek. The dataset are simply oversampled
using ADASYN in the first process, then being resampled
second time using TomekLinks under-sampling method. The
results of the resampling method made the dataset are not
strictly equal in class instances number, but still relatively
balanced.

B. Classification Results
After each resampling process, the dataset is being
classified using the mentioned classification algorithm. The
Figure 9 Resampled x-API Dataset
following Table II shows the highest F1 score among all
resampling and classification method on each dataset using
both splitting and cross validation evaluation metric.
TABLE II EXPERIMENT RESULTS
80:20 splitting Dataset
xAPI Por Math Student
Classifiers
DEAGO
CURE-SMOTE
SMOTEENN Trim-SMOTE Trim-SMOTE
Logistic Regression Trim-SMOTE
(0,86) (0,92) (0,79)
MWMOTE
(0,93)
ENN SMOTEENN SMOTEENN SMOTEENN
K-NN
(0,95) (0,9) (0,9) (0,86)
SMOTEENN SMOTEENN SMOTEENN SMOTEENN
CART
(0,87) (0,96) (0,93) (0,84)
SMOTEENN
SMOTEENN MCT SMOTEENN
Random Forest Polynom-fit-SMOTE
(0,95) (0,97) (0,9)
(0,96)
Borderline-SMOTE1
SMOTEENN Trim-SMOTE SMOTEENN
SVM LLE-SMOTE
(0,88) (0,9) (0,86)
(0,92)
ENN SMOTEENN SMOTEENN SMOTEENN
STACKING
(0,93) (0,97) (0,96) (0,91)
SMOTEENN
Borderline-SMOTE1
SMOTEENN SMOTE-IPF ENN
XGBoost Cure-SMOTE
(0,97) (0,94) (0,93)
SMOTE-D
(0,95)
SMOTEENN ENN Polynom-fit-SMOTE Trim-SMOTE
XGBRF
(0,88) (0,95) (0,94) (0,78)
AdaBoost Trim-SMOTE ANS LLE-SMOTE Trim-SMOTE
(0,88) (0,86) Cure-SMOTE (0,68)
(0,89)
K-Fold Dataset
xAPI Por Math Student
Classifiers

Logistic Regression SMOTEENN SMOTENN SMOTEENN Trim-SMOTE


(0,88) Trim – SMOTE (0,89) (0,77)
(0,92)
K-NN SMOTE SMOTENN SMOTEENN SMOTEENN
(0,97) (0,94) (0,9) (0,91)
CART SMOTEENN SMOTENN SMOTEENN SMOTEENN
(0,89) (0,94) (0,93) (0,94)
Random Forest SMOTEENN SMOTENN SMOTEENN SMOTEENN
(0,91) (0,97) (0,95) (0,91)
SVM Trim-SMOTE LLE-SMOTE SMOTE-IPF SMOTEENN
(0,90) (0,9) (0,85) (0,88)
STACKING SMOTEENN SMOTEENN SMOTEENN SMOTEENN
(0,94) (0,98) (0,96) (0,93)
XGBoost SMOTEENN SMOTEENN SMOTEENN SMOTEENN
(0,93) (0,96) (0,95) (0,83)
XGBRF Trim-SMOTE Borderline-SMOTE1 Polynom-fit-SMOTE Trim-SMOTE
(0,89) Polynom-fit-SMOTE (0,93) (0,79)
LLE-SMOTE
DEAGO
Cure-SMOTE
(0,93)
AdaBoost SMOTE-Tomek ENN Cure-SMOTE Trim-SMOTE
(0,97) (0,86) (0,9) (0,71)
As the experiment carried out, each resampling method Random Forest produced the highest with 0.86 and is
preserved their default parameter. Same goes to the measured using 80:20 splitting evaluation.
classification methods. From Table II to Table V can be seen TABLE VI
AVERAGE F1 SCORE OF EACH CLASSIFIER ON XAPI DATA
that each scenario on each dataset performs differently. In
Average F1Score on No- resampling
most of the scenario SMOTEENN performs better than any
Classifier Every Resampling classification result
other resampling method with good accuracy result on the technique
classifiers performance proven on both of the evaluation Logistic
metrics too. It seems that the SMOTEENN resampling 0,75 0,69
Regression
method is a good pair with any of the classification method K-NN 0,72 0,63
to handle imbalanced class problem. But some other CART 0,78 0,71
resampling method such as ENN, Trim-SMOTE, Polynom- Random 0,86 0,8
fit-SMOTE, LLE-SMOTE, DEAGO, Cure-SMOTE, Forest
Borderline-SMOTE1, MCT, MWMOTE, SMOTE and SVM 0,69 0,60
Stacking 0,81 0,77
SMOTE-Tomek can outperform SMOTEENN in some
XGboost 0,84 0,75
cases. Among all of the result on Table II, the highest F1 XGBRF 0,78 0,74
score produced by Stacking classification method and Adaboost 0,73 0,68
SMOTEENN resampling with 0,98 under k-fold evaluation
metric, which slightly higher than the splitting evaluation TABLE VII
metric with 0,97 on the same pair plus XGBoost and AVERAGE F1 SCORE OF EACH CLASSIFIER ON POR DATA
SMOTEENN pair with the same F1 score. Average F1Score on No- resampling
Classifier Every Resampling classification result
technique
C. Influence of resampling method on classification’s
Logistic
performance 0,87 0,78
Regression
K-NN 0,76 0,51
This research utilizes as many as resampling method to CART 0,87 0,71
measure the resampled dataset when it’s come to Random 0,92 0,8
classification process. On the first dataset (x-API), the Forest
average of all classification methods on every resampling SVM 0,84 0,59
technique produced 0,78 F1 score which is a decent score. Stacking 0,93 0,81
When it is broken down to each of the classification XGboost 0,92 0,81
method’s average F1 Score on every resampling technique, XGBRF 0,87 0,83
Adaboost 0,59 0,70
head-to-head on ech classification method’s result with the
student dataset.
TABLE VIII The class variation may influenced the method’s
AVERAGE F1 SCORE OF EACH CLASSIFIER ON MATH DATA
performance but it is not a solidly prooved too. The highest
Average F1Score on No- resampling
result (accumulatively) produced by most of the
Classifier Every Resampling classification result
technique classification and resampling method on the Por and Math
Logistic dataset which have four label variation and the imbalanced
0,82 0,78 degree between majority and the minority class are 1:21 and
Regression
K-NN 0,69 0,57 1: 4,2 imbalanced ratio respectively.
CART 0,83 0,78
Random 0,90 0,83 IV. CONCLUSION
Forest
SVM 0,74 0,54 From this study, several things are become known such
Stacking 0,89 0,82 as, resampling method are impactful toward classification
XGboost 0,89 0,82 result according to Table VI to IX in general. On the specific
XGBRF 0,87 0,83 way, Random Forest dominating almost in every scenario on
Adaboost 0,78 0,90 every dataset. The best resampling method that can improve
the results according to the experiment is SMOTEENN
TABLE IX resampling method even though it excels on different
AVERAGE F1 SCORE OF EACH CLASSIFIER ON STUDENT DATA
classifiers on every dataset. it can be infered that resampling
Average F1Score on No- resampling
Classifier Every Resampling classification result method cannot be perfectly steady on different dataset
technique condition, or it maybe the cause of the evaluation metrics
Logistic (splitting and k-fold). The difference condition of imbalance
0,43 0,24
Regression between datasets affects how each classifiers performs. It
K-NN 0,61 0,25 may be caused by the performance of the resampling method
CART 0,57 0,28 on sythetizing the data for example on the Por dataset,
Random 0,67 0,26 SMOTE-ENN classified using Stacking produces 0,97 F1
Forest score while with the same resampling and classifier on
SVM 0,60 0,22
student dataset it produce only 0,13 F1 score. Despite of all
Stacking 0,64 0,28
XGboost 0,59 0,31 differences happened on the experiment result, the goal of
XGBRF 0,48 0,28 this study has been fulfilled that most of the resampling
Adaboost 0,40 0,30 method enhance the classification result compared to the no-
resampling scenario. The future work of this study is to dig
From Table VI to IX, it can be seen that resampling deeper into the reason why the resampling method can’t
method improve most of the classification’s performances handle the bigger class variation since the F1 score on the
with Random Forest produce best average F1 Score among student dataset are lower than the other dataset.
all of the classification result. Some differences occur on the
Por dataset which produced best F1 score average under REFERENCES
Stacking classification, Adaboost produce highest F1 score [1] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H.
(0,90) among all the no-resampling result on Math dataset.
Yuanyue, and G. Bing, “Learning from class-
Despite of all differences happened on the experiment result, imbalanced data: Review of methods and
the goal of this study has been fulfilled that most of the applications,” Expert Syst. Appl., vol. 73, pp. 220–
resampling method enhance the classification result 239, 2017, doi: 10.1016/j.eswa.2016.12.035.
compared to the no-resampling scenario. [2] Z. Xu, D. Shen, T. Nie, and Y. Kou, “A hybrid
sampling algorithm combining M-SMOTE and ENN
D. influence of dataset condition toward classification’s based on Random forest for medical imbalanced
performance data,” J. Biomed. Inform., vol. 107, no. May 2019, p.
103465, 2020, doi: 10.1016/j.jbi.2020.103465.
According to Table II, the performance of each [3] S. Makki, Z. Assaghir, Y. Taher, R. Haque, M. S.
resampling and classification methods are different on each Hacid, and H. Zeineddine, “An Experimental Study
dataset. Especially on the student dataset compared to the With Imbalanced Classification Approaches for
other dataset. The dataset have more class variation Credit Card Fraud Detection,” IEEE Access, vol. 7,
compared to the other which may caused the differences on pp. 93010–93022, 2019, doi:
the method’s performance. With the broader variation of 10.1109/ACCESS.2019.2927266.
class and imbalance scale of each class, the method’s may [4] Y. Zhang and P. Trubey, “Machine Learning and
find it hard to synthesizing the minority sample there are Sampling Scheme: An Empirical Study of Money
way too many class exist. Compared to the Por and Math Laundering Detection,” Comput. Econ., vol. 54, no.
dataset with the relatively same imbalanced condition, the 3, pp. 1043–1063, 2019, doi: 10.1007/s10614-018-
resampling and classification method can perform well and 9864-z.
producing good score and way better when it compared [5] B. A. Akinnuwesi et al., “Application of
intelligence-based computational techniques for
classification and early differential diagnosis of “Educational data mining for predicting students’
COVID-19 disease,” Data Sci. Manag., vol. 4, pp. academic performance using machine learning
10–18, 2021, doi: algorithms,” Mater. Today Proc., vol. 47, no. xxxx,
https://doi.org/10.1016/j.dsm.2021.12.001. pp. 5260–5267, 2021, doi:
[6] G. Fan, Z. Deng, Q. Ye, and B. Wang, “Machine 10.1016/j.matpr.2021.05.646.
learning-based prediction models for patients no- [16] A. I. Adekitan and O. Salau, “The impact of
show in online outpatient appointments,” Data Sci. engineering students’ performance in the first three
Manag., vol. 2, pp. 45–52, 2021, doi: years on their graduation result using educational
https://doi.org/10.1016/j.dsm.2021.06.002. data mining,” Heliyon, vol. 5, no. 2, p. e01250,
[7] M. A. Harlev, H. S. Yin, K. C. Langenheldt, R. R. 2019, doi: 10.1016/j.heliyon.2019.e01250.
Mukkamala, and R. Vatrapu, “Breaking bad: De- [17] D. Zhang, W. Liu, X. Gong, and H. Jin, “A Novel
anonymising entity types on the bitcoin blockchain Improved SMOTE Resampling Algorithm Based on
using supervised machine learning,” Proc. Annu. Fractal,” J. Comput. Inf. Syst., vol. 7, Jun. 2011.
Hawaii Int. Conf. Syst. Sci., vol. 2018-Janua, pp. [18] A. Aditsania, Adiwijaya, and A. L. Saonard,
3497–3506, 2018, doi: 10.24251/hicss.2018.443. “Handling imbalanced data in churn prediction using
[8] I. Alarab and S. Prakoonwit, “Effect of data ADASYN and backpropagation algorithm,”
resampling on feature importance in imbalanced Proceeding - 2017 3rd Int. Conf. Sci. Inf. Technol.
blockchain data: Comparison studies of resampling Theory Appl. IT Educ. Ind. Soc. Big Data Era,
techniques,” Data Sci. Manag., vol. 5, no. 2, pp. 66– ICSITech 2017, vol. 2018-Janua, pp. 533–536, 2017,
76, 2022, doi: 10.1016/j.dsm.2022.04.003. doi: 10.1109/ICSITech.2017.8257170.
[9] Y. Pristyanto, I. Pratama, and A. F. Nugraha, “Data [19] S. Ahmed, A. Mahbub, F. Rayhan, R. Jani, S.
level approach for imbalanced class handling on Shatabda, and D. M. Farid, “Hybrid Methods for
educational data mining multiclass classification,” in Class Imbalance Learning Employing Bagging with
2018 International Conference on Information and Sampling Techniques,” 2nd Int. Conf. Comput. Syst.
Communications Technology, ICOIACT 2018, 2018, Inf. Technol. Sustain. Solut. CSITSS 2017, pp. 1–5,
vol. 2018-Janua. doi: 2018, doi: 10.1109/CSITSS.2017.8447799.
10.1109/ICOIACT.2018.8350792. [20] B. S. Raghuwanshi and S. Shukla, “SMOTE based
[10] E. Buraimoh, R. Ajoodha, and K. Padayachee, class-specific extreme learning machine for
“Importance of Data Re-Sampling and imbalanced learning,” Knowledge-Based Syst., vol.
Dimensionality Reduction in Predicting Students’ 187, p. 104814, 2020, doi:
Success,” 3rd Int. Conf. Electr. Commun. Comput. 10.1016/j.knosys.2019.06.022.
Eng. ICECCE 2021, no. June, pp. 12–13, 2021, doi: [21] D. Bajer, B. Zonć, M. Dudjak, and G. Martinović,
10.1109/ICECCE52056.2021.9514123. “Performance Analysis of SMOTE-based
[11] D. Jahin, I. J. Emu, S. Akter, M. J. A. Patwary, M. Oversampling Techniques When Dealing with Data
A. S. Bhuiyan, and M. H. Miraz, “A Novel Imbalance,” in 2019 International Conference on
Oversampling Technique to Solve Class Imbalance Systems, Signals and Image Processing (IWSSIP),
Problem: A Case Study of Students’ 2019, pp. 265–271. doi:
Grades Evaluation,” in 2021 International 10.1109/IWSSIP.2019.8787306.
Conference on Computing, Networking, [22] F. Sağlam and M. A. Cengiz, “A novel SMOTE-
Telecommunications & Engineering Sciences based resampling technique trough noise detection
Applications (CoNTESA), 2021, pp. 69–75. doi: and the boosting procedure,” Expert Syst. Appl., vol.
10.1109/CoNTESA52813.2021.9657151. 200, no. April 2020, pp. 1–12, 2022, doi:
[12] M. Utari, B. Warsito, and R. Kusumaningrum, 10.1016/j.eswa.2022.117023.
“Implementation of Data Mining for Drop-Out [23] H. Guo, J. Zhou, and C.-A. Wu, “Imbalanced
Prediction using Random Forest Method,” 2020 8th Learning Based on Data-Partition and SMOTE,”
Int. Conf. Inf. Commun. Technol. ICoICT 2020, Information , vol. 9, no. 9. 2018. doi:
2020, doi: 10.1109/ICoICT49345.2020.9166276. 10.3390/info9090238.
[13] R. Ghorbani and R. Ghousi, “Comparing Different [24] A. Fernández, S. García, M. Galar, R. C. Prati, B.
Resampling Methods in Predicting Students’ Krawczyk, and F. Herrera, Data Level
Performance Using Machine Learning Techniques,” Preprocessing Methods. 2018. doi: 10.1007/978-3-
IEEE Access, vol. 8, pp. 67899–67911, 2020, doi: 319-98074-4_5.
10.1109/ACCESS.2020.2986809. [25] Y. Pristyanto, N. A. Setiawan, and I. Ardiyanto,
[14] M. Revathy, S. Kamalakkannan, and P. Kavitha, “Hybrid resampling to handle imbalanced class on
“Machine Learning based Prediction of Dropout classification of student performance in classroom,”
Students from the Education University using Proc. - 2017 1st Int. Conf. Informatics Comput. Sci.
SMOTE,” in 2022 4th International Conference on ICICoS 2017, vol. 2018-Janua, pp. 207–212, 2017,
Smart Systems and Inventive Technology (ICSSIT), doi: 10.1109/ICICOS.2017.8276363.
2022, pp. 1750–1758. doi: [26] M. Zeng, B. Zou, F. Wei, X. Liu, and L. Wang,
10.1109/ICSSIT53264.2022.9716450. “Effective prediction of three common diseases by
[15] P. Dabhade, R. Agarwal, K. P. Alameen, A. T. combining SMOTE with Tomek links technique for
Fathima, R. Sridharan, and G. Gopakumar, imbalanced medical data,” in 2016 IEEE
International Conference of Online Analysis and 2019, doi: 10.1016/j.ins.2019.06.007.
Computing Science (ICOACS), 2016, pp. 225–228. [30] E. A. Amrieh, T. Hamtini, and I. Aljarah, “Mining
doi: 10.1109/ICOACS.2016.7563084. Educational Data to Predict Student’s academic
[27] G. E. A. P. A. Batista, R. C. Prati, and M. C. Performance using Ensemble Methods,” Int. J.
Monard, “A Study of the Behavior of Several Database Theory Appl., vol. 9, no. 8, pp. 119–136,
Methods for Balancing Machine Learning Training 2016, doi: 10.14257/ijdta.2016.9.8.13.
Data,” SIGKDD Explor. Newsl., vol. 6, no. 1, pp. [31] P. Cortez and A. Silva, “Using data mining to
20–29, Jun. 2004, doi: 10.1145/1007730.1007735. predict secondary school student performance,” 15th
[28] H. Han, W. Y. Wang, and B. H. Mao, “Borderline- Eur. Concurr. Eng. Conf. 2008, ECEC 2008 - 5th
SMOTE: A new over-sampling method in Futur. Bus. Technol. Conf. FUBUTEC 2008, vol.
imbalanced data sets learning,” Lect. Notes Comput. 2003, no. 2000, pp. 5–12, 2008.
Sci., vol. 3644, no. PART I, pp. 878–887, 2005, doi: [32] G. Kovács, “Smote-variants: A python
10.1007/11538059_91. implementation of 85 minority oversampling
[29] G. Douzas and F. Bacao, “Geometric SMOTE a techniques,” Neurocomputing, vol. 366, pp. 352–
geometrically enhanced drop-in replacement for 354, 2019, doi: 10.1016/j.neucom.2019.06.100.
SMOTE,” Inf. Sci. (Ny)., vol. 501, pp. 118–135,

You might also like