Professional Documents
Culture Documents
Comparison of Classification Data Mining C4.5 and Naïve Bayes Algorithms of EDM Dataset
Comparison of Classification Data Mining C4.5 and Naïve Bayes Algorithms of EDM Dataset
net/publication/356622131
CITATION READS
1 101
6 authors, including:
Some of the authors of this publication are also working on these related projects:
Housing Urban Poor and Urban Revitalisation. KLN Grant Ministry of Research, Technology and Higher Education Republic of Indonesia View project
All content following this page was uploaded by Muhammad Arifin on 10 July 2022.
1738 TEM Journal – Volume 10 / Number 4 / 2021.
TEM Journal. Volume 10, Issue 4, Pages 1738‐1744, ISSN 2217‐8309, DOI: 10.18421/TEM104‐34, November 2021.
Further research by Pujianto [12] on Diabetes [15], [16]. By changing the dataset as training and
Patients with HbA1c Measurement. This paper testing data, it is hoped that it can evaluate the
proposes two comparisons of classification methods selection of the best classification method.
C4.5 and Naive Bayes with HbA1c measurement in
seeing the performance of the two methods. By 2. Methodology
involving a combination of preprocessing methods,
namely Synthetic Minority Over-Sampling
Technique (SMOTE) and the Wrapper feature 2.1. Dataset
selection method, with both classification techniques.
The result of the research states that the C4.5 method The dataset used is the graduation data of students
produces the best performance in classifying diabetic majoring in informatics engineering at University
patients with an accuracy value of 82.74%, a XYZ which consists of 79 data from students who
precision value of 87.1%, and a recall value of have graduated from various generations. The
82.7%. attributes used are regional origin, type of school,
Based on related research, the results obtained entrance, predicate cumulative graduation (IPK),
from the best classification method differ according predicate of graduation in the first semester (IP1),
to the cases used. The purpose of this study was to predicate of second semester graduation (IP2),
evaluate the performance of the C4.5 and Naïve predicate of third semester graduation (IP3),
Bayes classification methods by performing a
validation test with 10-Ford X Validation and predicate of fourth semester graduation (IP4),
performing a T-Test differential test [13]. The case predicate of fifth semester graduation (IP5), boarding
raised is Educational Data Mining (EDM) [14] on the school and information. The following is the student
student graduation dataset in the research conducted graduation dataset as shown in Table 1 below:
TEM Journal – Volume 10 / Number 4 / 2021. 1739
TEM Journal. Volume 10, Issue 4, Pages 1738‐1744, ISSN 2217‐8309, DOI: 10.18421/TEM104‐34, November 2021.
Table 3. Entrance dataset into 10 equal parts and then do the learning
process 10 times and use the rest of the dataset to
No Entrance Score perform the test. Several tests mention the use of this
1 Mandiri Ujian Tulis 1 validation model stratification slightly increased
2 SNMPTN Ujian Tulis 2 yield [11].
3 Mandiri Prestasi 3
4 SNMPTN Undangan 4
5 SPMB - PTAIN 5 2.4. Model Evaluation
2.2. Classification Algorithms Apply the area under the curve (AUC) for accuracy
indicator is to increase increasing convergence across
The proposed classification algorithm aims to experiments. The following is guidance Table for
achieve a balance between the classification methods classifying accuracy using AUC as shown in Table 4
used by comparing the performance of these models [6].
[6]. The methods used are the decision tree (C4.5)
and the traditional statistical classifier (Naïve Bayes) Table 4. AUC value
[17]. AUC Meaning
0.90 - 1.00 Excellent Classification
2.3. Model Validation 0.80 - 0.90 Good Classification
0.70 - 0.80 Fair Classification
The validation model [18] used is cross validation
0.60 - 0.70 Poor Classification
10-fold stratified, which means dividing the training < 0.60 Failure
This stage of the analysis process uses the In Figure 1 (a), the input dataset uses three different
assistance of Rapid Miner software in comparing the datasets (25, 50 and 79) with excel format (.xls). The
best classification method on the student graduation design uses the multiply operator which functions as
dataset which is divided into three datasets. The a bridge in comparing to the classification methods
following is a design model for the comparison of the used at once (C4.5 and Naïve Bayes). After that each
C4.5 method with Naïve Bayes using the Rapid method uses the Cross Validation operator which
Miner software as in Figure 1 below: uses the 10-Ford X Validation test on training and
testing data (b) (c). Then the further determination is
used testing using statistical tests, namely by using
the T-Test to compare two methods alternately.
Following are the results of a comparative analysis of
the C4.5 and Naïve Bayes methods for various
datasets (25, 50 and 79).
(c)
Figure 2. C4.5 accuracy results (79 records)
Figure 1. Classification Method Comparison Design
Model (a)(b)(c)
1740 TEM Journal – Volume 10 / Number 4 / 2021.
TEM Journal. Volume 10, Issue 4, Pages 1738‐1744, ISSN 2217‐8309, DOI: 10.18421/TEM104‐34, November 2021.
TEM Journal – Volume 10 / Number 4 / 2021. 1741
TEM Journal. Volume 10, Issue 4, Pages 1738‐1744, ISSN 2217‐8309, DOI: 10.18421/TEM104‐34, November 2021.
In Figure 3, the best AUC value is 0.350 and is Based on the Table above, it can be seen that the
included in the category of "Failure ". Naive Bayes algorithm has an accuracy value the
highest is 80% and C4.5 is 76%. Meanwhile, the
b) Naive Bayes ROC curve (AUC) test shows that Naïve Bayes
The following are the results of the analysis of the achieved the best AUC value, namely 0.75.
Naïve Bayes method using RapidMiner software for Meanwhile, Method C4.5 is in the Failed category
validation tests with 10-Ford X Validation and T- because it is <0.60.
Test differences as shown in Figures 9 and 10 below:
3.3. Results of Comparative Analysis of Methods
with Dataset 1 (25 records)
Figure 11. T-Test Statistics Test (50 record) Figure 13. Result of AUC (Area Under the ROC Curve) at
C4.5 (25 records)
From the t-test above, that results comparison
between C4.5 and Naive Bayes methods there is no In Figure 13, the best AUC value is 0.425 and is
significant difference (H0). included in the category of "Failure ".
b) Naive Bayes
Table 6. Comparison results of all tests (50 records)
The following are the results of the analysis of the
Dataset C4.5 Naïve Bayes
Naïve Bayes method using RapidMiner software for
AUC
Accuracy AUC value Accuracy validation tests with 10-Ford X Validation and T-
value
50 76% 0.35 80% 0.75 Test differences as shown in Figures 14 and 15
below:
1742 TEM Journal – Volume 10 / Number 4 / 2021.
TEM Journal. Volume 10, Issue 4, Pages 1738‐1744, ISSN 2217‐8309, DOI: 10.18421/TEM104‐34, November 2021.
3.4. Discussion
Experiments were carried out on a laptop based on (b)
an Intel Core i5, Processor with 8 GB RAM and an Figure 17. The results of the comparison graph of all tests
operating system Windows 8 is used. Applications (a)(b)
TEM Journal – Volume 10 / Number 4 / 2021. 1743
TEM Journal. Volume 10, Issue 4, Pages 1738‐1744, ISSN 2217‐8309, DOI: 10.18421/TEM104‐34, November 2021.
1744 TEM Journal – Volume 10 / Number 4 / 2021.
View publication stats