Professional Documents
Culture Documents
Article Illiteracy Clustering Using The K Means Method (Upload JOIV Politeknik Negeri Padang)
Article Illiteracy Clustering Using The K Means Method (Upload JOIV Politeknik Negeri Padang)
Abstract— Illiteracy is a condition where a person is unable to read and write both for communication and expressing opinions in
societyIlliteracy is a problem that is almost present in all countries, in Indonesia there are still illiterate populations recorded by the
Indonesian Central Bureau of Statistics, there are still 3.7 million people who are illiterate in 2017. One way to overcome illiteracy is to
mapping the population based on the level of illiteracy so that areas with high illiteracy can be prioritized. Because illiterate population
data do not yet have classes, first step is grouping data to the same class using clustering methods so that each region can be labeled
based on the illiteracy class. In this study aims to find the optimal number of classes in the clustering level of illiteracy, where the k-
means clustering algorithm is used to classify the classes and applied as many as 4 different scenarios with k= 2, k = 3, k = 5 and k = 7.
Before clustering, preprocessing will be done. Preprocessing processes include data collection, selection attributes, data integration,
multicollinearity testing, data normalization, and outlier detection. After preprocessing, the data will be processed immediately using
the k-means clustering algorithm and then it will be validated using silhouette coefficient (sc) . So that shows the sc value for each k
value as follows, k = 2 shows sc valuae as 0.24, k = 3 (0.29), k = 5 (0.25) and k = 7 (0.24). So it can be concluded that the most optimal
class number is k = 3.
The value of Silhouette The Strength Level of Based on Equation 2, the value of z-score normalization (v`)
Coefisien (SC) Cluster obtained from the attribute value A (v) is reduced by the mean
0,7 < SC ≤ 1 Very Strong
attribute A (A ̅), then divided by the standard deviation of
0,5 < SC ≤ 0,7 Strong
attribute A (σ_A). For the results of data transformation using
0,25 < SC ≤ 0,5 Moderate
SC ≤ 0,25 Poor
z-score can be seen in Table 3
TABLE III
3 FROM 462 SAMPLE OF TRANSFORMATION DATA
E. Evaluation
Data
Region
F1 F2 F3 F4 F5 F6 F7
After completing the testing or validation process, it can
Kab. 1.99 0.24 0.092 0.7 0.575 0.46 0.71
be seen the evaluation results in the form of a Silhouette
Simelue
coefficient value of each of the different k values so that the
highest and lowest sc values can be found in the k value Kab. 0.396 0.82 0.031 0.3 0.112 0.88 0.04
entered. Nias
Kab. 0.688 0.16 0.107 0.52 0.452 0.121 0.22
IV.RESULT AND DISCUSSION Siak
A. Pre-processing Result
In the Pre-processing stage, data collection and In Table 3, if there is a negative value it indicates that the
integration is carried out wherein, data Life expectancy is data is below the total average, while the positive value
initialized as F1 , Per capita expenditure as F2 Percentage of indicates that the data is above the total average. The results
poor population as F3 , Number of population not of data transformation shown in Table 3 will be used as a
graduating elementary school as F4 , Literacy rate as F5, dataset for clustering processes using the K means algorithm
Duration of School as F6 , Population of people who cannot to form illiterate classes.
read and write as F7 . he results of attribute selection and
TABLE VI
10 OF 114 SAMPLES OF CLUSTERING RESULT
But before that, outlier detection needs to be done first by
paying attention to the z-score of each object. Outliers can be Value of
marked with a value that has an abnormal distribution with a Region Label Euclidean
range of values below -3 or above 3[15]. After through this Distance
process there are 35 objects which are outliers, then deleted Kab.
from the dataset. So that the data used has 462 objects left. Simeulue Cluster 1 2.063785
Kab. Aceh
Singkil Cluster 1 1.28987
B. Clustering Using K-means Algotihm
Kab. Aceh
The clustering process is carried out three times with values Selatan Cluster 1 1.214199
of k = 2, k = 3, and k = 5. The initial step of determining Kab. Aceh
clusters with values k = 2 is shown in Table 4 Tenggara Cluster 2 1.996979
Kab. Aceh
TABLE IV Timur Cluster 1 2.08613
INITIAL CLUSTER CENTER Kab. Aceh
Clusters Tengah Cluster 2 1.12503
Data Kab. Aceh
Cluster 1 Cluster 2
-2.6896804 -0.3949074 Barat Cluster 1 1.82489
F1
Kab. Aceh
F2 0.7216268 -0.4789252 Besar Cluster 2 1.336122
F3 0.0297028 -0.0967458 Kab. Pidie Cluster 2 1.210859
-0.4836585 1.3934143 Kab. Bireuen Cluster 2 2.163445
F4
F5 -0.2990172 -0.4923361
F6 -0.6783040 0.3913993
From the results shown in Table 6, 247 regions are
F7 -0.1172511 0.4710987 grouped in groups 1 and 215 regions which are grouped in
groups 2. Then, the grouping process will be repeated in
the same way for k = 3 and k = 5. After all the grouping
Based on Table 4, the initial centroid selected for center processes have After done, the number of members of each
cluster 1 is the 27th object and the initial cluster center for cluster is obtained as shown in Table 7
cluster 2 is the 275th object. In the clustering process with a
value of k = 2, iterates 21 times until the object does not move TABLE VII
MEBER OF EACH CLUSTER
the cluster again. For the final cluster center can be seen in
Table 5. Nilai k
Label
Jumlah
Persentase
Anggota
TABEL V Cluster 1 247 54%
PUSAT CLUSTER AKHIR k=2
Cluster 2 215 46%
Clusters 204
Data Cluster 1 44%
Cluster 1 Cluster 2
-0.5635997 0.7654722 k=3 Cluster 2 177 38%
F1
-0.3155703 0.5524862 Cluster 3 81 17%
F2
-0.0112379 -0.1187075 Cluster 1 30 6%
F3
0.02913571 -0.2352607 Cluster 2 59 12%
F4
-0.3788719 0.7062448 k=5 Cluster 3 155 33%
F5
0.0142193 0.3591641 Cluster 4 100 21%
F6
0.0184489 -0.2833912 Cluster 5 118 25%
F7
REFERENCES