You are on page 1of 5

Illiteracy Clustering Using the K-means Method

Anwar Ludfianto1, Triyanna Widiyaningtyas2


#
Electrical Department, State University of Malang, Jl. Semarang 5, Malang, 65145, Indonesia
E-mail: anwarnagato@gmail.com, triyannaw.ft@um.ac.id,

Abstract— Illiteracy is a condition where a person is unable to read and write both for communication and expressing opinions in
societyIlliteracy is a problem that is almost present in all countries, in Indonesia there are still illiterate populations recorded by the
Indonesian Central Bureau of Statistics, there are still 3.7 million people who are illiterate in 2017. One way to overcome illiteracy is to
mapping the population based on the level of illiteracy so that areas with high illiteracy can be prioritized. Because illiterate population
data do not yet have classes, first step is grouping data to the same class using clustering methods so that each region can be labeled
based on the illiteracy class. In this study aims to find the optimal number of classes in the clustering level of illiteracy, where the k-
means clustering algorithm is used to classify the classes and applied as many as 4 different scenarios with k= 2, k = 3, k = 5 and k = 7.
Before clustering, preprocessing will be done. Preprocessing processes include data collection, selection attributes, data integration,
multicollinearity testing, data normalization, and outlier detection. After preprocessing, the data will be processed immediately using
the k-means clustering algorithm and then it will be validated using silhouette coefficient (sc) . So that shows the sc value for each k
value as follows, k = 2 shows sc valuae as 0.24, k = 3 (0.29), k = 5 (0.25) and k = 7 (0.24). So it can be concluded that the most optimal
class number is k = 3.

Keywords— Iliteracy, Clustering, K means.

illiteracy, as until now there are still 11 provinces with


I. INTRODUCTION illiteracy rates above the national figure, they are Papua (28.75
Education is one of the main needs that must be met. percent), NTB (7.91 percent), NTT (5.15 percent), West
Because education can make individuals more developed and Sulawesi (4.58 percent), West Kalimantan (4.50 percent),
able to understand science well. However, not all regions South Sulawesi (4.49 percent), Bali (3.57 percent), East Java
receive proper education which causes early school dropouts. (3 , 47 percent), North Kalimantan (2.90 percent), Southeast
Causing side effects such as illiteracy and can lead to a Sulawesi (2.74 percent), and Central Java (2.20 percent)[4].
decrease in the quality of education [1]. This is a reference for the government to mapping the district
Where indeed one of the indicators of education is / city area which has a high percentage so that it can be
illiteracy rates, illiteracy rates themselves represent a picture overcome by the government.
of certain age and area residents who cannot read and write To be able to do population mapping can be applied using
Latin letters. Illiteracy often occurs in third countries or the classification method. Classification method is a way to
developing countries, including in Indonesia, especially group data into classes where the reference is based on
Indonesia does have a large area and population spread in characteristics, differences, and similarities[5]. In this
various regions so it is difficult to control existing education. classification method there are several algorithms that can be
The Central Bureau of Statistics has noted that in 2009 there used such as C4.5, Random Forest, Cart, K-Nearest Neighbor
were 8.7 million people affected by illiteracy [2]. While in (KNN), Naïve Bayes, Decision Tree.[6]. But before
2017 there were 3.4 million people affected by illiteracy. classifying, the first step is clustering first because the data
Where spread in various regions in Indonesia[3]. used in this research still does not have a class, clustering is
Even though the number of illiterates continues to used for class grouping where the method used is the K-means
decrease each year, some regions still have a high quantity of clustering algorithm. This method was chosen because
because of its advantages that are easy to implement and
modify to determine the class so that the optimal number of “pengeluaran_perkapita”,”angka_harapan_hidup”,”jumlah_p
classes can be found [7]. In addition, this method was chosen enduduk_miskin”,”tingkat_pendidikan”,”jumlah_perempuan
because it has the advantage of handling the complexity of (tidak lulus sd)”, “jumlah_laki_laki(tidak lulus sd)”,
time and storage space where this method is easy to “lama_sekolah” ,”angka_melek_huruf dan tidak_baca_tulis”.
implement by changing values such as the distance function
and the maximum iteration limit that you want to achieve [8]. B. Pre-processing
1) Pre-processing data in this research include data
II. PROPOSED ALGORITHM collection, data selection, data integration,
A. K means multicollinearity testing, and outlier detection. .
K-means is one clustering method that is classified as 2) Data collection: Data collection is done to find data that
non-hierarchy to group data into one or more clusters [9]. This has a relationship with factors and influences on the
algorithm works by grouping data that has the same object of the research, namely illiteracy. Data collection
characteristics in one class. K-Means is an algorithm for can be done by searching data on the data.go.id website.
clustering objects based on attributes being k partitions, where 3) Data Selection: In each data obtained contains many
k <n. This algorithm has a good performance in processing attributes, but not all attributes will be used so that
data with numerical attributes. One of the advantages, this attributes need to be selected. By looking for attributes
algorithm is able to cluster large data very quickly [10]. Even that are considered to have an influence on the object of
so, K-Means still has some disadvantages, among others, if research
the amount of data used is very influential on the performance 4) Data Integration: After selecting attributes that are
of clustering, then the number of clusters is as much as K, so considered to have an influence on the research object,
an optimal K value must be determined[11]. then for the next step, combine all the data obtained into
For more details, the following are K-means pseudocode : one table
5) Multicollinearity Test: Multicollinearity test is a test for
Input: predictor variables with other predictor variables whether
it has a strong relationship or correlation or not. . This test
D = {d1, d2, ....., dn} //set of n data items. is done to avoid predictor variables that have redundant
k // Number of desired cluster information
6) Data Transformation: Data transformation is a technique
Output: in data mining where this technique will change a value
A set of k clusters. in the existing data set. The technique that will be used in
this research is z-score normalization.
Step: 7) Oultier Detection: At this stage also uses z-score
1. Arbitrarily choose k data-items from D as normalization, because the technique can also be used to
initial centroids; detect outliers or outliers. Where this data can be seen
when the result of data transformation results is smaller
2. Repeat
than -3.00 or greater than 3.00 then it is an outlier data
C. Clustering Process
Assign each item di to the cluster
which has the closest centroid; Start
Calculate new mean for each cluster;
Illiterate

Until convergence criteria is met. Datasets


Determine the value of k
Fig.. 1 Pseudocode of k meanes algorithm
Clustering
III. METHODOLOGY process using k
means
A. Data Research Save the clustering result
The data used as input for clustering is illiterate
population data and supporting data obtained from the
Indonesian Central Bureau of Statistics collected on the
government website, data.go.id where the data used is clustering
has done
illiterate population data in 2010 and its component. The data three
consists of 7 attributes and 497 instances. In this study, the
object of the research was data from 94 districts in 34
provinces in Indonesia which focused in 2010, which later
will be clustered to get the target class in the form of
Finish
illiteracy.. In determining the label of each instance, 8 initial
attributes will be used as indicators including : Fig. 2 Sequences line of research
In Figure 2, the stages of the clustering process are data integration as well as multicollinearity test data can
explained using the K-means method where in the first be seen in Table 2.
step is preparing the dataset will used. Then determine the
value of k to be entered which is the reference for how TABLE II
many kels as made. After that the results will be saved for 5 FROM 462 SAMPLE DATA OF ILITERACY
the validation process, the steps are repeated up to 3 times Data
to produce clusters k = 2, then k = 3 and k = 5. Region
F1 F2 F3 F4 F5 F6 F7
Kab. 62 618 23 120 98.6 8.52 5195
D. Validation and Evaluation
Simelue
There are several steps that require validation to ensure 69 607 19 620 90.4 6.4 3676
Kab. Nias
the process is running well. For the first process, it is
Kab. Siak 71 644 6 630 98.5 9 1809
validating the results of clustering using the Cose Silhouette
method. The Silhouette Coefficient method is an internal Kab. 70 589 33 409 97.4 9.2 3792
validation method in the clustering process which will see the Fakfak
closeness between objects in one cluster [12].The stages for Kab. 62 597 14 629 97.9 9.3 1602
obtaining Coefficient Silhouette values are as follows [13]. Merauke
(1) Calculate the average distance using the euclidean
distance equation from the i-object with all objects in the same The steps of data transformation are carried out in Table
cluste. Then, the average distance value is used as ai value. (2) 2 because the range of values is not the same so that it will be
Calculate the average distance using the euclidean distance difficult if the distance between variables is calculated. Data
equation from the i object with all objects in the other cluster. to be transformed are data on the life expectancy column, per
from all distances the average takes the smallest value. Then, capita expenditure, the percentage of poor people, the number
the smallest distance average value is used as a bi value. (3) of people not graduating from elementary school, literacy
Calculate the value of Silhouette Coefficient (SC) for the i- rates, length of school and not reading (the population cannot
object using the following equation. read and write), because the data will become predictor data
to determine clusters at the next stage.
𝑏𝑖 − 𝑎𝑖
𝑆𝐶𝑖 = (1) The transformation data technique used is z-score
𝑀𝑎𝑥(𝑎𝑖 ,𝑏𝑖 )
normalization. With the z-score normalization method, the
To be able to assess a cluster is good or not is to look at the existing data will be normalized based on the mean and
closeness between objects in a cluster where the results if standard deviations of each attribute [14] . To do a z-score
close to one then the cluster is good, for a more complete level normalization using the following equation:
of cluster strength as in Table 1.
𝑣− 𝐴̅
TABLE I 𝑣′ = (2)
SILHOUETTE COEFFICIENT VALUE REFERENCE 𝜎𝐴

The value of Silhouette The Strength Level of Based on Equation 2, the value of z-score normalization (v`)
Coefisien (SC) Cluster obtained from the attribute value A (v) is reduced by the mean
0,7 < SC ≤ 1 Very Strong
attribute A (A ̅), then divided by the standard deviation of
0,5 < SC ≤ 0,7 Strong
attribute A (σ_A). For the results of data transformation using
0,25 < SC ≤ 0,5 Moderate
SC ≤ 0,25 Poor
z-score can be seen in Table 3
TABLE III
3 FROM 462 SAMPLE OF TRANSFORMATION DATA
E. Evaluation
Data
Region
F1 F2 F3 F4 F5 F6 F7
After completing the testing or validation process, it can
Kab. 1.99 0.24 0.092 0.7 0.575 0.46 0.71
be seen the evaluation results in the form of a Silhouette
Simelue
coefficient value of each of the different k values so that the
highest and lowest sc values can be found in the k value Kab. 0.396 0.82 0.031 0.3 0.112 0.88 0.04
entered. Nias
Kab. 0.688 0.16 0.107 0.52 0.452 0.121 0.22
IV.RESULT AND DISCUSSION Siak
A. Pre-processing Result
In the Pre-processing stage, data collection and In Table 3, if there is a negative value it indicates that the
integration is carried out wherein, data Life expectancy is data is below the total average, while the positive value
initialized as F1 , Per capita expenditure as F2 Percentage of indicates that the data is above the total average. The results
poor population as F3 , Number of population not of data transformation shown in Table 3 will be used as a
graduating elementary school as F4 , Literacy rate as F5, dataset for clustering processes using the K means algorithm
Duration of School as F6 , Population of people who cannot to form illiterate classes.
read and write as F7 . he results of attribute selection and
TABLE VI
10 OF 114 SAMPLES OF CLUSTERING RESULT
But before that, outlier detection needs to be done first by
paying attention to the z-score of each object. Outliers can be Value of
marked with a value that has an abnormal distribution with a Region Label Euclidean
range of values below -3 or above 3[15]. After through this Distance
process there are 35 objects which are outliers, then deleted Kab.
from the dataset. So that the data used has 462 objects left. Simeulue Cluster 1 2.063785
Kab. Aceh
Singkil Cluster 1 1.28987
B. Clustering Using K-means Algotihm
Kab. Aceh
The clustering process is carried out three times with values Selatan Cluster 1 1.214199
of k = 2, k = 3, and k = 5. The initial step of determining Kab. Aceh
clusters with values k = 2 is shown in Table 4 Tenggara Cluster 2 1.996979
Kab. Aceh
TABLE IV Timur Cluster 1 2.08613
INITIAL CLUSTER CENTER Kab. Aceh
Clusters Tengah Cluster 2 1.12503
Data Kab. Aceh
Cluster 1 Cluster 2
-2.6896804 -0.3949074 Barat Cluster 1 1.82489
F1
Kab. Aceh
F2 0.7216268 -0.4789252 Besar Cluster 2 1.336122
F3 0.0297028 -0.0967458 Kab. Pidie Cluster 2 1.210859
-0.4836585 1.3934143 Kab. Bireuen Cluster 2 2.163445
F4
F5 -0.2990172 -0.4923361
F6 -0.6783040 0.3913993
From the results shown in Table 6, 247 regions are
F7 -0.1172511 0.4710987 grouped in groups 1 and 215 regions which are grouped in
groups 2. Then, the grouping process will be repeated in
the same way for k = 3 and k = 5. After all the grouping
Based on Table 4, the initial centroid selected for center processes have After done, the number of members of each
cluster 1 is the 27th object and the initial cluster center for cluster is obtained as shown in Table 7
cluster 2 is the 275th object. In the clustering process with a
value of k = 2, iterates 21 times until the object does not move TABLE VII
MEBER OF EACH CLUSTER
the cluster again. For the final cluster center can be seen in
Table 5. Nilai k
Label
Jumlah
Persentase
Anggota
TABEL V Cluster 1 247 54%
PUSAT CLUSTER AKHIR k=2
Cluster 2 215 46%
Clusters 204
Data Cluster 1 44%
Cluster 1 Cluster 2
-0.5635997 0.7654722 k=3 Cluster 2 177 38%
F1
-0.3155703 0.5524862 Cluster 3 81 17%
F2
-0.0112379 -0.1187075 Cluster 1 30 6%
F3
0.02913571 -0.2352607 Cluster 2 59 12%
F4
-0.3788719 0.7062448 k=5 Cluster 3 155 33%
F5
0.0142193 0.3591641 Cluster 4 100 21%
F6
0.0184489 -0.2833912 Cluster 5 118 25%
F7

After doing the clustering process three times, then the


In Table 5, it can be seen that the cluster 1 average is -0.17249
clustering results are validated using the Silhouette
and cluster 2 is 0.2494. So, the conclusion is that cluster 1 is
Coefficient method as described in Equation 1. The results of
lower than cluster 2 in some aspects of life, which means
the clustering validation process are obtained as shown in
cluster 1 is a high literacy class and cluster 2 is a low literacy
Table 8 and refer to Table 8 the best clustering results occur
class. The final results of the grouping process using the K
in the clustering results k = 3 with the Silhouette Coefficient
algorithm mean that the values k = 2 are shown in Table 6.
value of 0.29 with the moderate category. While the worst
clustering results occur in the results clustering k = 2 with
Coefficient Silhouette value of 0, 24 with a poor category.
TABLE VIII Uegaki, Data-based fault diagnosis of power cable system:
VALIDATION RESULT OF SILHOUETTE COEFFICIENT Comparative study of k-NN, ANN, random forest, and CART, vol.
18, no. PART 1. IFAC, 2011.
Value of Level Of
Number of Class Silhouette Cluster [7] N. Wakhidah, “Clustering Menggunakan K-Means Algorithm (
Coefficient Strength K-Means Algorithm Clustering ),” Fak. Teknol. Inf., vol. 21, no.
0.24 Poor 1, pp. 70–80, 2014.
2
3 0.29 Moderate [8] F. E. M. Agustin, A. Fitria, and A. Hanifah, “Implementasi
Algoritma K-Means untuk Menentukan Kelompok Pengayaan
5 0.25 Moderate Materi Mata Pelajaran Ujian Nasional (Studi Kasus: SMP Negeri
101 Jakarta),” J. Tek. Inform., vol. 8, no. 1, pp. 73–78, 2015.

[9] M. Jain and C. Verma, “Adapting k-means for Clustering in Big


Data,” vol. 101, no. 1, pp. 19–24, 2014.
V. CONCLUSIONS [10] B. M. Metisen and H. L. Sari, “Analisis Clustering Menggunakan
Based on the results and discussion can be concluded: K-means Dalam Pengelompokkan Penjualan Produk Pada
Swalayan Fadhila,” J. Media Infotama, vol. 11, no. 2, pp. 110–
1. K-means clustering can be used for the cluster process 118, 2015.
of illiteracy rates.
2. The most optimal number of illiterate classes is three [11] K. Singh, D. Malik, and N. Sharma, “Evolving limitations in K-
classes consisting of high illiteracy, moderate means algorithm in data mining and their removal,” IJCEM Int. J.
Comput. Eng. Manag. ISSN, vol. 12, no. April, pp. 2230–7893,
illiteracy and low illiteracy where 204 high illiteracy
2011.
members and 177 illiterate members while illiterate
are 81 members [12] M. Anggara, H. Sujiani, and H. Nasution, “Pemilihan Distance
3. From the three clustering processes that have been Measure Pada K-Means Clustering Untuk Pengelompokkan
carried out, the evaluation results of the evaluation Member Di Alvaro Fitness,” vol. 1, no. 1, pp. 1–6, 2016.
using Silhouette coefficient are done and show that k [13] D. F. Azuri and R. S. Pontoh, “Pengelompokkan Kabupaten /
= 3 gets the best results with a sc value of 0.29 and the Kota Di Pulau Jawa Berdasarkan Pembangunan Manusia Berbasis
strength level of the cluster is medium. The second Gender Menggunakan Bisecting K-Means,” pp. 27–28, 2016.
rank is clustering with five classes that get the sc value
[14] H. Junaedi, H. Budianto, and I. Maryati, “Data transformation
of 0.25 and the strength level of the cluster is medium. pada data mining,” vol. 7, pp. 93–99, 2011.
The third is clustering with two classes which obtain a
sc value of 0.24 and a poor level of cluster strength. [15] A. Klaster and A. Rosiatun, “Analisis Klaster Untuk Segmentasi
Pemirsa Program Berita Sore Stasiun TV Swasta,” pp. 93–102.
To get better results for future research it is recommended
to:
 Add other data variables that have an influence on the
extent of illiteracy
 Using other clustering methods or improving the
performance of the K-means clustering method
 Increase the number of observation data to improve
data accuracy.

REFERENCES

[1] R. D. Bekti, E. Irwansyah, and Andiyono, “Angka Buta Huruf


Melalui Geographically Weighted Regression : Studi Kasus
Provinsi Jawa Timur,” Comtech, vol. 4, no. 1, pp. 443–449, 2013.

[2] Mariyono, “Strategi Pemberantasan Buta Aksara Melalui


Penggunaan Teknik Metastasis Berbasis Keluarga,” Pancaran,
vol. 5, no. 1, pp. 55–66, 2016.

[3] Antara, “Indonesia Peringkat Keempat Penduduk dengan Buta


Huruf Terbanyak,” Media Indonesia, 2017. [Online]. Available:
https://mediaindonesia.com/read/detail/121475-indonesia-
peringkat-keempat-penduduk-dengan-buta-huruf-terbanyak.
[Accessed: 24-Jul-2019].

[4] B. O. Sepling Paling, Irsan Hengky Sulendorong, “Pendampingan


Masyarakat Melalui Program Keaksaraan di Distrik Walelagama,
Wamena Kabupaten Jayawijaya, Papua,” vol. 3, no. 1, pp. 11–20,
2019.

[5] O. Irvi and A. A. Supianto, “Perbandingan Teknik Klasifikasi


Dalam Data Mining untuk Bank Direct Marketing,” vol. 5, no. 5,
pp. 567–576, 2018.

[6] I. Ahmad, H. Mabuchi, M. Kano, S. Hasebe, Y. Inoue, and H.

You might also like