You are on page 1of 8

IMPLEMENTATION DATA MINING WITH K-MEANS

ALGORITHM FOR CLUSTERING DISTRIBUTION


RABIES CASE AREA IN PALEMBANG CITY
1
Kintan Rahayu, 2Leni Novianti, 3Meivi Kusnandar
1,2,3
Program Studi Manajemen Informatika, Politeknik Negeri Sriwijaya
2
leninovianti16@gmail.com

Abstrak, This research talked about classification of rabies cases By using the CRISP-DM
methodology through the process of business understanding, data understanding, data
preparation, modeling, evaluation and deployment. The data that has the same characteristics
are clustered in a single cluster then the data with different characteristics are clustered in
another cluster. The algorithm used in clustering technique is K-Means algorithm, and then
implemented into RapidMiner software In this case, the cluster is divided into three clusters. The
first cluster is a result of an expenditure that has an area with high levels of rabies case, the
second cluster is a result of an expenditure that has an area with medium levels of rabies case,
and the third cluster is a result of an expenditure that has an area with low levels of rabies case.
Based on the result of research from 16 sub districts there are 7 sub districts can be classified
as an area with high levels of rabies case (C1) (Kertapati,SU II, Plaju, Kemuning, Kalidoni,
Sako and Alang-Alang Lebar), 4 sub districts of them can be classified as an area with medium
levels of rabies case (C2) (SU I, IB I, IT II, and Sukarami), and then the other 5 sub district are
classified as an area with low levels of rabies case (C3) (IB II, Gandus, Bukit Kecil, IT I, and
Sematang Borang).
Keywords: CRISP-DM , K-Means, Cluster, RapidMiner

1. Intoduction
Rabies is a zoonosis disease that can attack all warm-blooded animals and humans. Rabies, otherwise
known as mad dog disease, is an infection caused by the rabies virus. According to WHO,Domestic
dogs are the most common reservoir of virus rabies, With over 95% of human deaths caused by dogs
that have the rabies virus. The virus can be transmitted through the saliva of an infected animal and then
entered the human body with the virus from an infected animal. There are some indicators used in efforts
control rabies's disease is an animal bite case (GHPR), the cases were vaccinated with an anti rabies
vaccine (VAR) and A positive case of rabies and death based on the lyssa test [1].
Rabies in Indonesia is still a serious pet disease and include in infectious animal disease strategic
priorities because it affects social economic and public health. The incidents of rabies cases in both
animals and humans almost always end in death (case fatality rate 100%) and so asa result of this disease,
it causes fear and anxiety and anxiety for society. In addition, which includes a high risk of contracting
this virus is that Childrens who live in an area prone to an animal bite infection , communities below
and remote areas where public awareness of environmental health access is low. It will cause spatial
variability between regions that will have specificities according to the region [4]. The development of
this case is clustering based on the sub district clustering method. This method is part of data-mining
with ultimate goal of finding previously unknown patterns and trends in databases and using that
information to build predictive models [6].
The clustering method is commonly used to map or classify an object by k-means clustering method.
This technique has been used many times by early researchers to clustering diseases in Indonesia like
tuberculosis (TB) (4) and dengua fever disease [7]. There is event a research that classifies the natural
disaster province into several clusters so it became an early warning system for the public as to the high-
risk point of location of a natural disaster And a disaster safe location [3] and so also the research of
clustering rice corps that uses RapidMiner software assistance [5].
This research talked about classification of rabies cases by using the CRISP-DM methodology
through the process of business understanding, data understanding, data preparation, modeling,
evaluation and deployment. The main purpose of this research is that epidemic or results from the cluster
can be input for the Government, especially Palembang City in an effort to disseminate information
about the dangers of rabies cases that often end in death and control of the cases.

2. Study Area Discription


The objects and data sources used in this research are located in the government institutions of
Palembang Health Office which is located at Jalan Merdeka No. 72 Ka, 19 Ilir, Bukit Kecil Palembang.

2.1 K-Means
K-Means (KM) clustering is a widely used partitioning method. This method aims to create the
K cluster of mutually exclusive N sample data that is marked with a D parameter. Each K cluster is
defined with one central point (mass Center) determined by a specific combination of parameters
contained in each sample data [6].
Clustering with the K-Means method is as follows [7]:
1. Select the number of clusters K.
2. Initializing this cluster center K can be done in various ways. But the most often done is by
means of random. Cluster centers are given an initial value with a random number..
3. Allocate all data/objects to the nearest cluster. The proximity of two objects is determined by
the distance of both objects. To calculate the distance of all the data to each cluster center point
can use the Euclidean spacing theory that is formulated as follows:
𝐷(𝑖𝑗) = √(𝑥1𝑖 − 𝑥2𝑗 )2 + (𝑥2𝑖 − 𝑥3𝑗 )2 + ⋯ + (𝑥𝑘𝑖 − 𝑥𝑘𝑗 )2
(1)
Dimana :
D(I,j) = Distance of data to I to the center of Cluster J
Xki = Data to attribute data to K
Xkj = Central point to J on attribute to K
4. Recalculate the cluster center with the current cluster membership. The cluster center is the
average of all data/objects in a given cluster. If desired can also use the median of the cluster.
So the average (mean) is not the only size that can be worn.
1
𝑅𝑘 = (𝑥 + 𝑥2𝑘 + ⋯ + 𝑥𝑛𝑘 )2
𝑁𝑘 1𝑘
(2)
Dimana :
𝑅𝑘 = New average.
Nk = Number of training pattern on cluster (k).
Xnk = Pattern to (n) the cluster part (k).
5. Reasapply each object using a new cluster center. If the cluster center does not change anymore
then the clustering process is complete. Or, go back to step number 3.

3. Research Methodology
This study used the data mining methodology of Cross Standard Industry Processing for Data
Mining (CRISP-DM) as the steps to conduct research. CRISP-DM has standard data mining as a
problem solving that is common for business and research. CRISP-DM methodology consists of six
phases as steps to conduct research in Business Understanding, Data Understanding, Data Preparation,
Modeling. Evaluation, and Deployment [2].
Figure 1. Crisp-DM Stage
1. Business Understanding.
In this phase, the objectives of the business understanding of the data processing of
rabies-outstretched animal bites, rabies indicators, the way of its transmission, and the
way of its appearance.
2. Data Understanding
This phase is carried out data collection victims of rabies animal bites in Palembang.
The Data provided is an Excel document. The selection of the attributes used is the
population of each warning, the number of cases of GHPR and the number of cases in
the Vaksinization
3. Data Preparation.
Data preparation includes all activities to build the dataset that will be applied into the
modeling tool, from the raw data in the form of datasets and will then process the data
mining.
4. Modeling.
This stage is the implementation phase of the data that has been imported, which will
be implemented clustering by using the algorithm K-means on RapidMiner.
5. Evaluation.
This stage assesses the extent of data mining modeling to fulfill the purpose of data
mining that has been determined at the business understanding stage.
6. Deployment.
In this stage the deployment is done with report generation and journal articles.

4. Result and Discussion


The Data to be processed in this research, taken from the Ministry of Health Office of Palembang.
The data is the victim of rabies animal bite cases in the city of Palembang in 2018.
In clustering, the data obtained and will be summed up first based on the number of subdistricts
affected by the rabies animal bite case. The result of the summation is based on the population in each
sub-district, the number of cases of GHPR, and the number of cases that are vaccinated with Anti-Rabies
vaccine (VAR).
Table 1. . The case Data report on animal Bite Rabies (GHPR) in 2018
Vaksin Anti Rabies
Jumlah Jumlah
Kecamatan CUCI VAR VAR VAR
Penduduk GHPR
LUKA 192 II III
IB II 68762 4 4 4 0 0
Gandus 63440 3 3 3 0 0
Seberang Ulu I 184550 9 8 6 1 0
Kertapati 85541 7 6 4 0 1
Seberang Ulu II 99135 2 2 2 0 0
Plaju 84335 5 5 3 0 0
IB I 133687 8 8 4 0 0
Bukit Kecil 47354 20 16 15 1 3
IT I 72460 9 9 7 0 0
Kemuning 87555 9 8 8 3 3
IT II 165511 40 33 28 5 2
Kalidoni 108506 16 15 15 1 0
Sako 90541 5 5 3 0 0
Sematang Borang 38211 10 9 9 0 0
Sukarami 153339 10 6 1 5 0
Alang-alang Lebar 97590 17 6 7 12 5
(Source: Dinas Kesehatan Kota Palembang)

Furthermore, the data that has been accumulated will be transferred into region by publishing the
algorithm K-means using RapidMiner software based on the specified cluster.

4.1. Centeroid Data


At the application of the K-Means algorithm in determining the initial centeroid point is performed
by taking the highest (maximum) value for the high-level cluster (C0), the average value (average) for
the intermediate cluster (C1), the smallest value (minimum) for The lowest cluster (C2). Here is the
initial centeroid value for interment 1 that has been calculated.

Table 2. Preliminary Data Centeroids


Vaksin Anti Rabies
Jumlah Jumlah
Kecamatan CUCI VAR VAR VAR
Penduduk GHPR
LUKA 192 II III
High Level Cluster
184550 40 33 28 12 5
(C0)
Medium Level
98782,31 10,875 8,9375 7,4375 1,75 0,875
Cluster (C1)
Low Level Cluster
38211 2 2 1 0 0
(C2)

4.2. Implementation Of The K-Means Algorithm RapidMiner


In this study data mining modeling implemented using RapidMiner 9.3 software. Researchers use
the K-means algorithm to cluster data by attributes at the starting centeroid distance as in table 2. Here
is the modeling of data processing by using the K-means algorithm on RapidMiner:
Figure 2. Proses Desaign Algorithm K-means with value K=3

In Picture 2. Shows the configuration of the design process of a C-means in RapidMiner with a
value of K = 3. The first process illustrates that this operator processes ExampleSet of the Excel file
we've specified. All required parameters are stored in this model object. Then the data goes into the
clustering model, the model that performs the clustering using the Kmeans algorithm. After
performing the model performance is used for the performance evaluation of a centroid-based
clustering method. This Model is for classifying attribute values with defined clusters. Based on the
design configuration in the image above the RapidMiner application will configure the GHPR case
area based on the cluster that has been created

Figure 3. Cluster Model

Figure. 3. Above describes grouping areas of animal bite cases with seven sub-districts occupying
areas with high clusters (C0), four sub-districts in the moderate cluster area (C1), and five sub-districts
occupying a low cluster area (C2).
Figure 4. The final centeroid at the last iteration

From Picture 4. Above, the cluster at the central point distance value will continue to fluctuate
until the number of iteration values performed remains.

Figure 5. K-means clustering result display with scatter diagram.

This is a sheet to display the results of the repressor of the database that has been processed in its
entirety complete with the cluster. Where is cluster 0 with orange color that has 7 items in the cluster,.
While cluster 1 with green color is a representation of cluster 1 that amounted to 4 items,. While the
cluster 2 in blue is a representation of the 2 clusters that totaled 5 items.
Then the Result Perspective to display the processed data as a whole is complete with the classic
of the example set (read Excel). The view Data can be seen in Figure 6. That shows each cluster in
each area of the district.
Figure 6. Data View

4.3. K-Means Performance Acuracy with RapidMiner


Performance Vector is one of the measures used to measure the accuracy of K-means. Cluster
models and cluster pools manufactured by the K-Medoids operator are provided as inputs to the Cluster
Distance Performance operator evaluating the performance of these models and providing performance
vectors that have performance criteria values. The resulting vector performance can be seen in the results
workspace in Picture 7.

Figure 7. Result performance vector k-means with RapidMiner

The measurement of its parameters based on Avg. within_centroid_distance is The average within
cluster distance is calculated by averaging The distance between The centroid and all examples of a
cluster. Davies_bouldin: The algorithms that produce clusters with low intra-cluster distance (high intra-
cluster similarity) and high inter-cluster distance (low inter-cluster similarity) will have a low Davies –
Bouldin index, The clustering algorithm that Produces a collection of clusters with the smallest Davies
– Bouldin index is considered the best algorithm based on this criterion. Based on those performance
vector results The value of Davis-Bouldin =-0484 can therefore be regarded as the best algorithm based
on criteria.

5.1 Conclussion
Based on the results of the research and discussion that has been stated in the previous chapters, the
authors can conclude some of the following things:
1. In the research, clustering the case area of the animal bite (GHPR) in Palembang city, can be
done with the technique of data mining techniques CRISP-DM using the method K-means.
Where K-means trying to do clustering data with the system partition.
2. The result of data processing using RapidMiner software with the method of K-Means can be
known from 16 sub districts there are 7 sub districts can be classified as an area with high levels
of rabies case (C1) (Kertapati,SU II, Plaju, Kemuning, Kalidoni, Sako and Alang-Alang Lebar),
4 sub districts of them can be classified as an area with medium levels of rabies case (C2) (SU
I, IB I, IT II, and Sukarami), and then the other 5 sub district are classified as an area with low
levels of rabies case (C3) (IB II, Gandus, Bukit Kecil, IT I, and Sematang Borang).. With the
epidemic cluster can be intervened to a program of Palembang health service that is fast-
responsive and on target.

REFERENCES

[1] Kementerian Kesehatan RI, Situasi Rabies Di Indonesia 2017.


[2] Haryati, Siska., Sudarsono, Aji., Suryana, Eko. Implementasi. Data Mining to Predict Student Study
Period Using C4.5 Algorithm (Case Study: Dehasen University Bengkulu).Jurnal Media
Infotama. Vol. 11 No. 2, September 2015
[3] Supriyadi, Bambang., Windarto, Agus Perdana., Sumartono, Triyuni., Mungad. Classification Of
Natural Disaster Prone Areas In Indonesia using K-Means. International Journal of Grid and
Distributed Computing. Vol. 11, No. 8 (2018), Pp.87-98
[4] Wardani, Ratih Sari., Purwanto, Sayono, Paramanda, Aditya. Clustering Tuberculosis In Children
Using Kmeans Based On Geographic Information System. Aip Conference Proceedings 2114,
060012 (2019).
[5] Sudirman, Windarto Agus Perdana., Wanto, Anjar. Data Mining Tools RapidMiner : K-Means
Method On Clustering Of Rice Corps By Province As Effort To Stabilize Food Crops Indonesia.
Nommensen International Conference On Technology And Engineering. Iop Conf. Series:
Materials Science And Engineering 420 (2018) 01208
[6] Sano Albert V. Dian., Nindito Hendro. Application Of K-Means Algorithm For Cluster Analysis On
Poverty Of Provinces In Indonesia. Comtech. Vol. 7 No. 2 June 2016: 141-150
[7] Fatmawati, Kiki., Windarto, Agus Perdana. Data Mining: Application of RapidMiner using
K-Means Cluster in Area Affected od Demam Berdarah Dengue (DBD) by Province. Cess
(Journal Of Computer Engineering System And Science). P-Issn :2502-7131 Vol. 3 No. 2 July
2018

You might also like