You are on page 1of 5

Faculty of Engineering & Technology

Subject Name :- Data mining & Business Intelligence


Subject Code:- 203105454
B.Tech. IT Year 3rd Semester 6th

PRACTICAL-10

Aim:- Perform Clustering using WEKA tool.

(1). Percentage Split:-


(A).Using SimpleKmeans Custer:-K-means clustering is a simple unsupervised learning
algorithm. In this, the data objects (‘n’) are grouped into a total of ‘k’ clusters, with each
observation belonging to the cluster with the closest mean. It defines ‘k’ sets, one for each
cluster k n (the point can be thought of as the center of a one or two-dimensional figure).
The clusters are separated by a large distance.
 The data is then organized into acceptable data sets and linked to the nearest collection. If
no data is pending, the first stage is more difficult to complete; in this case, an early
grouping is performed. The ‘k’ new set must be recalculated as the barycenters of the
clusters from the previous stage.
 The same data set points and the nearest new sets are bound together after these ‘k’ new
sets have been created. After that, a loop is created. The ‘k’ sets change their position step
by step until no further changes are made as a result of this loop.
Clustering:- Clustering is the method of dividing a set of abstract objects into groups. Points
to Keep in Mind A set of data objects can be viewed as a single entity. When performing
cluster analysis, we divide the data set into groups based on data similarity, then assign
labels to the groups.
Step 1: open the labor.arff dataset.

(Figure 1.1 laybor.arff dataset)

Ankit Pandey(200303108159) Data Mining & Business Intelligence (203105454)


Faculty of Engineering & Technology
Subject Name :- Data mining & Business Intelligence
Subject Code:- 203105454
B.Tech. IT Year 3rd Semester 6th
Step 2: Go to the clustering choose filter
SimpleKmeans give percentage Split (50%) click start.

(Figure 1.2 Clustering for (50%))

Step 3: Go to the clustering choose filter SimpleKmeans give percentage Split (70%)
click start.

(Figure 1.3 Clustering for (70%))

Ankit Pandey(200303108159) Data Mining & Business Intelligence (203105454)


Faculty of Engineering & Technology
Subject Name :- Data mining & Business Intelligence
Subject Code:- 203105454
B.Tech. IT Year 3rd Semester 6th

❖ ACCURACY TABLE FOR CLUSTERING:

Accuracy (%) Clustered Instances


Split SimpleKmeans SimpleKmeans
ercentage Clustered 0 lustered 1 Clustered 0 Clustered 1
(%)

30 45% 55% 10 22
60 24% 76% 7 22
70 67% 33% 12 6

(B).Using EM Custer:- Expectation Maximization (EM) is another popular, though a bit


more complicated, clustering algorithm that relies on maximizing the likelihood to find
the statistical parameters of the underlying sub-populations in the dataset. I will not get
into the probabilistic theory behind EM. If you are interested you can read more here. But
to briefly summarize, the EM algorithm alternates between two steps (E-step and Mstep).
In the E-step the algorithm tries to find a lower bound function on the original likelihood
using the current estimate of the statistical parameters. In the M-step the algorithm finds
new estimates of those statistical parameters by maximizing the lower bound function (i.e.
determine the MLE of the statistical parameters). Since at each step we maximize the
lower bound function, the algorithm always produces estimates with higher likelihood
than the previous iteration and ultimately converge to a maxima.

 EM is an iterative method which alternates between two steps, expectation (E) and
maximization (M). For clustering, EM makes use of the finite Gaussian mixtures model and
estimates a set of parameters iteratively until a desired convergence value is achieved.

 The EM algorithm extends this basic approach to clustering in two important ways:
Instead of assigning examples to clusters to maximize the differences in means for continuous
variables, the EM clustering algorithm computes probabilities of cluster memberships based
on one or more probability distributions.

Step 1: Go to the clustering choose filter SimpleKmeans give percentage Split


(70%) click start.
Ankit Pandey(200303108159) Data Mining & Business Intelligence (203105454)
Faculty of Engineering & Technology
Subject Name :- Data mining & Business Intelligence
Subject Code:- 203105454
B.Tech. IT Year 3rd Semester 6th

(Figure 1.2 Clustering for 70%)

Step 2: Go to the clustering choose filter SimpleKmeans give percentage Split


(30%) click start.

(Figure 1.3 Clustering for 30%)

Ankit Pandey(200303108159) Data Mining & Business Intelligence (203105454)


Faculty of Engineering & Technology
Subject Name :- Data mining & Business Intelligence
Subject Code:- 203105454
B.Tech. IT Year 3rd Semester 6th

❖ ACCURACY TABLE FOR CLUSTERING:

Accuracy (%) Clustered Instances


Split SimpleKmeans SimpleKmeans
Percentage Clustered 0 Clustered 1 Clustered 2 Clustered 0 Clustered Clustered
(%) 1 2

30 100% - - 40 - -
60 24% 76% - 7 22 -
70 28% 22% 50% 5 44 9

Ankit Pandey(200303108159) Data Mining & Business Intelligence (203105454)

You might also like