You are on page 1of 15

LAB Manual

PART A
(PART A : TO BE REFFERED BY STUDENTS)

Experiment No.05
Aim: Implementation of 2 dimensional K-means Algorithm for Clustering.
Prerequisites: C/C++/Java
Programming
Learning Outcomes:
Concepts of K-means Algorithm and Clustering.
Theory:
Algorithm:

Example:
Problem Statement : Given: {2,4,10,12,3,20,30,11,25}, k=2 Randomly
assign means: m1=3,m2=4
K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16
K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6

K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25

PART B
(PART B : TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the practical slot.
The soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge faculties
at the end of the practical in case the there is no Black board access available)
Roll No. E059
Class : B.Tech CS
Date of Experiment:
Grade :
Date of Grading:

Name: Shubham Gupta


Batch : E3
Date of Submission
Time of Submission:

B.1 Software Code written by student:


(Paste your c/c++/java code completed during the 2 hours of practical in the lab here)

/*
* To change this license header, choose License Headers in Project
Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
/**
*
* @author mpstme.student
*/
//package means;
import java.util.ArrayList; import java.util.Scanner;
public class KMeans {
public static int NUM_CLUSTERS ;
public static int TOTAL_DATA ;
private static ArrayList<Data> dataSet = new ArrayList<>();
private static ArrayList<Centroid> centroids = new ArrayList<>();
private static void initialize(double SAMPLES[][])

{
ArrayList<Integer> temp=new ArrayList<>(); for(int
i=0;i<NUM_CLUSTERS;i++){
int t=(int)Math.floor(Math.random()*TOTAL_DATA);
if(temp.isEmpty()||!temp.contains(t)){
temp.add(t);
centroids.add(new Centroid(SAMPLES[t][0],SAMPLES[t][1]));

}
else
{
i--;
}
}
System.out.println("Centroids initialized at:"); for(int
i=0;i<NUM_CLUSTERS;i++){
System.out.println(" (" + centroids.get(i).X() + ", " + centroids.get(i).Y()
+ ")");
}
}
private static void kMeanCluster(double SAMPLES[][])
{
final double bigNumber = Math.pow(10, 10); double minimum =
bigNumber;
double distance = 0.0; int sampleNumber = 0; int cluster = 0;
boolean isStillMoving = true; Data newData = null;

while(dataSet.size() < TOTAL_DATA)


{
newData = new Data(SAMPLES[sampleNumber][0],
SAMPLES[sampleNumber][1]); dataSet.add(newData);
minimum = bigNumber;
for(int i = 0; i < NUM_CLUSTERS; i++)
{
distance = dist(newData, centroids.get(i)); if(distance < minimum){
minimum = distance; cluster = i;
}
}
newData.cluster(cluster);

for(int i = 0; i < NUM_CLUSTERS; i++)


{
double totalX = 0.0; double totalY = 0.0; double totalInCluster = 0.0;
for(int j = 0; j < dataSet.size(); j++)
{
if(dataSet.get(j).cluster() == i){ totalX += dataSet.get(j).X(); totalY +=
dataSet.get(j).Y(); totalInCluster++;
}
}
if(totalInCluster > 0){ centroids.get(i).X(totalX / totalInCluster);
centroids.get(i).Y(totalY / totalInCluster);

}
}
sampleNumber++;
}

while(isStillMoving)
{
for(int i = 0; i < NUM_CLUSTERS; i++)
{
double totalX = 0.0; double totalY = 0.0; double totalInCluster = 0.0;
for(int j = 0; j < dataSet.size(); j++)
{
if(dataSet.get(j).cluster() == i){

totalX += dataSet.get(j).X(); totalY += dataSet.get(j).Y(); totalInCluster+


+;
}
}
if(totalInCluster > 0){ centroids.get(i).X(totalX / totalInCluster);
centroids.get(i).Y(totalY / totalInCluster);
}
}
isStillMoving = false;

for(int i = 0; i < dataSet.size(); i++)


{
Data tempData = dataSet.get(i); minimum = bigNumber;
for(int j = 0; j < NUM_CLUSTERS; j++)
{
distance = dist(tempData, centroids.get(j)); if(distance < minimum){
minimum = distance; cluster = j;
}
}
tempData.cluster(cluster); if(tempData.cluster() != cluster){
tempData.cluster(cluster); isStillMoving = true;
}
}
}
}

private static double dist(Data d, Centroid c)


{
return Math.sqrt(Math.pow((c.Y() - d.Y()), 2) + Math.pow((c.X() - d.X()),
2));
}

private static class Data


{
private double mX = 0; private double mY = 0; private int mCluster = 0;
public Data()
{
}
public Data(double x, double y)
{
this.X(x);
this.Y(y);
}
public void X(double x)
{
this.mX = x;
}
public double X()
{
return this.mX;
}
public void Y(double y)
{

this.mY = y;
}
public double Y()
{
return this.mY;
}
public void cluster(int clusterNumber)
{
this.mCluster = clusterNumber;
}
public int cluster()
{
return this.mCluster;
}
}
private static class Centroid
{
private double mX = 0.0; private double mY = 0.0;
public Centroid()
{

}
public Centroid(double newX, double newY)
{
this.mX = newX; this.mY = newY;
}

public void X(double newX)


{
this.mX = newX;
}
public double X()
{
return this.mX;
}
public void Y(double newY)
{
this.mY = newY;
}
public double Y()
{

return this.mY;
}
}
public static void main(String[] args)
{
Scanner sc=new Scanner(System.in);
System.out.println("Enter total no of clusters");
NUM_CLUSTERS=sc.nextInt();
do{System.out.println("Total No of data");
TOTAL_DATA=sc.nextInt();
if(TOTAL_DATA<NUM_CLUSTERS){
System.out.println("Number of data should be atleast equal to number of
clusters");
}
}while(TOTAL_DATA<NUM_CLUSTERS);
double SAMPLES[][]=new double[TOTAL_DATA][2];
System.out.println("Enter sample values");
for(int i=0;i<TOTAL_DATA;i++){

for(int j=0;j<2;j++){
SAMPLES[i][j]=sc.nextDouble();
}
}
initialize(SAMPLES);
kMeanCluster(SAMPLES);
for(int i = 0; i < NUM_CLUSTERS; i++)

{
System.out.println("Cluster " + i + " includes:"); for(int j = 0; j <
TOTAL_DATA; j++)
{
if(dataSet.get(j).cluster() == i){
System.out.println(" (" + dataSet.get(j).X() + ", " + dataSet.get(j).Y() +
")");
}
}
System.out.println();
}
System.out.println("Centroids finalized at:"); for(int i = 0; i <
NUM_CLUSTERS; i++)
{
System.out.println(" (" + centroids.get(i).X() + ", " + centroids.get(i).Y()
+")");
}
System.out.print("\n");
}
}

B.2 Input and Output:


(Paste your program input and output in following format, If there is error then paste the specific error in the output
part. In case of error with due permission of the faculty extension can be given to submit the error free code with output
in due course of time. Students will be graded accordingly.)

Input Data:

debug:
Enter total no of clusters
5
Total No of data
7
Enter sample values
12
34
56
78
89
88
99
Centroids initialized at:
(3.0, 4.0)
(1.0, 2.0)
(5.0, 6.0)
(8.0, 9.0)
(9.0, 9.0)
Cluster 0 includes:
(3.0, 4.0)
Cluster 1 includes:
(1.0, 2.0)
Cluster 2 includes:
(5.0, 6.0)
Cluster 3 includes:
(7.0, 8.0)
(8.0, 8.0)
Cluster 4 includes:
(8.0, 9.0)
(9.0, 9.0)
Centroids finalized at:
(3.0, 4.0)
(1.0, 2.0)
(5.0, 6.0)
(7.5, 8.0)
(8.5, 9.0)
BUILD SUCCESSFUL (total time: 55 seconds)
Output Data:

B.3 Observations and learning:


(Students are expected to comment on the output obtained with clear observations and learning for each task/ sub part
assigned)

After successful completion of this experiment, we learned to implement k-means method for
clustering the given objects using centroids. We observe that the objects get clustered according to
their distances from a given centroid which is chosen randomly.

B.4 Conclusion:
(Students must write the conclusions based on their learning)

After successful completion of this experiment we


have thus implemented K-means method of
clustering.

B.5 Questions of Curiosity


Q1.Summarize the approaches that are used for clustering with their advantages and limitations.
1) Partitioning algorithm Construct various partitions and then evaluate them by some
criterion. Advantages:
- Relatively efficient
- Terminates at local optimum
Disadvantages:
- Need to specify number of clusters Applicable when mean is defined

2) Hierarchy Algorithms Create a hierarchical decomposition of the set of data using the same
criterion.
Advantages:
- Structure that is more informative
- Does not require to specify number of clusters
Disadvantages:
Selection of merge points is critical.
Split decisions if not well chosen may lead to low quality clusters.
3) Density Based Based on connectivity and density function.

Advantage:
- It is based on connecting points within certain distance
thresholds Disadvantage:
They expect some kind of density drop to detect cluster borders
Q2. Explain Hierarchical algorithms for clustering with example.
The hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters.
Strategies for hierarchical clustering generally fall into two types:
Agglomerative: Start with the points as individual clusters. At each step, merge the closest pair of
clusters until only one cluster (or k clusters) left.

Divisive: Start with one, all-inclusive cluster. At each step, split a cluster until each
cluster contains a point (or there are k clusters).

Q3. Explain clustering algorithms used for Large Databases.


Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to
many thousands of dimensions. Such high-dimensional data spaces are often encountered in areas
such as medicine, where DNA microarray technology can produce a large number of measurements at
once, and the clustering of text documents, where, if a word-frequency vector is used, the number of
dimensions equals the size of the vocabulary.
BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining

algorithm used to perform hierarchical clustering over particularly large data-sets. An advantage of
BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data
points in an attempt to produce the best quality clustering for a given set of resources (memory and
time constraints). In most cases, BIRCH only requires a single scan of the database. In addition,
BIRCH also claims to be the "first clustering algorithm proposed in the database area to handle 'noise'
(data points that are not part of the underlying pattern) effectively", beating DBSCAN by two months.