You are on page 1of 8

Nikhil Ghag

CLUSTERING Business Analyst


TheGreenBillions Limited
WHY THERE IS NEED FOR
CLUSTERING
• Clustering in an important task in analytics in which the
data (customers or entities) is grouped into finite subsets
such that each subset is homogeneous group of entities.
• Many analytics projects may start first with clustering
after performing descriptive statistics and visualization
on the data, since it assists data scientists to apply
appropriate strategies for different clusters identified
through cluster characteristics
DIFFERENCE BETWEEN
CLUSTERING AND
CLASSIFICATION
• The main difference between clustering algorithms and other classification
techniques such as logistic regression and classification trees is that clustering
algorithms are unsupervised learning algorithms (classes are not known a priori)
whereas logistic regression and classification tree are supervised learning algorithms
(where classes are known a priori in the training data).
• Another important difference between clustering and classification is that clustering
is descriptive analytics whereas classification is usually a predictive analytics
algorithm.
CLASSIFICATION
• Clusters can be classified into the following four categories:
• 1. Non-overlapping clusters: Cluster in which each observation belongs to only one cluster.
• Non-overlapping clusters are more frequently used clustering techniques in practice.
• 2. Overlapping clusters: An observation may belong to more than one cluster.
• 3. Probabilistic clusters: An observation may belong to a cluster according to a probability
distribution.
• 4. Hierarchical clustering: Hierarchical clustering creates subsets of data similar to a tree-
like structure in which the root node corresponds to the complete set of data. Branches are
created from the root node to split the data into heterogeneous subsets (clusters).
DISTANCE AND DISSIMILARITY MEASURES USED IN CLUSTERING

• Euclidean Distance
• Higher distance would imply that observations are dissimilar, whereas higher similarity would indicate that the
observations are similar.
• Euclidean is one of the frequently used distance measures when the variable is either in interval or ratio scale.
CASE
K-MEANS CLUSTERING

• K-means clustering is one of the frequently used clustering algorithms. It is a non-hierarchical


clustering method in which the number of clusters (K) is decided a priori. The observations in the
sample are assigned to one of the clusters (say C1, C2, …, CK).
• The following steps are used in K-means clustering algorithm:
• 1. Choose K observations from the data that are likely to be in different clusters. There are many ways
of choosing these initial K values; easiest approach is to chose observations that are farthest (in one of
the parameters of the data).
• 2. The K observations chosen in step 1 are the centroids of those clusters.
• 3. For remaining observations, find the cluster closest to the centroid. Add the new observation (say
• observation j) to the cluster with closest centroid. Adjust the centroid after adding a new observation to
the cluster. The closest centroid is chosen based on an appropriate distance measure.
• 4. Repeat step 3 till all observations are assigned to a cluster.
K-MEAN ALGORITHM
Height Weight

185 72

130 56

168 60

179 68

182 72

188 77

180 71

180 70

183 84

You might also like