Professional Documents
Culture Documents
Distance Measure
Distance Measure is a very important aspect of clustering. Knowing how close or how
far apart each variable is with respect to the other helps in grouping them.
Jaccard Distance
The Jaccard index is used to compare elements of two sets to identify which of the
members are shared and not shared. The Jaccard Distance is a measure of how
different the two given sets are.
Eucledian Distance is the shortest distance between the two given points in
Eucledian Space.
Cosine distance of two given vectors u and v is the angular cosine between the
given vectors.
Module Summary
Learning at a high level
Few prominent distance measures used in Clustering data
Hierarchical Clustering
K Means
Hierarchical
Begin by allotting each item to a cluster. If you are having N items, you are now
having N clusters, where each of them contains one item.
Now, let us make the similarities (distances) between the clusters the same as the
similarities (distances) between the items they include.
Discover the most identical or closest pair of clusters, merge them into one
cluster, thereby reducing one cluster.
Calculate the similarities (distances) between each of the old clusters and the new
cluster.
Repeat step 2 and step 3 until all items are finally clustered into one cluster
with size N.
Source:
https://home.deib.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html
Dendrogram
A dendrogram is a branching diagram that represents the relationships of similarity
among a group of entities
Each branch is called a clade
The terminal end of each clade is called a leaf
There is no limit to the number of leaves in a clade
The arrangement of the clades tells us which leaves are most similar to each other
The height of the branch points indicates how similar or different they are from
each other
The greater the height, the greater the difference between the points
If data points are wrongly grouped at the inception, they cannot be reallocated.
If different similarity measures are utilized to calculate the similarity between
clusters, it may result in different results altogether.
K Means Clustering
By rule of thumb
Elbow method
Information Criterion Approach
An Information Theoretic Approach
Choosing k using the Silhouette
Cross-validation
Code Snippet
K Means Clustering in R
library(datasets)
head(iris)
Visualizing the data
library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
Setting the seed and creating the cluster
set.seed(20)
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)
irisCluster
Comparing the clusters with the species
table(irisCluster$cluster, iris$Species)
Plotting the dataset to view the clusters
Code Snippet
Hierarchical Clustering in R
library(datasets)
head(iris)
library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()