Professional Documents
Culture Documents
STAT452 Project1
STAT452 Project1
In the following part, I will introduce the K-means clustering, Hierarchical clustering, and the K-
Nearest Neighbor. For each section, I will introduce the principles and theories of these
algorithms as well as performing a practice with the codes in R.
K-means clustering
K-means clustering is an unsupervised learning algorithm. The main objective of this method is
to find a grouping that the observations are set within each cluster are similar but the
observations in different clusters are dissimilar to each other. K-means will assign every point in
the dataset to the nearest centroid and minimize the total distance of every point to the
centroid. Each observation will be assigned to one cluster without overlapping in the same
cluster. To determine which cluster that each point belongs to, we need to find the distances of
this point to each centroid.
Overall, the K-means clustering algorithm will keep merging the clusters where the distance
between the centers of the clusters are the smallest until the left of k clusters remaining. The
distance in the k-means clustering can be based on the Euclidean metric.
First, we need to determine the number of K. Using above formula to assign observations into K
clusters. W(Ck) is a measure of how observations differ from each other. Overall, the sum of all
with-cluster variance from K clusters have to be minimized based on formula.
To define the with-cluster variance, which is W(Ck), we can apply this formula,
with-cluster variance. .
In the second step, I would like to generate the distance metric for our observations for
calculating the distances used in the following k-means algorithm. The dist function is an
embedded function in the package factoextra.
The third step is to calculate how many clusters that I need for the k-means clustering to work
since I know that in the k-means clustering algorithm I need to pre-determine the number of
clusters in order to calculate the centroid of the cluster. The specific method that I used is
Elbow method. fviz_nbclust(iris_set_standard, kmeans,method="wss")+labs(subtitle="Elbow
method") .The function applied in the R code will return an elbow shaped curve, which depicts
the relationship between the within sum of squares and the number of cluster (See the Graph A
in the Appendix). I already know that there are 3 clusters by nature in my data (Iris), so I will
employ k=3. However, without such information, I will look at the diminishing returns. Based on
the Graph A, the return of adding an extra cluster is the largest at 1 to 2, and become smaller at
2 to 3, so on and so forth. We usually want to continue adding clusters until the reduction of
sum of squares is small.
After determining the k=3, I run the k-means algorithm using the kmeans(.) function.
Kmeans(dataset, centers=3, nstart=100). The result will return the size of each cluster, which
are 50, 53, 47, respectively, and the centroid position of each cluster. The clustering vector
returns the cluster group of each data point. The within cluster sum of squares by cluster show
us the within variation of each cluster. The between SS/ total SS = 76% implies that my
clustering algorithms reaches the 76% of identification. After visualizing our
data(fviz_cluster(list(data=iris_set_standard, cluster=km.clusters)), we can see that the iris data
are clustered into 3 groups. In the end, we can make to table to see that some of the data
points are mis-determined (Appendix A, table B).
Hierarchical clustering
Hierarchical clustering is a way of cluster analysis that seeks to design a hierarchy of clusters.
This follows steps of grouping similar into clusters, and at end, each point should be entirely
different from each other. The observations in the same cluster should be similar with others.
There is a several methods to measure the dissimilarity between two clusters of observations.
Complete linkage: This method is to calculate all pairwise dissimilarities in cluster one
and cluster two. Then regarding the maximum value of these clusters as the distance
between these two clusters.
Single linkage: This method is similar to the complete linkage; it also computes all
pairwise dissimilarities between cluster one and cluster two and regard the smallest
value as the distance between these two clusters.
Average linkage: This method also computes all pairwise dissimilarities between two
clusters, but it regards the average of these dissimilarities as the distance between the
two clusters.
Centroid linkage: It computes the dissimilarity between the centroids of two clusters.
Ward’s method: It first minimize the total variance in each cluster. Then the pair with
minimum variance will be combined in each step.
Among these four methods, Ward’s method can usually provide more compact and stronger
clustering structure, but it is hard to realize in R, so I use Complete linkage method instead.
Perform R codes:
To perform agglomerative hierarchical clustering, we need some functions. First, we can use
hclust function to get complete linkage. The we can still use dist. function to calculate the
dissimilarity. We can also use agnes function to get agglomerative coefficient.
In the R codes for hierarchical clustering, I employ the agglomerative hierarchical algorithm
with the complete linkage clustering. I also use the factoextra package as I did in the k-means.
At the very beginning, I still need to scalarize our data as what have done for the k-means since
they are all unsupervised learning. Since I have already done it in the K-means section, so I jump
to the hierarchical clustering algorithm directly. I applied hclust function, hclust(.). The first
argument is the data we use, and the second argument specifies the linkage method that I
apply. iris_holdout<-hclust(iris.dist, method="complete"). To visualize the clustering results, I
directly plot the clustering results and get the Dendrogram, which is a tree-type graph that is
very clear on how the clusters are made. fviz_cluster(list(data=iris_set_standard,
cluster=iris.clusters)). Since the k = 3 based on our choice, the grouping rectangles are cut at the
fifth layer of the tree graph. We can also let the K = 5, and the rectangles will be smaller and cut
the tree graph at a lower level.
The cutree(.) function is to cut a tree graph into groups of data. It will show the cluster results in
a form of telling which observations are assigned to which group. Finally, I can present the
clustering plot as what I do for the k-means and get a clustered graph. Compared to the k-
means algorithm, it seems that less observations are correctly specified to the green group and
more have been mis-specified to the blue group.
There are three large steps that are involved in the KNN. Firstly, to choose the K number of data
points where K is the number of the nearest neighbors to a given observation. We usually begin
with K equals to the square root of the number of all observations n and want K to be odd
because it will be easy to process data. If the K =1, this can underfit data since it will lead all the
data points into one cluster. However, if the K approaches to the total number of observations,
then it may run into the problem of overfitting our data, and this could also be very
computationally expensive. The second step is that, assuming the chosen K is optimal, then if
we need to determine which distance metric for the data (ex. Euclidean). In the third step, once
metric algorithm has computed the distances between data points, they are assigned to
whatever group that they have the greatest number of nearest neighbors.
Therefore, we want an odd number of clusters, since then the computer knows where to
assign each data point
Define K
The K in this algorithm means how many neighbours are going to be measured so that the
algorithm can determine the classification of a particular query point. Defining K also plays a
significant role to make a trade-off between the values which may cause overfitting or
underfitting to the model because the larger values of K may lead to high bias but lower
variance, the smaller K may have high variance but lower bias. The determination of K is usually
set in a larger number if the data has many outliers.
Tuning K:
The choice of K can significantly impact the accuracy of KNN algorithm, so we can choose to
perform R codes to choose the K with smallest misclassification rate.
First, I can apply FNN library package and get set.seed samples.
Then I assign observations to training set and testing set. I also create confusion matrix and
scale data. After that, I need to try the value of K with the lowest validation error on test data
set. Then I can use apply() function to store and compute the misclassification rate for each K.
Finally, I can plot the graph and choose the K with minimum misclassification rate. Tuning k is
usually complex, the quick way is that we can choose k as the square root of our data size.
Like the K-means algorithm, the hierarchical clustering algorithm is also an unsupervised
clustering algorithm for grouping the unlabeled data by their similarities. The positive side of
using hierarchical clustering algorithm instead of the k-means is that we do not need to
subjectively choose the value of k in the first place to run the algorithm. Moreover, the distance
metric in the hierarchical clustering can be some other metrics other than the Euclidean, which
is good when we have some specific types of data.
The dendrogram of hierarchical clustering is easier to find the total clusters than K-means. Also,
compared to K-means and KNN, the results of hierarchical clustering will be easily data-
visualized, which will be obvious for audience to get important information.
If the dataset has a significant variable, the K-means can process the algorithms faster than
hierarchical clustering.
Appendix
A). K-means:
Graph A
Table A
Graph B
Table B
B) Hierarchical Clustering
Graph A
Table A
Graph B
Graph A
Graph B