You are on page 1of 13

In this report, I would like to discuss three fundamental algorithms in machine learning, which

are K-means clustering, hierarchical clustering, and K-nearest neighbour.


These algorithms belong to the unsupervised learning and supervised learning respectively.
Supervised learning refers to the machine learning that the groups need to be pre-specified.
Supervised learning generally processes the labeled data, which means that we already gained
knowledge about what observations that these groups have. Based on this, we can develop an
algorithm to classify the other observations into the known cluster. However, the unsupervised
learning does not require the pre-determined clusters and focuses on the unlabeled data. We
essentially need to see what type of groups existed in the data set so that we can potentially
perform a classifier algorithm. Generally, we can check what relationships between variables.

In the following part, I will introduce the K-means clustering, Hierarchical clustering, and the K-
Nearest Neighbor. For each section, I will introduce the principles and theories of these
algorithms as well as performing a practice with the codes in R.

K-means clustering
K-means clustering is an unsupervised learning algorithm. The main objective of this method is
to find a grouping that the observations are set within each cluster are similar but the
observations in different clusters are dissimilar to each other. K-means will assign every point in
the dataset to the nearest centroid and minimize the total distance of every point to the
centroid. Each observation will be assigned to one cluster without overlapping in the same
cluster. To determine which cluster that each point belongs to, we need to find the distances of
this point to each centroid.
Overall, the K-means clustering algorithm will keep merging the clusters where the distance
between the centers of the clusters are the smallest until the left of k clusters remaining. The
distance in the k-means clustering can be based on the Euclidean metric.

How K-means algorithm works:


We can first also use mathematical formula to illustrate how this works.

First, we need to determine the number of K. Using above formula to assign observations into K
clusters. W(Ck) is a measure of how observations differ from each other. Overall, the sum of all
with-cluster variance from K clusters have to be minimized based on formula.

To define the with-cluster variance, which is W(Ck), we can apply this formula,

Ck means the Kth cluster

The abovementioned formula is to calculate the squared Euclidean Distance of pairwise


observations in Kth cluster, and then dividing all observations in Kth cluster, this formula is
similar to the normal standard deviation formula.
In summary, the main objective of KNN algorithm is to minimize the sum of Euclidean Distance

with-cluster variance. .

Applications of K-means algorithm in R code:


In practice, I use iris as dataset and the factoextra package in R, which is powerful in visualizing
the clusters. The first step is to scale our data, this is vital because I want the distance metric to
be unweighted especially for the Euclidean distance applied for k-means clustering.

In the second step, I would like to generate the distance metric for our observations for
calculating the distances used in the following k-means algorithm. The dist function is an
embedded function in the package factoextra.

The third step is to calculate how many clusters that I need for the k-means clustering to work
since I know that in the k-means clustering algorithm I need to pre-determine the number of
clusters in order to calculate the centroid of the cluster. The specific method that I used is
Elbow method. fviz_nbclust(iris_set_standard, kmeans,method="wss")+labs(subtitle="Elbow
method") .The function applied in the R code will return an elbow shaped curve, which depicts
the relationship between the within sum of squares and the number of cluster (See the Graph A
in the Appendix). I already know that there are 3 clusters by nature in my data (Iris), so I will
employ k=3. However, without such information, I will look at the diminishing returns. Based on
the Graph A, the return of adding an extra cluster is the largest at 1 to 2, and become smaller at
2 to 3, so on and so forth. We usually want to continue adding clusters until the reduction of
sum of squares is small.

After determining the k=3, I run the k-means algorithm using the kmeans(.) function.
Kmeans(dataset, centers=3, nstart=100). The result will return the size of each cluster, which
are 50, 53, 47, respectively, and the centroid position of each cluster. The clustering vector
returns the cluster group of each data point. The within cluster sum of squares by cluster show
us the within variation of each cluster. The between SS/ total SS = 76% implies that my
clustering algorithms reaches the 76% of identification. After visualizing our
data(fviz_cluster(list(data=iris_set_standard, cluster=km.clusters)), we can see that the iris data
are clustered into 3 groups. In the end, we can make to table to see that some of the data
points are mis-determined (Appendix A, table B).
Hierarchical clustering
Hierarchical clustering is a way of cluster analysis that seeks to design a hierarchy of clusters.
This follows steps of grouping similar into clusters, and at end, each point should be entirely
different from each other. The observations in the same cluster should be similar with others.

How hierarchical clustering algorithm works.


Generally, we use agglomerative clustering as our hierarchical clustering algorithm.
For the general steps, first step is that the algorithm will find two clusters that are nearest to
each other, and then merging these two most similar clusters. The entire process will repeat
these steps until all clusters are merged. At last, the main output is named as dendrogram.

There is a several methods to measure the dissimilarity between two clusters of observations.
 Complete linkage: This method is to calculate all pairwise dissimilarities in cluster one
and cluster two. Then regarding the maximum value of these clusters as the distance
between these two clusters.
 Single linkage: This method is similar to the complete linkage; it also computes all
pairwise dissimilarities between cluster one and cluster two and regard the smallest
value as the distance between these two clusters.
 Average linkage: This method also computes all pairwise dissimilarities between two
clusters, but it regards the average of these dissimilarities as the distance between the
two clusters.
 Centroid linkage: It computes the dissimilarity between the centroids of two clusters.
 Ward’s method: It first minimize the total variance in each cluster. Then the pair with
minimum variance will be combined in each step.
Among these four methods, Ward’s method can usually provide more compact and stronger
clustering structure, but it is hard to realize in R, so I use Complete linkage method instead.

Application of Hierarchical clustering in R:


Data Preparation:
First, we need to ensure the observations in rows and columns are variables, and all the missing
values are removed. Besides, all the data should also be scaled so that data can compared with
each other, and clustering will not depend on arbitrary variable.

Perform R codes:
To perform agglomerative hierarchical clustering, we need some functions. First, we can use
hclust function to get complete linkage. The we can still use dist. function to calculate the
dissimilarity. We can also use agnes function to get agglomerative coefficient.

In the R codes for hierarchical clustering, I employ the agglomerative hierarchical algorithm
with the complete linkage clustering. I also use the factoextra package as I did in the k-means.
At the very beginning, I still need to scalarize our data as what have done for the k-means since
they are all unsupervised learning. Since I have already done it in the K-means section, so I jump
to the hierarchical clustering algorithm directly. I applied hclust function, hclust(.). The first
argument is the data we use, and the second argument specifies the linkage method that I
apply. iris_holdout<-hclust(iris.dist, method="complete"). To visualize the clustering results, I
directly plot the clustering results and get the Dendrogram, which is a tree-type graph that is
very clear on how the clusters are made. fviz_cluster(list(data=iris_set_standard,
cluster=iris.clusters)). Since the k = 3 based on our choice, the grouping rectangles are cut at the
fifth layer of the tree graph. We can also let the K = 5, and the rectangles will be smaller and cut
the tree graph at a lower level.
The cutree(.) function is to cut a tree graph into groups of data. It will show the cluster results in
a form of telling which observations are assigned to which group. Finally, I can present the
clustering plot as what I do for the k-means and get a clustered graph. Compared to the k-
means algorithm, it seems that less observations are correctly specified to the green group and
more have been mis-specified to the blue group.

K-nearest neighbour algorithm


KNN algorithm is a non-parametric, supervised machine learning. This algorithm uses proximity
as criteria to make classifications or the predictions for the observations, so this algorithm is
often used to solve classifications problem. The goal of this method is to determine which point
is closest to the given query point.
There are four typical distance metrics we can use to evaluate the distances.
 Euclidean distance: This distance metrics directly measures the straight distance
between the query point and the point is being measured. d(p, q)^2=(q1-p1)^2+(q2-
p2)^2
 Manhattan distance: This distance often measures the absolute value between two
points, d(x, y)=sum of IXi – Yil
 Minkowski distance: The Minkowski distance metric is a generalization of both
Euclidean and Manhattan distance, the formula is (|X1 – Y1|p + |X2 – Y2|p + |X2 – Y2|
p 1/p
)
Generally, we use Euclidean distance as our distance metric because it can directly calculate the
distance between two data points.

There are three large steps that are involved in the KNN. Firstly, to choose the K number of data
points where K is the number of the nearest neighbors to a given observation. We usually begin
with K equals to the square root of the number of all observations n and want K to be odd
because it will be easy to process data. If the K =1, this can underfit data since it will lead all the
data points into one cluster. However, if the K approaches to the total number of observations,
then it may run into the problem of overfitting our data, and this could also be very
computationally expensive. The second step is that, assuming the chosen K is optimal, then if
we need to determine which distance metric for the data (ex. Euclidean). In the third step, once
metric algorithm has computed the distances between data points, they are assigned to
whatever group that they have the greatest number of nearest neighbors.
Therefore, we want an odd number of clusters, since then the computer knows where to
assign each data point

Define K
The K in this algorithm means how many neighbours are going to be measured so that the
algorithm can determine the classification of a particular query point. Defining K also plays a
significant role to make a trade-off between the values which may cause overfitting or
underfitting to the model because the larger values of K may lead to high bias but lower
variance, the smaller K may have high variance but lower bias. The determination of K is usually
set in a larger number if the data has many outliers.

Tuning K:
The choice of K can significantly impact the accuracy of KNN algorithm, so we can choose to
perform R codes to choose the K with smallest misclassification rate.
First, I can apply FNN library package and get set.seed samples.
Then I assign observations to training set and testing set. I also create confusion matrix and
scale data. After that, I need to try the value of K with the lowest validation error on test data
set. Then I can use apply() function to store and compute the misclassification rate for each K.
Finally, I can plot the graph and choose the K with minimum misclassification rate. Tuning k is
usually complex, the quick way is that we can choose k as the square root of our data size.

Applications of KNN algorithm in R code:


In the R of the appendix C, I first need to specify the labels, which is different than the previous
two algorithms because KNN is supervised learning. And I also scalarize the column 1 to 4
because they are all numbers. Then I state the ratio of our testing and training set to be 20/80.
The training set has 80 percent of the data. After clarifying that, I randomly select 80% of the
iris data using the sample(.) function and assigning them to the training set. After that, I label
the training data points and the test data points. Then I run KNN model. In the argument of the
knn() function, knn(train=data_train, test=data_test, cl=train_labels, k=
round(sqrt(nrow(data_train))). I choose the k as the square root of the size of our data set. The
cl argument is to tell the computer which data are the true values. I use ggplot function to plot
the result of my KNN algorithm. I set length as X and width as Y for both sepal and petal. In the
results, I can see that group satosa seems have higher sepal length and smaller sepal width than
the other two species. Setosa also tends to have smaller Petal length and Petal width, and
virginica tends to have larger petal length and petal width.

The comparison of these three algorithms


The first key difference between K-NN and K means is that K-means is an unsupervised learning,
K-means method uses a fixed number of clusters, k, and group the observations together based
on similarities but K-NN is a supervised learning that tries to find relations between different
data points based on the distances between each other. A fundamental function of the KNN
algorithm is to classify the new data point to the existing groups. For example, given a new data
point, the computer will look at the K nearest neighbors of this data point and see the number
of neighbors that are in group 1 or group 2. Based on that difference, K-means method is
generally applied for clustering, K-NN algorithm is often applied for classification.
The KNN algorithm is often simple and quick to implement for its easy codes and accuracy. K-
means algorithm is often applied when the number of classes are known. To be more specific of
the algorithm of the K-means, the k-means algorithm randomly chooses K centroids in the data
space. Then, according to the positions of the centroids, the algorithm will optimize for the best
position of the centroids. In detail, the algorithm will keep selecting the data points that are the
nearest to the centroid to join the cluster and then recalculate the position of the centroid. The
algorithm will keep doing that until the positions of the centroids do not change much or the
required number of iterations has been reached.

Like the K-means algorithm, the hierarchical clustering algorithm is also an unsupervised
clustering algorithm for grouping the unlabeled data by their similarities. The positive side of
using hierarchical clustering algorithm instead of the k-means is that we do not need to
subjectively choose the value of k in the first place to run the algorithm. Moreover, the distance
metric in the hierarchical clustering can be some other metrics other than the Euclidean, which
is good when we have some specific types of data.

The dendrogram of hierarchical clustering is easier to find the total clusters than K-means. Also,
compared to K-means and KNN, the results of hierarchical clustering will be easily data-
visualized, which will be obvious for audience to get important information.
If the dataset has a significant variable, the K-means can process the algorithms faster than
hierarchical clustering.
Appendix
A). K-means:
Graph A

Table A
Graph B
Table B
B) Hierarchical Clustering
Graph A

Table A
Graph B

C). K-Nearest Neighbor

Graph A
Graph B

You might also like