You are on page 1of 2

9 Unsupervised Learning We have been mainly focusing on supervised learning, where we have i) a

set of predictors X1 , ..., Xp on n observations x1 , ..., xn, and ii) corresponding responses y1 , ..., yn.
The responses (or, “correct answers”) “guides” the construction of our model. However, in the cases
where we only have the observations x1 , ..., xn and the p predictors, the “guidance” is no longer
available. unsupervised learning is thus motivated. 9.1 Principal Component Analysis (PCA) Given p
dimensions/predictors of a data set X1 , ..., Xp , the Principal Components (PC) are linear
combinations of the predictors: Z = φ1X1 + ... + φpXp, where Xp i=1 φ 2 i = 1 (9.1) Each φ is called the
loading of its corresponding predictor/variable46. The sum of squares equal to 1 constraint prevents
the coefficients from becoming arbitrarily large — this leads to arbitrarily large variance. In Ch 5.3,
we learned that the 1st PC is the dimension that maximizes the variance explanation for a data set47.
The maximization goal with which we find the 1st PC, with Eq. 9.1, is then Eq. 9.2. The optimization
problem is solved using eigen decomposition. maximize φ11,...,φp1 1 n Xn i=1 Xp j=1
φj1xij 2 , subject to Xp j=1 φ 2 j1 = 1 (9.2) As a linear combination of the original p
dimensions, a PC may help the discovery of hidden patterns in a data set. For instance, if a PC is
mainly loaded by a few of the dimensions, then the combination of these dimensions may have
some meaningful interpretation. Another way to look at PCs is a low-dimensional linear surface that
are closest to the observations. Further, note that scaling (e.g. N(0,1) scaling) affects the output of
PCA. This is desirable when predictors live in different units of measurement. Scaling is unnecessary
and potentially damaging when the predictors are of the same units of measurement. PCA is after all
an approximation to actual data. To evaluate its performance, it is therefore informative to know the
proportion of variance explained (PVE) 46The loading of a predictor can be thought of as its
“contribution” to the value of an observation on the dimension of a PC (i.e. Z). 47The direction in
which the data vary the most.

for the data. PVE is defined as follows, where the numerator is the variance explained by a PCA
model, and the denominator is the total amount of variance in a data set: P V E = Pn n=1 ( Pp j=1
φjmxij ) 2 Pn i=1 Pp j=1 x 2 ij (9.3) In order to decide how many PCs to include in a PCA model, we
plot the number of PCs against the (cumulative) PVE for a scree plot (as in Fig. 9.1). A rule of thumb
here is the elbow method: Take the number of PCs where the scree plot takes a major change of
direction. In this example, PC = 2 would be a good number to pick by the rule. However, keep in mind
that the actual decision should be based on many other factors (e.g. number of PCs preserved in
order to keep a highly faithful approximation while not paying too much computational cost). 9.2
Clustering Succinctly stated, clustering methods look to find homogeneous subgroups among the
observations. 9.2.1 K-Means As is clear by its name, K-Means clustering seeks to partition a data set
into K subgroups/classes, where K is empirical and its choice depends on two main factor: •
Clustering Pattern: How data cloud naturally partition and gather in subgroups. • Granularity
Needed: How much granularity is required for a particular task. Fig. 9.2 presents the grouping results
on a data set using K = 2, 3, 4. A K-Means clustering satisfies two properties:

• C1 ∪ ... ∪ CK = {1, ..., n}. I.e. each observation belongs to at least one of the K clusters. • Ci ∩ Cj = ∅
for all i 6= j. I.e. each observation can only belong to 1 cluster. An ideal K-Means clustering minimizes
the total (i.e. all K classes) within-class squared Euclidean distance: minimize C1,...,CK { X K k=1
W(Ck)} = minimize C1,...,CK X K k=1 1 |Ck| X i,i′∈Ck Xp j=1 (xij − xi ′ j ) 2 (9.4) The
optimization problem in Eq. 9.4 is solved using the following algorithm: • Randomly pick K
observations to represent the K clusters, and each observation to cluster k by its Euclidean distance
to the initializers (the closest). • Iterate the following steps until the assignment stop changing: –
Compute cluster centroid for each cluster, and represent the class using the centroid. – Reassign each
observation to cluster k by its Euclidean distance to the centroids (the closest). Although the K-
Means algorithm is guaranteed to reduce the optimization goal in Eq. 9.4, it may be trapped in a local
optimum, and the clustering results depends heavily on the selection of the random initializers.
Therefore, it is recommended that the process be run at least a few times with different random
initializers, and decide on the final clustering by observing the clustering pattern

You might also like