Professional Documents
Culture Documents
Ch12 Unsupervised Learning
Ch12 Unsupervised Learning
1 / 50
The Goals of Unsupervised Learning
2 / 50
The Challenge of Unsupervised Learning
3 / 50
Another advantage
4 / 50
Principal Components Analysis
5 / 50
Principal Components Analysis: details
• The first principal component of a set of features
X1 , X2 , . . . , Xp is the normalized linear combination of the
features
that
Pp has2 the largest variance. By normalized, we mean that
j=1 φj1 = 1.
• We refer to the elements φ11 , . . . , φp1 as the loadings of the
first principal component; together, the loadings make up
the principal component loading vector,
φ1 = (φ11 φ21 . . . φp1 )T .
• We constrain the loadings so that their sum of squares is
equal to one, since otherwise setting these elements to be
arbitrarily large in absolute value could result in an
arbitrarily large variance.
6 / 50
PCA: example
35
30
25
Ad Spending
20
15
10
5
0
10 20 30 40 50 60 70
Population
The population size (pop) and ad spending (ad) for 100 different
cities are shown as purple circles. The green solid line indicates
the first principal component direction, and the blue dashed
line indicates the second principal component direction.
7 / 50
Computation of Principal Components
• Suppose we have a n × p data set X. Since we are only
interested in variance, we assume that each of the variables
in X has been centered to have mean zero (that is, the
column means of X are zero).
• We then look for the linear combination of the sample
feature values of the form
for i = 1, . . . , n thatP
has largest sample variance, subject to
the constraint that pj=1 φ2j1 = 1.
• Since each of the xij has mean zero, then so does zi1 (for
any values of φj1 ). Hence the sample variance of the zi1
can be written as n1 ni=1 zi1
2.
P
8 / 50
Computation: continued
9 / 50
Geometry of PCA
10 / 50
Further principal components
11 / 50
Further principal components: continued
12 / 50
Illustration
13 / 50
USAarrests data: PCA plot
−0.5 0.0 0.5
UrbanPop
3
2
0.5
Hawaii California
Rhode Island
Massachusetts
Utah New Jersey
Second Principal Component
Connecticut
1
Washington Colorado
New York Nevada
Minnesota Pennsylvania
Wisconsin
Ohio IllinoisArizona
Oregon Rape
Texas
Delaware Missouri
Oklahoma
Kansas
Nebraska Indiana Michigan
Iowa
0.0
New Hampshire
0
Florida
Idaho Virginia New Mexico
Maine Wyoming
Maryland
North Dakota Montana
Assault
South Dakota Tennessee
Louisiana
Kentucky
−1
−0.5
South Carolina
−2
North Carolina
Mississippi
−3
−3 −2 −1 0 1 2 3
14 / 50
Figure details
15 / 50
PCA loadings
PC1 PC2
Murder 0.5358995 -0.4181809
Assault 0.5831836 -0.1879856
UrbanPop 0.2781909 0.8728062
Rape 0.5434321 0.1673186
16 / 50
Another Interpretation of Principal Components
• •
•
1.0
• • • • •• ••• ••
• • • • •
• •
•
0.0
•• •
• •• •
• ••• •• • • •
•
•
• • • •
• • • • •
•• •
−0.5
• •
• •••
• • • ••
• • • •
• •
−1.0
17 / 50
PCA find the hyperplane closest to the observations
18 / 50
Scaling of the variables matters
• If the variables are in different units, scaling each to have
standard deviation equal to one is recommended.
• If they are in the same units, you might or might not scale
the variables.
Scaled Unscaled
−0.5 0.0 0.5 −0.5 0.0 0.5 1.0
1.0
UrbanPop
3
UrbanPop
150
2
Second Principal Component
100
** ** *
0.5
*
*
1
* *
** *
50
* * * * Rape
* * * Rape
* * ** * *
0.0
* * * ** * *
0
* * * **
* *
* ** * ** * * ** *
0.0
* * * * * *
* * 0 * * ** *Murder * * Assault
* * Assault * * ** * * * * * *
* * * ** *
* *
−1
* * * *
* *
−50
* * *
Murder
−0.5
−0.5
−2
*
−100
*
*
−3
19 / 50
Proportion Variance Explained
• To understand the strength of each component, we are
interested in knowing the proportion of variance explained
(PVE) by each one.
• The total variance present in a data set (assuming that the
variables have been centered to have mean zero) is defined
as
p p n
X X 1X 2
Var(Xj ) = xij ,
n
j=1 j=1 i=1
1.0
Cumulative Prop. Variance Explained
0.8
0.8
Prop. Variance Explained
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0
22 / 50
How many principal components should we use?
22 / 50
How many principal components should we use?
22 / 50
Clustering
23 / 50
PCA vs Clustering
24 / 50
Clustering for Market Segmentation
25 / 50
Two clustering methods
26 / 50
K-means clustering
K=2 K=3 K=4
28 / 50
Details of K-means clustering: continued
• The idea behind K-means clustering is that a good
clustering is one for which the within-cluster variation is as
small as possible.
• The within-cluster variation for cluster Ck is a measure
WCV(Ck ) of the amount by which the observations within
a cluster differ from each other.
• Hence we want to solve the problem
(K )
X
minimize WCV(Ck ) . (2)
C1 ,...,CK
k=1
30 / 50
K-Means Clustering Algorithm
31 / 50
Properties of the Algorithm
32 / 50
Properties of the Algorithm
1 P
where x̄kj = |Ck | i∈Ck xij is the mean for feature j in
cluster Ck .
• however it is not guaranteed to give the global minimum.
Why not?
32 / 50
Example
Data Step 1 Iteration 1, Step 2a
33 / 50
Details of Previous Figure
The progress of the K-means algorithm with K=3.
• Top left: The observations are shown.
• Top center: In Step 1 of the algorithm, each observation is
randomly assigned to a cluster.
• Top right: In Step 2(a), the cluster centroids are computed.
These are shown as large colored disks. Initially the
centroids are almost completely overlapping because the
initial cluster assignments were chosen at random.
• Bottom left: In Step 2(b), each observation is assigned to
the nearest centroid.
• Bottom center: Step 2(a) is once again performed, leading
to new cluster centroids.
• Bottom right: The results obtained after 10 iterations.
34 / 50
Example: different starting values
320.9 235.8 235.8
35 / 50
Details of Previous Figure
36 / 50
Hierarchical Clustering
37 / 50
Hierarchical Clustering: the idea
Builds a hierarchy in a “bottom-up” fashion...
A B
38 / 50
Hierarchical Clustering: the idea
Builds a hierarchy in a “bottom-up” fashion...
A B
38 / 50
Hierarchical Clustering: the idea
Builds a hierarchy in a “bottom-up” fashion...
A B
38 / 50
Hierarchical Clustering: the idea
Builds a hierarchy in a “bottom-up” fashion...
A B
38 / 50
Hierarchical Clustering: the idea
Builds a hierarchy in a “bottom-up” fashion...
A B
38 / 50
Hierarchical Clustering Algorithm
The approach in words:
• Start with each point in its own cluster.
• Identify the closest two clusters and merge them.
• Repeat.
• Ends when all points are in a single cluster.
Dendrogram
4
3
D
E
A B 2
C
1
0
D
E
B
A
C
39 / 50
Hierarchical Clustering Algorithm
The approach in words:
• Start with each point in its own cluster.
• Identify the closest two clusters and merge them.
• Repeat.
• Ends when all points are in a single cluster.
Dendrogram
4
3
D
E
A B 2
C
1
0
D
E
B
A
C
39 / 50
An Example
4
X2
2
0
−2
−6 −4 −2 0 2
X1
10
10
8
8
6
6
4
4
2
2
0
41 / 50
Details of previous figure
42 / 50
Types of Linkage
Linkage Description
Maximal inter-cluster dissimilarity. Compute all pairwise
dissimilarities between the observations in cluster A and
Complete
the observations in cluster B, and record the largest of
these dissimilarities.
Minimal inter-cluster dissimilarity. Compute all pairwise
Single dissimilarities between the observations in cluster A and
the observations in cluster B, and record the smallest of
these dissimilarities.
Mean inter-cluster dissimilarity. Compute all pairwise
dissimilarities between the observations in cluster A and
Average
the observations in cluster B, and record the average of
these dissimilarities.
Dissimilarity between the centroid for cluster A (a mean
Centroid vector of length p) and the centroid for cluster B. Cen-
troid linkage can result in undesirable inversions.
43 / 50
44 / 50
Choice of Dissimilarity Measure
• So far have used Euclidean distance.
• An alternative is correlation-based distance which considers
two observations to be similar if their features are highly
correlated.
• This is an unusual use of correlation, which is normally
computed between variables; here it is computed between
the observation profiles for each pair of observations.
20
Observation 1
Observation 2
Observation 3
15
10
2
5
3
1
0
5 10 15 20
Variable Index
45 / 50
Practical issues
46 / 50
Example: breast cancer microarray study
47 / 50
Fig. 1. Hierarchical clustering of 115 tumor tissues and 7 nonmalignant tissues using the ‘‘intrinsic’’ gene set. (A) A scaled-down representation of the entire cluster
of 534 genes and 122 tissue samples based on similarities in gene expression. (B) Experimental dendrogram showing the clustering of the tumors into five subgroups. 48 / 50
among our p
regarding dis
and patient
patients had
ESR1 receive
from van’t V
adjuvant trea
The obser
with a basal
nosis for th
usually high
expression o
familial canc
in several st
confounding
Molecular Ma
subtypes, it m
vised analysi
were separa
prognostic in
of the 231 m
Fig. 5. Kaplan–Meier analysis of disease outcome in two patient cohorts. (A)
Time to development of distant metastasis in the 97 sporadic cases from van’t
cohort, we fo
Veer et al. Patients were stratified according to the subtypes as shown in Fig. 2B. recurrences w
(B) Overall survival for 72 patients with locally advanced breast cancer in the part be due t
Norway cohort. The normal-like tumor subgroups were omitted from both data discussed ab
sets in this analysis. Both van’t
8422 兩 www.pnas.org兾cgi兾doi兾10.1073兾pnas.0932692100 49 / 50
Conclusions
50 / 50