Professional Documents
Culture Documents
Unsupervised Vs Supervised Learning: Supervised Learning
Unsupervised Vs Supervised Learning: Supervised Learning
1 / 52
2 / 52
expression measurements,
groups of shoppers characterized by their browsing and
purchase histories,
movies grouped by the ratings assigned by movie viewers.
3 / 52
Another advantage
4 / 52
5 / 52
25
20
15
10
0
Ad Spending
30
35
PCA: example
10
20
30
40
50
60
70
Population
The population size (pop) and ad spending (ad) for 100 different
cities are shown as purple circles. The green solid line indicates
the first principal component direction, and the blue dashed
line indicates the second principal component direction.
7 / 52
(1)
for i = 1, . . . , n thatP
has largest sample variance, subject to
the constraint that pj=1 2j1 = 1.
Since each of the xij has mean zero, then so does zi1 (for
Computation: continued
Plugging in (1) the first principal component loading vector
2
p
p
n
X
1 X X
maximize
j1 xij subject to
2j1 = 1.
11 ,...,p1 n
i=1
j=1
j=1
9 / 52
Geometry of PCA
10 / 52
11 / 52
12 / 52
Illustration
13 / 52
0.0
0.5
0.5
UrbanPop
Ohio
Oregon
Idaho
Maine
South Dakota
Michigan
New Mexico
Virginia
Wyoming
Florida
Maryland
Montana
North Dakota
Texas
0.0
Iowa
New Hampshire
Nevada
Rape
Delaware Missouri
Oklahoma
Kansas
Nebraska Indiana
Assault
Kentucky
Arkansas
VermontWest Virginia
Tennessee
Louisiana
Alaska
Alabama
Georgia
Murder
0.5
Minnesota Pennsylvania
Wisconsin
California
Colorado
New York
IllinoisArizona
Washington
South Carolina
North Carolina
Mississippi
Hawaii
Rhode
Island
Massachusetts
Utah New Jersey
Connecticut
14 / 52
Figure details
The first two principal components for the USArrests data.
The blue state names represent the scores for the first two
principal components.
The orange arrows indicate the first two principal
15 / 52
PCA loadings
Murder
Assault
UrbanPop
Rape
PC1
0.5358995
0.5831836
0.2781909
0.5434321
PC2
-0.4181809
-0.1879856
0.8728062
0.1673186
16 / 52
0.5
0.0
0.5
1.0
1.0
1.0
0.5
0.0
0.5
1.0
17 / 52
18 / 52
0.5
0.5
1.0
150
0.5
100
*
*
** *
* *
* **
*
** * *
*
*
*
Rape
** *
** * * **
*Murder
* * *
*
*
**
*
*
* *
* * Assault
* * *
* *
*
0.0
50
0
50
0.5
0.5
0.5
0.0
UrbanPop
100
0.0
*
**
*
* Rape
*
*
*
*
*
*
Assault
* *
*
*
*
Murder
*
*
*
*
* * ** *
**
*
*
*
*
*
*
*
*
*
*
*
* *
0.5
** **
*
UrbanPop
Unscaled
0.0
1.0
Scaled
0.5
100
50
50
100
150
19 / 52
j=1
i=1
1.0
0.8
0.6
0.4
0.0
0.2
0.8
0.6
0.4
0.2
0.0
1.0
cumulative PVEs.
1.0
1.5
2.0
2.5
3.0
Principal Component
3.5
4.0
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Principal Component
21 / 52
22 / 52
components?
22 / 52
components?
the scree plot on the previous slide can be used as a
22 / 52
Clustering
23 / 52
PCA vs Clustering
observations.
24 / 52
25 / 52
26 / 52
K-means clustering
K=2
K=3
K=4
28 / 52
minimize
C1 ,...,CK
(K
X
)
WCV(Ck ) .
(2)
k=1
(3)
i,i Ck j=1
p
K
X
1 X X
(xij xi0 j )2 .
minimize
C1 ,...,CK
|Ck | 0
k=1
(4)
i,i Ck j=1
30 / 52
31 / 52
32 / 52
where x
kj =
cluster Ck .
1
|Ck |
iCk j=1
iCk
Why not?
32 / 52
Example
Data
Step 1
Iteration 1, Step 2a
Iteration 1, Step 2b
Iteration 2, Step 2a
Final Results
33 / 52
235.8
235.8
235.8
235.8
310.9
35 / 52
36 / 52
Hierarchical Clustering
K-means clustering requires us to pre-specify the number
37 / 52
D
E
B
A
C
38 / 52
D
E
B
A
C
38 / 52
D
E
B
A
C
38 / 52
D
E
B
A
C
38 / 52
D
E
B
A
C
38 / 52
Dendrogram
E
1
A B
C
39 / 52
Dendrogram
E
1
A B
C
39 / 52
2
2
X2
An Example
X1
10
8
6
2
0
10
8
6
2
0
2
0
10
41 / 52
42 / 52
Another Example
9
2
1.0
2
1
6
1.5
4
1
0.0
1.0
0.5
7
8
0.5
1.5
X2
0.0
2.0
2.5
0.5
3.0
4
1.5
1.0
0.5
0.0
0.5
1.0
X1
43 / 52
2
1.0
1.0
2
1
1
6
1.5
1.5
6
4
0.5
0.0
0.5
1.0
1.0
0.5
0.0
X1
X1
0.5
1.0
0.0
0.0
7
8
3
0.5
0.5
X2
7
8
X2
4
1.5
0.5
1.0
0.5
1.5
2
1.0
1.0
2
1
1
6
1.5
6
1.5
3
0.5
0.5
X2
7
8
X2
0.0
0.0
0.5
0.5
4
1.5
1.0
0.5
0.0
X1
0.5
1.0
4
1.5
1.0
0.5
0.0
0.5
1.0
X1
44 / 52
Types of Linkage
Linkage
Complete
Single
Average
Centroid
Description
Maximal inter-cluster dissimilarity. Compute all pairwise
dissimilarities between the observations in cluster A and
the observations in cluster B, and record the largest of
these dissimilarities.
Minimal inter-cluster dissimilarity. Compute all pairwise
dissimilarities between the observations in cluster A and
the observations in cluster B, and record the smallest of
these dissimilarities.
Mean inter-cluster dissimilarity. Compute all pairwise
dissimilarities between the observations in cluster A and
the observations in cluster B, and record the average of
these dissimilarities.
Dissimilarity between the centroid for cluster A (a mean
vector of length p) and the centroid for cluster B. Centroid linkage can result in undesirable inversions.
45 / 52
20
10
15
Observation 1
Observation 2
Observation 3
3
0
1
5
10
Variable Index
15
20
46 / 52
Socks
Computers
1000
500
0
0.0
0.2
0.4
0.6
0.8
1.0
1500
1.2
10
Socks
Computers
Socks
Computers
47 / 52
Practical issues
Should the observations or features first be standardized in
48 / 52
49 / 52
Fig. 1. Hierarchical clustering of 115 tumor tissues and 7 nonmalignant tissues using the intrinsic gene set. (A) A scaled-down representation of the entire cluster
of 534 genes and 122 tissue samples based on similarities in gene expression. (B) Experimental dendrogram showing the clustering of the tumors into five subgroups.
50 / 52
among our p
regarding di
and patient
patients had
ESR1 receiv
from vant
adjuvant tre
The obser
with a basal
nosis for th
usually high
expression o
familial canc
in several s
confounding
Molecular M
subtypes, it
vised analysi
were separa
prognostic in
of the 231 m
cohort, we fo
recurrences
part be due
discussed ab
Both van
51 / 52
Conclusions
52 / 52