You are on page 1of 29

Announcements

Unsupervised learning
1. Unsupervised Learning (Ch. 12)

2
Supervised vs. Unsupervised Learning
• Supervised Learning: both and are known
• Unsupervised Learning: only is known

Supervised Learning Unsupervised Learning

3
Challenges of Unsupervised Learning
• Tend to be more subjective
• No simple goal like prediction of a response
• Part of an exploratory data analysis
• Hard to assess the results obtained
• Reason is simple: no way to check our work because we don’t know
the true answer----the problem is “unsupervised”

4
Roadmap for today
1. Principal Component Analysis (Ch. 12.2)

2
What Does PCA Do?
• Dimension (# of columns/variables)

() ()

• PCA looks to find a low-dimensional representation of the


observations that explain a good fraction of the variance

6
Transformation of X to Z
• Transforms feature matrix X into a lower dimensional principal
feature matrix Z

() ()

7
Why would this be useful?
PCA as a Rotation/Projection
• Animations and short tutorial
https://www.ty-penguin.org.uk/~auj/old_aber_pages/talks/depttalk96/pca.active.html
https://www.youtube.com/watch?v=FgakZw6K1QQ
• Input variables  principal components

9
Terms
• Scores and loadings

++…+

Scores of the 1st PC Loadings of the 1st PC

10
Algorithm
Typically standardize the individual columns in X matrix (mean zero
and standard deviation one) before PCA
Transforms the original variables (columns in X matrix) so that sum of
squares of the individual terms is maximized

Equivalent to minimizing the distance to the line

11
This maximizes the spread of data:
Why is this good?
This maximizes the spread of data:
Why is this good?
Further components
• is the data matrix

• Once we have the first principal component ( vector)


• Score: ( vector)

• Solve for the loadings of the 2nd PC:

14
Properties of Principal Components
• The principal components (columns of the Z matrix) are orthogonal,
i.e., uncorrelated
• They are ordered according to the decreasing variance in the data
they capture: has the largest variance, has the second largest
variance, etc.

• The principal component scores can be used in further supervised


learning (e.g., as predictors in regression)

15
Example: PCA in Two-dimensional Case

16
R Implementation
Perform PCA on the USArrests dataset, which is part of the base R
package (no need to install new package)
For each of the 50 states in the US, the data set contains the number of arrests per 100,000 residents for each
of three crimes: Assault, Murder, and Rape. UrbanPop (the percent of the population in each state living in
urban areas) is also recorded.

states=row.names(USArrests)
states

names(USArrests)

##find the mean and var of columns of the table


apply(USArrests, 2, mean)
apply(USArrests, 2, var)

17
Perform PCA
##apply PCA
pr.out=prcomp(USArrests, scale=TRUE)
names(pr.out)

pr.out$center #means of the X variables


pr.out$scale #stds of the X variables

pr.out$rotation #loading matrix

dim(USArrests)

dim(pr.out$x) #x: score vectors (Z matrix)

18
Plot of the First Two PCs
biplot(pr.out, scale=0)

19
Proportion of Variance Explained (PVE)
pr.out$sdev

pr.var=pr.out$sdev^2
pr.var

pve=pr.var/sum(pr.var)
pve
( )
𝑛 𝑝 2

∑ ∑ 𝜙 𝑗𝑚 𝑥𝑖𝑗
𝑖 =1 𝑗 =1
𝑃𝑉𝐸 ( 𝑃 𝐶 𝑚 ) = 𝑛 𝑝

∑∑
2
𝑥𝑖𝑗
𝑖=1 𝑗 =1

20
Scree Plot
plot(pve, xlab="Principal Component", ylab="Proportion of
Variance Explained", ylim=c(0,1),type='b')

1.0
0.8
Proportion of Variance Explained

0.6
0.4
0.2
0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Principal Component
21
Plot of Cumulative PVE
plot(cumsum(pve), xlab="Principal Component",
ylab="Cumulative Proportion of Variance Explained",
ylim=c(0,1),type='b')

22
Other Uses for PCA:
Preprocessing for Supervised Learning
Using Principal components as input:
• See R
Handling Missing Data (See ISLR 12.3 for Details)

• Common approaches:
– Delete rows with missing data
– Estimate missing data with average for that
The Problem with Deleting:

Excerpt of Netflix movie rating data. Movies are rated


from 1 (worst) to 5 (best). Most customers have viewed
<1% of the movies on Netflix.
The Problem with Averaging:
• Often, input data is correlated.
– i.e., customers that like Avengers 1 typically like
Avengers 2.
– Assume average ranking for both movies are (3,3)
– A customer that liked Avengers 1 but hasn’t seen
Avengers 2, i.e., (5,ꚛ), would be represented by (5,3)
Better Approach:
Estimate Missing Values with PCA
Questions
• What is the purpose of PCA?
• What are “scores” and “loadings” in PCA?
• Should we standardize each variable before PCA?
• If the data consists of variables (i.e., X matrix has columns),
how many principal components will we obtain initially?
• How are the initial principal components ordered?
• How can we reduce the number of principal components?
• How can the final (reduced) principal components be used in
statistical learning?

29

You might also like