You are on page 1of 27

# Lecture Slides for

ETHEM ALPAYDIN
The MIT Press, 2010
alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml2e
Why Reduce Dimensionality?
Reduces time complexity: Less computation
Reduces space complexity: Less parameters
Saves the cost of observing the feature
Simpler models are more robust on small datasets
More interpretable; simpler explanation
Data visualization (structure, groups, outliers, etc) if
plotted in 2 or 3 dimensions

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 3
Feature Selection vs Extraction
Feature selection: Choosing k<d important features,
ignoring the remaining d k
Subset selection algorithms
Feature extraction: Project the
original xi , i =1,...,d dimensions to
new k<d dimensions, zj , j =1,...,k

## Principal components analysis (PCA), linear

discriminant analysis (LDA), factor analysis (FA)

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 4
Subset Selection
There are 2d subsets of d features
Forward search: Add the best feature at each step
Set of features F initially .
At each iteration, find the best new feature
j = argmini E ( F xi )
Add xj to F if E ( F xj ) < E ( F )

## Hill-climbing O(d2) algorithm

one at a time, if possible.
Floating search (Add k, remove l)

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 5
Principal Components Analysis (PCA)
Find a low-dimensional space such that when x is
projected there, information loss is minimized.
The projection of x on the direction of w is: z = wTx
Find w such that Var(z) is maximized
Var(z) = Var(wTx) = E[(wTx wT)2]
= E[(wTx wT)(wTx wT)]
= E[wT(x )(x )Tw]
= wT E[(x )(x )T]w = wT w
where Var(x)= E[(x )(x )T] =

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 6
Maximize Var(z) subject to ||w||=1

maxw1T w1 w1T w1 1
w1

## w1 = w1 that is, w1 is an eigenvector of

Choose the one with the largest eigenvalue for Var(z) to be max
Second principal component: Max Var(z2), s.t., ||w2||=1 and
orthogonal to w1
maxwT2 w 2 wT2 w 2 1 wT2 w1 0
w2

## w2 = w2 that is, w2 is another eigenvector of

and so on.

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 7
What PCA does
z = WT(x m)
where the columns of W are the eigenvectors of , and m
is sample mean
Centers the data at the origin and rotates the axes

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 8
How to choose k ?
Proportion of Variance (PoV) explained
1 2 k
1 2 k d

## when i are sorted in descending order

Typically, stop at PoV>0.9
Scree graph plots of PoV vs k, stop at elbow

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 9
Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 10
Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 11
Factor Analysis
Find a small number of factors z, which when combined
generate x :
xi i = vi1z1 + vi2z2 + ... + vikzk + i

## where zj, j =1,...,k are the latent factors with

E[ zj ]=0, Var(zj)=1, Cov(zi ,, zj)=0, i j ,
i are the noise sources
E* i += i, Cov(i , j) =0, i j, Cov(i , zj) =0 ,

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 12
PCA vs FA
PCA From x to z z = WT(x )
FA From z to x x = Vz +

x z

z x

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 13
Factor Analysis
In FA, factors zj are stretched, rotated and translated to
generate x

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 14
Multidimensional Scaling
Given pairwise distances between N points,
dij, i,j =1,...,N
place on a low-dim map s.t. distances are preserved.
z = g (x | ) Find that min Sammon stress

E | X
z r
z x x
s r s
2

s 2
r ,s x xr

gx | gx | x
r s r
x s

2

s 2
r ,s x x
r

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 15
Map of Europe by MDS

## Map from CIA The World Factbook: http://www.cia.gov/

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 16
Linear Discriminant Analysis
Find a low-dimensional
space such that when x is
projected, classes are
well-separated.
Find w that maximizes

J w
m1 m2 2
2
s1 s2
2

m1
t x r
w T t t

s t w x m1 r
2 T t 2 t

r t 1
t

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 17
Between-class scatter:
m1 m2 w m1 w m 2
2 T T 2

w m1 m 2 m1 m 2 w
T T

w T SB w where SB m1 m 2 m1 m 2 T

Within-class scatter:
s t w x m1 r
2 T t 2 t
1

t w x m1 x m1 wr t w T S1w
T t t T

where S1 t x m1 x m1 r t t T t

## s12 s12 w T SW w where SW S1 S 2

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 18
Fishers Linear Discriminant
Find w that max
w SB w w m1 m 2
T 2
T
Jw T
w SW w w T SW w
LDA soln: w c SW1 m1 m2

Parametric soln:
w 1 2
1

when px|C i ~ N i ,

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 19
K>2 Classes
Within-class scatter:
Si t ri x m i x m i
K
SW Si t t t T

i 1

Between-class scatter:
K
1 K
SB Ni m i m m i m T m mi
i 1 K i 1
Find W that max
W SB W
T The largest eigenvectors of SW-1SB
J W Maximum rank of K-1
WT SW W

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 20
Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 21
Isomap
Geodesic distance is the distance along the manifold that
the data lies in, as opposed to the Euclidean distance in
the input space

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 22
Isomap
Instances r and s are connected in the graph if
||xr-xs||<e or if xs is one of the k neighbors of xr
The edge length is ||xr-xs||
For two nodes r and s not connected, the distance is equal to
the shortest path between them
Once the NxN distance matrix is thus formed, use MDS to find
a lower-dimensional mapping

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 23
Optdigits after Isomap (with neighborhood graph).
150

100 2
22222
2
2
50 3 22 2
7 7777 1 11 313
333
77 7 7 7 4 111 1
1 338
3
1 8 83
7 44999 5 5 5 98 38
0 99944 9 59 88
49
4 0
88 0 0
0 00
-50 000
6
4 6 66 0
6 66
4
44
-100
4

-150
-150 -100 -50 0 50 100 150
Matlab source from http://web.mit.edu/cocosci/isomap/isomap.html

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 24
Locally Linear Embedding
1. Given xr find its neighbors xs(r)
2. Find Wrs that minimize
2

r s

## 3. Find the new coordinates zr that minimize

2

E (z | W) z r Wrs z(sr )
r s

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 25
Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 26
LLE on Optdigits
1

0 000

7
7777
7
6 666 7
7 9 9
1 66 399 47
84 4
8383
9334
957
9
44
389 93
41
9
8 34
3
484 1
1
4 4 1 82 282
1 1 22 222
9
8 25
1
1
1 55

5
1

1
-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5

## Matlab source from http://www.cs.toronto.edu/~roweis/lle/code.html

Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 27