You are on page 1of 29

Unsupervised Learning:

Dimensionality Reduction (PCA


and other methods)
CS771: Introduction to Machine Learning
Piyush Rai
2
Dimensionality Reduction 𝒙𝑛 ~
𝒙𝑛

 Goal: Reduce the dimensionality of each input


( is a compressed version
of
𝒛 𝑛 = 𝑓 ( 𝒙𝑛 )
 Also want to be able to (approximately) reconstruct from
Often is a “cleaned” version of (the ~
𝒙 𝑛 =𝑔 ( 𝒛 𝑛 ) =𝑔 ( 𝑓 ( 𝒙 𝑛 ) ) ≈ 𝒙 𝑛
loss in information is often the
noise/redundant information in )

 Sometimes is called “encoder” and is called “decoder”. Can be linear/nonlinear

 These functions are learned by minimizing the distortion/reconstruction error of


inputs 𝑁 𝑁
~
ℒ = ∑ ‖𝒙 𝑛 − 𝒙 𝑛‖ = ∑ ‖𝒙 𝑛 − 𝑔 ( 𝑓 ( 𝒙 𝑛 ))‖
2 2

𝑛 =1 𝑛 =1

Image source: Manjon et al (2013): Diffusion Weighted Image Denoising using overcomplete Local PCA CS771: Intro to ML
3
Dimensionality Reduction
 Choosing and as linear transformations () and , respectively
𝒛𝑛 ~
𝒙 𝑛 =𝑾 𝒛 𝑛
ℒ=∑ 𝑛 =1
𝑁
¿ ¿ ¿
Principal Component
Analysis (PCA). Proof
shortly
 Minimizer of , if the columns of are orthonormal, are top eigenvectors of
empirical 𝑁 is the matrix of inputs after
1 1
covariance
matrix of
𝑺=
𝑁
∑ ( 𝒙 𝑛 −𝝁 ) ( 𝒙𝑛 − 𝝁 ) = 𝑁

𝑿⊤
𝑐 𝑿 centering each input (subtracting
𝑐 off the mean of inputs from each
𝑛 =1
inputs input)

 The matrix does a “linear projection” of each input into a dim space

 denotes this linear projection

 Note: If we use eigenvectors for , the reconstruction will be perfect ()


CS771: Intro to ML
4
Dimensionality Reduction
Not necessarily PCA where the
 Consider a linear model of the form columns of were orthonormal
𝐾
𝒙𝑛 ≈~
is the -th
𝒙 𝑛 =𝑾 𝒛 𝑛 =∑ 𝑧 𝑛𝑘 𝒘 𝑘 column of
𝑘=1

 Above means that each is appox a linear comb of vectors


“basis” face
images

Each “basis” image is


like a “template” that
captures the common
properties of face images
+ in the dataset

A face
𝑧𝑛 1 𝒘 𝑧𝑛 2 𝒘 𝑧𝑛 3 𝒘 𝑧𝑛 4 𝒘 4
image 1 2 3

 In this example, is a low-dim feature rep. for each image


CS771: Intro to ML
5
Principal Component Analysis (PCA)
 PCA learns a different and more economical coordinate system to represent
data 𝑒 2
in original in new coordinates system
coordinates system defined by axes and
With this representation, both defined by axes and
coordinates and have a high
variance across all the points so, 𝑤2 𝑤1 Not much variance in the
for any point , we can’t ignore any n ce coordinates of all the points so we
ar ia can ignore and represent each

Sm
of these two coordinates otherwise v
ge point using just the first

a
a r

ll v
we will lose considerable L coordinate (i.e., ) without losing

a
ria
information about the data
much information about the data

n
ce
Dimensionality reduction! From
two dimensions () to one
dimension ( )
𝑒1
CS771: Intro to ML
6
Finding Max. Variance Directions
 Consider projecting an input along a direction
 Projection/embedding of (red points below) will be (green pts below)

Mean of projections of all inputs:


𝒘1
𝒙𝑛

𝒙𝑛
is the cov matrix of the data:
Variance of the projections:

𝒘1

 Want such that variance is maximized


For already centered data, = and =
s.t.
Need this constraint otherwise the
objective’s max will be infinity CS771: Intro to ML
7
Max. Variance Direction Variance along the
direction Note: Total variance of
the data is equal to the
sum of eigenvalues of ,
 Our objective function was s.t. i.e.,

 Can construct a Lagrangian for this problem PCA would keep the
top such directions of
largest variances
1-
 Taking derivative w.r.t. and setting to zero gives Note: In general,
will have eigvecs
 Therefore is an eigenvector of the cov matrix with eigenvalue
 Claim: is the eigenvector of with largest eigenvalue . Note that

 Thus variance will be max. if is the largest eigenvalue (and is the corresponding top
eigenvector; also known as the first Principal Component)
 Other large variance directions can also be found likewise (with each being
orthogonal to all others) using the eigendecomposition of cov matrix (this is PCA)
CS771: Intro to ML
8
The PCA Algorithm
 Center the data (subtract the mean from each data point)
 Compute the covariance matrix using the centered data matrix as
1 ⊤
𝐒= 𝐗 𝐗 (Assuming is arranged as )
𝑁
 Do an eigendecomposition of the covariance matrix (many methods exist)
 Take top leading eigvectors with eigvalues
 The -dimensional projection/embedding of each input is
Note: Can decide how many
eigvecs to use based on how
𝒛𝑛 ≈ 𝐖⊤
𝐾 𝒙𝑛 is the “projection matrix” of size
much variance we want to
capture (recall that each gives
the variance in the direction
(and their sum is the total
variance)
CS771: Intro to ML
9
The Reconstruction Error View of PCA
 Representing a data point in the standard orthonormal basis
𝐷

is the coordinate of along 𝒙 𝑛= ∑ 𝑥 𝑛𝑑 𝒆 𝑑 is a vector of all zeros except a single 1 at the


position. Also, for
 Let’s represent the same data point 𝑑=1
the direction in a new orthonormal basis

𝐷 denotes the co-ordinates of in the


𝒙 𝑛= ∑ 𝑧 𝑛𝑑 𝒘 𝑑
is the projection/coordinate of
new basis
along the direction since (verify)
 Ignoring directions along which projection
𝑑=1 is small, we can approximate as

𝐾 𝐾 𝐾
𝒙 𝑛 = ∑ 𝑧 𝑛𝑑 𝒘 𝑑 ¿ ∑ ( 𝒙¿ ¿ 𝑛⊤ 𝒘 𝑑 ) 𝒘 𝑑 ¿ ¿ ∑ (𝒘 𝑑 𝒘 ¿ ¿ 𝑑 )𝒙 𝑛 ¿

𝒙𝑛≈ ^ Note that is the reconstruction error
on . Would like it to minimize w.r.t.
𝑑 =1 𝑑=1 𝑑=1
 Now is represented by dim. rep. and


Also,
𝒛 𝑛 =𝑾 𝐾 𝒙 𝑛 is the “projection matrix” of size
CS771: Intro to ML
10
PCA Minimizes Reconstruction Error
 We plan to use only directions so would like them to be such that the total
reconstruction error is minimized
𝑁 𝑁
ℒ ( 𝒘 1 , 𝒘 2 , … , 𝒘 𝐾 ) = ∑ ‖𝒙 𝑛 − 𝒙
^ 𝑛‖ = ∑ ¿ ¿ ¿ ¿ 2

𝑛 =1 𝑛 =1
Constant; doesn’t
depend on the ’s Variance along

(verify)

 Each optimal can be found by solving Subject to

 Thus minimizing the reconstruction error is equivalent to maximizing variance


 The directions can be found by solving the eigendecomposition of
CS771: Intro to ML
11
Singular Value Decomposition (SVD)
 Any matrix of size can be represented as the following decomposition
min { 𝑁 , 𝐷 }
𝐗 =𝐔 𝚲 𝐕 = ⊤
∑ 𝜆 𝑘 𝒖𝑘 𝒗 ⊤
𝑘
𝑘=1

Diagonal matrix. If , last rows are all zeros; if ,


last columns are all zeros

 is matrix of left singular vectors, each


 is also orthonormal ( and )
 is matrix of right singular vectors, each
 is also orthonormal( and )
 is with only diagonal entries - singular values
 Note: If is symmetric then it is known as eigenvalue decomposition ()
CS771: Intro to ML
12
Low-Rank Approximation via SVD
 If we just use the top singular values, we get a rank- SVD
This is a rank-1 matrix

 Above SVD approx. can be shown to minimize the reconstruction error


 Fact: SVD gives the best rank- approximation of a matrix

 PCA is done by doing SVD on the covariance matrix (left and right singular vectors
are the same and become eigenvectors, singular values become eigenvalues)
CS771: Intro to ML
13

Dimensionality Reduction: Beyond PCA

CS771: Intro to ML
14
Dim-Red as Matrix Factorization
 If we don’t care about the orthonormality constraints on , then dim-red can also be
achieved by solving a matrix factorization problem on the data matrix
D K D

N
𝑿 𝒁 ≈ N
K 𝑾
Matrix containing
the low-dim rep of

{ 𝒁 , 𝑾 }=argmin 𝒁 ,𝑾 ‖ 𝑿 − 𝒁𝑾 ‖
^ ^ 2
If , such a factorization gives a low-
rank approximation of the data
matrix
 Can solve such problems using ALT-OPT
 Can impose various constraints on and (e.g., sparsity, non-negativity, etc) CS771: Intro to ML
15
Matrix Factorization is a very useful method!
 In many problems, we are given co-occurrence data in form of an matrix
 Data consists of relationship b/w two sets of entities containing and entities
 Each entry denotes how many times the pair co-occurs, e.g.,
 Number of times a document (total docs) contains word of a vocabulary (total words)
 Rating user gave to item (or movie) on a shopping (or movie streaming) website
𝑀 𝐾 𝑀
×𝐾 𝑊
𝑁 𝑋 ≈𝑁 𝑍 In such problems, matrix factorization can be used
to learn a -dim feature vector for both set of entities
Even if some entries of are missing, we can still do
matrix factorization using the loss defined on the
given entries of and use the learned and to predict
any missing entry as (matrix completion) CS771: Intro to ML
16
Supervised Dimensionality Reduction
 Maximum variance directions may not be aligned with class separation directions
(focusing only on variance/reconstruction error of the inputs , is not always ideal)
Direction that preserves Projecting along this will give a one
Projecting along this will give a one class separation dimensional embedding of each point
dimensional embedding of each point with both classes overlapping with each
with both classes still having a good other
separation
Max variance
direction (given by
PCA)

 Be careful when using methods like PCA for supervised learning problems
 A better option would be to find projection directions such that after projection
 Points within the same class are close (low intra-class variance)
 Points from different classes are well separated (the class means are far apart) CS771: Intro to ML
17
Dim. Reduction by Preserving Pairwise Distances
 PCA/SVD etc assume we are given points as vectors (e.g., in dim)

 Often the data is given in form of distances between and ()

 Would like to project data such that pairwise distances between points are preserved
and denote low-dim
embeddings/projections
of and , respectively

 Basically, if is large (resp. small), would like to be large (resp. small)

 Multi-dimensional Scaling (MDS) is one such algorithm

 Note: If is the Euclidean distance, MDS is equivalent to PCA


CS771: Intro to ML
18
MDS: An Example
 Result of applying MDS (with ) on pairwise distances between some US cities

 Here MDS produces 2D embedding of each city such that geographically close cities
are also close in 2D embedding space CS771: Intro to ML
19

Nonlinear Dimensionality Reduction

CS771: Intro to ML
20
Beyond Linear Projections
 Consider the swiss-roll dataset (points lying close to a manifold)
Relative positions of points
destroyed after the projection

 Linear projection methods (e.g., PCA) can’t capture intrinsic nonlinearities


 Maximum variance directions may not be the most interesting ones

CS771: Intro to ML
21
Nonlinear Dimensionality Reduction
 We want to a learn nonlinear low-dim projection
Relative positions of points
preserved after the
projection

 Some ways of doing this


 Nonlinearize a linear dimensionality reduction method. E.g.:
 Cluster data and apply linear PCA within each cluster (mixture of PCA)
 Kernel PCA (nonlinear PCA)
 Using manifold based methods that intrinsically preserve nonlinear geometry, e.g.,
 Locally Linear Embedding (LLE), Isomap
 Maximum Variance Unfolding Will look at KPCA,
 Laplacian Eigenmap, and others such as SNE/tSNE, etc. LLE, SNE/tSNE

 .. or use unsupervised deep learning techniques (later) CS771: Intro to ML


22
Kernel PCA
 Recall PCA: Given observations , ,
eigenvectors of
cov matrix assuming
centered data

 Assume a kernel with associated dimensional nonlinear map


cov matrix assuming eigenvectors of
centered data in the kernel-
induced feature space

 Would like to do it without computing and the mappings since can be very large
(even infinite, e.g., when using an RBF kernel)
 Boils down to doing eigendecomposition of the kernel matrix (PRML 12.3)
 Can verify that each above can be written as a lin-comb of the inputs:
 Can show that finding reduces to solving an eigendecomposition of
 Note: Due to req. of centering, we work with a centered kernel matrix
matrix of all 1s CS771: Intro to ML
23
Locally Linear Embedding
Several non-lin dim-red Essentially,
algos use this idea neighbourhood
preservation, but only
local
 Basic idea: If two points are local neighbors in the original space then they should be
local neighbors in the projected space too
 Given observations , , LLE is formulated as
Solve this to learn weights such that each denotes the local neighbors (a
point can be written as a weighted linear predefined number, say K, of
combination of its local neighbors in the them) of point
original feature space

 For each point , LLE learns , such that the same neighborhood structure exists in low-dim
space too
Requires solving an
eigenvalue problem

 Basically, if point can be reconstructed from its neighbors in the original space, the
same weights should be able to reconstruct in the new space too
S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science 290 (2000) CS771: Intro to ML
Thus very useful if we want 24
SNE and t-SNE to visualize some high-dim
data in two or three dims

 Also nonlin. dim-red methods, especially suited for projecting to 2D or 3D


 SNE stands for Stochastic Neighbor Embedding (Hinton and Roweis, 2002)
 Uses the idea of preserving probabilistically defined neighborhoods
 SNE, for each point , defines the probability of a point being its neighbor as
Neighbor probability Neighbor probability in the
in the original space projected/embedding space

 SNE ensures that neighbourhood distributions in both spaces are as close as possible
 This is ensured by minimizing their total mismatch (KL divergence)
 t-SNE (van der Maaten and Hinton, 2008) offers a couple of improvements to SNE
 Learns ’s by minimizing symmetric KL divergence
 Uses Student-t distribution instead of Gaussian for defining
CS771: Intro to ML
25
SNE and t-SNE
 Especially useful for visualizing data by projecting into 2D or 3D

Result of visualizing MNIST digits data in 2D (Figure from van der Maaten and Hinton, 2008)

CS771: Intro to ML
26
Word Embeddings: Dim-Reduction for Words Or sentences, paragraphs, documents, etc
which are basically a set of words
 Feature representation/embeddings of words are very useful in many applications
 Naively we can a one-hot vector of size for each word (where is the vocab size)
1 only at the -th location
Word of the 0 0 0 0 0 0 0 1 0 0 0 0
vocabulary
 One-hot representation of a word has two main issues
 Very high dimensionality () for each word
 One-hot vector does not capture word semantics (any pair of words will have zero similarity)

 Desirable: Learning low-dim word embeddings that capture the meaning/semantics

 We want embedding of each word to be low-dimensional vector


 If two words and are semantically similar (dissimilar), we want and to be close (far)

 Many methods to learn word embeddings (e.g., Glove and Word2Vec)


CS771: Intro to ML
27
GloVe
 GloVe (Global Vectors for Word Representation) is a linear word embedding method
 Based on matrix factorization of a word-context co-occurrence matrix
 In GloVe, the context consists of words that co-occur within a specified context
window around a target word in a large text corpus.
Embedding size
Contexts Row ()denotes the Contexts
embedding of word

Entry denotes number


of times a word appears
in a context
× 𝐶
Words
𝑋 ≈ Words 𝐸
‖ 𝑋 − 𝐸𝐶‖ =∑ ( 𝑋 𝑛𝑖 − 𝑒 𝑐 𝑖 )
2 ⊤ 2 Column ()denotes the
𝑛 embedding of context
𝑛, 𝑖
Given , minimize this
distortion w.r.t. and Skipping a few other details
(e.g., using SVD of ) such as pre-processing of the
data (refer to the paper if
GloVe: Global Vectors for Word Representation (Pennington et al, 2014)
interested) CS771: Intro to ML
28
Word2Vec
 A deep neural network based nonlinear word embedding method
 Usually learned using one of the following two objectives
 Skip-gram
 Continuous bag of words (CBOW)
 Skip-gram: Probability of a context occurring around a word
Embeddings are learned by optimizing a
Conditional probability ⊤
exp (𝒄 𝒆 𝑛 )
𝑖 neural network based loss function which
which can be estimated 𝑝 (𝑖|𝑛 )=
from training data ∑ exp( 𝒄⊤
𝑖 𝒆𝑛 )
makes the difference b/w LHS and RHS
small
𝑖

 CBOW: Probability of word occurring given a context window, e.g., previous and next words
Conditional probability which can Sum/average of the embeddings
be estimated from training data of the context window for word
exp ( 𝒆⊤
𝑛 𝒄 𝑛)
𝑝 ( 𝑛|𝑛 − 𝑘 : 𝑛 + 𝑘 ) = Embeddings are learned by optimizing
∑ exp ( 𝒆⊤ 𝒄 𝑛network
𝑛a neural ) based loss function 𝑛
which makes the difference b/w LHS
and RHS small
Efficient Estimation of Word Representations in Vector Space (Mikolov et al, 2013) CS771: Intro to ML
29
Dimensionality Reduction: Out-of-sample Embedding
 Some dim-red methods can only compute the embedding of the training data
 Given training samples they will give their embedding
 However, given a new point (not in the training samples), they can’t produce its
embedding easily
 Thus no easy way of getting “out-of-sample” embedding

 Some of the nonlinear dim-red methods like LLE, SNE, KPCA, etc have this
limitation
 Reason: They don’t learn an explicit encoder and directly optimize for given
 To get “out-of-sample” embeddings, these methods require some modifications*

 But many other methods do explicitly learn a mapping in form of an “encoder” that
can give for any new as well (such methods are more useful)
 For PCA, the projection matrix is this encoder function and
*Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering (Bengio et al, 2003) CS771: Intro to ML

You might also like