Professional Documents
Culture Documents
𝑛 =1 𝑛 =1
Image source: Manjon et al (2013): Diffusion Weighted Image Denoising using overcomplete Local PCA CS771: Intro to ML
3
Dimensionality Reduction
Choosing and as linear transformations () and , respectively
𝒛𝑛 ~
𝒙 𝑛 =𝑾 𝒛 𝑛
ℒ=∑ 𝑛 =1
𝑁
¿ ¿ ¿
Principal Component
Analysis (PCA). Proof
shortly
Minimizer of , if the columns of are orthonormal, are top eigenvectors of
empirical 𝑁 is the matrix of inputs after
1 1
covariance
matrix of
𝑺=
𝑁
∑ ( 𝒙 𝑛 −𝝁 ) ( 𝒙𝑛 − 𝝁 ) = 𝑁
⊤
𝑿⊤
𝑐 𝑿 centering each input (subtracting
𝑐 off the mean of inputs from each
𝑛 =1
inputs input)
The matrix does a “linear projection” of each input into a dim space
A face
𝑧𝑛 1 𝒘 𝑧𝑛 2 𝒘 𝑧𝑛 3 𝒘 𝑧𝑛 4 𝒘 4
image 1 2 3
Sm
of these two coordinates otherwise v
ge point using just the first
a
a r
ll v
we will lose considerable L coordinate (i.e., ) without losing
a
ria
information about the data
much information about the data
n
ce
Dimensionality reduction! From
two dimensions () to one
dimension ( )
𝑒1
CS771: Intro to ML
6
Finding Max. Variance Directions
Consider projecting an input along a direction
Projection/embedding of (red points below) will be (green pts below)
𝒙𝑛
is the cov matrix of the data:
Variance of the projections:
⊤
𝒘1
Can construct a Lagrangian for this problem PCA would keep the
top such directions of
largest variances
1-
Taking derivative w.r.t. and setting to zero gives Note: In general,
will have eigvecs
Therefore is an eigenvector of the cov matrix with eigenvalue
Claim: is the eigenvector of with largest eigenvalue . Note that
Thus variance will be max. if is the largest eigenvalue (and is the corresponding top
eigenvector; also known as the first Principal Component)
Other large variance directions can also be found likewise (with each being
orthogonal to all others) using the eigendecomposition of cov matrix (this is PCA)
CS771: Intro to ML
8
The PCA Algorithm
Center the data (subtract the mean from each data point)
Compute the covariance matrix using the centered data matrix as
1 ⊤
𝐒= 𝐗 𝐗 (Assuming is arranged as )
𝑁
Do an eigendecomposition of the covariance matrix (many methods exist)
Take top leading eigvectors with eigvalues
The -dimensional projection/embedding of each input is
Note: Can decide how many
eigvecs to use based on how
𝒛𝑛 ≈ 𝐖⊤
𝐾 𝒙𝑛 is the “projection matrix” of size
much variance we want to
capture (recall that each gives
the variance in the direction
(and their sum is the total
variance)
CS771: Intro to ML
9
The Reconstruction Error View of PCA
Representing a data point in the standard orthonormal basis
𝐷
𝐾 𝐾 𝐾
𝒙 𝑛 = ∑ 𝑧 𝑛𝑑 𝒘 𝑑 ¿ ∑ ( 𝒙¿ ¿ 𝑛⊤ 𝒘 𝑑 ) 𝒘 𝑑 ¿ ¿ ∑ (𝒘 𝑑 𝒘 ¿ ¿ 𝑑 )𝒙 𝑛 ¿
⊤
𝒙𝑛≈ ^ Note that is the reconstruction error
on . Would like it to minimize w.r.t.
𝑑 =1 𝑑=1 𝑑=1
Now is represented by dim. rep. and
⊤
Also,
𝒛 𝑛 =𝑾 𝐾 𝒙 𝑛 is the “projection matrix” of size
CS771: Intro to ML
10
PCA Minimizes Reconstruction Error
We plan to use only directions so would like them to be such that the total
reconstruction error is minimized
𝑁 𝑁
ℒ ( 𝒘 1 , 𝒘 2 , … , 𝒘 𝐾 ) = ∑ ‖𝒙 𝑛 − 𝒙
^ 𝑛‖ = ∑ ¿ ¿ ¿ ¿ 2
𝑛 =1 𝑛 =1
Constant; doesn’t
depend on the ’s Variance along
(verify)
PCA is done by doing SVD on the covariance matrix (left and right singular vectors
are the same and become eigenvectors, singular values become eigenvalues)
CS771: Intro to ML
13
CS771: Intro to ML
14
Dim-Red as Matrix Factorization
If we don’t care about the orthonormality constraints on , then dim-red can also be
achieved by solving a matrix factorization problem on the data matrix
D K D
N
𝑿 𝒁 ≈ N
K 𝑾
Matrix containing
the low-dim rep of
{ 𝒁 , 𝑾 }=argmin 𝒁 ,𝑾 ‖ 𝑿 − 𝒁𝑾 ‖
^ ^ 2
If , such a factorization gives a low-
rank approximation of the data
matrix
Can solve such problems using ALT-OPT
Can impose various constraints on and (e.g., sparsity, non-negativity, etc) CS771: Intro to ML
15
Matrix Factorization is a very useful method!
In many problems, we are given co-occurrence data in form of an matrix
Data consists of relationship b/w two sets of entities containing and entities
Each entry denotes how many times the pair co-occurs, e.g.,
Number of times a document (total docs) contains word of a vocabulary (total words)
Rating user gave to item (or movie) on a shopping (or movie streaming) website
𝑀 𝐾 𝑀
×𝐾 𝑊
𝑁 𝑋 ≈𝑁 𝑍 In such problems, matrix factorization can be used
to learn a -dim feature vector for both set of entities
Even if some entries of are missing, we can still do
matrix factorization using the loss defined on the
given entries of and use the learned and to predict
any missing entry as (matrix completion) CS771: Intro to ML
16
Supervised Dimensionality Reduction
Maximum variance directions may not be aligned with class separation directions
(focusing only on variance/reconstruction error of the inputs , is not always ideal)
Direction that preserves Projecting along this will give a one
Projecting along this will give a one class separation dimensional embedding of each point
dimensional embedding of each point with both classes overlapping with each
with both classes still having a good other
separation
Max variance
direction (given by
PCA)
Be careful when using methods like PCA for supervised learning problems
A better option would be to find projection directions such that after projection
Points within the same class are close (low intra-class variance)
Points from different classes are well separated (the class means are far apart) CS771: Intro to ML
17
Dim. Reduction by Preserving Pairwise Distances
PCA/SVD etc assume we are given points as vectors (e.g., in dim)
Would like to project data such that pairwise distances between points are preserved
and denote low-dim
embeddings/projections
of and , respectively
Here MDS produces 2D embedding of each city such that geographically close cities
are also close in 2D embedding space CS771: Intro to ML
19
CS771: Intro to ML
20
Beyond Linear Projections
Consider the swiss-roll dataset (points lying close to a manifold)
Relative positions of points
destroyed after the projection
CS771: Intro to ML
21
Nonlinear Dimensionality Reduction
We want to a learn nonlinear low-dim projection
Relative positions of points
preserved after the
projection
Would like to do it without computing and the mappings since can be very large
(even infinite, e.g., when using an RBF kernel)
Boils down to doing eigendecomposition of the kernel matrix (PRML 12.3)
Can verify that each above can be written as a lin-comb of the inputs:
Can show that finding reduces to solving an eigendecomposition of
Note: Due to req. of centering, we work with a centered kernel matrix
matrix of all 1s CS771: Intro to ML
23
Locally Linear Embedding
Several non-lin dim-red Essentially,
algos use this idea neighbourhood
preservation, but only
local
Basic idea: If two points are local neighbors in the original space then they should be
local neighbors in the projected space too
Given observations , , LLE is formulated as
Solve this to learn weights such that each denotes the local neighbors (a
point can be written as a weighted linear predefined number, say K, of
combination of its local neighbors in the them) of point
original feature space
For each point , LLE learns , such that the same neighborhood structure exists in low-dim
space too
Requires solving an
eigenvalue problem
Basically, if point can be reconstructed from its neighbors in the original space, the
same weights should be able to reconstruct in the new space too
S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science 290 (2000) CS771: Intro to ML
Thus very useful if we want 24
SNE and t-SNE to visualize some high-dim
data in two or three dims
SNE ensures that neighbourhood distributions in both spaces are as close as possible
This is ensured by minimizing their total mismatch (KL divergence)
t-SNE (van der Maaten and Hinton, 2008) offers a couple of improvements to SNE
Learns ’s by minimizing symmetric KL divergence
Uses Student-t distribution instead of Gaussian for defining
CS771: Intro to ML
25
SNE and t-SNE
Especially useful for visualizing data by projecting into 2D or 3D
Result of visualizing MNIST digits data in 2D (Figure from van der Maaten and Hinton, 2008)
CS771: Intro to ML
26
Word Embeddings: Dim-Reduction for Words Or sentences, paragraphs, documents, etc
which are basically a set of words
Feature representation/embeddings of words are very useful in many applications
Naively we can a one-hot vector of size for each word (where is the vocab size)
1 only at the -th location
Word of the 0 0 0 0 0 0 0 1 0 0 0 0
vocabulary
One-hot representation of a word has two main issues
Very high dimensionality () for each word
One-hot vector does not capture word semantics (any pair of words will have zero similarity)
CBOW: Probability of word occurring given a context window, e.g., previous and next words
Conditional probability which can Sum/average of the embeddings
be estimated from training data of the context window for word
exp ( 𝒆⊤
𝑛 𝒄 𝑛)
𝑝 ( 𝑛|𝑛 − 𝑘 : 𝑛 + 𝑘 ) = Embeddings are learned by optimizing
∑ exp ( 𝒆⊤ 𝒄 𝑛network
𝑛a neural ) based loss function 𝑛
which makes the difference b/w LHS
and RHS small
Efficient Estimation of Word Representations in Vector Space (Mikolov et al, 2013) CS771: Intro to ML
29
Dimensionality Reduction: Out-of-sample Embedding
Some dim-red methods can only compute the embedding of the training data
Given training samples they will give their embedding
However, given a new point (not in the training samples), they can’t produce its
embedding easily
Thus no easy way of getting “out-of-sample” embedding
Some of the nonlinear dim-red methods like LLE, SNE, KPCA, etc have this
limitation
Reason: They don’t learn an explicit encoder and directly optimize for given
To get “out-of-sample” embeddings, these methods require some modifications*
But many other methods do explicitly learn a mapping in form of an “encoder” that
can give for any new as well (such methods are more useful)
For PCA, the projection matrix is this encoder function and
*Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering (Bengio et al, 2003) CS771: Intro to ML