You are on page 1of 94

Singular Value Decomposition (SVD)

Prerequisites….
Orthogonal and Orthonormal vectors
Orthogonal and Orthonormal
vectors
LINEAR INDEPENDENCE OF VECTORS

• Definition: An indexed set of vectors {v1, …, vp} in is


said to be linearly independent if the vector
equation
c1v1  c2 v 2  ...  c p v p  0
has only the trivial solution: .

• The set {v1, …, vp} is said to be linearly dependent if


there exist weights c1, …, cp, not all zero, such that
c1v1  c2 v 2  ...  c p v p  0
LINEAR INDEPENDENCE

Determine if the set {v1, v2, v3} is linearly independent.


• Non trivial Solution => det

• Hence the vectors are Linearly Dependent

Exercise-2: Find if the set of vectors {(1,2,2), (2,1,2),(2,2,1)} is linearly


independent or not.
Subspace, Basis

• Given a basis of a subspace, any vector in that subspace will be a linear


combination of the basis vectors.

The smallest subspace containing a finite set of vectors of a vector space is said to be linear
span of the set. (That is the set (basis vectors) is said to be spanning the subspace)
Eigen values and Eigen vectors
Eigen Value and Eigen Vector

• Let us take the example of house price prediction model

• Here the dependent variable is the house price and there are
a large number of independent variables (features) on which
the house price depends.

• Training with large number of independent variables will


make machine learning models computationally intensive
and complex.
Eigen Value and Eigen Vector

• Is it possible to extract a smaller set of variables (features) to


train the models and do the prediction while ensuring that
most of the information contained in the original variables is
retained / maintained?

• This is where eigenvalues and eigenvectors comes into


picture.
What do matrices do to vectors?

(0,2)
(2,1)
(3,1)
Recall

(3,5)

(0,2)
(2,1)
(3,1)
What do matrices do to vectors?

(3,5)
• The new vector is:
(0,2)
(2,1)
1) rotated
(3,1)
2) scaled
Are there any special vectors
that only get scaled?
Are there any special vectors
that only get scaled?

Try (1,1)
Are there any special vectors
that only get scaled?

= (1,1)
Are there any special vectors
that only get scaled?

= (3,3)

= (1,1)
Are there any special vectors
that only get scaled?

• For this special vector,


multiplying by M is like
multiplying by a scalar.
= (3,3) • The special vector (1,1) is
called an Eigen vector of M
= (1,1) • 3 (the scaling factor) is called
the Eigen value associated
with this eigenvector
Definition
Eigen vectors are the vectors which when multiplied by a matrix
(linear combination or transformation) results in another vector
having same direction but scaled (hence scalar multiple) in forward or
reverse direction by a magnitude of the scalar multiple which can be
termed as Eigen value.
Eigen Vector: Special vector x that points in a direction in
which it is stretched (or shrunk) by the transformation. Usually
the direction is same (or reversed) as Ax.

Eigen Value: λ is a factor by which the Eigen vector is stretched


or shrunk or reversed or unchanged.

In other words we can say that an eigen value is a number that


tells us how much variance exists in the data in that direction,
whereas an Eigen vector tells us how spread the data is.
Mathematical Definition

Let A be a n x n matrix, is a non-zero n x 1


column vector and is a scalar.
If ,
then is an Eigen vector of A and is the Eigen
value of A
Eigen vectors obey this equation

Or,
Or,

Eq.1 is a system of equations in matrix form

For, non zero solution of above homogeneous system


Characteristic equation

• Let A be an n x n matrix, then is said to be the


characteristic polynomial of A and is denoted by
where

𝐴= [ 𝑎

The equation
Finding Eigen value, Eigen vector

Example-1:

Find Eigen value and Eigen vector of


Are Eigen vectors orthogonal?

Generally, for any matrix, the Eigen vectors are not


always orthogonal.

However, they will be orthogonal for a particular type


of matrix such as a symmetric matrix.
The Eigen vectors corresponding to two distinct Eigen
values of a real symmetric matrix are orthogonal
Issues with Eigen vector!

1. Ax = λx requires A to be square.
2. Eigen vectors are generally not orthogonal.
3. There are not always enough Eigen vectors to
construct P for diagonalization

How to diagonalize a m x n (rectangular) matrix?


Instead of Eigen vectors consider Singular vectors.
Singular Value Decomposition
(SVD)
Singular Value Decomposition (SVD)

• The singular value decomposition is a factorization of a matrix


A (m x n) into the product of three matrices

A  UV T
mm mn V is nn

• The diagonal values in the Sigma matrix are known as the singular
values of the original matrix A.
• The columns of the U matrix are called the left-singular vectors of A.
• The columns of V are called the right-singular vectors of A.
• Left and right singular vectors are pairwise orthogonal
Singular Value Decomposition (SVD)

For an m n (row x col) matrix A of rank r there exists a


factorization SVD as follows:

A  UV T

mm mn V is nn

The columns of U are orthogonal eigenvectors of AAT.


The columns of V are orthogonal eigenvectors of ATA.
Eigenvalues 1 … r of AAT are the eigenvalues of ATA.
 i  i
  diag  1... r  Singular values.
Singular values and Singular vectors

• Singular values of A are the square roots of the Eigen values of ,


denoted by

• The Eigen vectors of are called the singular vectors of A.

• In SVD the columns of U and V are orthonormal and the matrix ∑


is diagonal with positive real entries.

  diag  1... r 
Properties of SVD

1. If A be an m x n matrix, then is an n x n symmetric matrix and


is orthogonally diagonalizable.

2. Let be the orthonormal eigenvectors of corresponding to the


eigenvalues respectively, then ,
i.e. singular values of A are the lengths of the vectors for i
= 1,2,…,n
Singular value decomposition

• Exercise
1. Find the SVD of
2. Find the SVD of
3. Find the SVD of
Low rank approximation of a matrix

• Often a data matrix A is close to a low rank matrix and SVD is useful
to find a good low rank approximation to A.

• For any k (rank); the SVD of A gives the best rank-k approximation
to A.

***Rank of a matrix: The maximum number of its linearly


independent columns (or rows ) of a matrix.
If we consider a square matrix, the columns (rows) are linearly
independent only if the matrix is nonsingular (determinant of A≠ 0).
Low rank approximation of a matrix

• In mathematics, low-rank approximation is a minimization problem, in which


the cost function measures the fit between a given matrix (the data) and an
approximating matrix, subject to a constraint that the approximating matrix
has reduced rank.

• They usually have sparse representations- you don't need to store the
m*n numbers for a low rank approximation.

• Low-rank approximation is thus a way to recover the "original" (the "ideal"


matrix before it was messed up by noise etc.) low-rank matrix i.e., find the
matrix that is most appropriate with the current matrix and is low-rank.

• The problem is used for mathematical modeling and data compression


Low rank approximation from SVD

• If we want to best approximate a matrix A by a rank-k matrix, how


should we do it?

• Assume we had a representation of the data matrix A as a sum of


several ingredients, with these ingredients ordered by “importance,"
then we could just keep the k “most important" ones. The SVD gives us
exactly such a representation!

• Does Eigen Value/Eigen vector perform the similar job?


• PCA!!
Low rank approximation from SVD
Best-fitting k-dimensional subspace

• Since from SVD we can find the low rank approximation of a matrix
(say ‘A’), it can be used to find the best-fitting k-dimensional subspace
for k = 1,2,3,…. For the set of n data points.

• Here, “best" means minimizing the sum of the squares of the


perpendicular distances of the points to the subspace, or
equivalently, maximizing the sum of squares of the lengths of the
projections of the points onto this subspace.

• It can be shown that the best-fitting k-dimensional subspace can be


found by k applications of the best fitting line algorithm, where on the
i’th iteration we find the best fit line perpendicular to the previous i-1
lines.
Best Rank-k Approximations
Best Rank-k Approximations

• Thus we have two interpretations of the best-fit subspace.

1. To minimize the sum of squared distances of the data points to it.

2. To maximize the sum of projections squared of the data points on it.


This says that the subspace contains the maximum content of data
among all subspaces of the same dimension

It is seen that minimizing the sum of squared distances is equivalent to


maximizing the sum of squared projections.
Finding best-fit sub space

**Argmax is an operation that finds the argument (index) that gives the maximum value from a target
function. Argmax is most commonly used in machine learning for finding the class with the largest
predicted probability.
Finding best-fit sub space
Finding best-fit sub space:
Greedy algorithm
Finding best-fit sub space:
Greedy algorithm
Finding best-fit sub space:
Greedy algorithm

The following theorem establishes that the greedy algorithm finds the
best subspaces of every dimension.

The Frobenius Norm (Eucidean norm or L2


norm) of a matrix is defined as the square
root of the sum of the squares of the
elements of the matrix.
Best Rank-k Approximations
Condition for finding Best-fitting sub-space

• Centering data: Center the data by subtracting the centroid


of the data from each data point.
• To find the best-fitting subspace we first center the data and
then find the best-fitting subspace.

If one wants statistical information relative to the mean of the data, one needs to
center the data. If one wants the best low rank approximation, one would not
center the data.
• It can be shown that the line minimizing the sum of squared
distances to a set of points, if not restricted to go through the
origin, must pass through the centroid of the points.

• This implies that if the centroid is subtracted from each data


point, such a line will pass through the origin.
Principal Component Analysis (PCA)
PCA

2. Principal component analysis (PCA):

• The traditional use of SVD is in Principal Component Analysis (PCA).

• PCA is illustrated by a movie recommendation setting where there are


‘n’ customers and ‘d’ movies.

• Let matrix A (n x d) where each of its elements represent the amount


that customer i likes movie j.

• One hypothesizes that there are only k underlying basic factors that
determine how much a given customer will like a given movie, where .
PCA
• Principal components analysis (PCA) is a technique that can be
used to simplify a dataset.

• It is a linear transformation that chooses a new coordinate


system for the data set such that greatest variance by any
projection of the dataset comes to lie on the first axis (then
called the first principal component).

• The second greatest variance on the second axis, and so on.

• PCA can be used for reducing dimensionality by eliminating the


later principal components.
PRINCIPAL COMPONENT ?

57
Are they correleted?
PCA

• We define the new dimensions (variables) which


are
linear combinations of the original ones
uncorrelated with one another
• Orthogonal in original dimension space
capture as much of the original variance in the data
as possible

These are called Principal Components


PCA

• Given a set of points, how do we know if they


can be compressed like in the previous toy
example?

• The answer is to look into the correlation


between the points

• The tool for doing this is called PCA


PCA
• Principle
– Linear projection method to reduce the number of parameters
– Transfer a set of correlated variables into a new set of
uncorrelated variables
– Map the data into a space of lower dimensionality
– Form of unsupervised learning

• Properties
– It can be viewed as a rotation of the existing axes to new
positions in the space defined by original variables
– New axes are orthogonal and represent the directions with
maximum variability
PCA
• PCA is performed by finding the eigenvalues and eigenvectors
of the covariance matrix.

• We find that the eigenvectors with the largest eigenvalues


correspond to the dimensions that have the strongest
correlation in the dataset.
This is the principal component.

• PCA is a useful statistical technique that has found application in:


– fields such as face recognition and image compression
– finding patterns in data of high dimension.
What are the new axes?
• Orthogonal directions of greatest variance in data
• Projections along PC1 discriminate the data most along any one axis

Original Variable B PC 2
PC 1

Original Variable A

• First principal component is the direction of greatest variability (covariance)


in the data
• Second is the next orthogonal (uncorrelated) direction of greatest variability
• And so on …
PCA
• What are principal components?

– 1st Principal component : The Most-important direction -


Direction of maximum variance in the input space
– 2nd Principal component : 2nd Most-important direction -
Direction of second-largest variance in the input space
– 3rd Principal component : …………
– 4th Principal component : ………….

• How many principal components are possible ?


– As many as dimensions of the input space
68
PCA
• Are all principal components equally important ??

• No. Only those principal components that contribute a


significant fraction of total energy are considered important.
Energy along a direction is proportional to Variance along that
direction.

• The less important principal components can be ignored –


leading to reduction in dimensionality

• What does it say about the distribution, if all dimensions are


equally important ??
– Isotropic
How to compute ??
– Given N-dimensional Data x (Say, M-points X1, X2,…….XM)

– Find the Covariance Matrix of X (zero mean)


• Cx = Cov (X) = E[X XT]

• Now find eigen values (λ)s of Cx

• Sort the eigen values

• The eigen vector vi that corresponds to the largest eigen value λi


is the first principal component
• The eigen vector vj that corresponds to the j’th-largest eigen
value λj is the j’th- principal component
Steps : PCA
• Arrange data points in matrix X
• Find Cov X
• Find Eigen values and Eigen Vectors of Cov X
• Sort Eigen Values in descending order
• Start building matrix P. First column of P is the eigen vector
that corresponds to largest eigen value.
• Second column of P is the eigen vector that corresponds to
the second-largest eigen value.
• PT is the transform that completely decorrelates data X
Dimension reduction
• For dimensionality reduction:
– Choose only significant eigen values. Use only
those corresponding eigenvectors to build matrix
P

– This will lead to dimension reduction in the


transformed data

Kx1 KxN Nx1


• The Eigen vectors of the Covariance matrix are actually the directions
of the axes where there is the most variance(most information) and
that we call Principal Components.

• Eigenvalues are simply the coefficients attached to Eigen vectors,


which give the amount of variance carried in each Principal
Component.

• By ranking your eigenvectors in order of their eigenvalues, highest to


lowest, we get the principal components in order of significance.

• After having the principal components, in order to compute the


percentage of variance (information) accounted for by each
component, we divide the eigenvalue of each component by the sum
of Eigen values.
• % of variance =
• We are looking for a Transform that :
– Represents data along each of the principal
components
– The transformed data should be completely de-
correlated
• i.e Cov(Y) = Diagonal matrix

• How do we compute that Transform ???


Say the transformed data , Y is given as
We want Cov(Y) to be diagonal,

=
=
=

Why ??? Because only then the transformed data, Y is completely decorrelated !!
• Hence columns of P should diagonalize Cov(X) matrix

• But is Cov(X) diagonalizable ??


– Yes. Since Cov(X) is symmetric
• All symmetric matrices are diagonalizable
– Covariance matrices are positive definite
• All eigen values are positive

• P should be formed by linearly independent eigen vectors of


Cov(X)

• Eigen vectors of P are guaranteed to be orthogonal for distinct


eigen values
Clustering a Mixture of Spherical
Gaussians
Clustering a Mixture of Spherical Gaussians:

• Clustering is the task of partitioning a set of points into k subsets


or clusters where each cluster consists of nearby points.

• How to solve a particular clustering problem using SVD?


• A mixture is a probability density or distribution that is the
weighted sum of simple component probability densities.
• It is of the form
One approach to the model-fitting problem is to break it into two sub-
problems:

1. Cluster the set of samples into k clusters (


2. Then fit a single Gaussian distribution to each cluster of sample
points.

• The second problem is relatively easier and indeed we saw the


solution in Module-1, where we showed that taking the empirical
mean (the mean of the sample) and the empirical standard deviation
gives us the best-fit Gaussian.

• The first problem (finding the k-clusters from a mixture of Gaussians)


is harder and this is what we discuss here.

• Using SVD we can find the subspace that contains the centers.
• We discussed in Chapter 2 distances between two sample points from
the same Gaussian as well the distance between two sample points
from two different Gaussians.
• Recall from Module-1 that if
• It can be shown that that the top k singular vectors produced by the
SVD span the space of the k centers. For this,

• First, we extend the notion of best-fit to probability distributions.

• Then we show that for a single (1D) spherical Gaussian whose center is
not the origin, the best fit 1-dimensional subspace is the line though
the center of the Gaussian and the origin.

• Next, we show that the best fit k-dimensional subspace for a single
Gaussian whose center is not the origin is any k-dimensional subspace
containing the line through the Gaussian's center and the origin.

• Finally, for k spherical Gaussians, the best-fit k-dimensional subspace is


the subspace containing their centers.

• Thus, the SVD finds the subspace that contains the centers.
Best-fit k-dimensional subspace
• It can be shown that that the top k singular vectors produced by the
SVD span the space of the k centers. For this,

• First, we extend the notion of best-fit to probability distributions.

• Then we show that for a single (1D) spherical Gaussian whose center is
not the origin, the best fit 1-dimensional subspace is the line though
the center of the Gaussian and the origin.

• Next, we show that the best fit k-dimensional subspace for a single
Gaussian whose center is not the origin is any k-dimensional subspace
containing the line through the Gaussian's center and the origin.

• Finally, for k spherical Gaussians, the best-fit k-dimensional subspace is


the subspace containing their centers.

• Thus, the SVD finds the subspace that contains the centers.
• Recall that for a set of points, the best-fit line is the line passing
through the origin that maximizes the sum of squared lengths of the
projections of the points onto the line.

• We extend this definition to probability densities instead of a set of


points.
• It can be shown that that the top k singular vectors produced by the
SVD span the space of the k centers. For this,

• First, we extend the notion of best-fit to probability distributions.

• Then it can be shown that for a single (1D) spherical Gaussian whose
center is not the origin, the best fit 1-dimensional subspace is the line
though the center of the Gaussian and the origin.

• Next, we show that the best fit k-dimensional subspace for a single
Gaussian whose center is not the origin is any k-dimensional subspace
containing the line through the Gaussian's center and the origin.

• Finally, for k spherical Gaussians, the best-fit k-dimensional subspace is


the subspace containing their centers.

• Thus, the SVD finds the subspace that contains the centers.
• It can be shown that that the top k singular vectors produced by the
SVD span the space of the k centers. For this,

• First, we extend the notion of best-fit to probability distributions.

• Then it can be shown that for a single (1D) spherical Gaussian whose
center is not the origin, the best fit 1-dimensional subspace is the line
though the center of the Gaussian and the origin.

• Next, we show that the best fit k-dimensional subspace for a single
Gaussian whose center is not the origin is any k-dimensional subspace
containing the line through the Gaussian's center and the origin.

• Finally, for k spherical Gaussians, the best-fit k-dimensional subspace is


the subspace containing their centers.

• Thus, the SVD finds the subspace that contains the centers.
• It can be shown that that the top k singular vectors produced by the
SVD span the space of the k centers. For this,

• First, we extend the notion of best-fit to probability distributions.

• Then it can be shown that for a single (1D) spherical Gaussian whose
center is not the origin, the best fit 1-dimensional subspace is the line
though the center of the Gaussian and the origin.

• Next, we show that the best fit k-dimensional subspace for a single
Gaussian whose center is not the origin is any k-dimensional subspace
containing the line through the Gaussian's center and the origin.

• Finally, for k spherical Gaussians, the best-fit k-dimensional subspace is


the subspace containing their centers.

• Thus, the SVD finds the subspace that contains the centers.
• For an infinite set of points drawn according to the mixture, the
k-dimensional SVD subspace gives exactly the space of the
centers.

• In reality, we have only a large number of samples drawn


according to the mixture.

• However, it is intuitively clear that as the number of samples


increases, the set of sample points will approximate the
probability density and so the SVD subspace of the sample will
be close to the space spanned by the centers.

You might also like