Singular Value Decomposition (SVD)

Singular Value Decomposition (SVD)
Prerequisites….
Orthogonal and Orthonormal vectors
Orthogonal and Orthonormal
vectors
LINEAR INDEPENDENCE OF VECTORS
• Definition: An indexed set of vectors {v1, …, vp} in is

said to be linearly independent if the vector
equation
c1v1  c2 v 2  ...  c p v p  0
has only the trivial solution: .
• The set {v1, …, vp} is said to be linearly dependent if

there exist weights c1, …, cp, not all zero, such that
c1v1  c2 v 2  ...  c p v p  0
LINEAR INDEPENDENCE
Determine if the set {v1, v2, v3} is linearly independent.

• Non trivial Solution => det
• Hence the vectors are Linearly Dependent
Exercise-2: Find if the set of vectors {(1,2,2), (2,1,2),(2,2,1)} is linearly

independent or not.
Subspace, Basis
• Given a basis of a subspace, any vector in that subspace will be a linear

combination of the basis vectors.
The smallest subspace containing a finite set of vectors of a vector space is said to be linear
span of the set. (That is the set (basis vectors) is said to be spanning the subspace)
Eigen values and Eigen vectors
Eigen Value and Eigen Vector
• Let us take the example of house price prediction model
• Here the dependent variable is the house price and there are
a large number of independent variables (features) on which
the house price depends.
• Training with large number of independent variables will

make machine learning models computationally intensive
and complex.
Eigen Value and Eigen Vector
• Is it possible to extract a smaller set of variables (features) to

train the models and do the prediction while ensuring that
most of the information contained in the original variables is
retained / maintained?
• This is where eigenvalues and eigenvectors comes into

picture.
What do matrices do to vectors?
(0,2)
(2,1)
(3,1)
Recall
(3,5)
(0,2)
(2,1)
(3,1)
What do matrices do to vectors?
(3,5)
• The new vector is:
(0,2)
(2,1)
1) rotated
(3,1)
2) scaled
Are there any special vectors
that only get scaled?
Try (1,1)
= (1,1)
= (3,3)
= (1,1)
• For this special vector,

multiplying by M is like
multiplying by a scalar.
= (3,3) • The special vector (1,1) is
called an Eigen vector of M
= (1,1) • 3 (the scaling factor) is called
the Eigen value associated
with this eigenvector
Definition
Eigen vectors are the vectors which when multiplied by a matrix
(linear combination or transformation) results in another vector
having same direction but scaled (hence scalar multiple) in forward or
reverse direction by a magnitude of the scalar multiple which can be
termed as Eigen value.
Eigen Vector: Special vector x that points in a direction in
which it is stretched (or shrunk) by the transformation. Usually
the direction is same (or reversed) as Ax.
Eigen Value: λ is a factor by which the Eigen vector is stretched

or shrunk or reversed or unchanged.
In other words we can say that an eigen value is a number that

tells us how much variance exists in the data in that direction,
whereas an Eigen vector tells us how spread the data is.
Mathematical Definition
Let A be a n x n matrix, is a non-zero n x 1

column vector and is a scalar.
If ,
then is an Eigen vector of A and is the Eigen
value of A
Eigen vectors obey this equation
Or,
Or,
Eq.1 is a system of equations in matrix form
For, non zero solution of above homogeneous system

Characteristic equation
• Let A be an n x n matrix, then is said to be the

characteristic polynomial of A and is denoted by
where
𝐴= [ 𝑎
The equation
Finding Eigen value, Eigen vector
Example-1:
Find Eigen value and Eigen vector of

Are Eigen vectors orthogonal?
Generally, for any matrix, the Eigen vectors are not

always orthogonal.
However, they will be orthogonal for a particular type

of matrix such as a symmetric matrix.
The Eigen vectors corresponding to two distinct Eigen
values of a real symmetric matrix are orthogonal
Issues with Eigen vector!
1. Ax = λx requires A to be square.
2. Eigen vectors are generally not orthogonal.
3. There are not always enough Eigen vectors to
construct P for diagonalization
How to diagonalize a m x n (rectangular) matrix?

Instead of Eigen vectors consider Singular vectors.
Singular Value Decomposition
(SVD)
• The singular value decomposition is a factorization of a matrix

A (m x n) into the product of three matrices
A  UV T
mm mn V is nn
• The diagonal values in the Sigma matrix are known as the singular
values of the original matrix A.
• The columns of the U matrix are called the left-singular vectors of A.
• The columns of V are called the right-singular vectors of A.
• Left and right singular vectors are pairwise orthogonal
For an m n (row x col) matrix A of rank r there exists a

factorization SVD as follows:
A  UV T
mm mn V is nn
The columns of U are orthogonal eigenvectors of AAT.

The columns of V are orthogonal eigenvectors of ATA.
Eigenvalues 1 … r of AAT are the eigenvalues of ATA.
 i  i
  diag  1... r  Singular values.
Singular values and Singular vectors
• Singular values of A are the square roots of the Eigen values of ,

denoted by
• The Eigen vectors of are called the singular vectors of A.
• In SVD the columns of U and V are orthonormal and the matrix ∑

is diagonal with positive real entries.
  diag  1... r 
Properties of SVD
1. If A be an m x n matrix, then is an n x n symmetric matrix and

is orthogonally diagonalizable.
2. Let be the orthonormal eigenvectors of corresponding to the

eigenvalues respectively, then ,
i.e. singular values of A are the lengths of the vectors for i
= 1,2,…,n
Singular value decomposition
• Exercise
1. Find the SVD of
2. Find the SVD of
3. Find the SVD of
Low rank approximation of a matrix
• Often a data matrix A is close to a low rank matrix and SVD is useful
to find a good low rank approximation to A.
• For any k (rank); the SVD of A gives the best rank-k approximation
to A.
***Rank of a matrix: The maximum number of its linearly

independent columns (or rows ) of a matrix.
If we consider a square matrix, the columns (rows) are linearly
independent only if the matrix is nonsingular (determinant of A≠ 0).
Low rank approximation of a matrix
• In mathematics, low-rank approximation is a minimization problem, in which

the cost function measures the fit between a given matrix (the data) and an
approximating matrix, subject to a constraint that the approximating matrix
has reduced rank.
• They usually have sparse representations- you don't need to store the
m*n numbers for a low rank approximation.
• Low-rank approximation is thus a way to recover the "original" (the "ideal"

matrix before it was messed up by noise etc.) low-rank matrix i.e., find the
matrix that is most appropriate with the current matrix and is low-rank.
• The problem is used for mathematical modeling and data compression

Low rank approximation from SVD
• If we want to best approximate a matrix A by a rank-k matrix, how

should we do it?
• Assume we had a representation of the data matrix A as a sum of

several ingredients, with these ingredients ordered by “importance,"
then we could just keep the k “most important" ones. The SVD gives us
exactly such a representation!
• Does Eigen Value/Eigen vector perform the similar job?

• PCA!!
Low rank approximation from SVD
Best-fitting k-dimensional subspace
• Since from SVD we can find the low rank approximation of a matrix
(say ‘A’), it can be used to find the best-fitting k-dimensional subspace
for k = 1,2,3,…. For the set of n data points.
• Here, “best" means minimizing the sum of the squares of the

perpendicular distances of the points to the subspace, or
equivalently, maximizing the sum of squares of the lengths of the
projections of the points onto this subspace.
• It can be shown that the best-fitting k-dimensional subspace can be

found by k applications of the best fitting line algorithm, where on the
i’th iteration we find the best fit line perpendicular to the previous i-1
lines.
Best Rank-k Approximations
• Thus we have two interpretations of the best-fit subspace.
1. To minimize the sum of squared distances of the data points to it.
2. To maximize the sum of projections squared of the data points on it.

This says that the subspace contains the maximum content of data
among all subspaces of the same dimension
It is seen that minimizing the sum of squared distances is equivalent to

maximizing the sum of squared projections.
Finding best-fit sub space
**Argmax is an operation that finds the argument (index) that gives the maximum value from a target
function. Argmax is most commonly used in machine learning for finding the class with the largest
predicted probability.
Finding best-fit sub space
Finding best-fit sub space:
Greedy algorithm
Greedy algorithm
Greedy algorithm
The following theorem establishes that the greedy algorithm finds the
best subspaces of every dimension.
The Frobenius Norm (Eucidean norm or L2

norm) of a matrix is defined as the square
root of the sum of the squares of the
elements of the matrix.
Condition for finding Best-fitting sub-space
• Centering data: Center the data by subtracting the centroid

of the data from each data point.
• To find the best-fitting subspace we first center the data and
then find the best-fitting subspace.
If one wants statistical information relative to the mean of the data, one needs to
center the data. If one wants the best low rank approximation, one would not
center the data.
• It can be shown that the line minimizing the sum of squared
distances to a set of points, if not restricted to go through the
origin, must pass through the centroid of the points.
• This implies that if the centroid is subtracted from each data

point, such a line will pass through the origin.
Principal Component Analysis (PCA)
PCA
2. Principal component analysis (PCA):
• The traditional use of SVD is in Principal Component Analysis (PCA).
• PCA is illustrated by a movie recommendation setting where there are

‘n’ customers and ‘d’ movies.
• Let matrix A (n x d) where each of its elements represent the amount

that customer i likes movie j.
• One hypothesizes that there are only k underlying basic factors that
determine how much a given customer will like a given movie, where .
PCA
• Principal components analysis (PCA) is a technique that can be
used to simplify a dataset.
• It is a linear transformation that chooses a new coordinate

system for the data set such that greatest variance by any
projection of the dataset comes to lie on the first axis (then
called the first principal component).
• The second greatest variance on the second axis, and so on.
• PCA can be used for reducing dimensionality by eliminating the

later principal components.
PRINCIPAL COMPONENT ?
57
Are they correleted?
PCA
• We define the new dimensions (variables) which

are
linear combinations of the original ones
uncorrelated with one another
• Orthogonal in original dimension space
capture as much of the original variance in the data
as possible
These are called Principal Components

PCA
• Given a set of points, how do we know if they

can be compressed like in the previous toy
example?
• The answer is to look into the correlation

between the points
• The tool for doing this is called PCA

PCA
• Principle
– Linear projection method to reduce the number of parameters
– Transfer a set of correlated variables into a new set of
uncorrelated variables
– Map the data into a space of lower dimensionality
– Form of unsupervised learning
• Properties
– It can be viewed as a rotation of the existing axes to new
positions in the space defined by original variables
– New axes are orthogonal and represent the directions with
maximum variability
PCA
• PCA is performed by finding the eigenvalues and eigenvectors
of the covariance matrix.
• We find that the eigenvectors with the largest eigenvalues

correspond to the dimensions that have the strongest
correlation in the dataset.
This is the principal component.
• PCA is a useful statistical technique that has found application in:

– fields such as face recognition and image compression
– finding patterns in data of high dimension.
What are the new axes?
• Orthogonal directions of greatest variance in data
• Projections along PC1 discriminate the data most along any one axis
Original Variable B PC 2
PC 1
Original Variable A
• First principal component is the direction of greatest variability (covariance)

in the data
• Second is the next orthogonal (uncorrelated) direction of greatest variability
• And so on …
PCA
• What are principal components?
– 1st Principal component : The Most-important direction -

Direction of maximum variance in the input space
– 2nd Principal component : 2nd Most-important direction -
Direction of second-largest variance in the input space
– 3rd Principal component : …………
– 4th Principal component : ………….
• How many principal components are possible ?

– As many as dimensions of the input space
68
PCA
• Are all principal components equally important ??
• No. Only those principal components that contribute a

significant fraction of total energy are considered important.
Energy along a direction is proportional to Variance along that
direction.
• The less important principal components can be ignored –

leading to reduction in dimensionality
• What does it say about the distribution, if all dimensions are

equally important ??
– Isotropic
How to compute ??
– Given N-dimensional Data x (Say, M-points X1, X2,…….XM)
– Find the Covariance Matrix of X (zero mean)

• Cx = Cov (X) = E[X XT]
• Now find eigen values (λ)s of Cx
• Sort the eigen values
• The eigen vector vi that corresponds to the largest eigen value λi

is the first principal component
• The eigen vector vj that corresponds to the j’th-largest eigen
value λj is the j’th- principal component
Steps : PCA
• Arrange data points in matrix X
• Find Cov X
• Find Eigen values and Eigen Vectors of Cov X
• Sort Eigen Values in descending order
• Start building matrix P. First column of P is the eigen vector
that corresponds to largest eigen value.
• Second column of P is the eigen vector that corresponds to
the second-largest eigen value.
• PT is the transform that completely decorrelates data X
Dimension reduction
• For dimensionality reduction:
– Choose only significant eigen values. Use only
those corresponding eigenvectors to build matrix
P
– This will lead to dimension reduction in the

transformed data
Kx1 KxN Nx1

• The Eigen vectors of the Covariance matrix are actually the directions
of the axes where there is the most variance(most information) and
that we call Principal Components.
• Eigenvalues are simply the coefficients attached to Eigen vectors,

which give the amount of variance carried in each Principal
Component.
• By ranking your eigenvectors in order of their eigenvalues, highest to

lowest, we get the principal components in order of significance.
• After having the principal components, in order to compute the

percentage of variance (information) accounted for by each
component, we divide the eigenvalue of each component by the sum
of Eigen values.
• % of variance =
• We are looking for a Transform that :
– Represents data along each of the principal
components
– The transformed data should be completely de-
correlated
• i.e Cov(Y) = Diagonal matrix
• How do we compute that Transform ???

Say the transformed data , Y is given as
We want Cov(Y) to be diagonal,
=
=
=
Why ??? Because only then the transformed data, Y is completely decorrelated !!
• Hence columns of P should diagonalize Cov(X) matrix
• But is Cov(X) diagonalizable ??

– Yes. Since Cov(X) is symmetric
• All symmetric matrices are diagonalizable
– Covariance matrices are positive definite
• All eigen values are positive
• P should be formed by linearly independent eigen vectors of

Cov(X)
• Eigen vectors of P are guaranteed to be orthogonal for distinct

eigen values
Clustering a Mixture of Spherical
Gaussians
Clustering a Mixture of Spherical Gaussians:
• Clustering is the task of partitioning a set of points into k subsets

or clusters where each cluster consists of nearby points.
• How to solve a particular clustering problem using SVD?

• A mixture is a probability density or distribution that is the
weighted sum of simple component probability densities.
• It is of the form
One approach to the model-fitting problem is to break it into two sub-
problems:
1. Cluster the set of samples into k clusters (

2. Then fit a single Gaussian distribution to each cluster of sample
points.
• The second problem is relatively easier and indeed we saw the

solution in Module-1, where we showed that taking the empirical
mean (the mean of the sample) and the empirical standard deviation
gives us the best-fit Gaussian.
• The first problem (finding the k-clusters from a mixture of Gaussians)

is harder and this is what we discuss here.
• Using SVD we can find the subspace that contains the centers.
• We discussed in Chapter 2 distances between two sample points from
the same Gaussian as well the distance between two sample points
from two different Gaussians.
• Recall from Module-1 that if
• It can be shown that that the top k singular vectors produced by the
SVD span the space of the k centers. For this,
• First, we extend the notion of best-fit to probability distributions.
• Then we show that for a single (1D) spherical Gaussian whose center is
not the origin, the best fit 1-dimensional subspace is the line though
the center of the Gaussian and the origin.
• Next, we show that the best fit k-dimensional subspace for a single
Gaussian whose center is not the origin is any k-dimensional subspace
containing the line through the Gaussian's center and the origin.
• Finally, for k spherical Gaussians, the best-fit k-dimensional subspace is

the subspace containing their centers.
• Thus, the SVD finds the subspace that contains the centers.
Best-fit k-dimensional subspace
• Then we show that for a single (1D) spherical Gaussian whose center is
not the origin, the best fit 1-dimensional subspace is the line though
the center of the Gaussian and the origin.

• Recall that for a set of points, the best-fit line is the line passing
through the origin that maximizes the sum of squared lengths of the
projections of the points onto the line.
• We extend this definition to probability densities instead of a set of

points.
• Then it can be shown that for a single (1D) spherical Gaussian whose
center is not the origin, the best fit 1-dimensional subspace is the line
though the center of the Gaussian and the origin.



• For an infinite set of points drawn according to the mixture, the
k-dimensional SVD subspace gives exactly the space of the
centers.
• In reality, we have only a large number of samples drawn

according to the mixture.
• However, it is intuitively clear that as the number of samples

increases, the set of sample points will approximate the
probability density and so the SVD subspace of the sample will
be close to the space spanned by the centers.

Singular Value Decomposition (SVD)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Singular Value Decomposition (SVD)

Uploaded by

Copyright:

Available Formats

Singular Value Decomposition (SVD)

• Definition: An indexed set of vectors {v1, …, vp} in is

• The set {v1, …, vp} is said to be linearly dependent if

Determine if the set {v1, v2, v3} is linearly independent.

• Hence the vectors are Linearly Dependent

Exercise-2: Find if the set of vectors {(1,2,2), (2,1,2),(2,2,1)} is linearly

• Given a basis of a subspace, any vector in that subspace will be a linear

• Let us take the example of house price prediction model

• Training with large number of independent variables will

• Is it possible to extract a smaller set of variables (features) to

• This is where eigenvalues and eigenvectors comes into

• For this special vector,

Eigen Value: λ is a factor by which the Eigen vector is stretched

In other words we can say that an eigen value is a number that

Let A be a n x n matrix, is a non-zero n x 1

Eq.1 is a system of equations in matrix form

For, non zero solution of above homogeneous system

• Let A be an n x n matrix, then is said to be the

Find Eigen value and Eigen vector of

Generally, for any matrix, the Eigen vectors are not

However, they will be orthogonal for a particular type

How to diagonalize a m x n (rectangular) matrix?

• The singular value decomposition is a factorization of a matrix

For an m n (row x col) matrix A of rank r there exists a

mm mn V is nn

The columns of U are orthogonal eigenvectors of AAT.

• Singular values of A are the square roots of the Eigen values of ,

• The Eigen vectors of are called the singular vectors of A.

• In SVD the columns of U and V are orthonormal and the matrix ∑

1. If A be an m x n matrix, then is an n x n symmetric matrix and

2. Let be the orthonormal eigenvectors of corresponding to the

***Rank of a matrix: The maximum number of its linearly

• In mathematics, low-rank approximation is a minimization problem, in which

• Low-rank approximation is thus a way to recover the "original" (the "ideal"

• The problem is used for mathematical modeling and data compression

• If we want to best approximate a matrix A by a rank-k matrix, how

• Assume we had a representation of the data matrix A as a sum of

• Does Eigen Value/Eigen vector perform the similar job?

• Here, “best" means minimizing the sum of the squares of the

• It can be shown that the best-fitting k-dimensional subspace can be

• Thus we have two interpretations of the best-fit subspace.

1. To minimize the sum of squared distances of the data points to it.

2. To maximize the sum of projections squared of the data points on it.

It is seen that minimizing the sum of squared distances is equivalent to

The Frobenius Norm (Eucidean norm or L2

• Centering data: Center the data by subtracting the centroid

• This implies that if the centroid is subtracted from each data

2. Principal component analysis (PCA):

• The traditional use of SVD is in Principal Component Analysis (PCA).

• PCA is illustrated by a movie recommendation setting where there are

• Let matrix A (n x d) where each of its elements represent the amount

• It is a linear transformation that chooses a new coordinate

• The second greatest variance on the second axis, and so on.

• PCA can be used for reducing dimensionality by eliminating the

• We define the new dimensions (variables) which

These are called Principal Components

• Given a set of points, how do we know if they

• The answer is to look into the correlation

• The tool for doing this is called PCA

• We find that the eigenvectors with the largest eigenvalues

• PCA is a useful statistical technique that has found application in:

• First principal component is the direction of greatest variability (covariance)

– 1st Principal component : The Most-important direction -