Data Compressing

Lecture Note (Data Compression-CS) by Prof.
Byeungwoo Jeon (2012 Fall)
2012 Fall
Course Introduction
n Data compression (before)
n Main text: Introduction to Data Compression (3rd Ed) (K. Sayood)
Data Compression n Main Topics

n Mathematical Preliminaries for Lossless compression
n Huffman Coding
(ECE 5546-
5546-41) n Arithmetic Coding
n Dictionary Techniques
n Context-Based Compression
n Lossless Image Compression
Introduction to Compressive Sensing n Data compression (this semester): Compressed Sensing

n Texts:
n R. Baraniuk, M. Davenport, M. Duarte, C. An Hegde, An Introduction
to Compressive Sensing, Connexions Web site.

Byeungwoo Jeon http://cnx.org/content/col11133/1.5/, Apr 2, 2011.
n Compressed Sensing: Theory and Applications, edited by Y. C. Eldar
Digital Media Lab, SKKU, Korea
and G. Kutyniok
http://media.skku.ac.kr; bjeon@skku.edu n Lecture Note, Introduction to Compressed Sensing, Spring, 2011 by
Prof. Heung-No Lee (http://infonet.gist.ac.kr)

n Selected papers
Digital Media Lab. Digital Media Lab. 2
2012/Fall
Ch2--page.
Ch2 page.11
Lecture Note (Data Compression-CS) by Prof. Byeungwoo Jeon (2012 Fall)
Major Subjects to Cover How to Study

n Basic framework of the course
n Lecture (2 hours)
n Paper Investigation (1 hour) : student presentation
n Each student should study thoroughly and present at least one paper.
n It should be completely understood by the presenter before presentation.
n A list of papers will be provided by the instructor. However, a preferred
paper can be suggested by student.
n Grading Policy
n Attendance 10%
n Project/Presentation 20 %
n Homework 10 %
n Exam (Mid 30 + Final 30) 60 %
Digital Media Lab. 3 Digital Media Lab. 4
2012/Fall
Ch2--page.
Ch2 page.22
Solving linear equations

n Solving linear equations (Y: Measured; X: Unknown; A: from signal
model)
Y = AX
Very Brief Introduction to CS
n Its solution of X is easy unless the matrix A is non-invertible.
n Nonlinear system can be approximated to a linear system of equations.
n Continuous system can be discretized to a linear system of equations.
modified from a file by Igor Carron (version 2- draft)
at
https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxpZ2
9yY2Fycm9uMnxneDoxYmNkZjU5MWQ2NmJkOGUy)
Digital Media Lab. Digital Media Lab. 6
2012/Fall
Ch2--page.
Ch2 page.33
Solving Linear Equations Underdetermined Case

n Over-determined: too many equations and too few unknowns (tall and n Too few equations and too many unknowns means “infinite number
thin matrix A) of solutions”
n Solution: least squares, remove equations n Matrix A cannot be inverted as in the square case
n How do we choose a solution among many ?
n There are as many equations as there are unknowns (square matrix A) n Maybe we should look for a specific feature for x, what should that
n easy case. feature be ?
n Cf: ill-posed inverse problem
n Underdetermined: too few equations and too many unknowns (fat and n Regularization technique
short matrix A)
n This is what Compressive Sensing deals with n Condition: sparseness (compressed sensing can help)
n “Compressed Sensing reconstruction techniques” allows one to find a
solution that is sparse, i.e., that has the property of having very few non-
Y = AX zeros elements (the rest of the elements are zeros).
2012/Fall
Ch2--page.
Ch2 page.44
Rf:: Regularization (1)

Rf Rf:: Regularization (2)
Rf
n In mathematics and statistics, particularly in the fields of machine n A theoretical justification for regularization is that it attempts to
learning and inverse problems, regularization involves introducing impose Occam's razor on the solution. From a Bayesian point of
additional information in order to solve an ill-posed problem or to view, many regularization techniques correspond to imposing certain
prevent overfitting. prior distributions on model parameters.
n This information is usually of the form of a penalty for complexity, such as
restrictions for smoothness or bounds on the vector space norm. n The same idea arose in many fields of science. For example, the
least-squares method can be viewed as a very simple form of
< Examples of applications of different methods of regularization to the linear model> regularization.
n A simple form of regularization applied to integral equations,
generally termed Tikhonov regularization after Andrey Nikolayevich
Tikhonov, is essentially a trade-off between fitting the data and
reducing a norm of the solution.
n More recently, non-linear regularization methods, including total
variation regularization have become popular.
Digital Media Lab. http://en.wikipedia.org/wiki/Regularization_(mathematics) 9 Digital Media Lab. http://en.wikipedia.org/wiki/Regularization_(mathematics) 10
2012/Fall
Ch2--page.
Ch2 page.55
Rf:: Regularization (3)

Rf Compressed Sensing
n In statistics and machine learning, regularization is used to prevent n An instance of an underdetermined system of linear equation is a
overfitting. Typical examples of regularization in statistical machine compressed sensing system.
learning include ridge regression, lasso, and L2-norm in support n Y ~ compressed measurement (few)
vector machines. n
n
A ~ sensing (in a form of linear combinations)
X ~ original information (like to find)
Y = AX
n Regularization methods are also used for model selection, where
they work by implicitly or explicitly penalizing models based on the n The recovery of a sparse solution to an underdetermined system of
number of their parameters. For example, Bayesian learning methods linear equations is performed using Compressed Sensing
make use of a prior probability that (usually) gives lower probability to Reconstruction techniques /solvers.
more complex models. Well-known model selection techniques n Key Question: “Do all underdetermined systems of linear
include the Akaike information criterion (AIC), minimum description equations admit very sparse and unique solution ?”
length (MDL), and the Bayesian information criterion (BIC). n Answer: Some systems do under a condition (RIP, NSP,….).
Alternative methods of controlling overfitting not involving
regularization include cross-validation. n Issues to study (in this semester)
n Mathematical background
n Checking the condition
n Recovery algorithms
n Implementing the algorithms
n Applications
Digital Media Lab. http://en.wikipedia.org/wiki/Regularization_(mathematics) 11 Digital Media Lab. 12
2012/Fall
Ch2--page.
Ch2 page.66
Summary Some link

n Compressed sensing is about setting up underdetermined systems of n Some solutions
linear equations. n Solvers (mostly in Matlab):
https://sites.google.com/site/igorcarron2/cs#reconstruction
n Compressed sensing reconstruction techniques are about finding the n Acceptable Matrix A:
sparsest solution (out of an infinite number of solutions) of that https://sites.google.com/site/igorcarron2/cs#measurement
system. n Hardware/Sensors implementing A:
https://sites.google.com/site/igorcarron2/compressedsensinghardware
n Its new framework is very important in signal processing community,
including data compression. n http://nuit-blanche.blogspot.com/2011/11/how-to-wow-your-friends-in-
high-places.html
2012/Fall
Ch2--page.
Ch2 page.77
2012 Fall
Data Compression
(ECE 5546-
5546-41)
Ch2. Sparse and Compressible Signal Models
Byeungwoo Jeon
http://media.skku.ac.kr; bjeon@skku.edu
Digital Media Lab.
What we like to cover in this class
Algorithms for sparse analysis : Lecture I: Background on sparse approximation by

Digital Media Lab. 2
Anna C. Gilbert Department of Mathematics, University of Michigan)
Underdetermined linear equations
n Solving linear equations (Y: Measured; X: Unknown; A: from signal
model)
Y = AX
n Underdetermined case:
n Too few equations and too many unknowns means “infinite number of
solutions” (ie, matrix A cannot be inverted as in the square case)
n CS tries to solve this under the condition of “sparseness.”
n “Compressed Sensing reconstruction techniques” allows one to find a solution
that is sparse, i.e., that has the property of having very few non-zeros
elements (the rest of the elements are zeros).
n We need to evaluate the fitness of solution: need measurement (~ norm)
n Need to formally define signal model, sparseness, compressibility, etc.
è Need lots of concept in linear algebra
Compressed Sensing
n An instance of an underdetermined system of linear equation is a
compressed sensing system.
n Y ~ compressed measurement (few)
n
n
A ~ sensing (in a form of linear combinations)
X ~ original information (like to find)
Y = AX
n The recovery of a sparse solution to an underdetermined system of
linear equations is performed using Compressed Sensing
Reconstruction techniques /solvers.
n Key Question: “Do all underdetermined systems of linear
equations admit very sparse and unique solution ?”
n Answer: Some systems do under a condition (RIP, NSP,….).
n Issues to study (in this semester)

n Mathematical background
n Checking the condition
n Recovery algorithms
n Implementing the algorithms
n Applications

BRIEF REVIEW ON LINEAR ALGEBRA
From Paul’s On-line math

Linear Algebra (Math 2318) : Linear Algebra (Math 2318)
(http://tutorial.math.lamar.edu/download.aspx)
2.1 VECTOR SPACE

Vector in 2
2--D (or 3-
3-D) world
n Vector: a directed line segment (direction & magnitude) Terminal point
n In 2-space
n In 3-space
Initial point
n Vector addition and scalar multiplication
n Vector norm: If v is a vector then the magnitude of the vector, called

the norm of the vector and denoted by ||v||. Furthermore, if v is a
vector in 2-space (or in 3-space), then,
v = v12 + v22 (in 2 - space); v = v12 + v22 + v32 ( in 3 - space);
n Dot product: If u and v are two vectors in 2-space (or 3-space), and
the angle between them is q, then the dot product is defined as,
u gv = u v cos(q )
n It is sometimes called scalar product or Euclidean inner product
Extension to N
N--space (1)
n Definition of n-space: For a given a positive integer n, an ordered n-
tuple is a sequence of n real numbers denoted by ( a1 , a2 ,..., an ) . The
complete set of all ordered n-tuples is called n-space and is denoted
by R n .
n It is a natural extension of 2-space, 3-space
n Definition of arithmetic operations in n-space:

Extension to N
N--space (2)
n Definition of Euclidean inner product: For two vectors u, v in Rn,
u = ( u1 , u2 ,..., un ) , v = ( v1 , v2 ,..., vn ) , the Euclidean inner product is
defined as:
n
u gv = < u , v > = åu v i i
i =1
n It is a natural extension of dot product in 2-space.
n It can be written in matrix as following: (suppose u and v are in column
vectors)
u gv = v T u
n Note that when we add in addition, scalar multiplication and the
Euclidean inner product to the n-space, it is often called Euclidean n-
space.
Extension to N
N--space (3)
n Let’s extend the concepts of norm and distance to n-space
n Definition: For a vector u = ( u1 , u2 ,..., un ) Î R n , the Euclidean norm
is defined as,
n
u = u gu = åu i
2
i =1
n
n Definition: For two vectors u , v Î R , the Euclidean distance bet.
two points indicated by the two vectors, is defined as,
n
2
d (u , v ) = u - v = å ( ui - vi )
i =1

Generalization to Vector Space
n Up to now, we have a good geometric analogy, esp., on 2-space (or
3-space), coming from a notion that a vector is interpreted as “a
directional line segment”.
n A vector, however, is a much more general concept and it doesn’t
necessarily have to represent a directed line segments as before.
n For example, a vector can be a matrix or a function and that is only a
couple of possibilities for vectors.
n
n Nor does a vector have to represent the vectors we looked at in R
(that is, a vector may not be in Rn, therefore, it is a more general
object).
n The concept of n-space is now generalized into vector space.
n A vector space is nothing more than a collection of vectors (whatever
those now are…) that satisfies a set of axioms.
n Once we get the general definition of a vector and a vector space out of
the way, we’ll look at many of the important ideas that come with vector
spaces.
Vector Space (1)

n Definition: Let V be a set on which addition and scalar multiplication
are defined (this means that if and are objects in , and is a scalar
then we’ve defined and in some way). If the following axioms are
true for all objects and in and all scalars and , then is
called a vector space and the objects in are called vectors.
(a) (closure under addition) is in .
(b) (closure under scalar multiplication) is in V.
(c) (commutation with addition)
(d) (association)
(e) There is a special object in V, denoted 0 and called the zero vector, such
that for all u in V we have .
(f) For every u in V there is another object in V, denoted –u and called the
negative of u, such that .
(g) (distribution)
(h)
(i)
(j)
n A vector space is simply a collection of vectors satisfying the axioms above.

Vector Space (2)
n Note
n No need to be locked into the “standard” ways of defining addition and
scalar multiplication. For the most part we will be doing addition and
scalar multiplication in a fairly standard way, but there will be the
occasional example where we won’t.
n In order for something to be a vector space it simply must have an
addition and scalar multiplication that meets the above axioms and it
doesn’t matter how strange the addition or scalar multiplication might be.
n When the scalar in the definition is complex numbers, it is called
complex vector space. In the same way, when we restrict the scalars to
real numbers we generally call the vector space a real vector space.
n
n Ex1: If n is any positive integer then the set V = R with the standard
addition and scalar multiplication as defined in the Euclidean n-space
section is a vector space.
n Ex2: Show a set V =R n with the standard scalar multiplication and an
addition defined as, (u1 , u2 ) + (v1 , v2 ) = (u1 + 2v1 , u2 + v2 ) is not a VS.
Rf:: Signal and Vector Space

Rf
n Many natural and man-made system can be modeled well as linear.
n Model such a linear structure using a linear model by treating signal

as a vector in a vector space.
n The vector space model can capture the linear structure well.
n This modeling allows us to apply intuitions and tools from the geometry in
3-space such as length, distance, angles, etc.
n This is useful when the signal lives in high-dimensional or infinite-
dimensional spaces.

Inner Product
n Generalization of the concept of dot product (or inner product) in n-
space for the vector space:
Norm in Vector Space

n A norm is a function that assigns a strictly positive length or size to
all vectors in a vector space, other than the zero vector (which has
zero length assigned to it).
2
n A simple example is the 2-dimensional Euclidean space R equipped with
the Euclidean norm. The Euclidean norm assigns to each vector the
length of the vector. Because of this, the Euclidean norm is often known
as the magnitude.
n A vector space with a norm is called a normed vector space.
n Definition of Norm: Given a vector space V over a subfield F of the

complex numbers, a norm on V is a function p: V → R with the
following properties: For all a ∈ F and all u, v ∈ V,
n P1: p(av) = |a| p(v), (positive homogeneity or positive scalability).
n P2: p(u + v) ≤ p(u) + p(v) (triangle inequality).
n P3: If p(v) = 0, then v is the zero vector (separates points).
n A simple consequence of the first two axioms, positive homogeneity

and the triangle inequality, is p(0) = 0 and thus, p(v) ≥ 0 (positivity).

Norm, Seminorm,
Seminorm, Quasinorm
n A seminorm, on the other hand, is allowed to assign zero length to
some non-zero vectors (in addition to the zero vector).
n Similarly, a vector space with a seminorm is called a seminormed vector
space.
n A quasinorm is similar to a norm in that it satisfies the norm axioms,

except that the triangular inequality (P2) is replaced by,
u + v £ K ( u + v ) , for some K > 1
n This is not to be confused with a seminorm or pseudonorm where the

norm axioms are satisfied except for positive definiteness.
n Examples
n All norms are seminorms.
n The trivial seminorm, with p(x) = 0 for all x in V.
n The absolute value is a norm on the real numbers.
n Every linear form f on a vector space defines a seminorm by x → |f(x)|.
Digital Media Lab. http://en.wikipedia.org/wiki/Norm_(mathematics)

17
Examples of Norm (1)

n Euclidean norm (also called the Euclidean length, L2 distance, ℓ2
distance, L2 norm, or ℓ2 norm)
n n
2 n 2
u 2
= åu i (u Î R ); u = åu i (u Î C n )
i =1 i =1
n Taxicab norm or Manhattan norm (also called 1-norm, L1 norm,

L1 distance): its name relates to the distance a taxi has to drive in a
rectangular street grid to get from the origin to the point u.
n
u1= åu
i =1
i
n The set of vectors whose 1-norm is a given constant forms the surface
of a cross polytope of dimension equivalent to that of the norm minus 1.
n The Taxicab norm is also called the L1 norm. The distance derived from
this norm is called the Manhattan distance or L1 distance.

n Zero norm: In signal processing and statistics, David Donoho
referred to the zero "norm" with quotation marks.
u 0
= supp(u ) where supp(u ) = {i ui ¹ 0}
n supp(x): support of x (set of index indicating non-zero components of x)
n Following Donoho's notation, the zero "norm" of x is simply the

number of non-zero coordinates of x, or the Hamming distance of the
vector from zero.
n When this "norm" is localized to a bounded set, it is the limit of p-norms
as p approaches 0.
n Of course, the zero "norm" is not a B-norm, because it is not positive
homogeneous. It is not even an F-norm, because it is discontinuous,
jointly and severally, with respect to the scalar argument in scalar-vector
multiplication and with respect to its vector argument.
n Abusing terminology, some engineers omit Donoho's quotation marks
and inappropriately call the number-of-nonzeros function the L0 norm
(sic.), also misusing the notation for the Lebesgue space of measurable
functions.
Digital Media Lab. http://en.wikipedia.org/wiki/Norm_(mathematics)
19

n P-norm (for p ≥ 1, a real number) 1/ p
æ n p ö
u p
= ç å ui ÷
è i =1 ø
n Note that for p = 1, we get the taxicab norm, for p = 2 we get the Euclidean
norm, and as p approaches the infinity, the p-norm approaches the infinity
norm or maximum norm.
n This definition is still of some interest for 0 < p < 1, but the resulting
function does not define a norm, (it is quasinorm) because it violates the
triangle inequality.
n Maximum norm (special case of: infinity norm, uniform norm, or

supremum norm)
u ¥
= max ( u1 ,..., un )
n The set of vectors whose infinity norm is a given constant, c, forms the
surface of a hypercube with edge length 2c.

n Lp norm: For p Î [1, ¥ ]
1/ p
ìæ n p ö
ïïç å ui ÷ , p Î [1, ¥ )
u p
= íè i =1 ø
ï max ui , p = ¥
ïî i =1,..,n
Properties of Norms
n The concept of unit circle (the set of all vectors of norm 1) is different
in different norms
n For the 1-norm the unit circle in R2 is a square
n For the 2-norm (Euclidean norm) it is the well-known unit circle
n For the infinity norm it is a different square.
n For any p-norm it is a superellipse (with congruent axes).
n Due to the definition of the norm, the unit circle is always convex and
centrally symmetric (therefore, the unit ball may be a rectangle but
cannot be a triangle).
n Illustration of unit circles in different norms

2.2 BASES AND FRAMES
Linear Independence
n Linear Independence
n A finite set of vectors that contains the zero vector will be linearly
dependent.
Suppose that S = {v1 ,..., vk } is a set of vectors in R . If k > n, then

n
n
the set of vectors is linearly dependent.

Bases
n Basis/Bases
n span(S) is a set of all linear combination of given set of vectors in S.

n
å c v , (where
i =1
i i ci : scalar )
n Suppose that the set S = {v1 ,..., vn } is a basis for the vector space V,
then every vector u from V can be expressed as a linear combination
of the vectors from S in exactly one way.
n Suppose that S = {v1 ,..., vn } is a set of linearly independent vectors,
then S is a basis for the vector space V = span (S).
Orthogonality and Basis (1)

n Two vectors u, v in an inner product space are orthogonal if <u,v>=0
n Orthogonal set and orthonormal set
n Orthonormal basis (ONB)

Orthogonality and Basis (2)
n Any vector in an inner product space, with an orthogonal/orthonormal
basis can be easily represented as a linear combination of basis
vectors for that vector.
Orthogonal Complement (1)

n Definition of Orthogonal complement: Suppose that W is a subspace
of an inner product space V. We say that a vector u from V is
orthogonal to W if it is orthogonal to every vector in W. The set of all
vectors that are orthogonal to W is called the orthogonal
complement of W and is denoted by W ^.
^
n We say that W and W are orthogonal complements.
n Theorem

Orthogonal Complement (2)
n Extension of Projection
n Theorem
In Matrix Form
n
n A basis set {fi }i =1 Î R n , any vector x in Rn is uniquely represented
as,
n
x = å ci fi
i =1
n Form a nxn matrix F with columns given by fi ‘s, and let c denote
the length-n vector with entries ci, the matrix representation is:
x = Fc
n Orthonormal basis should satisfy, < fi , f j >= d (i - j )

n Therefore,
ci =< x, f j >
n In matrix form, (note orthonormality means F T F = I ).

c = FT x

Dictionary
N
n A dictionary in F in R n is a collection {ji }i =1 Ì R n of unit-norm
vectors: ji = 1
2
n Each elements are called atoms.

N
n If {ji }i =1 spans R n, the dictionary is complete.
N
n If {j i }i =1 are linearly dependent, the dictionary is redundant.
n In the sparse approximation literature, it is also common for a basis or

frame to be referred to as a dictionary or over-complete dictionary
respectively, with the dictionary elements being called atoms.
2.3 SPARSE REPRESENTATION

K-Sparse
n Definition of K-sparse: A signal is called K-sparse if it has at most K
non-zero components, i.e.,
x0£K
n Note that a signal x itself may not show K-sparse, we still refer to x as
being K-sparse, with understanding that x can be expressed K-sparse
through linear transformation.
x = Ya where a 0
£K
n Ex: x(t) = cos(wt)

n Time-domain
n Fourier-domain
n DCT-domain
Ex: Sparse representation of images

n Sparse representation of an image via multiscale wavelet transform
n Note that most of wavelet coefficients are close to zero.
(a). Original image

(b). Wavelet representation (larger coeff à lighter pixel)
Digital Media Lab. Fog. 1.3 (Compressed Sensing by Y. Eldar et. Al) 34
Ex: Sparse approx. of a natural image
n Sparse approximation of a natural image
(a) Original image

(b) approximated by taking only 10% largest wavelet coefficient
Set of K
K--Sparse Signals
n Set of all K-sparse signals
{
SK = x x 0 £ K }
n Q: Is the set Sk a linear space?
n That is, for any pair of vector x, z in Sk , x+z also belongs to Sk ?
n See Fig. 1.5.
Sparseness of Image
n Most natural images are characterized by large smooth or textured
regions with relatively few sharp edges.
n Signals with this structure are known to be very nearly sparse when
represented using a multiscale wavelet approximation.
n K-term approximation
n Need measure (i.e., appropriate norm) to measure the error.
n This kind of approximation is non-linear (since choice of which coefficients to
keep in the approximation depends on signal itself).
2.4 COMPRESSIBLE SIGNALS

Compressible vs. Sparse
n Few real-world signals are truly sparse, rather they are compressible
(meaning that they can be well-approximated by sparse signals).
n The terms mean the same concept: compressible, approximately sparse,
relatively sparse
n Quantification of the compressibility by calculating the error incurred

by approximating a signal x by some xÎ ˆ SK ,
s K ( x ) p = argmin x - xˆ p
xˆ Î S K
n If x Î SK , then s K ( x) p = 0 for any p.
n Thresholding (keeping only the K largest coefficients) gives the optimal

approximation for all p.
n Choose a basis set such that the coefficients obey the power-law decay.
Compressibility (1)
n Definition of compressibility: A signal is called compressible if its
sorted coefficient magnitudes in Y decays rapidly.
x = Ya where a1 ³ a 2 ³ ... a n
n Power-law decay: suppose there exist C1 and q > 0 such that

as a s £ C1s - q , s = 1, 2,...
C1 s - q
a s should be below this curve !

s
n Larger q means faster magnitude decay, and the more compressible a

signal is.
n In power-law decay, a signal can be approximated pretty well for K << n.

Compressibility (2)
n Depending on the space (referred by Y ), the signal can be either
compressible or not.
n Therefore proper choice of the space is important
n Q: For a such compressible signal (K-approximated), there exist

constant C2 and r > 0 depending only on C1 and q such that,
s K ( x ) 2 £ C2 K - r
K-term Approximation
n Only K-largest coefficients are kept while making the others zero to
represent the given signal.
n K-term approximation error
s K ( x ) = argmin x - Ya 2
a Î SK

-- end --

2012 Fall
Data Compression
(ECE 5546-
5546-41)
Ch 3. Sensing Matrices
Byeungwoo Jeon
Digital Media Lab.
What we are doing?
Digital Media Lab.

Sparse vs. Compressible (1)
n Recall that we call a signal x is K-sparse if it has at most k non-zeros.
n K-sparse signal may not be themselves sparse, but it admits a sparse
representation in some basis Ψ
æ p ö
S K = {x : x 0 £ K } ç lim x p
= supp( x ) ÷
è p ->0 ø
n But, few real-world signals are truly sparse
n Most of signals can be represented as the compressible signal
x = Ys
Example: K= 4
x Y s
Digital Media Lab.
Sparse vs. Compressible (2)

n Compressible signal means that the vector of coefficients in certain
basis has few large coefficients and other coefficients with small
values.,
n If we set small coefficients to zero, the remaining large coefficients can
represent the original signal with hardly noticeable perpetual loss
sparse x compressible
Digital Media Lab.

Compressed Sensing (1)
n Compressed sensing measurement process y = Fx
n y: measurement vector (Mx1)
n F: measurement (sensing) matrix (MxN)
n x: input signal vector in its original domain (ex: time or spatial) (Nx1)
n (CS is also possible for continuous-time signal)
y F x
Mx1 MxN Nx 1
n F represents a dimensionality reduction (maps RN into RM, M <<N)

n Main idea: Sense reduced vector y, then recover x from y.
n Note: the measurements should be non-adaptive!!!

n Rows of F are fixed in advance and do not depend on the previously
acquired measurements.
Digital Media Lab.

n What happen if x is not seen sparse? y = F x = FY s = Qs
n Make x represented in a transform domain
n y: measurement vector (Mx1)
n F: measurement (sensing) matrix (MxN)
n x: input signal vector in its original domain (ex: time or spatial) (Nx1)
n Y : transform matrix (NxN)
n s : input signal vector in its transform domain (Nx1) ß sparse
y F
Mx1 MxN
Y s
NxN Nx 1
x
Digital Media Lab.
n Measurement process with Q = FY
n There are small number of columns corresponding to nonzero coeffs.
n The measurement vector y is a linear combination of these columns
n Ex: The following is an underdetermined system

n 4-sparse, N unknowns
n Fewer equations than unknowns (M << N)
y Q = FY
Mx1 MxN
s
Nx 1
Digital Media Lab.
Questions to Answer (1)

n What we have learned so far:
n Natural signals can be represented as sparse signals in a certain basis (~

K-sparse)
n Sparse signals can be compressively sampled, meaning that the number

M of samples needed for perfect reconstruction is less than the number N
of Shannon-Nyquist samples.
n The reconstruction of the signal is done by L1 minimization, rather than

the usual L2 minimization.
Digital Media Lab.

n We further need to learn many issues:
n Q1: How to design an MxN sensing matrix F to ensure that it
preserves the information in the signal x?
n Sensing matrix which is designed to reduce the number of measurements
as much as possible while allowing the recovery wide class of signal x
from their measurement
n Sensing matrix design problem
n Q2: How small M can we choose given K and N?
n Q3: How sparse the signal has to be at a given M?
n Q4: How to recover the original signal x from measurements y?

n Look for fast and robust algorithms
è Signal recovery problem
n Q5: When will the L1 convex relaxation solution attain the L0 solution?
Digital Media Lab.

n After answering the previous questions, we further need to investigate
yet another issues:
n How to incorporate measurement noise in the signal model y = Fx ?

n What would happen to the L1 minimization signal recovery?
n Reliability issue in signal recovery
n What would happen if there is a model mismatch ?

n If the signal is not exactly a K-sparse signal, what kind of results do we
expect under such an assumption?
Digital Media Lab.

Design of Sensing Matrix
1. Null Space Property

2. Restricted Isometry property
3. Bounded Coherence Property
Digital Media Lab.
1. Null Space Conditions

n Q: Like to design F so that we can find recover all sparse signal x
corresponding to the measurement y. What condition on F do we need?
n Definition of Null space of F N (F ) = {z F z = 0}
n Uniqueness solution condition for y= F x

N (F ) contains F uniquely represents
€
no vector in S 2 K all x ÎS K
X y
n Proof: . .
. .
. .
.
.
. In this case, no way to find all signals x
. from the measurements y è distinct x
must mean distinct measurement vectors.
Digital Media Lab.

Spark
n Definition of Spark: The spark of a given matrix F is the smallest
number of columns of F that are linearly dependent.
n Spark
n A term coined by Donoho & Elad (2003)
n It is a way characterizing the null space of F using L-0 norm.
n It is very complex to obtain (compared to a rank), since it calls for
combinatorial search over all possible subsets of columns from F.
Digital Media Lab.
More on Spark (1)

n Solving a underdetermined equation: (F ~ MxN, N >> M)
y = F x = å xifi =å xk fk + å x jfk
i k j
A: linearly dep. B:linearly indep.
n Term A: Its corresponding columns of F are linearly dependent

n Term B: Its corresponding columns of F are linearly independent
é x1 ù
êx ú
ê 2ú
ê . ú
é ù ê ú é y1 ù
êf f .
..... ..... f N ú ê ú = ê M ú
ê 1 2 úê . ú ê ú
êë úû ê ú êë yM úû
ê . ú
ê . ú
ê ú
êë xN úû
Digital Media Lab.
More on Spark (2)
n Note that any vector x can be represented as
n Where x A Î NS ( F )
é x1 ù é x1 ù é 0 ù
êx ú êx ú ê M ú
ê ú 2 ê 2ú ê ú
ê M ú êMú ê M ú
ê ú ê ú ê ú
ê M ú M ê M ú
x= = x A + xB = ê ú +
ê xi ú ê xi ú ê 0 ú
ê ú ê ú ê ú
x
ê i +1 ú ê0ú ê xi +1 ú
ê M ú êMú ê . ú
ê ú ê ú ê ú
x
ëê N ûú êë 0 úû ëê x N ûú
n Note that 2 £ spark (F ) £ M + 1

n (zero vector is not considered since it is a trivial case )
Digital Media Lab.
Spark Condition
n Theorem: for any vector y Î R M ,
$at most one signal
€ spark ( F ) > 2 K
x ÎS K s.t. y=Fx
n (This is an equivalent way of characterizing the null space condition)
n This theorem guarantees uniqueness of representation for K-sparse

signals.
n It has combinatorial computational complexity, since it must verify that all
sets of columns of a certain size are linearly independent.
n (Proved by D.L.Donoho and M.Elad “Optimally sparse representation

in general (nonorthogonal) dictionaries via l1 minimization”)
n The spark provides a complete characterization of when sparse

recovery is possible. However, when dealing with approximately
sparse signals (ie. compressible signal), we must consider somewhat
more restrictive conditions on the null space of F.
Digital Media Lab.
Proof to Spark Condition
n (à) Suppose spark(F) <= 2K
n It means there exists at most some set of at most 2K columns that are
linearly independent.
è there exists h Î N ( F ) s.t. h Î S 2 K
è we can write h = x - x¢ where x, x¢ Î S K
n Since h Î N ( F ) we have that F ( x - x¢) = 0 F x = F x¢
è contradiction of distinctness!
n (ß) Suppose that spark(F) > 2K

n Assume that for some y there exists x, x¢ Î S K such that y = F x = Fx¢
è F ( x - x ¢) = 0
n Letting h = x - x¢ , we can write this as F h = 0
n Since spark (F ) > 2 K , all sets of up to 2K columns of F are linearly
independent, and therefore h = 0 .
è x = x¢
Digital Media Lab.
Corollary to Spark Condition

n Corollary: 2K £ M
n Proof:
2 £ spark (F ) £ M + 1; spark (F ) > 2 K
Þ 2K < M + 1
Þ 2K £ M
Digital Media Lab.

More on Spark Condition
n The spark provides a complete characterization of when sparse
recovery is possible.
$at most one signal
€ spark ( F ) > 2 K
x ÎS K s.t. y=Fx
n However, when dealing with approximately sparse signals (ie.

compressible signal), we must consider somewhat more restrictive
conditions on the null space of F.
n We must also ensure that NS(F) does not contain any vectors that are
too compressible in addition to vectors that are sparse.
è Null space property
n Notation subset of indices L Ì {1, 2,..., N } L C Ì {1, 2,..., N } \ L

C
n xL : length-N vector obtained by setting 0 the entries of x indicated by L
n FL : MxN matrix obtained by replacing zero column at the column

positions indexed by L C
Digital Media Lab.
2. Null Space Property (NSP)

n Definition of null space property (NSP) of order K
n A matrix F satisfies the null space property (NSP) of order K if there
exists a constant C > 0 such that,
hLC
1
hL 2
£C
K
holds for all h Î N (F ) and for all L such that L £ K .
n Rf:
é h1 ù é h1 ù é0ù
êh ú ê0ú êh ú
ê 2ú ê ú ê 2ú
ê h3 ú ê h3 ú ê0 ú
h=ê ú hL = ê ú, hLC =ê ú h = hL + hL C
0
ê M ú L Î {1, 3} ê ú ê h4 ú
êM ú ê0ú êM ú
ê ú ê ú ê ú
ëê hn ûú êë 0 úû êë hn úû
Digital Media Lab.

Null Space Property (NSP)
n The NSP implies that vectors in the null space of F should not be too
concentrated on a small subset of indices.
hLC
1
hL 2
£C
K
n If a vector h is exactly K-sparse, then there exists a L such that h C =0

L 1
Therefore, NSP indicates that hL = 0, thus hL = 0 as well.
2
n This means that if a matrix F satisfies the NSP, then the only K-sparse
vector in N(F) is h=0.
Digital Media Lab.
NSP and Sparse Recovery

n How to measure the performance of sparse recovery algorithms when
dealing with general non-sparse x.
n The following relationship under NSP guarantees exact recovery of all

possible K-sparse signals, but also ensures a degree of robustness to
non-sparse signals that directly depends on how well the signals are
approximated by K-sparse vectors.
s K ( x)1
D (Fx ) - x 2 £ C
K
n Where D : R M ® R N represents a specific recovery method, and
hLC
hL £C 1 s K ( x ) p = min x - xˆ p
2 xˆÎS K
K
Digital Media Lab.

NSP Theorem
n Theorem: For a sensing matrix F : RN à RM, and an arbitrary
recovery algorithm D : RM à RN
a pair (F , D ) satisfies
F satisfies
s K ( x )1 ®
D (F x ) - x 2 £ C NSP of order 2 K
K
Digital Media Lab.
Proof of NSP Theorem

n Suppose h Î N (F ) and let L be the indices corresponding to the 2K
largest entries of h. Split L into L 0 and L1 where L 0 = L1 = K
n Set x = hL + hL and x¢ = - hL , so that h = x - x¢ .
1
C
0
s K ( x )1
n Since by construction x¢ Î S K , we can apply D(Fx) - x 2 £ C to
K
obtain x¢ = D(Fx ) . Moreover, since h Î N (F ) , we have
F h = F ( x - x¢) = 0
n so that Fx¢ = Fx . Thus, x¢ = D (Fx ) . Finally, we have that

s K ( x )1 hL C
1
hL 2
£ h 2 = x - x¢ 2 = x - D (Fx) 2 £ C = 2C
K 2K
n If matrix F satisfies the NSP then the only 2K-sparse vector in N(F )
is h=0
Digital Media Lab.

Restricted Isometry Property (RIP)
n When measurements are contaminated with noise or have been
corrupted by some error such as quantization, it will be useful to
consider somewhat stronger conditions.
n Candes and Tao introduced the isometry condition on matrices A and

established its important role in CS.
n In mathematics, an isometry is a distance-preserving map between

metric spaces. Geometric figures which can be related by an isometry
are called congruent.
Digital Media Lab.
Restricted Isometry Property (RIP)

n Definition of RIP
n A matrix F satisfies the restricted isometry property(RIP) of order K if
there exists a d K Î (0,1) such that
2 2 2
(1 - d K ) x 2 £ F x 2 £ (1 + d K ) x 2
{
holds for all x Î S K = x | x 0 £ K
. }
n If a matrix F satisfies the RIP of order 2K, F approximately preserves
the distance between any pair of K-sparse vectors.
n Fundamental implications concerning robustness to noise.
n If a matrix F satisfies the RIP of order K with constant dK, then, for
any K’ < K, we automatically have that F satisfies the RIP of order K’
with constant d K ' £ d K .
n If a matrix F satisfies the RIP of order K with a sufficiently small

constant, then it will also automatically satisfy the RIP of order gK for
certain g, albeit with a somewhat worse constant.
Digital Media Lab.

The RIP and Stability
n Definition of C-stable: Let F : R N ® R M denote a sensing matrix and
D : R M ® R N denote a recovery algorithm. A pair (F,D) is called C-
stable if for any x Î S K and any e Î ¡ M , we have that
D (Fx + e) - x 2
£C e 2
n It says that if we add a small amount of noise to the measurements,

then the impact of this on the recovered signal should not be
arbitrarily large.
2
n As C à 1, F must satisfy the lower bound of below with d K = 1 - 1 / C ® 0
2 2 2
(1 - d K ) x 2 £ Fx 2 £ (1 + d K ) x 2
n Thus if we desire to reduce the impact of noise in the recovered signal,

we must adjust F so that it satisfies the lower bound of above inequality
with a tighter constant.
Digital Media Lab.
The RIP and Stability

n Theorem: If a pair (F,D) is C-stable, then, for all x Î S K
1
x £ Fx
C 2 2
n It demonstrates that the existence of any decoding algorithm that can

stably recover from noisy measurements requires that F satisfy the
lower bound of RIP with a constant determined by C.
Digital Media Lab.

End of Lecture 3
(Chapter 3 is continued in next week)
Digital Media Lab.

2012 Fall
Data Compression
(ECE 5546
5546--41)
Ch 3. Sensing Matrices
Byeungwoo Jeon
Digital Media Lab.
How many measurement are necessary to

achieve RIP?
(Measurement Bound)

Measurement bound (1)
n Lemma: For K and M satisfying K < N/2, there exist a subset X of SK
such that for any x in X, we have,
x2£ K
and for any distinct x & z in X,
K æNö
x-z 2
³ K / 2 and log X ³ log ç ÷
2 èKø
n Proof:

n Theorem: Let F be an NxM matrix that satisfies the RIP of order 2K
with constant d Î (0, 0.5] . Then,
æNö 1
M ³ CK log ç ÷
èKø
where C =
2
log ( )
24 + 1 » 0.28
n Proof:

n Johnson-Lindenstrauss lemma:
c0 log ( p )
M ³ where constant c0 > 0
e2
How to design the sensing matrix?

RIP and NSP (1)
n Theorem: Suppose F satisfies the RIP of order 2K with d 2 K < ( )
2 -1 .
Then, F satisfies the NSP of order 2K with constant,
2d 2 K
C=
(
1- 1+ 2 d2K )
Digital Media Lab.
RIP and NSP (2)

n Lemma: Suppose u Î S K , then, u 1
£ u2£ K u ¥
K
n Lemma : Suppose that F satisfies the RIP of order 2K, and let
h Î R N , h ¹ 0 be arbitrary. Let L 0 be any subset of {1,2,…,N} s.t.
L 0 £ K . Define L1 as the index set corresponding to the K
entries of hL c0 with largest magnitude, and set L = L 0 È L1 . Then,
hL c F hL , F h
0 1
hL 2
£a +b
K hL 2
where, 2d 2 K 1
a= ,b=
1 - d 2K 1 - d 2K

Matrix Design Satisfying RIP
n Q: How to construct matrix satisfying RIP?
n Methods
1. Deterministic method
2. Randomization method
n Method without specified d2K (ß just assume d2K > 0)
n Method with specified d2K (ß particular value of d2K is specified)
n Definition of RIP
n A matrix F satisfies the restricted isometry property(RIP) of order K if
{
there exists a d K Î (0,1) such that, for all x Î S K = x | x 0 £ K , }
2 2 2
(1 - d K ) x 2 £ Fx 2 £ (1 + d K ) x 2
n Theorem on RIP and NSP: Suppose F satisfies the RIP of order 2K

with d 2 K < ( 2 - 1) . Then, F satisfies the NSP of order 2K with constant,
2d 2 K
C=
( )
1 - 1 + 2 d 2K

.
Deterministic Matrix Design

n Idea: deterministically construct matrices of size MxN that satisfy the
RIP of order K
n It requires M to be relatively large.
n (
(ex). Requires M = O K 2 log N ) in [62]; M = O ( KN a ) in [115]
n In real-world problem, these results lead to an unacceptably large M.

Randomization Matrix Design (1)
n Idea: Choose random numbers for matrix entries
n For given M and N, generate random matrices F by choosing the entries
fij as independent realizations from some PDF.
n Randomization method without specified d2K (ß just assume d2K > 0)

n Set M=2K, and draw F according to Gaussian PDF.
n With probability 1, any subset of 2K columns are linearly indep. and hence all
subsets of 2K columns will be bounded below by (1- d2K), where d2K > 0.
n Problem: how to know the value of d2K ?
æNö
n Need to search all combinations of ç K ÷ K-dimensional subspaces of R N .
è ø
n Considering realistic values of N and K, such search is of prohibitively too
much computation.

n Randomization method with specified value of d2K
n Like to achieve RIP of order 2K for a specified constant d2K .
n It can be achieved by specifying additional two conditions on the PDF.
n Cond 1: Let the PDF yield a matrix that is norm-preserving, That is,
1
E (fij2 ) =
M
n Under this condition, the variance of PDF is 1/M.
n Cond 2: The PDF is sub-Gaussian. That is, there exists a constant c > 0 s.t.
( )
f t
E e ij £ e c t
2 2
/2
for all t Î R
n Note that, the moment-generating function of the PDF is dominated by
that of a Gaussian PDF, which is also equivalent to requiring that the
tails of the PDF decay at least as fast as the tails of a Gaussian PDF.

n Examples of sub-Gaussian PDF
n Gaussian, Bernoulli with taking values ± 1 M , more generally any PDF
with bounded support.
n
with a constant c below. 2 1
( )
Strictly-Sub-Gaussian: a PDF satisfying E ef t = ec t /2 for all t Î R.
ij
2 2
c = E (fij2 ) =
M
n Corollary: suppose that F is an MxN matrix whose entries fij are iid
with fij drawn according to a strictly sub-Gaussian PDF with c2=1/M.
Let Y = Fx for x in RN. Then for any e > 0 and any x in RN,
æ Me 2 ö
( )=
E Y
2
2
x
2
2
& P ( Y
2
2
- x
2
2
³e x
2
2 ) £ 2 exp ç - * ÷
è k ø
With k * = 2
» 6.52
1 - log(2)
n Note that the norm of a sub-Gaussian RV strongly concentrates about its
mean.

n Theorem: Fix d Î (0,1) . Let F be an MxN random matrix whose
entries fij drawn according to a strictly-Gaussian PDF with c2=1/M.
If,
æNö
M ³= k1 K log ç ÷
èKø
then F satisfies the RIP of order K with prescribed d with probability

exceeding (1 - 2e-k M ) , where k1 is arbitrary and k 2 = d 2 2k * - log ( 42e d ) / k1 .
2
n Note that the measurement bound above satisfies the optimal

number of measurements (up to a constant).

Why randomized method better?
n One can show that for the random construction, the measurements
are democratic.
n It means that it is possible to recover a signal using any sufficiently large
subset of the measurements.
n Thus, by using random F one can be robust to the loss or corruption of a
small fraction of the measurements.
n Universality: can easily accommodate some other basis

n In practice, we are often more interested in the setting where x is sparse
with respect to some basis Y. In this case, what actually required is RIP of
the product (FY).
n In case deterministic design, the design process must take into account Y.
n In randomized design, F can be designed independently from Y.
n If F is Gaussian and Y is orthonormal, note that (FY) is also Gaussian.
n Furthermore, for sufficiently large M, (FY) will satisfy RIP with high
probability.
Practical Situation
n In practical implementation, the fully random matrix design may be
sometimes impractical to build in HW. Therefore it is possible to:
n Use a reduced amount of randomness
n Or model the architecture via matrices F that has significantly more
structure than a fully random matrix
n EX: random demodulator[192], random filtering [194], modulated
wideband converter [147], random convolution [2,166], compressive
multiplier [179]
n Although not quite easy as in the fully random case, one can prove
that many of such constructions also satisfy RIP.

Coherence
n Definition: Coherence of a matrix F, m(F), is the largest absolute
inner product between any two columns of fi, fj of F.
fi , f j
m (F ) = max
1£ i < j £ N fi 2
fj 2
n Note that the coherence satisfies the relation below:

N -M
£ m (F ) £ 1
M ( N - 1)
n Its lower bound is called Welch bound
n When N>>M, the lower bound is approximately to m (F ) ³ 1 / M
n Coherence is related to spark, NSP, and RIP.
Coherence and Spark

n Theorem: The eigenvalues of an NxN matrix M with entries mij ,1 £ i, j £ N ,
lie in the union of N discs d i = d i ( ci , ri ) ,1 £ i £ N , centered at ci = mii and
with radius ri = å mij .
j ¹i
1
n Lemma: For any matrix F, spark (F ) ³ 1 +
m (F )

Coherence and NSP
1æ 1 ö M
n Theorem: If K< ç1 + ÷ , then for each measurement vector y Î R ,
2 è m (F ) ø
there exists at most one signal x Î S K such that y=Fx.
n Lemma: If F has unit-norm columns and coherence m=m(F), then F

satisfies the RIP of order K with d=(K-1)m for all K <1/m.
n The lemma suggests need for small coherence m(F) for matrices in
CS.

2012 Fall
Data Compression
(ECE 5546
5546--41)
Ch 4. Sparse Signal Recovery via L1 Minimization
Byeungwoo Jeon
Digital Media Lab.
How to recover a sparse signal from a small

number of linear measurements ?

Sparse Signal Recovery (1)
n Problem:
For y=Fx and for x assumed to be sparse (or compressible), find x̂
satisfying
xˆ = argmin z 0 subject to z Î B ( y )
z
where B(y) ensures that x̂ is consistent with the measurements y.
n Under an assumption of x being sparse (or compressible), find x

corresponding to measurement y under L0 optimality condition.
n The solution seeks for the sparsest signal in B(y).
n Consistent with measurements: depending on existence of measurement
noise, B(y) has two cases.
ìï{ z | F z = y} ( noise - free case)

B( y) = í
ïî { z | F z - y 2
£ e} ( noisy case)

n The framework also holds for the case x is not apparently sparse.
n In that case, suppose x = Ya , then the problem is,

aˆ = argmin z 0 subject to z Î B ( y )
z
where,
ìï{ z | FY z = y} ( noise - free case)
B( y) = í
ïî { z | FY z - y 2 £ e } ( noisy case)
n Note that under an assumption of Y referring to orthonormal basis, it

is possible to assume y=I without loss of generality.

n How to solve the L0 minimization problem?
z
ìï{ z | FY z = y} ( noise - free case )

B( y) = í
{
ïî z | FY z - y 2 £ e } ( noisy case )
n Note that || . ||0 is a non-convex function.

n Potentially very complex (NP-hard) to solve this minimization problem.
n L0 solution via L1 minimization

z
n If B(y) is convex, this problem becomes computationally tractable !

n The solution prefers to a sparse solution in B(y).
n Big Question: Will L1 solution be similar to L1 solution?
Why L1 Minimization Preferred?

n Intuitively,
n L1 minimization promotes sparsity.
n There are variety of reasons to suspect that L1 minimization will provide
an accurate method for sparse signal recovery.
n L1 minimization provides a computationally tractable approach to the
signal recovery.

Analysis of L1 Minimization Solution
z
Noise--free Signal Recovery (1)

Noise
n Lemma: Let F be a matrix that satisfies the RIP of order 2K with
N
constant d 2 K < 2 - 1 . Let x, xˆ Î R be given, define h = xˆ - x ,
L0 ~ index set corresponding to K entries of x w/ largest magnitude
L1 ~ index set corresponding to K entries of hL c w/ largest magnitude
Set L = L 0 U L1 . If x̂ 1 £ x 1 , then,
0
s K ( x )1 < F hL , F h >
h 2 £ C0 + C1
K hL 2
where, 1 - (1 - 2 )d 2 K 2
C0 = 2 , C1 =
1 - (1 + 2)d 2 K 1 - (1 - 2)d 2 K
n It shows an error bound for the class of L1 minimization algorithm

when combined with a measurement matrix F satisfying RIP.
n For specific bounds for concrete examples of B(y), need to examine how
requiring xˆ Î B ( y ) affects < F hL , F h > .

Noise
n Proof: (self-study)

Noise
n Theorem: Let F be a matrix that satisfies the RIP of order 2K with
constant d 2 K < 2 - 1 . When B(y)={z | Fz = y }, the solution x̂ to the
L1 minimization obeys that,
s ( x)
x - xˆ 2 £ C0 K 1
K
n For x Î S K = { x | x 0 £ K } and F satisfying RIP,

x - xˆ 2 = 0
n Note L1 minimization exactly provides the solution by L0 minimization.
n In other words, for as few as O(Klog(N/K)) measurements, we can

exactly recover any K-sparse signals using the L1 minimization.
n This can be shown also stable under noisy measurements.

Noise
n Proof: For x belonging to B(y), the lemma can be applied to obtain
that for h = xˆ - x ,
s ( x) < F hL , F h >
h 2 £ C0 K 1 + C1
K hL 2
Since , x, xˆ Î B ( y ), y = F x = F xˆ. Therefore, Fh = 0, and the

second term vanishes, thus proving the theorem.
s K ( x )1
h 2 £ C0
K
--Q.E.D.---
Noisy Signal Recovery (1)

n Theorem: Let F be a matrix that satisfies the RIP of order 2K with
constant d 2 K < 2 - 1 . Let y=Fx+e with e 2 £ e (that is, bounded
noise) . Then, for B ( y ) = { z | F z - y 2 £ e } , the L1 solution x̂ obeys
s K ( x )1
x - xˆ 2 £ C0 + C 2e
K
where,
C0 = 2
( )
1- 1- 2 d2K
, C2 = 4
1 + 2d 2 K
1 - (1 + 2 ) d 2K (
1 - 1+ 2 d 2K )
n This provides a bound on the worst-case performance for uniformly
bounded noise.

n Proof: (self-study)

n What is the bound of recovery error if the noise is Gaussian ?
y = Fx + e where e Î R M , iid with N (o, s 2 )
n Corollary: Let F be a sensing matrix that satisfies the RIP of order

2K with constant d 2 K < 2 - 1 , and for K-sparse signal x Î S K , we
obtain measurement y=Fx+e where the entries of e are iid N(0,s2).
{ }
Then, when B ( y ) = z | F z - y 2 £ 2 M s , the solution to L1
minimization obeys,
1+ d2K
x - xˆ 2 £ 8
1 - (1 + 2)d 2 K
(
M s with probability at least 1 - e - c0 M )

How to recover a non-sparse signal from a
small number of linear measurements ?
Instance--optimal Guarantee (1)

Instance
M
n ® R N is a
Theorem: Let F be a MxN matrix and that D : R ¾¾
recovery algorithm satisfying,
x - D ( F x ) 2 £ Cs K ( x ) 2 for some K ³ 1,
(
then, M > 1 - 1 - 1 / C 2 N )
n In order to make the bound hold for all signals x with a constant C » 1 ,
then, regardless of what recovery algorithm is being used, need to
take M » N measurements.
Digital Media Lab.

Rf:: Instance
Rf Instance--Optimal?
n The theorem says not only about exact recovery of all possible k-
sparse signals, but also ensures a degree of robustness to non-
sparse signals that directly depends on how well the signals are
approximated by k-sparse vectors.
è Instance-optimal guarantee (i.e., it guarantees optimal performance for
each instance of x)
n Cf: Guarantee that only holds for some subset of possible signals,
such as compressible or sparse signals (the quality of guarantee
adapts to the particular choice of x)
n In that sense, instance-optimality is also commonly referred to as
“uniform guarantee” since they hold uniformly for all x.

Instance
n Theorem: Fix d Î (0,1) . Let F be a MxN random matrix whose
entries fij are iid with fij drawn according to a strictly sub-Gaussian
distribution with c 2 = 1 / M . If,
æNö
M ³ k1 K log ç ÷
èKø
then, F satisfies the RIP of order K with the prescribed d with
probability exceeding (1 - 2e -k M ) , where k1 is arbitrary and
2
k 2 = d 2 / 2k * - log ( 42e / d ) / k1
Digital Media Lab.

Instance
N
n Theorem: Let x Î R be fixed. Set d 2 K < 2 - 1 , and suppose that F
be a MxN sub-Gaussian random matrix with
æNö
M ³ k1 K log ç ÷
èKø
and measurement is y= Fx. Set e = 2s K ( x ) 2 . Then, with probability
exceeding (1 - 2e - k M - e -k M ) , when B ( y ) = { z | F z - y 2 £ e } , the L1
2 3
minimization solution obeys,

1 + d 2 K - (1 + 2)d 2 K
x - xˆ 2 £ 8 s K ( x )2
1 - (1 + 2)d 2 K
Digital Media Lab.
End of Chapter 4

2012 Fall
Data Compression
(ECE 5546
5546--41)
Ch 5. Algorithms for Sparse Recovery

Part 1
Byeungwoo Jeon
Digital Media Lab.
Various recovery algorithms for compressed-

sensed sparse signal

From Chapter 4

n Problem:
satisfying
z
n Under an assumption of x being sparse (or compressible), find x

corresponding to measurement y under L0 optimality condition.
n The solution seeks for the sparsest signal in B(y).
n Consistent with measurements: depending on existence of measurement
noise, B(y) has two cases.
ìï{ z | F z = y} ( noise - free case)
B( y) = í
ïî { z | F z - y 2 £ e } ( noisy case)
n Loss (cost) function other than the Euclidean distance may also be
appropriate.
From Chapter 4

n How to solve the L0 minimization problem?
z
ìï{ z | FY z = y} ( noise - free case )

B( y) = í
{
ïî z | FY z - y 2
£e } ( noisy case )
n Note that || . ||0 is a non-convex function.

n Potentially very complex (NP-hard) to solve this minimization problem.
n L0 solution via L1 minimization

z
n If B(y) is convex, this problem becomes computationally tractable !

n The solution prefers to a sparse solution in B(y).
n Big Question: Will L1 solution be similar to L1 solution?

Use of Different Norms
n Solve the underdetermined system
y = F x where F Î R MxN ; x Î R N ; y Î R M ; M < N
xˆ = minN F x - y p
xÎR
n L2 norm (p=2): small penalty on small residual, strong penalty on

large residual.
n Mathematically tractable (Least-square solution is very well understood)
n L1 norm (p=1): most penalty on small residual, the least penalty on

large residual.
n L0 norm (p=0): zero penalty on zero (component) residual, but
identical penalty on non-zero (component) residual.
n Combinatorial nature of solution makes it very hard to solve: NP-hard
problem unless N is small (ex: N < 10)
n Approximation of L0 norm with others?
Some Requirements to consider for recovery algorithms

n Minimal number of measurements
n How many measurements are required to recover K-sparse signals?
n Robust to measurement noise and model mismatch
n Measurement noise
y = F x + e where e ~ noise, x Î S K
n Model mismatch
y = Fx, x is not necessarily K - sparse
n Computationally fast
n Computational speed is very important esp. in high-dimensional case.
n Considering the combinatorial nature of L0 optimization, it is important to
solve a large spare approximation problem in a reasonable time.
n Performance guaranteed
n Instance-optimal or probabilistic-guaranteed?
n Only for exact K-sparse signal or general signal?
n Under noise-free or Noisy?

Recovery Algorithms
n Category 1: Convex optimization approach (or convex relaxation)
n Replace the combinatorial problem with a convex optimization problem.
n Solve the convex-optimization problem with algorithms which can exploit
the problem structure.
n Category 2: Greedy pursuits
n Iteratively refine a sparse solution by successively identifying one or
more components that yield the greatest improvements in quality.
n Category 3: Bayesian framework
n Assume a prior distribution for the unknown coefficients favoring sparsity.
n Develop a maximum a posterior estimator incorporating the observation.
n Identify a region of significant posterior mass or average over most-
probable models.
n Category 4: Other approaches
n Non-convex optimization method: relax the L0 problem to a related non-
convex problem and attempt to identify a stationary point.
n Brute force method: search through all possible support sets, possibly
using cutting-plane methods to reduce the number of possibilities.
n Heuristic method: based on belief-propagation and message-passing
techniques developed in graphical models and coding theory.
Convex Optimization Approach

Convex optimization
optimization--based method (1)
M MxN
n For given y ∈ R , F ∈ R , find x by solving following convex-
optimization problem:
min { J ( x) | y = Fx} (noise - free case)
x
min { J ( x ) | H ( F x, y ) £ e } ( noisy case)

where x
n J(.): convex sparsity-promoting cost function

n J(x) has small value for sparse x
n H(.): cost function penalizing the distance between the vector Fx and y.
n Goodness of fit criterion
n The measurement noisy case can be put into unconstrained

formulation by using a penalty parameter m, m > 0 as:
min { J ( x ) + m H ( F x, y )}
x
n The parameter m can be found by trial-and-error, or by statistical

technique such as cross-validation.
n Actually decision of a proper value m is a research problem.
Convex optimization
n Ex: J(x) = ||x||p
n p=0 (L0 norm): directly measure sparsity (but hard to solve)
n p=1 (L1 norm): gives robustness against outliers
n Ex: H ( F x, y ) = F x - y for example, p=2

p
min { J ( x ) | y = F x} ® min x 0
subject to y = Fx
x x
min { J ( x ) | H ( F x , y ) £ e } ® min x 0
subject to F x - y 2 £ e
x x
n Ex: the noisy case can be modified in several ways:
min F x - y 2
subject to x 0 £ K
x
ì1 2 ü Review
min í F x - y +m x 0 ý, m > 0 Convexity,
x
î2 2
þ Optimization,
etc.

Convex optimization
n Standard optimization package cannot be used for real applications
of CS since the number of unknowns (that is, dimension of x) is very
large.
n If there are no restrictions on the sensing matrix F and the signal x,

the solution to the sparse approximation is very complex (NP-hard).
n In practice, sparse approximation algorithms tend to be slow unless the
sensing matrix F admits a fast matrix-vector multiply (like fast transform
algorithm utilizing matrix structure).
n In case of compressible signal which needs some transformation first,
fast multiplication is possible when both the sensing (random) matrix and
sparsity basis are structured.
n Then, the question is how to incorporate more sophisticated signal
constraints into sparsity models.
Need 1~2 volunteer to

investigate on fast
computation utliizing
the structure of
sensing matrix
L0 Approach (1)
n L0 norm explictly computes the number of nonzero components of
the given data
n Directly related to sparsity of a signal
n A function card(x): cardinality
n For scalar x:
card ( x ) = 0 ( x = 0) and 1 ( x ¹ 0)
n Card (x) has no convexity properties.

n Note however it is quasi-concave on R+n since, for x, y ≥ 0
card ( x + y ) ³ min {card ( x ), card ( y )}
Digital Media Lab. From Prof. S. Boyd (EE364a, b) Stanford Univ

12
Rf:: Quasiconvexity
Rf Quasiconvexity (1)
n Quasiconvex function: a real-valued function defined on an interval
or on a convex subset of a real vector space such that the inverse
image of any set of the form (-infinity, a) is a convex set.
n Informally, along any stretch of the curve, the highest point is one of the
endpoints.
n The negative of a quasiconvex function is said to be quasiconcave.
A function that is not quasiconvex: the set of

A quasiconvex function that points in the domain of the function for which
is not convex the function values are below the dashed red
line is the union of the two red intervals,
which is not a convex set
http://en.wikipedia.org/wiki/Quasiconvex_function
Rf:: Quasiconvexity
n Def: A function f : S à R defined on a convex subset S of a real vector
space is quasiconvex if for all x, y ∈ S and l ∈ [0,1], then,
f {l x + (1 - l ) y} £ max ( f ( x), f ( y ) )
n Note that the points x and y, and the point directly between them, can be
points on a line or more generally points in n-dimensional space.
n In words, if f is such that it is always true that a point directly between two
other points does not give a higher a value of the function than do both of
the other points, then f is quasiconvex.
n An alternative way of defining a quasi-convex function is to require that
each sub-levelset Sa(f) is a convex set.
S a ( f ) = { x | f ( x) £ a} ~ convex set
n A concave function can be quasiconvex function. For example log(x) is

concave, and it is quasiconvex.
n Any monotonic function is both quasiconvex and quasiconcave. More
generally, a function which decreases up to a point and increases from
that point on is quasiconvex (compare unimodality).
Rf:: Quasiconvexity
n Quasiconvexity is a generalization of convexity.
n All convex functions are also quasiconvex, but not all quasiconvex
functions are convex.
n A function that is both quasiconvex and quasiconcave is quasilinear.
The probability density function of the normal A quasilinear function is both

distribution is quasiconcave but not concave quasiconvex and quasiconcave
L0 Approach (2)
n General convex-cardinality problem
n It refers to a problem what would be convex, except for appearance of
card(.) in objective or constrains.
n Example: For f, C: convex,
Minimize card ( x ) subject to x Î C
Minimize f ( x ) subject to x Î C , card ( x ) £ K
n
n Solving convex-cardinality problem: for x ∈ R ,
n Fix a sparsity pattern of x (i.e., which entries are zero/nonzero), then
solve its convex problem
n If we solve the 2n convex problems associated with all possible sparsity
patterns, the convex-cardinality problem is solved completely.
n However, practically possible only for n ≤ 10
n General convex-cardinality problem is NP-hard.

16
L0 Approach (3)
n Many forms of optimization problems
Minimize x 0
subject to F x - y 2 £ e
Minimize F x - y 2
subject to x 0 £ K
Minimize F x - y 2 + l x 0
n L1-norm Heuristic
n Replace ||x||0 with l||x||1 or add regularization term l||x||1 to objective fct.
n l is a parameter used to achieve desired sparsity
n More sophisticated versions use åw i xi or åw (x ) + åv (x )

i i + i i - where
w and v are positive weights. i i i
Digital Media Lab. From Prof. S. Boyd (EE364a, b) Stanford Univ 17
Rf:: Reweighted L1 algorithm (1)

Rf
n (joint work of E. Candes, M. Wakins and S. Boyd)
n Minimum L0 recovery requires minimal oversampling but intractable

min x 0 = å 1{ xi ¹ 0} subject to y = F x
i
n Observation: If x* solution to the combinatorial search and

ì 1
ï * xi* ¹ 0
wi* = í xi
ï¥ xi* = 0
î
*
then, x* is also the solution to min å wi xi subject to y = Fx
i
Digital Media Lab. From CS Theory Lecture Notes by E. Candes, 2007 18

Rf
Initial step: wi( 0 ) = 1 for all i
Loop: For j=1,2,3,…
n Solve
ì ü
xˆ ( j ) = arg min í å wi( j -1) | xi |ý such that y = F x
î i þ
n Update
1
wi( j ) = ( j)
| xˆ | +e
i
n Until convergence (typically 2~5 iterations)
n Intuition: down-weight large entries of x to mimic magnitude-

insensitive L0-penalty.

Rf
n Empirical performance

L1 Approach (1)
n Connection between L1 norm and sparsity
n Known for a long time, early ’70s
n Mainly studied in Geophysics (literature on sparse spike trains)
n Key rough empirical fact is that “L1 returns sparse solution”
n Replace the combinatorial L0 function with the L1 norm, yielding a

convex optimization problem
n It makes the problem tractable !
n There can be several variants of the problem.
L1 Approach (2)
n The L1 minimization problem
minN x 1 , F Î R MxN
xÎR
n There is always a solution with at most M non-zero terms

n In general, the solution is unique
n Similarly, minN F x - y 1 , F Î R MxN

xÎR
n There is always a solution (r = y-Fx) has at most (N-M) non-zero terms

n In general, the solution is unique

L1 Approach (3)
n Variant
n Start with minimum cardinality problem (C: convex)
Minimize card ( x ) subject to x Î C
n Apply heuristic to obtain L1-norm minimization problem
Minimize || x ||1 subject to x Î C
n Variant
n Start with cardinality constrained problem (f, C: convex)
Minimize f ( x ) subject to x Î C , card ( x ) £ K
n Apply heuristic to obtain L1-norm constrained problem
Minimize f ( x ) subject to x Î C , || x ||1 £ b
n Or L1-regularized problem
Minimize f ( x ) + l || x ||1 subject to x Î C
n b, g are adjusted so that card(x) ≤ K.

23
L1 Approach (4)
n Variant with polishing
n Use L1 heuristic to find x estimate with required sparsity
n Fix the sparsity pattern of x
n Re-solve the (convex) optimization problem with this sparsity pattern to
obtain final (heuristic) solution.

24
Some examples: convex optimization
From Computational Methods for Sparse… 2010 IEEE Proceedings by J.A. Tropp and J. Wright
Equality--constrained Problem
Equality
n Equality-constrained problem
n Among all x consistent with measurements, pick one with min L1 norm
min x 1
subject to y = Fx (C1)
x
From Computational Methods for Sparse… 2010 IEEE

Digital Media Lab. Proceedings by J.A. Tropp and J. Wright 26
Convex Relaxation Method
n Convex Relaxation Method
ì1 2 ü
min í F x - y + m x 1 ý, m ³ 0 (C 2)
x
î2 2
þ
n m is a regularization parameter: it governs the sparsity of the solution
n large m typically produces sparser results.
n How to choose m ?
n Often need to solve the equation repeatedly for different choices of m, or to
trace systematically the path of solutions as m decreases towards zero.

LASSO
n Least Absolute Shrinkage and Selection Operator (LASSO) method
n It is equivalent to the convex relaxation method (C2) in the sense that the
path of solution of (C3) parameterized by positive b matches the solution
path for as m varies.
2
min Fx - y subject to x 1 £ b (C 3)
x 2
n Rf: its L0 version: min Fx - y subject to x 0 £ K

x 2
n Can interpret this as fitting the vector y as a linear combination of K

regressors (chosen form N possible regressors) ~ feature selection (in
statistics).
n i.e., choose a subset of M regressors that (together) best fit or explain y.
n In can be solved (in principle) by trying all æçè MN ö÷ø choices.
n Rf: An independent variable is also known as a "predictor variable", "regressor",

"controlled variable“, "manipulated variable", "explanatory variable", "feature" (see
machine learning and pattern recognition) or an "input variable.”
Digital Media Lab. From CS Tutorial at ITA 2008 by Baraniuk, Romberg, and Wakin 28
Others
n Quadratic relaxation (LASSO)
n Explicit parameterization of the error norm
min x 1 subject to Fx - y £ e (C 4)
x 2
n Danzig selector (with residual correlation constraints)
min x 1
subject to F T ( F x - y ) £e
x ¥
Digital Media Lab. From CS Tutorial at ITA 2008 by Baraniuk, Romberg, and Wakin 29
Further study (Volunteer?)

n Other optimization algorithms:
n interior point methods (slow, but extremely accurate)
n homotopy methods (fast and accurate for small-scale problems)

Gradient Method (1)
n (also known as the first-order method) iteratively solve the following
problem
ì1 2 ü
min í F x - y + m x 1 ý , m ³ 0 (C 2)
x
î2 þ
2
n Similar methods under this category

n Operator splitting [65]
n Iterative splitting and thresholding (IST) [66]
n Fixed-point iteration [67]
n Sparse reconstruction via separable approximation (SpaRSA) [68]
n TwIST [70]
n GPSR [71]

Gradient Method (2)

n Gradient-descent framework
• Input: a signal y ∈ RM, sensing matrix ∈ RMxN , regularization parameter
m > 0, and initial estimate x0
• Output: coefficient vector x ∈ RN
• Algorithm:
(1). Initialize: set k=1
+
(2). Iterate: Choose ak ≥ 0 and coefficient vector xk from
ì * 1 2 ü
xk+ := arg min í( z - xk ) F* ( Fxk - y ) + a k z - xk 2
+ m z 1ý
z î 2 þ
+
If an acceptance test on xk is not passed, increase ak by some
factor and repeat.
(3). Line search: choose gk ∈ (0,1] and obtain xk+1 from
(
xk +1 := xk + g k xk+ - xk )
(4). Test: If stopping criterion holds, terminate with x=xk+1. Otherwise,
set k ß k+1 and goto (2).
Gradient Method (3)
n This gradient-based method works well on sparse signals when the
dictionary F satisfies RIP.
n It benefits from warm starting, that is, the work required to identify a
solution can be reduced dramatically when the initial estimate of x is close
to the solution.
n Continuation strategy
n Solve the optimization problem (C2) for a decreasing sequences of m
using the approximate solution for each value as the starting point for the
next sub-problem.

Digital Media Lab. Procedings by J.A. Tropp and J. Wright 33
Review:
Convex Optimization

References (1)
n Introduction to Optimization
n http://ocw.mit.edu/courses/electrical-engineering-and-computer-
science/6-079-introduction-to-convex-optimization-fall-2009/index.htm
References (2)
n Convex Optimization (EE364a by Prof. Boyd)
n http://www.stanford.edu/class/ee364a/lectures.html
n Video lecture is also available Introduction
Convex sets
Convex functions
Convex optimization problems
Duality
Approximation and fitting
Statistical estimation
Geometric problems
Numerical linear algebra background
Unconstrained minimization
Equality constrained minimization
Interior-point methods
Conclusions
Lecture slides in one file.
Additional lecture slides:
Convex optimization examples
Stochastic programming
Chance constrained optimization
Filter design and equalization
Disciplined convex programming and CVX
Two lectures from EE364b:
methods for convex-cardinality problems
methods for convex-cardinality problems, part II

Mathematical Optimization Problem
n Optimize problem

37
Solving Optimization Problem

n General optimization problem
n Very difficult to solve
n Methods involve some compromise, e.g., very long computation time, or
not always finding the solution
n Exceptions : certain problem classes can be solved efficiently and

reliably
n Least-square problems min F x - y
n Analytical solution x
2
{ 2
}
-1
x* = ( F T F ) FT y
T T
n Linear programming problems min c x subject to ai x £ yi = 1,....., m
n No analytical formula for solution
n Reliable and efficient algorithms and software~ a mature technology
n Convex optimization problems min f 0 ( x ) subject to f i ( x ) £ yi , i = 1,..., m

n Objective and constraint functions are convex
n It includes least-squares problems and linear programming as special cases

38
Optimization problem in standard form
Optimal & locally optimal points

Implicit constrains
Convex Set and Others (1)

n Def: A set W is convex if and only if for any x1 and x2 ∈ W and for any
θ, 0 ≤ θ ≤ 1, the convex combination x= θ x1 + (1-θ) x2 ∈ W
n Example: convex set non-convex set non-convex set
n Def: Convex combination of x1,. . ., xk: any point x of the form x = θ1x1
+ θ2x2 + ··· + θkxk with θ1 + ··· + θk =1, θj ≥ 0.
n Def: Convex hull (conv S): a set of all convex combinations of points
in S

42
Convex Set and Others (2)
n Def: Conic(nonnegative) combination of x1 and x1: any point of the
form x = θ1x1 + θ2x2 with θ1 ≥ 0, θ2 ≥ 0.
n Def: Convex cone: a set that contains all conic combinations of points
in the set

43
Convex function
n Def: A function f(x): W à R is convex if only if any convex
combination x = θx1 + (1-θ) x2 for all x1, x2 ∈ W and θ , 0 ≤ θ ≤ 1,
satisfies f{θx1 + (1-θ) x2} ≤ θf(x1)+(1−θ)f(x2).
n Note that f is concave if (−f) is convex

n A function f is strictly convex iff f{θx1 + (1-θ) x2} < θf(x1)+(1−θ)f(x2).

44
1st order Condition
n Def: A function f is differentiable if dom f is open and the gradient
æ ¶f ¶f ö
Ñf ( x ) = ç ,..., ÷
è ¶x1 ¶x n ø
exists at each x ∈ dom f.
n Def: 1st-order condition: A differentiable function f with convex

domain is convex iff f(y) ≥ f(x)+∇f(x)T(y−x) for all x,y ∈dom f.

45
2nd order Condition

n Def: A function f is twice differentiable if dom f is open and the
Hessian ∇2f(x)∈Sn exists at each x ∈dom f.
æ ¶2 f ö
(Ñ 2
f ( x) )
ij
=ç
ç ¶x ¶x ÷÷ for 1 £ i , j £ N
è i j ø
n Def: 2nd-order conditions: for twice-differentiable function f with

convex domain, a function f is convex if and only if
(Ñ 2
f ( x ) ) ³ 0 for 1 £ i, j £ N
ij
n Strict convex: no equality sign

46
2012 Fall
Data Compression
(ECE 5546-
5546-41)

Part 2
Byeungwoo Jeon
Digital Media Lab.
Recovery Algorithms (1)

n Category 2: Greedy algorithms

n Greedy pursuits
n Iteratively refine a sparse solution by successively identifying one or more
components that yield the greatest improvements in quality.
n In general very fast and are applicable to very large datasets, however,
theoretical peformance guarantees are typically weaker than those of some
other methods.
n Thresholding algorithms
n The methods alternate both element selection as well as element pruning
steps. These methods are often very easy to implement and can be relatively
fast.
n These have theoretical performance guarantees that rival those guarantees
derived for convex optimization-based approaches.

probable models.

Greedy Algorithms
A greedy algorithm is an algorithm that follows the problem solving heuristic of

making the locally optimal choice at each stage with the hope of finding a global
optimum. In many problems, a greedy strategy does not in general produce an
optimal solution, but nonetheless a greedy heuristic may yield locally optimal
solutions that approximate a global optimal solution in a reasonable time.
Digital Media Lab. http://en.wikipedia.org/wiki/Greedy_algorithm 4

Greedy Algorithm (1)
n Starting at A, a greedy algorithm (GA) will find the local maximum at
"m", instead of the global maximum at "M".
Search global
maximum starting
from A?
Greedy Algorithm (2)

n The greedy algorithm determines the
minimum number of coins to give while
making change. These are the steps a Ex: How to pay 36 cents
human would take to emulate a greedy using only coins with
algorithm to represent 36 cents using values {1, 5, 10, 20}?
only coins with values {1, 5, 10, 20}.
n The coin of the highest value, less

than the remaining change owed, is
the local optimum.
n (Note that in general the change-
making problem requires dynamic
programming or integer programming
to find an optimal solution.)
n However, most currency systems,
including the Euro (pictured) and US
Dollar, are special cases where the
greedy strategy does find an optimum
solution

Sparse Signal Recovery via Greedy Algorithm
n Problem:
satisfying
z

ìï{ z | F z = y} ( noise - free case )
B( y) = í
{
ïî z | F z - y 2
£e } ( noisy case)
n This problem can be re-written as,

ì ü
min í I : y = å xifi ý
W
î iÎW þ
n W denotes a particular subset of the indices i=1,…,N, and fi denotes the
ith column of F.
n Use greedy algorithm to find the index set W.
Rf:: Greedy Algorithms

Rf
n Greedy algorithms have been called in different terms in other fields
n Statistics: Forward stepwise regression
n Nonlinear approximation: Pure greedy algorithm
n Signal Processing: Matching pursuit
n Radio Astronomy: CLEAN algorithm

Basic idea of Pursuit algorithm (1)
n Problem to solve
( P0 ) : xˆ = argmin z 0 subject to y = F x
z
n The solution needs two sub-processes:

n Element selection: find support of solution: i.e., supp(x)
n Coefficient finding: find non-zero components of x over the support
n Combinatorial nature of element selection and coefficient finding:
n Example: suppose recovering K-sparse signal (assume K: known)
æNö
n Support of solution: ç ÷ ~ O(N )
K
èKø
n Non-zero components of x over the support: once the support is known,
then a plain least square solution can find the solution
n Ex: How to solve the problem when K=1?

n Find a column of F by minimizing 2
err (i ) = min y - x jf j
1£ j £ N 2
n It requires to test each column of F à N test ~ O(MN)

n Note that suitability of each column can be checked by minimal
(approximation) error by the given column:
for j=1, …., N,
2 T
err ( j ) = y - x jf j = y - x j f j , y - x jf j = ( y - x j f j ) (y-x f ) j j
2
2
= y 2 - 2 x jf Tj y + x 2j f Tj f j
n This approximation error is minimized by
derr ( j ) f Tj y
= -2f Tj y + 2 x jf Tj f j = 0 ® xˆ j = 2
dx j fj 2
n The minimum error for the j-th column is,
2 2
err ( j ) min = y
2
-
(f y )
T
j
¾¾
® find j s.t .
(f y )
T
j
is maximum !!
2 2 ¬¾
¾ 2
fj fj
2 2
2 2
n Its solution is to choose a column which maximize (f Tj y ) fj 2

n Pre-normalization: assume that all the columns of F are normalized
by multiplying by normalizing matrix W
normalizing
F ¬¾¾¾¾ ( FW ) where F Î R M ´ N , W Î R N ´ N
é 1 1 ù
where W = diag ê ..... ú
êë f1 2 fN 2 úû
n Under this pre-normalization, the solution of pursuit algorithm can be

easily found by identifying the column maximizing f T y 2
( j )
n As a final step, the solution vector x should be post-normalized by
normalizing
x ¬¾¾¾¾ (Wx )
n Theorem tells that the normalization does not change the solution.
n From now on, assume pre-normalization without loss of generality.

n Suppose K > 1: since y is a linear combination of K columns of F, the
problem is to find a subset of F consisting of K columns.
æN ö
n Need to enumerate (
ç ÷~O N
èK ø
K
) combinations
n Greedy algorithm (Pursuit-based methods): instead of the exhaustive

search, select column one by one in favor of local optimum.
n Start from x(0) = 0 (residual r(0) = y), it iteratively constructs K-term
approximation by maintaining a set of active columns, and at each stage,
expanding the set by one additional column.
n The additional column at each stage is the one which maximally reduces
the residual error (in L2 sense) in approximating the measurement y using
the currently active columns.
n Residual: as-yet “unexplained” portion of the measurement
n After constructing an approximation including a new column, a new
residual vector is computed by subtracting the approximation represented
by the newly selected column from the current residual.
n A new residual L2 error is evaluated: if it falls below a threshold, the
algorithm terminates. Otherwise, looks for another column.

Various Pursuit Algorithms
n Matching Pursuit (MP) ~ also known as “pure greedy algorithm”
n Orthogonal Matching Pursuit (OMP)
n Weak-Matching Pursuit
n These algorithms all belong to Greedy algorithms (GA)

n Its variants include
n Pure GA(PGA)
n Orthogonal GA(OGA)
n Relaxed GA(RGA)
n Weak GA (WGA)
n Rf: “At this point, it is not fully clear what role greedy pursuit algorithms
will ultimately play in practice.” From Computational Methods for Sparse… 2010 IEEE
Proceedings by J.A. Tropp and J. Wright
Category 2: Greedy algorithms

n Greedy pursuits

Matching Pursuit (MP) (1)
n First proposed by Mallat and Zhang*: iterative greedy algorithm that
decomposes a signal into a linear combination of elements from a
dictionary (i.e., sensing matrix).
Inputs: Sensing matrix F , measurement vector y , error threhold e 0
Outputs: a sparse signal x
· Initialize: Set k = 0, index set W ( k ) = Æ , and residual r ( k ) = y.
· Main Iteration: Increment k by 1 and perform followings:
2
(a). Sweep: compute err ( j ) = min r ( k -1) - l jf j for all j = 1,..., N , with optimal choice l *j = f Tj r ( k -1) .
lj 2
(b). Update support: find a column i of F that is most correlated with the residual.
i = arg max {err ( j )} = arg max r ( k -1) , f j , update support W ( k ) = W ( k -1) U {i}
1£ j £ N 1£ j £ N
(c). Update provisional solution: x ( k ) = x ( k -1) with updated entry x ( k ) (i ) = x ( k -1) (i ) + li*
(d). Update residual: r ( k ) = y - F x ( k ) = r ( k -1) - li*fi
(e). Stopping rule: If r ( k ) < e 0 , stop. Otherwise, apply another iteration.
2
· Output: The proposed solution is x ( k ) obtained after k -iterations.

*S.Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,”
Digital Media Lab. IEEE. Trans. Signal Processing, 41(12):33978211; 3415, 1993. 15
Matching Pursuit (MP) (2)

n The update in each iteration with2 the selected coefficient minimizes
the approximation error y - Fx ( k ) 2 .
n It is known that r ( k ) 2 converges linearly to zero if the columns of F
spans RM.
n Therefore MP will stop in finite number of iterations if the norm of r ( k ) is
used to define a stopping criterion for the algorithm.
n The approximation is incremental, that is, it selects one column from

F at a time, at each iteration, and updates only the coefficient
associated with the selected column.
n MP will generally repeatedly select the same column from F in order to
further refine the approximation.
n Major drawbacks of MP:

n No guarantees in terms of recovery error (???)
n It does not exploit the special structure present in the sensing matrix F.
n Computational infeasibility: the required number of iterations can be quite
large
n Complexity of MP ~ O(MNT) (T: number of MP iterations)

Orthogonal Matching Pursuit (OMP) (1)
n OMP (orthogonal MP): it projects the residual onto the orthogonal
subspace corresponding to the linear span of the currently selected
dictionary elements.
n In contrast to MP, the minimization (in the step of update provisional

solution) is performed with respect to all of the currently selected
coefficients.
2
x(k ) = min y - Fz 2
z , supp(z) = W ( k )
n Unlike MP, OMP never re-selects an element already chosen and the
residual at any iteration is always orthogonal to all currently selected
elements.

Inputs: Sensing matrix F , measurement vector y , error threhold e 0
· Initialize: Set k = 0, index set W ( k ) = Æ , and residual r ( k ) = y.
2
(a). Sweep: compute err ( j ) = min r ( k -1) - l jf j ,1 £ j £ N , j Ï W ( k -1) , with optimal choice l *j = f Tj r ( k -1) .
lj 2
(b). Update support: find a column i of F that is not already in W ( k -1) such that
i = arg max err ( j ) = arg max r ( k -1) , f j , update support W ( k ) = W ( k -1) U {i}
( k -1 ) ( k -1 )
1£ j £ N , jÏW 1£ j £ N , jÏW
2
(c). Update provisional solution: compute x ( k ) which minimizes y - F x 2 subject to supoort W ( k )
2
x(k ) = min (k )
y - Fz 2
z , supp(z) = W
(d). Update residual: r ( k ) = y - F x ( k )

(e). Stopping rule: If r ( k ) < e 0 , stop. Otherwise, apply another iteration.
2
· Output: The proposed solution is x ( k ) obtained after k -iterations.

n Update provisional solution:
2
(c). Update provisional solution: compute x ( k ) which minimizes y - F x 2 subject to supoort W ( k )
2
x(k ) = min (k)
y - Fz 2
z , supp(z) = W
(k) (k)
n Suppose F W ( k ) Î R M ´|W |
, a submatrix of F, which selects columns by W
2
n Solve a minimization problem of min y - F W( k ) xW( k ) 2
where xW( k ) is a
x (k)
non-zero portion of vector x. W
2 T
min y - F W( k ) xW( k )
x
W( k )
2
¾¾
W W (
® min y - F ( k ) x ( k )
¬¾
¾ ) (y-F x ) W( k ) W( k )
( )
F TW( k ) y - F W( k ) xW( k ) = 0 ® xW( k ) = {(F T
W( k )
F W( k ) ) F }y =F
-1
T
W( k )
b
W( k )
y
n By the way, note that, r ( k ) = ( y - F W xW (k) (k ) ) and FTW( k ) r ( k ) = 0

(k )
n The selected columns in F by W are necessarily orthogonal to r(k).
n It implies that the columns will not be chosen again later for the support.
è Orthogonal matching pursuit!

n Various techniques for solving the least-square problems have been
proposed such as QR factorization, Cholesky decomposition, or
iterative techniques such as conjugate gradient methods.
n Complexity of OMP ~ O(MNK) (K: signal sparsity)
n It includes additional complexity for orthogonalization at each iteration.
n Exact K iterations are needed to approximate y with K columns of F.
n OMP is relatively fast and can be shown to lead to exact recovery.

n However, the guarantee accompanying OMP for sparse recovery is
weaker than those associated with convex optimization techniques.
n Not much robust to noise: it is not certain whether small amount of
measurement noise perturbs the solution slightly or not.
n Nevertheless, the OMP is an efficient method for CS recovery
especially when the signal sparsity K is low.
n It is ineffective when the signal is not very sparse.
n (In a large scale setting, the Stagewise OMP (StOMP) is a better choice).

n Approaches to improve OMP
n Select multiple columns per iteration
n Pruning the set of active columns at each step
n Solving the least square problem iteratively
n Design an algorithm whose theoretical analysis is supported by the RIP
bound ~ regularized OMP
Weak Matching Pursuit (Weak-

(Weak-MP)
n Key idea: rather than searching for the largest inner-product value in
MP, settle for the first found that exceeds a t-weaker threshold.
n It is a simplification of MP: (thus suboptimal to MP) The update support
stage of MP is relaxed by choosing any index that is factor t (in the range
of (0,1)) away from the optimal choice.
2 2
n Note that the inequality(see à) (f y )
T
j
£ max
(f y )T
j
£ r ( k -1)
2
sets an upper bound on the fj

2 1£ j £ N
fj
2 2
maximal achievable inner-product. 2 2
2
Thus, we can compute r ( k -1)
n 2 at the beginning of the sweep stage,
and as we search for index i that gives the smallest err(i), we choose
the first that gives,
2 2
(f i
T
r ( k -1) ) ³ t2 r ( k -1) 2
³ t2 max
(f T
j r ( k -1) ) , for a prechosen t in ( 0,1)
2 2 2
1£ j £ N
fi fj
2 2
n If fouding no index satisfying the inequality, the maximum can be chosen.

n Faster than MP, but suboptimal to MP.
Stagewise OMP ((StOMP
StOMP))
n It has considerable computational advantages over L1 optimization
and the OMP for large-scale sparse problem.
n In Update support stage of OMP, it uses a threshold t to determine

the next best set of columns of F whose correlations with the current
residual exceeding t.
n Afterwards, the new residual is calculated using a least square estimate
of the signal using the expanded set of columns same as OMP.
n Unlike OMP, the number of iterations in StOMP is fixed and chosen

before hand (ex: 10 iterations)
n Complexity of StOMP ~ O(KNlogN)

n Significant improvement of OMP
n Drawbacks:
n No reconstruction guarantee
n Moderate memory requirements compared to OMP since
orthogonalization requires maintenance of a Cholesky factorization of the
dictionary elements.
Thresholding Algorithm
n Idea: Choose the K largest inner products as the desired support.
n It implies that the search for the K elements of the support amounts to a
simple sorting of the entries of the vector |FTy|.
n If the number K is not a priori known, it can be increased until the error
2
y - F x ( k ) reaches a pre-specified value of e0.
2
Inputs: Sensing matrix F , measurement vector y , number of atoms desired K

2
· Quality evaluation: compute err ( j ) = min y - l jf j for all j using the optimal choice l *j = f Tj y.
lj 2
· Update support: find a set of indices W of cardinality K that contains the smallest error by
" j Î W, err ( j ) £ min err (i )
iÏW
2
· Update provisional solution: compute x ( k ) which minimizes y - F x 2 subject to supoort W ( k )
2
x(k ) = min (k )
y - Fz 2
z , supp(z) = W
2
· Output: The proposed solution is x which minimizes y - F x 2 subject to supoort W.

Category 2: Greedy algorithms
n Greedy pursuits
Thresholding type algorithms

n Greedy pursuits are easy to implement and use. Also it can be
extremely fast. However, no recovery guarantees as strong as those
methods based on convex relaxation.
n Thresholding type algorithm bridges the gap: it is fairly easy to

implement, use, and can be extremely fast with strong performance
guarantee as with the methods based on convex relaxation.
n Iterative Hard Thresholding (IHT)
n Compressive Sampling Matching Pursuit (CoSaMP)
n Subspace Pursuit (SP)

Iterative Hard Thresholding (IHT)
n It seeks an s-sparse representation x of a signal y via the iteration:
ì x ( 0) = 0
ïï ( k ) (k ) rf : éë z * ùû ~ restriction of a vector x
í r = y - Fx s
to the s components
ï ( k +1)
ïî x = éë x ( k ) + F T r ( k ) ùû largest in magnitude
s
n It was proved that this sequence of iterations converges to a fixed

point of x, and, if the matrix F possesses the RIP, the recovered signal
x satisfies an instance-optimality guarantee of the selected type.
n Empirical evidence also shows that thresholding is reasonably

effective for solving sparse approximation problems in practice.
n (However, some simulations also indicate that the simple thresholding
techniques behave poorly in presence of noise).
CoSaMP (1)
n Compressive Sampling Matching Pursuit (CoSaMP)
n The idea of CoSaMP and subspace pursuit (SP) is very similar.

n It keeps track of an active set T of nonzero elements and both adds as
well as removes elements in each iteration.
n At the beginning of each iteration, a K-sparse estimate x ( k ) is used to
calculate a residual error y - F x ( k ) .
n The index set of columns of F with K- (or 2K-) largest inner products are
( k + 0.5 )
then selected and added to the support set as an intermediate T .
( k + 0.5)
n An intermediate estimate x is then calculated as an intermediate
estimate as the least-square solution arg min y - FxT( k( k++0.5)
0.5 )
n The largest K elements of this x ( k +0.5 )
2
T
intermediate estimate used as
the new support set T ( k +1) .
n In the last step:
( k +1)
n CoSaMP takes, as a new estimate, the intermediate estimate T restricted
( k + 0.5)
to the new smaller support set x .
n SP solves a second least-square problem restricted to this reduced support.
n .
CoSaMP (2)
Inputs: Sensing matrix F , measurement vector y , target sparsity s , tuning parameter a
Outputs: s -sparse coefficient vector x
· Initialize: Set k = 0, initial vector x ( k ) = 0, and residual r ( k ) = y.
(a). Identify: find a s columns of F that are most strongly correlated with the residual.
W Î arg max å r ( k -1) , f j
|T | £a s jÎT
(b). Merge: put the old and new columns into one set T = supp (r ( k -1) ) U W
(c). Estimate: find the best coefficients for approximating the residual with these columns:
2
z * = arg min r ( k -1) - F T z
2
z
(d). Prune: retain the s largest coefficients: x ( k ) = y - éë z * ùû

s
(e). Iterate: update the residual: r ( k ) = y - F x ( k ) rf : éë z* ùû ~ restriction of a vector x

s
(f). Repeat (a)-(d) until stopping criterion holds. to the s components
· Output: The proposed solution is x ( k ) . largest in magnitude
CoSaMP (3)
n Both CoSaMP and SP offer near optimal performance guarantee under
the conditions on the RIP.
n CoSaMP is the first greedy method to be shown to possess similar
performance guarantees to L1-based methods.
n Complexity of CosaMP ~ O(MN)
n Note its independence from the sparsity of the original signal. Also it
represents improvement over both greedy algorithms and convex methods.
n CoSaMP is faster but usually less effective than algorithms based on
convex programming.
n It is also faster and more effective than OMP for compressive sampling
problems, except perhaps in the ultra-sparse regime where the number of
non-zeros in the representation is very small.
n Drawback of CoSaMP
n Need to know prior knowledge of the sparsity K of the target signal.
n An incorrect choice of input sparsity may lead to a worse guarantee that
the actual error incurred by a weaker algorithm such as OMP.
n The stability bounds accompanying CoSaMP ensure that the error due to
an incorrect parameter choice is bounded, but it is not yet known how
these bounds translate into practice.
2012 Fall
Data Compression
(ECE 5546
5546--41)

Part 3
Byeungwoo Jeon
Digital Media Lab.

n Category 2: Greedy algorithms

n Greedy pursuits
n Iteratively refine a sparse solution by successively identifying one or more
components that yield the greatest improvements in quality.
n In general very fast and are applicable to very large datasets, however,
theoretical peformance guarantees are typically weaker than those of some
other methods.
n The methods alternate both element selection as well as element pruning
steps. These methods are often very easy to implement and can be relatively
fast.
n These have theoretical performance guarantees that rival those guarantees
derived for convex optimization-based approaches.

probable models.

Bayesian Methods for CS Recovery
Digital Media Lab.

Basic Frameworks
n Deterministic signal framework
n Under an assumption that x is fixed and belongs to a known set of
signals, find a sparse vector x s.t. y=Fx where y is a given
measurement vector.
n Most algorithms in this framework give some guarantees (on the required
number of measurements, fidelity of signal reconstruction, etc)
n Probabilistic signal framework (à Bayesian framework)

n Under an assumption that the sparse vector x arises from a known
probability distribution (assume sparsity promoting priors on the elements
of x), find a probability distribution of each non-zero element of x from
the stochastic measurements y=Fx.
n It does not give any guarantees nor notion of “reconstruction-error”
n Also note that in practice, the signal is not exactly sparse, and there will
always be loss in reconstruction.
n Note it is related to “Bayesian Inference” in statistics.
Review: A little bit of Background on

Bayesian Inference
Digital Media Lab.

Bayesian Inference
n Bayesian inference: a method of statistical inference in which some
kind of evidence or observations (i.e., measurements y in CS) are
used to calculate the probability (that is, of the sparse vector x) that a
hypothesis may be true, or else to update its previously-calculated
probability.
n In other words, given some observations, it uses prior probability over
hypothesis to determine the likelihood of a particular hypothesis
(posterior probability of the hypothesis)
n In Bayesian inference there is a fundamental distinction between

n Observable quantities y, i.e. the data
n Unknown quantities x
n x can be statistical parameters, missing data, latent variables…
n Parameters are treated as random variables
n The Bayesian framework makes probability statements about model

parameters
n In the frequentist framework, parameters are fixed non-random quantities
and the probability statements concern the data.
Rf:: Latent Variable (1)

Rf
n In statistics, latent variables (as opposed to observable variables),
are variables that are not directly observed but are rather inferred
(through a mathematical model) from other variables that are
observed (directly measured).
n Mathematical models that aim to explain observed variables in terms of
latent variables are called latent variable models.
n Latent variable models are used in many disciplines, including
psychology, economics, machine learning/artificial intelligence,
bioinformatics, natural language processing, and the social sciences.
n Sometimes latent variables correspond to aspects of physical reality,

which could in principle be measured, but may not be for practical
reasons.
n In this situation, the term hidden variables is commonly used (reflecting
the fact that the variables are "really there", but hidden).
n Other times, latent variables correspond to abstract concepts, like

categories, behavioral or mental states, or data structures.
n The terms hypothetical variables or hypothetical constructs may be
used in these situations.
Digital Media Lab. http://en.wikipedia.org/wiki/Latent_variable
8
Rf
n One advantage of using latent variables is that it reduces the
dimensionality of data.
n A large number of observable variables can be aggregated in a model to
represent an underlying concept, making it easier to understand the data.
n In this sense, they serve a function similar to that of scientific theories. At
the same time, latent variables link observable ("sub-symbolic") data in
the real world to symbolic data in the modeled world.
n Latent variables, as created by factor analytic methods, generally

represent 'shared' variance, or the degree to which variables 'move'
together.
n Variables that have no correlation cannot result in a latent construct
based on the common factor model.

9

Rf
n Examples in Economics
n Quality of life, Business confidence, Morale, Happiness, Conservatism
n Note these are all variables which cannot be measured directly. But
linking these latent variables to other, observable variables, the values of
the latent variables can be inferred from measurements of the observable
variables.
n Quality of life is a latent variable which can not be measured directly so
observable variables are used to infer quality of life.
n Observable variables to measure quality of life includes wealth, employment,
environment, physical and mental health, education, recreation and leisure
time, and social belonging.

10
Rf
n Common methods to estimate the latent variables
n Hidden Markov models
n Factor analysis
n Principal component analysis (PCA)
n Latent semantic analysis and Probabilistic latent semantic analysis
n EM algorithms
n Bayesian algorithms and methods
n Bayesian statistics is often used for inferring latent variables.
n Latent Dirichlet Allocation
n The Chinese Restaurant Process is often used to provide a prior distribution
over assignments of objects to latent categories.
n The Indian buffet process is often used to provide a prior distribution over
assignments of latent binary features to objects.

11
Rf:: Bayes Theorem

Rf
n Bayesian statistics is named after Rev.Thomas Bayes (1702-1761)
n Bayes Theorem for probability events A and B:
p ( B | A) p ( A)
p ( A | B) = , assume P ( B) ¹ 0
p( B)
n For a set of mutually exclusive & exhaustive events (i.e. p (U Ai ) = å p ( Ai ) = 1 ),

then, i
p ( B | Ai ) p ( Ai )
p ( Ai | B ) =
å j p ( B | Aj ) P( Aj )

Prior & Posterior Distributions
n y is known CS data (i.e., observation) so should be conditioned on
and the Bayes theorem gives “the conditional distribution for
unobserved quantities (i.e., sparse vector x)” (it is called as the
posterior distribution).
Likelihood function
p( y | x) p ( x)
p( x | y) = µ p( y | x) p ( x)
ò p( x) p( y | x)dx Prior prob.
Posterior prob
n From a Bayesian point of view :

n The prior distribution p(x) expresses our uncertainty about x before
seeing the CS data y
n x is unknown so should have a probability distribution reflecting our
uncertainty about it before seeing the data
n The posterior distribution p(x|y) expresses our uncertainty about x after
seeing the CS data y.
Rf:: Prior Probability (1)

Rf
n In Bayesian statistical inference, a prior probability distribution,
often called simply the prior, of an uncertain quantity p (for example,
suppose p is the proportion of voters who will vote for the politician
named Smith in a future election) is the probability distribution that
would express one's uncertainty about p before the "data" (for
example, an opinion poll) is taken into account.
n It is meant to attribute uncertainty rather than randomness to the
uncertain quantity. The unknown quantity may be a parameter or latent
variable.
n A prior is often the purely subjective assessment of an experienced

expert.
n Some will choose a conjugate prior when they can, to make calculation of
the posterior distribution easier.
FROM http://en.wikipedia.org/wiki/Prior_probability
Rf:: Prior Probability (2)
Rf
n Parameters of prior distributions are called hyperparameters, to
distinguish them from parameters of the model of the underlying data.
n For instance, if one is using a beta distribution to model the

distribution of the parameter p of a Bernoulli distribution, then:
n p is a parameter of the underlying system (Bernoulli distribution), and
n α and β are parameters of the prior distribution (beta distribution), hence
hyperparameters.
FROM http://en.wikipedia.org/wiki/Prior_probability
Rf:: Point & Interval Estimation

Rf
n In Bayesian inference, the outcome of interest for a parameter is its
full posterior distribution, however we may be interested only in
summaries of this distribution.
n A simple point estimate would be the mean of the posterior

n Also the median and other mode are alternative
n Interval estimates are also easy to obtain from the posterior

distribution and are given several names.
n Examples (All of these refer to the same quantity.)
n credible intervals
n Bayesian confidence intervals
n Highest density regions (HDR)

Rf:: Conjugate Prior (1)
Rf
n In Bayesian probability theory, if the posterior distributions p(x|y) are
in the same family as the prior probability distribution p(x), the prior
and posterior are then called conjugate distributions, and the prior
is called a conjugate prior for the likelihood.
n Ex: Gaussian family is conjugate to itself (or self-conjugate) with

respect to a Gaussian likelihood function
n That is, if the likelihood function is Gaussian, choosing a Gaussian prior
over the mean will ensure that the posterior distribution is also Gaussian.
This means that the Gaussian distribution is a conjugate prior for the
likelihood which is also Gaussian.
n Consider the general problem of inferring a distribution for a

parameter x given some datum or data y. From Bayes' theorem, the
posterior distribution is equal to the product of the likelihood function
and prior p(x), normalized (divided) by the probability of the data p(y):
p ( y | x) p ( x )
p( x | y) =
ò p( x) p ( y | x)dx
http://en.wikipedia.org/wiki/Conjugate_prior

Rf
n p(y|x): likelihood function
n It is considered fixed and is usually well-determined from a statement of
the data-generating process.
n It is clear that different choices of the prior distribution p(x) may make
the integral more or less difficult to calculate, and the product
p(y|x)p(x) may take one algebraic form or another.
n For certain choices of the prior, the posterior has the same algebraic
form as the prior (generally with different parameter values). Such a
choice is a conjugate prior.
n A conjugate prior is an algebraic convenience, giving a closed-form

expression for the posterior.
n Otherwise a difficult numerical integration may be necessary. Further,
conjugate priors may give intuition, by more transparently showing how a
likelihood function updates a distribution.
n All members of the exponential family have conjugate priors.
Rf
n Example: Conjugate priors
Likelihood Parameter Prior Posterior
Normal Mean Normal Normal
Normal Precision Gamma Gamma
Binomial Probability Beta Beta
Poisson Mean Gamma Gamma
Common PDFs (1)

n Gaussian PDF: The Gaussian (also called Normal) RV X is defined in
terms of its PDF as,
1 2
2s 2
e ( )
- x-m
f X ( x) =
2ps 2
where m and s2 are called the mean and variance respectively.
n Gaussian PDF and CDF: (m = 5 and s = 2)

Common PDFs (2)
n Exponential PDF: The exponential RV X has its PDF as,
f X ( x) = a e-a x u ( x), a > 0
n Gamma PDF: The Gamma RV X has its PDF as,
c b b -1 -cx
f X ( x) = x e u ( x), b, c > 0
G (b)
where G(b) is the gamma function given by the integral:
¥
G(b) = ò0 y b-1e - y dy
Common PDFs (3)

n Special cases of Gamma PDF: n=3
n Chi-square PDF
n Gamma with b= n/2 & c = 1/2
n Often used in statistics
1
x( ) e- x 2u ( x), n Î Z
n 2 -1
f X ( x) = n2
2 G (n 2)
n Erlang PDF c=2,n=3

n Gamma with b= integer n
n Often used in queuing theory
cn
f X ( x) = x n -1e - cx u ( x ), n Î Z
( n - 1)!

Common PDFs (4)
n Beta PDF: for 0 <= X <= 1 and shape parameter a and b,
-1
a -1 b -1
æ 1 a -1 ö
f X ( x : a , b ) = Cx (1 - x ) where C = ç ò u (1 - u ) b -1 du ÷
ç ÷
è0 ø
1 G (a )G ( b )
= xa -1 (1 - x) b -1 where B (a , b ) =
B (a , b ) G (a + b )
n Bernoulli distribution: for p, 0< p < 1 and k ={0, 1},

P ( X = 1) = 1 - P ( X = 0) = 1 - q = p
n Binomial distribution: A RV X is binomially distributed if it takes on the
values 0, 1, 2, × × × , n with probabilities: with p + q = 1
ænö
P ( X = k ) = ç ÷ p k q n- k , k = 0, 1, 2, L , n
èk ø
Bayesian Compressive Sensing (BCS)
Digital Media Lab.

CS as Linear Regression (1)
n The signal model: y=Fx
n Assume that x = xS + xe
n xS : K elements in x with the largest magnitude, others are set to 0
n xe : (N-K) elements in x with the smallest magnitude, others are set to 0
y = F ( xS + xe ) = F xS + ne where ne = F xe
n Consider measurement noise (nm) in the signal model,

y = FxS + ne + nm = F xe + n
n The components of n are approximated as an iid Gaussian noise N(0,s2)

n Can be assumed to be Gaussian due to Central limit theorem for large N-M

2
n The signal model: y = F xS + ne + nm = F xS + n, ni ~ iid N (0, s )
n Gaussian likelihood model:

-K /2 æ 1 2 ö
p ( y | xS , s 2 ) = ( 2ps 2 ) exp ç - 2 y - FxS ÷
è 2s ø
n
2
F is known and y is given, but the noise variance s is unknown.
n In Bayesian analysis, a full posterior distribution of x and s2 are sought.
n Sparseness Prior: Laplace prior is widely used to impose belief that x

is sparse (~ sparseness-promoting prior)
n (The subscript s is avoided from now on for simplicity)
N
N
æl ö -l å | xi |
p( x | l ) = ç ÷ e i =1
(l : Laplacian parameter )
è2ø

n The solution which maximize the posterior P(x|y) is equivalent to
{
xˆ = arg min y - F x 2 + b x
x
2
L1 }
n However, direct evaluation of the posterior p(x|y) using a Laplace prior is
not tractable since the Laplace prior is not conjugate to the Gaussian
likelihood function, hence the associated Bayesian inference may not be
performed in closed form.
n This problem is addressed by the Relevance Vector Machine (RVM)

method ~ Sparse Bayesian Learning with Hierarchical Sparseness Prior
N N
P ( x | a ) = Õ N ( xi | 0, a -1
); P (a | a, b) = Õ G (a i | a, b );
i =1 i =1
n ai: precision (inverse-variance) of a Gaussian PDF

n Gamma prior is considered over a.
Hierarchical Priors (1)

n Gaussian prior is assigned to each of the N elements of x
n ai controls the strength of the prior on its associated weight in xi.
N
P ( x | a ) = Õ N ( xi | 0, a -1 )
i =1
n Gamma prior is assigned to the precision (inverse-variance) ai of the

i-th Gaussian prior.
N
P (a | a , b ) = Õ G (a i | a , b );
i =1
n It can be visualized using a graphical model similar to the one in

“Sparse recovery via belief propagation”.

Hierarchical Priors (2)
n The marginal density function for x may be shown to satisfy
N ¥
P ( x | a, b ) = ò p( x | a ) p (a | a, b ) da = Õ ò N ( xi | 0, a i-1 )G (a i | a, b ) da i
a i =1 0
n The integral can be also evaluated analytically, and it corresponds to
Student t-distribution.
n With appropriate choice of a and b, the Student t-distribution is strongly
peaked about xi=0 è sparseness prior!!!
n Similarly, a Gamma prior G(a0|c,d) is introduced on the inverse of the
noise variance a0=1/s2.
n Hierarchical model may look complicated, but it is actually simple

n Each consecutive term in the hierarchy in the conjugate-exponential
family
n Also very convenient for implementing iterative algorithms for evaluation
of the posterior density functions of x and a0.
n Can be solved using such techniques as “Markov Chain Monte Carlo
(MCMC) method” or “EM (expectation-Maximization)” methods.
Bayesian CS Recovery via RVM (1)

n Assume hyperparameters a and a0 are known, and y and F are
given, the posterior of x can be expressed analytically as MVN with
mean and covariance matrix as,
-1
m = a 0 SFT y, S = (a 0 FT F + A ) where A = diag (a1 ,..., a N )
n In the RVM, these hyperparameters a and a0 are estimated from the data
(by performing type-II ML procedure)
n By marginalizing over the sparse vector x, the marginal likelihood for

a and a0 can be expressed analytically as,
L(a , a 0 ) = log p ( y | a , a 0 ) = log ò p( y | x, a 0 ) p( x | a ) dx
1
=-
2
{ K log(2p ) + log | C | + yT C -1 y} where C = s 2 I + FA-1FT

Bayesian CS Recovery via RVM (2)
n A type-II ML approximation employs the point estimates for a and a0
to maximize L(a,a0 ) which can be implemented via the EM algorithm
g
a inew = i2 , 1 £ i £ N where g i º 1 - a i S ii , s 2 = 1 / -a 0
mi
K - ågi
n Also, new i
a 0 = 2
y - Fm 2
n Note that (a new , a 0new ) are function of ( m , S ) , while ( m , S ) are functions

of (a , a 0 ) è Iterative algorithm!!!
n This iteration is repeated until convergence.
CS Recovery via Belief Propagation (BP)
Digital Media Lab.

CS Recovery via Belief Propagation (BP) (1)
n There are significant parallels to be drawn between error correcting
codes and sparse recovery.
n LDPC: sparse code
n The sparsity in the F matrix is equivalent to the sparsity in LDPC coding
graphs.(???)
< A factor graph depicting the relationship bet. Variables involved in CS recovery using BP >
(Black: variable node, white: constraint node)
CS Recovery via Belief Propagation (BP) (2)

n Choice of signal probability density functions: the signal is modeled
as being compressible (as opposed to being strictly sparse)
n Two-state Gaussian mixture model
n Each signal element of x takes either a “large” or “small” coeff. value state.
n Each signal element of x is iid.
n Small coefficients occurs more frequently than the large coefficients.
n iid Laplace Prior model for x
n The recovery (decoding) problem of x (where y=Fx) is solved as a

Bayesian inference problem.
n Approximate each marginal distribution Prob{x(i) | y,F}.
n Estimate the Maximum Likelihood Estimate (MLE) or the Maximum a
Posterior (MAP) estimate of the coefficients from their distributions.
n Solved by variety of methods
n (ex) Belief propagation (BP) ~ graphical methods

Listen to the presentation on BP applied to
CS recovery
Digital Media Lab.

2012 Fall
Data Compression
(ECE 5546
5546--41)
Ch 6. Applications of Compressive Sensing

Part 1
Byeungwoo Jeon
Digital Media Lab.
CS Applications
n Linear regression and model selection
n Sparse error correction
n Group testing and data stream algorithms
n Compressive medical imaging
n Analog-to-information conversion
n Single pixel camera
n Hyperspectral Imaging
n Compressive processing of manifold-modeled data
n Inference using compressive measurements
n Compressive sensor networks
n Genomic Sensing

More CS Applications (1)
n Analysis: ”Find x which generates the given y”
n Atomic decomposition problem: infer individual atoms actually generating y
( P ) : min || x ||
0
e
0 subject to || y - F x ||2 £ e
n Its solution may not be necessarily the underlying x0.
If x0 is K-sparse, then the solution ( P0 ) is at most O(e) away from x0 .
e
n
n Representing y in terms of at most 2K scalars is a compression problem

n By increasing e, it is possible higher compression with fewer non-zeros è can
draw RD performance
n Denoising: ”Find denoised version of y from a noisy observation of y’

where y’=y+n. .
n Noise is power-limited: || n ||2 £ d .
n The solution of ( P0e +d ) which is ( x0e +d ) will have at most k0 non-zeros, and
the denoised answer is given as ( F x0e +d ) .
See M. Elad, Sparse and redundant representations, Springer

Ch10. Image Deblurring
Ch13. Image Compression – Facial Images
Ch14. Image Denoising
Ch15. Other Applications
More CS Applications (2)

n Inverse problem: suppose y ¢ = Hy + n , where the linear operator H
represents blurring, masking leading to in-painting, projections,
downscaling, or some other degradations.
( P ) : min || x ||
e
0 0 subject to || y - H F x ||2 £ d + e
n Its solution identifies directly the sparse components of the underlying

signal and obtain approximation ( F x0e +d ) .
n Morphological component analysis (MCA)

n Suppose the observed signal y is a superposition of two different sub-
signals y1 and y2 (that is, y=y1+y2) where y1 is sparsely generated by
model M1 and y2 is sparsely generated by model M2.
n The problem likes to separates y1 and y2. (similar to ICA)
min (|| x1 ||0 + | x2 ||0 ) subject to || y - F1 x1 - F 2 x2 ||2 £ e 1 + e 2
x1 , x2
n Can be applied to in-painting problem

n Missing pixels in an image are filled-in based on a sparse representation of
the existing pixels.

Linear regression and model selection
Digital Media Lab.
Linear regression & model selection (1)

n Statistical problem of sparse linear regression and model selection
can be thought as a CS problem y = F x + e if the number of input
variables actually required to predict the output variables are sparse.
n Input (training data): a set of M pairs of values {input variables, noisy
output (response) variables}
n Assume there are total of N input variables and represent the set of input
variable observations as an MxN matrix F, and the set of response
variable observations as an Mx1 vector y).
y
Output
variable
x (input variable) x (input variable)

Rf:: Linear Regression (1)
Rf
n Linear regression: it models the relationship between a scalar
dependent variable y and one or more explanatory variables denoted
as X using training data. (note, X here used to mean F in CS)
n The case of one explanatory variable is called simple regression. More
than one explanatory variable is multiple regression. (This in turn should be
distinguished from multivariate linear regression, where multiple correlated
dependent variables are predicted, rather than a single scalar variable).
n Data is modeled using linear predictor functions, and unknown model
parameters are estimated from the data.
M
n Training data set: { X i1 ,..., X iN , yi }i =1
n yi: output variable (also called as, response variable, regressand,
exogeneous variable, measured variable, dependent variable)
n Xi: input variables (also called as, regressors, endogeneous variables,
explanatory variables, covariates, predictor variables, independent
variables)
n X =[X1,…, XN] is called design matrix
n Usually a constant is included as one of the regressors (ex: X1j =1, j=1,…,N).
The corresponding x is called intercept.
http://en.wikipedia.org/wiki/Linear_regression_model

Rf
M
n Given a data set { X i1 ,..., X iN , yi }i =1 of M statistical units, a linear
regression model assumes that a linear relationship between the
dependent variable yi and the p-vector of regressors { X i1 ,..., X iN } .
n This relationship is modeled through a disturbance term or error
variable εi - an unobserved random variable that adds noise to the linear
relationship between the dependent variable and regressors. Thus the
model takes the form:
yi = b1 X i1 + ... + b N X iN + e i = X iT b + e i , i = 1,..., M
n XiTβ is the inner product between vectors Xi and β.
n In matrix form: y = X b + e
n Note that in CS notation of y=F x, X à F , b à x
n b is a parameter vector (also called, effects, regression coefficients)
Rf
n Solution by Ordinary least squares (OLS) : the simplest and thus
most common estimator. It minimizes the sum of squared residuals
which leads to a closed-form:
-1
-1 æ 1 ö æ 1 ö
bˆ = éê( X T X ) X T ùú y = ç i
T
åi X i X ÷ø çè M åX y i i ÷
ë û èM i ø
n Ex: Consider a situation where a small ball is being tossed up in the

air and then we measure its heights of ascent hi at various moments
in time ti. Assume it can be modeled as:
hi = b1ti + b 2ti2 + e i
where β1 determines the initial velocity of the ball, β2 is proportional to the
standard gravity, and εi is due to measurement errors.
n Linear regression can be used to estimate the values of β1 and β2 from
the measured data.
n Note that this model is non-linear in the time variable, but it is linear in the
parameters β1 and β2; if we take regressors xi = (xi1, xi2) = (ti, ti2), the
model takes on the standard form h = X T b + e
i i i
Group testing and data stream algorithms
Digital Media Lab.

Group Testing
n Assume N total products among which to identify K defective ones.
n Nx1 vector x indicates the defective ones (non-zero x ~ with defect)
n Problem: design a collection of tests that allow to identify the support

(and possibly the non-zero values also) of x while minimizing the
number of test to be performed.
n The test is denoted by matrix F (fij=1 iff the jth item is used in the ith test)
è Q: “How to design the sensing matrix F?” & “Recovery algorithm”
n If the test output is linear wrt the inputs, then the problem of recovering the
vector x is essentially the same as the standard CS recovery problem.
X
y
Φ
M =
y = Fx
pools
N
Pooling design (most likely, 0-1 matrix)
Data Streaming Problem

n Streaming algorithms (in computer science): algorithms for processing
data streams in which the input is presented as a sequence of items
and can be examined in only a few passes (typically just one).
n These algorithms have limited memory available to them (much less than
the input size) and also limited processing time per item.
n These constraints may mean that an algorithm produces an approximate
answer based on a summary or "sketch" of the data stream in memory.
n In the data stream model, some or all of the input data that are to be
operated on are not available for random access from disk or memory,
but rather arrive as one or more continuous data streams.
n Streams can be denoted as an ordered sequence of points (or "updates")
that must be accessed in order and can be read only once or a small
number of times.
http://en.wikipedia.org/wiki/Streaming_algorithm
Rf:: Data Stream
Rf
n Data stream means various sequences of information:
n In telecommunications and computing: a sequence of digitally encoded
signals (ex: packets of data)
n In electronics and computer architecture: a data flow determines for
which time which data item is scheduled to enter or leave which port of a
systolic array, a Reconfigurable Data Path Array or similar pipe network,
or other processing unit or block (cf. main article)
n Often the data stream is seen as the counterpart of an instruction

stream, since the von Neumann machine is instruction-stream-driven,
whereas its counterpart, the Anti machine, is data stream driven.
Digital Media Lab. http://en.wikipedia.org/wiki/Data_stream 13
Examples of Data Streams (1)

n The algorithms we are going to describe act on massive data that
arrive rapidly and cannot be stored. These algorithms work in few
passes over the data and use limited space (less than linear in the
input size).
n Ex 1 (Telephone call): every time a cell-phone makes a call to
another phone, several calls between switches are being made until
the connection can be established. Every switch writes a record for
the call over approx. 1000 Bytes.
n Since a switch can receive up to 500 million calls a day, this adds up to
something like 1 Terabyte per month information. This is a massive
amount of information but has to be analyzed for different purposes.
n An example is searching for drop calls trying to find out under what
circumstances such drop calls happen. It is clear that for dealing with this
problem we do not want to work with all the data, but just want to filter
with a few passes the useful information.

n EX2 (The Internet): The Internet is made of a network of routers
connected to each other, receiving and sending IP packets. Each IP
packet contains a packet log including its source and destination
addresses as well as other information that is used by the router to
decide which link to take for sending it.
n The packet headers have to be processed at the rate at which they flow
through the router. Each package takes about 8 nanoseconds to go
through a router and modern routers can handle a few million packets
per second.
n Keeping the whole information would need more than one Terabyte
information per day and router.
n Statistical analysis of the traffic through the router can be done, but this
has to be performed on line at nearly real time.

n EX3 (Web Search): Consider a company for placing publicity in the
Web. Such a company has to analyze different possibilities trying to
maximize, for example, the number of clicks they would get by
placing an add for a certain price.
n For this they like to analyze large amounts of data including information
on web pages, numbers of page visitors, add prices and so on.
n Even if the company keeps a copy of the whole net, the analysis has to
be done very rapidly since this information is continuously changing.

Computation on data streams
n In packet communications
n xi: number of packets passing through a network router with destination i
n i can be as large as 2**64 (for 64 bit IP address)
n Instead of storing all xi’s directly, store y (y=Fx) with M << N.
n y is called “sketch”
n In actual packet transmission, x is not actually observed, rather its

increment to xi is observed.
n Thus, y is constructed iteratively by adding ith column to y each time an
increment to xi (note y=Fx is linear in terms of x) is observed.
n When the network traffic is dominated by traffic to a small number of
destinations, the vector x is compressible, and thus the problem of
recovering x from the sketch Fx is essentially CS recovery problem.
Compressive medical imaging
Digital Media Lab.

Compressive Medical Imaging
n MRI (Magnetic Resonance Imaging)
n It is based on the core principle that protons in water molecules in the
human body align themselves in a magnetic field.
n MRI source repeatedly pulse magnetic fields to cause water molecules in
the human body to disorient and then reorient themselves, which cause a
release of detectable radiofrequencies
n Major problem in MRI

n Linear relation bet. number of measurements and scan times
n Long scan is more susceptible to physiological motion artifacts,
discomfort to the patients
n Problem: how to minimize scan time without compromising image quality
n CS can be one solution.
End of Ch.6 Part 1
Digital Media Lab.

2012 Fall
Data Compression
(ECE 5546
5546--41)

Part 2
Byeungwoo Jeon
Digital Media Lab.
CS Applications
n Genomic Sensing

Analog-to-information conversion
Most materials here are from “Compressive Sampling for Analog Time Signals” by Richard
Baraniuk (at IMA2007 Talk) http://www.ima.umn.edu/2006-2007/ND6.4-15.07/activities/Baraniuk-
Richard/baraniuk-IMA-A2I-june07.pdf
Digital Media Lab.
Sensing by Sampling
n Foundation of analog-to-digital conversion (ADC):
n Shannon/Nyquist sampling theorem: “periodically sample at 2x signal
bandwidth” and “perfect reconstruction if the signal is band-limited”
n Signal processing systems rely more on A/D converter at front-end

n Radio frequency (RF) applications have hit a performance brick wall
n “Moore’s Law” for A/D’s: doubling in performance only every 6 years
n ADC remains the major bottleneck:

n limited bandwidth (# Hz)
n limited dynamic range (# bits)
n deluge of bits to process downstream
n High resolution ADC is costly or infeasible

Signal Sparsity (1)
n Sampling theorem by Shannon talks about the worst-case bound.
n Locally Fourier Sparse (LFS) wideband signal

n LFS: at each point in time the signal is well-approximated by a few
sinusoids of constant frequencies
n LFS signal is sparse in time-frequency representation like STFT
n Example of LFS signal ¥
- jwt
n Frequency hopping communication signals
S (t , w ) = å x(t )W (t - t )e
s = -¥
n Slowly varying chirps from radar and geophysics W(.): Localized window function
n Many acoustic and audio signal
Signal Sparsity (2)

n Frequency-hopping communication consists of a sequence of
windowed sinusoidals with frequencies between f1 and f2 Hz.
STFT ¥
- jwt
S (t , w ) = å x(t )W (t - t )e
s = -¥
W(.): Localized window function
2
Spectogram S (t , w )
n Nyquist rate: 2(f2-f1) Hz
n can acquire a signal with fewer than 2(f2-f1) samples/sec?

Signal Sparsity (3)
n Sparsity: “information rate” K per second, K<<N
n If a signal is sparse, do not waste effort sampling the empty space.
Instead, use fewer samples and allow ambiguity.
n Use the sparsity model to reconstruct and uniquely resolve the ambiguity.
n Applications: Communications, radar, sonar, …
Local Fourier Sparsity ((Spectogram

Spectogram))
STFT ¥
- jwt
S (t , w ) = å x(t )W (t - t )e
s = -¥
W(.): Localized window function
2
Spectogram S (t , w )

Sparsity & Compressive Sampling
M measurements
M » K << N
K : information rate
Streaming Measurements
n Streaming applications cannot fit signal into a processing buffer at
one time
y = Fx streaming requires special F

Analog--to-
Analog to-information Conversion (A2I)
n Problem of applying CS to an acquisition system of a continuous-time
signal x(t) è ADC with sub-Nyquist rate is possible under
assumption that x(t) is sparse in the Fourier domain. It means that
n x(t) is band-limited
n Much of its spectrum is empty
n What is the best way to capture the structure ?

n Note that the measurement system is built with analog HW.
n It is referred as a problem of Analog CS
Analog measurement model (1)

M
n Consider M continuous-time test functions {f j (t )} j =1 , and collect M
measurements as:
T
y[ j ] = ò x (t )f j (t )dt , j = 1,..., M matrix form
0
¾¾¾¾ ® y = F x (t )
n The operator F takes analog signal x(t) to generates a discrete vector y.
Need M integrator (w/zero initial state) and S&H

Need M
correlators
< Random Demodulator Block Diagram>

Need hardware for
generating M test functions

n There are many potential practical issues
n How to reliably and accurately produce (and reproduce) arbitrarily
complex fj(t) functions using analog hardware.
n The architecture (M correlator/integrator pairs operating in parallel) is

also potentially prohibitively expensive both in implementation cost, size,
weight, and power.
n Need a lot of ideas for a simpler architecture (ex: by designing a set

of structured fj(t)).

n Ex1: non-uniform sampling by fj (t) =δ(t−tj).
n {tj, j=1,…M} denotes a sequence of M time-locations to sample x(t).
n If the number of measurements is lower than the Nyquist-rate, then these
locations cannot simply be uniformly spaced in the interval [0,T] , but
must be carefully chosen.
n It can be implemented by using a single traditional ADC with ability to
sample on a non-uniform grid, avoiding the requirement for M parallel
correlator/integrator pairs.
n Such non-uniform sampling systems have been studied in other contexts
outside of the CS framework.
n For example, there exist specialized fast algorithms for the recovery of
extremely large Fourier-sparse signals. The algorithm uses samples at a non-
uniform sequence of locations that are highly structured, but where the initial
location is chosen using a (pseudo) random seed.

n Ex2: frameworks for the sampling and recovery of multi-band signals,
whose Fourier transforms are mostly zero except for a few frequency
bands.
n They use non-uniform sampling patterns based on coset sampling.
n Unfortunately, these approaches are often highly sensitive to jitter, or
error in the timing of when the samples are taken.
Architectures for A2I
1. Random Sampling
2. Random Filtering
3. Random Demodulator

Method 1: Random Sampling
n Apply the random sampling concepts from “randomized ADC” (by
Anna Gilberts) directly to A2I
n Average sampling rate < Nyquist rate
n Appropriate for narrowband signals (sinusoids), wideband signals
(wavelets), histograms,…
n Highly efficient, one-pass decoding algorithms
See [Gilbert, Strauss]

http://www.math.lsa.umich.edu/~annacg/
Randomized ADC (1)

n Acquire data
n Sample at random points of time at much lower than the Nyquist rate
n Process measurements
n use samples and time points in iterative algorithm (not the FFT)
n Extract information
n reconstruct most energetic portions of Fourier spectrum (not entire
spectrum)
Symposium Speaker on Big Data, National Academies of Science

Digital Media Lab. Kavli Frontiers of Science Symposium, Irvine, CA, 2012. 18
Randomized ADC (2)

Randomized ADC (3)

n How does it work?

Sparsogram (1)
n Sparsogram: spectrogram computed using random samples
Sparsogram (2)
n Example: frequency hopper
n Random sampling A2I at 13xsub-Nyquist rate

Method 2: Random Filtering
n Analog LTI filter with “random impulse response”
y(t) = (x*h) (t)
n Quasi-Toeplitz measurement system
Comparison to Full Gaussian F

n Ex: B = length of filter h in terms of Nyquist rate samples
= horizontal width of band of A2I convolution

Method 3: Random Demodulation
n Chipping sequence: have +/- 1 values alternating at Na Hz (at least
as fast as the Nyquist rate)
n The mixed signal is integrated over a time-period 1/Ma
j/Ma
y[ j ] = ò x (t ) pc (t )dt
( j -1)/ M a
n The sample-and-hold ADC is done at Ma Hz.

n Note Ma << Na
+/-1
Chipping
signal
Discrete Formulation of Random Demodulation

Na / M a
1/ M a n/ Ma
n Discrete formulation y[1] = ò x (t ) pc (t )dt = å pc [ n]ò x (t ) dt
0 ( n -1)/ M a
n =1
n Na is the Nyquist rate of x(t)
n Note n/ M a
ò x (t ) dt = x (t ) over nth interval º x[ n]
( n -1)/ M a
Na / M a
n Therefore, y[1] = ån =1
pc [ n]x[ n]
n The CS measurement is equivalent to multiplying the signal x with the

random sequence of +/- 1 in pc[n] and then summing over every
sequence block of Na/Ma coefficients.
n Ex: N=12, M=4, T=1
é -1 + 1 + 1 ù
ê -1 + 1 - 1 ú
F=ê ú
ê +1 + 1 - 1 ú
ê ú
ë +1 - 1 - 1û

Method 3: Random Demodulation
n Theorem [Tropp et. Al 2007]
2
n If the sampling rate satisfies M > CK log ( N / d ) , 0 < d < 1 , then locally
Fourier K-sparse signals can be recovered exactly with probability 1-d.
n Experimental results:
Ex: Frequency Hopper

n Random demodulator A2I converter at 8x sub-Nyquist rate

Summary
n A2I is referred as a problem of Analog CS
n Key concepts of discrete-time CS carry over!
n Streaming signals require specially structured measurement systems

n Tension between what can be built in hardware versus what systems
create a good CS matrix
n Three examples
n Random sampling
n Random Filtering
n Random demodulation
n Open Issues
n New HW design
n New Transform that sparsify natural and man-made signals
n Analysis and optimization under real-world non-idealities such as jitter,
measurement noise, interference, etc.
n Reconstruction/processing algorithms for dealing with large N
End of Ch.6 Part 2
Digital Media Lab.

2012 Fall
Data Compression
(ECE 5546
5546--41)

Part 3
Byeungwoo Jeon
Digital Media Lab.
CS Applications
n Genomic Sensing

Hyperspectral Imaging
Digital Media Lab.
Hyperspectral Imaging (1)

n Basic Principle
n Different materials produce different electromagnetic radiation spectra,
as shown in this plot of reflectance for soil, dry vegetation, and green
vegetation.
n The spectral information contained in a hyperspectral image pixel can be
therefore indicate the various materials present a scene.
D. Manolakis, D. Marden, and G. A. Shaw, “Hyperspectral Image Processing for

Digital Media Lab. Automatic Target Detection Applications” Lincoln Laboratory Journal, 2003. 4

n Basic datacube structure in hyperspectral imaging illustrates the
simultaneous spatial and spectral characteristics of the data.
n The datacube can be visualized as a set of spectral (left in below), each
for a single pixel, or as a stack of images (right), each for a single
spectral channel.

n Hyperspectral imaging (= imaging spectroscopy) collects & processes
information in multiple bands across electromagnetic spectrum.
n Certain objects leave unique 'fingerprints' across the electromagnetic
spectrum. These 'fingerprints' are known as spectral signatures and
enable identification of the materials that make up a scanned object.
n Often applied in agriculture, mineralogy, physics, and surveillance.
f ( x, y , l )
voxel
Data Cube
f ( x, y ) = { f ( x, y , l )}l
Spectral signature
Digital Media Lab. http://en.wikipedia.org/wiki/Hyperspectral_imaging 7

n Practical example of hyperspectral imaging
n Airborne sensors: NASA's Airborne Visible/Infrared Imaging
Spectrometer (AVIRIS)
n Satellites: NASA's Hyperion
n Handheld sensors
n The acquisition and processing of hyperspectral images is also referred

to as imaging spectroscopy.
n Considerable structure is present in the hyperspectral observed data.

nSpatial structure as in the normal images
n Each pixel’s spectral signature is usually smooth.
è CS sensing of hyperspectral image: lower-dimensional projections that multiplex
in the spatial domain, the spectral domain, or both.

n Multispectral imaging: use several discrete and narrow bands from
the visible to the longwave infrared.
n Multispectral images do not produce
the "spectrum" of an object.
n Landsat is an example.
n Hyperspectral imaging: imaging

narrow spectral bands over a
continuous spectral range, and
produce the spectra of all pixels in
the scene.
n So a sensor with only 20 bands can
also be hyperspectral when it
covers the range from 500 to 700 nm
with 20 bands each 10 nm wide.
n While a sensor with 20 discrete bands covering the VIS, NIR, SWIR,
MWIR, and LWIR would be considered multispectral.

n Comparison of spatial processing and Spectral Processing in Remote
sensing.

SOME APPLICATIONS OF
HYPERSPECTRAL IMAGING
Digital Media Lab. From http://www.headwallphotonics.com/applications/

11
Applications: Food Safety & Quality

n The utilization of hyperspectral imaging for the in-line inspection of
poultry, fruits, vegetables, and specialty crops holds exceptional
potential for not only increasing the quality and safety of food
products but also offers a significant financial return for food
processors by increasing the throughput and yield of processing
centers.
n While machine vision technology has been a standard approach to

many food inspection and safety applications, hyperspectral imaging
offers the incremental benefit of analyzing the chemical composition
of food products both for in-line inspection and in the
laboratory thereby significantly increasing production
yields. Through high-throughput chemometrics, food
products can be analyzed with hyperspectral sensing
for disease conditions, ripeness, tenderness, grading,
or contamination.
n Agricultural Research
n Crop Management
n High Throughput In-Line Inspection
n Fruits, Vegetables, Poultry, Specialty Crops, Fish
12
Applications: Forensics
n Hyperspectral imaging, also known as chemical sensing, affords
forensic scientists unique advantages in terms of non-invasively
analyzing crime scenes, evidence, or other objects of interest.
n Crime Scene Investigation
n Counterfeit Detection
n Document Testing & Verification
n Materials Identification
n Non-Invasive Latent Print Analysis

13
Rf:: Latent Print

Rf
n Although the word latent means hidden or invisible, in modern usage
for forensic science the term latent prints means any chance or
accidental impression left by friction ridge skin on a surface,
regardless of whether it is visible or invisible at the time of deposition.
n Electronic, chemical and physical processing techniques permit
visualization of invisible latent print residues whether they are from
natural sweat on the skin or from a contaminant such as motor oil, blood,
ink, paint or some other form of dirt.
n The different types of fingerprint patterns, such as arch, loop and whorl,
will be described below.
n Latent prints may exhibit only a small portion of the surface of a finger
and this may be smudged, distorted, overlapped by other prints from
the same or from different individuals, or any or all of these in
combination.
n For this reason, latent prints usually present an “inevitable source of error
in making comparisons,” as they generally “contain less clarity, less
content, and less undistorted information than a fingerprint taken under
controlled conditions, and much, much less detail compared to the actual
patterns of ridges and grooves of a finger.”
http://en.wikipedia.org/wiki/Fingerprint
Applications: Life Science & Biotech.
n Hyperspectral imaging is an invaluable analytical technique for life
sciences and biotechnology applications whether used as a
traditional high performance spectral imaging instrument or whether
deployed as a multi-channel spectroscopy instrument.
n Hyperspectral imaging instruments can give the researcher access to
accurate, calibrated, and repeatable spectral analysis.
n When utilized as a multi-channel spectrometer, researchers are able
to conduct high-throughput screening experiments where high
spectral resolution, spatial differentiation, and channel separation are
all critical parameters. Optimized for high-throughput screening, the
Hyperspec instruments are fully-capable of processing at very high
speeds based on selected spectral bands or wavelengths of interest.
n Fluorescence
n High Throughput Screening
n Laboratory Research & Development
n Multi-Channel Spectroscopy
n Nanobead & Quantum Dot Detection

15
Applications: Medical Sciences

n For diagnostic medical applications, hyperspectral sensing provides
a highly resolved means of imaging tissues at a either macroscopic
or cellular level and providing accurate spectral information relating to
the patient, tissue sample, or disease condition.
n It (esp. portable one) can give non-invasively scan a complete
sample in-vivo with very high spectral and spectral resolution. The
size of the area or size of the sample that can be scanned is based
on the required field of view (FOV) and the spectral/spatial resolution
required (IFOV, instantaneous field of view) by the application. These
parameters are application-specific and can be adjusted to configure
specific sensor components required to achieve the necessary
diagnostic or investigatory imaging performance.
n Microscopy
n Non-Invasive Diagnostic Imaging
n Optical Biopsy
n Tissue Demarcation
n Therapeutic Analysis

16
Applications: Microscopy
n For applications such as tracking and classification of cellular drug
absorption and delivery or quantifying the presence of tagged nano-
particles within tissue samples, hyperspectral imaging represents a
valuable extension of traditional research techniques that can utilize
existing optical microscopes available within the laboratory.
n With research samples positioned along the microscope stage, spectral
imaging yields critical analytical information with the addition of a
hyperspectral sensor attached with a C-mount adapter of the exit port of
the microscope.
n With the microscope stage moving the sample area in a "push-
broom" manner, hyperspectral imaging simultaneously yields precise
information for all wavelengths across the complete spectral range of
the sensor.
n Drug Discovery
n Cellular Spectroscopy
n Fluorescence
n Nanobead & Nanoparticle Research

17
Applications: Military & Defense

n Hyperspectral imaging can be used in various applications
n Border Protection
n Reconnaissance & Surveillance
n Spectral Tagging
n Targeting

18
Applications: Mining Exploration & Mineral Processing
n Hyperspectral imaging sensors can provide unique benefits to
increasing the production capacity and efficiency of key operational
areas within the mining and mineral processing industries. The
utilization of hyperspectral imaging for the in-line inspection of raw
minerals and materials in process holds notable benefits for not only
increasing the quantity and quality of finished product but also offers
a significant financial return on investment for production facilities by
increasing the throughput and yield within these processing centers -
all with a safe, non-invasive chemical imaging system which can be
cost-effectively deployed at numerous points along the production
process.
n Airborne Exploration
n Drill Core Analysis
n High Volume Process Manufacturing
n Mineral Mapping
n Quarry & Excavation Analysis

19
Applications:: Pharmaceutical Manufacturing

Applications
n The utilization of hyperspectral imaging expedites not only the drug
discovery process but holds clear and distinct advantages for
pharmaceutical manufacturers moving novel compounds and drugs
from the laboratory into the manufacturing plant in an environment
governed by the FDA's Process Analytical Technology (PAT)
initiative.
n Blending Quality Control
n Drug Discovery
n Manufacturing to Volume
n Polymorph Analysis
n Spray-Dry Dispersion

20
Applications: Process Manufacturing
n Implementation of in-line or "at-line" spectral sensing for the
monitoring of critical formulation and inspection processes represents
a valuable analytical technique for capturing important spectral data
critical to the maintenance and operation of key steps within a
process manufacturing operation.
n Within the field of view of the sensor, hyperspectral imaging
simultaneously yields precise information for all wavelengths across
the complete spectral range available. Traditionally, the near infrared
range (NIR) of 900 to 1700 nanometers and the extended visible-near
infrared (Extended VNIR) range of 600 to 1600 nanometers are of
considerable interest for process applications.
n LCD Quality Control
n Semiconductor Operations
n Pharmaceuticals
n Photovoltaics
n Wafer Inspection

21
Applications:: Remote Sensing

Applications
n Hyperspectral imaging, also known as chemical sensing, affords
researchers and biologists unique opportunities to conduct both
airborne and stationary spectral analysis for remote sensing
applications.
n Airborne hyperspectral imaging represents an established remote
sensing technique for capturing important spectral data critical to
remote sensing applications. Within the field of view of the sensor,
hyperspectral imaging simultaneously yields precise information for
all wavelengths across the complete spectral range available. With
the creation of the hyperspectral datacube, a data set that includes all
of the spatial and spectral information, researchers are able to
generate and analyze in-depth environmental spectral imaging data.
n Civil & Environmental Engineering
n Environmental Monitoring
n Pollution Detection
n Forestry Management
n Precision Agriculture
n Mineral Exploration

22
Applications:: Space Research & Satellite Sensors
Applications
n For space-based applications, hyperspectral sensors offer significant
and unique advantages for researchers. Designed for imaging in
harsh environments, the system should keep high performance
imaging through the utilization of extremely high efficiency optics
which yield exceptional spectral and spatial resolution while featuring
a very tall image slit for a very wide field of view. Also it should offer a
robust, stable, and athermalized design optimized for the rigors of
space research.
n Atmospheric Sciences
n Environmental Monitoring
n Remote Sensing
n Small Satellite Systems

23
Applications: Antiquities, Artwork & Document Validation (1)

n As a non-destructive imaging technique, hyperspectral imaging
sensors have proven an essential and critical technology enabler to
enhance research and understanding of a wide range of artifacts and
artwork. As an easily deployable imaging instrument, hyperspectral
imaging has been used to reveal secrets of famous documents such
as the Dead Sea Scrolls as well as archeological artifacts such as
pottery shards (ostracons) that are the oldest known representation of
Hebrew writing.
n Archeological Research
n Document Examination & Verification
n Artifact Analysis
n Inspection of Paintings & Artwork

24
Applications: Antiquities, Artwork & Document Validation (1)
n Hyperspectral imagers offer researchers and scientists unique
advantages:
n Forensic analysis & validation of documents and artifacts
n Discover “original intent” elements & authenticity
n Identify regions for restoration
n Assess original coloring and pigmentation
n Enhance faded or hidden attributes
n Since no preparation of the document or artifact is necessary, this

non-destructive spectral technique is invaluable for a wide range of
historical research and forensic science analysis. Within the field of
view of the sensor, hyperspectral imaging simultaneously yields
precise information for all wavelengths across the complete spectral
range of the sensor. With the creation of the hyperspectral datacube,
a data set that includes all of the spatial and spectral information
within the field of view, research teams are able to more thoroughly
evaluate documents and other crime scene evidence that will greatly
enhance knowledge of the spectral composition and uniqueness of
these sample.

25
Hyperspectral Imaging using CS
Digital Media Lab.

Compressive hyperspectral imaging (1)
n Single pixel hyperspectral camera
n Instead off single photodetector in single pixel camera set-up, use a
single light modulating element that is reflective across the wavelengths
of interest and a sensor that can record the desired spectral bands
separately.

n The single sensor consists of a single spectrometer that spans the
necessary wavelength range, which replaces the photodiode.
n The spectrometer records the intensity of the light reflected by the modulator
in each wavelength.
n The same digital micromirror device (DMD) provides reflectivity for
wavelengths from near infrared to near ultraviolet. Thus, by converting the
datacube into a vector sorted by spectral band, the matrix that operates on
the data to obtain the CS measurements is represented as:
éF x, y 0 L 0 ù
ê 0 F x, y L 0 ú
F=ê ú
ê M M O M ú
ê ú
ëê 0 0 L F x , y ûú
n This architecture performs multiplexing only in the spatial domain, i.e.

dimensions x and y, since there is no mixing of the different spectral bands
along the dimension λ.

n Dual disperser coded aperture snapshot spectral imager (DD-CASSI)
n Combines separate multiplexing in the spatial and spectral domain,
which is then sensed by a wide-wavelength sensor/pixel array, thus
flattening the spectral dimension.
Digital Media Lab. http://www.disp.duke.edu/projects/CASSI/index.ptml 29

n Single disperser coded aperture snapshot spectral imager (SD-CASSI)
n Simplification of DD-CASSI by removing the first dispersive element.
Digital Media Lab. http://www.disp.duke.edu/projects/CASSI/index.ptml 30

Sparsity structure for hyperspectral data cube (1)
n Dyadic Multiscale Partioning
n This sparsity structure assumes that the spectral signature for all pixels in
a neighborhood is close to constant.
n That is, the datacube is piecewise constant with smooth borders in the spatial
dimensions. The complexity of an image is then given by the number of
spatial dyadic squares with constant spectral signature necessary to
accurately approximate the datacube
n A reconstruction algorithm then
searches for the signal of lowest
complexity (i.e., with the fewest dyadic
squares) that generates compressive
measurements close to those observed

n Spatial-only Sparsity
n This sparsity structure operates on each spectral band separately and
assumes the same type of sparsity structure for each band.
n The sparsity basis is drawn from those commonly used in images, such
as wavelets, curvelets, or the discrete cosine basis.
n Since each basis operates only on a band, the resulting sparsity basis for
the datacube can be represented as a block-diagonal matrix:
é Y x, y 0 L 0 ù
ê 0 Y x, y L 0 ú
Y=ê ú
ê M M O M ú
ê ú
êë 0 0 L Y x , y úû

n Kronecker product Sparsity
n This sparsity structure employs separate sparsity bases for the spatial
dimensions and the spectral dimension, and builds a sparsity basis for
the datacube using the Kronecker product of these two.
é Y l [1,1]Y x , y Y l [1, 2]Y x , y Lù

Y = Y l Ä Y x, y = ê Y l [2,1]Y x , y Y l [2, 2]Y x , y Lú
ê ú
êë M M Oúû
n In this manner, the datacube sparsity bases simultaneously enforces

both spatial and spectral structure, potentially achieving a sparsity level
lower than the sums of the spatial sparsities for the separate spectral
slices, depending on the level of structure between them and how well
can this structure be captured through sparsity

n Compressive sensing will make the largest impact in applications with
very large, high dimensional datasets that exhibit considerable
amounts of structure.
n Hyperspectral imaging is a leading example of such applications

n The sensor architectures and data structure models surveyed in this
module show initial promising work in this new direction, enabling new
ways of simultaneously sensing and compressing such data.
n For standard sensing architectures, the data structures surveyed also
enable new transform coding-based compression schemes.

End of Ch.6 Part 3
Digital Media Lab.

Data Compressing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Compressing

Uploaded by

Copyright:

Available Formats

Lecture Note (Data Compression-CS) by Prof.

Byeungwoo Jeon (2012 Fall)

Data Compression n Main Topics

n Lossless Image Compression

Introduction to Compressive Sensing n Data compression (this semester): Compressed Sensing

to Compressive Sensing, Connexions Web site.

Prof. Heung-No Lee (http://infonet.gist.ac.kr)

Digital Media Lab. Digital Media Lab. 2

Major Subjects to Cover How to Study

Digital Media Lab. 3 Digital Media Lab. 4

Solving linear equations

Digital Media Lab. Digital Media Lab. 6

Solving Linear Equations Underdetermined Case

Y = AX zeros elements (the rest of the elements are zeros).

Digital Media Lab. 7 Digital Media Lab. 8

Rf:: Regularization (1)

Digital Media Lab. http://en.wikipedia.org/wiki/Regularization_(mathematics) 9 Digital Media Lab. http://en.wikipedia.org/wiki/Regularization_(mathematics) 10

Rf:: Regularization (3)

Digital Media Lab. http://en.wikipedia.org/wiki/Regularization_(mathematics) 11 Digital Media Lab. 12

Summary Some link

Digital Media Lab. 13 Digital Media Lab. 14

Digital Media Lab.

What we like to cover in this class

Algorithms for sparse analysis : Lecture I: Background on sparse approximation by

è Need lots of concept in linear algebra

Digital Media Lab. 3

n Issues to study (in this semester)

Digital Media Lab. 4

From Paul’s On-line math

Digital Media Lab. 5

2.1 VECTOR SPACE

Digital Media Lab. 6

n Vector norm: If v is a vector then the magnitude of the vector, called

Digital Media Lab. 7

n Definition of arithmetic operations in n-space:

Digital Media Lab. 8

Digital Media Lab. 9

Digital Media Lab. 10

Digital Media Lab. 11

Vector Space (1)

n A vector space is simply a collection of vectors satisfying the axioms above.

Digital Media Lab. 12

Digital Media Lab. 13

Rf:: Signal and Vector Space

n Model such a linear structure using a linear model by treating signal

Digital Media Lab. 14

Digital Media Lab. 15

Norm in Vector Space

n Definition of Norm: Given a vector space V over a subfield F of the

n A simple consequence of the first two axioms, positive homogeneity

Digital Media Lab. 16

n A quasinorm is similar to a norm in that it satisfies the norm axioms,

n This is not to be confused with a seminorm or pseudonorm where the

Digital Media Lab. http://en.wikipedia.org/wiki/Norm_(mathematics)

Examples of Norm (1)

n Taxicab norm or Manhattan norm (also called 1-norm, L1 norm,

Digital Media Lab. 18

n supp(x): support of x (set of index indicating non-zero components of x)

n Following Donoho's notation, the zero "norm" of x is simply the

Examples of Norm (3)

n Maximum norm (special case of: infinity norm, uniform norm, or

Digital Media Lab. 20

Digital Media Lab. 21

n Illustration of unit circles in different norms