Section 1

Sparse & Redundant Representations
and Their Applications in

Signal and Image Processing
Overview of this Field and this Course
Michael Elad
The Computer Science Department
The Technion – Israel Institute of technology
Haifa 32000, Israel
What This Field is All About ?
Take 1: A New Transform
Michael Elad | The Computer-Science Department | The Technion

The Answer is Not Trivial
Depends whom you ask, as the researchers

in this field come from various disciplines:
• Mathematics
• Applied Mathematics
• Statistics
• Signal & Image Processing: CS, EE, Bio-medical, …
• Computer-Science Theory
• Machine-Learning
• Physics (optics)
• Geo-Physics
• Physics - Astronomy
• Psychology (neuroscience)
• …

My Answer (For Now)
A New Transform for Signals

 We are all well-aware of the idea of transforming
a signal and changing its representation
 We apply a transform to gain something –
efficiency, simplicity of the processing, speed, …
 There are many known transforms: Fourier, DCT,
Hadamard, Wavelets and its descendants, …
 Our message: There is a new transform in town,
based on sparse and redundant representations

Transforms – The General Picture
Linear
n
D  n
Invertible Transforms
 x
Linear
Unitary
Separable
Structured

Redundancy?
 In a redundant transform, the representation vector is

longer (m>n) m
 This can still be done

while preserving n
D  n
the linearity of the

transform: x
x  D
 n 
 DD x
†
m
 D†
I x
 x

Sparse & Redundant Representation
n D  n
x

 We shall keep the linearity of the inverse-transform
 As for the forward (computing  from x), there are
infinitely many possible solutions
 We shall seek the sparsest of all solutions – the one
with the fewest non-zeros

n D  n
x

 This makes the forward transform a highly non-linear
operation
 The field of sparse and redundant representations
is all about defining clearly this transform, solving various
theoretical and numerical issues related to it, and
showing how to use it in practice

Sounds great !?
No, in fact, it sounds boring !!!!
Who cares about a new
transform?

Michael Elad
Haifa 32000, Israel
What This Field is All About ?
Take 2: Modeling Data

Lets Take a Wider Perspective
Voice Signal Still Image
Stock Market
CT
Radar Imaging
Heart Signal Traffic Information
 We are surrounded by various sources of

massive information of different nature
 All these sources have some internal structure,
which can be exploited
Model?
Effective removal of noise (and many other applications)

relies on an proper modeling of the signal

Which Model to Choose?
 There are many different Principal-Component-Analysis

ways to mathematically Anisotropic diffusion
model signals and images Markov Random Field
with varying degrees of Wienner Filtering
success DCT and JPEG
Wavelet & JPEG-2000
 The following is a partial list
Piece-Wise-Smooth
of such models (for images):
C2-smoothness
 Good models should be Besov-Spaces
simple while matching the Total-Variation
signals: Beltrami-Flow
Simplicity Reliability

An Example: JPEG and DCT
178KB – Raw 24KB 20KB 12KB
data
8KB
How & why does it works?
Discrete
4KB
Cosine
Trans.
The model assumption: after DCT, the top left

coefficients to be dominant and the rest zeros

Research in Signal/Image Processing
Problem
Model Signal
(Application)
Numerical
Scheme
The fields of signal & image A New

Research
processing are essentially built of an Work (and
evolution of models and ways to use Paper) is
them for various tasks Born

Again: What This Field is all About?
A Data Model and its Use

 Almost every task in data processing requires a
model – true for denoising, deblurring, super-
resolution, inpainting, compression, anomaly-
detection, sampling, and more
 There is a new model in town – sparse and
redundant representation – we will call it
Sparseland
 We will be interested in a flexible model that
can adjust to the signal

A New Emerging Model
Machine
Signal Learning
Processing Mathematics
Wavelet Approximation
Theory Theory
Multi-Scale Linear
Analysis Algebra
Signal Sparseland Optimization
Transforms Theory
Compression Inpainting
Blind Source Super-
Separation Resolution
Denoising Demosaicing

Michael Elad
Haifa 32000, Israel
A Closer Look at the
Sparseland Model

The Sparseland Model
 Task: model image patches

of size 8×8 pixels
 We assume that a
dictionary of such image Σ
patches is given, containing
256 atom images α1 α2 α3
 The Sparseland model

assumption:
every image patch can be
described as a linear
combination of few atoms

The Sparseland Model
Properties of this model: Sparsity and Redundancy

 We start with a 8-by-8 pixels patch and
represent it using 256 numbers
– This is a redundant representation
Σ
 However, out of those 256
elements in the representation, α1 α2 α3
only 3 are non-zeros
– This is a sparse representation
 Bottom line: 64 numbers
representing the patch are
replaced by 6 (3 for the
indices of the non-zeros,
and 3 for their entries)

Chemistry of Data
We could refer to the Sparseland model as the

chemistry of information:
 Our dictionary stands for the Periodic Table
containing all the elements Σ
 Our model follows a similar α1 α2 α3
rationale: Every molecule is
built of few elements

Model vs. Transform ?
 The relation between the m
signal x and its
representation  is the
n
D  n
following linear system, x

just as described earlier

 We shall be interested in seeking sparse
solutions to this system when deploying the
sparse and redundant representation model
 This is EXACTLY the transform we discussed earlier
Bottom Line: The transform and the model we described
above are the same thing, and their impact on
signal/image processing is profound and worth studying

Difficulties With Sparseland
 Problem 1: Given an image patch, how can we find
its atom decomposition?
 A simple example:
o There are 2000 atoms in the dictionary
Σ
o The signal is known to be built
of 15 atoms possibilities α1 α2 α3
 2000 
   2.4e  37
 15 
 If each takes 1 nano-sec to
test, the whole sweep will take
~7.5e20 years to finish !!!!!!
 Solution: Approximation
algorithms
 Various algorithms exist. Their Σ

theoretical analysis guarantees α1 α2 α3
their success if the solution is

sparse enough
 Here is an example – the
Iterative Reweighted LS (IRLS):
2
0
Iteration 6
0
1
2
3
4
5
-1
-2
0 200 400 600 800 1000 1200 1400 1600 1800 2000

 Problem 2: Given a family of signals, how do we

find the dictionary to represent it well?
 Solution: Learn! Gather a large set of signals
(many thousands), and find the
dictionary that sparsifies them Σ
α1 α2 α3
 Such algorithms were developed
in the past decade (e.g., K-SVD)
and their performance is
surprisingly good
 This is only the beginning of a
new era in signal processing …

 Problem 3: Is this model flexible enough to
describe various sources? e.g., Is it good for
images? Audio? Stocks? …
 General answer: Yes, this model is
extremely effective in representing Σ
various sources α1 α2 α3
o Theoretical answer: yet to be
given
o Empirical answer: we will see in
this course, several image
processing applications, where
this model leads to the best
known results (benchmark tests)

Difficulties With Sparseland?
 Problem 1: Given an image patch, how can we

find its atom decomposition?
 Problem 2: Given a family of signals,
how do we find the dictionary to
represent it well? Σ
α1 α2 α3
 Problem 3: Is this model flexible
enough to describe various
sources? E.g., Is it good for
images? audio? …

Difficulties With Sparseland?
 Problem 1: Given an image patch, how can we

find its atom decomposition?
 Problem 2: Given a family of signals,
how do we find the dictionary to
represent it well? Σ
α1 α2 α3
 Problem 3: Is this model flexible
enough to describe various
sources? E.g., Is it good for
images? audio? …

Michael Elad
Haifa 32000, Israel
Who Works on This
and Who Are We ?

Who is Working in This Field ?
Donoho Candes – Stanford

Donoho, Goyal – MIT
Tropp – CalTech Mallat – Ecole-Polytec. Paris
Baraniuk, W. Yin – Rice Texas Nowak, Willet – Wisconsin
Gilbert, Vershynin, Plan – U-Michigan Coifman – Yale
Gribonval, Fuchs – INRIA France Romberg – GaTech
Starck – CEA – France Lustig, Wainwright – Berkeley
Frossard, Cevher – EPFL Swiss Sapiro, Daubechies – Duke
Rao, Delgado – UC San-Diego Friedlander – UBC Canada
Do, Ma – U-Illinois Lei Zhang – Hong-Kong PolyU
Davies – Edinbourgh UK Cohen, Combettes – Paris VI
Elad, Zibulevsky, Bruckstein, Eldar, Segev, Bronstein, Mendelson – Technion

David L. Donoho
 An extremely talented
mathematician and statistician
from the Stanford Statistics
Department
 He is among the few who founded
this field sparse and redundant
representations and its
spin-off topic of
compressed sensing
 In 2013 he won the Shaw
prize (“the Nobel of the
east”)

The Course’s Team: Leading Instructor
Computer-Science Department
Prof. Michael Elad The Technion
 M. Elad published more than 70 journal
papers and a book in the field of sparse
& redundant representations, spanning
theory, algorithms, and applications
 In the past 10 years, Michael Elad has
been teaching a course at the Technion,
similar to the one offered here
The Course’s Team: Teaching Assistant
Electrical Engineering Department

Mr. Yaniv Romano The Technion
 Yaniv is a PhD student, supervised by M. Elad

 He has published so far 8 journal papers in the field of
sparse & redundant representations

This Field is rapidly Growing …
 Searching ISI-Web-of-Science (December 26th 2016):

Topic = ((spars* and (represent* or approx* or solution or
estimation) and (dictionary or pursuit or convex)) or
(compres* and sens* and spars*))
led to 6354 papers (it was ~4000 papers a year ago)
 Here is how they spread over time (~146178 citations):

Which Countries?

Who is Publishing in This Area?
# citations
1606
11190
From the 362
506
Technion 423
642
Israel 2988
.
.
Zhang Lei – Xidian Radar .
Zhang Lei – HK Polytech .
.
Zhang Li – Xidian EE 1128
Zhang Li – Tsinghua Univ. .
Zhang Long – Xidian .
.
Zhang Lan – Xi An Jiao Tong .
Zhang Liang – Chongqing .
18735
Zhang Liao – Xidian .
Zhang Lin – Beijing Univ. .
Zhang Lu – Capital Med. 16474

Books in this field
 We already mentioned the book authored

by Michael Elad (2010). It will serve as the
textbook for this course
 This is the first published comprehensive
book in this field, serving as a textbook for
similar courses worldwide
 In the past several years, many other books
followed …

Books in this field
 We already mentioned the book authored

by Michael Elad (2010). It will serve as the
textbook for this course
 This is the first published comprehensive
book in this field, serving as a textbook for
similar courses worldwide
 In the past several years, many other books
followed …

Michael Elad
Haifa 32000, Israel
Several Examples:
Applications Leveraging
this Model

Image Separation [Starck, Elad, & Donoho (`04)]
The original The Cartoon

image - part spanned
Galaxy SBS by wavelets
0335-052 as
photographed
by Gemini
The texture The residual

part spanned being additive
by global DCT noise

Inpainting [Starck, Elad, and Donoho (‘05)]
Source Outcome

Image Denoising (Gray) [Elad & Aharon (`06)]
Source
Initial dictionary
The obtained
Result 30.829dB (overcomplete DCT)
dictionary after
10 64×256
iterations
Noisy image   20

Denoising (Color) [Mairal, Elad & Sapiro (‘06)]
Original Noisy (12.77dB) Result (29.87dB)

Denoising (Color) [Mairal, Elad & Sapiro (‘06)]


Poisson Denoising [Giryes & Elad (‘14)]
Original Noisy (peak=0.2) Result (PSNR=24.16dB)
Original Noisy (peak=2) Result (PSNR=24.76dB)

Deblurring [Elad, Zibulevsky and Matalon (‘07)]
original (left), Measured (middle), and Restored (right):

Iteration: 19 ISNR=7.0322 dB

Blind Deblurring [Shao and Elad (‘14)]

Inpainting (Again!) [Mairal, Elad & Sapiro (‘06)]
Original 80% missing Result


Original 80% missing Result

Facial Image Compression [Brytt and Elad (`07)]
Original
15.81 13.89 6.60 JPEG
JPEEG-2000
14.67 12.41 5.49 Sparse-Based
15.30 12.57 6.36

Results for 550 Bytes per each file
Facial Image Compression [Brytt and Elad (`07)]
Original
? 18.62 7.61 JPEG
JPEEG-2000
? 16.12 6.31 Sparse-Based
? 16.81
Results for 400 Bytes per each file
7.20

Super-Resolution [Zeyde, Protter & Elad (‘09)]
Ideal Image
SR Result
PSNR=16.95dB
Bicubic
interpolation
PSNR=14.68dB Given Image

Super-Resolution [Zeyde, Protter & Elad (‘09)]
The Original Bicubic Interpolation SR result

To Summarize
An effective (yet
simple) model for Sparse and redundant
signals/images is representations and other
key in getting better Which model example-based modeling
algorithms for to choose? methods are drawing a
various applications considerable attention in
recent years
Are they working well?

Yes, these methods have been
deployed to a series of
applications, leading to state-of-
the-art results. In parallel,
theoretical results provide the
backbone for these algorithms’
stability and good-performance

Michael Elad
Haifa 32000, Israel
This Course:
Scope and Style

This Course
Will review 15 years of tremendous

progress in the field of
Sparse and Redundant

Representations
Numerical Applications
Theory
Problems (image processing)

This Course Structure
 We chose to break this course into two separate

parts, each of which is an independent five-
weeks course:
o Part 1: Introduction to the Fundamentals of
Sparse Representations
o Part 2: Sparse Representations - From
Theory to Practice
 While we recommend that you take both these
courses, each of them can be taken
independently of the other.

This Course Grading
Part 1: Part 2:
Introduction to the Sparse Representations
Fundamentals of Sparse - From Theory to
Representations Practice
 Each of these two parts includes

o Knowledge-check questions and discussions
o Series of quizzes, &
o Matlab programming projects
 Each course will be graded separately, using the

average grades of the questions/discussions [K]
quizzes [Q], and projects [P], by
Final-Grade = 0.1K + 0.5Q + 0.4P
Good Luck!
We hope you will enjoy this course

Mathematical Warm-Up
Michael Elad
Haifa 32000, Israel
Underdetermined Linear
System of Equations
& Regularization
Underdetermined Linear System
 Our story begins with a classic problem in linear

algebra: We are given a linear system of equations
of the form Ax=b with n equations and m>n
unknowns, were A is full-rank
 Clearly this system has infinitely many solutions
 Often times in engineering we encounter problems
admitting this form … m
 Our Problem: How could
we suggest a meaningful n A  n
single solution in such
a case? x
b

Underdetermined Linear System
 Here is such a practical example: x is an image of

size 100×100 pixels, and b is a 4:1 subsampled
version of it m
n A  n
b b
x x
 Solving this linear system amounts to an
attempt to obtain x from b, a problem known
as Interpolation or Single-Image-Super-Resolution
 Clearly, in such problems we desire a single
solution among the infinitely many that exist

The Answer: Regularization
 If we aim to get a unique solution to our problem,

we should “rank” all the possible solutions based on
a chosen quality-measure or penalty, and then
choose the best one
 We define this by: min J  x  s.t. A x  b
x
All Possible
Solutions This is the
regularization
term – it assigns
Among all a penalty value to
possible solutions each possible
satisfying the solution
linear system, we
seek the
Levels of the function J(x) “cheapest” one

Sounds Easy But …
Who is J(x) ?
 Clearly, everything depends on the choice of this
function, so who is it? How should we choose it?
 The answer to this question depends on your
needs, beliefs, mathematical proficiency, your field,
and more
 The idea of regularization is central in sciences, and
it has been used in many fields and in many ways
 Why is it called REGULARIZATION? Because it takes
a problem with a severe singularity (a.k.a. ill-posed
problem) and turns it to become “regular”

And the Most Popular Regularization is …
L2 -based
1
Jx 
2
Bx for some chosen matrix B
2 2
Why is it so popular? Because of several critical reasons:

o This choice is easy to mathematically analyze
o This choice leads to a closed-form solution
o It leads to a unique solution under clear conditions

Let’s Have a Closer Look
1 2
min Bx s.t. A x  b
x 2 2
Let’s derive the close-form solution to this problem

o First step: Define the Lagrangian that takes care of the
constraint
L(x,  )  2 B x 2    A x  b 
1 2 T
o Second step: Take the gradient of L w.r.t. x and null it

x   B B  A 
1
T T T T
L  B B x  A   0
o Third step: Plug the outcome into the constraint
 
1
   
1 1
T T T T
A x  A B B A  b    A B B A b
 A B B  
1
 
1 1
T T T T
x  B B A A b

L2 Regularization – Closed-Form
1 2
min Bx s.t. A x  b
x 2 2
Theorem:Under some conditions, this problem admits

a unique solution, given by
 A B B  
1
 
1 1
T T T T
x opt  B B A A b
 As promised, this problem is easy to analyze, it

has a closed-form solution, and this solution is
unique under a simple condition on B and A
 This explains the great popularity of this approach
in sciences in general, and image processing in
particular

Michael Elad
Haifa 32000, Israel
The Temptation
of Convexity
Is There Life Beyond L2 ?
 We have seen that the L2 regularization is pleasant

to work with, but … this does not mean that it
serves our needs
 Indeed, in image processing, the Wiener filter that
emerges from the L2 regularization strategy is
known to perform very poorly
 So, the question that comes to mind is: What else
is out there that could be used, while preserving
the good properties that the L2 term had?
 The answer: Convex functions J(x). Why?
Because (strict) convexity of J(x) guarantees
a unique solution, which we so much desire
Recall: Convex Sets
Definition: A set  is convex if  x 1 , x 2   and

t  [0,1] , the convex combination
x  tx 1  (1  t)x 2
is also in 
x1 x2 x1
x2
Convex Set Non-Convex Set

Convex Sets: An Example
 Consider the set of all the solutions of a given

linear system:    x A x  b
 This set is convex since  x1 , x 2   and t  [0,1]

we get
A x1  b 
 A  t x 1  (1  t)x 2   t A x 1  (1  t) A x 2
A x2  b
 tb  (1  t)b  b
 Observe that the above is correct even if t is
outside the interval [0,1]. What does that mean?

Recall: Convex Functions
Definition: A function f  x  :   is convex if  x 1 , x 2  

and t  [0,1] , the convex combination
x  tx 1  (1  t)x 2
satisfies
f  t x 1  (1  t)x 2   tf  x 1   (1  t)f  x 2 
Here is an alternative definition that ties convex

functions and convex sets:
Definition: A function f  x  :   is convex if its epigraph

is a convex set:
 x , y  | y  f (x )

Convex Functions: Illustration
tf  x 1   (1  t)f  x 2 
f x
Epigraph
x1 x x2
f  t x 1  (1  t)x 2   tf  x 1   (1  t)f  x 2 
Convex function
x1 x x2
Non-Convex function

Convexity via Derivatives
If f(x) is twice continuously differentiable, we can

tie convexity to the function’s derivatives:
Theorem: A function f  x  :   is convex iff  x 1 , x 2  
f  x 2   f  x1   f  x1 
T
 x 2  x1 
The tangent planes bounds the function from below
Convex function Non-Convex function
x2 x2
x1 x1

Convexity via Derivatives
If f(x) is twice continuously differentiable, we can

tie convexity to the function’s derivatives:
Theorem: A function f  x  :   is convex iff  x 1 , x 2  
f  x 2   f  x1   f  x1 
T
 x 2  x1 
The tangent planes bounds the function from below
Theorem: A function f  x  :   is convex iff  x  

 f x
2
0
i.e., the Hessian is positive semi-definite (PSD)
everywhere in 

Convexity vs. Strict Convexity
In all the definitions for convexity, if we
do not allow equality, the convexity becomes strictly so
f  t x 1  (1  t)x 2   tf  x 1   (1  t)f  x 2 
f  x 2   f  x1   f  x1 
T
 x 2  x1 
 f x
2
0
These functions are convex but not strictly so

Convexity vs. Strict Convexity
Example:
Is the following function (strictly) convex?
1
Jx 
2
Bx
2 2
Answer:
The first derivative of this expression is given by
J  x   B B x
T
and the second derivative is  2 J  x   B T B
If B has full column-rank, then BTB is PD, and thus

this function is strictly convex
x2
Question: What about the function J  x   B x 2 ?
And Now to the Punch-Line ...
Theorem:Consider the following constrained problem:

min f  x  s.t. x  
x
where f  x  :   is a convex function and  is

a convex set. Then:
o Every local minimum point is a global one
o The set of optimal solutions forms a convex set
o If f(x) is strictly convex, the solution is unique
 Returning to the problem m in J  x  s.t. A x  b,

if J(x) is strictly convex, we get a unique solution,
as wished
 The L2 case is simply a special case of this broader property
So, Which J(x) to Choose?
There are many options, and we’ll mention

one specific & popular family of such functions:
 Lp-norm (p1): x   x
2 p
2 k
x p
 p
 xk
k k
 Special cases of interest are L1: x 1

  xk
k
L: x 
 max x k
k
Note 1: L1 is convex but not strictly so

Note 2: For all 1<p<, the norm function is convex but not
strictly so. Its p-th power becomes strictly-convex
(just like the L2 case)
Note 3: For p<1, the above is not a convex function, and
therefore it is not a formal norm [See next Slide]

Norms vs. Convexity
 A norm is a function x :
m
 with the properties:
o Non-negativity:  x x  0 and x  0  x  0
o Homogeneity:  x &  c, c x  c  x
o Triangle-Inequality:  x , y , x  y  x  y
 For 0t1 and for any pair of vectors x and y,

p p
t x  (1  t)y  t x  (1  t)y Any valid

x p
  xk
 k
norm
Triangle with p<1 is
function is
necessarily NOT CONVEX
 t x  (1  t) y and thus not
 convex
Hom og. a valid norm

Michael Elad
Haifa 32000, Israel
A Closer Look at
the L1 Minimization
Our Goal
Better understanding of the (P1) problem
 P1  min x
x 1
s.t. A x  b
Why are we so interested in this choice of J(x)?
Because, as we are about to see, it favors
SPARSE SOLUTIONS

(P1) Solutions form a Convex Set
 Let’s start with a relatively simple to prove property of (P1)

Theorem: The set of optimal solutions of (P1),
S, forms a bounded and convex set
 Recall: L1 is not strictly

Example: m in   
convex and thus an
option for multitude of s.t.     2
A  [1 1]
solutions does exist b  [2]
 Our goal here is to say
that the situation in this 2
case is not so grave, since
Optimal
all the solutions are
solutions
“concentrated”

Proof – Part 1: Bounded Set
 We start with x1 & x2 in S, thus

x1 1
 x2 1
 J min & A x1  A x 2  b
 Consider a convex (0<t<1) combination of x1 and x2:

x*  tx 1  (1  t)x 2
o It satisfies the constraint:

A x*  A  t x 1  (1  t)x 2   b
o Due to the convexity of L1,

x* 1
 t x 1  (1  t)x 2 1
 t x1 1
 (1  t) x 2 1
 J min
o Due to the optimality of x1 and x2: x * 1  Jmin

x*  tx 1  (1  t)x 2 is in the optimal solution set

Proof – Part 2: Convexity
 If x1 and x2 are any two optimal solutions in S, then

x1 1
 x2 1
 Jmin
 Using the triangle inequality we get
x1  x 2 1
 x1 1
 x2 1
 2Jm in
The L1-diameter of the S

set S is at most 2Jmin
x2
To summarize: S – the set x1  2Jmin

of all optimal solutions of
(P1) – is bounded and convex

Tendency to Sparse Solutions
Theorem: The set of optimal solutions of (P1),

S, must contain a solution with at
most n non-zeros
m
Example: m in   
n A  n
A  [1 1]
s.t.     2
b  [2]
b
x
2
Sparse Optimal
Solutions solutions

Proof: (1)
 Assume that x*  S , with k>n non-zeros

 The term Ax* linearly combines k columns from
A to form b m
 Since k>n, clearly these
(yellow) columns are n  n
linearly dependent
k A x
b
h  0 | A h  0 Assume w.l.o.g.
that these are
supp h  supp  x * the first k
columns in A

Proof: (2)
 Consider the vector x  x * h for a

very small ε k 
o Observation 1: x is a feasible solution

A  x * h   A x *  A h  b
x* h
o Observation 2: For small enough values of ε,
, sign  x *   h  sign  x *
*
xj
   m in
1  j k h j
o Observation 3: For these ε we have

k
 sign(x )  x 
* *
x *  h 1
 j j
 h j
j 1
k k
 sign(x )x    sign(x j )h j  x *
* * * T
 j j  h sign(x*)
1
j 1 j 1

Proof: (3)
T
x * h 1
  x* 1
  h sign(x*)
T
h sign(x*)  0 x * h 1
 x* 1
for   0
T
h sign(x*)  0 x * h 1
 x* 1
for   0
Could this term be 0 ?

Both these cases stand in complete
disagreement with the optimality of x*
T
h sign(x*)  0
x* 1
 x *  h 1

Proof: (4)
Conclusions:
 x*+εh is in S as well (just as optimal as x*)
 We can choose ε to null one entry in x*+εh,
getting a new optimal solution with k-1 non-zeros
 We can proceed this way and null elements in the
optimal solution till we get n non-zeros, as claimed
We cannot guarantee going below n

since then the linear dependence of the
columns in A is not given

m
Disclaimer
n A  n
b
x
We have just seen that (P1) favors sparse solutions

 P1  min
x
x 1
s.t. A x  b
namely, a solution with n<m non-zeros at most
While this course is indeed dealing with

SPARSE REPRESENTATIONS
The above is not the sparsity we are aiming for, as
we will be interested in (much) fewer non-zeros in x

Michael Elad
Haifa 32000, Israel
Conversion of (P1) to
Linear Programming
Solving (P1)
 We now turn to present a useful connection

between our problem, (P1), and Linear
Programming (LP)
 General LP structure:
T
(LP) min c x s.t. A x  b & x  0
x
 This connection is valuable since there are many

available effective LP solvers, and with this
connection, these become relevant for solving our
task
 P1  min x
x 1
s.t. A x  b
 We shall build this connection (P1)-LP by
construction
Splitting the Unknown
 Starting with (P1), we propose to split the unknown

x to its positive and negative entries:
 x  u  v where x, u, v 
m
and u  0 & v  0
 Example:
o All the positive entries go to u
 1  1   0 


3


   
3
   
0 o All the negative ones are in v
 0 
 
0   0 
   
o Both vectors are non-negative
 4  0 4
x      uv o A by-product of this split, since
 1  0  1 
     
 0  0   0 
their supports do not overlap,
 0  0   0  we get
      T
 2   2   0  u v  0

Turning to Use u and v
 Lets plug this split into the (P1) problem:

 P1  min x
x 1
s.t. A x  b
b  A x  A u  v 
m
x 1
  xk
k 1
m m
  A  A   u 
 
 u k
 v k v 
k 1 k 1
 1 1   u 
T T u 
  and    0
  v 
v 
 So, it appears that we could replace x with its new

description and modify (P1) to an equivalent form

From (P1) to LP
 Lets plug this split into the (P1) problem:

 P1  min x
x 1
s.t. A x  b
1 T 1 T   u  s.t. b   A  A   u  u 
min    
(LP)     &  0
u ,v v
  v
  v 
 Looks like an LP problem, no?

 One problem, though: what about the condition
uTv=0 ? Inserting it into the above problem ruins
the LP structure !

Is there a Problem?
 1  3 2 
Theorem: The constraint uTv=0
is a passive      
3
     
3 0
one and thus can be omitted from  0  0   0 

     
4 0 4
the LP formulation without harm x        uv
 1  0  1 
     
Proof:  0  0   0 
 0  0   0 
     
o Assume that the optimal solution violates uTv=0  2   2   0 
o There exists a non-zero entry in the same location in

u and v. Assume that it is u1v1>0
o Replacing u1 by u1-v1>0, and v1 by 0, we get

• The constraint is not affected: A(u-v)=b
• The penalty 1T(u+v) is reduced by 2v1>0
Our initial assumption about violation of uTv=0 does not
hold, and the optimal solution does obey this constraint
Seeking Sparse Solutions
Michael Elad
Haifa 32000, Israel
Promoting Sparse
Solutions
What Have we Done so Far?
Ax  b Regularize:
What
has infinitely Jx  ?
to do? min J  x  s.t. A x  b
many solutions x
Such Option 2: Choose Other Option 1: Choose

as ? a convex J(x) options?
Jx  x
2
Option 3: Choose … Tendency to

So?
Jx  x 1 Sparse Solutions
Properties?
We will now try to better understand this behavior

and then discuss its importance for our needs

Why L1 Leads to Sparsity ? Answer #1
 In order to understand this tendency better, consider

the following problem:
min x 1 s.t. x 1
Find the shortest x in L1, x 2
given that its L2-norm is normalized?

 While an algebraic solution could be applied here,
let’s try a geometric view
x 1
 The solution is given on the 2
“corners” of the L1-sphere,

such as x opt  [1, 0, 0, 0, 0]
 Conclusion: Among all
L2-normalized vectors, the shortest in
L1 are also the sparest possible

Why L1 Leads to Sparsity ? Answer #2
 Indeed, let’s consider our more general problem for

p=1 and 2, and again consider things geometrically
p
m in x p
s.t. A x  b
x
Ax  b Ax  b
x  const p=1 p=2 x 2

 const
1
 Conclusion: L2 tends to dense solutions while L1

(again) promotes sparse ones
So, why L1 ?
 If the migration from L2 to L1 Clearly, this choice is no

promotes sparsity, why not go longer a valid norm as it
is not a convex function,
further to Lp with p<1? but if it serves our needs,
p
m in x s.t. A x  b why not use it?
x p
Ax  b Ax  b
x  const x 2
 const
1
 Conclusion: As p decreases below 1, we

strengthen the tendency to get sparse solutions
Other Options
 We have to emphasize: There is nothing p
special about the Lp-norm expression,

x p
 p
 xk
k
and one could suggest other functions

m
 More broadly, any function of the form J  x      x k 

could fit, if the scalar function () is k 1
o Symmetric (x)
o Monotonically x
non-decreasing for x>0
o Monotonic non-increasing derivative for x>0
 Examples:   x   log 1   x 
x
 x  1  e 
 x
 x  
1 x

Michael Elad
Haifa 32000, Israel
The L0-Norm and the
(P0) Problem
From Lp to L0
 If we are interested in the sparsest solution to

Ax=b, why not go with p all the way to zero?
What does that mean?
 Here are the “influence functions” |x|p for varying p:
p p
x p
  xk
2
k
p=2
 As p0, this function 1.5 p=1
p=0.5
becomes an indicator 1 p=0.1
function, I(x):
0.5
0 0 x 0
I(x)  x 
1 x 0 0
-1.5 -1 -0.5 0 0.5 1 1.5

L0–Norm: Definition
m
Definition: Given a vector x  , its L0-norm is
defined as
m
  I xk 
p
x 0
lim x p
p0
k 1
 This L0-norm simply counts the number of

non-zeros in x
0
 Source of confusion: Is it x 0
or x 0
?
p
o While we took the limit of x p , the outcome is
considered as “norm” without the 0-th power
o We shall use both notations and mean the same:
the number of non-zeros (the cardinality) in x

L0–Norm: It is Not Really a Norm !
m
 I x 
p
x 0
lim x p
 k
p0
k 1
 Clearly, x 0 is not a valid norm, as it is not convex

 Which of the norm-axioms is violated?
o Non-negativity:
x 0
0
o Homogeneity:
cx 0
 c  x 0
o What about the triangle inequality?

L0–Norm and the Triangle Inequality
 The triangle inequality: x  y 0  x 0  y 0

 In words: the # of non-zeros in x+y is equal or smaller
than the sum of the non-zeros in x and y separately
Equality: When the Inequality: When the supports overlap

supports do not overlap or there is a cancelation
1  0  1  3  1   1 
           
0 1 0 1 0 1
           
0  0  0  0  0  0 
x    an d y    x    an d y    x    a nd y   
7  0  7  0  7  0 
1  0  1  0  1  0 
           
 0    2   0    2   0    2 
5 xy  x  y 5 5 xy  x 0
 y 6 4  xy  x  y 6
0 0 0 0 0 0 0 0

L0–Norm Properties: Summary
m
 I x 
p
x 0
lim x p
 lim k
p0 p0
k 1
 Clearly, x 0 is not a valid norm, as it is not convex

 Which of the norm-axioms is violated?
o Non-negativity:
x 0
0
o Homogeneity:
cx 0
 c  x 0
o inequality: Surprisingly
xy  x 0
 y
0 0

Defining Our Holy-Grail: (P0)
Definition: Given an under-determined linear system

of equations, Ax=b, the (P0) problem
seeks its sparsest solution:
(P0 ) min x 0
s.t. A x  b
x
Observe how similar this problem is to

2
(P2 ) min x 2
s.t. A x  b
x
And yet, we know so much about (P2)

and so little about (P0)

What Do We Want to Know About (P0) ?
2
(P2 ) min x 2
s.t. A x  b (P0 ) min x s.t. A x  b
x x 0
Here is What Our

we Know Questions:
 There is a unique  Is there a unique solution?

solution to this Under which conditions?
problem  Given a candidate solution,
& could we test its optimality
 This solution is easily?
easily computed  How can we get this solution
(closed-form) in reasonable time?

(P0) is (Really) Hard
 Assume we are given (P0)

o m=2000, n=500 n A 
o We know that
x  20 m b
0 x
 Our goal: Find x using a direct approach that checks

all the possible supports  2000 
   3.9e  47
 20 
 If every such test takes 1nano-sec, we will
conclude the whole calculation in 1.2E31 years (!!!)
Is There Hope? Why Should We Care?

Do Sparse Solutions Exist ?
Here is an interesting observation: Question: What

would the cardinality
n m
Draw A at A of the solution be?
Random (m  n)
Answer: x 0  n with
probability 1 – The
A Solver of solution is not
(P0 ) min x s.t. A x  b expected to be
x 0
x “deeply” sparse at all
Draw b at b
n So, why bother with
Random (P0), when (P1) could
have given this
solution easily?
Clearly, this is not the case we are after

So, what are we trying to achieve?

Our Point of View
Draw A at Obviously, in this case, there

n m
Random A is a sparse solution to the
(m  n) linear system, and the
solution will give x 0  s
n
b
x
Multiply A Solver of
b=Ax0 (P0 ) min x s.t. A x  b
x 0
m
x0  Our field is not about covering all b  n
x0 s n but rather about treating only those b

Draw an 0
that can be explained “sparsely” –
s-sparse x0
Their “dimension” is O(s). In doing
at Random
so, we aim to reveal the underlying
structure of b, believed to be a
composition of few atoms from A

Michael Elad
Haifa 32000, Israel
A Signal Processing
Perspective
Mathematical Riddle ?
(P0 ) min x 0
s.t. A x  b
x
So far, our discussion has been purely mathematical: We
are interested in better understanding the problem (P0)
that seeks the sparsest solution to a linear system
We argue that this is more than just a mathematical riddle
As we will show in this course, (P0) and related topics are

of central importance to data processing in a universal way

Signal Modeling
 The linear system Ax=b

describes a signal n A  n
construction process
b
 We consider our signal m x
(the vector b) to be composed
as a linear combination of s<<n columns from A
 What does this mean? It implies that our n-dim.
signals are in-fact simpler, residing in a union of
many s-dim. subspaces
 Is it so far fetched? Clearly, we do know that all the
data we operate on is compressible, suggesting that
an inner structure in it seems quite plausible
Signal Modeling: An Example
 Piece-wise constant
signals can be described n A  n
as a sparse combination
of atoms taken from the b
Heaviside dictionary m x
=
 x5
+ x9
1
0 + x16

Is this a New Idea ?
 JPEG assumed that only few coefficients of the DCT*

capture most of the
image content
 JPEG-2000 does the same
n
A 
with the wavelet transform
(allowing any location for x b
n
the non-zeros)
* Actually, the
 Transform’s (e.g., wavelet) sparsity has assumption is even
been used effectively for handling more restrictive:
problems such as denoising, deblurring, … The non-zero
coefficients are
expected to be the
 In our language, A is simply the leading ones, as
Inverse-Transform since it takes us from in PCA
representation-space to the signal-space
Why Redundancy ?
 We propose to use redundant

dictionaries (m>n)
n A 
 Why? Because
o It enriches further the n x b
model, enabling more
combinations for the
signal construction
o It enables getting
sparser solutions
n A 
m b
x

 The concept of seeking

the sparsest
representation for data
has been around and
found useful in many Welcome
fields already to
 It is now time to make Sparseland
it more clear and more
precise, which is exactly
the goal of this field
… so
 Welcome to Sparseland

Section 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Section 1

Uploaded by

Copyright:

Available Formats

Sparse & Redundant Representations

and Their Applications in

Take 1: A New Transform

Michael Elad | The Computer-Science Department | The Technion

Depends whom you ask, as the researchers

Michael Elad | The Computer-Science Department | The Technion

A New Transform for Signals

Michael Elad | The Computer-Science Department | The Technion

Michael Elad | The Computer-Science Department | The Technion

 In a redundant transform, the representation vector is

 This can still be done

the linearity of the

Michael Elad | The Computer-Science Department | The Technion

Michael Elad | The Computer-Science Department | The Technion

Michael Elad | The Computer-Science Department | The Technion

Michael Elad | The Computer-Science Department | The Technion

Take 2: Modeling Data

Michael Elad | The Computer-Science Department | The Technion

 We are surrounded by various sources of

Effective removal of noise (and many other applications)

Michael Elad | The Computer-Science Department | The Technion

 There are many different Principal-Component-Analysis

Michael Elad | The Computer-Science Department | The Technion

The model assumption: after DCT, the top left

Michael Elad | The Computer-Science Department | The Technion

The fields of signal & image A New

Michael Elad | The Computer-Science Department | The Technion

A Data Model and its Use

Michael Elad | The Computer-Science Department | The Technion

Michael Elad | The Computer-Science Department | The Technion

Michael Elad | The Computer-Science Department | The Technion

 Task: model image patches

 The Sparseland model

Michael Elad | The Computer-Science Department | The Technion

Properties of this model: Sparsity and Redundancy

Michael Elad | The Computer-Science Department | The Technion

We could refer to the Sparseland model as the

Michael Elad | The Computer-Science Department | The Technion

following linear system, x

Michael Elad | The Computer-Science Department | The Technion

 Various algorithms exist. Their Σ

their success if the solution is

Michael Elad | The Computer-Science Department | The Technion

 Problem 2: Given a family of signals, how do we

Michael Elad | The Computer-Science Department | The Technion

Michael Elad | The Computer-Science Department | The Technion

 Problem 1: Given an image patch, how can we

Michael Elad | The Computer-Science Department | The Technion

 Problem 1: Given an image patch, how can we

Michael Elad | The Computer-Science Department | The Technion

Michael Elad | The Computer-Science Department | The Technion

Donoho Candes – Stanford

Michael Elad | The Computer-Science Department | The Technion

Michael Elad | The Computer-Science Department | The Technion

Electrical Engineering Department

 Yaniv is a PhD student, supervised by M. Elad

Michael Elad | The Computer-Science Department | The Technion

 Searching ISI-Web-of-Science (December 26th 2016):

Michael Elad | The Computer-Science Department | The Technion

Michael Elad | The Computer-Science Department | The Technion

Michael Elad | The Computer-Science Department | The Technion

 We already mentioned the book authored

Michael Elad | The Computer-Science Department | The Technion