You are on page 1of 14

Face Recognition with Compressive Sensing

Mathias Lohne

Spring, 2017

1 Introduction man mind’s extraordinary ability to recognize faces,


even when key elements such as skin tone, hair color
Traditional signal processing usually follows Shannon or length, facial hair or glasses, change.
and Nyquist’s sampling theorem. It states that a signal In this report we will first give a general introduc-
can be recovered perfectly if the sampling frequency tion to the basic concepts of compressive sensing, and
is at least twice the highest frequency present in then look at how these techniques can be applied to
the signal. However, this theorem assumes that the the face recognition problem.
samples are taken with a uniform time interval, and
that the reconstruction happens by interpolating the
samples with sinc functions.
1.1 Preliminaries and Notation
Compressive sensing is a new scheme for sampling
and reconstructing which abandons these assump- In this section we will introduce some of the neces-
tions. It turns out that by placing different assump- sary notation and concepts for this report.
tions on the signal, and by using a different recovery
Throughout we will denote vectors by boldface
strategy, we can recover our signal with less samples
lower case letters, and matrices by boldface upper
than the traditional theory suggests.
case letters. For a vector x, we will by xi refer to the
The field of compressive sensing has exploded i’th element in the vector. Similarly, for a matrix A,
in the last decade, after the initial publications by we will by ai refer to the i’th column of A, and by ai,j
Candès, Tao, Romberg and Donoho [CT06; CRT06; denote the element of A at column j and row i.
Don06]. Some of the ideas and concepts have roots
Sets will be denoted by italic upper case letters,
further back in time, but these publications mark the
and the cardinality of a set S is denoted as |S|. The
beginning of compressive sensing as a field of study.
complement of a set S will be written as S.
The key idea is to assume that the signal is sparse,
and then look for the sparsest possible signal which
matches the sampled signal. A lot of new theory has Norms
been developed for recovering such signals.
During the development of this theory, one has As we soon will see, a central part of compressive
found several applications of compressive sensing sensing is a minimization problem involving different
techniques. In this report, we will work more closely norms of a vector. Therefore, we will begin with the
with one of these applications, namely face recogni- definition of norms. Even though the author suspect
tion. Face recognition is perhaps one of the most this to be known material, it is included for ease of
studied problems in machine learning and statistical reference later.
classification. Possibly because the traditional ap- Definition 1.1. Let V be a vector space over a field
proaches still show several weaknesses, as we will K. A norm k·k on V is a function k·k : V → R such
discuss more in Section 3, but also because of the hu- that

1
(i) kvk ≥ 0 for all v ∈ V with kvk = 0 if and only For a vector v ∈ CN and a set S ⊂ {1, 2, . . . , N },
if v = 0 we denote by vS either the subvector in C|S| consist-
(ii) kcvk = |c| kvk for all v ∈ V and c ∈ K ing of the entries in v indexed by S, that is:

(iii) ku + vk ≤ kuk + kvk for all u, v ∈ V (vS )i = vi for i ∈ S (1.1)


This definition is quite broad, and very general. In Or the vector in CN which coincides with v on the
this report we will be working more closely with a indices in S, and is zero otherwise, that is:
family of norms called the `p norms:

vi if i ∈ S
Definition 1.2. Let p ≥ 1 and p ∈ R. The `p norm (vS )i = (1.2)
0 Otherwise
k·kp is defined as
It should always be clear from context which of these
n
!1 is used. Similarly, for a matrix A ∈ Rm×n we will
p p by AS ∈ Rm×|S| refer to the matrix consisting of the
X
kvkp = |vi |
i=1 columns of A indexed by S.
The final concept we will introduce is the notion of
We will not prove here that the `p -norms actually sparsity:
are norms, but this can be proven for all p ≥ 1.
We observe that for p = 1 we get the Manhattan Definition 1.5. A vector v ∈ CN is said to be s-
norm, and for p = 2 we get euclidean norm. If we sparse if it has no more than s non-zero entries. That
let p → ∞ we arrive at the supremum norm. Even is, kvk0 ≤ s
though Definition 1.2 does not allow for p < 1, it can
be proven that if we let p → 0, accept that 00 = 0, and Note that any vector supported on a set S with
ignore the 1/p-exponent, we get what is sometimes |S| = s is s-sparse.
called the `0 norm:

Definition 1.3. The `0 norm k·k0 is the number of


non-zero entries in v.
2 A Sparse Introduction to
It is worth noting that the `0 norm is strictly
speaking not a norm, since k·k0 does not fulfill axiom
Compressive Sensing
(ii) of Definition 1.1. In fact, for all q < 1, k·kq is
We will introduce the basic idea of compressive
not a norm, as k·kq does not fulfill axiom (iii) for any
sensing with an example quite similar to the one
q ∈ (0, 1). Despite this, it is customary to refer to k·k0
found in [BL13]. Suppose we have 100 coins. We
as the `0 norm.
suspect that a few of them might be counterfeit, and
thus have a slightly different weight than the normal
Support and Sparsity coins.
The naive approach to finding these coins would be
As compressive sensing deals with the recovery of
to weigh every one of them with an electronic weight,
sparse vectors, we will need to define sparsity. Before
and detect the coins that are off. In other words, we
we do that, we will introduce support:
would have to do as many measurements as there are
Definition 1.4. Let v ∈ CN . The support S of v is coins. But what if we weighed more than one coin at
defined as the index set of its non-zero entries, that a time?
is: Suppose, to the contrary, that we would include,
supp v = {j ∈ {1, 2, . . . , N } | vj 6= 0} for example, half of the coins in every weighing.
The recorded weight would be the sum of all the
The notion of support yields a new formulation of included coins. Would we be able to make do with less
Definition 1.3: The `0 norm is simply the cardinality weighings than 100? Say if we for example made 20
of the support: kvk0 = |supp v|. measurements, and recorded only the deviation from

2
1 1

0.5 0.5

−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

−0.5 −0.5

−1 −1

Figure 2.1: Visualization of the two-dimensional case in Example 2.1: `1 (left) and `2 (right) balls, along
with the solution space to Az = y (dashed line)

the expected weight. This would lead to the following underdetermined. In Section 2.2 we will look at when
system of equations: we have unique solutions up to a given sparsity s,
but for now we will only consider that minimizing the
Ax = y number of non-zero elements will reduce the solution
space drastically. Formally we can write this as an
Here A ∈ R20×100 is a matrix where each row optimization problem as follows:
corresponds to one weighing, and the elements ai,j
is either 1 or 0, depending on whether coin j was
included in weighing i or not, respectively. The vector minimize kzk0 subject to Az = y (P0 )
z∈CN
y ∈ R20 is our measurement vector, and x ∈ R100 is
the solution to our problem. However, this problem seems to be intractable in
Unfortunately, since A has far more columns than practice. It is, in fact, NP-hard in general. A proof
rows, this system is underdetermined, meaning that of the NP-hardness of (P0 ) is found in Section 2.3 of
solving this system yields an infinite set of solutions. [FR13], and is obtained by reducing (P0 ) to the exact
cover by 3-sets problem, which is known to be NP-
Here comes the key idea of compressive sensing:
complete.
we will assume that our solution vector x is sparse.
In our example it would make sense to assume that
most coins are not counterfeit, so the solution vector Basis Pursuit
x (consisting of deviations from the expected weight)
would mostly have elements equal to 0. Hence, we Since (P0 ) is computationally hard, we need some-
want to choose the solution from the solution space thing to approximate it. One intuitive guess would be
of Ax = y with the smallest amount of non-zero to use minimization of another norm, like the `1 or `2
elements. norm. It turns out that this is what we usually do in
practice.
The rest of this chapter will consider how to find
the sparsest solution in practice, as well some of the The question then becomes, which norm do we
properties we need of A in order to make sure that use? It can be shown that as p gets lower, the `p norm
this procedure works for all s-sparse vectors. approximates the `0 norm better [FR13, Section 4.1].
We will use the lowest value for p possible, while still
having a minimization problem that can be solved
in polynomial time. Thus, we will use the `1 -norm.
2.1 The General Setting Figure 2.1 illustrates why `1 -minimization works well
to find the `0 minimum.
As we will soon see, if we assume that our solution We will illustrate this with an example:
vector x is the sparsest solution z to the system of
Example 2.1. Let
equations Az = y, there is hope to find a unique  
solution even though our system of equations is A= 1 2 y=2

3
1.0 2 1.0

0.8 1 0.8
0
0.6 0.6
1
0.4 0.4
2
0.2 0.2
3
0.0 4 0.0
0 20 40 60 80 100 0 5 10 15 20 0 20 40 60 80 100

Figure 2.2: Left: The original 5-sparse vector x ∈ R100 . Center: The sensed vector y ∈ R20 . Right: Result of
`1 -minimization

We want to find the solution z to Az = y that exists results which states that there is a certain
minimizes the `0 norm, and we will use the `1 and probability that the reconstruction yields the correct
`2 norms to approximate the `0 norm. result, (ie, that A exibits the NSP, which will be
The solution space to Az = y yields a line in R2 . defined later). Running the experiment described
Intuitively, we can visualize finding the solution to above several times actually gives the wrong result
Az = y that minimizes the `p -norm, as having a `p - some times.
ball Bp (0, r) centered at the origin, and increasing
the radius r until it intersects with this solution space.
The intersection is then the optimal solution. This is
Recasting (P1 ) as a Linear Program
what we have shown in Figure 2.1. Now that we have a tractable way of finding sparse
The result of `1 -minimization gives z = (1, 0), solutions, we will look at one way to actually solve
which is the correct `0 minimum. The result of `2 - (P1 ). In this section we will see how one can use linear
minimization gives z ≈ (0.39419, 0.8029), which is programming to do this. Linear programming (LP) is
not the correct `0 minimum. the study of problems on the form

This example illustrates both why `1 -minimization maximize cT x subject to Ax ≤ b , x ≥ 0 (2.1)


x∈KN
is a reasonable choice of approximation, and also
why `2 -minimization is not. We formalize this new Thus, to use one of the algorithms developed for LP,
problem: we need to rewrite (P1 ) as a linear program (ie, on
the form (2.1)).
minimize kzk1 subject to Az = y (P1 )
z∈C N
In order to do this, three problems arise. First,
we see from Equation (2.1) that LP problems do not
This is also called basis pursuit. We will later discuss
take constraints on equality form, which is what we
how one can guarantee that the solution to (P1 ) is
have in (P1 ). Second, LP problems only minimizes
actually the solution to (P0 ).
linear functions (ie, functions that can be written as
An illustration of this procedure is found in Fig- a dot product between the solution vector and some
ure 2.2. In this example we have drawn a random constant vector). Absolute values, as we have in the `
1
5-sparse vector x ∈ R100 . Using a random sensing ma- norm, can not be described this way. Third, a general
iid
trix A ∈ R20×100 where the elements ai,j ∼ N (0, 1) LP problem requires all values in x be non-negative.
for all i, j, we obtain our sampled vector y = Ax ∈ This is not a constraint we have in (P1 ).
R20 . We have then applied Vegard Antun’s implemen- We begin by addressing the first issue. This
tation of the Orthogonal Matching Pursuit algorithm can be quite easily solved by observing that an
described in [FR13, Section 3.2] to solve (P1 ), giving equality constraint can be rewritten as two inequality
us our reconstructed vector. constraints like so:
Because we have used a random sensing matrix,
there is no guarantee that basis pursuit works. There Az = y ⇐⇒ Az ≤ y and Az ≥ y

4
The second issue is a common one in LP. Thus there Thus, we can use all the celebrated algorithms for
are also common ways to work around it. One popular Linear Programming to solve (P1 ), such as the Simplex
way of doing this is to introduce new variables ti such method, or various interior point methods [Van14].
that |zi | ≤ ti for i ∈ {1, 2, . . . , N }. Thus,
N
X
minimize |zi | 2.2 Good Sensing Matrices
z∈CN
i=1

can be rewritten as In this section we will look at the minimum number


N
X of measurements required, as well as some of the
minimize ti features we want our sensing matrix A to possess, in
z,t∈CN order to ensure that basis pursuit works well.
i=1
subject to |zi | ≤ ti for all i ∈ {1, 2, . . . , N } We will start by looking at how we can ensure
unique s-sparse solutions to the `1 -minimization.
we can then rewrite the constraints as Then we will look at how we can make sure that the
zi − ti ≤0 solution to (P1 ), which is the problem we will solve
for all i ∈ {1, 2, . . . , N }
−zi − ti ≤0 in practice, is in fact the solution to (P0 ), which is the
problem we actually want to solve.
to arrive at the standard LP form. The ti ’s are clearly
non-negative since they are defined to be greater than
or equal to an absolute value. Minimum Number of Measurements
The third issue is also a quite common one in LP.
We now have two decision vectors, z and t, where We begin by looking at how many measurements
the elements of z does not need to be non-negative. we need in order to ensure that (P0 ) has only one s-
We will solve this by introducing two new decision sparse solution. We begin by stating the main result
vectors z+ and z− , which will replace z in the problem of this section:
formulation. Theorem 2.2. If the following equality holds:
We define z+ and z− as follows:
 {z ∈ CN | Az = Ax, kzk0 ≤ s} = {x}
zi if zi > 0
(z+ )i =
0 otherwise That is, if x is the unique s-sparse solution to (P0 ), then

−zi if zi < 0 the number of measurements m (ie, the number of rows
(z− )i =
0 otherwise in A) must satisfy m ≥ 2s.
It is clear that z = z+ − z− , it is also clear that
Before proving this theorem, we need the following
z+ , z− ≥ 0. Substituting in z+ − z− for z in the
lemma:
original problem will then solve this issue.
A final issue is that general LP problems concerns Lemma 2.3. Every s-sparse vector x is the unique s-
maximization problems, however (P1 ) is a minimiza- sparse solution to (P0 ) with y = Ax if and only if every
tion problem. This is easily solved by observing that set of 2s columns of A is linearly independent.
minimizing a function f is equivalent to maximizing
−f . We will not prove Lemma 2.3 in this report, but a
Combining all of the above we arrive at the LP proof can be found in [FR13, Theorem 2.13]. We are
formulation of basis pursuit: now ready to prove the main result:
Pn
maximize − i=1 ti Proof of Theorem 2.2. Assume that it is possible to
subject to Az+ − Az− ≤y uniquely recover any s-sparse vector x from the
−Az+ + Az− ≤ −y knowledge of its measurement vector y = Ax. Then,
(2.2)
z+ − z− − t ≤0 by Lemma 2.3, we have that every set of 2s columns
−z+ + z− − t ≤0 of A must be linearly independent. This implies that
z+ , z− , t ≥0 rank A ≥ 2s. From elementary linear algebra we

5
know that the rank of a matrix can not be bigger that Theorem 2.6. Given a matrix A ∈ Km×N , every
the number of rows, hence rank A ≤ m. Combining vector x ∈ KN supported on a set S is the unique
this, we get that solution to (P1 ) with y = Ax if and only if A satisfies
the NSP relative to S.
2s ≤ rank A ≤ m
Proof. We will begin by proving that if a vector x
which concludes the proof. supported on S uniquely solves (P1 ), then A satisfies
the NSP relative to S.
Theorem 2.2 only gives a condition which must be
satisfied in order for (P0 ) to have unique solutions, Given an index set S, assume that every vector
but no guarantee that the solution is unique if the x ∈ KN supported on S is the unique solution to
number of measurements is greater than 2s (ie, it minimize kzk1 subject to Az = Ax (P1 )
is necessary, but not sufficient). However, it can be z∈CN

proven that there exist matrices that is guaranteed Since ker A is a subspace of KN , it is clear that for any
to have unique s-sparse solutions to (P0 ) if m ≥ 2s. v ∈ ker A \ {0}, the vector v is the unique solution
S
One such matrix is the matrix consisting of the first to
2s rows of the DFT matrix.
minimize kzk1 subject to Az = AvS (2.3)
Theorem 2.4. For any N ≥ 2s, there exists a z∈CN
practical procedure for the reconstruction of every s- Because v ∈ ker A, we have that Av = 0, which
sparse vector from its first m = 2s discrete Fourier means that A(vS +v ) = 0, giving us that A(−v ) =
S S
measurements. AvS . Hence it is clear that −vS is also a feasible
solution to (2.3), but since vS is assumed to be the
A proof of this theorem will not be given here, but
unique optimal solution to (2.3), we get that kvS k1 <
can be found in [FR13, Theorem 2.15].
k−vS k1 . Since k·k1 is a norm, we have that k−vS k1 =
|−1| kvS k1 = kvS k1 from Definition 1.1. Thus, we
The Null Space Property arrive at the following inequality:

So far, we have only looked at the intuitive reasoning kvS k1 < kvS k1
of why `1 -minimization works well to find the `0 - This establishes the NSP for A, relative to S.
minimum. In this section we will formalize this
To prove the other implication, assume first that
relationship using the Null Space Property (often
the NSP holds for A, relative to a given set S. Let x be a
abbreviated NSP). It can be shown that, for the NSP,
vector in KN supported on S. Let z ∈ KN be a vector
the real and complex case are equivalent (for a formal
that satisfies Ax = Az, and assume that x 6= z. Our
statement and proof, see Theorem 4.7 in [FR13]).
goal will be to show that kzk1 must be strictly bigger
Hence we will state the definitions and results for
than kxk1 , which will prove the uniqueness of the
a field K, which can be either R or C. The NSP is
solution.
defined as follows:
Define v = x − z. Since Ax = Az, we have that
Definition 2.5. A matrix A ∈ Km×N is said to
satisfy the Null Space Property relative to a set S ⊂ 0 = Ax − Az = A(x − z) = Av
{1, 2, . . . , N } if This means that v ∈ ker A. Since x 6= z, we also have
that v 6= 0. If we use the triangle inequality of norms,
kvS k1 < kvS k1 ∀ v ∈ ker A \ {0}
as well as the definition of v, we obtain
It is said to satisfy the Null Space Property of order s if kxk1 = kx − zS + zS k1
it satisfies the null space property relative to any set
S ⊂ {1, 2, . . . , N } with |S| ≤ s ≤ kx − zS k1 + kzS k1 = kvS k1 + kzS k1
Now, using the assumption that A satisfies the NSP
The definition of the NSP might seem a bit arbitrary, relative to S we get the next inequality
but as we will soon see, the NSP is directly connected
to the success of basis pursuit. kvS k1 + kzS k1 < kvS k1 + kzS k1

6
Using the definition of v and z again, we arrive at our see how we can transform most natural signals into a
final result: sparse or compressible representation.

kvS k1 + kzS k1 = kxS − zS k1 + kzS k1


= k−zS k1 + kzS k1 = kzk1 Compressibility

This proves that kxk1 < kzk1 for any z ∈ KN A compressible vector is a vector where most entries
satisfying Ax = Az and x 6= z. This establishes the are almost zero. To more precisely define a compress-
required minimality of kxk1 , and thus the uniqueness ible vector, we must first define what we mean by
of the solution. almost zero:

Theorem 2.6 is not that interesting by itself, but Definition 2.8. For any p > 0, the `p -error of best
if we let the set S vary, it immediately yields a more s-term approximation to a vector x ∈ CN is defined
general result: by σs (x)p = inf{kx − zkp : z is s-sparse}

Corollary 2.7. Given a matrix A ∈ Km×N , every s- We say that a vector x is s-compressible if σs (x)1
sparse vector x ∈ KN is the unique solution to (P1 ) is “small”. The second potential issue we will cover is
with y = Ax if and only if A satisfies the NSP of order when our measured vector y contains some noise,
s. such that Ax ≈ y. If the distance between the
distorted measurements y and the real, unbiased
Before we prove this result, we will give a small signal Ax is bounded by a parameter ε, we can
remark: Corollary 2.7 shows that if A satisfies the rewrite the approximation constraint as kAx−yk2 <
NSP of order s, the `1 -minimization strategy of (P1 ) ε. This motivates the following variant of basis
will actually solve (P0 ) for all s-sparse vectors. pursuit:

Proof of Corollary 2.7. Assume every s-sparse vector minimize subject to kAz − yk2 ≤ ε (P1,ε )
x ∈ KN is the unique solution to (P1 ). Then, for z∈CN
every set S with |S| ≤ s we can find a vector x0 ∈ K
supported on S which is the unique solution to (P1 ). It can be shown that if the sensing matrix exhibits a
By Theorem 2.6 we then have that A must satisfy strengthening of the NSP called robust NSP, the error
the NSP relative to S. Since this is true for all S with made by solving (P1,ε ) is bounded by a weighted sum
|S| ≤ s, A must satisfy the NSP of order s. of σs (x)1 and ε [FR13, Section 4.3].
Conversely, assume that A satisfies the NSP of
order s. Then, from Definition 2.5, we have that A
Achieving Sparsity or Compressibility
satisfies the NSP relative to S for every set S with
|S| ≤ s. From Theorem 2.6 we have that a vector So far we have just assumed our solution vector x to
x ∈ KN is supported on S only if it is the unique be sparse or compressible. However, most real life
solution to (P1 ). Since this is true for any set S with signals are rarely sparse. Hence, we need some way
|S| ≤ s, it is true for any s-sparse vector. to represent natural signals in a sparse way.
We will achieve this by applying what is known as
a sparsifying transform. Many such transforms exist,
but in this report we will consider the Haar wavelet
2.3 Sparsity in the Real World transform as an example. The Haar wavelet is by far
not the most efficient sparsifying transform, but it is
The assumption that our desired signal is sparse quite understandable, which is why we have chosen
might seem very strict. After all, images are usually it.
not mostly black, and songs are usually not mostly The key idea in a wavelet transform is to take some
silence. object, expressed in a high resolution wavelet basis,
In this section we will first introduce compressi- and express it in terms of a lower resolution basis,
bility to ease the sparsity assumption. Then, we will and a detail basis. In the specific case of the Haar

7
Figure 2.3: Left: the original image of Lily. Center: the image after a 1-level discrete wavelet transform using
the Haar wavelet. Right: the image after a 2-level Haar DWT, ie: a 1-level DWT applied to the center image. In
this example, we have used the 2D DWT implementation provided in [Rya16].

wavelet, those functions are defined as follows: sensing matrix A. Hence the sensing matrix becomes
 the following product:
1 if 0 ≤ t < 0
φ(t) =
0 otherwise A = PΩ ΦΨ−1 (2.4)

 1 if 0 ≤ t < 1/2 Here, PΩ is a matrix describing the down-sampling,
ψ(t) = −1 if 1/2 ≤ t < 1 Φ denotes the measurement pattern used, and Ψ is
0 otherwise

the basis in which x is sparse. Typical choices include
letting Ψ be some wavelet basis, letting Φ be rows
By shifting and scaling those functions we get a basis
from the discrete Fourier matrix, and letting PΩ pick
for the low resolution space (from the φ’s) and the
out the first n rows, or n randomly chosen rows.
detail space (from the ψ’s). We will then consider the
coefficients of the image matrix as coordinates in a This means that solving (P1 ) results in a recovered
high-resolution wavelet basis. vector z also expressed in the Ψ-basis. To recover
our real non-sparse vector z0 we would then have to
The Haar Discrete Wavelet Transform (Haar DWT)
apply the reversed change of coordinates:
is essentially a change of coordinates from the higher
resolution wavelet basis, to a lower resolution and z0 = Ψ−1 z
detail basis. Figure 2.3 illustrates what happens when
the DWT is applied to an image. The upper left corner For ease of notation, we will simply refer to the
of the resulting image is the low resolution space. overall matrix as A, even though we think of it as
This is usually not any sparser or more compressible a product of multiple matrices. The concepts we
than the original image. However, in the upper right, discussed earlier in Section 2.2 will apply to this
lower left and lower right corners we see the detail product A. We will also note that we will later use Φ
space. This is highly compressible, and it is also to denote a matrix generated from training images.
clear that the total compressibility of the image has This is a different matrix, and should not be confused
increased (ie, number of “significant” components with the sampling pattern here.
has decreased)1 . We will denote the corresponding
matrix to this change of coordinates by Ψ.
We could assume that this change of coordinates
has been done prior to the sensing, so that our
solution vector x = Ψx0 is sparse. Here x0 is the
3 Applications to Face Recognition
underlying non-sparse solution. However, we will
We will now shift our focus away from general com-
instead include the change of coordinates in our
pressive sensing, and look at how these techniques
1 Recall that in an image, coefficients near 0 are depicted as of sparse recovery can be used to build a framework
dark/black for a complete face recognition system.

8
The face recognition problem is a classical one in It seems we have a problem with dimensionality.
the area of statistical learning and machine intelli- However this problem can be easily solved by simply
gence. Hence it is a broadly studied problem, with stacking all the columns of the image matrix A on top
many proposed solutions. The classical way to do of each other in the following way:
face recognition is to first do what is called feature
A = a1 a2 · · · an ∈ Rm×n
 
extraction. One can think of feature extraction as a
projection to a lower dimensional feature space, such
becomes
that the requirements for memory and computational
T
power is reduced. Popular methods for this includes a = aT1 aT2 aTn ∈ Rmn

···
Principal Component Analysis and Discrete Cosine
Transform [BW14]. We begin by making an observation: if our training
After projecting the images down to a lower resolu- images of subject i is of varying illumination, and
tion space, one typically applies some statistical clas- if we assume that they are all aligned correctly, we
sification scheme. Typical classifiers used in classical will expect a test image of subject i to be closely
face recognition include Linear Discriminant Anal- approximated by a linear combination of the test
ysis (LDA) [BW14], K Nearest Neighbor/Subspace images. That is, for a new test image y of subject
(KNN/KNS) [LHK05] and (Linear-kernel) Support i, there exists ki coefficients c1 , c2 , . . . , cki ∈ R (here,
Vector Machines (SVM) [Wri+09]. These methods ki is the number of images of subject i) such that:
seem to work well when in a controlled environment. X
However, when parameters such as lighting or noise y≈ φj cj (3.1)
level changes, or when a subject has occluded parts j | lj =i
of his/her face (like the addition of glasses), these
methods often begin to struggle [EK12]. We note that if we define ci ∈ Rki to be the vector
of all the coefficients, and Φi ∈ Rmn×ki to be the
In this chapter we will introduce a new classifi-
matrix consisting of all the corresponding images as
cation scheme called Sparse Representation-based
columns, we can rewrite (3.1) as:
Classification (SRC) [Wri+09]. This is an alternative
method to the LDA or KNN discussed above, but does y ≈ Φi ci (3.2)
not replace the dimensionality reduction. In this re-
port, however, we will not discuss feature extraction This is assuming that i is known, which of course
in any more detail, but focus on the classification it is not. However, we observe that we expect the
scheme. test image to only be a linear combination of the
In order to do face recognition, we first need a set training images corresponding to the correct subject.
of N labeled training images {(φi , li )}N This means that the coefficients corresponding to any
i=1 . This is a
set of images which our algorithm will use to learn other subject should be 0, or at lest very small. Thus
what the different people we want to recognize look we arrive at the zero-padded version of ci :
like. Here, (φi , li ) denotes the tuple consisting of the T
c0 = · · · 0T cTi 0T · · · ∈ RN

i’th image φi , and a label li ∈ {1, 2, . . . , C} indicating
which of the C subjects the image φi is of. The task Now, if we concatenate all the images in the entire
og the system is then, given a new test image y, to say training database into a matrix as such:
which of the C subjects is pictured in y.
Φ = Φ1 Φ2 · · · ΦC ∈ Rmn×N
 

where Φi is the matrix consisting of all the training


images of subject i concatenated, we see that we can
3.1 Sparse Representation-based
rewrite (3.2) as:
Classification y ≈ Φc0 (3.3)
At first we must address a potential problem. All of the This will clearly be the case, no matter what i is.
theory developed in Section 2 concerns vectors in KN , Summarizing what we have shown so far, we get
while we usually think of images as matrices in Rm×n . that we are given a sensed testing image y, and we

9
are trying to find a vector of coefficients c0 such that Algorithm 3.1 Sparse Representation-based Classifi-
y ≈ Φc0 . We also know that c0 should be highly cation for face recognition
sparse, since we expect most images in the database
to not be similar to the testing image. In fact, we only
Input: A matrix of training images Φ ∈ Rmn×N
expect an average of 1/C entries in c0 to be non-zero.
of C persons, a test image y ∈ RN , and a
In the last chapter we introduced theory for finding tolerance ε > 0.
the sparsest solution to such systems of equations.
However, we now have a system of approximations. 1. Normalize the columns of Φ to have a `2 norm
If we assume that the error is less than some error of 1
term ε, we can state a minimization problem that we 2. Solve the `1 -minimization problem:
expect c0 to be the optimal solution to:
ĉ0 = arg minkck1 subject to kΦc − yk ≤ ε
minimize kck1 subject to ky − Φck2 ≤ ε (3.4) c
c∈RN (3.5)
We observe that (3.4) has the same form as (P1,ε ). 3. Compute the residuals ri (y) = ky − Φi (ĉ0 )i k2
Thus, we can use compressive sensing techniques to for all i ∈ {1, 2, . . . , C}. Here Φi and (ĉ0 )i
solve this minimization problem. denote all the training images and estimated
By finding an optimal solution to (3.4), we get a coefficients associated with person i.
vector ĉ0 of estimated coefficients. It remains to Output: lˆy = identity of y = arg min ri (y)
i
determine which person we believe the image to be of.
We will do this by choosing the person who minimizes
the squared error term. That is, the person i who
minimizes the distance between the sampled test cuts his beard, or so forth, an ideal system should still
image y and the linear combination of training images be able to classify the subject correctly.
of person i. In this respect, SRC can be thought of as Definition 3.1. We say a sequence of error-signal
a type of Nearest Neighbor. (x, e) exhibits proportional growth with parameters
Summarizing all of the above, we get the algorithm δ > 0, ρ ∈ (0, 1), α > 0 if
described in Algorithm 3.1.
n = bδmc, ke0 k0 = bρmc, kx0 k0 = bαmc

Before we state the main result of this section, we


3.2 Addressing Corruption and will give an interpretation of the parameters involved
Occlusion in Definition 3.1. Our first parameter, n, gives a ratio
between the length of the real signal, and the length
A common problem when dealing with numerical of the measured signal. In other words, n, is the
problems of any sort is the introduction of round- compression rate. The next two parameters deals
off errors. A typical computer today uses 64 bit with sparsity: α is a measure of the sparsity of the
processors, this means that the processor (and all the uncorrupted signal, and ρ is the sparsity of the error.
other components) can store and process 64 binary If a signal exhibits this property, we have that the
digits of a number. Any real number must then be following result holds:
approximated by the closest number possible to write
with 64 binary digits. Typically, this introduces a Theorem 3.2. Fix any δ > 0, ρ < 1. Suppose that A
small error εi < 2−53 ≈ 10−16 for every pixel in is a random matrix with columns drawn independently
the image. In addition to round-off errors, all image from a multivariate normal distribution as such:
sensors introduce some noise. iid

ν2

In addition to noise, another problem we have ai ∼ N µ, Im , kµk2 = 1, kµk∞ ≤ Cµ m−1/2
m
to address is occlusion. A common problem with
many of the classical face recognition systems is their for some constant Cµ . Assume that ν is sufficiently
incapability of dealing with occlusion. If a subject small, that J ⊂ {1, 2, . . . , m} is a uniform random sub-
iid
puts on a pair of sunglasses, let his/hers hair grow, set of size ρm, that σ ∈ Rm with σ J ∼ Unif({−1, 1})

10
(independent of J), that σ J = 0 and that m is suffi- then look at some problems that arise when imple-

ciently large. Then with probability at least 1−Ce−ε m menting the system.
in A, J, σ, for all x0 with kx0 k0 ≤ α∗ m and any e0 In order to implement and test a face recognition
with signs and support (σ, J), system, one will need a database of faces to test on.
Many such databases exists, and some include the
(x0 , e0 ) = arg minkxk1 + kek1 CMU Multi-PIE, the Yale database (and its successor,
(x,e)
the extended Yale B database) and the AR Face
subject to Ax + e = Ax0 + e0 Database.

and the minimizer is uniquely defined.

The proof of this theorem is very technical, and 4.1 Comparison with Classical
will not be given here. It can be found in [WM10].
We will instead give a small interpretation of its
Methods
consequences. Now that we have introduced a new classifier for face
It is worth noting that many of the technical as- recognition, an immediate question becomes: Why
sumptions in Theorem 3.2 are present only to make develop a new classification scheme based on com-
the theorem provable. Some of them are reasonable, pressive sensing when there already exists multiple
and quite intuitive, such as the assumption of propor- statistical classifiers suitable for face recognition?
tional growth. Others, like the assumption that the In the field of statistical and machine learning, one
signs σ is uniformly random, is rather unrealistic. studies, among other things, different classifiers, such
The main result is that we can recover a signal by as the LDA, KNN or SVM discussed earlier. Typically,
using `1 -minimization, even if the signal has some we have that for one specific application, different
noise, and even if the noise is not sparse. However, for classifiers will perform differently. Thus, for a given
this to work we need our error and signal to exhibit application, we can rate how the different classifiers
certain properties, and the most important one is the perform. A common metric for classification perfor-
proportional growth. The intuitive interpretation of mance is to observe the test error rate.
these assumptions is that if our error is very dense, In order to precisely test the accuracy of a classifi-
we need the signal to be very sparse, and if our signal cation model, we need some testing data independent
is dense, we need a sparse error in order for the from the training data. Typically, this is achieved by
recovering scheme to work. dividing the total data available into two sets: a train-
One last problem we will not look into is mis- ing set used to build the model, and a test set used
alignment. Usually, images of faces are not perfectly to test the model. For our case, we will split our set
aligned and cropped, so a fully functioning face recog- of images {(φi , li )}N
i=1 in two, and make sure that for
nition system should be able to deal with misaligned each person i, we have equally many images of i in the
images. A more in-depth look at this can be found in training and the testing set. In the rest of this section,
Section 12.5 of [EK12]. we will by I ⊂ {1, 2, . . . , N } denote the set of indices
for the training images, and by J = {1, 2, . . . , N } \ I
denote the set of indices for the testing images.
Given a model and a set of testing images, the test
4 Practical Implementation error rate is given as

1 X ˆ
We now shift our attention away from the purely the- I(li 6= li ) (4.1)
oretical aspect, and towards the practical side. In this |J|
i∈J
chapter we will seek to answer two main questions:
How can a face recognition system using the SRC de- where lˆi is the predicted label of the image, li is the
scribed in Section 3.1 be implemented? And how does true label of the image, and I denotes the indicator
it perform compared to the classical, state-of-the-art function. A good classifier is one for which the test
methods? We will begin by addressing the latter, and error rate is small [ISLR, Section 2.2].

11
Feature dimension
Classifier 30 56 120 504
SRC 0.125 0.083 0.060 0.035
NN 0.229 0.165 0.128 0.093
NS 0.110 0.096 0.081 0.066
SVM 0.280 0.150 0.060 0.023

Table 4.1: Test error rates reported in [Wri+09] for the SRC, NN, NS and SVM using the Laplacian faces
scheme for feature extraction, and using the extended Yale B database

An empirical comparison of the SRC with the more 4.2 Performance Issues
classical methods NN (ie: KNN with K = 1), NS (KNS
with K = 1) and the linear-kernel SVM is found So far, we have only discussed the advantages of
in [Wri+09]. In this comparison, the researchers the SRC-based system for face recognition compared
used the Extended Yale B Database consisting of 2 414 to the classical approaches. However, this gives an
images of 38 individuals, as well as the AR database incomplete view of the system. In this section we
consisting of over 4 000 images of 126 individuals. For will look at one major disadvantage to the SRC-based
each subject, they used half of the images for training, method, namely runtime and memory use.
and half of the images for testing. Then they paired
up the different classifiers with different feature
Runtime Analysis
extraction and feature dimensions, and compared
their test error rate2 . We will begin by looking at how the runtime of
We have included some of the results from Algorithm 3.1 will increase as the size of the problem
[Wri+09] in Table 4.1. We have concentrated on the increases (ie, as C, N, m and n increases). To do this
Laplacian faces scheme for feature extraction, as this we will use the Big-O notation.
seemed to perform well with all the different classi-
Definition 4.1. Let f, g be two functions f, g : N →
fiers. As one can see from the table, the SRC performs
R. Then f (n) ∈ O(g(n)) if there exists a c ∈ R and a
quite well. It is also worth noting that the perfor-
N ∈ N such that f (n) ≤ cg(n) for all n > N .
mance of the SRC varies less from the choice of feature
extraction than compared to the other classifiers. We can think of Big-O as an upper bound on the true
An even more impressive result appears when runtime.
the different classifiers is applied to occluded or Step 1 and 3 of Algorithm 3.1 are obviously linear-
corrupted images. For images where parts of the time operations. More precisely, they have a runtime
face are occluded, the researchers found that the test of O(N ) and O(C) respectively. The bottleneck of
error rate was below 0.02, even when up to 30% of the runtime is thus step 2. In Section 2.1 we saw
the face were occluded. At 40% occlusion the test how (P1 ) could be recast as an LP problem. However,
error rate increased to 0.1, and for 50% they reported a famous result due to Klee and Minty renders this
a test error rate of 0.35. Meanwhile, the researchers more or less useless. They showed that the Simplex
reported a test error rate of > 0.2 at 30% occlusion method using the largest-coefficient pivoting rule
for the nearest-neighbor methods, and at 50% they uses O(2n ) pivots worst case, where n is the number
reported a test error rate between 0.5 and 0.7. Similar of decision variables [Van14, Section 4.4]. For our
results were found for corrupted images. case this means that using the simplex method to
implement Algorithm 3.1 will have a worst case
runtime of O(2N ).
2 In [Wri+09] they actually talk about the recognition rate, Various interior point methods exist, such as
which would be 1− test error rate. My guess is that the authors the path-following method. This algorithm uses
are the “glass is half-full”-kind of people. Newton’s method to iteratively find better and better

12
approximations to the optimal solution. However, vector with a matrix, without allocating the matrix in
even though the number of iterations no longer has memory. A simple way to illustrate how we can avoid
exponential growth, since each iteration typically allocating the whole matrix is to notice that we only
uses O(N 2 ) operations, this too is to slow for our case need one row at a time. Thus, we can allocate only
[EK12, Section 12.6]. This is because in a real world N bytes of memory space, instead of mnN . Better
example, we will a very large number of training approaches exist, but we will not go into any more
images N . detail in this report.
The last algorithm we will look at is the orthogo- Feature extraction will also help. Typically, before
nal matching pursuit used to create Figure 2.2. This constructing the database of training images, one ap-
algorithms solves a series of least-squares approxima- plies one of the feature extraction schemes mentioned
tions. This too is an approach that scales badly, which earlier. I this way we can reduce the size of each image
is why it’s mostly used for small problems [FR13, Se- down even as low as 120 pixels [EK12, Section 12.2].
cion 3.2]. Even though a combination of these techniques
It is worth noting that some optimization algo- will help, the amount of data required for robust
rithms with better per-iteration runtime exists, such face recognition using the SRC classification scheme
as the Augmented Lagrange multiplier method de- remains an issue.
scribed in [FR13, Alg. 12.2]. Other algorithms, such
as homotopy algorithms, make use of the fact that
the solution is sparse, and are able to recover the `1 -
minimum of an s-sparse vector in RN in O(s3 + N )
time [Wri+09], which is a decent runtime. For a very
5 Conclusions
sparse vector, this is close to linear-time.
In this report we have laid out some of the basic
theory of compressive sensing, such as how many
Memory Usage samples that are necessary in order to recover the
signal, how one can guarantee that the correct `0
As we now have seen, the issue with runtime can be minimum can be recovered using `1 optimization, and
somewhat dealt with. Still, another issue remains, how to adapt this theory to fit problems when the
and that is memory usage. desired solution is not sparse in its original form.
Digital images without compression use a lot of Further we have seen that these techniques of
disk space. A one megapixel gray-scale image, using sparse recovery via `1 minimization can be used to
8 bits to encode the light intensity, will take up 1 develop a classification scheme well suited for face
megabyte of disk space. That in itself is not too recognition. The SRC introduced in [Wri+09] seems
much, but as the database of images grows, the to perform very well, and is more stable in regard
sensing matrix Φ quickly becomes very large. For to feature extraction, occlusion and corruption than
the extended Yale B database, consisting of over more classical methods such as NN, NS or linear-
2400 images, Φ takes up 1.2 gigabytes of disk space, kernel SVM.
assuming that we store the images in the discussed However, as is often the case for compressive sens-
format, and that half of the available data is used for ing, there are some problems regarding runtime and
training. For the AR database this grows to roughly memory usage which must be overcome in order to
2 gigabytes, and for the enormous CMU Multi-PIE do a full-scale implementation of this classification
database we will have a sensing matrix of over 350 scheme. At the current state of minimization algo-
gigabytes. rithms and computational power, the SRC might be
It is clear that it is not feasible to allocate such unfavorable due to the time required to do a single
matrices in memory. Hence we will need a way to classification. Thus, more work is needed on the im-
multiply the matrix with a vector (as one needs to plementation side.
do in the constraints of the minimization problem
in (3.5)) without allocating the matrix. Results from
numerical linear algebra tell us that in many cases
it is possible to create a function which multiplies a

13
References [Rya16] Øyvind Ryan. Linear algebra, signal pro-
cessing and wavelets. A unified approach.
[Ant16] Vegard Antun. Master thesis code. 2016. 2016. URL: https : / / github . com /
URL: https : / / bitbucket . org / oyvindry / applinalgcode (visited on
vegarant / code - thesis (visited on 04/10/2017).
03/23/2017). [Van14] Robert J. Vanderbei. Linear Programming.
[BL13] Kurt Bryan and Tanya Leise. “Making 4th ed. Springer, 2014.
Do with Less: An Introduction to Com- [WM10] John Wright and Yi Ma. “Dense Error
pressed Sensing.” In: Siam Review 55.3 Correction Via `1 -Minimization.” In: IEEE
(2013), pp. 547–566. Transactions on Information Theory 56.7
[BW14] Farooq Ahmad Bhat and M. Arif Wani. (July 2010), pp. 3540–3560.
“Performance Comparison of Major Clas- [Wri+09] John Wright et al. “Robust face recogni-
sical Face Recognition Techniques.” In: tion via sparse representation.” In: IEEE
IEEE 13th International Conference on Ma- Transactions on Pattern Analysis and
chine Learning and Applications (ICMLA) Machine Intelligence 31.2 (Feb. 2009),
(2014), pp. 521–528. pp. 210–227.
[CRT06] Emmanuel J Candès, Justin Romberg, and
Terence Tao. “Robust uncertainty prin-
ciples: Exact signal reconstruction from
highly incomplete frequency informa-
tion.” In: IEEE Transactions on informa-
tion theory 52.2 (2006), pp. 489–509.
[CT06] Emmanuel J Candes and Terence Tao.
“Near-optimal signal recovery from
random projections: Universal encod-
ing strategies?” In: IEEE transactions
on information theory 52.12 (2006),
pp. 5406–5425.
[Don06] David L Donoho. “Compressed sensing.”
In: IEEE Transactions on information the-
ory 52.4 (2006), pp. 1289–1306.
[EK12] Yonina C. Eldar and Gitta Kutyniok. Com-
pressed Sensing: Theory and Applica-
tions. Cambridge university press, 2012.
Chap. 12.
[FR13] Simon Foucart and Holger Rauhut. A
Mathematical Introduction to Compressive
Sensing. Birkhauser, 2013.
[ISLR] Gareth James et al. An Introduction to
Statistical Learning with Applications in R.
Springer, 2013.
[LHK05] Kuang-Chih Lee, Jeffrey Ho, and David J.
Kriegman. “Acquiring linear subspaces
for face recognition under variable light-
ing.” In: IEEE Transactions on pattern
analysis and machine intelligence 27.5
(2005), pp. 684–698.

14

You might also like