Professional Documents
Culture Documents
Series Editors
M. ABLOWITZ, S. DAVIS, J. HINCH,
A. ISERLES, J. OCKENDON, P. OLVER
20 Greedy Approximation
The Cambridge Monographs on Applied and Computational Mathematics series
reflects the crucial role of mathematical and computational techniques in contemporary
science. The series publishes expositions on all aspects of applicable and numeri-
cal mathematics, with an emphasis on new developments in this fast-moving area of
research.
State-of-the-art methods and algorithms as well as modern mathematical descrip-
tions of physical and mechanical ideas are presented in a manner suited to graduate
research students and professionals alike. Sound pedagogical presentation is a prereq-
uisite. It is intended that books in the series will serve to inform a new generation of
researchers.
VLADIMIR TEMLYAKOV
University of South Carolina
CAMBRIDGE UNIVERSITY PRESS
Cambridge, New York, Melbourne, Madrid, Cape Town,
Singapore, São Paulo, Delhi, Tokyo, Mexico City
Published in the United States of America by Cambridge University Press, New York
www.cambridge.org
Information on this title: www.cambridge.org/9781107003378
c Cambridge University Press 2011
A catalog record for this publication is available from the British Library
Preface page ix
v
vi Contents
3.2 Finite dimensional spaces 144
3.3 Trigonometric polynomials and volume estimates 151
3.4 The function classes 165
3.5 General inequalities 168
3.6 Some further remarks 175
3.7 Open problems 182
References 405
Index 415
Preface
From the beginning of time, human beings have been trying to replace com-
plicated with simpler things. From ancient shamans working magic upon clay
figures to heal the sick, to Renaissance artists representing God Almighty as a
nude painted onto a ceiling, the fundamental problem of representation of the
only partially representable continues in contemporary applied mathematics.
A generic problem of mathematical and numerical analysis is to represent a
given function approximately. It is a classical problem that goes back to the
first results on Taylor’s and Fourier’s expansions of a function.
The first step in solving the representation problem is to choose a represen-
tation system. Traditionally, a representation system has natural features such
as minimality, orthogonality, simple structure and nice computational charac-
teristics. The most typical representation systems are the trigonometric system
{eikx }, the algebraic system {x k }, the spline system, the wavelet system and
their multivariate versions. In general we may speak of a basis = {ψk }∞ k=1
in a Banach space X .
The second step in solving the representation problem is to choose a form
of an approximant that is built on the base of the chosen representation sys-
tem . In a classical way that was used for centuries, an approximant am is a
polynomial with respect to :
m
am := ck ψk . (1)
k=1
ix
x Preface
function being approximated. Approximation of this type is referred to as lin-
ear approximation theory because, for a fixed m, approximants come from a
linear subspace spanned by ψ1 , . . . , ψm .
It is understood in numerical analysis and approximation theory that in many
problems from signal/image processing it is more beneficial to use an m-term
approximant with respect to than a polynomial of order m. This means that
for f ∈ X we look for an approximant of the form
am ( f ) := ck ψk , (2)
k∈( f )
1.1 Introduction
It is well known that in many problems it is very convenient to represent a
function by a series with regard to a given system of functions. For example,
in 1807 Fourier suggested representing a 2π -periodic function by its series
(known as the Fourier series) with respect to the trigonometric system. A very
important feature of the trigonometric system that made it attractive for the rep-
resentation of periodic functions is orthogonality. For an orthonormal system
B := {bn }∞
n=1 of a Hilbert space H with an inner product
·, ·, one can
construct a Fourier series of an element f in the following way:
∞
f ∼
f, bn bn . (1.1)
n=1
1
2 Greedy approximation with regard to bases
provide the best approximation; that is, defining
m
E m ( f, B) := inf
f − cn bn
(1.5)
{cn }
n=1
f − Sm ( f, B) = E m ( f, B). (1.6)
Identities (1.3) and (1.6) are fundamental properties of Hilbert spaces and
their orthonormal bases. These properties make the theory of approximation
in H from the span{b1 , . . . , bm }, or linear approximation theory, simple and
convenient.
The situation becomes more complicated when we replace a Hilbert space
H by a Banach space X . In a Banach space X we consider a Schauder basis
instead of an orthonormal basis B in H . In Section 1.2 we discuss Schauder
bases in detail. If := {ψn }∞n=1 is a Schauder basis for X , then for any f ∈ X
there exists a unique representation
∞
f = cn ( f, )ψn
n=1
that converges in X .
Theorem 1.3 from Section 1.2 states that the partial sum operators Sm ,
defined by
m
Sm ( f, ) := cn ( f, )ψn ,
n=1
Sm ( f, ) ≤ B f .
where
m
E m ( f, ) := inf
f − cn ψn
.
{cn }
n=1
Also, it is clear that the Sm ( f, B) realizes the best m-term approximation
of f with regard to B,
f − Sm ( f, B)
= σm ( f, B) := inf inf
f − cn bn
. (1.8)
:||=m {cn }
n∈
We note that we need some restrictions on the basis (see Sections 1.3 and
1.4 for a detailed discussion) in order to be able to run the greedy algorithm
for each f ∈ X . It is sufficient to assume that is normalized. We make this
assumption for our further discussion in the Introduction. In some later sections
we continue to use the normalization assumption; in others, we do not.
An application of the greedy algorithm can also be seen as a rearrange-
ment of the series from (1.9) in a special way: according to the size of
coefficients. Let
|cn 1 ( f, )| ≥ |cn 2 ( f, )| ≥ . . . .
Then
m
G m ( f, ) = cn j ( f, )ψn j .
j=1
An immediate question arising from (1.10) is: When does this series converge?
The theory of convergence of rearranged series is a classical topic in analysis.
A series converges unconditionally if every rearrangement of this series con-
verges. A basis of a Banach space X is said to be an unconditional basis if,
1.1 Introduction 5
for every f ∈ X , its expansion (1.9) converges unconditionally. For a set of
indices define
S ( f, ) := cn ( f, )ψn .
n∈
where
E ( f, ) := inf
f − cn ψn
.
{cn }
n∈
Theorem 1.3 Let X be a Banach space with a Schauder basis . Then the
operators Sm : X → X are bounded linear operators and
sup
Sm
< ∞.
m
The above remark means that, for any Schauder basis of X , we can renorm
X (take X s ) to make the basis monotone for a new norm.
We note that for a basis with the basis constant K , we have, for any
f ∈ X,
m
f − Sm ( f, )
≤ (K + 1) inf
f − ck ψk
.
{ck }
k=1
where the infimum is taken over coefficients ck and sets of indices with
cardinality || = m. We note that in the above definition of σm ( f, ) X the
system may be any system of elements from X , not necessarily a basis of X .
An immediate natural question is: When does the best m-term approximant
exist? This question is more difficult than the corresponding question in linear
approximation and it has not been studied thoroughly. In what follows, we
present some results that may point us in the right direction.
Let us proceed directly to the setting of our approximation problem. Let a
subset A ⊂ X be given. For any f ∈ X , let
d( f, A) := d( f, A) X := inf
f − a
a∈A
denote the distance from f to A, or, in other words, the best approximation
error of f by elements from A in the norm of X . To illustrate some appropriate
techniques, we prove existence theorems in two settings.
k∈
⎧ ⎫
⎨ ⎬
m (R) := t : t = ak cos kx + bk sin kx, |1 | + |2 | ≤ m .
⎩ ⎭
k∈1 k∈2
σm ( f, T ) X := d( f, m ) X .
In the setting S2 we prove here the following existence theorem (see DeVore
and Lorenz (1993), p. 363).
j=1
Consider
g( f, c) := c j χ[y ∗j−1 ,y ∗j ] .
j∈
such that
σm ( f, T d ) p =
f − tm
p . (1.16)
Proof We prove this theorem by induction. Let us use the abbreviated notation
σm ( f ) p := σm ( f, T d ) p .
First step Let m = 1. We assume σ1 ( f ) p <
f
p , because in the case
σ1 ( f ) p =
f
p the proof is trivial: we take t1 = 0. We now prove that
1.2 Schauder bases in Banach spaces 11
polynomials of the form cei(k,x) with large k cannot provide approximation
with error close to σ1 ( f ) p . This will allow us to restrict the search for an opti-
mal approximant c1 ei(k ,x) to a finite number of k 1 , which in turn will imply
1
the existence.
We introduce a parameter N ∈ N, which will be specified later, and consider
the following polynomials:
|k| iku
K N (u) := 1− e , u ∈ T, (1.17)
N
|k|<N
and
d
K N (x) := K N (x j ), x = (x1 , . . . , xd ) ∈ Td .
j=1
The functions K N are the Fejér kernels. These polynomials have the property
(see Zygmund (1959), chap. 3, sect. 3 and Section 3.3 here)
K N 1 = 1, N = 1, 2, . . . (1.18)
K N ( f ) p = K N (g) p ≤ g p . (1.21)
K N ( f ) p ≥ f p − f − K N ( f ) p ≥ f p − e N ( f ). (1.22)
f
p − e N ( f ) ≥ (
f
p + σ1 ( f ) p )/2. (1.24)
12 Greedy approximation with regard to bases
Relations (1.23) and (1.24) imply
σ1 ( f ) p = inf
f (x) − cei(k,x)
p ,
c,
k
∞ <N
which completes the proof for m = 1, by the existence theorem in the case of
approximation by elements of a subspace of finite dimension.
General step Assume that Theorem 1.7 has already been proven for m − 1.
We now prove it for m. If σm ( f ) p = σm−1 ( f ) p , we are done, by the induction
assumption. Let σm ( f ) p < σm−1 ( f ) p . The idea of the proof in the general
step is similar to that in the first step.
Take any k 1 , . . . , k m . Assume
k j
∞ ≤
k m
∞ , j = 1, . . . , m − 1, and
k m
∞ > N . We prove that a polynomial with frequencies k 1 , . . . , k m does
not provide a good approximation. Take any numbers c1 , . . . , cm , and consider
m−1
j ,x)
f m−1 (x) := f (x) − c j ei(k ,
j=1
m ,x)
g(x) := f m−1 (x) − cm ei(k .
Then, replacing f by f m−1 , we get in the same way as above the estimate
m
j ,x)
f (x) − c j ei(k
p ≥ σm−1 ( f ) p − e N ( f ). (1.25)
j=1
be such that
f − f n
→ σm ( f, ).
Then we have
|cknn | ≤ M (1.29)
j
inf
ψk
= 0. (1.33)
k
In this case we define the G m ( f, ) in the same way as above, but only for f
of a special form:
f = ck ( f, )ψk , |Y | < ∞. (1.34)
k∈Y
where the infimum is taken over coefficients ck and sets of indices with
cardinality || = m. The best we can achieve with the algorithm G m is
f − G m ( f, , ρ) X = σm ( f, ) X ,
f − G m ( f, , ρ) X ≤ Gσm ( f, ) X , (1.35)
f − G m ( f, B, ρ) H = σm ( f, B) H .
Theorem 1.11 Let 1 < p < ∞ and let a basis be L p -equivalent to the
Haar basis H. Then, for any f ∈ L p (0, 1), we have
p
f − G m ( f, )
p ≤ C( p, )σm ( f, ) p
Theorem 1.11A Let 1 < p < ∞ and let a basis be L p -equivalent to the
Haar basis H p . Then, for any f ∈ L p (0, 1) and any ρ ∈ D( f ), we have
f − G m ( f, , ρ) p ≤ C( p, )σm ( f, ) p
f − G m ( f, , ρ) X ≤ Gσm ( f, ) X
We take any finite set P ⊂ N satisfying w(P) ≤ w(Q). Then our assumption
wn > 0, n ∈ N implies that either P = Q or Q \ P is non-empty. As in
Section 1.1 let
E P ( f, ) := inf
f − bn ψn
.
{bn }
n∈P
f − S Q ( f )
≤
f − S P ( f )
+
S P ( f ) − S Q ( f )
. (1.40)
1.3 Greedy bases 21
We have
S P ( f ) − S Q ( f ) = S P\Q ( f ) − S Q\P ( f ). (1.41)
w-greedy ⇒ unconditional.
Lemma 1.19 Let be a basis such that, for any f of the form (1.38), we have
f − G m ( f, ) ≤ C E ( f, ),
where
G m ( f, ) = cn ( f )ψn .
n∈
Then is unconditional.
Proof It is clear that it is sufficient to prove that there exists a constant C0 such
that, for any finite and any f of the form (1.38), we have
S ( f ) ≤ C0 f .
f M := S[1,M] ( f );
22 Greedy approximation with regard to bases
then
f M
≤ C B
f
. We take b > max1≤n≤M |cn ( f )| and define a new
function
g := f M − S ( f M ) + b ψn .
n∈
Then
G m (g, ) = b ψn , m := ||
n∈
and
E (g, ) ≤
f M
.
f M − S ( f M ) = g − G m (g, ) ≤ C E (g, ) ≤ C f M ,
and therefore
S ( f )
=
S ( f M )
≤ C0
f
.
w-greedy ⇒ w-democratic.
Then
G m ( f, ) = (1 +
) ψn , m := |B|,
n∈B
and
E A ( f, ) ≤
ψn
(1 +
).
n∈B
Now let A, B be any finite subsets of N for which w(A) ≤ w(B). Then, using
the unconditionality of proved in IIa and the above part of IIb, we obtain
1.3 Greedy bases 23
ψn
≤
ψn
+
ψn
n∈A n∈A\B n∈A∩B
≤ C
ψn
+ K
ψn
≤ C1
ψn
.
n∈B\A n∈B n∈B
Democracy does not imply unconditionality Let X be the set of all real
sequences x = (x1 , x2 , . . . ) such that
N
x
X := sup | xn |
N ∈N n=1
but
m
(−1)k ψk
X = 1.
k=1
We let H p := {Hk, p }∞
be the Haar basis H renormalized in L p ([0, 1)).
k=1
We define the multivariate Haar basis Hdp to be the tensor product of the
univariate Haar bases: Hdp := H p × · · · × H p ;
The lower bound in (1.44) was proved by R. Hochmuth (see the end of this
section); the upper bound in (1.44) was proved in the case d = 2, 4/3 ≤ p ≤
4, and was conjectured for all d, 1 < p < ∞, in Temlyakov (1998b). The
conjecture was proved in Wojtaszczyk (2000).
Let us return to the problem of finding a near-best m-term approximant of
f ∈ X with regard to a basis . This problem consists of two subproblems.
First, we need to identify a set m of m indices that can be used in achieving a
near-best m-term approximation of f . Second, we need to find the coefficients
{ck }, k ∈ m , such that the approximant k∈m ck ψk provides a near-best
approximation of f . It is clear from the properties of an unconditional basis
that, for any f ∈ X and any , we have (see (1.12))
f − ck ( f, )ψk
≤ C inf
f − ck ψk
.
{ck }
k∈ k∈
Theorem 1.20 Let 1 < p < ∞. Then, for any , || = m, we have, for
2 ≤ p < ∞,
C 1p,d m 1/ p min |cn | ≤
cn Hn, p
p ≤ C 2p,d m 1/ p (log m)h( p,d) max |cn |
n∈ n∈
n∈
Theorem 1.21 Let 1 < p < ∞ and let be a greedy basis for L p ([0, 1)).
Then, for any , || = m, we have, for 2 ≤ p < ∞,
C 5p,d m 1/ p min |cn | ≤
cn ψn
p ≤ C 6p,d m 1/ p (log m)h( p,d) max |cn |
n∈ n∈
n∈
Then the Haar functions H I are indexed by their intervals of support. The first
and the second Haar functions are indexed by [0, 1] and [0, 1), respectively.
Let us take a parameter 0 < t ≤ 1 and consider the following greedy-
type algorithm G p,t with regard to the Haar system. For the Haar basis H we
define
1
c I ( f ) :=
f, H I = f (x)H I (x)d x.
0
and define
p,t p,t
G m ( f ) := G m ( f, H) := c I ( f )H I . (1.48)
I ∈m (t)
c I ( f, )ψ I
p ≥
c J ( f, )ψ J
.
1.3 Greedy bases 27
Using (1.36), we get
This inequality implies that, for any m, we can find a set m (t), where t =
C1 ( p)C2 ( p)−1 , such that m (t) = m , and therefore
p p,t
f − G m ( f, )
p ≤ C2 ( p)
g( f ) − G m (g( f ))
p . (1.52)
Relations (1.50) and (1.52) show that Theorem 1.11 follows from Theorem
1.22 below.
Theorem 1.22 Let 1 < p < ∞ and 0 < t ≤ 1. Then, for any g ∈ L p , we
have
p,t
g − G m (g, H)
p ≤ C( p, t)σm (g, H) p .
Proof The Littlewood–Paley theorem, for the Haar system, gives, for
1 < p < ∞,
1/2 1/2
C3 ( p)
|c I (g)H I | 2
p ≤
g
p ≤ C4 ( p)
|c I (g)H I | 2
p.
I I
(1.53)
We first formulate two simple corollaries of (1.53):
1/ p
p
g
p ≤ C5 ( p)
c I (g)H I
p , 1 < p ≤ 2, (1.54)
I
1/2
g
p ≤ C6 ( p)
c I (g)H I
2p , 2 ≤ p < ∞. (1.55)
I
Analogs of these inequalities for the trigonometric system are known (see, for
instance, Temlyakov (1989a), p. 37). The same proof yields (1.54) and (1.55).
The dual inequalities to (1.54) and (1.55) are
1/2
g
p ≥ C7 ( p)
c I (g)H I
p
2
, 1 < p ≤ 2, (1.56)
I
1/ p
p
g
p ≥ C8 ( p)
c I (g)H I
p , 2 ≤ p < ∞. (1.57)
I
28 Greedy approximation with regard to bases
We proceed to the proof of Theorem 1.22. Let Tm be an m-term Haar
polynomial of best m-term approximation to g in L p (for the existence see
Section 1.2):
Tm = a I H I , || = m.
I ∈
Proof of Lemma 1.23. We note that in the case 1 < p ≤ 2 the statement of
Lemma 1.23 follows from (1.54). We will give a proof of this lemma for all
1 ≤ p < ∞. We have
c I H I
p = |c I ||I |1/ p−1/2 .
Assumption (1.62) implies
|c I | ≤ |I |1/2−1/ p .
Next, we have
f
p ≤
|c I H I |
p ≤
|I |−1/ p χ I (x)
p , (1.63)
I ∈Q I ∈Q
Proof Let
s
F(x) := 2n j /q χ E j (x),
j=1
E j := ∪ I ∈Q;|I |=2−n j I.
Proof of Lemma 1.24. We derive Lemma 1.24 from Lemma 1.23. Define
u := c̄ I |c I |−1 |I |1/ p−1/2 H I ,
I ∈Q
where the bar denotes the complex conjugate number. Then for p = p/( p−1)
we have
c̄ I |c I |−1 |I |1/ p−1/2 H I
p = 1,
Consider
f, u. We have, on the one hand,
f, u = |c I ||I |1/ p−1/2 =
c I H I
p ≥ N , (1.66)
I ∈Q I ∈Q
1.3 Greedy bases 31
and, on the other hand,
f, u ≤
f
p
u
p . (1.67)
and let
B := min
c I (g)H I
p .
I ∈m (t)\
B ≥ t A. (1.68)
Taking into account that |m (t) \ | = | \ m (t)|, we obtain relation (1.61)
from (1.69) and (1.70).
The proof of Theorem 1.22 is now complete.
We now discuss the multivariate analog of Theorem 1.11. There are several
natural generalizations of the Haar system to the d-dimentional case. We first
describe that one for which the statement of Theorem 1.11 and its proof coin-
cide with the one-dimensional version. First of all, we include in the system of
functions the constant function
Next, we define 2d − 1 functions with support [0, 1)d . Take any combination
of intervals Q 1 , . . . , Q d , where Q i = [0, 1] or Q i = [0, 1) with at least one
Q j = [0, 1), and define for Q = Q 1 × · · · × Q d , x = (x1 , . . . , xd ),
d
H Q (x) := H Q i (xi ).
i=1
32 Greedy approximation with regard to bases
d (x), k = 1, . . . , 2d − 1. We
k
We shall also denote these functions by H[0,1)
define the basis of the Haar functions with supports on dyadic cubes of the
form
For each dyadic cube of the form (1.71), we define 2d − 1 basis functions
−n
H Jk (x) := 2nd/2 H[0,1)
k
d (2 (x −( j1 −1, . . . , jd −1)2
n
)), k = 1, . . . , 2d −1.
d (x) =
k
We can also use another enumeration of these functions. Let H[0,1)
H Q (x) with
Q = Q1 × · · · × Qd , Q i = [0, 1), i ∈ E,
Q i = [0, 1], i ∈ {1, d} \ E, E = ∅.
and define H I (x) := H Jk (x). Denoting the set of dyadic intervals D as the
set of all dyadic cubes of the form (1.72) amended by the cube [0, 1]d , and
denoting by H the corresponding basis {H I } I ∈D , we get the multivariate Haar
system.
Remark 1.26 Theorem 1.11 holds for the multivariate Haar system H with
the constant C( p) allowed to depend also on d.
In this section we have studied approximation in L p ([0, 1]) and have made
a remark about approximation in L p ([0, 1]d ). We can treat the approximation
in L p (Rd ) in the same way.
B := {I : / A, ∀I = I
I ∈ we have I ∩ I = ∅; |B| = |A|}.
Let 2 ≤ p < ∞ be given. Let m := |A| and consider
f = g A, p + 2g B, p .
Then we have, on the one hand,
p
G m ( f, Hdp ) = G m ( f, Hd ) = 2g B, p
and
f − G m ( f, Hd )
p =
g A, p
p m 1/ p (log m)(1/2−1/ p)(d−1) .
p
(1.73)
On the other hand, we have
σm ( f, Hd ) p ≤
2g B, p
p m 1/ p . (1.74)
Relations (1.73) and (1.74) imply the required lower estimate in the case
2 ≤ p < ∞. The remaining case 1 < p ≤ 2 can be handled in the same way
considering the function f = 2g A, p + g B, p .
f − G m ( f, , ρ) → 0 as m → ∞. (1.77)
we have
S ( f, )
≤ C Q
f
. (1.79)
for any two finite sets P and Q, |P| = |Q|, and any choices of signs θk = ±1,
k ∈ P, and
k = ±1, k ∈ Q.
Indeed, we have
m
√
±ψk j
2 = m,
j=1
m
m √
±ψk j
+ ≤ 1/ j < 2 m,
j=1 j=1
but
m √
(−1)k ψk / k
log m.
k=1
S ( f, ) 2 ≤ f 2 ≤ 1. (1.84)
+ (N ) := {k ∈ : k > N }, − (N ) := {k ∈ : k ≤ N }.
(α 2 N )−1/6 . (1.85)
M
| ck ( f )k −1/2 | ≤ | ck ( f )k −1/2 | + | ck ( f )k −1/2 |
k∈− (M) k=1 / − (M),k≤M
k ∈
M
≤1+α k −1/2 ≤ 1 + 2α M 1/2 1. (1.86)
k=1
Thus
S ( f, )
+ ≤ C,
Lemma 1.32 Let be a quasi-greedy basis. Then, for any two finite sets of
indices A ⊆ B and coefficients 0 < t ≤ |a j | ≤ 1, j ∈ B, we have
a j ψ j
≤ C(X, , t)
a j ψ j
.
j∈A j∈B
G m ( f ) ≤ K f and f − G m ( f ) ≤ K f , f ∈ X.
It is clear that
σm ( f, ) ≤ σ̃m ( f, ).
38 Greedy approximation with regard to bases
It is also clear that for an unconditional basis we have
σ̃m ( f, ) ≤ C(X, )σm ( f, ).
Definition 1.35 We call a basis an almost greedy basis if, for every f ∈ X ,
there exists a permutation ρ ∈ D( f ) such that we have the inequality
f − G m ( f, , ρ)
X ≤ C σ̃m ( f, ) X (1.87)
with a constant independent of f and m.
The following proposition follows from the proof of theorem 3.3 of Dilworth
et al. (2003) (see Theorem 1.37 below).
Proposition 1.36 If is an almost greedy basis then (1.87) holds for any
permutation ρ ∈ D( f ).
The following characterization of almost greedy bases was obtained in
Dilworth et al. (2003).
Theorem 1.37 Suppose is a basis of a Banach space. The following are
equivalent:
(A) is almost greedy;
(B) is quasi-greedy and democratic;
(C) for any (respectively, every) λ > 1 there is a constant C = Cλ such
that
f − G [λm] ( f, )
≤ Cλ σm ( f, ).
In order to give the reader an idea of relations between σ̃ and σ , we present
an estimate for σ̃n ( f, ) in terms of σm ( f, ) for a quasi-greedy basis . For
a basis we define the fundamental function
ϕ(m) := sup
ψk
.
A:|A|≤m k∈A
of the expansion terms with the m biggest coefficients in absolute value (see
(1.32)). In this section we discuss a more flexible way to construct a greedy
approximant. The rule for choosing the expansion terms for approximation
will be weaker than in the greedy algorithm G m (·, ). We are motivated by
Theorem 1.22. Instead of taking m terms with the biggest coefficients we now
take m terms with near-biggest coefficients. We proceed to a formal definition
of the Weak Greedy Algorithm with regard to a basis . We assume here that
satisfies (1.75).
Let t ∈ (0, 1] be a fixed parameter. For a given basis and a given f ∈ X ,
let m (t) be any set of m indices such that
min |ck ( f, )| ≥ t max |ck ( f, )| (1.88)
k∈m (t) k ∈
/ m (t)
and define
G tm ( f ) := G tm ( f, ) := ck ( f, )ψk .
k∈m (t)
40 Greedy approximation with regard to bases
We call it the Weak Greedy Algorithm (WGA) with the weakness sequence {t}
(the weakness parameter t). We note that the WGA with regard to a basis was
introduced in the very first paper (see Temlyakov (1998a)) on greedy bases. It
is clear that G 1m ( f, ) = G m ( f, ). It is also clear that, in the case t < 1,
we have more flexibility in building a weak greedy approximant G tm ( f, )
than in building G m ( f, ): it is one advantage of a weak greedy approximant
G tm ( f, ). The question is: How much does this flexibility affect the efficiency
of the algorithm? Surprisingly, it turns out that the effect is minimal: it is only
reflected in a multiplicative constant (see below).
We begin our discussion with the case when is a greedy basis. It was
proved in Temlyakov (1998a) (see Theorem 1.22) that, when X = L p ,
1 < p < ∞, and is the Haar system H p normalized in L p , we have
f − G tm ( f, H p ) L p ≤ C( p, t)σm ( f, H p ) L p , (1.89)
for any f ∈ L p . It was noted in Konyagin and Temlyakov (2002) that the proof
of (1.89) from Temlyakov (1998a) works for any greedy basis, not merely the
Haar system H p . Thus, we have the following result.
Theorem 1.39 For any greedy basis of a Banach space X , and any
t ∈ (0, 1], we have
for each f ∈ X .
Theorem 1.40 Let be a quasi-greedy basis. Then, for a fixed t ∈ (0, 1] and
any m, we have, for any f ∈ X ,
G tm ( f, ) ≤ C(t) f . (1.91)
Theorem 1.41 Let be a quasi-greedy basis for a Banach space X . Then, for
any fixed t ∈ (0, 1], we have, for each f ∈ X , that
G tm ( f, ) → f as m → ∞.
Let us now proceed to an almost greedy basis . The following result was
established in Konyagin and Temlyakov (2002).
1.5 Weak Greedy Algorithms with respect to bases 41
Theorem 1.42 Let be an almost greedy basis. Then, for t ∈ (0, 1], we have,
for any m,
f − G tm ( f, )
≤ C(t)σ̃m ( f, ). (1.92)
Proof We drop from the notation for the sake of brevity. Take any
> 0
and find P, |P| = m, such that
f − S P ( f ) ≤ σ̃m ( f ) + .
f − G tm ( f ) ≤ f − S P ( f ) + S P ( f ) − S Q ( f ) . (1.93)
We have
S Q\P ( f ) = S Q\P ( f 1 ).
Next,
Consider
G m ( f, λ ) = (ck ( f )λ−1
k )λk ψk .
k∈m
Therefore, the set m can be interpreted as a m (t) with t = a/b with regard
to the basis . It remains to apply the corresponding results for G tm ( f, ):
(1.90) in case (1); (1.91) in case (3); and (1.92) in case (2). This completes the
proof of Theorem 1.43.
Kamont and Temlyakov (2004) studied the following modification of the
above weak-type greedy algorithm as a way to weaken further restriction
(1.88). We call this modification the Weak Greedy Algorithm (WGA) with
a weakness sequence τ = {tk }. Let a weakness sequence τ := {tk }∞ k=1 ,
tk ∈ [0, 1], k = 1, . . . , be given. We define the WGA by induction. We take an
element f ∈ X , and at the first step we let
1 (τ ) := {n 1 }; G τ1 ( f, ) := cn 1 ψn 1 ,
1.6 Thresholding and minimal systems 43
with any n 1 satisfying
|cn 1 | ≥ t1 max |cn |,
n
where we write cn := cn ( f, ) for brevity. Assume we have already defined
G τm−1 ( f, ) := G m−1
X,τ
( f, ) := cn ψn .
n∈m−1 (τ )
with any n m ∈
/ m−1 (τ ) satisfying
|cn m | ≥ tm max |cn |.
n ∈
/ m−1 (τ )
where D ⊆ Dt,
(x). We say that the Weak Thresholding Algorithm converges
for x ∈ X , and write x ∈ W T {en }(t) if, for any D(
) ⊆ Dt,
,
Theorem 1.45 Let t, t ∈ (0, 1), x ∈ X . Then the following conditions are
equivalent:
46 Greedy approximation with regard to bases
(1) lim
→0 sup D⊆Dt,
(x)
T
,D (x) − x
= 0;
(2) lim
→0 T
(x) = x and
lim sup
e∗j (x)e j
= 0; (1.104)
→0 D⊆Dt,
(x)
j∈D
So, the set W T {en }(t) defined above is indeed independent of t ∈ (0, 1).
and define
G tm (x) := G tm (x, {en }) := SWm (t) (x) := e∗j (x)e j .
j∈Wm (t)
It is clear that for any t ∈ (0, 1] and any D ⊆ Dt,
(x) there exist m and Wm (t)
satisfying (1.107) such that
Proposition 1.47 Let t ∈ (0, 1) and x ∈ X . Then the following two conditions
are equivalent:
1.7 Greedy approximation; trigonometric system 47
lim sup
T
,D (x) − x
= 0, (1.108)
→0 D⊆Dt,
(x)
Clearly
m → 0 as m → ∞. We have
G tm (x) = T2
m (x) + e∗j (x)e j (1.110)
j∈Dm
where
fˆ(k) := (2π )−d f (x)e−i(k,x) d x
Td
and call it an mth weak greedy approximant of f with regard to the trigono-
metric system T d := {ei(k,x) }k∈Zd , T := T 1 . We write G m ( f ) = G 1m ( f )
and call it an mth greedy approximant. Clearly, an mth weak greedy approxi-
mant, and even an mth greedy approximant, may not be unique. In this section
we do not impose any extra restrictions on m in addition to (1.111). Thus,
theorems formulated in the following hold for any choice of m satisfying
(1.111), or, in other words, for any realization G tm ( f ) of the weak greedy
approximation.
We will discuss in detail only results concerning convergence of the WGA
with regard to the trigonometric system. Körner (1996), answering a ques-
tion raised by L. Carleson and R. R. Coifman, constructed a function from
L 2 (T) and then, in Körner (1999), a continuous function such that {G m ( f, T )}
diverges almost everywhere. It has been proved in Temlyakov (1998c) for
p = 2 and in Cordoba and Fernandez (1998) for p < 2, that there exists
an f ∈ L p (T) such that {G m ( f, T )} does not converge in L p . It was
remarked in Temlyakov (2003a) that the method from Temlyakov (1998c)
gives a little more.
1.7 Greedy approximation; trigonometric system 49
(1) There exists a continuous function f such that {G m ( f, T )} does not
converge in L p (T) for any p > 2.
(2) There exists a function f that belongs to any L p (T), p < 2, such that
{G m ( f, T )} does not converge in measure.
Thus the above negative results show that the condition f ∈ L p (Td ), p = 2,
does not guarantee convergence of {G m ( f, T )} in the L p norm. The main goal
of this section is to discuss results in the following setting: find an additional (to
f ∈ L p ) condition on f to guarantee that
f − G m ( f, T )
p → 0 as m → ∞.
In Konyagin and Temlyakov (2003b) we proved the following theorem.
Theorem 1.48 Let f ∈ L p (Td ), 2 < p ≤ ∞, and let q > p := p/( p − 1).
Assume that f satisfies the condition
| fˆ(k)|q = o(n d(1−q/ p ) ),
|k|>n
lim
f − G tm ( f, T )
p = 0.
m→∞
Proposition 1.49 For each 2 < p ≤ ∞ there exists f ∈ L p (Td ) such that
| fˆ(k)| = O(|k|−d(1−1/ p) ),
then G tm ( f ) → f in L p .
50 Greedy approximation with regard to bases
Remark 1.51 The proof of Proposition 1.49 (see Konyagin and Temlyakov
(2003b)) implies that there is an f ∈ L p (Td ) such that
We note that Remark 1.50 can also be obtained from some general inequal-
ities for
f − G tm ( f )
p . As in the above general definition of best m-term
approximation, we define the best m-term approximation with regard to T d :
m
j ,x)
σm ( f ) p := σm ( f, T d ) p := inf
f − c j ei(k
p.
k j ∈Zd ,c j
j=1
It was proved in Temlyakov (1998c) that the inequality (1.112) is sharp: there
is a positive absolute constant C such that, for each m and 1 ≤ p ≤ ∞, there
exists a function f = 0 with the property
G m ( f ) p ≥ Cm h( p) f p . (1.113)
The above inequality (1.113) shows that the trigonometric system is not a
quasi-greedy basis for L p , p = 2. We formulate one more inequality from
Konyagin and Temlyakov (2003b).
f − G tm ( f ) p ≤ f − S Q ( f ) p + (3 + 1/t)(2m)h( p) E Q ( f )2 .
We present some results from Konyagin and Temlyakov (2003b) that are for-
mulated in terms of the Fourier coefficients. For f ∈ L 1 (Td ) let { fˆ(k(l))}l=1
∞
| fˆ(k(1))| ≥ | fˆ(k(2))| ≥ . . .
Let an ( f ) := | fˆ(k(n))|.
1.7 Greedy approximation; trigonometric system 51
Theorem 1.54 Let 2 < p < ∞ and let a decreasing sequence {An }∞
n=1 satisfy
the condition
An = o(n 1/ p−1 ) as n → ∞.
We also proved in Konyagin and Temlyakov (2003b) that, for any decreasing
sequence {An } satisfying
lim
f − G tm ( f, T )
∞ = 0.
m→∞
The condition (A∞ ) is very close to the convergence of the series n An ; if
the condition (A∞ ) holds then we have
N
An = o(log∗ (N )), as N → ∞,
n=1
Theorem 1.56 Assume that a decreasing sequence {An }∞ n=1 does not satisfy
the condition (A∞ ). Then there exists a function f ∈ C(T) with the property
an ( f ) ≤ An , n = 1, 2, . . . , and such that we have
52 Greedy approximation with regard to bases
lim sup
f − G m ( f, T )
∞ > 0
m→∞
G M(m) ( f ) − G m ( f ) p → 0, as m → ∞.
Then we have
G m ( f ) − f
p → 0, as m → ∞.
Theorem 1.58 For any p ∈ (2, ∞) there exists a function f ∈ L p (T) with
an L p (T)-divergent sequence {G m ( f )} of greedy approximations with the fol-
lowing property. For any sequence {M(m)} such that m ≤ M(m) ≤ m 1+o(1) ,
we have
G M(m) ( f ) − G m ( f )
p → 0, as m → 0.
(a) For some k ∈ N, and for any sufficiently large m ∈ N, we have the
inequality αk (m) > em .
1.7 Greedy approximation; trigonometric system 53
(b) If f ∈ C(T) and
G α(m) ( f ) − G m ( f ) → 0, as m → 0,
∞
then
f − G m ( f )
∞ → 0, as m → 0.
In order to illustrate the techniques used in the proofs of the above results
we discuss some inequalities that were used in proving Theorems 1.57 and
1.59. The reader will also see from the further discussion a connection to
some deep results in harmonic analysis. The general style of these inequali-
ties is as follows. A function that has a sparse representation with regard to the
trigonometric system cannot be approximated in L p by functions with small
Fourier coefficients. We begin our discussion with some concepts introduced
in Konyagin and Temlyakov (2005) that are useful in proving such inequali-
ties. The following new characteristic of a Banach space L p plays an important
role in such inequalities. We now introduce some more notation. Let be a
finite subset of Zd ; we let || denote its cardinality and let T () be the span
of {ei(k,x) }k∈ . Denote
m (T ) := ∪:||≤m T ().
For f ∈ L p , F ∈ L p , 1 ≤ p ≤ ∞, p = p/( p − 1), we write
F, f := F f¯ dμ, dμ := (2π )−d d x.
Td
Definition 1.60 Let be a finite subset of Zd and 1 ≤ p ≤ ∞. We call a set
:= ( p, γ ), γ ∈ (0, 1], a ( p, γ )-dual to if, for any f ∈ T (), there
exists F ∈ T ( ) such that
F
p = 1 and
F, f ≥ γ
f
p .
Let D(, p, γ ) denote the set of all ( p, γ )-dual sets . The following
function is important for us:
v(m, p, γ ) := sup
inf | |.
:||=m ∈D(, p,γ )
Proof Let h ∈ T () with || = m and let ∈ D(, p, γ ). Then, using
Definition 1.60 we find F(h, γ ) ∈ T ( ) such that
F(h, γ )
p = 1 and
F(h, γ ), h ≥ γ
h
p .
We have
F(h, γ ), h =
F(h, γ ), h + g −
F(h, γ ), g ≤
h + g
p + |
F(h, γ ), g|.
Next,
|
F(h, γ ), g| ≤
{ F̂(h, γ )(k)}
1
{ĝ(k)}
∞ .
We now combine the above inequalities and use the definition of v(m, p, γ ).
Setting f i = |e j | p , n = p − 1, we deduce that any function of the form
m
m
|ei | ,
ki
ki ∈ N, ki = p − 1,
i=1 i=1
1.7 Greedy approximation; trigonometric system 55
belongs to L p . It now follows from (1.115) that
w(m, p, 1) ≤ m p−1 , p = 2q, q ∈ N. (1.116)
There is a general theory of uniform approximation property (UAP) which
provides some estimates for w(m, p, γ ) and v(m, p, γ ). We give some
definitions from this theory. For a given subspace X of L p , dim X = m, and
a constant K > 1, let k p (X, K ) be the smallest k such that there is an oper-
ator I X : L p → L p , with I X ( f ) = f for f ∈ X ,
I X
L p →L p ≤ K , and
rankI X ≤ k. Define
k p (m, K ) := sup k p (X, K ),
X :dim X =m
M
I X (ei(k,x) ) = ckj ψ j .
j=1
Let
M
λk := ckj ψ̂ j (k).
j=1
We have
⎛ ⎞⎛ ⎞
M M
|λk |2 ≤ ⎝ |ckj |2 ⎠ ⎝ |ψ̂ j (k)|2 ⎠ ≤ K 2 M.
k k j=1 j=1
A L p →L p ≤ K and A L 2 →L ∞ ≤ K M 1/2 .
Therefore
A(h + g)
p ≤ K
h + g
p
and
A(h + g)
p ≥
h
p − K M 1/2
g
2 .
h + g
p ≥ 2− p−1
h
p − 2m p/2
h
p
p p p−2
g
22 . (1.120)
We mention two inequalities from Konyagin and Temlyakov (2003b) in the
style of the inequalities in Lemmas 1.61, 1.63 and 1.65.
Lemma 1.66 Let 2 ≤ p < ∞ and h ∈ L p ,
h
p = 0. Then, for any g ∈ L p ,
we have
h
p ≤
h + g
p + (
h
2 p−2 /
h
p ) p−1
g
2 .
Lemma 1.67 Let h ∈ m (T ),
h
∞ = 1. Then, for any function g such that
g
2 ≤ 14 (4π m)−m/2 , we have
h + g
∞ ≥ 1/4.
58 Greedy approximation with regard to bases
We proceed to estimate v(m, p, γ ) and w(m, p, γ ) for p ∈ [2, ∞). In the
special case of even p, we have, by (1.114) and (1.116), that
Lemma 1.68 Let 2 ≤ p < ∞, and let α := p/2 − [ p/2]. Then we have
v(m, p, γ ) ≤ m c(α,γ )m
1/2 + p−1
.
Proof We have
∞
σm s ( f ) p ≤
cn k ( f )ψn k
p ≤
cn k ( f )ψn k
p .
k>m s l=s k∈(m l ,m l+1 ]
as required.
Theorem 1.74 Assume that a given sequence E ∈ M D P satisfies the
conditions
Ns ≥ C1 2−s , ds+1 ≤ C2 ds , s = 0, 1, 2, . . .
Proof We first prove ⇒. If Ns+1 > Ns , then we use Lemma 1.72 with
M = Ns+1 and N = Ns ,
−1/ p
a Ns+1 ( f, p) ≤ C( p, )σ Ns ( f ) p ds ≤ C( p, )2−s−1 (ds+1 /C2 )−1/ p ,
which implies the statement of Theorem 1.74 in this case. Let Ns+1 = Ns =
· · · = Ns− j > Ns− j−1 . The assumption
Ns ≥ C1 2−s combined with the
definition of Ns :
Ns < 2−s implies that j ≤ C3 . Then, from the above case
we get
a Ns− j ( f, p) 2−s+ j (ds− j )−1/ p ,
1.8 Greedy-type bases; direct and inverse theorems 61
and therefore
a Ns+1 ( f, p) 2−s−1 (ds+1 )−1/ p .
The implication ⇒ has been proved.
We now prove the inverse statement ⇐. Using Lemma 1.73, we get
∞
∞
σ Ns ( f ) p a Nl ( f, p)(Nl+1 − Nl )1/ p 2−l 2−s
Ns
l=s l=s
Corollary 1.75 Theorem 1.74 applied to Examples 1.70 and 1.71 gives the
following relations:
σm ( f ) p (m + 1)−r ⇐⇒ an ( f, p) n −r −1/ p , (1.125)
−m b −n b
σm ( f ) p 2 ⇐⇒ an ( f, p) 2 n (1−1/b)/ p . (1.126)
Remark 1.76 Making use of Lemmas 1.72 and 1.73 we can prove a version
of Corollary 1.75 with the sign replaced by .
Theorem 1.74 and Corollary 1.75 are in the spirit of classical Jackson–
Bernstein direct and inverse theorems in linear approximation theory, where
conditions of the form
E n ( f ) p n or E n ( f ) p / n ∞ < ∞ (1.127)
Lemmas 1.72 and 1.73 are also useful in handling this more general case.
For instance, in the particular case of Example 1.70 we have the following
statement.
Theorem 1.77 Let 1 < p < ∞ and 0 < q < ∞. Then, for any positive r we
have the equivalence relation
q
σm ( f ) p m rq−1 < ∞ ⇐⇒ an ( f, p)q nrq−1+q/ p < ∞.
m n
62 Greedy approximation with regard to bases
Remark 1.78 The condition
an ( f, p)q nrq−1+q/ p < ∞
n
where β := (r + 1/ p)−1 .
The statement similar to Corollary 1.79 for free-knot spline approximation was
proved in Petrushev (1988). Corollary 1.79 and further results in this direction
can be found in DeVore and Popov (1988) and DeVore, Jawerth and Popov
(1992). We want to remark here that conditions in terms of an ( f, p) are con-
venient in applications. For instance, (1.125) can be rewritten using the idea of
thresholding. For a given f ∈ L p denote
T (
) := #{k : ak ( f, p) ≥
}.
Then (1.125) is equivalent to
−1
σm ( f ) p (m + 1)−r ⇐⇒ T (
)
−(r +1/ p) .
For further results in this direction see Cohen, DeVore and Hochmuth (2000),
DeVore (1998) and Oswald (2001).
The above direct and inverse Theorem 1.77 that holds for greedy bases sat-
isfying (1.121) was extended in Dilworth et al. (2003) and Kerkyacharian
and Picard (2004) to the case of quasi-greedy bases satisfying (1.121).
Kerkyacharian and Picard (2004) say that a basis of a Banach space X has
the p-Temlyakov property if there exists 0 < C < ∞ such that, for any finite
set of indices , we have
C −1 (min |cn |)||1/ p ≤
cn ψn
X ≤ C(max |cn |)||1/ p . (1.130)
n∈ n∈
n∈
Now let
∞
f = ck ( f )ψk
k=1
1.9 Some further results 63
and
|ck1 | ≥ |ck2 | ≥ . . .
(1) If has the p-Temlyakov property (1.130), then, for any 0 < r < ∞,
0 < q < ∞, we have
q
σm ( f, ) X m rq−1 < ∞ ⇐⇒ |ckn ( f )|q nrq−1+q/ p < ∞.
m n
(1.131)
(2) If (1.131) holds with some r > 0, then has the p-Temlyakov property
(1.130).
The extra factor (3m + 1) cannot be replaced by a factor c(m) such that
c(m)/m → 0 as m → ∞.
This particular result indicates that there are problems with greedy approx-
imation in L 1 and in C with regard to the Haar basis. We note that, as is
proved in Oswald (2001), the extra factor 3m + 1 is the best-possible extra
factor in Theorem 1.81. The greedy-type bases have nice properties and they
are important in nonlinear m-term approximation. Therefore, one of the new
directions of research in functional analysis and in approximation theory is to
64 Greedy approximation with regard to bases
understand which Banach spaces may have such bases. Another direction is to
understand in which Banach spaces some classical bases are of greedy type.
Some results in this direction can be derived immediately from known results
on Banach spaces that have unconditional bases, and from the characterization
Theorem 1.15. For instance, it is well known that the spaces L 1 and C do not
have unconditional bases. Therefore, Theorem 1.15 implies that there are no
greedy bases in L 1 and in C.
It was proved in Dilworth, Kutzarova and Wojtaszczyk (2002) that the Haar
basis H1 is not a quasi-greedy basis for L 1 . We saw in Section 1.6 that the use
of the Weak Greedy Algorithm has some advantages over the Greedy Algo-
rithm. Theorem 1.46 states that the convergence set W T {en } of the WGA is
linear for any t ∈ (0, 1), while the convergence set may not be linear for the
Greedy Algorithm. Recently, Gogyan (2009) proved that, for any t ∈ (0, 1)
and for any f ∈ L 1 (0, 1), there exists a realization of the WGA with respect
to the Haar basis that converges to f in L 1 .
It was proved in Dilworth, Kutzarova and Wojtaszczyk (2002) that there
exists an increasing sequence of integers {n j } such that the lacunary Haar sys-
tem {H 1n j +l ; l = 1, . . . , 2n j , j = 1, 2, . . . } is a quasi-greedy basis for its linear
2
span in L 1 . Gogyan (2005) proved that the above property holds if either {n j }
is a sequence of all even numbers or {n j } is a sequence of all odd numbers. We
also note that the space L 1 (0, 1) has a quasi-greedy basis (Dilworth, Kalton
and Kutzarova (2003)). The reader can find further results on existence (and
nonexistence) of quasi-greedy and almost greedy bases in Dilworth, Kalton
and Kutzarova (2003). In particular, it is proved therein that the C[0, 1] does
not have quasi-greedy bases.
We pointed out in Section 1.7 that the trigonometric system is not a quasi-
greedy basis for L p , p = 2. The question of when (and for which weights
w) the trigonometric system forms a quasi-greedy basis for a weighted space
L p (w) was studied in Nielsen (2009). The author proved that this can happen
only for p = 2 and, whenever the system forms a quasi-greedy basis, the basis
must be a Riesz basis.
Theorem 1.11A shows that, in the case when a basis is L p -equivalent to
the Haar basis H p , 1 < p < ∞, the Greedy Algorithm G m ( f, ) provides
near-best approximation for each individual function f ∈ L p . For a function
class F ⊂ X denote
G m (F, ) X := sup
f − G m ( f, )
X .
f ∈F
1.9 Some further results 65
Obviously, if G m (·, ) provides near-best approximation for each individual
function, then it provides near-best approximation for each function class F:
G m (F, ) X ≤ Cσm (F, ) X .
In Section 1.7 we pointed out that the trigonometric system is not a quasi-
greedy basis for L p , p = 2 (see (1.113)). Thus, the trigonometric system
is not a greedy basis for L p , p = 2, and for some functions f ∈ L p ,
p = 2, the G m ( f, T ) does not provide near-best approximation. However, it
was proved in Temlyakov (1998c) that, in many cases, the algorithm G m (·, T )
is optimal for a given class of functions. The reader can find further results
on σm (F, T d ) p and G m (F, T d ) p for different classes F in DeVore and
Temlyakov (1995) and Temlyakov (1998c, 2000a, 2002a).
Consideration of approximation in a function class leads to the concept of
the optimal (best) basis for a given class. The first results for best-basis approx-
imation were given by Kashin (1985), who showed that, for any orthonormal
basis and any 0 < α ≤ 1, we have
σm (Lip α, ) L 2 ≥ cm −α , (1.132)
where the constant c depends only on α. It follows from this that any of
the standard wavelet or Fourier bases are best for the Lipschitz classes,
when the approximation is carried out in L 2 and the competition is held
over all orthonormal bases. The estimate (1.132) rests on some fundamental
estimates for the best-basis approximation of finite-dimensional hypercubes
using orthonormal bases.
Kashin (2002) considered a function class that is much thinner than the
Lip α:
X := {χ[t,1] , t ∈ [0, 1]},
where χ[a,b] is the characteristic function of [a, b]. Kashin (2002) proved the
following lower bounds. For any orthonormal system ⊂ L 2 ([0, 1]) we have
σm (X , ) L 2 ≥ C −m . (1.133)
For any complete uniformly bounded (
ψ j
∞ ≤ M) orthonormal system
⊂ L 2 ([0, 1]) we have
σm (X , ) L 2 ≥ C(M)m −1/2 . (1.134)
In the proof of estimates (1.133) and (1.134) the technique from the theory of
general orthogonal series, developed in Kashin (1977b), was used. It is known
and easy to see that bounds (1.133) and (1.134) are sharp. The complementary
to the (1.133) upper bound follows from approximation by the Haar basis and
66 Greedy approximation with regard to bases
the complementary to the (1.134) upper bound follows from approximation by
the trigonometric system.
The problem of best-basis selection was studied in Coifman and
Wickerhauser (1992). Donoho (1993, 1997) also studied the problem of best
bases for a function class F. He calls a basis from a collection B best for
F if
σm (F, ) X = O(m −α ), m → ∞,
σn (F, ) X = O(n −β ), n → ∞,
for a value of β > α. Donoho has shown that in some cases it is possible
to determine a best basis (in the above sense) for the class F by intrinsic
properties of how the class is represented with respect to the basis. In Donoho’s
analysis (as was the case for Kashin as well) the space X is L 2 (or equivalently
any Hilbert space), and the competition for a best basis takes place over all
complete orthonormal systems (i.e. B consists of all complete orthonormal
bases for L 2 ).
In DeVore, Petrova and Temlyakov (2003) we continued to study the prob-
lem of optimal bases selection with regard to natural collections of bases. We
worked on the following problem in this direction. We say that a function class
F is aligned to the basis if, whenever f = ak ψk is in F, then
ak ψk ∈ F for any |ak | ≤ c|ak |, k = 1, 2, . . . ,
Theorem 1.82 Let be an orthonormal basis for a Hilbert space H and let
F be a function class aligned with such that, for some α > 0, β ∈ R, we
have
lim sup m α (log m)β σm (F, ) > 0.
m→∞
for some μ > 0. Assume that the function class F is aligned with , and that,
for some α > 0, β ∈ R, we have
lim sup m α (log m)β σm (F, ) > 0.
m→∞
Theorem 1.83 is weaker than Theorem 1.82 in the sense that we have
an extra factor (log m)α in (1.135). Recently, Bednorz (2008) proved Theo-
rem 1.83 with (1.135) replaced by (1.136):
lim sup m α (log m)β σm (F, B) > 0. (1.136)
m→∞
In the papers cited above some results were obtained when B = O, the
set of orthonormal bases, and F is either a multivariate smoothness class of
an anisotropic Sobolev–Nikol’skii kind, or a class of functions with bounded
mixed derivative.
68 Greedy approximation with regard to bases
We conclude this section with a recent result from Wojtaszczyk (2006).
Theorem 1.11A says that the univariate Haar basis H p is a greedy basis for
L p := L p ([0, 1]), 1 < p < ∞. The spaces L p are examples of rearrangement-
invariant spaces. Let us recall that a rearrangement-invariant space of functions
defined on [0, 1] is a Banach space X with norm
·
whose elements are mea-
surable (in the sense of Lebesgue) functions defined on [0, 1] satisfying the
following conditions.
then g ∈ X and g = f .
It is a very interesting result that singles out the L p -spaces with 1 < p < ∞
from the collection of rearrangement-invariant spaces. Theorem 1.84 empha-
sizes the importance of the L p spaces in the theory of greedy approximation.
C1 A ≤ B ≤ C2 A.
We shall indicate what the constants depend on (in the case of (1.138) they
may depend on p).
Here is a useful remark concerning (1.138). From the validity of (1.138) for
finite sequences, we can deduce its validity for infinite sequences by a limiting
argument. For example, if (c I ) I ∈D is an infinite sequence for which the sum on
the left-hand side of (1.138) converges in L p (R) with respect to some ordering
of the I ∈ D, then the right-hand side of (1.138) will converge with respect to
the same ordering and the right-hand side of (1.138) will be less than a multiple
of the left. Likewise, we can reverse the roles of the left- and right-hand sides.
Similar remarks hold for other statements like (1.138).
The term strong Littlewood–Paley inequality is used to differentiate (1.138)
from other possible forms of Littlewood–Paley inequalities. For example,
the Littlewood–Paley inequalities for the complex exponentials take a differ-
ent form (see Zygmund (1959), chap. XV). Another form of interest in our
considerations is the following:
1/2
c I η(I, ·)
p
[c I χ I ]2
p. (1.139)
I ∈D I ∈D
70 Greedy approximation with regard to bases
We use the notation χ for the characteristic function of [0, 1] and χ I for its
L 2 (R)-normalized, shifted dilates given by (1.137) (with ψ = χ ).
The two forms (1.138) and (1.139) are equivalent under very mild condi-
tions on the functions η(I, ·). To see this, we shall use the Hardy–Littlewood
maximal operator, which is defined for a locally integrable function g on R by
1
Mg(x) := sup |g(y)| dy,
J x |J | J
with the supremum taken over all intervals J that contain x. It is well known
that M is a bounded operator on L p (R) for all 1 < p ≤ ∞. The Fefferman–
Stein inequality (Fefferman and Stein (1972)) bounds the mapping M on
sequences of functions. We only need the following special case of this
inequality, which says that for any functions η(I, ·) and constants c I , I ∈ D,
we have, for 1 < p ≤ ∞,
1/2 1/2
(c I Mη(I, ·))2
p ≤ A
(c I η(I, ·)) 2
p, (1.140)
I ∈D I ∈D
with an absolute constant A.
Consider now as an example the equivalence of (1.138). If the functions
η(I, ·), I ∈ D, satisfy
|η(I, x)| ≤ C Mχ I (x), χ I (x) ≤ C Mη(I, x), for almost all x ∈ R,
(1.141)
then, using (1.140), we see that (1.138) holds if and only if (1.139) holds. The
first inequality in (1.141) is a decay condition on η(I, ·). For example, if η(I, ·)
is given by the normalized, shifted dilates of the function ψ, η(I, ·) = ψ I , then
the first inequality in (1.141) holds whenever
|ψ(x)| ≤ C[max(1, |x|)]−λ , for almost all x ∈ R,
with λ ≥ 1. The second condition in (1.141) is extremely mild. For example,
it is always satisfied when the family η(I, ·) is generated by the shifted dilates
of a non-zero function ψ.
Suppose that we have two families η(I, ·), μ(I, ·), I ∈ D(R). We shall use
the notation {η(I, ·)} I ∈D ≺ {μ(I, ·)} I ∈D , if there is a constant C > 0 such that
c I η(I, ·)
p ≤ C
c I μ(I, ·)
p (1.142)
I ∈D I ∈D
so that
η(J, ·) = λ(I, J )H I .
I ∈D
It follows that
T c J HJ = λ(I, J )c J H I . (1.145)
J ∈D I ∈D J ∈D
Thus the mapping T is tied to the bi-infinite matrix := (λ(I, J )) I,J ∈D which
maps the sequence c := (c J ) into the sequence
(cI ) := c.
with
−1−
(1+
)/2
|ξ I − ξ J | |I | |J |
ω(I, J ) := 1 + min , . (1.147)
max(|I |, |J |) |J | |I |
In DeVore, Konyagin and Temlyakov (1998) we used the following special
case of Frazier and Jawerth (1990), theorem 3.3, concerning almost-diagonal
operators.
Theorem 1.86 If η(I, ·), I ∈ D, satisfy assumptions A1–A3, then the operator
T defined by (1.143) is bounded from L p (R) into itself for each 1 < p < ∞.
We can use a duality argument to provide sufficient conditions that the oper-
ator T of (1.143) is boundedly invertible. For this, we assume that η(I, ·),
I ∈ D, is a family of functions for which there is a dual family η∗ (I, ·), I ∈ D,
that satisfies
η(I, ·), η∗ (J, ·) = δ(I, J ), I, J ∈ D.
Theorem 1.89 If the systems of functions {η(I, ·)} I ∈D , {η∗ (I, ·)} I ∈D , satisfy
assumptions A1–A3, then the system {η(I, ·)} I ∈D is L p -equivalent to the Haar
system {H I } I ∈D for 1 < p < ∞.
It is known from different results (see DeVore (1998), DeVore, Jawerth and
Popov (1992) and Temlyakov (2003a)) that wavelets are well designed for non-
linear approximation. We present here one general result in this direction. We
fix p ∈ (1, ∞) and consider in L p ([0, 1]d ) a basis := {ψ I } I ∈D indexed by
dyadic intervals I of [0, 1]d , I = I1 × · · · × Id , where I j is a dyadic interval of
[0, 1], j = 1, . . . , d, which satisfies certain properties. Set L p := L p () with
a normalized Lebesgue measure on , || = 1. First of all we assume that, for
all 1 < q, p < ∞ and I ∈ D, where D := D([0, 1]d ) is the set of all dyadic
intervals of [0, 1]d , we have
Ds := {I = I1 × · · · × Id ∈ D : |I j | = 2−s j , j = 1, . . . , d}.
We assume that
lim
f − f I ψI
p = 0 (1.151)
min j μ j →∞
s j ≤μ j , j=1,...,d I ∈Ds
and
⎛ ⎞1/2
f
p
⎝ | f I ψ I |2 ⎠
p . (1.152)
s I ∈Ds
We define the class HqR () as the set of functions f ∈ L q representable in the
form
∞
f = tl , tl ∈ (R, l),
tl
q ≤ 2−g(R)l .
l=1
f ∈N HqR
76 Greedy approximation with regard to bases
We also proved in Temlyakov (2002a) that the basis U d is an optimal
orthonormal basis for approximation of classes N HqR in L p :
2.1 Introduction
In this chapter we discuss greedy approximation with regard to redundant sys-
tems. Greedy approximation is a special form of nonlinear approximation. The
basic idea behind nonlinear approximation is that the elements used in the
approximation do not come from a fixed linear space but are allowed to depend
on the function being approximated. The standard problem in this regard is the
problem of m-term approximation, where one fixes a basis and aims to approx-
imate a target function f by a linear combination of m terms of the basis. We
discussed this problem in detail in Chapter 1. When the basis is a wavelet basis
or a basis of other waveforms, then this type of approximation is the start-
ing point for compression algorithms. An important feature of approximation
using a basis := {ψk }∞k=1 of a Banach space X is that each function f ∈ X
has a unique representation
∞
f = ck ( f )ψk , (2.1)
k=1
and we can identify f with the set of its coefficients {ck ( f )}∞
k=1 . The problem
of m-term approximation with regard to a basis has been studied thoroughly
and rather complete results have been established (see Chapter 1). In particular,
it was established that the greedy-type algorithm which forms a sum of m
terms with the largest
ck ( f )ψk
X out of expansion (2.1) realizes in many
cases near-best m-term approximation for function classes (DeVore, Jawerth
and Popov (1992)) and even for individual functions (see Chapter 1).
Recently, there has emerged another more complicated form of nonlinear
approximation, which we call highly nonlinear approximation. It takes many
forms but has the basic ingredient that a basis is replaced by a larger system
of functions that is usually redundant. We call such systems dictionaries. On
77
78 Greedy approximation in Hilbert spaces
the one hand, redundancy offers much promise for greater efficiency in terms
of the approximation rate, but on the other hand it gives rise to highly non-
trivial theoretical and practical problems. The problem of characterizing the
approximation rate for a given function or function class is now much more
substantial and results are quite fragmentary. However, such results are very
important for understanding what this new type of approximation offers. Per-
haps the first example of this type was considered by Schmidt (1906), who
studied the approximation of functions f (x, y) of two variables by bilinear
forms,
m
u i (x)vi (y),
i=1
in L 2 ([0, 1]2 ). This problem is closely connected with properties of the integral
operator
1
J f (g) := f (x, y)g(y)dy
0
with kernel f (x, y). Schmidt (1906) gave an expansion (known as the Schmidt
expansion):
∞
f (x, y) = s j (J f )φ j (x)ψ j (y),
j=1
m
f (x, y) − s j (J f )φ j (x)ψ j (y)
L 2
j=1
m
= inf
f (x, y) − u j (x)v j (y)
L 2 .
u j ,v j ∈L 2 , j=1,...,m
j=1
It was understood later that the above best bilinear approximation can be
realized by the following greedy algorithm. Assume that c j , u j (x), v j (y),
u j
L 2 =
v j
L 2 = 1 and j = 1, . . . , m − 1 have been constructed after
m − 1 steps of the algorithm. At the mth step we choose cm , u m (x), vm (y),
u m
L 2 =
vm
L 2 = 1, to minimize
2.1 Introduction 79
m
f (x, y) − c j u j (x)v j (y)
L 2 .
j=1
We call this type of algorithm the Pure Greedy Algorithm (PGA) (see the
general definition below).
Another problem of this type which is well known in statistics is the pro-
jection pursuit regression problem, mentioned in the Preface. The problem is
to approximate in L 2 a given function f ∈ L 2 by a sum of ridge functions,
i.e. by
m
r j (ω j · x), x, ω j ∈ Rd , j = 1, . . . , m,
j=1
m
f (x) − r j (ω j · x)
L 2 .
j=1
This is one more example of the Pure Greedy Algorithm. The Pure Greedy
Algorithm and some other versions of greedy-type algorithms have recently
been intensively studied: see Barron (1993), Davis, Mallat and Avellaneda
(1997), DeVore and Temlyakov (1996, 1997), Donahue et al. (1997), Dubinin
(1997), Huber (1985), Jones (1987, 1992), Konyagin and Temlyakov (1999b),
Livshitz (2006, 2007, 2009), Livshitz and Temlyakov (2001, 2003), Temlyakov
(1999, 2000b, 2002b, 2003b). There are several survey papers that discuss
greedy approximation with regard to redundant systems: see DeVore (1998)
and Temlyakov (2003a, 2006a). In this chapter we discuss along with the PGA
some of its modifications which are more suitable for implementation. This
new type of greedy algorithms will be termed Weak Greedy Algorithms.
In order to orient the reader we recall some notations and definitions from
the theory of greedy algorithms. Let H be a real Hilbert space with an inner
product
·, · and the norm
x
:=
x, x1/2 . We say a set D of functions
(elements) from H is a dictionary if each g ∈ D has norm one (
g
= 1) and
the closure of span D is equal to H. Sometimes it will be convenient for us also
to consider the symmetrized dictionary D± := {±g, g ∈ D}. In DeVore and
Temlyakov (1996) we studied the following two greedy algorithms. If f ∈ H ,
80 Greedy approximation in Hilbert spaces
we let g = g( f ) ∈ D be the element from D which maximizes |
f, g| (we
make an additional assumption that a maximizer exists) and define
G( f ) := G( f, D) :=
f, gg (2.2)
and
R( f ) := R( f, D) := f − G( f ).
Pure Greedy Algorithm (PGA) We define f 0 := R0 ( f ) := R0 ( f, D) := f
and G 0 ( f ) := G 0 ( f, D) := 0. Then, for each m ≥ 1, we inductively define
G m ( f ) : = G m ( f, D) := G m−1 ( f ) + G(Rm−1 ( f )),
f m := Rm ( f ) : = Rm ( f, D) := f − G m ( f ) = R(Rm−1 ( f )).
We note that the Pure Greedy Algorithm is known under the name Matching
Pursuit in signal processing (see, for example, Mallat and Zhang (1993)).
If H0 is a finite-dimensional subspace of H , we let PH0 be the orthogonal
projector from H onto H0 . That is, PH0 ( f ) is the best approximation to f from
H0 .
Orthogonal Greedy Algorithm (OGA) We define f 0o := R0o ( f ) :=
R0o ( f, D) := f and G o0 ( f ) := G o0 ( f, D) := 0. Then, for each m ≥ 1, we
inductively define
Hm :=Hm ( f ) := span{g(R0o ( f )), . . . , g(Rm−1
o
( f ))},
G om ( f ) :=G om ( f, D) := PHm ( f ),
f mo := Rm
o
( f ) :=Rm
o
( f, D) := f − G om ( f ).
We remark that for each f we have
f mo
≤
f m−1
o
− G 1 ( f m−1
o
, D)
. (2.3)
In Section 1.5 we realized that the Weak Greedy Algorithms with regard to
bases work as well as the corresponding Greedy Algorithms. In this chapter
we study similar modifications of the Pure Greedy Algorithm (PGA) and
the Orthogonal Greedy Algorithm (OGA), which we call, respectively, the
Weak Greedy Algorithm (WGA) and the Weak Orthogonal Greedy Algo-
rithm (WOGA). We now give the corresponding definitions from Temlyakov
(2000b). Let a sequence τ = {tk }∞k=1 , 0 ≤ tk ≤ 1, be given.
(2)
f mτ := f m−1
τ τ
−
f m−1 , ϕmτ ϕmτ .
(3)
m
G τm ( f, D) := τ
f j−1 , ϕ τj ϕ τj .
j=1
f mτ
2 =
f m−1
τ
2 − |cm ( f )|2 .
We prove convergence of greedy expansion (see, for example, Theorem 2.4
below), and therefore, from the above equality, we get for this expansion an
analog of the Parseval formula for orthogonal expansions:
∞
f
2 = |c j ( f )|2 .
j=1
m ( f, D) := PHmτ ( f ),
G o,τ where Hmτ := span(ϕ1o,τ , . . . , ϕmo,τ ).
(3)
f mo,τ := f − G o,τ
m ( f, D).
82 Greedy approximation in Hilbert spaces
It is clear that G τm and G o,τ
m in the case tk = 1, k = 1, 2, . . . , coincide with
the PGA G m and the OGA G om , respectively. It is also clear that the WGA and
the WOGA are more ready for implementaion than the PGA and the OGA. The
WOGA has the same greedy step as the WGA and differs in the construction of
a linear combination of ϕ1 , . . . , ϕm . In the WOGA we do our best to construct
an approximant out of Hm := span(ϕ1 , . . . , ϕm ): we take an orthogonal pro-
jection onto Hm . Clearly, in this way we lose a property of the WGA to build
an expansion into a series in the case of the WOGA. However, this modifica-
tion pays off in the sense of improving the convergence rate of approximation.
To see this, compare Theorems 2.18 and 2.19.
There is one more greedy-type algorithm that works well for functions from
the convex hull of D± , where D± := {±g, g ∈ D}.
For a general dictionary D we define the class of functions
Ao1 (D, M) := f ∈ H : f = ck wk , wk ∈ D, # < ∞, |ck | ≤ M ,
k∈ k∈
| f |A1 ( D )
a(g) =
f m−1 , g − G m−1
g − G m−1
−2 .
Next, it is not difficult to derive from the definition of A1 (D) and from our
assumption on the existence of a maximizer that, for any h ∈ H and u ∈
A1 (D), there exists g ∈ D± such that
h, g ≥
h, u. (2.4)
f m−1 , gm − G m−1 ≥
f m−1 , f − G m−1 =
f m−1
2 . (2.5)
f m−1 , ϕm − G m−1 ≥ tm
f m−1
2 . (2.7)
84 Greedy approximation in Hilbert spaces
(2)
G m := G m ( f, D) := (1 − βm )G m−1 + βm ϕm ,
−1
m
βm := tm 1 + tk2 for m ≥ 1.
k=1
(3)
f m := f − G m .
2.2 Convergence
We begin this section with convergence of the Weak Orthogonal Greedy Algo-
rithm (WOGA). The following theorem was proved in Temlyakov (2000b).
Theorem 2.1 Assume
∞
tk2 = ∞. (2.8)
k=1
Then, for any dictionary D and any f ∈ H , we have for the WOGA
lim
f mo,τ
= 0. (2.9)
m→∞
Proof of Theorem 2.1 Let f ∈ H and let ϕ1o,τ , ϕ2o,τ , . . . be as given in the
definition of the WOGA. Let
Hn := Hnτ = span(ϕ1o,τ , . . . , ϕno,τ ).
It is clear that Hn ⊆ Hn+1 , and therefore {PHn ( f )} converges to some function
v. The following Lemma 2.3 says that v = f and completes the proof of
Theorem 2.1.
sup |
u, g| ≥ 2δ
g∈D
2.2 Convergence 85
with some δ > 0. Therefore, there exists N such that, for all m ≥ N , we have
sup |
f mτ , g| ≥ δ.
g∈D
o,τ 2 o,τ
m
f mo,τ
2 ≤
f m−1
− tm2 (sup |
f m−1 , g|)2 ≤
f No,τ
2 − δ 2 tk2 ,
g∈D k=N +1
2
which contradicts the divergence of k tk .
Theorem 2.1 and Remark 2.2 show that (2.8) is a necessary and sufficient
condition on the weakness sequence τ = {tk } in order that the WOGA con-
verges for each f and all D. Condition (2.8) can be rewritten as τ ∈ / 2 . It
turns out that the convergence of the PGA is more delicate. We now proceed
to the corresponding results. The following theorem gives a criterion of con-
vergence in a special case of monotone weakness sequences {tk }. Sufficiency
was proved in Temlyakov (2000b) and necessity in Livshitz and Temlyakov
(2001).
is necessary and sufficient for convergence of the Weak Greedy Algorithm for
each f and all Hilbert spaces H and dictionaries D.
Remark 2.5 We note that the sufficiency part of Theorem 2.4 (see Temlyakov
(2000b)) does not need the monotonicity of τ .
86 Greedy approximation in Hilbert spaces
Proof of sufficiency condition in Theorem 2.4 This proof (see Temlyakov
(2000b)) is a refinement of the original proof of Jones (1987). The follow-
ing lemma, Lemma 2.6, combined with Lemma 2.3, implies sufficiency in
Theorem 2.4.
Proof of Lemma 2.6 It is easy to derive from the definition of the WGA the
following two relations:
m
f mτ = f − τ
f j−1 , ϕ τj ϕ τj , (2.11)
j=1
m
f mτ
2 =
f
2 − τ
|
f j−1 , ϕ τj |2 . (2.12)
j=1
∞
a 2j ≤
f
2 . (2.13)
j=1
f nτ − f mτ
2 =
f nτ
2 −
f mτ
2 − 2
f nτ − f mτ , f mτ .
Let
τ
θn,m := |
f nτ − f mτ , f mτ |.
Using (2.11) and the definition of the WGA, we obtain, for all n < m and all
m such that tm+1 = 0,
m
am+1
m+1
τ τ
θn,m ≤ |
f j−1 , ϕ τj ||
f mτ , ϕ τj | ≤ aj. (2.14)
tm+1
j=n+1 j=1
We shall prove that this series converges. It is clear that convergence of this
series, together with the assumption ∞ k=1 tk /k = ∞, imply the statement of
Lemma 2.7.
n
We use the following known fact. If {y j }∞
j=1 ∈ 2 then {n
−1 ∞
j=1 y j }n=1 ∈
2 (see Zygmund (1959), chap. 1, sect. 9). By the Cauchy inequality, we have
⎛ ⎞2 ⎞1/2
∞ 1/2 ∞ ⎛
tn yn n ⎜ n
⎟
yj ≤ yn2 ⎝ ⎝n −1 y j ⎠ ⎠ < ∞.
n tn
n∈P(τ ) j=1 n=1 n=1 j=1
Then {xn }∞
n=1 converges.
lim
f mτ
= 0.
m→∞
and
∞
qs
2−s xk2 < ∞, (2.17)
s=1 k=1
Remark 2.11 It is clear from this definition that if x ∈ V and for some N and
c we have 0 ≤ yk ≤ cxk , k ≥ N , then y ∈ V.
Theorem 2.12 solves the problem of convergence of the WGA in a very gen-
eral situation. The sufficiency part of Theorem 2.12 guarantees that, whenever
τ ∈ / V, the WGA converges for each f and all D. The necessity part of
Theorem 2.12 states that, if τ ∈ V, then there exist an element f and a
dictionary D such that some realization G τm ( f, D) of the WGA does not con-
verge to f . However, Theorem 2.12 leaves open the following interesting and
important problem. Let a dictionary D ⊂ H be given. Find necessary and suf-
ficient conditions on a weakness sequence τ in order that G τm ( f, D) → f for
each f ∈ H . The corresponding open problems for special dictionaries are
formulated in Temlyakov (2003a), pp. 78 and 81. They concern the following
two classical dictionaries:
2 := {u(x)v(y), u, v ∈ L 2 ([0, 1]),
u
2 =
v
2 = 1}
and
R2 := {g(x) = r (ω · x),
g
2 = 1},
where r is a univariate function and ω · x is the scalar product of x,
x
2 ≤ 1,
and a unit vector ω ∈ R2 .
If f ∈ Ao1 (D), then f = j c j g j , for some g j ∈ D and with j |c j | ≤ 1.
Since the functions g j all have norm one, it follows that
f
≤ |c j |
g j
≤ 1.
j
Theorem 2.15 For the Relaxed Greedy Algorithm we have, for each f ∈
A1 (D), the estimate
2
f − G rm ( f )
≤ √ , m ≥ 1. (2.22)
m
Proof We use the abbreviation rm := G rm ( f ) and gm := g(Rm−1
r ( f )). From
the definition of rm , we have
2 1
f − rm
2 =
f − rm−1
2 +
f − rm−1 , rm−1 − gm + 2
rm−1 − gm
2 .
m m
(2.23)
The last term on the right-hand side of (2.23) does not exceed 4/m 2 . For the
middle term, we have
f − rm−1 , rm−1 − gm = inf
f − rm−1 , rm−1 − g
g∈D ±
= inf
f − rm−1 , rm−1 − φ
φ∈A1 (D )
≤
f − rm−1 , rm−1 − f = −
f − rm−1
2 .
We substitute this into (2.23) to obtain
2 4
f − rm
2 ≤ 1 −
f − rm−1
2 + 2 . (2.24)
m m
Thus the theorem follows from Lemma 2.14 with A = 4 and am :=
f − rm
2 .
2.3 Rate of convergence 91
We now turn our discussion to the approximation properties of the Pure
Greedy Algorithm and the Orthogonal Greedy Algorithm.
We shall need the following simple known lemma (see, for example, DeVore
and Temlyakov (1996)).
Lemma 2.16 Let {am }∞
m=1 be a sequence of non-negative numbers satisfying
the inequalities
a1 ≤ A, am+1 ≤ am (1 − am /A), m = 1, 2, . . .
Then we have, for each m,
am ≤ A/m.
Proof The proof is by induction on m. For m = 1 the statement is true by
assumption. We assume am ≤ A/m and prove that am+1 ≤ A/(m + 1). If
am+1 = 0 this statement is obvious. Assume therefore that am+1 > 0. Then
we have
−1 −1
am+1 ≥ am (1 − am /A)−1 ≥ am
−1 −1
(1 + am /A) = am + A−1 ≥ (m + 1)A−1 ,
which implies am+1 ≤ A/(m + 1).
We now want to estimate the decrease in error provided by one step of the
Pure Greedy Algorithm. Let D be an arbitrary dictionary. If f ∈ H and
ρ( f ) :=
f, g( f )/
f
, (2.25)
where as before g( f ) ∈ D± satisfies
f, g( f ) = sup
f, g,
g∈D ±
then
R( f )2 =
f − G( f )
2 =
f
2 (1 − ρ( f )2 ). (2.26)
The larger ρ( f ), the better the decrease of the error in the step of the Pure
Greedy Algorithm. The following lemma estimates ρ( f ) from below.
Lemma 2.17 If f ∈ A1 (D, M), then
ρ( f ) ≥
f
/M. (2.27)
Proof It is sufficient to prove (2.27) for f ∈ Ao1 (D, M) since the general result
follows from this by taking limits. We can write f = ck gk , where this sum
has a finite number of terms and gk ∈ D and |ck | ≤ M. Hence,
f
2 =
f, f =
f, ck gk = ck
f, gk ≤ Mρ( f )
f
and (2.27) follows.
92 Greedy approximation in Hilbert spaces
The following theorem was proved in DeVore and Temlyakov (1996).
Theorem 2.18 Let D be an arbitrary dictionary in H . Then, for each f ∈
A1 (D, M), we have
f − G m ( f, D)
≤ Mm −1/6 .
Proof It is enough to prove the theorem for f ∈ A1 (D, 1); the general result
then follows by rescaling. We shall use the abbreviated notation f m := Rm ( f )
for the residual. Let
am :=
f m
2 =
f − G m ( f, D)
2 , m = 0, 1, . . . , f 0 := f,
and define the sequence {bm }∞
m=0 by
b0 := 1, bm+1 := bm + ρ( f m )
f m
, m = 0, 1, . . .
Since f m+1 := f m − ρ( f m )
f m
g( f m ), we obtain by induction that
f m ∈ A1 (D, bm ), m = 0, 1, . . . ,
and consequently we have the following relations for m = 0, 1, . . . :
am+1 = am (1 − ρ( f m )2 ), (2.28)
1/2
bm+1 = bm + ρ( f m )am , (2.29)
1/2 −1
ρ( f m ) ≥ am bm . (2.30)
Equations (2.29) and (2.30) yield
−1
1/2
bm+1 = bm (1 + ρ( f m )am bm ) ≤ bm (1 + ρ( f m )2 ). (2.31)
Combining this inequality with (2.28) we find
am+1 bm+1 ≤ am bm (1 − ρ( f m )4 ),
which in turn implies, for all m,
am bm ≤ a0 b0 =
f
2 ≤ 1. (2.32)
Further, using (2.28) and (2.30) we get
am+1 = am (1 − ρ( f m )2 ) ≤ am (1 − am /bm
2
).
Since bn ≤ bn+1 , this gives
−2
an+1 bn+1 ≤ an bn−2 (1 − an bn−2 ).
−2 } we obtain
Applying Lemma 2.16 to the sequence {am bm
−2
am bm ≤ m −1 . (2.33)
2.3 Rate of convergence 93
Relations (2.32) and (2.33) imply
−2
3
am = (am bm )2 am bm ≤ m −1 .
In other words,
f m
= am ≤ m −1/6 ,
1/2
The next theorem (DeVore and Temlyakov (1996)) estimates the error in
approximation by the Orthogonal Greedy Algorithm.
f − G om ( f, D) ≤ Mm −1/2 .
Proof The proof of this theorem is similar to the proof of Theorem 2.18, but
is technically even simpler. We can again assume that M = 1. We let f mo :=
Rmo ( f ) be the residual in the Orthogonal Greedy Algorithm. Then, from the
f m+1
o
≤
f mo − G 1 ( f mo , D)
. (2.34)
f m+1
o
2 ≤
f mo
2 (1 − ρ( f mo )2 ). (2.35)
f mo
2 =
f mo , f ≤ ρ( f mo )
f mo
.
Hence,
ρ( f mo ) ≥
f mo
.
f m+1
o
2 ≤
f mo
2 (1 −
f mo
2 ).
In order to complete the proof it remains to apply Lemma 2.16 with A = 1 and
am =
f mo
2 .
94 Greedy approximation in Hilbert spaces
2.3.2 Upper estimates for weak-type greedy algorithms
We begin this subsection with an error estimate for the Weak Orthogonal
Greedy Algorithm. The following theorem from Temlyakov (2000b) is a
generalization of Theorem 2.19.
where the supremum is taken over all dictionaries D, all elements f ∈ A1 (D)\
{0} and all possible choices of {G m }.
Recently, the known upper bounds in approximation by the Pure Greedy
Algorithm were improved in Sil’nichenko (2004), who proved the estimate
γ (m) ≤ Cm −s/2(2+s) ,
where s is a solution from [1, 1.5] of the equation
1/(2+x) 2 + x 1+x
(1 + x) − = 0.
1+x x
96 Greedy approximation in Hilbert spaces
Numerical calculations of s (see Sil’nichenko (2004)) give
s
= 0.182 · · · > 11/62.
2(2 + s)
The technique used in Sil’nichenko (2004) is a further development of a
method from Konyagin and Temlyakov (1999b).
There is also some progress in the lower estimates. The estimate
γ (m) ≥ Cm −0.27 ,
with a positive constant C, was proved in Livshitz and Temlyakov (2003). For
previous lower estimates see Temlyakov (2003a), p. 59. Very recently, Livshitz
(2009), using the technique from Livshitz and Temlyakov (2003), proved the
following lower estimate:
We mentioned above that the PGA and its generalization the Weak Greedy
Algorithm (WGA) give, for every element f ∈ H , a convergent expansion
in a series with respect to a dictionary D. We discuss a further generalization
of the WGA that also provides a convergent expansion. We consider here a
generalization of the WGA obtained by introducing to it a tuning parameter
b ∈ (0, 1] (see Temlyakov (2007a)). Let a sequence τ = {tk }∞ k=1 , 0 ≤ tk ≤ 1,
and a parameter b ∈ (0, 1] be given. We define the Weak Greedy Algorithm
with parameter b as follows.
Weak Greedy Algorithm with parameter b (WGA(b)) We define f 0τ,b :=
f . Then, for each m ≥ 1 we have the following inductive definition.
(2)
τ,b τ,b
f mτ,b := f m−1 − b
f m−1 , ϕmτ,b ϕmτ,b .
(3)
m
τ,b
G τ,b
m ( f, D) := b
f j−1 , ϕ τ,b τ,b
j ϕ j .
j=1
This theorem is an extension of the corresponding result for the WGA (see
Theorem 2.22). In the particular case tk = 1, k = 1, 2, . . . , we get the
following rate of convergence:
−r (b) 2−b
m ( f, D)
≤ Cm
f − G 1,b , r (b) := .
2(4 − b)
We note that r (1) = 1/6 and r (b) → 1/4 as b → 0. Thus we can offer
the following observation. At each step of the Pure Greedy Algorithm we can
choose a fixed fraction of the optimal coefficient for that step instead of the
optimal coefficient itself. Surprisingly, this leads to better upper estimates than
those known for the Pure Greedy Algorithm.
f n 2 ≤ d( f, A1 (G))2H + C(θ )( f + C0 )2 n −1 .
f n = f − G n = (1 − α) f n−1 + α( f − ϕn )
and
sup
h, g = sup
h, φ. (2.44)
g∈G ± φ∈A1 (G )
This implies
f n 2 − f ∗ 2 ≤ (1 − α)( f n−1 2 − f ∗ 2 ) + α 2 ( f + C0 )2 .
Theorem 2.29 For θ > 1/2 there exists a constant C := C(θ, C0 ) such that,
for any f ∈ H , we have
f − G n ( f ) 2 ≤ C/n.
Proof If f ∈ A1 (G) then the statement of Theorem 2.29 follows from known
results (see Barron (1993) and Theorem 2.15). If d( f, A1 (G)) > 0, then the
property (2.42) implies that, for any φ ∈ A1 (G), we have
f ∗ , φ − f ≤ 0. (2.45)
f n−1 , f ≤ max
f n−1 , g. (2.49)
g∈G ±
max
f n−1 , g =
f n−1 , ϕn . (2.50)
g∈G ±
f ∗ , ϕn − f ≤ 0. (2.51)
f − G m ( f, D)
≥ m −1/2 , m ≥ 4.
It is clear that σm ( f, D) = 0 for m ≥ 2. This example of a dictionary shows
that in general we cannot improve on an m −1/2 rate of approximation by the
Pure Greedy Algorithm, even if we impose extremely tough restrictions on
σm ( f, D). We call this phenomenon a saturation property.
We give here a sufficient condition on D to have the property formulated in
Problem 2.30. We consider dictionaries which we call λ-quasi-orthogonal.
Definition 2.31 We say D is a λ-quasi-orthogonal dictionary if, for any n ∈ N
and any gi ∈ D, i = 1, . . . , n, there exists a collection ϕ j ∈ D, j = 1, . . . , M,
M ≤ N := λn, with the properties
gi ∈ X M := span(ϕ1 , . . . , ϕ M ); (2.55)
and, for any f ∈ X M , we have
max |
f, ϕ j | ≥ N −1/2
f
. (2.56)
1≤ j≤M
102 Greedy approximation in Hilbert spaces
Remark 2.32 It is clear that an orthonormal dictionary is a 1-quasi-orthogonal
dictionary.
We shall prove the following theorem and its slight generalization on an
asymptotically λ-quasi-orthogonal dictionary. Examples of asymptotically λ-
quasi-orthogonal dictionaries are also given in this section.
Theorem 2.33 Let a given dictionary D be λ-quasi-orthogonal and let
0 < r < (2λ)−1 be a real number. Then, for any f such that
σm ( f, D) ≤ m −r , m = 1, 2, . . . ,
we have
f − G m ( f, D)
≤ C(r, λ)m −r , m = 1, 2, . . .
We will make a comment about this theorem before proceeding to the proof.
In Section 2.3 we formulated an open problem about the rate of convergence of
the PGA with respect to a general dictionary D for elements of A1 (D). Theo-
rem 2.33 allows us to give an example of a collection of dictionaries for which
the behavior of the PGA is better than its behavior for a general dictionary.
Theorem 2.19 guarantees that
σm ( f, D) ≤ m −1/2 , f ∈ A1 (D).
Therefore, for any λ-quasi-orthogonal dictionary we have, for any r < (2λ)−1 ,
f − G m ( f, D)
≤ C(r, λ)m −r , f ∈ A1 (D).
For example, if λ < 2.5 then the rate of convergence of the PGA with respect
to the λ-quasi-orthogonal dictionary is better than m −0.2 , which, in turn, is
better than the known lower bound for the rate of convergence of the PGA for
general dictionaries.
We now proceed to the proof of Theorem 2.33. We begin with a numerical
lemma.
Lemma 2.34 Let three positive numbers α < γ ≤ 1, A > 1 be given and
let a sequence of positive numbers 1 ≥ a1 ≥ a2 ≥ · · · satisfy the following
condition: If, for some ν ∈ N we have
aν ≥ Aν −α ,
then
aν+1 ≤ aν (1 − γ /ν). (2.57)
Then there exists B = B(A, α, γ ) such that, for all n = 1, 2, . . . , we have
an ≤ Bn −α .
2.5 λ-quasi-orthogonal dictionaries 103
Proof We have a1 ≤ 1 < A, which implies that the set
V := {ν : aν ≥ Aν −α }
does not contain ν = 1. We now prove that for any segment [n, n + k] ⊂ V we
have k ≤ C(α, γ )n. Indeed, let n ≥ 2 be such that n − 1 ∈
/ V , which means
n+k−1
n+k−1
an+k ≤ an (1 − γ /ν) ≤ an−1 (1 − γ /ν). (2.60)
ν=n ν=n
n+k−1
(n + k)−α ≤ (n − 1)−α (1 − γ /ν). (2.61)
ν=n
m−1
m
−1
ν ≥ x −1 d x = ln(m/n),
ν=n n
n+k
n+k−1
n+k−1
n+k
−α ln ≤ ln(1 − γ /ν) ≤ − γ /ν ≤ −γ ln .
n−1 ν=n ν=n
n
Hence,
n
(γ − α) ln(n + k) ≤ (γ − α) ln n + α ln ,
n−1
which implies
n + k ≤ 2α/(γ −α) n
and
k ≤ C(α, γ )n.
104 Greedy approximation in Hilbert spaces
Let us take any μ ∈ N. If μ ∈/ V we have the desired inequality with B = A.
Assume μ ∈ V , and let [n, n + k] be the maximal segment in V containing μ.
Then
−α
−α −α n − 1
aμ ≤ an−1 ≤ A(n − 1) = Aμ . (2.62)
μ
Using the inequality k ≤ C(α, γ )n proved above we get
μ n+k
≤ ≤ C1 (α, γ ). (2.63)
n−1 n−1
Substituting (2.63) into (2.62) we complete the proof of Lemma 2.34 with
B = AC1 (α, γ )α .
Proof of Theorem 2.33 Let ν(r, λ) be such that for ν > ν(r, λ) we have
Take two positive numbers C ≥ ν(r, λ)r and κ which will be chosen later.
We consider the sequence aν := 1 for ν < ν(r, λ) and aν :=
f ν
2 , ν ≥
ν(r, λ), where
f ν := f − G ν ( f, D).
aν(r,λ) := f ν(r,λ) 2 ≤ f 1 2 ≤ 1.
Let us assume that for some ν we have aν ≥ C 2 ν −2r . We want to prove that
for those same ν we have
aν+1 ≤ aν (1 − γ /ν)
with some γ > 2r . We shall specify the numbers C and κ in this proof. The
assumptions C ≥ ν(r, λ)r and aν ≥ C 2 ν −2r imply ν ≥ ν(r, λ) and
f ν
≥
Cν −r , or
ν −r ≤ C −1
f ν
. (2.64)
u 2 = f ν 2 − v 2 ≥ f ν 2 (1 − (Cκ r )−2 ).
sup |
f ν , g| ≥ max |
f ν , ϕ j | = max |
u, ϕ j | ≥ N −1/2
u
.
g∈D 1≤ j≤M 1≤ j≤M
Hence,
u
2
f ν+1
2 ≤
f ν
2 − ≤
f ν
2 (1 − (1 − (Cκ r )−2 )(λ([(1 + κ)ν] + 1))−1 ).
N
It is clear that taking a small enough κ > 0 and a sufficiently large C we can
make for ν ≥ ν(r, λ)
≤ C(r, λ)n −r ,
1/2
f n
= an n = 1, 2, . . . ,
max |
f, ϕ j | ≥ N (n)−1/2
f
. (2.67)
1≤ j≤M
σm ( f, D) ≤ m −r , m = 1, 2, . . . ,
we have
f − G m ( f, D) ≤ C(r, λ, D)m −r , m = 1, 2, . . .
Proof In the proof of this theorem we use the following Lemma 2.37 instead
of Lemma 2.34.
aν ≥ Aν −α ,
then
aν+1 ≤ aν (1 − γ /ν).
an ≤ Bn −α .
Proposition 2.38 Let a system {ϕ1 , . . . , ϕ M } and its linear span X M satisfy
(2.56). If M = N and dim X M = N , then {ϕ j } Nj=1 is an orthonormal system.
u j := v j /
v j
, j = 1, 2, . . . N ,
2.5 λ-quasi-orthogonal dictionaries 107
and form a vector
N
−1/2
yt := N ri (t)u i ,
i=1
where the ri (t) are the Rademacher functions. Then for all j = 1, 2, . . . , N
and t ∈ [0, 1] we have
|
yt , ϕ j | = N −1/2 |
u j , ϕ j | ≤ N −1/2 ; (2.68)
and
N
yt
2 = N −1
u i , u i + N −1 ri (t)r j (t)
u i , u j
i=1 i= j
= 1 + 2N −1 ri (t)r j (t)
u i , u j .
1≤i< j≤N
This inequality implies that for some t∗ we have
yt ∗
> 1 and by (2.68) for
this t ∗ we get, for all 1 ≤ j ≤ N ,
|
yt ∗ , ϕ j | < N −1/2
yt ∗
,
which contradicts (2.56).
Definition 2.39 For given μ, γ ≥ 1, a dictionary D is called (μ, γ )-semistable
if, for any gi ∈ D, i = 1, . . . , n, there exist elements h j ∈ D, j = 1, . . . , M ≤
μn, such that
gi ∈ span{h 1 , . . . , h M }
and, for any c1 , . . . , c M , we have
M
M 1/2
c j h j
≥ γ −1/2 c2j . (2.69)
j=1 j=1
M
M 1/2
≤ sup ajcj = γ 1/2
a 2j .
(c1 ,...,c M )
≤γ 1/2 j=1 j=1
Then
f, h j = a j ( f ).
Inequality (2.70) implies
max |a j ( f )| ≥ (γ M)−1/2
f
≥ (γ μn)−1/2
f
.
1≤ j≤M
where
n (D) := f : f = ci gi , gi ∈ D, || = n .
i∈
We shall prove that the dictionary χ from Example 2.41 does not have the
saturation property.
Theorem 2.44 For any f ∈ n (χ ) we have
m/2
1
f − G m ( f, χ)
≤ 1 −
f
.
2n + 1
Proof We prove a variant of Theorem 2.44 for functions of the form
n
f = c j gI j , ∪nj=1 I j = [0, 1), g J := |J |−1/2 χ J , (2.71)
j=1
F(x, x) = 0, 0 ≤ x ≤ 1,
is continuous on Y := {(x, y) : 0 ≤ x ≤ y ≤ 1} for any f of the form (2.71).
This implies the existence of J ∗ such that
|
f, g J ∗ | = max |
f, g J |. (2.73)
J
2.6 Lebesgue-type inequalities 111
Clearly, |
f, g J ∗ | > 0 if f is non-trivial. We complete the proof by contradic-
tion. Assume J ∗ = [a, t) and, for instance, t is an interior point of Is = [b, d).
Apply Lemma 2.46 with I 1 = [a, b), I 2 = [b, d), J = J ∗ . We get strict
inequality, which contradicts (2.73). Hence, t is an endpoint of one of the
intervals I j . The same argument proves that a is also an endpoint of one of
the intervals I j . This completes the proof of Lemma 2.47.
Lemma 2.47 implies that, for f of the form (2.71), all the residuals of the
PGA R j ( f ) are also of the form (2.71). Next, for f of the form (2.71) we have
max |
f, g J | ≥ max |
f, g I j | ≥ n −1/2
f
.
J Ij
Consequently,
Rm ( f )
2 ≤ (1 − 1/n)
Rm−1 ( f )
2 ≤ · · · ≤ (1 − 1/n)m
f
2 ,
which completes the proof of Lemma 2.45.
The statement of Theorem 2.44 follows from Lemmas 2.45 and 2.42.
where the cg are real or complex numbers. In some cases, it may be possible
to write an element from m (D) in this form in more than one way. The space
m (D) is not linear: the sum of two functions from m (D) is generally not in
m (D).
For a function f ∈ H we define its best m-term approximation error
σm ( f ) := σm ( f, D) := inf
f − s
.
s∈m (D )
with
A := (33/89)1/2 and a := (23/11)1/2 .
Then
g
= 1. We define the dictionary D = B ∪ {g}. It has been proved in
DeVore and Temlyakov (1996) (see Section 2.7) that, for the function
f = h1 + h2,
we have
f − G m ( f, D)
≥ m −1/2 , m ≥ 4.
It is clear that σ2 ( f, D) = 0.
Therefore, we look for conditions on a dictionary D that allow us to prove
the Lebesgue-type inequalities. The condition D = B, an orthonormal basis
for H , guarantees that
Rm ( f, B)
= σm ( f, B).
2.6 Lebesgue-type inequalities 113
This is an ideal situation. The results that we will discuss here concern the case
when we replace an orthonormal basis B by a dictionary that is, in a certain
sense, not far from an orthonormal basis.
Let us begin with results from Donoho, Elad and Temlyakov (2007) that are
close to the results from Temlyakov (1999) discussed in Section 2.5. We give a
definition of a λ-quasi-orthogonal dictionary with depth D. In the case D = ∞
this definition coincides with the definition of a λ-quasi-orthogonal dictionary
from Temlyakov (1999) (see Definition 2.31 above).
Definition 2.48 We say that D is a λ-quasi-orthogonal dictionary with depth
D if, for any n ∈ [1, D] and any gi ∈ D, i = 1, . . . , n, there exists a collection
ϕ j ∈ D, j = 1, . . . , J , J ≤ N := λn, with the properties
gi ∈ X J := span(ϕ1 , . . . , ϕ J ), i = 1, . . . , n,
and for any f ∈ X J we have
max |
f, ϕ j | ≥ N −1/2
f
.
1≤ j≤J
It is pointed out in Donoho, Elad and Temlyakov (2007) that the proof of
Theorem 1.1 from Temlyakov (1999) (see Theorem 2.33 above) also works in
the case D < ∞ and gives the following result.
Theorem 2.49 Let a given dictionary D be λ-quasi-orthogonal with depth D,
and let 0 < r < (2λ)−1 be a real number. Then, for any f such that
σm ( f, D) ≤ m −r , m = 1, 2, . . . , D,
we have
f m
=
f − G m ( f, D)
≤ C(r, λ)m −r , m ∈ [1, D/2].
In this section we consider dictionaries that have become popular in signal
processing. Denote by
M(D) := sup |
g, h|
g=h;g,h∈D
f mo
2 ≤
f m−1
o
2 − |
f m−1
o
, g( f m−1
o
)|2 .
We noted in Donoho, Elad and Temlyakov (2007) that the use of this inequality
instead of the equality
f m
2 =
f m−1
2 − |
f m−1 , g( f m−1 )|2 ,
which holds for the PGA, allows us to prove an analog of Theorem 2.49 for
the OGA. The proof repeats the corresponding proof from Temlyakov (1999)
(see the proof of Theorem 2.33 above). We formulate this as a remark.
Remark 2.50 Theorem 2.49 holds for the OGA instead of the PGA (for
f mo
instead of
f m
).
The first general Lebesgue-type inequality for the OGA for the M-coherent
dictionary was obtained in Gilbert, Muthukrishnan and Strauss (2003). They
proved that
f mo
≤ 8m 1/2 σm ( f ) for m < 1/(32M).
f So
2 ≤ 2
f ko
(σ S−k ( f ko ) + 3M S
f ko
), 0 ≤ k ≤ S; (2.76)
f S
2 ≤ 2
f
(σ S ( f ) + 5M S
f
).
f So
2 ≤ σ S−k ( f ko )2 + 5M S
f ko
2 , 0 ≤ k ≤ S; (2.77)
f S
2 ≤ σ S ( f )2 + 7M S
f
2 . (2.78)
The inequality (2.76) can be used for improving (2.75) for small m. The
following inequalities were proved in Donoho, Elad and Temlyakov (2007).
2.6 Lebesgue-type inequalities 115
Theorem 2.53 Let a dictionary D have the mutual coherence M = M(D).
Assume m ≤ 0.05M −2/3 . Then, for l ≥ 1 satisfying 2l ≤ log m, we have
−l
f m(2
o
l −1)
≤ 6m
2
σm ( f ).
Corollary 2.54 Let a dictionary D have the mutual coherence M = M(D).
Assume m ≤ 0.05M −2/3 . Then we have
f [m
o
log m]
≤ 24σm ( f ).
It was pointed out in Donoho, Elad and Temlyakov (2007) that the inequality
f [m
o
log m]
≤ 24σm ( f ) from Corollary 2.54 is almost (up to a log m factor)
perfect Lebesgue inequality. However, we are paying a big price for it in the
sense of a strong assumption on m. It was mentioned in Donoho, Elad and
Temlyakov (2007) that it was not known if the assumption m ≤ 0.05M −2/3
can be substantially weakened. It was shown in Temlyakov and Zheltov (2010)
that the use of Theorem 2.52 instead of Theorem 2.51 allows us to weaken
substantially the assumption m ≤ 0.05M −2/3 .
Theorem 2.55 Let a dictionary D have the mutual coherence M = M(D).
For any δ ∈ (0, 1/4] set L(δ) := [1/δ] + 1. Assume m is such that
20Mm 1+δ 2 L(δ) ≤ 1. Then we have
√
f m(2
o
L(δ)+1 −1)
≤ 3σm ( f ).
Very recently, Livshitz (2010) improved the above Lebesgue-type inequality
by proving that
f 2m
o
≤ 3σm ( f )
for m ≤ (20M)−1 . The proof in Livshitz (2010) is different from the proof of
Theorem 2.55 given below; it is much more technically involved.
We now demonstrate the use of the inequality (2.78). The following result
from Temlyakov and Zheltov (2010) is a corollary of (2.78).
Theorem 2.56 Let a dictionary D have the mutual coherence M = M(D).
For any r > 0 and δ ∈ (0, 1] set L(r, δ) := [2r/δ] + 1. Let f be such that
σm ( f ) ≤ m −r
f
, m ≤ 2−L(r,δ) (14M)−1/(1+δ) .
Then, for all n such that n ≤ (14M)−1/(1+δ) , we have
f n
≤ C(r, δ)n −r
f
.
The classical Lebesgue inequality is a multiplicative inequality, where the
quality of an approximation method is measured by the growth of an extra fac-
tor. It turns out that multiplicative Lebesgue-type inequalities are rather rare
116 Greedy approximation in Hilbert spaces
in nonlinear approximation. The above discussed example from DeVore and
Temlyakov (1996) shows that, even for a simple dictionary that differs from
an orthonormal basis by one element, there is no multiplicative Lebesgue-
type inequality for the Pure Greedy Algorithm. Theorems 2.51 and 2.52 are
not multiplicative Lebesgue-type inequalities – on the right-hand side they
have an additive term of the form M S
f ko
2 . Their applications in the proofs
of Theorems 2.55 and 2.56 indicate that these inequalities are useful and
rather powerful. It seems that the additive Lebesgue-type inequalities are an
appropriate tool in nonlinear approximation.
2.6.2 Proofs
We will use the following simple known lemma (see, for example, Donoho,
Elad and Temlyakov (2007)).
Lemma 2.57 Assume a dictionary D has mutual coherence M. Then we have,
for any distinct g j ∈ D, j = 1, . . . , N , and, for any a j , j = 1, . . . , N , the
inequalities
⎛ ⎞ ⎛ ⎞
N N N
⎝ |a j |2 ⎠ (1− M(N −1)) ≤
a j g j
22 ≤ ⎝ |a j |2 ⎠ (1+ M(N −1)).
j=1 j=1 j=1
Proof We have
N
N
a j g j
22 = |a j |2 + ai ā j
gi , g j .
j=1 j=1 i= j
Next,
⎛ ⎞
N
| ai ā j
gi , g j | ≤ M |ai a j | = M ⎝ |ai a j | − |ai |2 ⎠
i= j i= j i, j i=1
⎛ 2 ⎞
N
N
N
= M⎝ |ai | − |ai |2 ⎠ ≤ |ai |2 M(N − 1).
i=1 i=1 i=1
We now proceed to one more technical lemma (see Donoho, Elad and
Temlyakov (2007)).
Lemma 2.58 Suppose that g1 , . . . , g N are such that
gi
= 1, i = 1, . . . , N ;
|
gi , g j | ≤ M, 1 ≤ i = j ≤ N . Let HN := span(g1 , . . . , g N ). Then, for any
f , we have
2.6 Lebesgue-type inequalities 117
N 1/2 N 1/2
|
f, gi | 2
≥ |ci | 2
(1 − M(N − 1)),
i=1 i=1
Proof We have
f − PHN ( f ), gi = 0, i = 1, . . . , N , and therefore
N
N
|
f, gi | = |
PHN ( f ), gi | = | c j
g j , gi | ≥ |ci |(1 + M) − M |c j |.
j=1 j=1
Next, denoting σ := ( Nj=1 |c j |2 )1/2 and using the inequality Nj=1 |c j | ≤
N 1/2 σ , we get
N 1/2
|
f, gi | 2
≥ σ (1 − M(N − 1)).
i=1
We carry out the proof for the OGA and later point out the necessary changes
for the PGA. For k = S the inequality (2.77) is obvious because σ0 ( f So ) =
f So
. Let k ∈ [0, S) be fixed. Assume f ko = 0. Denote by g1 , . . . , g S−k ⊂ D
the elements that have the biggest inner products with f ko :
|
f ko , g1 | ≥ |
f ko , g2 | ≥ · · · ≥ |
f ko , g S−k | ≥ sup |
f ko , g|.
g∈D ,g=gi ,i=1,...,S−k
f mo , gi =
f ko , gi −
PHm ( f ko ), gi . (2.80)
Let
m
PHm ( f ko ) = cjϕj.
j=1
We continue, to obtain
⎛ ⎞1/2
m m
|
PHm ( f ko ), gi | ≤ M |c j | ≤ Mm 1/2 ⎝ |c j |2 ⎠
j=1 j=1
−1/2
≤ MS 1/2
f ko
(1 − M S) . (2.81)
d( f mo ) ≥ |
f mo , gi | ≥ |
f ko , gi |−M S 1/2
f ko
(1−M S)−1/2 , i ∈ [1, m+1−k],
2.6 Lebesgue-type inequalities 119
and, using the inequality |
f ko , gi | ≥ |
f ko , gm+1−k | that follows from the
definition of {g j }, we obtain
d( f mo ) ≥ |
f ko , gm+1−k | − M S 1/2
f ko
(1 − M S)−1/2 .
Therefore,
k+s−1 1/2 s 1/2
d( f vo )2 ≥ (|
f ko , gi | − M S 1/2
f ko
(1 − M S)−1/2 )2
v=k i=1
s 1/2
≥ |
f ko , gi |2 − M S
f ko
(1 − M S)−1/2 .
i=1
(2.82)
Next, let
n
σ S−k ( f ko ) =
f ko − PH (n) ( f ko )
, PH (n) ( f ko ) = bjψj, n ≤ s,
j=1
PH (n) ( f ko ) 2 = f ko 2 − σ S−k ( f ko )2
By Lemma 2.58,
⎛ ⎞
n
n
|
f ko , ψ j |2 ≥ ⎝ |b j |2 ⎠ (1 − M S)2 . (2.84)
j=1 j=1
v=k
1 − MS
× − M S
f ko
(1 − M S)−1/2 . (2.85)
(1 + M S)1/2
120 Greedy approximation in Hilbert spaces
Let M S ≤ 1/2. Denote x := M S. We use the following simple inequalities:
1−x 3
≥ 1 − x, x ≤ 1/2,
(1 + x)1/2 2
and
(1 − x)−1/2 ≤ 1 + x, x ∈ [0, 1/2].
Next, we use the following inequality for 0 ≤ B ≤ A, a, b ≥ 0, a 2 −2b+1 ≥ 0:
((A2 − B 2 )1/2 (1 − ax) − x(1 + bx)A)2
≥ (A2 − B 2 )(1 − ax)2 − 2x(1 + bx)A2 + x 2 (1 + bx)2 A2
≥ A2 − B 2 − (2a + 2)x A2 . (2.86)
Using (2.86) with A :=
f ko
, B := σ S−k ( f ko ), a = 3/2, b = 1, we obtain
k+s−1 1/2
d( f v )
o 2
≥
f ko
2 − σ S−k ( f ko )2 − 5M S
f ko
2 .
v=k
Thus,
f So
2 ≤ σ S−k ( f ko )2 + 5M S
f ko
2 .
This completes the proof of Theorem 2.52 for the OGA.
A few changes adapt the proof for k = 0 to the PGA setting. As above, we
write
m
f m = f − G m ( f ); G m ( f ) = b j ψ j , ψ j ∈ D,
j=1
and estimate |
f m , gi | with i ∈ [1, m + 1] such that gi = ψ j , j = 1, . . . , m.
Using instead of
PHm ( f )
≤
f
the inequality
G m ( f )
≤
f
+
f m
≤ 2
f
,
we obtain the following analog of (2.85):
s−1 1/2
1 − MS
d( f v )2
≥ (
f
2 −σ S−k ( f )2 )1/2 −2M S
f
(1−M S)−1/2 .
(1 + M S)1/2
v=0
(2.87)
We use the inequality
((A2 − B 2 )1/2 (1 − ax) − 2x(1 + bx)A)2 ≥ A2 − B 2 − (2a + 4)x A2 ,
provided a 2 − 4b + 4 ≥ 0; this is the case for a = 3/2 and b = 1. Therefore,
for the PGA we get
f S
2 ≤ σ S ( f k )2 + 7M S
f
2 . (2.88)
2.6 Lebesgue-type inequalities 121
Proof of Theorem 2.55 We now show how one can combine the inequalities
from Theorem 2.52 with the inequality (2.75). For a given natural number
m, consider a sequence m l := m(2l − 1), l = 1, 2, . . . . We estimate
f mo l
,
l = 1, 2, . . . . For l = 1 we have m 1 = m, and by (2.75) we get
Let δ > 0 be an arbitrary fixed number. Suppose m and L are such that
f m
2 ≤ σm ( f )2 + 7Mm
f
2 . (2.92)
122 Greedy approximation in Hilbert spaces
Applying Theorem 2.52 to f m l−1 with S = m l we get the following analog of
(2.89):
f m l
2 ≤ σm ( f )2 + 7Mm l
f m l−1
2 .
Assuming instead of (2.90) that
7Mm2 L ≤ m −δ /2,
we obtain
L
(7Mm l ) ≤ 2σm ( f )2 +
f
2 m −δL 2−L
2 /2
f m L
2 ≤ 2σm ( f )2 +
f
2 .
l=1
(2.93)
Let r > 0 be a fixed number. Set L(r, δ) := [2r/δ]+1. Let n ≤ (14M)−1/(1+δ)
and let m be the largest natural number such that m2 L(r,δ) ≤ n. Then by (2.93)
we obtain
f n
2 ≤
f m L(r,δ)
2 ≤ 2σm ( f )2 + m −2r
f
2 .
Using the assumption σm ( f ) ≤ m −r
f
we continue:
f
2 ≤ 3m −2r
f
2 ≤ C(r, δ)n −2r
f
2 .
This completes the proof of Theorem 2.56.
with
A := (33/89)1/2 and a := (23/11)1/2 .
Then
g
= 1. We define the dictionary D := B ∪ {g}.
2.7 Saturation property of greedy-type algorithms 123
Theorem 2.60 For the function
f := h 1 + h 2 ,
we have
f − G m ( f )
≥ m −1/2 , m ≥ 4.
Proof We shall examine the steps of the Pure Greedy Algorithm applied to the
function f = h 1 + h 2 . We shall use the abbreviated notation f m := Rm ( f ) :=
f − G m ( f ) for the residual at step m.
First step We have
f, g = 2A > 1, |
f, h k | ≤ 1, k = 1, 2, . . .
This implies
G 1 ( f, D) =
f, gg
and
f 1 = f −
f, gg = (1 − 2A2 )(h 1 + h 2 ) − 2a A2 (k(k + 1))−1/2 h k .
k≥3
f 3 , h k =
f 2 , h k .
This equality and the calculations from the third step show that it is sufficient
to compare
f 3 , h 2 and
f 3 , g . We have
f 3 , g =
f 2 , g −
f 2 , h 1
h 1 , g = −(23/89)(A/2).
|
f m , h m | = 2a A2 (m(m + 1))−1/2
and
|
f m , g| = 2a 2 A3 (k(k + 1))−1 = 2a 2 A3 m −1 .
k≥m
Since
(|
f m , g|/|
f m , h m |)2 = (a A)2 (1 + 1/m) ≤ 345/356 < 1,
we have that
|
f m , g| < |
f m , h m |, m ≥ 4.
with
1/2
1 1
A := − ≥ (1/3)1/2 .
2 54n
Then
1
g
2 = 2A2 + = 1.
27n
Take
3n−1
2
6n−1
f := An −1/2 ϕk + (k(k + 1))−1/2 ϕk .
3
k=1 k=3n
3n−1
G n ( f, D) = u := g + An −1/2 ϕk .
k=2n+1
A2 (2n − 1) 4 1
σn ( f, )2 = + > .
n 54n 27n
(2) Assume that g is among the approximating elements; then we
should estimate
δ := inf σn−1 ( f − ag, )2 .
a
2.7 Saturation property of greedy-type algorithms 127
Denote
∞
gs := (k(k + 1))−1/2 ϕk .
k=s
We have
2n
f − ag = (1 − a)An −1/2 ϕk + An −1/2
k=1
3n−1
(g3n − g6n ) ag6n
× ϕk + (2 − a) − .
3 3
k=2n+1
If |1 − a| ≥ 1 then
1
σn−1 ( f − ag, )2 ≥ (1 − a)2 A2 > .
27n
It remains to consider 0 < a < 2. In this case the n − 1 largest in absolute
value coefficients of f − ag are those of ϕk , k = 2n + 1, . . . , 3n − 1. We
have
It is clear that the right-hand side of (2.99) is greater than or equal to 1/(27n)
for all a, and equals 1/(27n) only for a = 1. This implies that the best n-term
approximant to f with regards to D is unique and coincides with u. This
concludes the first step.
After the first step we get
We have
σn (h s , D)2 = 1/(9(s + n)),
and the best n-term approximant with regard to D is unique and equals
1
s+n−1
vn := ek (k(k + 1))−1/2 ϕk .
3
k=s
128 Greedy approximation in Hilbert spaces
Proof It is easy to verify that
h s − vn 2 = 1/(9(s + n)),
and that vn is the unique best n-term approximant with regard to . We now
prove that for each a we have
2n
s−1
h s − ag = −a An −1/2 ϕk − a/3 (k(k + 1))−1/2 ϕk
k=1 k=3n
∞
1
+ (ek − a)(k(k + 1))−1/2 ϕk .
3
k=s
we conclude that we need to prove the corresponding lower estimate for the
right-hand side of (2.100) for all μ and a. We have
a2 (1 − |a|)2 a2 (1 − |a|)2
e(a, μ)2 ≥ + ≥ + . (2.101)
3 (9(s + μ)) 3 (9(s + n − 1))
We now use the following simple relation, for b, c > 0 we have
bc
inf(a 2 b + (1 − a)2 c) = = c(1 + c/b)−1 . (2.102)
a b+c
Specifying b = 1/3 and c = 1/(9(s + n − 1)), we get for all a and μ
and
Rm
n
( f )
2 = (9n(m + 2))−1 .
It follows from the definition that the n-Greedy Algorithm with regard to D
coincides with the Pure Greedy Algorithm with regard to Dn . Therefore, we
can apply the theory developed for the PGA to the n-Greedy Algorithm. A
typical assumption in that theory is f ∈ A1 (D). We show how this assumption
can be used to prove that f − pn ( f ) ∈ A1 (Dn , Bn −1/2 ) with an absolute
constant B, provided D is M-coherent and Mn ≤ 1/4.
For l = 0 we get
b2j ≤ n(B/n)2 = B 2 n −1 .
j∈J0
For l ≥ 1,
⎛ ⎞1/2 ⎛ ⎞
⎝ b2j ⎠ ≤ n 1/2 |bln+1 | ≤ n 1/2 ⎝n −1 |b j |⎠
j∈Jl j∈Jl−1
and
⎛ ⎞1/2
∞
∞
∞
⎝ b2j ⎠ ≤ n −1/2 |b j | ≤ n −1/2 |b j | ≤ Bn −1/2 .
l=1 j∈Jl l=1 j∈Jl−1 j=1
and
n
f − βi ψi
2 ≤ σn ( f )2 + n −2 .
i=1
2.7 Saturation property of greedy-type algorithms 131
Then there exists a representation
n ∞
f 1 := f − βi ψi = bjϕj, ϕ j ∈ D,
i=1 j=1
such that
∞
|b j | ≤ B/n; |b j | ≤ B, (2.103)
j=1
with B = 20.
Proof Consider
n
n
m
h := ci gi − βjψj = αk ηk , m ≤ 2n, ηk ∈ D.
i=1 j=1 k=1
Then
∞
∞
f1 = h + ci gi , h = f1 − ci gi .
i=n+1 i=n+1
∞
ci gi
≤ (5/4)1/2 n −1/2 . (2.104)
i=n+1
∞
∞
ci gi
≤
ci gi
.
i=n+1 l=1 i∈Jl
⎛ ⎞1/2 ⎛ ⎞1/2
ci gi
≤ ⎝ ci2 ⎠ (1 + Mn)1/2 ≤ (5/4)1/2 ⎝ ci2 ⎠ . (2.105)
i∈Jl i∈Jl i∈Jl
132 Greedy approximation in Hilbert spaces
The monotonicity assumption on the coefficients {ci } from Lemma 2.64
implies
⎛ ⎞1/2
∞ ∞
⎝ ci2 ⎠ ≤ n 1/2 |cln+1 |
l=1 i∈Jl l=1
⎛ ⎞
∞
∞
≤ n 1/2 ⎝n −1 |ci |⎠ ≤ n −1/2 |ci | ≤ n −1/2 .
l=1 i∈Jl−1 i=1
(2.106)
The bound (2.104) follows from (2.105) and (2.106).
∞
Using the inequality σn ( f ) ≤
i=n+1 ci gi
, we obtain from (2.104)
for n ≥ 2
∞
1/2
5 1 1/2 5
h
≤
f 1
+
ci gi
≤ + 2 + ≤ n −1/2 (71/2 +51/2 )/2.
4n n 4n
i=n+1
By Lemma 2.57,
m
αk2 ≤ (1 − 2Mn)−1
h
2
k=1
and
m 1/2
m
|αk | ≤ m 1/2
αk2 ≤ 5. (2.107)
k=1 k=1
Second, we assume that (2.103) is not satisfied. Then, taking into account
(2.108) and the inequality |cn+1 | < 1/n, we conclude that there is ν ∈ [1, m]
such that |αν | > 19/n. The inequality (2.107) implies
n
n
m
|β j | ≤ |ci | + |αk | ≤ 6.
j=1 i=1 k=1
2.7 Saturation property of greedy-type algorithms 133
Assuming, without loss of generality, that |β1 | ≥ |β2 | ≥ · · · ≥ |βn | we get that
|βn | ≤ 6/n. We now prove that replacing βn ψn by αν ην in the approximant
provides an error better than the best-possible one σn ( f ). This contradiction
proves the lemma. Using the fact that ψn is orthogonal to f 1 we obtain
f 1 + βn ψn
2 =
f 1
2 + βn2 ≤
f 1
2 + 36n −2 . (2.109)
Further,
f 1 + βn ψn − αν ην
2 =
f 1 + βn ψn
2 − 2αν
f 1 + βn ψn , ην + αν2 . (2.110)
We have
∞
f 1 + βn ψn , ην = αν + αk
ηk , ην + ci
gi , ην + βn
ψn , ην .
k=ν i=n+1
and
n
f − βi ψi
2 ≤ σn ( f )2 + n −2 /2.
i=1
Then
n
f 1 := f − βi ψi ∈ A1 (Dn , 60n −1/2 ).
i=1
134 Greedy approximation in Hilbert spaces
Proof Take arbitrary
> 0. Let f
be such that
f − f
≤
and
∞
∞
f
= ci gi , gi ∈ D, |c1 | ≥ |c2 | ≥ · · · , |ci | ≤ 1.
i=1 i=1
Denote
Hn ( f ) := span(ψ1 , . . . , ψn ) and f 1
:= f
− PHn ( f ) ( f
).
Then
f 1 − f 1
≤
f − f
+
PHn ( f ) ( f − f
)
≤ 2
. (2.112)
From this and the simple inequality σn ( f ) ≤ σn ( f
) +
we obtain
f 1
2 ≤
f 1
2 +4
f 1
+4
2 ≤ σn ( f )2 +n −2 /2+4
f 1
+4
2 ≤ σn ( f
)2 +n −2
for sufficiently small
. By Lemma 2.64 applied to f
we see that there exists
a representation
∞
f 1
= b j ϕ j , ϕ j ∈ D,
j=1
satisfying (2.103) with B = 20. By Lemma 2.63 f 1
∈ A1 (Dn , 60n −1/2 ). This
inclusion and (2.112) imply that f 1 ∈ A1 (Dn , 60n −1/2 ).
Theorem 2.66 Let D be an M-coherent dictionary and let n be such that
Mn ≤ 1/4. Assume that f ∈ A1 (D). Then
f − G nm ( f, D)
≤ 60n −1/2 γ (m − 1).
Proof Lemma 2.65 implies that f 1 := f − pn ( f ) ∈ A1 (Dn , 60n −1/2 ). This
inclusion and the following simple chain of equalities give the proof:
f − G nm ( f, D) = f 1 − G nm−1 ( f 1 , D) = f 1 − G m−1 ( f 1 , Dn ).
G m ( f ) := G m ( f, D) := G sm ( f, D) := PHm ( f ).
136 Greedy approximation in Hilbert spaces
(3) Define the residual after the mth iteration of the algorithm as
f m := f ms := f − G m ( f, D).
In this section we study the rate of convergence of the WOSGA(s, τ ) in the
case tk = t, k = 1, 2, . . . ; in this case we write t instead of τ in the notations.
We assume that the dictionary D is M-coherent and that f ∈ A1 (D). We
begin with the case t = 1. In this case we impose an additional assumption,
that the ϕi from the first step exist. Clearly, it is the case if D is finite. We
call the algorithm WOSGA(s, 1) the Orthogonal Super Greedy Algorithm with
parameter s (OSGA(s)). We prove the following error bound for the OSGA(s).
Theorem 2.68 Let D be a dictionary with coherence parameter M :=
M(D). Then, for s ≤ (2M)−1 the OSGA(s) provides, after m iterations, an
approximation of f ∈ A1 (D) with the following upper bound on the error:
f m
2 ≤ 40.5(sm)−1 , m = 1, 2, . . .
We note that the OSGA(s) adds s new elements of the dictionary at each
iteration and makes one orthogonal projection at each iteration. For compar-
ison, the OGA adds one new element of the dictionary at each iteration and
makes one orthogonal projection at each iteration. After m iterations of the
OSGA(s) and after ms iterations of the OGA, both algorithms provide ms-
term approximants with a guaranteed error bound for f ∈ A1 (D) of the same
order: O((ms)−1/2 ). Both algorithms use the same number, ms, of elements of
the dictionary. However, the OSGA(s) makes m iterations and the OGA makes
ms (s times more) iterations. Thus, in the sense of number of iterations the
OSGA(s) is s times simpler (more efficient) than the OGA. We gain this sim-
plicity of the OSGA(s) under an extra assumption of D being M-coherent, and
s ≤ (2M)−1 . Therefore, if our dictionary D is M-coherent, then the OSGA(s)
with small enough s approximates with an error whose guaranteed upper bound
for f ∈ A1 (D) is of the same order as that for the OGA.
Proof of Theorem 2.68. Denote
Fm := span(ϕi , i ∈ Im ).
Then Hm is a direct sum of Hm−1 and Fm . Therefore,
f m = f − PHm ( f ) = f m−1 + G m−1 ( f ) − PHm ( f m−1 + G m−1 ( f ))
= f m−1 − PHm ( f m−1 ).
It is clear that the inclusion Fm ⊂ Hm implies
f m
≤
f m−1 − PFm ( f m−1 )
. (2.114)
2.8 Some further remarks 137
Using the notation pm := PFm ( f m−1 ), we continue
f m−1 2 = f m−1 − pm 2 + pm 2 ,
and, by (2.114),
f m
2 ≤
f m−1
2 −
pm
2 . (2.115)
|
f, g| ≤ M + 1/s.
f 1 = f − PH1 ( f ) ( f ) = f − PH1 ( f ) ( f );
f m = f − PHm ( f ) ( f ) = f − PHm ( f ) ( f ).
Therefore,
s
s
(1+Ms)−1
f m−1 , h i 2 ≤
PH (s) ( f m−1 )
2 ≤ (1−Ms)−1
f m−1 , h i 2 ,
i=1 i=1
and thus
1 − Ms 2
pm
2 ≥ q . (2.118)
1 + Ms s
Using the notation Jl := [(l − 1)s + ν + 1, ls + ν] we write for m ≥ 2
∞
f m−1
2 =
f m−1 , f =
f m−1 , cjgj
l=1 j∈Jl
⎛ ⎞1/2
∞
≤ qs (1 + Ms)1/2 ⎝ c2j ⎠ . (2.119)
l=1 j∈Jl
j∈Jl
By (2.118) we have
s(1 − Ms)
pm
2 ≥
f m−1
4 . (2.122)
9(1 + Ms)2
Our assumption Ms ≤ 1/2 implies
1 − Ms
≥ 2/9,
(1 + Ms)2
and therefore (2.122) gives
and
f 1
2 ≤
f
2 ≤ 27/(2s) ≤ A/s.
f m 2 ≤ A(sm)−1 , m = 1, 2, . . .
Proof Proof of this theorem mimics the proof of Theorem 2.68, except that in
the auxiliary observations we choose a threshold B/s with B := (3 + t)/(2t)
instead of 2/s: |cν | ≥ B/s ≥ |cν+1 |, so that our assumption Ms ≤ 1/2 implies
that M + 1/s ≤ t (B/s − M). This, in turn, implies that all g1 , . . . , gν will be
140 Greedy approximation in Hilbert spaces
chosen at the first iteration. As a result, the sequence {c j } satisfies the following
conditions:
∞
|cν+1 | ≥ |cν+2 | ≥ · · · , |c j | ≤ 1, |cν+1 | ≤ B/s. (2.124)
j=ν+1
max |
f m−1 , h i | ≤ t −1 min |
f m−1 , ϕk |. (2.126)
i∈[1,s]\V k∈Im \V
Therefore,
1 + Ms
qs2 ≤ (1 − Ms)−1 t −2
f m−1 , ϕk 2 ≤
pm
2 .
t 2 (1 −
Ms)
k∈Im
The rest of the proof repeats the corresponding part of the proof of Theo-
rem 2.68 with A := (9(B + 1)2 )/2t 2 = (81/8)(1 + t)2 t −4 .
2.9 Open problems 141
2.9 Open problems
We have already formulated some open problems on greedy approximation in
Hilbert spaces. Here we add some more and repeat the problems mentioned
above.
γ (m) := sup (
f − G m ( f, D)
),
f,D ,{G m }
where the supremum is taken over all Riesz bases B ∈ R B(C1 , C2 ), all
elements f ∈ A1 (B) and all possible realizations of {G m }.
Comment For a normalized dictionary that we consider here we have
C1 ≤ 1 ≤ C2 . Using the lower bound from the definition of the Riesz
basis,
⎛ ⎞1/2
C1 ⎝ |c j |2 ⎠ ≤
c j g j
,
j j
f − G m ( f, D) ≤ C(r, γ )m −r .
For example, for r = 0.2 it is better than the rate of convergence of the
PGA with respect to a general dictionary. We note that we can specify
r = 0.2 if C1 > (0.4)1/2 .
142 Greedy approximation in Hilbert spaces
2.3. Find the order of decay of the sequence
γ (m, λ) := sup (
f − G m ( f, D)
),
f,D ,{G m }
where the supremum is taken over all elements f ∈ A1 (R2 ) and all
possible realizations of {G m }.
2.5. Find the necessary and sufficient conditions on a weakness sequence τ to
guarantee convergence of the Weak Greedy Algorithm with regard to R2
for each f ∈ L 2 (D).
2.6. Let 2 denote the system of functions (bilinear system) of the form
u(x1 )v(x2 ) ∈ L 2 ([0, 1]2 ) normalized in L 2 . Find the necessary and suf-
ficient conditions on a weakness sequence τ to guarantee convergence of
the Weak Greedy Algorithm with regard to 2 for each f ∈ L 2 ([0, 1]2 ).
2.7. Let 0 < r ≤ 1/2 be given. Characterize dictionaries D which possess the
property that for any f ∈ H such that
σm ( f, D) ≤ m −r , m = 1, 2, . . . ,
we have
f − G m ( f, D)
≤ C(r, D)m −r , m = 1, 2, . . .
3
Entropy
143
144 Entropy
Denote by M
(A) := M
(A, X ) the maximal cardinality of
-distinguishable
sets of a compact A.
Theorem 3.1 For any compact set A we have
M2
(A) ≤ N
(A) ≤ M
(A). (3.5)
Proof We first prove the second inequality. Let an
-distinguishable set F
realise M
(A), i.e.
F = {x 1 , . . . , x M
(A) }.
By the definition of M
(A) as the maximal cardinality of
-distinguishable sets
of a compact A, we get for any x ∈ A an index j := j (x) ∈ [1, M
(A)] such
that
x − x j
≤
. Thus we have
(A)
A ⊆ ∪M
j=1 B X (x ,
)
j
N (A)
A ⊆ ∪ j=1 B X (y j ,
).
Assume M2
(A) > N
(A). Then the corresponding 2
-distinguishable set
F contains two points that are in the same ball B X (y j ,
), for some j ∈
[1, N
(A)]. This clearly causes a contradiction.
Corollary 3.2 Let A ⊂ Y , and let Y be a subspace of X . Then
N
(A, X ) ≥ N2
(A, Y ).
Indeed, by Theorem 3.1 we have
N2
(A, Y ) ≤ M2
(A, Y ) = M2
(A, X ) ≤ N
(A, X ).
It is convenient to consider along with the entropy H
(A, X ) := log2 N
(A, X )
the entropy numbers
k (A, X ):
k k
k (A, X ) :=
k1 (A, X ) := inf{
: ∃y 1 , . . . , y 2 ∈ X : A ⊆ ∪2j=1 B X (y j ,
)}.
x
∞ :=
x
n∞ := max |x j |.
j
0 0 0
∞
∞
−nt /2 −1/2 1/2
e−y /2 dy,
2 2
≤e 1/2
e dt ≤ n e
0 0
3.2 Finite dimensional spaces 147
where we have made the substitution t = n −1/2 y. This completes the proof of
(3.11) and (3.9).
Lemma 3.5 For 0 < q < ∞ and k ≤ n we have
k (Bqn , n∞ ) ≤ C(q)(ln(en/k)/k)1/q .
Proof For x ∈ Bqn denote
and
ln N ≤ l ln C + n s ln(en/n s ) + (l − s + 1)2sq .
s≤l s≤l
148 Entropy
Taking into account that n s ≤ n, we get
n s ln(en/n s ) ≤ 2sq ln(en2−sq ).
Thus
ln N ≤ C(q)2lq ln(en2−lq ).
Denoting k := [log N ] + 1 we get
1/q
ln(en/k)
k ≤
= 2−l ≤ C1 (q) . (3.13)
k
This proves the lemma.
The following theorem is from Schütt (1984) (see also Höllig (1980) and
Maiorov (1978)).
Theorem 3.6 For any 0 < q ≤ ∞ and max(1, q) ≤ p ≤ ∞ we have
1/q−1/ p
ln(en/k)
, k≤n
k (Bq , p ) ≤ C(q)
n n k
2−k/n n 1/ p−1/q , k ≥ n.
Proof Let us first consider the case k ≤ n. In the case q < ∞ and p = ∞
the estimate follows from Lemma 3.5. We will deduce the case q ≤ p < ∞
from the construction in the proof of Lemma 3.5, and keep the notations from
that proof. We have constructed an 2−l -net {y 1 , . . . , y N } such that for each
x ∈ Bqn there is a y j , j ∈ [1, N ], with the properties yi = 0 if |xi | ≤ 2−l and
j
x − y j
p ≤ 2−lp 2lq +
p
|xi | p
i:|xi |≤2−l
≤ 2l(q− p) + 2l(q− p) |xi |q ≤ 2(2l(q− p) ).
i
2M X αn−1 B X
= (2π )−n/2 e−|x+yi | /2 d x
2
2M X αn−1 B X
≥ 0.5e−|yi |
2 /2
≥ 0.5e−(4M X /
αn ) .
2
3.3 Trigonometric polynomials; volume estimates 151
From here and (3.22) we get
M
≤ 2e(4M X /
αn ) ,
2
which implies
k (n/k)1/2 M X . (3.23)
It makes sense to use this inequality for k ≤ n. For k > n we get from (3.23),
(3.14), and Corollary 3.4 that
k M X 2−k/n , k > n. (3.24)
Thus we have proven the following theorem.
Theorem 3.8 Let X be Rn equipped with
·
and
MX =
x
dσ (x).
S n−1
Then we have
(n/k)1/2 , k≤n
k (B2n , X ) MX
2−k/n , k ≥ n.
Theorem 3.8 is a dual version of the corresponding result from Sudakov
(1971); it was proved in Pajor and Tomczak-Yaegermann (1986).
The estimate
Dn
1 ≤ C ln n, n = 2, 3, . . . , (3.27)
2n
l
2n
l
eikx Dn (x − x l ) = eimx ei(k−m)x = eikx (2n + 1).
l=1 |m|≤n l=0
2n
−1
t (x) = (2n + 1) t (x l )Dn (x − x l ). (3.28)
l=0
2n
l 2
t
22 = (2n + 1)−1 t (x ) . (3.30)
l=0
The de la Vallée Poussin kernels with n = 2m are used often; let us denote
them by
Vm (x) = Vm,2m (x), m ≥ 1, V0 (x) = 1.
Vm = 2K2m−1 − Km−1 ,
Vm 1 ≤ 3. (3.37)
In addition,
Vm
∞ ≤ 3m.
Vm q m 1−1/q , 1 ≤ q ≤ ∞. (3.38)
We denote
x(l) = πl/2m, l = 1, . . . , 4m.
= 2 |P j |2 + |Q j |2 ).
Consequently, for all x,
P j (x)2 + Q j (x)2 = 2 j+1 .
t ∗ DN = t.
For 1 < q ≤ ∞,
DN
q ν(N)1−1/q , (3.42)
$
where N j = max(N j , 1), ν(N) = dj=1 N j and
d
DN
1 ln(N j + 2). (3.43)
j=1
We denote
"
P(N) := n = (n 1 , . . . , n d ), n j are non-negative integers,
#
0 ≤ n j ≤ 2N j , j = 1, . . . , d ,
and set
2π n 1 2πn d
x =
n
,..., , n ∈ P(N).
2N1 + 1 2Nd + 1
Then, for any t ∈ T (N, d),
t (x) = ϑ(N)−1 t (xn )DN (x − xn ), (3.44)
n∈P(N)
$d
where ϑ(N) = j=1 (2N j + 1) and, for any t, u ∈ T (N, d),
t, u = ϑ(N)−1 t (xn )u(xn ), (3.45)
n∈P(N)
t
22 = ϑ(N)−1 t (xn )2 . (3.46)
n∈P(N)
158 Entropy
The Fejér kernels.
d
KN (x) := K N j (x j ), x = (x1 , . . . , xd ), N = (N1 , . . . , Nd ),
j=1
KN 1 = 1, (3.47)
KN q ϑ(N)1−1/q , 1 ≤ q ≤ ∞. (3.48)
VN 1 ≤ 3d , (3.49)
VN q ϑ(N)1−1/q , 1 ≤ q ≤ ∞. (3.50)
We denote
"
P (N) = n = (n 1 , . . . , n d ), n j are natural numbers,
#
1 ≤ n j ≤ 4N j , j = 1, . . . , d
and set
πn 1 π nd
x(n) = ,..., , n ∈ P (N).
2N1 2Nd
In the case N j = 0 we assume x j (n) = 0. Then for any t ∈ T (N, d) we have
the representation
! !
t (x) = ν(4N)−1 t x(n) VN x − x(n) . (3.51)
n∈P (N)
VN
p→ p ≤ 3d , 1 ≤ p ≤ ∞. (3.52)
3.3 Trigonometric polynomials; volume estimates 159
and
t
22 = tˆ(k)2 =
At
2 . (3.54)
2
|k|≤m
160 Entropy
Thus the operator ABϑ(m)1/2 is an orthogonal transformation of R2ϑ(m) .
We shall prove the following assertion.
Then for the volume of this set in the space R2ϑ(m) we have the estimate that
there is a C(d) > 0 such that, for all m,
!
vol S∞ (m) ≥ C(d)−ϑ(m) .
and
d
Km (x) = Km j (x j ).
j=1
which is valid for any pair of trigonometric polynomials u, w ∈ T (m, d). Then
it is not difficult to see that, for all n such that |n j | ≤ m j , j = 1, . . . , d, we
have
ϑ(m)−1 ei(n,x ) Km (xl − xk ) = K̂m (n)ei(n,x ) .
k l
(3.56)
k∈P(m)
" k #
ηn = (Re iei(n,x ) , Im iei(n,x ) ) k∈P(m) ∈ R2ϑ(m)
k
3.3 Trigonometric polynomials; volume estimates 161
are the eigenvectors of the operator K corresponding to eigenvalues K̂m (n),
|n| ≤ m. It is not difficult to verify that the vectors εn , ηn , |n| ≤ m, create a
set of 2ϑ(m) orthogonal vectors from R2ϑ(m) . Consequently, the operator K
maps the unit cube of the space R2ϑ(m) to the set G with volume
vol(G) = K̂m (n)2 ≥ C1 (d)−ϑ(m) , C1 (d) > 0. (3.57)
|n|≤m
and the representation (3.58) it follows that for some C2 (d) we have
t (·, y) ≤ C2 (d). (3.59)
∞
Then for the volume of this set in the space R2ϑ(m) the following estimate
holds: there is a C(d) > 0 such that, for all m,
!
vol A∞ (m) ≥ ϑ(m)−ϑ(m) C(d)−ϑ(m) .
vol(B) ≥ vol(B2N )C −N ,
L h = {l ∈ R N : l = h + u, u ∈ U },
Therefore,
A1 (m) ⊆ {AB y,
y
2ϑ(m) ≤ C(d)ϑ(m)}
1
with C(d) the constant from Lemma 3.18. Using the fact that the operator
ABϑ(m)1/2 is an orthogonal transformation of R2ϑ(m) , we get
vol(A1 (m)) ≤ ϑ(m)−ϑ(m) vol(O2ϑ(m) (C(d)ϑ(m))).
By (3.8) we obtain from here that
vol(A1 (m)) ≤ ϑ(m)−ϑ(m) C(d)ϑ(m) .
Lemma 3.17 is now proved.
Let us apply Theorem 3.3 with n = 2ϑ(m):
z
X :=
A−1 z
L 1 ;
z
Y :=
A−1 z
L ∞ , z ∈ Rn .
Then B X = A1 (m) and BY = A∞ (m). Combining Lemma 3.13 with
Lemma 3.17 we get from (3.6) that
N
(BY , B X ) ≥ (C(d)/
)n . (3.65)
In particular, (3.65) implies
n (BY , B X ) ≥ C(d) > 0.
This inequality can be rewritten in the form
n (T (m, d)∞ , L 1 ∩ T (m, d)) ≥ C(d). (3.66)
We want to replace L 1 ∩ T (m, d) by L 1 in (3.66). There are different ways of
doing this. The simplest way is to use Corollary 3.2.
The proof of Theorem 3.16 is now complete.
Theorem 3.19 For any 1 ≤ q ≤ p ≤ ∞ we have
k (T (m, d)q , L p )
1/q−1/ p
ln(e2ν(4m)/k)
, k ≤ 2ν(4m)
≤ C(q, d)ν(4m) 1/q−1/ p k
2−k/2ν(4m) ν(4m) 1/ p−1/q , k ≥ 2ν(4m).
Proof We consider arrays of complex numbers z = {z n , n ∈ P (m)}. Then,
since z n = Re z n + i Im z n , we interpret z as an element of the space R2ν(4m) .
3.4 The function classes 165
With each polynomial t ∈ T (m, d) we associate z(t) := {t (x(n)), n ∈ P (m)}
and with any z = {z n , n ∈ P (m)} we associate
t (x, z) := ν(4m)−1 z n Vm (x − x(n)).
n∈P (m)
Thus, if
t ∈ T (m, d)q , then
z(t)
2ν(4m) ≤ C2 (d)ν(4m)1/q .
q
Therefore
k (T (m, d)q , L p ) ≤ C(d)ν(4m)1/q−1/ p
k (Bq2ν(4m) , 2ν(4m)
p ).
Applying Theorem 3.6 we complete the proof of Theorem 3.19.
where inf L is taken over all n-dimensional subspaces of X . In other words, the
Kolmogorov n-width gives the best possible error in approximating a compact
set F by n-dimensional linear subspaces.
There has been an increasing interest since 2000 in nonlinear m-term
approximation with regard to different systems. In this section we generalize
the concept of the classical Kolmogorov width in order to use it in estimating
the best m-term approximation (see Temlyakov (1998d)). For this purpose we
introduce a nonlinear Kolmogorov (N , m)-width as follows:
dm (F, X, 1) = dm (F, X ).
N K m, (3.73)
N m am , (3.74)
It is clear that
d1 (F, X, 2n ) ≤
n (F, X ).
We prove here the inequality
max k r
k (F, X ) ≤ C(r, K ) max m r dm−1 (F, X, K m ), (3.76)
1≤k≤n 1≤m≤n
where we denote
d0 (F, X, N ) := sup
f
X .
f ∈F
and give an example showing that k log k in this inequality cannot be replaced
by a slower growing function on k. For non-integer k we set
k (F, X ) :=
[k] (F, X ), where [k] is the integral part of the number k.
Theorem 3.23 For any compact F ⊂ X and any r > 0 we have, for all n ∈ N,
max k r
k (F, X ) ≤ C(r, K ) max m r dm−1 (F, X, K m ).
1≤k≤n 1≤m≤n
X (K 2s+1
, 2s+1 ). Then the total number of the elements y sj in these
n s -nets
does not exceed
s+1
Ms := K 2 2n s .
We now consider the set A of elements of the form
l(r )
y 1j1 + 2−r y 2j2 + · · · + 2−r (l(r )−1) y jl(r) , js = 1, . . . , Ms , s = 1, . . . , l(r ).
The total number of these elements does not exceed
l(r )
l(r)
l(r)+2
M= Ms ≤ K 2 2 s=1 n s .
s=1
It is clear that it suffices to consider the case of large l ≥ l(r, K ). We take now
n s := [(r + 1)(l − s)2s+1 ], s = 1, . . . , l(r ),
where [x] denotes the integer part of a number x. We choose l(r ) ≤ l as a
maximal natural number satisfying
l(r )
n s ≤ 2l−1
s=1
and
2l(r )+2 log K ≤ 2l−1 .
It is clear that
l(r ) ≥ l − C(r, K ).
Then we have
l
M ≤ 22 .
For the error
( f ) of approximation of f ∈ H r (K(K , l)) by elements of A we
have
l(r )
l
( f ) ≤ 2−rl +
ts ( f ) − 2−r (s−1) y sjs
X +
ts ( f )
X
s=1 s=l(r )+1
l(r )
≤ C(r, K )2−rl + 2−r (s−1)
n s (B L s ( f ) , X )
s=1
l(r )
≤ C(r, K )2−rl + 3 2−r (s−1) 2−n s /2 ≤ C(r, K )2−rl .
s+1
s=1
s
This means that, for each s = 1, 2, . . . , l, there is a collection K 2s of K 2
s
2s -dimensional spaces L sj , j = 1, . . . , K 2 , such that, for each f ∈ F, there
exists a subspace L sjs ( f ) and an approximant as ( f ) ∈ L sjs ( f ) such that
f − as ( f ) ≤ 2−r s−1 .
Consider
ts ( f ) := as ( f ) − as−1 ( f ), s = 2, . . . , l. (3.79)
Then we have
ts ( f ) ∈ L sjs ( f ) ⊕ L s−1
js−1 ( f ), dim(L sjs ( f ) ⊕ L s−1
js−1 ( f )) ≤ 2 + 2
s s−1
< 2s+1 .
s+1
Let X (K 2 , 2s+1 ) denote the collection of all L sjs ⊕ L s−1
js−1 over various 1 ≤
2s
js ≤ K ; 1 ≤ js−1 ≤ K 2
s−1
. For ts ( f ) defined by (3.79) we have
f − a1 ( f ) ≤ 1/2
a1 ( f ) ≤ 1.
Remark 3.25 On examining the proof of Theorem 3.23, one can check that
the inequality holds for K m replaced by a larger function. For example, we
have
dm (Fn , L ∞ ) ≤ n −r .
We also have
This implies that for any s ∈ N we have, for the right-hand side of (3.75),
max m r dm (Fn , L ∞ , m am ) ≤ 1.
1≤m≤s
χG Q − χG Q ∞ ≥ 1.
lim m n /n = ∞.
n→∞
and
r
M Wq,α := { f : f = Fr (·, α) ∗ ϕ,
ϕ
q ≤ 1},
where ∗ means convolution.
The problem of estimating
k (M Wq,α r , L ) has a long history. The first
p
result on the right order of
k (M W2,α , L 2 ) was obtained by Smolyak (1960).
r
k (M Wq,α
r
, L ∞ ) k −r (log k)r +1/2 (3.86)
holds for 1 < q < ∞, r > 1. We note that the upper bound in (3.86) was
proved under the condition r > 1 and the lower bound in (3.86) was proved
under the condition r > 1/q. Belinsky (1998) proved the upper bound in (3.86)
for 1 < q < ∞ under the condition r > max(1/q, 1/2). Temlyakov (1998e)
proved (3.86) for q = ∞ under the assumption r > 1/2.
The case q = 1, 1 ≤ p ≤ ∞ was settled by Kashin and Temlyakov (2003),
who proved that
k (M W1,α
r
, L p ) k −r (log k)r +1/2 (3.87)
k (M W1,0
r
, L ∞ ) k −r (log k)r +1 , r > 1. (3.88)
d
E(Bx B y ) = min(xi , yi ), x = (x1 , . . . , xd ), y = (y1 , . . . , yd ).
i=1
This process is called the Brownian sheet. It is known that the sample paths
of Bd are almost surely continuous. We consider them as random elements of
the space C([0, 1]d ). The small ball problem is the problem of the asymptotic
behavior of the small ball probabilities
P( sup |Bx | ≤
)
x∈[0,1]d
where C is a positive number. Inequality (3.93) plays a key role in the proof of
lower bounds in (3.86).
We now proceed to the case q = 1, d = 2. For a natural number n let us
denote
Q n := ∪s:s1 +s2 ≤n ρ(s); Q n := Q n \ Q n−1 = ∪s:s1 +s2 =n ρ(s).
We call a set Q n a hyperbolic layer. For a finite set ⊂ Z2 define
T () := { f ∈ L 1 : fˆ(k) = 0, k ∈ Z2 \ }.
For a finite set as in Section 3.3.3 we assign to each f = k∈ fˆ(k)ei(k,x) ∈
T () a vector
A( f ) := {(Re fˆ(k), Im fˆ(k)), k ∈ } ∈ R2||
180 Entropy
where || denotes the cardinality of , and define
The volume estimates of the sets B (L p ) and related questions have been
studied in a number of papers: = [−n, n], p = ∞ in Kashin (1980);
= [−N1 , N1 ] × [−N2 , N2 ], p = ∞ in Temlyakov (1989b, 1993b) (see
also Lemma 3.13 above); arbitrary and p = 1 in Kashin and Temlyakov
(1994). In particular, the results of Kashin and Temlyakov (1994) imply for
d = 2 and 1 ≤ p < ∞ that
−1
(vol(BQ n (L p )))(2|Q n |) |Q n |−1/2 (2n n)−1/2 .
It was proved in Kashin and Temlyakov (2003) that in the case p = ∞ the
volume estimate is different:
−1
(vol(BQ n (L ∞ )))(2|Q n |) (2n n 2 )−1/2 . (3.94)
k (M Wq,α
r
, L ∞ ) k −r (ln k)(d−1)r +1/2 , 1 < q < ∞, (3.95)
for large enough r . The upper bound in (3.95) follows from (3.89). It is known
that the corresponding lower bound in (3.95) would follow from the
d-dimensional version of (3.93), which we formulate below. For s ∈ Zd+ define
Recently, Bilyk and Lacey (2008) and Bilyk, Lacey and Vagharshakyan (2008)
proved (3.97) with the exponent (d − 1)/2 − δ(d) with some positive δ(d)
instead of (d − 2)/2. There is no progress in proving (3.96).
Kashin and Temlyakov (2008) considered a new norm – the QC norm,
which we now briefly discuss. For a periodic univariate function f ∈ L 1 with
the Fourier series
f ∼ fˆ(k)eikx ,
k
we define
δ0 ( f, x) := fˆ(0), δs ( f, x) := fˆ(k)eikx , s = 1, 2, . . . .
k:2s−1 ≤|k|<2s
4.1 Introduction
This chapter is devoted to some mathematical aspects of recent results on
supervised learning. Supervised learning, or learning from examples, refers
to a process that builds on the base of available data of inputs xi and outputs
yi , i = 1, . . . , m, a function that best represents the relation between the inputs
x ∈ X and the corresponding outputs y ∈ Y . This is a big area of research,
both in non-parametric statistics and in learning theory. In this chapter we con-
fine ourselves to recent further developments in settings and results obtained
from Cucker and Smale (2001). In this chapter we illustrate how methods of
approximation theory, in particular greedy algorithms, can be used in learning
theory. We begin our discussion with a very brief survey of different settings
that are close to the setting of our main interest.
183
184 Approximation in learning theory
x1 , . . . , xm fixed;
i independent identically distributed (i.i.d.),
E
i = 0, f ∈ !, we find an approximant for f (estimator fˆ).
The unknown function f is called the regression function. Error is
measured by expectation E(
f − fˆ
2 ) of some of the standard norms.
(b) Random design model. Given
f ρ (x) := E(y|x)
yi = f ρ (xi ) + i , := y − f ρ (x),
close to the form of the random design model. However, in our setting we are
not assuming that
and x are independent. While the theories of fixed and
random design models do not directly apply to our setting, they utilize several
of the same techniques we shall encounter, such as the use of entropy and the
construction of estimators through minimal risk.
We note that a standard setting in the distribution-free theory of regression
(see Györfy et al. (2002)) involves the expectation as a measure of quality
of an estimator. An important new feature of the setting in learning theory
formulated in Cucker and Smale (2001) comprises the following. They propose
to study systematically the probability distribution function
ρ m {z : f ρ − f z L 2 (ρ X ) ≥ η}
where U (B) is the unit ball of a Banach space B. We denote by N (!,
, B) the
covering number that is the minimal number of balls of radius
with centers in
! needed for covering !. The corresponding
-net is denoted by N
(!, B). In
Cucker and Smale (2001), DeVore et al. (2006) and Konyagin and Temlyakov
(2004), in the most cases the space C := C(X ) of continuous functions on
a compact X ⊂ Rd has been taken as a Banach space B. This allows us to
formulate all results with assumptions on ! independent of ρ. In Konyagin and
Temlyakov (2007) and Binev et al. (2005) some results are obtained for B =
L 2 (ρ X ). On the one hand, we weaken assumptions on the class !, and on the
other hand this results in the use of ρ X in the construction of an estimator. Thus,
we have a tradeoff between treating wider classes and building estimators that
are independent of ρ X . We note that in practice we often do not know the ρ X .
Thus, it is very desirable to build estimators independent of ρ X . In statistics this
type of regression problem is referred to as distribution-free. A recent survey
on distribution-free regression theory is provided in Györfy et al. (2002).
In this chapter we always assume that the unknown measure ρ satisfies the
condition |y| ≤ M (or, a little weaker, |y| ≤ M a.e. with respect to ρ X ) with
some fixed M. Then it is clear that for f ρ we have | f ρ (x)| ≤ M for all x (for
4.1 Introduction 187
almost all x). Therefore, it is natural to assume that a class ! of priors where
f ρ belongs is embedded into the C(X )-ball (L ∞ -ball) of radius M.
In DeVore et al. (2006) and Konyagin and Temlyakov (2004) the restrictions
on a class ! were imposed in the following forms:
n (!, C) ≤ Dn −r , n = 1, 2, . . . , ! ⊂ DU (C), (4.1)
or
dn (!, C) ≤ K n −r , n = 1, 2, . . . , ! ⊂ K U (C). (4.2)
Here, dn (!, B) is the Kolmogorov width. Kolmogorov’s n-width for the
centrally symmetric compact set ! in the Banach space B is defined as follows:
dn (!, B) := inf sup inf
f − g
B ,
L f ∈! g∈L
or a specific estimator. In the case where E(m) is the set of all estimators,
m = 1, 2, . . . , we write ACm (M, η).
188 Approximation in learning theory
We discuss results on ACm (M, E, η) with M = M(!) := {ρ : f ρ ∈ !}.
In this case we write ACm (M(!), E, η) =: ACm (!, E, η). Section 4.4 is
devoted to the study of priors on f ρ in the form f ρ ∈ !. This setting is referred
to as the proper function learning problem. In Section 4.4 (see Theorem 4.74)
we obtain a right behavior of the AC function for classes satisfying the entropy
condition
n (!, L 2 (μ)) n −r .
This is one of the main results of this chapter. We give detailed proof of this the-
orem here. The upper bounds are proved in Section 4.4 and the corresponding
lower bounds are proved in Section 4.5. It is interesting to note that the proof
uses only well known classical results from statistics: the Bernstein inequality
in the proof of the upper bounds and the Fano lemma in the proof of the lower
bounds.
We point out that results on the expectation can be obtained as a corollary
of the results on the AC function. For a class ! consider
E(!, m, fˆ) := sup E
f ρ − fˆ
2L 2 (ρ X ) ,
f ρ ∈!
μ(A) ≤ μ(B).
⊆ 0 ; A ∈ ⇒ μ(A) = μ0 (A);
D ∈ 0 ⇐⇒ D = A ∪ B, B ∈ , A ⊆ C, C ∈ , μ(C) = 0.
{x ∈ X : f (x) ≥ a} ∈ .
The following theorem shows that the property of being measurable is stable
under many operations.
| f n (x)| ≤ g(x)
f n (x) → f (x).
Then
f dμ = lim f n dμ.
A n→∞ A
A measure ν is said to be absolutely continuous with respect to the mea-
sure μ if ν(A) = 0 for each set A for which μ(A) = 0. Whenever we are
dealing with more than one measure on a measurable space (X, ), the term
“almost everywhere” becomes ambiguous, and we must specify almost every-
where with respect to μ or almost everywhere with respect to ν, etc. These are
usually abbreviated as a.e. [μ] and a.e. [ν].
Let μ be a measure and let f be a non-negative measurable function on X .
For A ∈ , set
ν(A) := f dμ.
A
Then ν is a set function defined on , and it follows from Corollary 4.13 that ν
is countably additive and hence a measure. The measure ν will be finite if and
only if f is integrable. Since the integral over a set of μ-measure zero is zero,
we have ν absolutely continuous with respect to μ. The next theorem shows
that every absolutely continuous measure ν is obtained in this fashion.
We have also
F
=
g
q .
We now describe a way to obtain the product measure. Let (X 1 , 1 , μ1 ) and
(X 2 , 2 , μ2 ) be two complete measure spaces, and consider the direct product
X := X 1 × X 2 . If A ⊆ X 1 and B ⊆ X 2 , we call A × B a rectangle. If A ∈ 1
and B ∈ 2 , we call A × B a measurable rectangle. If A × B is a measurable
rectangle, we set
λ(A × B) := μ1 (A)μ2 (B).
The measure λ can be extended to be a complete measure on a σ -algebra
containing all measurable rectangles. This extended measure μ is called the
product measure of μ1 and μ2 and is denoted by μ := μ1 × μ2 . In a particular
case when (X 1 , 1 , μ1 ) = (X 2 , 2 , μ2 ) we write μ = μ21 . If μi , i = 1, 2,
are finite, so is μ. If X 1 and X 2 are the real line and μ1 and μ2 are both
Lebesgue measures, then μ is called the two-dimensional Lebesgue measure
for the plane.
Theorem 4.18 (Fubini theorem) Using the above notations let f be an
integrable function on X with respect to μ. Then we have the following.
(i) For almost all x1 ∈ X 1 the function f (x1 , ·) is an integrable function
on X 2 with respect to μ2 . For almost all x2 ∈ X 2 the function f (·, x2 )
is an integrable function on X 1 with respect to μ1 .
(ii) The functions
f (x1 , x2 )dμ2 , f (x1 , x2 )dμ1
X2 X1
are integrable functions on X 1 and X 2 , respectively.
194 Approximation in learning theory
(iii) The following equalities hold:
f dμ2 dμ1 = f dμ = f dμ1 dμ2 .
X1 X2 X X2 X1
this condition is to say that if, for each f ∈ F, M f is a Borel set on the real
line, then, for every possible choice of the Borel sets M f , the events (sets) of
the class A := { f −1 (M f ) : f ∈ F} are independent.
For integrable independent random variables f 1 and f 2 their product f 1 f 2
is also integrable and
E( f 1 f 2 ) = E( f 1 )E( f 2 ). (4.11)
The variance of an integrable random variable f , denoted σ 2 ( f ), is defined by
σ ( f ) := ( f − E f )2 dμ.
2
(4.12)
Next,
∞
x 2k
cosh x =
(2k)!
k=0
and
∞
x 2 /2 x 2k
e = .
2k k!
k=0
2 /2
Then the inequality (2k)! ≥ 2k k! implies cosh x ≤ e x .
Next, we have
m
s(ζ −Eζ )
E(e )= E(es(ξi (ωi )−Eξi ) ). (4.21)
i=1
198 Approximation in learning theory
By Lemma 4.20 we continue
m
E(es(ζ −Eζ ) ) ≤ exp((sbi )2 /2).
i=1
We note that in the proof of Theorem 4.22 we used the assumption |β| ≤ b,
β := ξ − Eξ , to estimate Eβ k . The above proof gives the following analog of
Theorem 4.22.
Proof The beginning of the proof is the same as in the proof of Theorem 4.22.
It deviates at the estimate (4.24) for E(esβ ). This time we write
∞
∞
(sb)k−2 σ 2 (sb)k
E(e ) ≤ 1 + s σ
sβ 2 2
=1+ 2
k! b k!
k=2 k=2
σ 2 sb σ 2 sb
= 1 + 2 (e − 1 − sb) ≤ exp (e − 1 − sb) .
b b2
Thus we need to minimize over s ≥ 0 the expression
σ 2 sb
F(s) := (e − 1 − sb) − st.
b2
It is clear that F(s) takes a min value at
1 tb
s0 = ln 1 + 2 .
b σ
We have
σ2 tb tb tb tb
F(s0 ) = 2 1 + 2 − 1 − ln 1 + 2 − 2 ln 1 + 2
b σ σ σ σ
σ2 tb
=− 2h .
b σ2
Using the Chernoff inequality (4.19) we complete the proof in the same way
as in Theorems 4.21 and 4.22.
and
1 1
h (u) = ≥ g (u) = , u ≥ 0.
1+u (1 + u/3)3
Denote
S(μ) := {x ∈ X : u(x) > 0}.
We write
K(μ, ν) = − u ln(v/u)dw
S(μ)
=− (ln(v/u) − (v/u − 1))u dw + (1 − v/u)u dw.
S(μ) S(μ)
ln(v/u) − (v/u − 1) ≤ 0.
202 Approximation in learning theory
Therefore,
− (ln(v/u) − (v/u − 1))u dw ≥ 0.
S(μ)
we obtain
(1 − v/u)u dw = 1 − v dw ≥ 0.
S(μ) S(μ)
Thus,
K(μ, ν) ≥ 0.
It is clear that K(μ, μ) = 0. The quantity K(μ, ν) indicates how close the
measures μ and ν are to each other. However, in general K(μ, ν) = K(ν, μ);
therefore, it is not a metric.
We proceed to the Hellinger distance that is a metric. For μ, ν satisfying
(4.28) we define
h(μ, ν) := (u 1/2 − v 1/2 )2 dw =
u 1/2 − v 1/2
2L 2 (w) .
X
and
K(μ, ν) = − u ln(v/u)dw ≥ −2 (uv)1/2 dw − 1 = h(μ, ν).
X X
Thus,
K(μ, ν) ≥ h(μ, ν).
We note that if μ and ν are of the form (4.28) and μ is not absolutely continu-
ous with respect to ν (there exists an A ∈ such that μ(A) > 0, ν(A) = 0)
then K(μ, ν) = ∞. So, we complement the relation (4.30) that gives the
K(μ, ν) in the case where μ is absolutely continuous with respect to ν by
4.2 Some basic concepts of probability theory 203
the relation K(μ, ν) = ∞ if μ is not absolutely continuous with respect to ν.
This defines K(μ, ν) for any pair of probability measures μ, ν.
We will prove a duality property of the K(μ, ν) that will be used later on.
Then
K(μ, ν) = sup f dμ.
f ∈V
Also,
f dμ = λμ(A) → ∞ as λ → ∞.
Now let dμ = g dν. First, we note that the function f = ln g belongs to V and
K(μ, ν) = ln g dμ.
Therefore,
K(μ, ν) ≤ sup f dμ.
f ∈V
f = λχ A − ln[(eλ − 1)ν(A) + 1] ∈ V.
Lemma 4.29 (Fano’s inequality) Let (X, ) be a measurable space and let
Ai ∈ , i ∈ {0, 1, . . . , M}, be such that ∀i = j, Ai ∩ A j = ∅. Assume that
μi , i ∈ {0, 1 . . . , M}, are probability measures on (X, ). Denote
p := sup μi (X \ Ai ).
0≤i≤M
where
+ ,
1− p M−p M−p
M ( p) := (1 − p) ln − ln .
p p Mp
Proof Let us define
a := inf μi (Ai ).
0≤i≤M
Next,
1 1 1
M M M
K(μi , μ0 ) ≥ λ μi (Ai ) − ln[(eλ − 1)μ0 (Ai ) + 1].
M M M
i=1 i=1 i=1
4.2 Some basic concepts of probability theory 205
Using the Jensen inequality for the concave function ln x, we continue
- .
1
M M
1
K(μi , μ0 ) ≥ λa − ln (eλ − 1) μ0 (Ai ) + 1
M M
i=1 i=1
+ ,
λ 1
= λa − ln (e − 1) μ0 (∪i=1 Ai ) + 1 .
M
M
By the inequality
μ0 (∪i=1
M
Ai ) ≤ (1 − μ0 (A0 )) ≤ 1 − a,
we get
+ ,
1
M
1
K(μi , μ0 ) ≥ λa − ln (eλ − 1) μ0 (∪i=1
M
Ai ) + 1
M M
i=1
+ ,
1−a
≥ λa − ln (eλ − 1) +1 .
M
It is clear that we obtain the same inequality if we replace μ0 by μ j with
some j ∈ [1, M] and replace the summation over [1, M] by summation over
[0, M] \ { j}. Therefore, we get
+ ,
1 1−a
inf K(μi , μ j ) ≥ λa − ln (eλ − 1) +1 .
j∈{0,1,...,M} M M
i:i= j
where
1
m
Ez ( f ) := ( f (xi ) − yi )2
m
i=1
208 Approximation in learning theory
is the empirical error (risk) of f . This f z,H is called the empirical optimum or
the least squares estimator (LSE).
In Section 4.1 we discussed the importance of the characteristics of a
class W , closely related to the concept of entropy numbers. In this section
it will be convenient for us to define the entropy numbers in the same way
as in Section 4.1. This definition differs slightly from the definition used in
Chapter 3. For a compact subset W of a Banach space B we define the entropy
numbers as follows:
n
n (W, B) :=
n2 (W, B) := inf{
: ∃ f 1 , . . . , f 2n ∈ W : W ⊂ ∪2j=1 ( f j +
U (B))},
where U (B) is the unit ball of Banach space B. A set { f 1 , . . . , f N } ⊂ W
satisfying the condition W ⊂ ∪ Nj=1 ( f j +
U (B)) is called an
-net of W .
We denote by N (W,
, B) the covering number, that is the minimal number of
points in
-nets of W . We note that N (W,
n (W, B), B) ≤ 2n and
N
(W, B) ≤ N (W,
, B) ≤ N
/2 (W, B).
This implies that
n1 (W, B) ≤
n2 (W, B) ≤ 2
n1 (W, B),
where N
(W, B) and
n1 (W, B) are defined in Chapter 3. Therefore, all the con-
ditions that we will impose on {
n (W, B)} can be expressed in an equivalent
way in both {
n1 (W, B)} and {
n2 (W, B)}.
We mentioned in Section 4.1 that in DeVore et al. (2004) and Konyagin and
Temlyakov (2004) the restrictions on a class W were imposed in the following
form:
n (W, C) ≤ Dn −r , n = 1, 2, . . . , W ⊂ DU (C). (4.32)
We denote by S r the collection of classes satisfying (4.32). In Konyagin and
Temlyakov (2007) weaker restrictions were imposed, i.e.
n (W, L 2 (ρ X )) ≤ Dn −r , n = 1, 2, . . . , W ⊂ DU (L 2 (ρ X )). (4.33)
We denote by S2r the collection of classes satisfying (4.33).
After building f z we need to choose an appropriate way to measure the
descrepancy between f z and the target function f W . In Cucker and Smale
(2001) the quality of approximation is measured by E( f z ) − E( f W ). It is easy
to see (Proposition 4.31 below) that for any f ∈ L 2 (ρ X )
E( f ) − E( f ρ ) =
f − f ρ
2L 2 (ρ X ) . (4.34)
Thus the choice
·
=
·
L 2 (ρ X ) seems natural. This norm was also used in
DeVore et al. (2006) and Konyagin and Temlyakov (2007) for measuring the
4.3 Improper function learning; upper estimates 209
error. The use of the L 2 (ρ X ) norm in measuring the error is one of the reasons
for us to consider restictions (4.33) instead of (4.32).
One of important questions discussed in Cucker and Smale (2001), DeVore
et al. (2006) and Konyagin and Temlyakov (2007) is to estimate the defect
function L z ( f ) := E( f ) − Ez ( f ) of f ∈ W . We discuss this question in detail
in this section.
If ξ is a random variable (a real-valued function on a probability space Z )
then denote
E(ξ ) := ξ dρ; σ (ξ ) :=
2
(ξ − E(ξ ))2 dρ. (4.35)
Z Z
Z Z
= ( f (x) − f ρ (x)) dρ + 2 ( f (x) − f ρ (x))( f ρ (x) − y) dρ
2
Z
Z
|L z ( f ) − L z ( f j )| ≤ /2.
f W = arg min E( f ).
f ∈W
This theorem follows from Theorem 4.33 and the chain of inequalities
+ Ez ( f W ) − E( f W ) ≤ E( f z,W ) − Ez ( f z,W ) + Ez ( f W ) − E( f W ).
n (W ) ≤ Dn −r , n = 1, 2, . . . , W ⊂ DU (C). (4.42)
4.3 Improper function learning; upper estimates 213
Then
N (W,
) ≤ 2(C1 /
)
1/r +1
≤ 2(C2 /
) .
1/r
(4.43)
Substituting this into Theorem 4.36 and optimizing over
we get for
=
Am −r/(1+2r ) , A ≥ A0 (M, D, r ),
' (
ρ m z : E( f z,W ) − E( f W ) ≤ Am −r/(1+2r ) ≥ 1−exp −c(M)A2 m 1/(1+2r ) .
Theorem 4.37 Assume that W ∈ S r and that ρ, W satisfy (4.36). Then for
η ≥ ηm := A0 (M, D, r )m −r/(1+2r )
|ξ | ≤ 2M f 1 − f 2 ∞ a.e.
j∈(I,J ] f ∈N j
≤ ρ m {z : sup |L z ( f )| ≥ η/4}
f ∈N I
+ N j sup ρ m {z : |L z ( f ) − L z (A j−1 ( f ))| ≥ η j }. (4.50)
j∈(I,J ] f ∈W
j j
By our choice of δ j =
2 j , we get N j ≤ 22 < e2 . Applying Lemma 4.38,
we obtain
mη2j
sup ρ {z : |L z ( f ) − L z (A j−1 ( f ))| ≥ η j } ≤ 2 exp −
m
.
f ∈W 30M 2 δ 2j−1
From the definition (4.48) of η j we get
mη2j
= mη2 2 j+1
30M 2 δ 2j−1
and
mη2j
N j exp − ≤ exp(−mη2 2 j ).
30M 2 δ 2j−1
Therefore
mη2j
N j exp − ≤ 2 exp(−mη2 2 I ). (4.51)
j∈(I,J ]
30M 2 δ 2j−1
216 Approximation in learning theory
By Theorem 4.21,
mη2
ρ {z : sup |L z ( f )| ≥ η/4} ≤ 2N I exp −
m
. (4.52)
f ∈N I C(M)
Theorem 4.40 Assume that ρ, W satisfy (4.36) and that W is such that
∞
n −1/2
n (W ) = ∞.
n=1
J
S J := 2( j+1)/2
2 j−1 .
j=1
Proof This proof differs from the above proof of Theorem 4.39 only in the
choice of an auxiliary sequence {η j }. Thus we keep notations from the proof
of Theorem 4.39. Now, instead of (4.48) we define {η j } as follows:
η 2( j+1)/2
2 j−1
η j := .
4 SJ
Proceeding as in the proof of Theorem 4.39 with I = 1, we need to check that
mη2j m(η/S J )2
2j − ≤ −2 j .
30M 2 δ 2j−1 480M 2
4.3 Improper function learning; upper estimates 217
Indeed, using the assumption m(η/S J )2 ≥ 480M 2 we obtain
We formulate three corollaries of Theorem 4.40. All the proofs are similar.
We only prove Corollary 4.44 here.
Proof We apply Theorem 4.40 to Nδ (W ). First of all we note that for n such
that
n (W ) ≤ δ we have
n (Nδ (W )) = 0. Also, for n such that
n (W ) > δ we
have
n (Nδ (W )) ≤
n (W ) + δ ≤ 2
n (W ).
Next,
D2−r (Jδ −1) ≥
2 Jδ −1 > δ implies 2 Jδ ≤ 2(D/δ)1/r .
Thus
S J ≤ C1 (D, r )(1/δ)(1/2r )−1 .
It remains to apply Theorem 4.40.
The above results on the defect function imply the corresponding results for
E( f z,W ) − E( f W ). We formulate these results as one theorem.
Theorem 4.45 Assume that ρ and W satisfy (4.32) and (4.36). Then we have
the following estimates:
ρ m {z : E( f z,W ) − E( f W ) ≤ η} ≥ 1 − C(M, D, r ) exp(−c(M)mη2 ), (4.53)
EH ( f ) := E( f ) − E( f H ); EH,z ( f ) := Ez ( f ) − Ez ( f H );
Lemma 4.48 Assume that H is convex and that ρ and H satisfy (4.36). Then
we have, for f ∈ H,
σ 2 := σ 2 (( f )) ≤ 4M 2 EH ( f ).
Proof We have
σ 2 (( f )) ≤ E(( f )2 ) = E(( f (x) − f H (x))2 ( f (x) + f H (x) − 2y)2 )
≤ 4M 2 E(( f (x) − f H (x))2 ) = 4M 2
f − f H
2L 2 (ρ X ) ≤ 4M 2 EH ( f ).
At the last step we have used Lemma 4.47.
Lemma 4.49 Assume that H is convex and that ρ and H satisfy (4.36). Let
f ∈ H. For all
> 0, α ∈ (0, 1], one has
α 2 m
ρ {z : EH ( f ) − EH,z ( f ) ≥ α(EH ( f ) +
)} ≤ exp −
m
;
5M 2
α 2 m
ρ m {z : EH ( f ) − EH,z ( f ) ≤ −α(EH ( f ) +
)} ≤ exp − .
5M 2
Proof Denote a := EH ( f ). The proofs of both inequalities are the same, so we
will only carry out the proof of the first one. Using the one-sided Bernstein’s
inequality (4.39) for ( f ), we obtain
mα 2 (a +
)2
ρ m {z : EH ( f ) − EH,z ( f ) ≥ α(a +
)} ≤ exp − .
2(σ 2 + M 2 α(a +
)/3)
It remains to check that
(a +
)2
≥ . (4.55)
2(σ + M α(a +
)/3)
2 2 5M 2
Using Lemma 4.48 we get, on the one hand,
2
(σ 2 + M 2 α(a +
)/3) ≤ M 2
(9a + 2
/3); (4.56)
4.3 Improper function learning; upper estimates 221
on the other hand,
5M 2 (a +
)2 ≥ M 2
(10a + 5
). (4.57)
Comparing (4.56) and (4.57), we obtain (4.55).
Lemma 4.50 Assume that ρ and H satisfy (4.36). Let α ∈ (0, 1),
> 0, and
let f ∈ H be such that
EH ( f ) − EH,z ( f )
< α. (4.58)
EH ( f ) +
Then for all g ∈ H such that
f − g
C (X ) ≤ α
/4M we have
EH (g) − EH,z (g)
< 2α. (4.59)
EH (g) +
Proof Denote
a := EH ( f ), a := EH (g), b := EH,z ( f ), b := EH,z (g).
Then our assumption
f − g
C (X ) ≤ α
/4M implies
|a − a | ≤ α
/2, |b − b | ≤ α
/2. (4.60)
By (4.58) and (4.60) we get (a ≥ 0)
a(1 − α) < b + α
≤ b + 3α
/2. (4.61)
Also, by (4.60),
a(1 − α) ≥ (a − α
/2)(1 − α) ≥ a − αa − α
/2. (4.62)
Combining (4.61) and (4.62) we obtain
a − b < αa + 2α
≤ 2α(a +
),
which implies (4.59).
A combination of Lemmas 4.49 and 4.50 yields the following theorem.
Theorem 4.51 Assume that H is convex and that ρ, H satisfy (4.36). Then for
all
> 0 and α ∈ (0, 1)
2
EH ( f ) − EH,z ( f ) α
α m
ρ z : sup
m
≥ 2α ≤ N H, , C(X ) exp − .
f ∈H EH ( f ) +
4M 5M 2
Proof Let f 1 , . . . , f N be the (α
/4M)-net of H in C(X ), N := N (H, α
/4M,
C(X )). Let be the set of z such that for all j = 1, . . . , N we have
EH ( f j ) − EH,z ( f j )
< α.
EH ( f j ) +
222 Approximation in learning theory
Then, by Lemma 4.49,
α 2 m
ρ m () ≥ 1 − N exp − . (4.63)
5M 2
We take any z ∈ and any g ∈ H. Let f j be such that
g− f j
C (X ) ≤ α
/4M.
By Lemma 4.50 we obtain that
EH (g) − EH,z (g)
< 2α.
EH (g) +
It remains to use (4.63).
Theorem 4.52 Let H be a compact and convex subset of C(X ) and let ρ, H
satisfy (4.36). Then for all
> 0 with probability at least
m
p(H,
) := 1 − N H, , C(X ) exp − ,
16M 80M 2
one has for all f ∈ H
Proof Using Theorem 4.51 with α = 1/4 we get, with probability at least
p(H,
),
EH ( f ) ≤ 2EH,z ( f ) +
. (4.65)
Substituting
EH ( f ) := E( f ) − E( f H ); EH,z ( f ) := Ez ( f ) − Ez ( f H ),
we obtain (4.64).
The definition of f H and the inequality (4.65) imply that for any f ∈ H we
have 2EH,z ( f ) +
≥ 0. Applying this inequality to f = f z,H we obtain the
following corollary.
Ez ( f H ) − Ez ( f z,H ) ≤ /2
Proof of Theorem 4.46 We use (4.65) with f = f z,H . From the defini-
tion of f z,H we obtain that Ez,H ( f z,H ) ≤ 0. This completes the proof of
Theorem 4.46.
The following theorem is a direct corollary of Theorem 4.46.
4.3 Improper function learning; upper estimates 223
Theorem 4.54 Assume that H ∈ S r is convex and that ρ, H satisfy (4.36).
Then for η ≥ ηm := A0 (M, D, r )m −r/(1+r ) one has
ρ m {z : E( f z,H ) − E( f H ) ≥ η} ≤ exp(−c(M)mη).
Proof As in (4.43) we get from the assumption H ∈ S r that
Proof We have
f − f H
L 2 (ρ X ) ≤
f − f ρ
L 2 (ρ X ) +
f ρ − f H
L 2 (ρ X ) .
Next,
f − f ρ
2L 2 (ρ X ) = E( f ) − E( f ρ ) = E( f ) − E( f H ) + E( f H ) − E( f ρ ).
Combining the above two relations, we get
f − f H
2L 2 (ρ X ) ≤ 2(
f − f ρ
2L 2 (ρ X ) +
f H − f ρ
2L 2 (ρ X ) )
≤ 2(E( f ) − E( f H ) + 2
f H − f ρ
2L 2 (ρ X ) ).
Lemma 4.58 Assume that ρ and H satisfy (4.36). Then we have, for f ∈ H,
σ 2 := σ 2 (( f )) ≤ 8M 2 (EH ( f ) + 2δ(H)).
Proof It is clear that |( f )| ≤ M 2 . Therefore,
σ 2 (( f )) ≤ E(( f )2 ) = E ( f (x) − f H (x))2 ( f (x) + f H (x) − 2y)2
≤ 4M 2 E ( f (x) − f H (x))2
= 4M 2
f − f H
2L 2 (ρ X ) ≤ 8M 2 (EH ( f ) + 2δ(H)).
At the last step we have used Lemma 4.57.
Lemma 4.59 Assume that ρ and H satisfy (4.36). Let f ∈ H. For all
> 0,
α ∈ (0, 1], one has
α 2 m
2
ρ {z : EH ( f ) − EH,z ( f ) ≥ α(EH ( f ) +
)} ≤ exp −
m
;
32M 2 (
+ δ(H))
α 2 m
2
ρ {z : EH ( f ) − EH,z ( f ) ≤ −α(EH ( f ) +
)} ≤ exp −
m
.
32M 2 (
+ δ(H))
Proof Denote a := EH ( f ). The proofs of both inequalities are the same. We
will carry out only the proof of the first one. Using the one-sided Bernstein’s
inequality (4.39) for ( f ) we obtain
mα 2 (a +
)2
ρ {z : EH ( f ) − EH,z ( f ) ≥ α(a +
)} ≤ exp −
m
.
2(σ 2 + M 2 α(a +
)/3)
4.3 Improper function learning; upper estimates 225
It remains to check that
(a +
)2
2
≥ . (4.66)
2(σ 2 + M 2 α(a +
)/3) 32M 2 (
+ δ(H))
Using Lemma 4.58 we get, on the one hand,
Theorem 4.60 Assume that ρ and H satisfy (4.36) and are such that E( f H ) −
E( f ρ ) ≤ δ. Then, for all
> 0 and α ∈ (0, 1),
EH ( f ) − EH,z ( f )
ρ m
z : sup ≥ 2α
f ∈H EH ( f ) +
α
α 2 m
2
≤ N H, , C(X ) exp − .
4M 32M 2 (
+ δ)
Proof Let f 1 , . . . , f N be the (α
/4M)-net of H in C(X ), N := N (H, α
/4M,
C(X )). Let be the set of z such that for all j = 1, . . . , N we have
EH ( f j ) − EH,z ( f j )
< α.
EH ( f j ) +
Then, by Lemma 4.59,
α 2 m
2
ρ () ≥ 1 − N exp −
m
. (4.69)
32M 2 (
+ δ)
We take any z ∈ and any g ∈ H. Let f j be such that
g− f j
C (X ) ≤ α
/4M.
By Lemma 4.50 we obtain
EH (g) − EH,z (g)
< 2α.
EH (g) +
It remains to use (4.69).
Theorem 4.61 Let H be a compact subset of C(X ) such that E( f H )−E( f ρ ) ≤
δ. Then, for all
> 0 with probability at least
m
2
p(H,
, δ) := 1 − N H, , C(X ) exp − 9 2 ,
16M 2 M (
+ δ)
226 Approximation in learning theory
one has for all f ∈ H
Proof Using Theorem 4.60 with α = 1/4 we get, with probability at least
p(H,
, δ),
EH ( f ) ≤ 2EH,z ( f ) +
. (4.71)
Substituting
EH ( f ) := E( f ) − E( f H ); EH,z ( f ) := Ez ( f ) − Ez ( f H ),
we obtain (4.70).
Corollary 4.62 Under the assumptions of Theorem 4.61 we have
Ez ( f H ) − Ez ( f z,H ) ≤ /2
|ξ | ≤ M 2 , σ (ξ ) ≤ 2Mδ.
Theorem 4.64 Assume that ρ, W satisfy (4.36) and that W is such that
∞
n −1/2
n (W, L 2 (ρ X )) < ∞. (4.73)
n=1
Let mη2 ≥ 1. Then for any δ satisfying δ 2 ≥ η we have for a minimal δ-net
Nδ (W ) of W in the L 2 (ρ X ) norm
f J := A J ( f ), f j := A j ( f j+1 ), j = 1, . . . , J − 1.
L z ( f J ) = L z ( f J ) − L z ( f J −1 ) + · · · + L z ( f I +1 ) − L z ( f I ) + L z ( f I )
Therefore
j∈(I,J ] f ∈N j
≤ ρ m {z : sup |L z ( f I )| ≥ η/2}
f ∈N I
+ N j sup ρ {z : |L z ( f ) − L z (A j−1 ( f ))| ≥ η j }. (4.77)
m
j∈(I,J ] f ∈W
j j
By our choice of δ j =
2 j (W, L 2 (ρ X )) we get N j ≤ 22 < e2 . Let η, δ be
such that mη2 ≥ 1 and η ≤ δ 2 . It is clear that δ 2j ≥ η j , j = I, . . . , J . Applying
Lemma 4.63 we obtain for j ∈ [I, J ]
mη2j
sup ρ {z : |L z ( f ) − L z (A j−1 ( f ))| ≥ η j } ≤ 2 exp −
m
.
f ∈W 9M 2 δ 2j−1
mη2j
= mη2 2 j+1
9M 2 δ 2j−1
and
mη2j
N j exp − ≤ exp(−mη2 2 j ).
9M 2 δ 2j−1
4.3 Improper function learning; upper estimates 229
Therefore
mη2j
N j exp − ≤ 2 exp(−mη2 2 I ). (4.78)
j∈(I,J ]
9M 2 δ 2j−1
By Theorem 4.32,
mη2
ρ m {z : sup |L z ( f )| ≥ η/2} ≤ 2N I exp − . (4.79)
f ∈N I C(M)
Theorem 4.65 Assume that ρ, W satisfy (4.36), and W ∈ S2r with r > 1/2.
Let mη1+max(1/r,1) ≥ A0 (M, D, r ) ≥ 1. Then there exists an estimator
f z ∈ W such that
Proof It suffices to prove the theorem for r ∈ (1/2, 1]. Let us take δ0 := η1/2
and H0 := Nδ0 (W ) to be a minimal δ0 -net for W . Let δ := η/(2M) and
H := Nδ (W ) to be a minimal δ-net for W . Denote f z := f z,H . For any f ∈ H
there is A( f ) ∈ H0 such that
f − A( f )
L 2 (ρ X ) ≤ δ0 . By Lemma 4.63,
mη
ρ m {z : |L z ( f ) − L z (A( f ))| ≤ η} ≥ 1 − 2 exp − .
9M 2
Using the above inequality and Theorem 4.64 (mη2 ≥ 1) we get
Let us specify A0 (M, D, r ) := max(18M 2 (2M D)1/r , 1), r ∈ (1/2, 1]. Then
Proof In the case J = 0 the statement of Theorem 4.66 follows from Theo-
rem 4.32. In the case J ≥ 1 the proof differs from the proof of Theorem 4.64
only in the choice of an auxiliary sequence {η j }. Thus we retain the notations
from the proof of Theorem 4.64. Now, instead of (4.75) we define {η j } as
follows:
η 2( j+1)/2
2 j−1
η j := .
2 SJ
Proceeding as in the proof of Theorem 4.64 with I = 1 we need to check that
mη2j m(η/S J )2
2j − ≤ −2 j .
9M 2 δ 2j−1 36M 2
The proofs of both corollaries are the same. We present here only the proof
of Corollary 4.68.
232 Approximation in learning theory
Proof of Corollary 4.68 We use Theorem 4.66. Similarly to the proof of The-
orem 4.66 it is sufficient to consider the case J ≥ 1. We estimate the S J from
Theorem 4.66:
J
J
SJ = 2( j+1)/2
2 j−1 ≤ 21/2+r D 2 j (1/2−r ) ≤ C1 (r )D2 J (1/2−r ) .
j=1 j=1
Next,
D2−r (J −1) ≥
2 J −1 > δ implies 2 J ≤ 2(D/δ)1/r .
Thus
S J ≤ C1 (D, r )(1/δ)(1/2r )−1 .
It remains to apply Theorem 4.66.
We now prove an analog of Theorem 4.65.
Theorem 4.69 Assume that ρ, W satisfy (4.36) and that W ∈ S2r with r ∈
(0, 1/2]. Let mη1+1/r ≥ A0 (M, D, r ) ≥ 1. Then there exists an estimator
f z ∈ W such that
ρ m {z : E( f z ) − E( f W ) ≤ 5η}
We complete the proof in the same way as in the proof of Theorem 4.65.
4.3 Improper function learning; upper estimates 233
4.3.7 Estimates for classes from S1r
In this subsection we demonstrate how a combination of a geometric assump-
tion (convexity) and a complexity assumption (W ∈ S1r ) results in an improved
bound for the accuracy of estimation. We first prove the corresponding theorem
and then give a discussion.
Theorem 4.70 Let W be convex and compact in the L 1 (ρ X ) set and let ρ, W
satisfy (4.36). Assume W ∈ S1r , that is
|N | ≤ 2(D/
0 )
1/r +1
. (4.86)
As an estimator f z we take
We take
≥
0 and apply the first inequality of Lemma 4.49 with α = 1/2 to
each f ∈ N . In such a way we obtain a set 1 with
m
ρ m (1 ) ≥ 1 − |N | exp −
20M 2
with the following property: For all f ∈ N and all z ∈ 1 one has
EW ( f ) ≤ 2EW,z ( f ) + . (4.87)
Therefore, for z ∈ 1 ,
and
ACm (M(!, μ), η) ≥ Ce−c(r )mη
2
for η ≥ ηm . (4.93)
Remark 4.72 Theorem 4.71 holds in the case ! ⊂ (M/4)U (C(X )), |y| ≤ M,
with constants allowed to depend on M.
The lower estimates from Theorem 4.71 will serve as a benchmark for
the performance of particular estimators. Let us formulate a condition on a
measure ρ and a class ! that we will often use:
+.
for η ≥ ηm
Theorem 4.74 solves the optimization problem. Let us now make some
conclusions. First of all, Theorem 4.74 shows that the entropy numbers
n (!, L 2 (μ)) are the right characteristic of the class ! in the estimation prob-
lem. The behavior of the sequence {
n (!, L 2 (μ))} determines the behavior of
the sequence {ACm (M(!, μ), η)} of the AC-functions. Second, the proof of
Theorem 4.73 points out that the optimal (in the sense of order) estimator can
be always constructed as a least squares estimator. Theorem 4.74 discovers a
new phenomenon – sharp phase transition. The behavior of the accuracy con-
fidence function changes dramatically within the critical interval [ηm − , η+ ]. It
m
drops from a constant δ0 to an exponentially small quantity exp(−cm 1/(1+2r ) ).
One may also call the interval [ηm − , η+ ] the interval of phase transition.
m
238 Approximation in learning theory
Let us make a general remark on the technique that we use in this section:
it usually consists of a combination of results from non-parametric statistics
with results from approximation theory. Both the results from non-parametric
statistics and the results from approximation theory that we use are either
known or are very close to known results. For example, in the proof of The-
orem 4.73 we have used a statistical technique that was used in many papers
(for example, Barron, Birgé and Massart (1999), Cucker and Smale (2001) and
Lee, Bartlett and Williamson (1998)) and goes back to Barron’s seminal paper,
Barron (1991). We also used some elementary results on the entropy numbers
from approximation theory.
We now proceed to the results from this section on the construction of uni-
versal (adaptive) estimators. Let a, b, be two positive numbers. Consider a
collection K(a, b) of compacts K n in C(X ) satisfying
The following two theorems that we prove here form a basis for constructing
universal estimators. We begin with the definition of our estimator. As above,
let K := K(a, b) be a collection of compacts K n in C(X ) satisfying (4.95).
We take a parameter A ≥ 1 and consider the following estimator that we
call the penalized least squares estimator (PLSE):
with
A j ln m
n(z) := arg min Ez ( f z,K j ) + .
1≤ j≤m m
Denote for a set L of a Banach space B
Theorem 4.75 For K := {K n }∞ n=1 satisfying (4.95) and M > 0 there exists
A0 := A0 (a, b, M) such that, for any A ≥ A0 and any ρ such that ρ, K n ,
n = 1, 2, . . . satisfy (4.94), we have
4A j ln m
f z − f ρ
L 2 (ρ X ) ≤ min 3d( f ρ , K j ) L 2 (ρ X ) +
A 2 2
1≤ j≤m m
with probability ≥ 1 − m −c(M)A .
Theorem 4.75 is from Temlyakov (2008a). The reader can find the results
in the style of Theorem 4.75 with bounds of the expectation E ρ m (
f zA −
f ρ
2L 2 (ρ X ) ) in Györfy et al. (2002), chap. 12.
4.4 Proper function learning; upper estimates 239
Theorem 4.76 Let compacts {K n } satisfy (4.95) and M > 0 be given. There
exists A0 := A0 (a, b, M) ≥ 1 such that for any A ≥ A0 and any ρ satisfying
d( f ρ , K n ) L 2 (ρ X ) ≤ A1/2 n −r , n = 1, 2, . . . , (4.96)
and such that ρ, K n , n = 1, 2, . . . , satisfy (4.94) we have for η ≥
A1/2 (ln m/m)r/(1+2r )
ρ m {z :
f zA − f ρ
L 2 (ρ X ) ≥ 4A1/2 η} ≤ Ce−c(M)mη .
2
d( f ρ , Hn ) L 2 (ρ X ) ≤ Cn −r .
Usually, an Hn is a linear subspace or a nonlinear manifold that is not a
bounded subset of the space B(X ) of bounded functions with
·
B(X ) :=
supx∈X | f (x)|. This is why the truncation operator TM is used in the construc-
tion of the f z . Results of Györfy et al. (2002) show that the estimator f z is
universal in the sense of expectation. The above approach has a drawback in
that the use of the truncation operator entails that the estimator f z is (in gen-
eral) not an element of Hn . Thus, they describe a class using approximation by
elements of one form and build an estimator of another, more complex, form.
Let us make a comment on the condition (4.95). This condition is written in
a form that allows us to treat simultaneously the case of linear approximation
and the case of nonlinear approximation within the framework of a unified
general theory. The term (a(1 + 1/
))n corresponds to the covering number
for an a-ball in an n-dimensional Banach space and an extra factor n bn takes
care of nonlinear n-term approximation (see Section 4.4.3 for details).
In this section we formulate assumptions on a class W in the following form:
n (W, B) ≤ Dn −r , W ⊂ DU (B). (4.97)
We denote by S r := S r (D) := S∞ r (D) a collection of classes W that satisfy
|N | ≤ 2(D
2 /
0)
1/(2r) +1
. (4.100)
As an estimator f z we take
We take
≥
0 and apply the first inequality of Lemma 4.78 with α = 1/2 to
each f ∈ N . In such a way we obtain a set 1 with
m
ρ m (1 ) ≥ 1 − |N | exp −
20M 2
with the following property: For all f ∈ N and all z ∈ 1 one has
Eρ ( f ) ≤ 2Eρ,z ( f ) + . (4.101)
Therefore, for z ∈ 1 ,
Eρ ( f z ) ≤ 2Eρ,z ( f N ) + ≤ 3Eρ ( f N ) + 2 ≤ 3 0 + 2 .
with
A j ln m
n(z) := arg min Ez ( f z,K j ) + .
1≤ j≤m m
Proof We set
j := (2A j ln m)/m for all j ∈ [1, m]. Applying Theorem 4.80
and Lemma 4.78 we find a set with
ρ m () ≥ 1 − m −c(M)A
Eρ ( f ) ≤ 2Eρ,z ( f ) +
j , ∀ f ∈ K j ,
3
Eρ,z ( f K j ) ≤ Eρ ( f K j ) +
j /2.
2
244 Approximation in learning theory
We get from here, for z ∈ ,
An(z) ln m
Eρ ( f zA ) ≤ 2 Eρ,z ( f z,K n(z) ) +
n(z) = 2 Eρ,z ( f z,K n(z) ) +
m
A j ln m
= 2 min Eρ,z ( f z,K j ) +
1≤ j≤m m
A j ln m
≤ 2 min Eρ,z ( f K j ) +
1≤ j≤m m
3 2A j ln m
≤ 2 min Eρ ( f K j ) + .
1≤ j≤m 2 m
Theorem 4.83 Let compacts {K n } satisfy (4.104). There exists A0 :=
A0 (a, b, M) ≥ 1 such that for any A ≥ A0 and any ρ satisfying
d( f ρ , K n ) L 2 (ρ X ) ≤ A1/2 n −r , n = 1, 2, . . . ,
and such that ρ, K n , n = 1, 2, . . . , satisfy (4.94) we have for η ≥
!r/(1+2r
1/2 )
A ln m/m
ρ m {z :
f zA − f ρ
L 2 (ρ X ) ≥ 4A1/2 η} ≤ Ce−c(M)mη .
2
Proof Let η ≥ A1/2 (ln m/m)r/(1+2r ) . We define n as the smallest integer such
that 2n ≥ mη2 / ln m. Denote
j := (2A j ln m)/m, j ∈ (n, m];
j := Aη2 ,
j ∈ [1, n]. We apply Theorem 4.80 to K j with
=
j . Denote by j the set
of all z such that, for any f ∈ K j ,
Eρ ( f ) ≤ 2Eρ,z ( f ) +
j . (4.105)
By Theorem 4.80,
ρ m ( j ) ≥ p(K j , ρ,
j ).
For estimating p(K j , ρ,
j ) we write ( j ∈ [n, m]):
m
j
ln N (K j ,
j /(16M), C(X )) exp −
80M 2
m
j
≤ j ln(a(1 + 16M/
j )) + bj ln j −
80M 2
≤ j ln(a(1 + 8M)) + j (1 + b) ln m − Ac2 (M) j ln m
≤ −Ac3 (M) j ln m ≤ −Ac3 (M)n ln m ≤ −Ac4 (M)mη2
for A ≥ C1 (a, b, M). A similar estimate for j ∈ [1, n] follows from the above
estimate with j = n.
Thus (4.105) holds for all 1 ≤ j ≤ m on the set := ∩mj=1 j with
For the set := ∩ we have the inequalities (4.105) and (4.107) for
all j ∈ [1, m]. Let z ∈ . We apply (4.105) to f zA = f z,K n(z) . We consider
separately two cases: (i) n(z) > n; (ii) n(z) ≤ n. In the first case we obtain
An(z) ln m
Eρ ( f z ) ≤ 2 Eρ,z ( f z ) +
A A
. (4.109)
m
Using the definition of f zA and the inequality (4.107) we get
An(z) ln m A j ln m
Eρ,z ( f zA ) + = min Eρ,z ( f z,K j ) +
m 1≤ j≤m m
A j ln m
≤ min Eρ,z ( f K j ) +
1≤ j≤m m
3 2A j ln m
≤ min min E( f K j ) + ,
n< j≤m 2 m
3 A j ln m
min E( f K j ) + + Aη2 /2
1≤ j≤n 2 m
3 2A j ln m
≤ min E( f K j ) + + Aη2 /2
1≤ j≤m 2 m
3 −2r 2A j ln m
≤ min Aj + + Aη2 /2.
1≤ j≤m 2 m
(4.110)
f zA − f ρ L 2 (ρ X ) ≤ 4A1/2 η.
Eρ ( f zA ) ≤ 2Eρ,z ( f zA ) + Aη2 .
Next, we have
A j ln m
Eρ,z ( f zA ) ≤ min Eρ,z ( f K j ) + .
1≤ j≤m m
Using Lemma 4.78 we continue:
3 2A j ln m
≤ min Eρ ( f K j ) + Aη /2 +
2
1≤ j≤m 2 m
3 −2r 2A j ln m
≤ min Aj + + Aη2 /2 ≤ 6Aη2 .
1≤ j≤m 2 m
Therefore,
Eρ ( f zA ) ≤ 13Aη2 .
Step (1) For given M, m, E(m) find for δ ∈ (0, 1) the smallest
tm (M, δ) := tm (M, E(m), δ) such that
It is clear that for η > tm (M, δ) we have acm (M, E, η) ≤ δ and for
η < tm (M, δ) we have acm (M, E, η) > δ.
The following modification of the above tm (M, δ) is also of interest. We
now look for the smallest tm (M, δ, c) such that
K n := DU (C(X )) ∩ L n , n = 1, 2, . . .
Then for any ν ∈ V , r ∈ [α, β] our assumptions (4.113) and (4.114) imply (by
Carl’s inequality, Carl (1981)) that
We note that {K n }∞
n=1 = K(max(1, 2Q), 0). Therefore, Theorem 4.83 applies
to this sequence of compacts. Let us discuss the condition
d( f ρ , K n ) L 2 (ρ X ) ≤ A1/2 n −r , n = 1, 2, . . . , (4.118)
4.4 Proper function learning; upper estimates 249
from Theorem 4.83. We compare (4.118) with a standard condition in approx-
imation theory,
d( f ρ , L n )C (X ) ≤ Dn −r , n = 1, 2, . . . , f ρ ∈ DU (C(X )). (4.119)
First of all we observe that (4.119) implies that there exists ϕn ∈ L n ,
ϕn
C (X ) ≤ 2D, such that
f ρ − ϕn
C (X ) ≤ Dn −r .
Thus (4.119) implies
d( f ρ , K n )C (X ) ≤ Dn −r , n = 1, 2, . . . (4.120)
provided Q ≥ 2D. Also, (4.120) implies (4.118) provided A1/2 ≥ D. There-
fore, Theorem 4.83 can be used for f ρ satisfying (4.119). We formulate this
result as a theorem.
Theorem 4.85 Let L = {L n }∞ n=1 be a sequence of n-dimensional subspaces
of C(X ). For given positive numbers D, M1 , M := M1 + D there exists
A0 := A0 (D, M) with the following property: For any A ≥ A0 there exists
an estimator f zA such that, for any ρ with the properties |y| ≤ M1 a.e. with
respect to ρ and
d( f ρ , L n )C (X ) ≤ Dn −r , n = 1, 2, . . . , f ρ ∈ DU (C(X )),
we have for η ≥ A1/2 (ln m/m)r/(1+2r )
ρ m {z :
f zA − f ρ
L 2 (ρ X ) ≥ 4A1/2 η} ≤ Ce−c(M)mη .
2
(4.121)
Theorem 4.85 is an extension of theorem 4.10 from DeVore et al. (2006);
which yields (4.121) with e−c(M)mη replaced by e−c(M)mη under an extra
2 4
restriction r ≤ 1/2.
Example 2 In the previous example we worked in the C(X ) space. We now
want to replace (4.119) by a weaker condition, i.e.
d( f ρ , L n ) L 2 (ρ X ) ≤ Dn −r , n = 1, 2, . . . , f ρ ∈ DU (L 2 (ρ X )). (4.122)
This condition is compatible with condition (4.118) (from Theorem 4.83) in the
sense of approximation in the L 2 (ρ X ) norm. However, conditions (4.122) and
(4.118) differ in the sense of the approximation set: it is a linear subspace L n
in (4.122) and a compact subset of C(X ) in (4.118). In Example 1, the approxi-
mation (4.119) by a linear subspace automatically provided the approximation
(4.118) by a suitable compact of C(X ). It is clear that, similarly to Exam-
ple 1, approximation (4.122) by a linear subspace L n provides approximation
(4.118) by a compact K n ⊂ L n of the L 2 (ρ X ) instead of the C(X ). We cannot
250 Approximation in learning theory
apply Theorem 4.83 in such a situation. In order to overcome this difficulty we
impose extra restrictions on the sequence L and on the measure ρ. We discuss
the setting from Konyagin and Temlyakov (2007). Let B(X ) be a Banach space
with the norm
f
B(X ) := supx∈X | f (x)|. Let {L n }∞
n=1 be a given sequence
of n-dimensional linear subspaces of B(X ) such that L n is also a subspace of
each L ∞ (μ), where μ is a Borel probability measure on X , n = 1, 2, . . . .
Assume that n-dimensional linear subspaces L n have the following property:
For any Borel probability measure μ on X one has
μ
PL n
B(X )→B(X ) ≤ K , n = 1, 2, . . . , (4.123)
μ
where PL is the operator of the L 2 (μ) projection onto L. Then our standard
assumption |y| ≤ M1 implies
f ρ
L ∞ (ρ X ) ≤ M1 , and (4.122) and (4.113)
yield
d( f ρ , K n ) L 2 (ρ X ) ≤ Dn −r , n = 1, 2, . . . ,
where
K n := (K + 1)M1 U (B(X )) ∩ L n .
We note that Theorem 4.83 holds for compacts satisfying (4.104) in the B(X )
norm instead of the C(X ) norm. Thus, as a corollary of Theorem 4.83 we obtain
the following result.
d( f ρ , L n ) L 2 (ρ X ) ≤ Dn −r , n = 1, 2, . . . ,
ρ m {z :
f zA − f ρ
L 2 (ρ X ) ≥ 4A1/2 η} ≤ Ce−c(M)mη .
2
(4.124)
Remark 4.87 In Theorem 4.86 we can replace the assumption that L satisfies
(4.123) for all Borel probability measures μ by the assumption that (4.123) is
satisfied for μ ∈ M and add the assumption ρ X ∈ M.
4.4 Proper function learning; upper estimates 251
Example 3 Our construction here is based on the concept of nonlinear
Kolmogorov’s (N , n)-width (Temlyakov (1998d)):
dn (F, B, N ) := inf sup inf inf
f − g
B ,
L N ,#L N ≤N f ∈F L∈L N g∈L
Then {K n }∞
n=1 = K(max(1, 2Q), b). It is also clear that the condition
min d( f ρ , L n )C (X ) ≤ Dn −r ,
j
n = 1, 2, . . . , f ρ ∈ DU (C(X )),
1≤ j≤Nn
(4.125)
implies
d( f ρ , K n )C (X ) ≤ Dn −r , n = 1, 2, . . . ,
provided Q ≥ 2D.
We have the following analog of Theorem 4.85.
Theorem 4.88 Let L := {Ln }∞ Nn j
n=1 be a sequence of collections Ln := {L n } j=1
j
of n-dimensional subspaces L n of C(X ). Assume Nn ≤ n bn . For given positive
numbers D, M1 , M := M1 + D there exists A0 := A0 (b, D, M) with the
following property: For any A ≥ A0 there exists an estimator f zA such that for
any ρ with the properties: |y| ≤ M1 a.e. with respect to ρ and
min d( f ρ , L n )C (X ) ≤ Dn −r ,
j
n = 1, 2, . . . , f ρ ∈ DU (C(X )),
1≤ j≤Nn
ρ m {z :
f zA − f ρ
L 2 (ρ X ) ≥ 4A1/2 η} ≤ Ce−c(M)mη .
2
We fix a parameter q ≥ 1 and define the best m-term approximation with depth
m q as follows:
m
σm,q ( f, ) L 2 (μ) := inf
f − ci ψn i
L 2 (μ) .
ci ;n i ≤m q
i=1
Then
nq
N (K n (Q),
, B(X )) ≤ (1 + 2Q/
))n ≤ (n q (1 + 2Q/
))n . (4.127)
n
Suppose we have for μ ∈ M(γ )
such that
f − ϕn
L 2 (μ) ≤ Dn −r .
ϕn B(X ) ≤ 2DC1 n γq .
Theorem 4.89 Let and M(γ ) be as above. For given positive numbers
q, D, M1 , M := M1 + D, there exists A0 := A0 (γ , C1 , q, D, M) with the
4.5 The lower estimates 253
following property: For any A ≥ A0 there exists an estimator f zA such that for
any ρ with the properties |y| ≤ M1 a.e. with respect to ρ, ρ X ∈ M(γ ), and
ρ m {z :
f zA − f ρ
L 2 (ρ X ) ≥ 4A1/2 η} ≤ Ce−c(M)mη .
2
Remark 4.90 In Theorem 4.89 the condition (4.126) can be replaced by the
following weaker condition. Let m i ∈ [1, n q ], i = 1, . . . , n, and
L n := span{ψm i }i=1
n
.
where supρ is taken over ρ such that ρ, W satisfy (4.94). We note that in the
case of convex W we have by Lemma 4.47, for any f ∈ W ,
f − ( f ρ )W
2L 2 (ρ X ) ≤ E( f ) − E(( f ρ )W ).
4.5 The lower estimates 255
Theorem 4.70 provides an upper estimate for the AC p -function in the case of
convex W from S1r :
η ≥ ηm m −r/(1+r ) .
p
ACm (W, η1/2 ) ≤ exp(−c(M)mη),
provided η m −1/4 .
It would be interesting to find the behavior of
p
sup ACm (W, η)
W ∈S
in the following cases: (i) S = S r (D), r ≤ 1/2; (ii) S = S2r (D), r < 1;
(iii) S = {W : W ∈ Sqr (D), W is convex}, q = 1, 2, ∞.
Theorem 4.91 There exist two positive constants c1 , c2 and a class W con-
sisting of two functions 1 and −1 such that, for every m = 2, 3, . . . and
m −1/4 ≤ η ≤ 1, there are two measures ρ0 and ρ1 such that, for any estimator
f z ∈ W , for one of ρ = ρ0 or ρ = ρ1 we have
ρ m {z : E( f z ) − E( f W ) ≥ η2 } ≥ c1 exp(−c2 mη4 ).
f ρ0 = η 2 ; f ρ1 = −η2
256 Approximation in learning theory
and
( f ρ0 )W = 1; ( f ρ1 )W = −1.
Therefore,
≥ c3 2−m exp(−c2 mη4 ) Cmk ≥ 2c1 exp(−c2 mη4 ).
|m/2−k|≤m 1/2
Theorem 4.92 For any r ∈ (0, 1/2] and for every m ∈ N there is W ⊂
U (B([0, 1]) satisfying
n (W, B) ≤ (n/2)−r for n ∈ N such that, for every
estimator f z ∈ W , there is a ρ such that |y| ≤ 1 and
Proof As above, let X := [0, 1], Y := [−1, 1] and let ρ X be the Lebesgue
measure on [0, 1]. Define
gγ (x) := γ[2mx+1] ,
f γ (x) := gγ (x)m −r ,
E( f z ) − E(( f ρ )W ) = Eρ ( f z ) − Eρ (( f ρ )W ) =
f z − f ρ
22 −
( f ρ )W − f ρ
22
2m
= 2m −1−r |{i : γi (z) = γi }| = m −1−r |γi (z) − γi |.
i=1
(4.130)
We will estimate
2m
E ρ m (Eρ ( f z )−Eρ (( f ρ )W ) = m −1−r |γi ((x, f ρ (x))−γi |d x1 · · · d xm .
i=1 [0,1]m
(x) := {γ ∈ : gγ (x j ) = j , j = 1, . . . , m}.
Then
i = |γi (x,
) − γi |.
γ ∈
(x)
Define
I (x) := {[2mx j + 1] : j = 1, . . . , m}.
and
i = || = 22m .
G i := {x : i ∈
/ I (x)}.
4.5 The lower estimates 259
It is clear that mes(G i ) ≥ (1 − 1/(2m))m ≥ 1/2. Thus we obtain
E ρ m (Eρ ( f z ) − Eρ (( f ρ )W ) ≥ m −r 22m .
ρ∈M
|k−m/2|≥m
Proof We consider two cases separately: (i) wn ∈ [0, 1/(2n)] and (ii) wn ∈
(1/(2n), 1/n].
Case (i) Using the convexity of the function e−x we obtain, for any C > 0,
12 + 22 = (1−wn )−2 (( −wn /2)2 +( +wn /2)2 ) = (1−wn )−2 (2 2 +wn2 /2).
S := exp(−c(n − 1)
12 ) + exp(−c(n − 1)
22 )
!
= exp(−c(n − 1)
12 ) 1 + exp(−c(n − 1)(
22 −
12 ) .
We have an identity
22 − 12 = 2wn (1 − wn )−2 .
Then for any
∈ [0, 1/2], m, w we have |L(
, m, w))| ≥ 1. Therefore, (4.138)
obviously holds for m ≤ 6,
∈ [0, 1/2] and for any m > 6,
∈ [β, 1/2],
β = (ln 2)1/2 /5.
262 Approximation in learning theory
We first establish Lemma 4.95 for
∈ [0, (9m)−1/2 ]. We will use a simple
property of the Rademacher functions {ri (t)}.
n
Lemma 4.96 Let i=1 |ci | = 1. Then
n
−1/2
mes t : | ci ri (t)| ≤ 2(9n) ≤ 1 − 5/(9n).
i=1
Proof Denote
n
g := ci ri and E := {t : |g(t)| ≤ 2(9n)−1/2 }.
i=1
with c = 25.
Proof Denote
L (
, m, w) := ⊆ [1, m] : | wi − 1/2| ≥
.
i∈
(1 − s)/a − 2 /a ≥ −2
and, therefore, (4.144) holds also in this case. It remains to use Lemma 4.95.
Theorem 4.97 is now proved.
with
δ ≤
f i − f j
B , ∀i = j}. (4.145)
It is well known, Pisier (1989), and easy to check (see Chapter 3) that
N (!, δ, B) ≤ P(!, δ, B). The tight packing numbers are defined as fol-
lows. Let 1 ≤ c1 < ∞ be a fixed real number. We define the tight packing
numbers as
with
δ ≤
f i − f j
B ≤ c1 δ, ∀i = j}. (4.146)
Theorem 4.98 provides lower estimates for classes ! with known lower esti-
mates for the tight packing numbers P̄(!, δ). We now show how this theorem
can be used in a situation when we know the behavior of the packing numbers
P(!, δ).
Lemma 4.100 Let ! be a compact subset of B. Assume that
C1 ϕ(δ) ≤ ln P(!, δ) ≤ C2 ϕ(δ), δ ∈ (0, δ1 ],
with a function ϕ(δ) satisfying the following condition. For any γ > 0 there is
Aγ such that, for any δ > 0,
ϕ(Aγ δ) ≤ γ ϕ(δ). (4.151)
Then there exists c1 ≥ 1 and δ2 > 0 such that
ln P̄(!, δ, c1 , B) ≥ C3 ln P(!, δ), δ ∈ (0, δ2 ].
P(!,δ)
Proof For δ > 0 we take the set F := { f i }i=1 ⊂ ! satisfying (4.145).
Considering a lδ-net with l ≥ 1 for covering ! we obtain that one of the balls
of radius lδ contains at least P(!, δ)/P(!, lδ) points of the set F. Denote this
set of points by Fl = { f i }i∈(l) . Then, obviously, for any i = j ∈ (l) we
have
δ ≤
f i − f j
≤ 2lδ.
Therefore
ln P̄(!, δ, 2l, B) ≥ ln P(!, δ) − ln P(!, lδ) ≥ C1 ϕ(δ) − C2 ϕ(lδ).
Remark 4.102 Theorem 4.101 holds in the case ! ⊂ (M/4)U (C(X )), |y| ≤
M, with constants allowed to depend on M.
We note that we do not impose direct restrictions on the measure μ in The-
orem 4.101. However, the assumption (4.152) imposes an indirect restriction.
For instance, if μ is a Dirac measure then we always have
n (!, L 2 (μ))
2−n . Therefore, Theorem 4.101 does not apply in this case.
Let us make some comments on Theorem 4.101. It is clear that the parameter
r controls the size of the compact !. The bigger the r , the smaller the compact
!. In the statement of Theorem 4.101 the parameter r affects the rate of decay
of ηm . The quantity ηm is an important characteristic of the estimation process.
The inequality (4.153) says that there is no way to estimate f ρ from ! with
accuracy ≤ ηm with high confidence (>1−δ0 ). It seems natural that this critical
accuracy ηm depends on the size of ! (on the parameter r ). The inequalities
(4.153) and (4.154) give
m
fz = wi (x1 , . . . , xm , x)yi (4.156)
i=1
of the regression function f ρ . Consider the following special case of the mea-
sure ρ. Let ρ X = μ be any probabilistic measure on X . We define ρ(y|x) as
the Bernoulli measure:
where
wi (x1 , . . . , xm ) := wi (x1 , . . . , xm , x)dμ.
X
Therefore, in the case where E(m) is the set of estimators of the form (4.156)
we have for Mμ := {ρ : f ρ = 1/2, ρ X = μ}
where
1
m
Ez ( f ) := ( f (xi ) − yi )2 .
m
i=1
Clearly, a crucial role in this approach is played by the choice of the hypothesis
space H. In other words, we need to begin our construction of an estimator
with a decision of what should be the form of the estimator. In this section
we discuss only the case oriented to the use of nonlinear approximation, in
particular greedy approximation in such a construction. We want to construct
a good estimator that will provide high accuracy and that will be practically
implementable. We will discuss a realization of this plan in several stages. We
begin with results on accuracy. We will give a presentation in a rather general
form of nonlinear approximation.
Nn
Let D(n, q) := {gln }l=1 , n ∈ N, Nn ≤ n q , q ≥ 1, be a system of bounded
functions defined on X . We will consider a sequence {D(n, q)}∞ n=1 of such
systems. In building an estimator, based on D(n, q), we are going to use n-term
approximations with regard to D(n, q):
G := cl gln , || = n. (4.157)
l∈
l∈
Theorem 4.103 For K := {K n }∞ n=1 satisfying (4.160) and M > 0 there exists
A0 := A0 (a, b, M) such that, for any A ≥ A0 and any ρ such that |y| ≤ M
a.s., we have
4A j ln m
f zA − f ρ
2L 2 (ρ X ) ≤ min 3d( f ρ , K j )2L 2 (ρ X ) +
1≤ j≤m m
with probability ≥ 1 − m −c(M)A .
It is clear from (4.159) and from the definition of Fn (q) that we can apply
Theorem 4.103 to the sequence of compacts {Fn (q)} and obtain the following
error bound with probability ≥ 1 − m −c(M)A :
4A j ln m
f zA − f ρ
2L 2 (ρ X ) ≤ min 3d( f ρ , F j (q))2L 2 (ρ X ) + . (4.161)
1≤ j≤m m
We note that the inequality (4.161) is the Lebesgue-type inequality (see
Section 2.6). Indeed, on the left-hand side of (4.161) we have an error of a
particular estimator f zA built as the PLSE and on the right-hand side of (4.161)
we have d( f ρ , F j (q)) L 2 (ρ X ) – the best error that we can get using estimators
from F j (q), j = 1, 2, . . . . We recall that, by construction, f zA ∈ Fn(z) (q).
Let us now discuss an application of the theory from Section 4.4 in case (ii).
We cannot apply that theory directly to the sequence of sets {FnT (q)} because
we do not know if these sets satisfy the covering number condition (4.160).
However, we can modify the sets FnT (q) to make them satisfy the condition
(4.160). Let c ≥ 0 and define
Fn (q, c) := f : ∃G :=
T
cl gln , ⊂ [1, Nn ], || = n,
l∈
G
B(X ) ≤ C2 n c , f = TM (G )
4.6 Greedy algorithms in learning theory 273
with some fixed C2 ≥ 1. Then, using the inequality |TM ( f 1 (x)) −
TM ( f 2 (x))| ≤ | f 1 (x) − f 2 (x)|, x ∈ X , it is easy to get that
N (FnT (q, c),
, B(X )) ≤ (2C2 (1 + 1/
)n n (q+c)n .
Therefore, (4.160) is satisfied with a = 2C2 and b = q + c. We note that, from
a practical point of view, an extra restriction
G
B(X ) ≤ C2 n c is not a big
constraint.
The above estimators (built as the PLSE) are very good from the theoretical
point of view. Their error bounds satisfy the Lebesgue-type inequalities. How-
ever, they are not good from the point of view of implementation. For example,!
there is no simple algorithm to find f z,Fn (q) because Fn (q) is a union of Nnn
M-balls of n-dimensional subspaces. Thus, finding an exact LSE f z,Fn (q) is
practically impossible. We now use a remark from Temlyakov (2008a) that
allows us to build an approximate LSE with a good approximation error. We
proceed to the definition of the penalized approximate least squares estima-
tor (PALSE) (see Temlyakov (2008a)). Let δ := {δ j,m }mj=1 be a sequence of
non-negative numbers. We define f z,δ,K j as an estimator satisfying the relation
Ez ( f z,δ,K j ) ≤ Ez ( f z,K j ) + δ j,m . (4.162)
In other words, f z,δ,K j is an approximation to the least squares estimator f z,K j .
Next, we take a parameter A ≥ 1 and define the penalized approximate least
squares estimator (PALSE)
A
f z,δ := f z,δ
A
(K) := f z,δ,K n(z)
with
A j ln m
n(z) := arg min Ez ( f z,δ,K j ) + .
1≤ j≤m m
The theory developed in Temlyakov (2008a) gives the following control of the
error.
Theorem 4.104 Under the assumptions of Theorem 4.103 we have
4A j ln m
f z,δ − f ρ
L 2 (ρ X ) ≤ min 3d( f ρ , K j ) L 2 (ρ X ) +
A 2 2
+ 2δ j,m
1≤ j≤m m
with probability ≥ 1 − m −c(M)A .
We point out here that the approximate least squares estimator f z,δ,K j
approximates the least squares estimator f z,K j in the sense that Ez ( f z,δ,K j ) −
Ez ( f z,K j ) is small and not in the sense that
f z,δ,K j − f z,K j
is small. The-
orem 4.104 guarantees a good error bound for any penalized estimator built
from { f z,δ,K j } satisfying (4.162). We will use greedy algorithms in building an
274 Approximation in learning theory
approximate estimator. We now present results from Temlyakov (2006e). We
will need more specific compacts F(n, q) and will impose some restrictions on
gln . We assume that
gln
B(X ) ≤ C1 for all n and l. We consider the following
compacts instead of Fn (q):
F(n, q) := f : ∃ ⊂ [1, Nn ], || = n, f = cl gl ,
n
|cl | ≤ 1 .
l∈ l∈
such that
y − v j
2 ≤ d(y, A1 (G))2 + C j −1 , C = C(M, C1 ).
We define an estimator
fˆz := fˆz,F( j,q) :=
j
al gl .
l∈
Then we set
f m := f m−1 − βm ϕm , G m := G m−1 + βm ϕm .
For systems D(n, q) the following estimator is considered in Barron et al.
(2008). As above, let z = (z 1 , . . . , z m ), z i = (xi , yi ), be given. Consider the
following system of vectors in Rm :
j j
v j,l := (gl (x1 ), . . . , gl (xm )), l ∈ [1, N j ].
m 2 1/2
We equip the Rm with the norm
v
:= (m −1 i=1 vi ) and normalize the
above system of vectors. Denote the new system of vectors by G j . Now we
apply either the OGA or the above defined version of the RGA to the vector
y ∈ R with respect to the system G j . Similar to the above discussed case of the
system G, we obtain an estimator fˆj . Next, we look for the penalized estimator
built from the estimators { fˆj } in the following way. Let
ˆ A j log m
n(z) := arg min Ez (TM ( f j )) + .
1≤ j≤m m
276 Approximation in learning theory
Define
fˆ := TM ( fˆn(z) ).
Assuming that the systems D(n, q) are normalized in L 2 (ρ X ), Barron et al.
(2008) proved the following error estimate.
Theorem 4.105 There exists A0 (M) such that for A ≥ A0 one has the
following bound for the expectation of the error:
E
f ρ − fˆ
2L 2 (ρ X ) ≤ min (C(A, M, q) j log m/m
1≤ j≤m
5.1 Introduction
This chapter aims to provide the theoretical foundations of processing large
data sets. The main technique used to achieve this goal is based on nonlinear
sparse representations. Sparse representations of a function are not only a
powerful theoretical tool, but also they are utilized in many applications
in image/signal processing and numerical computation. Sparsity is a very
important concept in applied mathematics.
We begin with a general setting of mathematical signal processing problems
related to sparse representations. This setting consists of three components that
we describe below: assumptions on a signal, tasks and methods of recovery of
a signal. We only give a brief and schematic description of this setting in order
to illustrate the idea.
First, assumptions are imposed on a signal.
277
278 Approximation in compressed sensing
approximations of f with respect to D. In other words, this assumption
means that f belongs to some approximation class.
Second, we discuss two variants of our main task of recovery of the signal f .
By recovery of a signal f we mean finding its representation with respect
to the dictionary D. There are two forms of recovery widely used in signal
processing:
Usually, we apply task (T1) only in the case of sparsity assumption (A1). In
other cases we apply task (T2). In the case of approximate recovery we need
to specify how we measure the error of approximate recovery. There are two
ways to define error, which lead to different settings. Suppose
∞
f = cjgj, g j ∈ D, j = 1, . . . ,
j=1
In the first method we measure the error between f and Am ( f ) in the norm
·
of the space to which the function f belongs (data space). Then the cor-
responding error is
f − Am ( f )
. In the second method we measure the error
between f and Am ( f ) in the norm of the space to which the coefficients of
function f belong (coefficient space). For instance, if φi = g j (i) , i = 1, . . . , m,
then the error in the 2 norm of the coefficient space will be given by
m
f − Am ( f )
22 = |ai − c j (i) |2 + |c j |2 .
i=1 j= j (i),i=1,...,m
w 1 ≥ η2 n 1/2 w 2 , ∀w ∈ WT , η2 > 0.
(CS3) Denote T c := { j}mj=1 \ T . For any T , |T | ≤ ρn/ log m, and for any
w ∈ WT c one has, for any v satisfying T c v = w,
v 1 (T c ) ≥ η3 (log(m/n))−1/2 w 1 , η3 > 0.
dn (B m
p , q ) = d (Bq , p ),
n m
p := p/( p − 1). (5.1)
In other words, it was proved (see (5.3) and (5.2)) that for any pair (m, n) there
exists a subspace Vn , dim Vn ≥ m − n, such that for any x ∈ Vn one has
It has been understood in Donoho (2006) that properties of the null space
N () := {x : x = 0} of a measurement matrix play an important role
in the compressed sensing problem. Donoho (2006) introduced the following
two characteristics of formulated in terms of N ():
w(, F) := sup
x
2
x∈F∩N ()
and
ν(, T ) := sup
x T
1 /
x
1 ,
x∈N ()
(1 − δ S ) c 22 ≤ T c 22 ≤ (1 + δ S ) c 22
for all subsets T with |T | ≤ S and all coefficient sequences {c j } j∈T . A matrix
is said to have then RIP if this constant exists. Candes and Tao (2005)
proved that if δ2S + δ3S < 1 then for S-sparse u one has A (u) = u
(recovery by 1 minimization is exact). They also proved the existence of sens-
ing matrices obeying the condition δ2S + δ3S < 1 for large values of sparsity
S n/ log(m/n).
For a positive number a denote
σa (v)1 := min
v − w
1 .
w∈Rm :| supp(w)|≤a
Candes, Romberg and Tao (2006) proved that, if δ3S + 3δ4S < 2, then
We note that properties of the RIP-type matrices have already been imployed
in Kashin (1977a) (see Kashin and Temlyakov (2007) for a further discus-
sion) for the widths estimation. The inequality (5.3) with an extra factor
1 + log(m/n) has been established in Kashin (1977a). The proof in Kashin
√
(1977a) is based on properties of a random matrix with elements ±1/ n.
Further investigation of the compressed sensing problem has been con-
ducted by Cohen, Dahmen and DeVore (2009). They proved that if satisfies
the RIP of order 2k with δ2k < δ < 1/3 then one has
2 + 2δ
u − A (u)
1 ≤ σk (u)1 . (5.6)
1 − 3δ
In Cohen, Dahmen and DeVore (2009) the inequality (5.6) has been called
instance optimality. In the proof of (5.6) the authors used the following
5.2 Equivalence of three approximation properties 283
property (null space property) of matrices satisfying the RIP of order 3k/2:
for any x ∈ N () and any T with |T | ≤ k we have
x 1 ≤ C x T c 1 . (5.7)
The null space property (5.7) is closely related to the property (A1) from
Donoho (2006). The proof of (5.6) from Cohen, Dahmen and DeVore (2009)
gives an inequality similar to (5.6) under the assumption that has the null
space property (5.7) with C < 2.
We now discuss the results of Kashin and Temlyakov (2007). We say
that a measurement matrix has a strong compressed sensing property with
parameters k and C (SCSP(k, C)) if, for any u ∈ Rm , we have
We say that satisfies the width property with parameter S (WP(S)) if the
following version of (5.4) holds for the null space N ():
x 2 ≤ S −1/2 x 1 .
The main result of Kashin and Temlyakov (2007) states that the above three
properties of are equivalent (see Theorem 5.7 below).
We remark on the notations used in this chapter. Above we used n and m
for the number of measurements and the size of a signal, respectively. These
notations are motivated by their use in the theory of widths, and we continue
to use these notations in Sections 5.2–5.4. In Sections 5.5–5.8 we use the letter
N for the size of a signal instead of m. This notation is more common in
compressed sensing.
Lemma 5.2 Let satisfy (5.10) and let x = 0, x ∈ . Then for any such
that || < S/4 we have
|x j | <
x
1 /2.
j∈
Proof As in (5.11),
|x j | ≤ ||1/2 S −1/2
x
1 <
x
1 /2.
j∈
Lemma 5.3 Let satisfy (5.10). Suppose u ∈ Rm is sparse with | supp(u)| <
S/4. Then for any v = u + x, x ∈ , x = 0, we have
v
1 >
u
1 .
Proof Let := supp(u). Then
v
1 = |v j | = |u j + x j | + |x j |
j∈[1,m] j∈ j ∈
/
≥ |u j | − |x j | + |x j | =
u
1 +
x
1 − 2 |x j |.
j∈ j∈ j ∈
/ j∈
5.2 Equivalence of three approximation properties 285
By Lemma 5.2,
x
1 − 2 |x j | > 0.
j∈
Lemma 5.3 guarantees that the following algorithm, known as the Basis
Pursuit (see A from Section 5.1), will find a sparse u exactly, provided
| supp(u)| < S/4:
u := u + arg min
u + x
1 .
x∈
Theorem 5.4 Let satisfy (5.10). Then for any u ∈ Rm and u such that
u
1 ≤
u
1 , u − u ∈ , one has
Proof It is given that u − u ∈ . Thus, (5.13) follows from (5.12) and (5.10).
We now prove (5.12). Let , || = [S/16], be the set of indices with the
largest absolute value coordinates of u. Denote by u a restriction of u onto
this set, i.e (u ) j = u j for j ∈ and (u ) j = 0 for j ∈
/ . Also denote
u := u − u . Then
We have
u − u
1 =
(u − u )
1 +
(u − u )
1 .
Next,
(u − u )
1 ≤
u
1 +
(u )
1 .
Using u 1 ≤ u 1 , we obtain
(u ) 1 − u 1 = u 1 − u 1 − u 1 + u 1 ≤ (u − u ) 1 .
Therefore,
(u )
1 ≤
u
1 +
(u − u )
1 (5.15)
and
u − u
1 ≤ 2
(u − u )
1 + 2
u
1 . (5.16)
286 Approximation in compressed sensing
Using the fact u − u ∈ we estimate
(u − u )
1 ≤ ||1/2
(u − u )
2
≤ ||1/2
u − u
2 ≤ ||1/2 S −1/2
u − u
1 . (5.17)
Our assumption on || guarantees that ||1/2 S −1/2 ≤ 1/4. Using this and
substituting (5.17) into (5.16) we obtain
u − u
1 ≤
u − u
1 /2 + 2
u
1 ,
which gives (5.12):
u − u
1 ≤ 4
u
1 .
Corollary 5.5 Let satisfy (5.10). Then for any u ∈ Rm one has
u − u
1 ≤ 4σ S/16 (u)1 , (5.18)
−1/2
u − u
2 ≤ (S/16) σ S/16 (u)1 . (5.19)
Proposition 5.6 Let be such that (5.9) holds with u instead of A (u).
Then satisfies (5.10) with S = kC −2 .
Proof Let u ∈ . Then u = 0 and we get, from (5.9),
u
2 ≤ Ck −1/2
u
1 .
The following theorem states that the three properties – SCSP, WCSP, WP –
are equivalent. In particular, it implies that the WP is a necessary and sufficient
condition for either SCSP or WCSP. Thus, if we are looking for matrices that
are good for 1 minimization operator A , say, such that (5.8) holds, then we
should study matrices such that the corresponding null space N () satisfies
(5.10) with appropriate S. The RIP property is merely a sufficient condition for
either the SCSP or WCSP.
Theorem 5.7 The following implications hold:
SCSP(k, C) ⇒ WCSP(k, C);
WP(S) ⇒ SCSP(S/16, 1);
WCSP(k, C) ⇒ WP(C −2 k).
Proof It is obvious that SCSP(k, C) ⇒ WCSP(k, C). Corollary 5.5 with
= N () implies that WP(S) ⇒ SCSP(S/16, 1). Proposition 5.6 with
= N () implies that WCSP ⇒ WP. Thus the three properties are
equivalent.
5.3 Construction of a good matrix 287
Let us discuss the application of this theorem to the setting discussed in
Section 5.1. We study the 1 minimization method A . We impose the assump-
tion (A2) of approximate sparsity of a signal. As a task let us take the following
version of (T2):
u − A (u)
2 ≤ 4Ck −1/2 σk (u)1 . (5.20)
This means that we measure the error of approximate recovery in the 2 norm
of the coefficient space. Then Theorem 5.7 states that the necessary condition
on to guarantee successful performance of the above task is that the null
space N () satisfies the following property: for any x ∈ N () we have
x
2 ≤ Ck −1/2
x
1 . (5.21)
On the other hand, if the null space N () satisfies (5.21) then by Theorem 5.7
we get
u − A (u)
2 ≤ Ck −1/2 σk/(16C 2 ) (u)1 . (5.22)
Thus the property (5.21) of the null space N () is a necessary and sufficient
condition (with different constants C) for fulfilling (T2) under assumptions
(A2). We note that the necessary and sufficient condition on k for the existence
of matrices satisfying (5.21) is the inequality k n/ log(m/n).
The result (5.5) of Candes, Romberg and Tao (2006) states that the RIP with
S n/ log(m/n) implies the SCSP. Therefore, by Theorem 5.7 it implies the
WP with S n/ log(m/n). One can find a direct proof of the above statement
in Kashin and Temlyakov (2007).
Thus,
1
E(t) = 2C(m) er t (1 − r 2 )(m−3)/2 dr. (5.28)
0
To get the upper estimate for E(t) we shall prove two estimates. First we
prove the lower one:
1
m −1/2
(1 − r 2 )(m−3)/2 dr > (1 − r 2 )(m−3)/2 dr
0 0
> (1 − 1/m)(m−1)/2 m −1/2 > (em)−1/2 . (5.29)
0 0
1
etr −mr
2 /2
≤e 3/2
dr
0
∞
2 /(2m)
m −1/2 e−v
2 /2
≤ e3/2+t dv, (5.30)
−tm −1/2
From (5.31) and the inequality (5.23) with b = 3m −1/2 , t = 3m 1/2 n, we get
" # !n
P v ∈ Y : F(x, v) > 3m −1/2 < e−5/2 (2π )1/2 . (5.32)
From (5.31) and (5.24) with a = (0.01)m −1/2 , t = −100m 1/2 n, taking into
account the inequality
∞
e−v /2 dv < e−z /2 /z,
2 2
z > 0,
z
we obtain
" #
P v ∈ Y : F(x, v) < 0.01m −1/2 < (0.01e3 )n . (5.33)
We set
/ 0
l = An/ ln(em/n) , (5.35)
Indeed, due to Lemma 5.9 the P-measure of those v for which (5.36) does
not hold is not greater than
The number A is chosen so that the right-hand side of (5.37) is less than 1. Let
" #
= x ∈ Rm : F(x, v ∗ ) = 0 .
Let x ∈ B1m and x ∈ B m,l be such that x j and x j have the same sign and
|x j | ≤ |x j |, |x j − x j | ≤ 1/l, j = 1, . . . , m. We consider x := x − x . Then
x
2 ≤
x
1
x
∞ ≤ l −1/2 .
1/2 1/2
(5.39)
Proof Let us denote the left-hand side of (5.42) by a and the right-hand side
of (5.42) by b. From the relation
n
n
|ϕ( f )| = ϕ − ck ϕk ( f ) ≤
ϕ − ck ϕk
,
k=1 k=1
Proof The proof will be carried out by induction. The case n = 1 is evident.
Let us assume that a biorthogonal system can be constructed if the number of
functionals is less than n. Clearly, it suffices to prove the existence of f 1 ∈ X
such that
ϕ1 ( f 1 ) = 1, ϕk ( f 1 ) = 0, k = 2, . . . , n.
and
n
ϕ1 f − ϕk ( f )gk = 0,
k=2
which implies
n
ϕ1 ( f ) = ϕ1 (gk )ϕk ( f ).
k=2
5.3 Construction of a good matrix 293
Consequently,
n
ϕ1 = ϕ1 (gk )ϕk ,
k=2
which implies
n
!
ψ( f ) = ϕ( f ) + ψ( f k ) − ϕ( f k ) ϕk ( f ).
k=1
one has
dn (B m
p , q ) = d (Bq , p ),
m n m m
p := p/( p − 1).
Proof By the definition of the Kolmogorov width, we have
dn (B m
p , q ) :=
m
inf sup inf
f − a
q
L n :dim L n ≤n f ∈B m a∈L n
p
n
= inf sup inf
f − ck ϕk
q .
{ϕ1 ,...,ϕn }⊂Rm f ∈B m {ck }
p k=1
Changing the order of the two supremums in the above formula we obtain
dn (B m
p , q ) =
m
inf sup
g
p .
{ϕ1 ,...,ϕn } g:
ϕk ,g=0, j=1,...,n,
g
≤1
q
The presentation of the following theorem follows ideas from Candes (2008).
We use the notations from Section 5.2.
√
Theorem 5.15 Assume δ2s ≤ 2 − 1. Then, for any u, u ∈ Rm , such that
u
1 ≤
u
1 and
u − u
≤
, we have
u − u
2 ≤ C0 s −1/2 σs (u)1 + C1
.
Proof Set h := u − u. Let T0 be the set of s largest coefficients of u. Let
h T0 = (h 1 , . . . , h m ). Reorder the coordinates in a decreasing way:
|h i1 | ≥ |h i2 | ≥ · · · ≥ |h im |.
For j = 1, 2, . . . define
T j := {i s( j−1)+1 , . . . , i s j } ∩ [1, m].
296 Approximation in compressed sensing
Then for cardinality of these sets we have |T j | = s provided s j ≤ m. Clearly
h= hTj .
j≥0
and
h T
2 =
h T , h T =
h T , h − h T j =
h T , h −
h T , h T j .
j≥2 j≥2
5.4 Dealing with noisy data 297
By our assumption,
h
≤
. Therefore,
|
h T , h| ≤
h T
h
≤
h T
. (5.50)
|
h T , h| ≤
(1 + δ2s )1/2
h T
2 . (5.52)
For j ≥ 2 we have
|
h T , h T j | ≤ |
h T0 , h T j | + |
h T1 , h T j |
≤ δ2s
h T0
2
h T j
2 + δ2s
h T1
2
h T j
2
√
≤ 2δ2s
h T
2
h T j
2 .
By (5.44) we obtain
⎛ ⎞
√
(1 − δ2s )
h T
22 ≤
h T
2 ≤
h T
2 ⎝
(1 + δ2s )1/2 + 2δ2s
h T j
2 ⎠ .
j≥2
This implies
√
(1 + δ2s )1/2 2δ2s
h T
2 ≤ +
h T j
2 . (5.53)
(1 − δ2s ) (1 − δ2s )
j≥2
h T 2 ≤ C2 s −1/2 σs (u)1 + C3 .
This bound and inequalities (5.47) and (5.49) complete the proof.
298 Approximation in compressed sensing
5.5 First results on exact recovery of sparse signals;
the Orthogonal Greedy Algorithm
Let be a matrix with columns ϕi ∈ Rn , i = 1, . . . , N . We will also denote
by the dictionary consisting of ϕi , i = 1, . . . , N . Then the following two
statements: (1) u is k-sparse and (2) u is k-sparse with respect to the dictio-
nary are equivalent. Results from Section 5.2 (see Theorem 5.7) imply that
if the matrix has the width property with parameter S then the 1 minimiza-
tion algorithm A recovers exactly S/16-sparse signals. We begin this section
with a discussion of the following question: Which property of is necessary
and sufficient for the existence of a recovery method that recovers exactly all
k-sparse signals?
Let k and k () denote all k-sparse vectors u ∈ R N and all k-sparse with
respect to the dictionary vectors y ∈ Rn , respectively. The following simple
known theorem answers the above question.
Proof We begin with the implication (i) ⇒ (ii). The proof is by contradiction.
Assume that there is a non-zero vector v ∈ 2k ∩ N (). Clearly, v can be
written in the form
v = v1 − v2, v 1 , v 2 ∈ k .
max |
f, ϕi | ≥ A − AM(m − 1).
1≤i≤m
|
f, g| ≤ AMm.
The inequality
A(1 − M(m − 1))t > AMm (5.56)
Arguing as above, we prove that, at the second iteration, the WOGA(t) picks
one of ϕi again, say ϕi2 . It is clear that i 2 = i 1 . We continue this process till
the mth iteration when all the ϕi , i = 1, . . . , m, will be picked up. Taking the
orthogonal projection, we obtain that f m = 0. This completes the proof.
Theorem 5.21 Let D be an M-coherent dictionary. If, for m < (1/2)
(1 + M −1 ), σm ( f ) = 0, then f ∈ m (D).
Proof The proof is by induction. First, consider the case m = 1. Let σ1 ( f ) = 0.
If f = 0 then f ∈ 1 (D). Suppose that f = 0. Take
> 0 and find ϕ1 ∈ D
and a coefficient b1 such that
f − b1 ϕ1
≤
.
We prove that for sufficiently small
the inequality
f − b2 ϕ2
≤
, ϕ2 ∈ D,
implies that ϕ2 = ϕ1 . Indeed, assuming the contrary, ϕ2 = ϕ1 , we obtain by
Lemma 2.57 that
It is clear that (5.57) and (5.58) contradict each other if
is small enough
compared to
f
. Thus, for sufficiently small
, only one dictionary element,
5.5 First results on exact recovery; OGA 301
ϕ1 , provides a good approximation. This implies
f −
f, ϕ1 ϕ1
= 0, which
in turn implies f =
f, ϕ1 ϕ1 .
Consider the case m > 1. Following the induction argument assume that if
σm−1 ( f ) = 0 then f ∈ m−1 . We now have σm ( f ) = 0. If σm−1 ( f ) = 0
then, by the induction assumption, f ∈ m−1 . So, assume that σm−1 ( f ) > 0.
Let
m
f − bi ϕi
≤
, ϕi ∈ D, i = 1, . . . , m. (5.59)
i=1
m
f − c j ψ j
≤
, ψi ∈ D, i = 1, . . . , m. (5.60)
i=1
m
m
bi ϕi − c j ψ j
≤ 2
.
i=1 i=1
m−1
f − bi ϕi
≤
+ |bm | < σm−1 ( f ).
i=1
The obtained contradiction implies that, for small enough
, inequalities (5.59)
and (5.60) imply that {ψ1 , . . . , ψm } = {ϕ1 , . . . , ϕm }. This reduces the problem
to a finite dimensional case, where existence of best approximation is well
known. In our case it is an orthogonal projection onto span(ϕ1 , . . . , ϕm ). The
assumption σm ( f ) = 0 implies that f is equal to that projection.
f mo
≤ (1 + (5m)1/2 )σm ( f ).
302 Approximation in compressed sensing
Proof In the case σm ( f ) = 0, Theorem 5.21 implies that f ∈ m . Then
by Theorem 5.19 the OGA recovers it exactly after m iterations. Assume that
σm ( f ) > 0. Suppose
m
f − ci ϕi
≤ (1 +
)σm ( f ), ϕi ∈ D, i = 1, . . . , m, (5.62)
i=1
with some fixed
> 0. We will find a constant C(m) := C(m,
) with the
following property. If
f
> C(m)σm ( f )
then the OGA picks one of the ϕi , i = 1, . . . , m, at the first iteration. If the
OGA has picked elements from {ϕ1 , . . . , ϕm } at the first k iterations and
am = PHm ( f ), Hm := span(ϕ1 , . . . , ϕm ).
We write
m
am − G k = di ϕi
i=1
max |
f ko , ϕi | = max |
am − G k , ϕi |
1≤i≤m 1≤i≤m
|
f ko , g| = |
f − am + am − G k , g| ≤ (1 +
)σm ( f ) + AMm.
am − G k
=
am − f + f − G k
≥
f ko
−
f − am
> (C(m) − 1 −
)σm ( f ).
A2 m(1 + Mm) ≥ am − G k 2 .
Therefore,
We complete the proof of Theorem 5.22 in the following way. Fix arbitrarily
small
> 0 and choose C(m) = (1 +
)(1 + (5m)1/2 ). If
f ko
> C(m)σm ( f )
for k = 0, . . . , m − 1 then, as we proved above, the OGA picks all ϕ1 , . . . , ϕm ,
and therefore
f mo
≤ (1 +
)σm ( f ).
A slight modification of the above proof gives the following result for the
WOGA(t).
The following are the scalar products of f with the dictionary elements:
f, φi =
αei − β f, f = α − mβ, φi ∈ Dg ,
f, φ j =
ηe j + γ f, f = mγ , φ j ∈ Db .
φi , φi =
αei − β f, αei − β f = mβ 2 − 2αβ = −mβ(2γ + β), i, i ≤ m,
φ j , φ j =
ηe j + γ f, ηe j + γ f = mγ 2 , j, j > m,
φi , φ j =
αei − β f, ηe j + γ f = γ (−mβ + α) = γ R = mγ 2 , i ≤ m < j.
We equalize
√ γ 2 = β(2γ + β). Solving this quadratic
√ equation, we get γ =
(1 + 2)β. We find α = mγ + mβ = m(2 + 2)β, and, plugging it into
(5.64), we find β:
5.6 Exact recovery of sparse signals; SP 305
√
(m(2 + 2) − 1)2 β 2 + (m − 1)β 2 = 1
1
β2 = √ √ .
m 2 (2 + 2) − m(3 + 2 2)
2
Since all the scalar products are still the same (they do not depend on φ), some
realization of the OGA will select another element from Db . This completes
the induction argument.
from R . n
(1) Take the T j−1 from the previous iteration j − 1 and find
y j := y − PT j−1 (y),
306 Approximation in compressed sensing
where PA (y) denotes the orthogonal projection of y onto the subspace
span(ϕi , i ∈ A).
(2) Find the set of indices L j such that |L j | = K and
min |
y j , ϕi | ≥ max |
y j , ϕi |.
i∈L j i ∈L
/ j
(3) Consider
PT j−1 ∪L j (y) = ci ϕi .
i∈T j−1 ∪L j
We denote the class of Riesz dictionaries with depth D and parameter δ ∈ (0, 1)
by R(D, δ).
It is clear that the term Riesz dictionary with depth D and parameter
δ ∈ (0, 1) is another name for a dictionary satisfying the restricted isometry
property with parameters D and δ.
Theorem 5.26 Let ∈ R(D, δ) with δ ≤ 0.04. For any K -sparse vector y
with 3K ≤ D, the SP recovers y exactly after C K iterations.
This theorem follows from Theorem 5.29 and Lemma 5.30 proved below.
We begin with a simple lemma.
Lemma 5.27 Let D ∈ R(D, δ) and let g j ∈ D, j = 1, . . . , s. For f =
s
i=1 ai gi and ⊂ {1, . . . , s} denote
S ( f ) := ai gi .
i∈
If s ≤ D then
S ( f )
2 ≤ (1 + δ)(1 − δ)−1
f
2 .
5.6 Exact recovery of sparse signals; SP 307
Proof By (5.65) we obtain
s
S ( f )
≤ (1+δ)
2
|ai | ≤ (1+δ)
2
|ai | ≤ (1+δ)(1−δ)−1
f
2 .
2
i∈ i=1
We now proceed to the evaluation of the jth iteration of the SP. First, we
prove a lemma in a style of Lemma 5.14.
|
u, v| ≤ δ
b
2
c
2 (5.66)
and
⎛ ⎞1/2
l
⎝ |
u, ψ j |2 ⎠ ≤ δ
b
2 ≤ δ(1 − δ)−1/2
u
. (5.67)
j=1
Proof Without loss of generality, assume that u and v are non-zero and con-
sider their normalized versions u := u/
b
2 and v := v/
c
2 . We have by
the parallelogram identity and (5.65)
1
|
u , v | = |
u + v
2 −
u − v
2 | ≤ δ.
4
This proves (5.66). The inequality (5.67) follows from (5.66) by the chain of
relations:
⎛ ⎞1/2
l l
⎝ |
u, ψ j |2 ⎠ = max | c j
u, ψ j | ≤ δ
b
2 .
c:
c
2 ≤1
j=1 j=1
xT j : (x T j )i = xi , i ∈ T j , (x T j )i = 0, i ∈
/ T j,
Theorem 5.29 Let ∈ R(D, δ) and let y be a K -sparse vector with respect
to . If 3K ≤ D then
and
x T j
2 ≤ 4δ 1/2 (1 + δ)(1 − δ)−2
x T j−1
2 .
Proof The second inequality follows from the first inequality and (5.65). We
prove the first inequality. From the definition of y j and T j−1 it follows that
Therefore
y j
≤
yT j−1
. (5.69)
(5.72)
By Lemma 5.28,
⎛ ⎞1/2
⎝ |
PT j−1 (yT j−1 ), ϕi |2 ⎠ ≤ δ(1 − δ)−1/2
PT j−1 (yT j−1 )
i∈T j−1
y − PT j−1 ∪L j (y)
2
≤ (1 − ((1 + δ)−1/2 ((1 − δ)1/2 − δ(1 − δ)−1/2 ))2 )
yT j−1
2
≤ 4δ
yT j−1
2 . (5.77)
|T j−1 ∪ L j | = 2K , |T j−1 ∪ L j \ T | ≥ K .
310 Approximation in compressed sensing
Define R j := T j−1 ∪ L j \ T j and take any with the following properties:
⊂ T j−1 ∪ L j \ T, || = K .
Next
ci ϕi
2 ≤ (1 + δ) |ci |2 ≤ (1 + δ) |ci |2
i∈R j i∈R j i∈
≤ (1 + δ)(1 − δ)−1
ci ϕi
2 .
i∈
x A j 2 ≤ q x A j−1 2 , j = 2, . . . , J. (5.80)
Then J ≤ C(q)S.
Proof It suffices to prove the lemma for q = 1/2. Indeed, for the sets Al j+1 ,
j = 0, . . . , [(J − 1)/l], with l := [(log(1/q))−1 ] + 1 we have
1
x Al j+1
2 ≤
x Al( j−1)+1
2 .
2
Thus, assume that we have (5.80) with q = 1/2. Define yn :=
x n
2 /xn ,
n = 1, . . . , S. Consider two cases:
1 1 1 3
x A j
2 ≤
x A1
2 ≤
x
2 ≤ (x1 +
x 1
2 ) ≤ x1 .
2 2 2 4
/ A j . Therefore, in case (I ) the problem is reduced to the
This implies that 1 ∈
same problem with S and J replaced by S − 1 and J − 1, respectively.
We proceed to case (I I ). Let y1 ≥ 1/2, . . . , ys−1 ≥ 1/2 and ys < 1/2. We
note that yS = 0. We prove that for all k = 1, . . . , s − 1
3s
xs−k ≤ 3k xs and
x
2 ≤ xs . (5.81)
2
The proof of the first inequality in (5.81) is by induction. For k = 1 we have
1
ys < 1/2 ⇒ xs >
x s
2 ,
2
1 3
ys−1 ≥ 1/2 ⇒ xs−1 ≤
x s−1
2 ≤ xs +
x s
2 ≤ xs .
2 2
312 Approximation in compressed sensing
This proves the first inequality in (5.81) for k = 1. Let xs−t ≤ 3t xs hold for
t = 1, 2, . . . , k − 1, 1 < k < s. Then
1
xs−k ≤
x s−k
2 ≤ xs−k+1 + · · · + xs +
x s
2
2
3k
≤ xs (3k−1 + · · · + 1 + 1/2) = xs .
2
This completes the proof of the first inequality in (5.81).
Next
3s
x
2 ≤ x1 + · · · + xs +
x s
2 ≤ (3s−1 + · · · + 1 + 1/2)xs = xs .
2
For j > s log 3 + 1 we have
1
x A j
2 ≤ (1/2) j−1
x
2 ≤ 3−s
x
2 ≤ xs .
2
Thus we have A j ⊂ {s + 1, . . . , S} for all j > s log 3 + 1. Therefore, in case
(I I ) the problem is reduced to the same problem with S and J replaced by
S − s and J − [s log 3] − 2, respectively. Combination of cases (I ) and (I I )
implies that J ≤ C S. This completes the proof of the lemma.
We define a variant of the SP for an arbitrary countable dictionary =
∞ of a separable Hilbert space H . The following modification of the SP
{ϕi }i=1
is in a spirit of weak greedy algorithms.
Subspace Pursuit with weakness parameters w1 , w2 (SP(w1 , w2 )) Let K
be a given natural number and let y ∈ H . To start we take an arbitrary set of
indices T0 of cardinality |T0 | = K . At the jth iteration of the algorithm we
construct a set of indices T j of cardinality |T j | = K performing the following
steps.
(1) Take the T j−1 from the previous iteration j − 1 and find
y j := y − PT j−1 (y),
where PA (y) denotes the orthogonal projection of y onto the subspace
span(ϕi , i ∈ A).
(2) Find the set of indices L j such that |L j | = K and
min |
y j , ϕi | ≥ w1 sup |
y j , ϕi |.
i∈L j i ∈L
/ j
(3) Consider
PT j−1 ∪L j (y) = ci ϕi .
i∈T j−1 ∪L j
5.6 Exact recovery of sparse signals; SP 313
Define T j to be the set of K indices i such that
Theorem 5.31 Let ∈ R(D, δ) and let y be a K -sparse vector with respect
to . Suppose that 3K ≤ D and w12 ≥ 1 − 4δ. Then for sufficiently small δ
(δ ≤ 0.02) and large w2 ∈ (0, 1) (w22 ≥ 7δ 1/2 ) there exists q(δ) ∈ (0, 1) such
that for the iterations of the SP(w1 , w2 ) we have
Proof The proof goes along the lines of the proof of Theorem 5.29. We only
point out the places which require modification. Instead of (5.71) we obtain
⎛ ⎞1/2
PL j (y j )
≥ (1 + δ)−1/2 w1 ⎝ |
y j , ϕi |2 ⎠ , (5.82)
i∈T j−1
and
y − PT j−1 ∪L j (y)
2 ≤ 8δ
yT j−1
2 . (5.83)
In this section we have only studied the problem of exact recovery of sparse
signals by the SP. It is shown in Dai and Milenkovich (2009) that the SP is sta-
ble under signal and measurement perturbations in the spirit of Theorem 5.15.
M(D) := sup |
g k , gl |.
k=l
G = T .
5.7 On the size of incoherent systems 317
It is well known and easy to understand that rank G ≤ n (indeed, the
columns of G are linear combinations of n columns h i , i = 1, . . . , n). There-
fore, the positive (non-negative) definite symmetric matrix G has at most n
non-zero eigenvalues λk > 0. By the normalization assumption
g j
= 1,
j = 1, . . . , N , we obtain from the property of traces of matrices that
λk = N .
k
N
(
g i , g j )2 = λ2k . (5.88)
i, j=1 k
By Cauchy’s inequality,
2
−1
λ2k ≥n λk = N 2 /n.
k k
N
N + (N 2 − N )M(D)2 ≥ (
g i , g j )2 ≥ N 2 /n
i, j=1
and
1/2
N −n
M(D) ≥ . (5.89)
n(N − 1)
Theorem 5.34 Let A :=
ai, j
i,N j=1 be a square matrix of the form ai,i = 1,
i = 1, . . . , N ; |ai, j | ≤
< 1/2, i = j. Then
and
N ≤ exp(C1 nμ2 ln(2/μ)). (5.91)
u H v := (u 1 v1 , . . . , u N v N )T , u = (u 1 , . . . , u N )T , v = (v1 , . . . , v N )T .
Therefore,
r +m−1
rank A Hm
≤ dim S ≤ ≤ (e(r + m)/m)m . (5.94)
m
Combining (5.94) and (5.95) we get the required lower bound for r .
5.7 On the size of incoherent systems 319
5.7.3 Lower bounds; probabilistic approach
In this subsection we prove the existence of large systems with small
coherence. We demonstrate how two classical systems – the trigonometric sys-
tem and the Walsh system – can be used in such constructions. We use the
trigonometric system {eikx } for a construction in Cn and the Walsh system for
a construction in Rn . We note that the lower bound for Rn implies the corre-
sponding lower bound for Cn . The proof is based on the following Hoeffding
inequality (see Section 4.2).
Theorem 5.35 Let ξi be real random variables on (X, , ρ) such that
|ξi − Eξi | ≤ bi , i = 1, . . . , m, almost surely. Consider a new random variable
ζ on (X m , m , ρ m ) defined as
m
ζ (ω) := ξi (ωi ), ω = (ω1 , . . . , ωm ).
i=1
n
|n −1 wk (yi )| ≤ μ.
i=1
n
n
|
gl , g m | = |n −1 wl (yi )wm (yi )| = |n −1 ws(l,m) (yi )| ≤ μ.
i=1 i=1
F(a, u) := ar u r + · · · + a1 u.
p
|S(a)| ≤ (r − 1) p 1/2 , S(a) := e2πi F(a,u)/ p . (5.99)
u=1
p
|
v a , v a | = p −1 | e2πi(F(a,u)−F(a ,u))/ p |
u=1
p
= p −1 | e2πi F(a−a ,u)/ p | ≤ (r − 1) p −1/2 . (5.100)
u=1
For given n and μ ≥ (2/n)1/2 we set p to be the biggest prime not exceeding n.
Then n/2 ≤ p ≤ n. We specify r to be the biggest natural number such that
(r − 1) p −1/2 ≤ μ. Then by (5.100) M(W (r, p)) ≤ μ. For the cardinality of
the W (r, p) we have
#W (r, p) = pr = er ln p ≥ eμp
1/2 ln p
.
F(a, u) := ar u r + · · · + a1 u.
Proof For a fixed a denote for brevity f (u) := F(a, u). Let y and z run
over the complete system of residues mod p. Then the sum y + pz runs over
the complete system of residues mod p 2 . It is clear from Taylor’s expansion
formula that
Therefore
2
p
p
p
p
p
(y)z/ p
2πi f (u)/ p 2 2πi f (y+ pz)/ p 2 2πi f (y)/ p 2
e = e = e e2πi f .
u=1 y=1 z=1 y=1 z=1
(5.102)
5.7 On the size of incoherent systems 323
Introducing the notation T := {y ∈ [1, p] : f (y) = 0 mod p} we
continue:
2
p
e2πi f (u)/ p = p e2πi f (y)/ p .
2 2
u=1 y∈T
Thus
2
p
e2πi f (u)/ p | ≤ p|T |.
2
| (5.103)
u=1
By Theorem 5.7 the bound (5.105) implies that the null space N () :=
{u : u = 0} has the following width property. For any u ∈ N () we have
u 2 ≤ C S −1/2 u 1 . (5.106)
We note that it was proved in Section 5.2 that the width property (5.106) of the
N () implies (5.105) with S replaced by S/16.
The inequality (5.106) is related to the concept of Gelfand’s width. We recall
that the Gel’fand width of a compact F in the 2N norm is defined as follows:
N ≤ n exp(Cn/S). (5.109)
The bounds (5.104) and (5.109) describe the behavior of maximal size of
dictionaries with RIP(S, δ).
n
−1 j
a := (1 + M)
j
xk wk , M := M()
k=1
k=1
k=1
≤
x
1 max
y − w
∞ .
y:
y
1 ≤1
M
x
∞ ≤
x
1 .
1+M
This proves the first inequality in (5.112). The second inequality in (5.112)
follows from the first one and the inequality
x
22 ≤
x
1
x
∞ .
The second inequality in (5.112) means that the null space N () of an
M-coherent dictionary satisfies the WP(S) with S := (1 + M −1 ). In particular,
this implies by Theorem 5.7 that M-coherent dictionaries are good for exact
recovery of sparse signals by 1 minimization. We now present a direct proof
of a theorem on the exact recovery of sparse signals by 1 minimization that
gives the optimal bound on sparsity that allows exact recovery: < 12 (1+ M −1 ).
A (u) = u.
N
xi ϕi = 0,
i=1
N
ξ j := | u i ri (w j )|2
u
−2
N, j = 1, . . . , n.
2
i=1
Then
N
Eξ j = u i2
u
−2
N = 1.
2
i=1
Lemma 5.43 Let (w) be as defined above. Then, for any set of indices T ,
|T | = s ≤ n, and any δ ∈ (0, 1), we have, for all x ∈ X T ,
(1 − δ)
x
N ≤
(w)x
n2 ≤ (1 + δ)
x
N
2 2
with probability
Since by the definition A is the smallest number for which (5.117) holds, we
obtain A ≤ δ/2 + (1 + A)δ/4. This implies A ≤ δ and proves the upper bound
in Lemma 5.43. The corresponding lower bound follows from
Theorem 5.44 Suppose that n, N and δ ∈ (0, 1) are given. Let (w) be the
Bernoulli matrix defined above. Then there exist two positive constants C1 (δ)
and C2 (δ) such that, for any s-sparse signal x with s ≤ C1 (δ)n/ ln(eN /n), we
have
with probability
≥ 1 − exp(−C2 (δ)n).
Proof Lemma 5.43 guarantees that for each of the s-dimensional subspaces
X T , the matrix (w) fails to satisfy (5.119) with probability
where the infimum is taken over all n × m matrices and all decoders
: Rn → Rm . It is known (see, for example, Cohen, Dahmen and DeVore
(2009)) that the above optimization problem is closely related to the Gelfand
width of K . Under mild assumptions on K we have
d n (K , X ) ≤ E n (K ) X ≤ Cd n (K , X ). (5.120)
In the particular case K = B1m , X = m
2 , we know from Garnaev and Gluskin
(1984) that
d n (B1m , m
2 ) ≥ C 1 ((ln(em/n))/n)
1/2
. (5.121)
Theorem 5.7 guarantees that for a matrix such that
x
2 ≤ C2 ((ln(em/n))/n)1/2
x
1 , x ∈ N (), (5.122)
5.9 Some further remarks 331
the 1 minimization provides the optimal rate of recovery:
As a direct corollary of Theorem 2.19, we get for any y ∈ A1 (D) that the
Orthogonal Greedy Algorithm guarantees the following upper bound for the
error:
y − G ok (y, D)
≤ k −1/2 . (5.125)
The inequalities (5.125) and (5.126) look alike. However, they provide the
error bounds in different spaces: (5.125) in Rn (the data space) and (5.126)
in Rm (the coefficients space).
constant C?
6
6.1 Introduction
In this chapter we move from Hilbert spaces to more general Banach spaces.
Let X be a Banach space with norm
·
. We say that a set of elements (func-
tions) D from X is a dictionary, respectively, symmetric dictionary, if each
g ∈ D has a norm bounded by 1 (
g
≤ 1),
g∈D implies − g ∈ D,
and the closure of span D is X . We denote the closure (in X ) of the convex hull
of D by A1 (D). We introduce a new norm, associated with a dictionary D, in
the dual space X ∗ by the formula
F
D := sup F(g), F ∈ X ∗.
g∈D
F f = 1, F f ( f ) = f .
f m−1 , ϕm = sup
f m−1 , g. (6.2)
g∈D
334
6.1 Introduction 335
In a Hilbert space both versions (6.1) and (6.2) result in the same PGA. In a
general Banach space the corresponding versions of (6.1) and (6.2) lead to dif-
ferent greedy algorithms. The Banach space version of (6.1) is straightforward:
instead of the Hilbert norm
·
H in (6.1) we use the Banach norm
·
X . This
results in the following greedy algorithm (see Temlyakov (2003a)).
X -Greedy Algorithm (XGA) We define f 0 := f , G 0 := 0. Then, for each
m ≥ 1, we have the following inductive definition.
(1) ϕm ∈ D, λm ∈ R are such that (we assume existence)
f m−1 − λm ϕm
X = inf
f m−1 − λg
X . (6.3)
g∈D ,λ
(2) Denote
f m := f m−1 − λm ϕm , G m := G m−1 + λm ϕm .
The second version of the PGA in a Banach space is based on the concept
of a norming (peak) functional. We note that in a Hilbert space a norming
functional F f acts as follows:
F f (g) =
f /
f
, g.
Therefore, (6.2) can be rewritten in terms of the norming functional F f m−1 as
follows:
F fm−1 (ϕm ) = sup F f m−1 (g). (6.4)
g∈D
This observation leads to the class of dual greedy algorithms. We define the
Weak Dual Greedy Algorithm with weakness τ (WDGA(τ )) (see Dilworth,
Kutzarova and Temlyakov (2002) and Temlyakov (2003a)) that is a general-
ization of the Weak Greedy Algorithm.
Weak Dual Greedy Algorithm (WDGA(τ )) Let τ := {tm }∞ m=1 , tm ∈ [0, 1],
be a weakness sequence. We define f 0 := f . Then, for each m ≥ 1, we have
the following inductive definition.
(1) ϕm ∈ D is any element satisfying
F f m−1 (ϕm ) ≥ tm
F fm−1
D . (6.5)
(2) Define am as
f m−1 − am ϕm
= min
f m−1 − aϕm
.
a∈R
(3) Let
f m := f m−1 − am ϕm .
336 Greedy approximation in Banach spaces
Let us make a remark that justifies the idea of the dual greedy algorithms in
terms of real analysis. We consider here approximation in uniformly smooth
Banach spaces. For a Banach space X we define the modulus of smoothness:
1
ρ(u) := sup (
x + uy
+
x − uy
) − 1 .
x
=
y
=1 2
The uniformly smooth Banach space is the one with the property
lim ρ(u)/u = 0.
u→0
It is easy to see that for any Banach space X its modulus of smoothness ρ(u)
is an even convex function satisfying the inequalities
Proof We have
x + uy ≥ Fx (x + uy) = x + u Fx (y).
This proves the first inequality. Next, from the definition of modulus of
smoothness it follows that
Also,
x − uy
≥ Fx (x − uy) =
x
− u Fx (y). (6.8)
x + uy ≤ x + u Fx (y) + 2 x ρ(u y / x ).
Proposition 6.2 Let X be a uniformly smooth Banach space. Then, for any
x = 0 and y we have
d
Fx (y) =
x + uy
(0) = lim (
x + uy
−
x
)/u. (6.9)
du u→0
6.1 Introduction 337
Proof The equality (6.9) follows from (6.6) and the property that, for a
uniformly smooth Banach space, limu→0 ρ(u)/u = 0.
Proposition 6.2 shows that in the WDGA we are looking for an element ϕm ∈ D
that provides a large derivative of the quantity
f m−1 + ug
. Thus, we have
two classes of greedy algorithms in Banach spaces. The first one is based on
a greedy step of the form (6.3). We call this class the class of X -greedy algo-
rithms. The second one is based on a greedy step of the form (6.5). We call
this class the class of dual greedy algorithms. A very important feature of the
dual greedy algorithms is that they can be modified into a weak form. The
term “weak" in the definition of the WDGA means that at the greedy step
(6.5) we do not aim for the optimal element of the dictionary which realizes
the corresponding supremum, but are satisfied with a weaker property than
being optimal. The obvious reason for this is that we do not know, in general,
that the optimal one exists. Another, practical reason is that the weaker the
assumption, the easier it is satisfied and, therefore, the easier it is to realize in
practice.
The greedy algorithms defined above (XGA, WDGA) are the generaliza-
tions of the PGA and the WGA, studied in Chapter 2, to the case of Banach
spaces. The results of Chapter 2 show that the PGA is not the most effi-
cient greedy algorithm for the approximation of elements of A1 (D). It was
mentioned in Chapter 2 (see Livshitz and Temlyakov (2003) for the proof) that
there exist a dictionary D, a positive constant C and an element f ∈ A1 (D)
such that, for the PGA,
f m
≥ Cm −0.27 . (6.10)
We note that, even before the lower estimate (6.10) was proved, researchers
began looking for other greedy algorithms that provide good rates of approxi-
mation of functions from A1 (D). Two different ideas have been used at this
step. The first idea was that of relaxation: see Barron (1993), DeVore and
Temlyakov (1996), Jones (1992) and Temlyakov (2000b). The corresponding
algorithms (for example, the WRGA, studied in Chapter 2) were designed for
approximation of functions from A1 (D). These algorithms do not provide an
expansion into a series but they have other good features. It was established
(see Theorem 2.21) for the WRGA with τ = {1} in a Hilbert space that for
f ∈ A1 (D)
f m
≤ Cm −1/2 .
Also, for the WRGA we always have G m ∈ A1 (D). The latter property clearly
limits the applicability of the WRGA to the A1 (D).
338 Greedy approximation in Banach spaces
The second idea was that of building the best approximant from the
span(ϕ1 , . . . , ϕm ) instead of the use of only one element ϕm for an update
of the approximant. This idea was realized in the Weak Orthogonal Greedy
Algorithm (see Chapter 2) in the case of a Hilbert space and in the Weak
Chebyshev Greedy Algorithm (WCGA) (see Temlyakov (2001b)) in the case
of a Banach space.
The realization of both ideas resulted in the construction of algorithms
(WRGA and WCGA) that are good for the approximation of functions from
A1 (D). We present results on the WCGA in Section 6.2 and results on the
WRGA in Section 6.3. The WCGA has the following advantage over the
WRGA. It will be proved in Section 6.2 that the WCGA (under some assump-
tions on the weakness sequence τ ) converges for each f ∈ X in any uniformly
smooth Banach space. The WRGA is simpler than the WCGA in the sense of
computational complexity. However, the WRGA has limited applicability. It
converges only for elements of the closure of the convex hull of a dictionary.
In Sections 6.4 and 6.5 we study algorithms that combine good features of both
algorithms the WRGA and the WCGA. In the construction of such algorithms
we use different forms of relaxation.
The Weak Greedy Algorithm with Free Relaxation (WGAFR) (Temlyakov
(2008c)), studied in Section 6.4, is the most powerful of the versions con-
sidered here. We prove convergence of the WGAFR in Theorem 6.22. This
theorem is the same as the corresponding convergence result for the WCGA
(see Theorem 6.6). The results on the rate of convergence for the WGAFR and
the WCGA are also the same (see Theorems 6.23 and Theorem 6.14). Thus,
the WGAFR performs in the same way as the WCGA from the point of view
of convergence and rate of convergence, and outperforms the WCGA in terms
of computational complexity.
In the WGAFR we are optimizing over two parameters w and λ at each
step of the algorithm. In other words we are looking for the best approx-
imation from a two-dimensional linear subspace at each step. In the other
version of the weak relaxed greedy algorithms (see the GAWR), considered in
Section 6.5, we approximate from a one-dimensional linear subspace at each
step of the algorithm. This makes computational complexity of these algo-
rithms very close to that of the PGA. The analysis of the GAWR version turns
out to be more complicated than the analysis of the WGAFR. Also, the results
obtained for the GAWR are not as general as in the case of the WGAFR. For
instance, we present results on the GAWR only in the case τ = {t}, when the
weakness parameter t is the same for all steps.
The XGA and WDGA have a good feature that distinguishes them from all
relaxed greedy algorithms, and also from the WCGA. For an element f ∈ X
they provide an expansion into a series,
6.1 Introduction 339
∞
f ∼ c j ( f )g j ( f ), g j ( f ) ∈ D, c j ( f ) > 0, j = 1, 2, . . . (6.11)
j=1
such that
m
Gm = c j ( f )g j ( f ), fm = f − G m .
j=1
In Section 6.7 we discuss other greedy algorithms that provide the expansion
(6.11).
All the algorithms studied in Sections 6.2–6.7 belong to the class of dual
greedy algorithms. Results obtained in Sections 6.2–6.7 confirm that dual
greedy algorithms provide powerful methods of nonlinear approximation. In
Section 6.8 we present some results on the X -greedy algorithms. These results
are similar to those for the dual greedy algorithms.
The algorithms studied in Sections 6.2–6.8 are very general approximation
methods that work well in an arbitrary uniformly smooth Banach space X for
any dictionary D. This motivates an attempt, made in Section 6.10, to modify
these theoretical approximation methods in the direction of practical applica-
bility. In Section 6.10 we illustrate this idea by modifying the WCGA. We note
that Section 6.6 is also devoted to modification of greedy algorithms in order to
make them more practically feasible. The main idea of Section 6.6 is to replace
the most difficult (expensive) step of an algorithm, namely the greedy step, by
a thresholding step.
In Section 6.11 we give an example of how the greedy algorithms can
be used in constructing deterministic cubature formulas with error estimates
similar to those for the Monte Carlo method.
As a typical example of a uniformly smooth Banach space we will use a
space L p , 1 < p < ∞. It is well known (see, for example, Donahue et al.
(1997), lemma B.1) that in the case X = L p , 1 ≤ p < ∞, we have
It is also known (see Lindenstrauss and Tzafriri (1977), p. 63) that, for any X
with dim X = ∞, one has
ρ(u) ≥ (1 + u 2 )1/2 − 1
ρ(u) ≥ Cu 2 , C > 0.
340 Greedy approximation in Banach spaces
This limits the power type modulus of smoothness of non-trivial Banach spaces
to the case u q , 1 ≤ q ≤ 2.
F fm−1
c (ϕmc ) ≥ tm
F f m−1
c
D .
(2) Define
m := τm := span{ϕ cj }mj=1 ,
Remark 6.3 It follows from the definition of the WCGA that the sequence
{
f mc
} is a non-increasing sequence.
Definition 6.4 Let ρ(u) be an even convex function on (−∞, ∞) with the
property ρ(2) ≥ 1 and
lim ρ(u)/u = 0.
u→0
lim
f mc,τ
= 0.
m→∞
ρ(u)/u ≤ ρ q (u)/u,
ξm (ρ, τ, θ) ≥ ξm (ρ q , τ, θ).
ξm (ρ q , τ, θ) = (θtm /γ )1/(q−1) .
We will use the following two simple and well known lemmas in the proof of
the above two theorems.
Lemma 6.9 Let X be a uniformly smooth Banach space and let L be a finite
dimensional subspace of X . For any f ∈ X \ L let f L denote the best
approximant of f from L. Then we have
F f − f L (φ) = 0
for any φ ∈ L.
F f − f L (φ) = β > 0.
Taking into account that ρ(u) = o(u), we find λ > 0 such that
λ β λ
1− + 2ρ < 1.
f − fL
f − fL
6.2 The Weak Chebyshev Greedy Algorithm 343
Then (6.17) yields
f − f L − λ φ
<
f − f L
,
which contradicts the assumption that f L ∈ L is the best approximant of f .
Lemma 6.10 For any bounded linear functional F and any dictionary D, we
have
F
D := sup F(g) = sup F( f ).
g∈D f ∈A1 (D )
is obvious. We prove the opposite inequality. Take any f ∈ A1 (D). Then for
any
> 0 there exist g1
, . . . , g
N ∈ D and numbers a1
, . . . , a
N such that
ai
> 0, a1
+ · · · + a
N ≤ 1 and
N
f − ai
gi
≤
.
i=1
Thus
N
F( f ) ≤
F
+ F ai
gi
≤
F
+ sup F(g),
i=1 g∈D
= tm sup F f m−1
c (φ) ≥ tm A(
)−1 F fm−1
c ( f
).
φ∈A1 (D )
(6.19)
which proves the lemma.
Let us specify θ := α/8A(
) and take λ = αξm (ρ, τ, θ). Then we obtain
f mc
≤
f m−1
c
(1 − 2θtm ξm ).
The assumption
∞
tm ξ m = ∞
m=1
6.2 The Weak Chebyshev Greedy Algorithm 345
implies that
f mc
→ 0 as m → ∞.
Let
Aq := 2(4γ )1/(q−1) .
By an analog of Lemma 2.16 (see Temlyakov (2000b), lemma 3.1), using the
estimate
f
p ≤ 1 < Aq we get
−1
m
p
f m
≤ Aq 1 +
c p
tn ,
n=1
which implies
−1/ p
m
p
f mc
≤ C(q, γ ) 1 + tn .
n=1
This statement follows from the fact that, in the proof of Theorem 6.8, the
relation
F fm−1
c (ϕmc ) ≥ tm sup F f m−1
c (g)
g∈D
First of all, (6.22) guarantees that f ∈ q . Next, it is well known that F f can
be identified as
∞
1/ p
p
F f = (1, t1 , t2 , . . . )/ 1 + tk ∈ p.
k=1
The following variant of Theorem 6.8 (see Temlyakov (2008c)) follows from
Lemma 6.11.
6.3 Relaxation; co-convex approximation 347
Theorem 6.14 Let X be a uniformly smooth Banach space with modulus of
smoothness ρ(u) ≤ γ u q , 1 < q ≤ 2. Take a number
≥ 0 and two elements
f , f
from X such that
f − f
≤
, f
/A(
) ∈ A1 (D),
with some number A(
) > 0. Then we have ( p := q/(q − 1))
⎛ −1/ p ⎞
m
f mc,τ
≤ max ⎝2
, C(q, γ )( A(
) +
) 1 + ⎠.
p
tk (6.23)
k=1
F f m−1
r (ϕm
r
− G rm−1 ) ≥ tm sup F f m−1
r (g − G rm−1 ).
g∈D
and define
G rm := G r,τ
m := (1 − λm )G m−1 + λm ϕm .
r r
(3) Let
f mr := f mr,τ := f − G rm .
Remark 6.15 It follows from the definition of the WRGA that the sequence
{
f mr
} is a non-increasing sequence.
We call the WRGA relaxed because at the mth step of the algorithm we use a
linear combination (convex combination) of the previous approximant G rm−1
and a new element ϕm r . The relaxation parameter λ in the WRGA is chosen
m
at the mth step depending on f . We prove here the analogs of Theorems 6.6
and 6.8 for the Weak Relaxed Greedy Algorithm.
348 Greedy approximation in Banach spaces
Theorem 6.16 Let X be a uniformly smooth Banach space with modulus of
smoothness ρ(u). Assume that a sequence τ := {tk }∞
k=1 satisfies the following
condition: for any θ > 0 we have
∞
tm ξm (ρ, τ, θ) = ∞.
m=1
lim
f mr,τ
= 0.
m→∞
Proof of Theorems 6.16 and 6.17 This proof is similar to the proof of
Theorems 6.6 and 6.8. Instead of Lemma 6.11 we use the following lemma.
Proof We have
f mr := f − ((1 − λm )G rm−1 + λm ϕm
r
) = f m−1
r
− λm (ϕm
r
− G rm−1 )
and
f mr
= inf
f m−1
r
− λ(ϕm
r
− G rm−1 )
.
0≤λ≤1
f m−1
r
− λ(ϕm r
− G rm−1 )
+
f m−1
r
+ λ(ϕmr
− G rm−1 )
λ
ϕm r − Gr
m−1
≤ 2
f m−1
r
1+ρ . (6.24)
f m−1
r
6.3 Relaxation; co-convex approximation 349
Next for λ ≥ 0 we obtain
f m−1
r
+ λ(ϕm
r
− G rm−1 )
≥ F f m−1
r ( f m−1
r
+ λ(ϕm
r
− G rm−1 ))
=
f m−1
r
+ λF f m−1
r (ϕm
r
− G rm−1 )
≥
f m−1
r
+ λtm sup F f m−1
r (g − G rm−1 ).
g∈D
=
f m−1
r
+ λtm sup F f m−1
r (φ − G rm−1 )
φ∈A1 (D )
≥
f m−1
r
+ λtm
f m−1
r
.
r − Gr
Using the trivial estimate
ϕm m−1
≤ 2 we obtain
2λ
f m−1
r
− λ(ϕm
r
− G rm−1 )
≤
f m−1
r
1 − λtm + 2ρ , (6.25)
f m−1
r
The remaining part of the proof uses the inequality (6.25) in the same way
(6.19) was used in the proof of Theorems 6.6 and 6.8. The only additional
difficulty here is that we are optimizing over 0 ≤ λ ≤ 1. However, it is easy
to check that the corresponding λ chosen in a similar way always satisfies the
restriction 0 ≤ λ ≤ 1. In the proof of Theorem 6.16 we choose θ = α/8 and
λ = αξm (ρ, τ, θ)/2 and in the proof of Theorem 6.17 we choose λ from the
equation
1
λtm = 2γ (2λ)q
f m−1
r
−q .
2
Remark 6.19 Theorems 6.16 and 6.17 hold for a slightly modified version of
the WRGA, the WRGA(1), for which at step (1) we require
r (1) r (1) r (1)
F f r (1) (ϕm − G m−1 ) ≥ tm
f m−1
. (6.26)
m−1
This follows from the observation that in the proof of Lemma 6.18 we used the
inequality from step (1) of the WRGA only to derive (6.26). It is clear from
Lemma 6.10 that in the case of approximation of f ∈ A1 (D), the requirement
(6.26) is weaker and easier to check than (1) of the WRGA.
350 Greedy approximation in Banach spaces
6.4 Free relaxation
Both of the above algorithms, the WCGA and the WRGA, use the functional
F fm−1 in a search for the mth element ϕm from the dictionary to be used in the
approximation. The construction of the approximant in the WRGA is different
from the construction in the WCGA. In the WCGA we build the approxi-
mant G cm so as to use maximally the approximation power of the elements
ϕ1 , . . . , ϕm . The WRGA, by definition, is designed for the approximation of
functions from A1 (D). In building the approximant in the WRGA we keep the
property G rm ∈ A1 (D). As mentioned in Section 6.3 the relaxation parameter
λm in the WRGA is chosen at the mth step depending on f . The following
modification of the above idea of relaxation in greedy approximation will be
studied in this section (see Temlyakov (2008c)).
Weak Greedy Algorithm with Free Relaxation (WGAFR) Let τ :=
{tm }∞
m=1 , tm ∈ [0, 1], be a weakness sequence. We define f 0 := f and G 0 := 0.
Then for each m ≥ 1 we have the following inductive definition.
and define
G m := (1 − wm )G m−1 + λm ϕm .
(3) Let
f m := f − G m .
We now estimate
Next,
Remark 6.21 It follows from the definition of the WGAFR that the sequence
{
f m
} is a non-increasing sequence.
lim
f m
= 0.
m→∞
The assumption
∞
tm s −1 (θtm ) = ∞
m=1
implies that
f m
→ 0 as m → ∞.
f − f ≤ , f /A( ) ∈ A1 (D),
Define
Aq := 4(8γ )1/(q−1) 5q/(q−1) .
354 Greedy approximation in Banach spaces
Using the notation p := q/(q − 1), we get from (6.30)
p
1 λtk tk
f k−1
p
f k
≤
f k−1
1 − =
f k−1
1 − .
4 A(
) Aq A(
) p
Raising both sides of this inequality to the power p and taking into account the
inequality x r ≤ x for r ≥ 1, 0 ≤ x ≤ 1, we obtain
p
tk
f k−1
p
f k
≤
f k−1
1 −
p p
.
Aq A(
) p
By an analog of Lemma 2.16 (see Temlyakov (2000b), lemma 3.1), using the
estimates
f
≤ A(
) +
and Aq > 1, we get
−1
m
p
f m
≤ Aq (A(
) +
) 1 +
p p
tk
k=1
which implies
−1/ p
m
p
f m
≤ C(q, γ )( A(
) +
) 1 + tk .
k=1
Theorem 6.23 is proved.
and define
G m := (1 − rm )G m−1 + λm ϕm .
(3) Let
f m := f − G m .
In the case τ = {t} we write t instead of τ in the notation. We note that in the
case rk = 0, k = 1, . . . , when there is no relaxation the GAWR(τ, 0) coincides
with the Weak Dual Greedy Algorithm. We now proceed to the GAWR. We
begin with an analog of Lemma 6.11.
f m
≤
f m−1
(1 − rm (1 −
/
f m−1
)
+ 2ρ((rm (
f
+ A(
)/t))/((1 − rm )
f m−1
)), m = 1, 2, . . .
Then the GAWR(t, r) converges in any uniformly smooth Banach space for
each f ∈ X and for all dictionaries D.
ak+1 ≤ ak + rk f . (6.32)
Using the fact that the function ρ(u)/u is monotone increasing on [0, ∞), we
obtain from (6.31) for ak > 0
f k−1
Brk Brk
ak+1 ≤ ak 1 − rk + 2 ρ ≤ ak 1 − rk + 2ρ .
ak
f k−1
ak
(6.33)
Consider the case when U is infinite. We note that part (I) of the proof
implies that there is a subsequence {k j } such that ak j ≤ 0, j = 1, 2, . . . .
This means that
U = ∪∞
j=1 [l j , n j ]
Lemma 6.26 Let a sequence {an }∞n=1 have the following property. For given
positive numbers α < γ ≤ 1, A > a1 , we have for all n ≥ 2
an ≤ an−1 + A(n − 1)−α . (6.36)
If for some ν ≥ 2 we have
aν ≥ Aν −α
then
aν+1 ≤ aν (1 − γ /ν). (6.37)
Then there exists a constant C(α, γ ) such that for all n = 1, 2, . . . we have
an ≤ C(α, γ ) An −α .
f m ≤ f m−1 + rm f .
We conclude this section with the following remark. The algorithms GAWR
and WGAFR are both dual type greedy algorithms. The first steps are similar
for both algorithms: we use the norming functional F fm−1 in the search for an
element ϕm . The WGAFR provides more freedom than the GAWR in choosing
good coefficients wm and λm . This results in more flexibility in choosing the
weakness sequence τ = {tm }. For instance, condition (6.29) of Theorem 6.22
is satisfied if τ = {t}, t ∈ (0, 1], for any uniformly smooth Banach space. In
the case ρ(u) ≤ γ u q , 1 < q ≤ 2, condition (6.29) is satisfied if
∞
p
tm = ∞, p := q/(q − 1).
m=1
6.6 Thresholding algorithms 359
6.6 Thresholding algorithms
We begin with a remark on the computational complexity of greedy algorithms.
The main point of Section 6.4 is in proving that relaxation allows us to build
greedy algorithms (see the WGAFR) that are computationally simpler than
the WCGA and perform as well as the WCGA. We note that the WCGA and
the WGAFR differ in the second step of the algorithm. However, the most
computationally involved step of all greedy algorithms is the greedy step (the
first step of the algorithm). One of the goals of relaxation was to get rid of
the assumption f ∈ A1 (D) (as in the WRGA). All relaxed greedy algorithms
from Sections 6.4 and 6.5 are applicable to (and converge for) any f ∈ X .
We want to point out that the information f ∈ A1 (D) allows us to simplify
substantially the greedy step of the algorithm. It is remarked in Section 6.2 (see
Remark 6.12) that we can replace the first step of the WCGA by the following
search criterion:
F fm−1 (ϕm ) ≥ tm
f m−1
. (6.39)
A similar remark (see Section 6.3, Remark 6.19) holds for the WRGA. The
requirement (6.39) is weaker than the requirement of the greedy step of
the WCGA. However, Theorem 6.8 holds for this modification of the WCGA.
The relation (6.39) is a threshold-type inequality and can be checked easier
than the greedy inequality.
We now consider two algorithms defined and studied in Temlyakov (2008c)
with a different type of thresholding. These algorithms work for any f ∈ X .
We begin with the Dual Greedy Algorithm with Relaxation and Thresholding
(DGART).
DGART We define f 0 := f and G 0 := 0. Then for a given parameter
δ ∈ (0, 1/2] we have the following inductive definition for m ≥ 1.
(1) ϕm ∈ D is any element satisfying
F fm−1 (ϕm ) ≥ δ. (6.40)
If there is no ϕm ∈ D satisfying (6.40) then we stop.
(2) Find wm and λm such that
f − ((1 − wm )G m−1 + λm ϕm )
= inf
f − ((1 − w)G m−1 + λϕm )
λ,w
and define
G m := (1 − wm )G m−1 + λm ϕm .
(3) Let
f m := f − G m .
360 Greedy approximation in Banach spaces
If
f m
≤ δ
f
then we stop, otherwise we proceed to the (m + 1)th
iteration.
f − f ≤ , f /A( ) ∈ A1 (D),
with some number A(
) > 0. Then the DGART (CGAT) will stop after m ≤
C(γ )δ − p ln(1/δ), p := q/(q − 1), iterations with
f m ≤ + δ A( ).
Proof We begin with the error bound. For both algorithms, the DGART and the
CGAT, our stopping criterion guarantees that either
F fm
D ≤ δ or
f m
≤
δ
f
. In the latter case the required bound follows from simple inequalities
f ≤ + f ≤ + A( ).
f m = F fm ( f m ) = F fm ( f ) ≤ + F fm ( f ) ≤ + F fm D A( ) ≤ +δ A( ).
For the DGART we apply Lemma 6.9 with f m−1 and L = span(G m−1 , ϕm ),
and get
f m
= F f m ( f m ) = F fm ( f m−1 ) = F fm ( f )
≤
+ F fm ( f
) ≤
+
F fm
D A(
) ≤
+ δ A(
).
Next,
Thus,
f k
≤
f
(1 − c(γ )δ p )k .
(1 − c(γ )δ p )n ≤ δ.
This implies
m ≤ C(γ )δ − p ln(1/δ).
F f i,
(ϕm
i,
− f ) ≥ −
m .
m−1
(2) Define
i,
m := (1 − 1/m)G m−1 + ϕm /m.
G i,
i,
(3) Let
f mi,
:= f − G i,
m .
We note that, as in Lemma 6.10, we have for any bounded linear functional
F and any D+
sup F(g) = sup F( f ).
g∈D + f ∈A1 (D + )
sup F(g) ≥ F( f ).
g∈D +
i,
This guarantees the existence of ϕm .
Since
(2) Define
f m := f m−1 − cm ϕm ,
where cm > 0 is a coefficient either prescribed in advance or chosen
from a concrete approximation procedure.
We call the series
∞
f ∼ cjϕj (6.51)
j=1
We prove some convergence results for the DBE in Sections 6.7.2 and 6.7.3.
In Section 6.7.3 we consider a variant of the Dual-Based Expansion with coef-
ficients chosen by a certain simple rule. The rule depends on two numerical
parameters, t ∈ (0, 1] (the weakness parameter from the definition of the DBE)
and b ∈ (0, 1) (the tuning parameter of the approximation method). The rule
also depends on a majorant μ of the modulus of smoothness of the Banach
space X .
Dual Greedy Algorithm with parameters (t, b, μ) (DGA(t, b, μ)) Let X
be a uniformly smooth Banach space with modulus of smoothness ρ(u), and
let μ(u) be a continuous majorant of ρ(u): ρ(u) ≤ μ(u), u ∈ [0, ∞). For
parameters t ∈ (0, 1], b ∈ (0, 1] we define sequences { f m }∞ ∞
m=0 , {ϕm }m=1 ,
{cm }∞
m=1 inductively. Let f 0 := f . If for m ≥ 1, f m−1 = 0 then we set f j = 0
for j ≥ m and stop. If f m−1 = 0 then we conduct the following three steps.
Theorem 6.30 Let X be a uniformly smooth Banach space with the modulus
of smoothness ρ(u) and let μ(u) be a continuous majorant of ρ(u) with the
property μ(u)/u ↓ 0 as u → +0. Then, for any t ∈ (0, 1] and b ∈ (0, 1) the
DGA(t, b, μ) converges for each dictionary D and all f ∈ X .
The following result from Section 6.7.3 gives the rate of convergence.
∞
m
f ∼ cjϕj, f m := f − cjϕj
j=1 j=1
∞
c j = ∞. (6.56)
j=1
Then
lim inf
f m
= 0. (6.57)
m→∞
Proof The proof of this lemma is similar to the proof of lemma 1 from
Ganichev and Kalton (2003). Denote sn := nj=1 c j . Then (6.56) implies (see
Bary (1961), p. 904) that
∞
cn /sn = ∞. (6.58)
n=1
Thus, by (6.58),
lim inf sn rD ( f n−1 ) = 0
n→∞
Let
lim sn k rD ( f n k ) = 0. (6.59)
k→∞
6.7 Greedy expansions 367
Consider {F fnk }. The unit sphere in the dual X ∗ is weakly∗ compact (see
∞ , F := F
Habala, Hájek and Zizler (1996), p. 45). Let {Fi }i=1 i f n k be a
i
∗
w -convergent subsequence. Denote
F := w ∗ − lim Fi .
i→∞
f m ≥ α, m ≥ N, (6.60)
F( f ) = lim Fi ( f ) (6.61)
i→∞
and
⎛ ⎞
n n
ki
ki
Fi ( f ) = Fi ⎝ f n ki + c j ϕ j ⎠ =
f n ki
+ c j Fi (ϕ j ) ≥ α − sn ki rD ( f n ki )
j=1 j=1
(6.62)
for large i. Relations (6.61), (6.62) and (6.59) imply that F( f ) ≥ α, and hence
F = 0. This implies that there exists g ∈ D for which F(g) > 0. However,
(2) Let
f m := f m−1 − cm ϕm , G m := G m−1 + cm ϕm .
Theorem 6.33 Let X be a uniformly smooth Banach space with the modulus
of smoothness ρ(u). Assume C = {c j }∞
j=1 is such that c j ≥ 0, j = 1, 2, . . . ,
∞
c j = ∞,
j=1
Proof The proof is by contradiction. Assume (6.64) does not hold. Then there
exist α > 0 and N ∈ N such that, for all m ≥ N ,
f m ≥ α > 0.
f m
≤ C(r, t, q, γ )m −r .
370 Greedy approximation in Banach spaces
In the case t = 1, Theorem 6.35 provides the rate of convergence m −r for
f ∈ A1 (D) with r arbitrarily close to (1 − 1/q)/2. Theorem 6.31 provides
a similar rate of convergence. It would be interesting to know if the rate
m −(1−1/q)/2 is the best that can be achieved in greedy expansions (for each
D, any f ∈ A1 (D) and any X with ρ(u) ≤ γ u q , q ∈ (1, 2]). We note that
there are greedy approximation methods that provide error bounds of the order
m 1/q−1 for f ∈ A1 (D) (see Temlyakov (2003a, 2008c) for recent results).
However, these approximation methods do not provide an expansion.
Proof of Theorem 6.30 In this case τ = {t}, t ∈ (0, 1]. We have by (6.68)
Thus
∞
cm rD ( f m−1 ) < ∞. (6.75)
m=1
lim
F fm − F f∞
= 0.
m→∞
−1 2
rD ( f m−1 ) ≤ αcm μ(cm /α) →0
tb
as m → ∞.
Theorem 6.30 is proved.
Remark 6.36 It is clear from the above prove that Theorem 6.30 holds for an
algorithm obtained from the DGA(τ, b, μ), by replacing (6.71) by
b
f m−1
μ(cm /
f m−1
) = cm F fm−1 (ϕm ). (6.78)
2
372 Greedy approximation in Banach spaces
Also, a parameter b in (6.71) and (6.78) can be replaced by varying parameters
bm ∈ (a, b) ⊂ (0, 1).
m−1
m−1
f m−1
A1 (D) =
f − c j ϕ j
A 1 (D ) ≤
f
A 1 (D ) + cj. (6.80)
j=1 j=1
n
Denote bn := 1 + j=1 c j . Then, by (6.80) we get
≥
f m−1
−1
A1 (D ) F f m−1 ( f m−1 ) ≥
f m−1
/bm−1 . (6.81)
(1 + x)α ≤ 1 + αx, 0 ≤ α ≤ 1, x ≥ 0,
6.7 Greedy expansions 373
we obtain
tm (1−b) tm (1−b)
bm ≤ bm−1 (1 + tm (1 − b)cm /bm−1 ). (6.83)
The function μ(u)/u = γ u q−1 is increasing on [0, ∞). Therefore the cm from
from (see (6.81))
(6.71) is greater than or equal to cm
tm b
γ
f m−1
(cm /
f m−1
)q = c
f m−1
/bm−1 , (6.85)
2 m
tm b 1/(q−1)
f m−1
q/(q−1)
cm = 1/(q−1)
. (6.86)
2γ bm−1
we obtain
p
tm
f m−1
p
f m
≤
f m−1
1 − p , (6.87)
A bm−1
from (6.82) and (6.86). Noting that bm ≥ bm−1 , we infer from (6.87) that
Remark 6.38 Theorem 6.39 holds for an algorithm obtained from the
DGA(τ, b, μ) by replacing (6.71) by (6.78).
It follows from the proof of Theorem 6.37 that it holds for a modifica-
tion of the DGA(τ, b, μ), where we replace the quantity rD ( f m−1 ) in the
definition by its lower estimate (see (6.81))
f m−1
/bm−1 , with bm−1 :=
1 + m−1 j=1 c j . Clearly, this modification is more ready for practical implemen-
tation than the DGA(τ, b, μ). We formulate the above remark as a separate
result.
(3) Define
f m := f m−1 − cm ϕm .
Therefore, by Theorem 6.31 with μ(u) = u 2 /2, the DGA(t, b, μ) provides the
following error estimate:
t (1−b)
f m
≤ C(t, b)m − 2(1+t (1−b)) for f ∈ A1 (D). (6.90)
The exponent (1 − b)/(2(2 − b)) in this estimate tends to 1/4 when b tends to
zero. Comparing (6.91) with the upper estimate for the PGA (see Section 2.3),
we observe that the DGA(1, b, u 2 /2) with small b has a better upper estimate
for the rate of convergence than the known estimates for the PGA. We note
also that inequality (2.40) from Chapter 2 indicates that the exponent in the
power rate of decay of error for the PGA is less than 0.1898.
Let us figure out how the DGA(1, b, u 2 /2) works in Hilbert space. Consider
its mth step. Let ϕm ∈ D be from (6.52). Then it is clear that ϕm maximizes
the
f m−1 , g over the dictionary D and
f m−1 , ϕm =
f m−1
rD ( f m−1 ).
f − f ≤ , f /A( ) ∈ A1 (D).
Then
We now turn to the L p -spaces. The following results, Proposition 6.42 and
Theorem 6.43, are from Ganichev and Kalton (2003).
u|1 + u| p−2 (1 + u) − u
φ p (u) := , u = 0, φ p (0) := 2/ p.
|1 + u| p − pu − 1
378 Greedy approximation in Banach spaces
We note that |1 + u| p − pu − 1 > 0 for u = 0. Indeed, it is sufficient to check
the inequality for u ≥ −1/ p. In this case |1 + u| p = (1 + u) p > 1 + pu,
u = 0. It is easy to check that
lim φ p (u) = 2/ p.
u→0
Combining Theorem 6.41 with Proposition 6.42 we obtain the following result.
Theorem 6.43 Let p ∈ (1, ∞). Then the WDGA(τ ) with τ = {t}, t ∈ (0, 1],
converges for each dictionary and all f ∈ L p .
and
G m := (1 − wm )G m−1 + λm ϕm .
(2) Let
f m := f − G m .
Theorems 6.22 and 6.23 were derived from Lemma 6.20. In the same way
we derive from Lemma 6.44 the following analogs of Theorems 6.22 and 6.23
for the XGAFR.
lim
f m
= 0.
m→∞
380 Greedy approximation in Banach spaces
Theorem 6.46 Let X be a uniformly smooth Banach space with modulus of
smoothness ρ(u) ≤ γ u q , 1 < q ≤ 2. Take a number
≥ 0 and two elements
f , f
from X such that
f − f ≤ , f /A( ) ∈ A1 (D),
and
G m := (1 − rm )G m−1 + λm ϕm .
(2) Let
f m := f − G m .
Then the XGAR(r) converges in any uniformly smooth Banach space for each
f ∈ X and for all dictionaries D.
f − f ≤ , f /A( ) ∈ A1 (D)
we have
f m
≤
+ C(q, γ )(
f
+ A(
))m −1+1/q .
lim (
g + uy
+
g − uy
− 2
g
)/u = 0
u→0
Remark 6.50 We note that (6.104) has a unique solution if det ||Fϕ j
(ϕi )||i,m j=1 = 0. We apply the WQOGA in the case of a dictionary with the
coherence parameter M := M(D). Then, by a simple well known argument
on the linear independence of the rows of the matrix ||Fϕ j (ϕi )||i,m j=1 , we con-
clude that (6.104) has a unique solution for any m < 1 + 1/M. Thus, in the
case of an M-coherent dictionary D, we can run the WQOGA for at least
[1/M] iterations.
The following result was obtained in Temlyakov (2006a).
Theorem 6.51 Let t ∈ (0, 1]. Assume that D has coherence parameter M. Let
S < (t/(1 + t))(1 + 1/M). Then for any f of the form
S
f = ai ψi ,
i=1
q,t
where ψi are distinct elements of D, we have that f S = 0.
We will prove the above theorem in a more general setting. Instead of a pair
(D, D∗ ) of a dictionary D and its dual dictionary D∗ , we now consider a pair
(D, W) of a dictionary D and a set W of normalized elements w indexed by
elements from D. We define
W := {wg ∈ X ∗ ,
wg
X ∗ = 1, g ∈ D}
6.9 Incoherent dictionaries and exact recovery 383
and define the coherence parameter of the pair (D, W) in the following way:
M(D, W) := sup |wg (h)|.
g=h;g,h∈D
S
f = ai ψi ,
i=1
p,t
where ψi are distinct elements of D, we have that ϕ1 = ψ j with some
j ∈ [1, S].
Therefore,
max |wψi ( f )| ≥ A(1 − δ − M(S − 1)). (6.107)
i
S
|wg ( f )| ≤ |ai wg (ψi )| ≤ AM S < t A(1 − δ − M(S − 1)). (6.108)
i=1
Theorem 6.54 Let t ∈ (0, 1]. Assume that the pair (D, W) has coherence
parameter M := M(D, W) and satisfies (6.105). Let S < (t/(1 + t))
× (1 + (1 − δ)/M). Then, for any f of the form
S
f = ai ψi ,
i=1
p,t
where ψi are distinct elements of D, we have that f S = 0.
p,t
Proof By Lemma 6.53 we obtain that each f m , m ≤ S, has a form
p,t
S
fm = aim ψi .
i=1
6.10 Approximate evaluations and restricted search 385
p,t p,t
We note that by (6.107) |wϕ p,t ( f m )| > 0 for m < S provided fm = 0.
m+1
p,t p,t p,t
Therefore, ϕm+1 is different from all the previous ϕ1 , . . . , ϕm . Thus, assum-
p,t
ing, without loss of generality, that f S−1 = 0, we conclude that the set
p,t p,t
ϕ1 , . . . , ϕ S coincides with the set ψ1 , . . . , ψ S .
The condition
S
wψ j (ai − ai )ψi = 0,
S
j = 1, . . . , S,
i=1
(2) Define
m := span{ϕ j }mj=1 ,
and let
E m ( f ) := inf
f − ϕ
.
ϕ∈m
f − G m ≤ E m ( f )(1 + ηm ).
(3) Let
f m := f mτ,δ,η := f − G m .
The term approximate in this definition means that we use a functional Fm−1
that is an approximation to the norming (peak) functional F fm−1 and also that
we use an approximant G m ∈ m which satisfies a weaker assumption than
being a best approximant to f from m . Thus, in the approximate version
of the WCGA, we have addressed the issue of non-exact evaluation of the
norming functional and the best approximant. We did not address the issue of
finding the supg∈D F fm−1c (g). In Temlyakov (2005b) we addressed this issue.
We did it in two steps. First we considered the corresponding modification of
the WCGA, and then the modification of the AWCGA. These modifications
are done in the style of the concept of depth search from Donoho (2001).
We now consider a countable dictionary D = {±ψ j }∞ j=1 . We denote
D(N ) := {±ψ j } Nj=1 . Let N := {N j }∞
j=1 be a sequence of natural numbers.
Restricted Weak Chebyshev Greedy Algorithm (RWCGA) We define
f 0 := f 0c,τ,N := f . Then, for each m ≥ 1 we have the following inductive
definition.
(2) Define
N
m := τ,
m := span{ϕ j }mj=1 ,
N
and define G m := G c,τ,
m to be the best approximant to f from m .
6.10 Approximate evaluations and restricted search 387
(3) Let
f m := f mc,τ,N := f − G m .
and
p p
δm = o(tm ), ηm = o(tm ).
Ab1 (K , D) := { f : d( f, A1 (D(n))) ≤ K n −b , n = 1, 2, . . . }.
Here, A1 (D(n)) is a convex hull of {±ψ j }nj=1 and, for a compact set F,
d( f, F) := inf
f − φ
.
φ∈F
(2) Define
m := span{ϕ j }mj=1 ,
and let
E m ( f ) := inf
f − ϕ
.
ϕ∈m
f − G m ≤ E m ( f )(1 + ηm ).
(3) Let
f m := f mτ,δ,η,N := f − G m .
and
p p
δm = o(tm ), ηm = o(tm ).
We let m (D) denote the collection of all functions (elements) in X which can
be expressed as a linear combination of at most m elements of D. Thus each
function s ∈ m (D) can be written in the form
s= cg g, ⊂ D, # ≤ m,
g∈
where the cg are real or complex numbers. For a function f ∈ X we define its
best m-term approximation error as follows:
σm ( f ) := σm ( f, D) := inf
f − s
.
s∈m (D )
The first result in this area was the following conjecture of van der Corput
(1935a, b). Let ξ j ∈ [0, 1], j = 1, 2, . . . , then we have
lim sup m D((ξ 1 , . . . , ξ m ), m, 1)∞ = ∞.
m→∞
This problem is still open. Recently, Bilyk and Lacey (2008) and Bilyk, Lacey
and Vagharshakyan (2008) made substantial progress on this problem. The
authors proved that, for some η = η(d) > 0,
They also conjectured that the right order of D(m, d)∞ is m −1 (log m)d/2 .
We now proceed to a brief discussion of numerical integration. Numerical
integration seeks good ways of approximating an integral
f (x)dμ
by an expression of the form
m
m ( f, ξ ) := λ j f (ξ j ), ξ = (ξ 1 , . . . , ξ m ), ξ j ∈ , j = 1, . . . , m.
j=1
(6.113)
It is clear that we must assume that f is integrable and defined at the points
ξ 1 , . . . , ξ m . Equation (6.113) is called a cubature formula (, ξ ) (if ⊂ Rd ,
d ≥ 2) or a quadrature formula (, ξ ) (if ⊂ R) with knots ξ = (ξ 1 , . . . , ξ m )
and weights = (λ1 , . . . , λm ). For a function class W we introduce a concept
of error of the cubature formula m (·, ξ ) by
m (W, ξ ) := sup | f dμ − m ( f, ξ )|. (6.114)
f ∈W
In order to orient the reader we will begin with the case of univariate periodic
functions. Let for r > 0
∞
Fr (x) := 1 + 2 k −r cos(kx − r π/2) (6.115)
k=1
and
W pr := { f : f = ϕ ∗ Fr ,
ϕ
p ≤ 1}. (6.116)
It is well known that for r > 1/ p the class W pr is embedded into the space of
continuous functions C(T). In a particular case of W11 we also have embedding
392 Greedy approximation in Banach spaces
into C(T). From the definitions (6.113), (6.114) and (6.116) we see that, for
the normalized measure dμ = (1/2π )d x,
⎛ ⎞
m
1 ⎝ Fr (x − y)dμ −
m (W pr , ξ ) = sup | λ j Fr (ξ j − y)⎠ϕ(y)dy|
ϕ
p ≤1 2π T T j=1
m
p
=
1 − λ j Fr (ξ j − ·)
p , p := . (6.117)
p−1
j=1
Thus the quality of the quadrature formula m (·, ξ ) for the function class W pr
is controlled by the quality of m (·, ξ ) for the representing kernel Fr (x − y).
In the particular case of W11 we have
m
m (W11 , ξ ) = max |1 − λ j F1 (ξ j − y)|. (6.118)
y
j=1
Proposition 6.60 There exist two positive absolute constants C1 and C2 such
that, for any m (·, ξ ) with a property j λ j = 1, we have
C1 m (χ , ξ ) ≤ m (W11 , ξ ) ≤ C2 m (χ , ξ ). (6.119)
d
Fr (x) := Fr (x j )
j=1
and
M W pr := { f : f = ϕ ∗ Fr ,
ϕ
p ≤ 1}.
As in the univariate case one obtains analogs of (6.117), (6.118) and Proposi-
tion 6.60 (see Temlyakov (2003b)):
m
m (M W pr , ξ ) =
1 − λ j Fr (ξ j − ·)
p ; (6.120)
j=1
m
m (M W11 , ξ ) = max |1 − λ j F1 (ξ j − y)|. (6.121)
y
j=1
Proposition 6.61 There exist two positive constants C1 (d) and C2 (d) such
that, for any m (·, ξ ) with a property j λ j = 1, we have
f (n 1 ,...,n d )
p ≤ 1 (6.123)
394 Greedy approximation in Banach spaces
for all n = (n 1 , . . . , n d ) such that n 1 +· · ·+n d ≤ R. These classes appeared as
natural ways to measure smoothness in many multivariate problems including
numerical integration. It was established that for Sobolev classes the optimal
error of numerical integration by formulas with m knots is of order m −R/d .
Assume now for the sake of simplicity (to avoid fractional differentiation) that
R = r d, where r is a natural number. At the end of the 1950s, N. M. Korobov
discovered the following phenomenon. Let us consider the class of functions
which satisfy (6.123) for all n such that n j ≤ r , j = 1, . . . , d (compare to the
above classes M W pr ). It is clear that this new class (the class of functions with
a dominating mixed derivative) is wider than the Sobolev class with R = r d.
For example, all functions of the form
d
(r )
f (x) = f j (x j ),
f j
p ≤ 1,
j=1
are in this class, while not necessarily in the Sobolev class (it would require,
roughly,
f j(r d)
p ≤ 1). Korobov constructed a cubature formula with m knots
which guaranteed the accuracy of numerical integration for this class of order
m −r (log m)r d , i.e. almost the same accuracy that we had for the Sobolev class.
Korobov’s discovery pointed out the importance of the classes of functions
with dominating mixed derivative in fields such as approximation theory and
numerical analysis. The (convenient for us) definition of these classes (classes
of functions with bounded mixed derivative) is given above (see the definition
of M W pr ).
In addition to the classes of 2π -periodic functions, it will be convenient for
us to consider the classes of non-periodic functions defined on d = [0, 1]d .
Let r be a natural number and let M W pr (d ), 1 ≤ p ≤ ∞, denote the
closure in the uniform metric of the set of r d-times continuously differentiable
functions f (x) such that
∂ n 1 + ··· +n d f
f
M W pr := n1 ≤ 1, (6.124)
∂ x1 . . . ∂ xdn d
0≤n j ≤r ; j=1,...,d p
where
1/ p
g
p = g(x) p d x .
d
where Dr (ξ, , m, d)q is defined in (6.125). The superscript o stands here for
optimal to emphasize that we are optimizing over the weights . The first
result in estimating the generalized discrepancy was obtained by Bykovskii
(1985):
Dro (m, d)2 ≥ C(r, d)m −r (log m)(d−1)/2 . (6.126)
396 Greedy approximation in Banach spaces
This result is a generalization of Roth’s result (6.109). The generalization of
Schmidt’s result (6.111) was obtained by Temlyakov (1990) (see theorem 3.5
of Temlyakov (2003b)):
Dro (m, d)q ≥ C(r, d, q)m −r (log m)(d−1)/2 , 1 < q ≤ ∞. (6.127)
We formulate an open problem (see Temlyakov (2003b)) for the case q = ∞.
Conjecture 6.62 For all d, r ∈ N we have
Dro (m, d)∞ ≥ C(r, d)m −r (log m)d−1 .
The above lower estimates for D1o (m, d)q are formally stronger than the
corresponding estimates for D(m, d)q because in D1o (m, d)q we are optimiz-
ing over the weights . However, the proofs for D(m, d)q could be adjusted
to give the estimates for D1o (m, d)q . The results (6.126) and (6.127) for the
generalized discrepancy were obtained as a corollary of the corresponding
results on cubature formulas. We do not know if existing methods for D(m, d)q
could be modified to get the estimates for Dro (m, d)q , r ≥ 2.
Remark 6.63 In the case of natural r the class M W pr turns into the subclass
of the class M W pr (d )B := { f : f /B ∈ M W pr (d )}, after the linear change
of variables
x j = −π + 2πt j , j = 1, . . . , d.
We are interested in the dependence on m of the quantities
δm (W ) = inf m (W, ξ )
λ1 ,...,λm ;ξ 1 ,...,ξ m
Theorem 6.65 The following lower estimate is valid for any cubature formula
(, ξ ) with m knots (r > 1/ p)
There are two big open problems in this area. They are formulated in
Temlyakov (2003b) as conjectures.
δm (M W∞
r
) ≥ C(r, d)m −r (log m)(d−1)/2 .
We note that by Proposition 6.61, Theorem 6.64 and (6.125), Conjecture 6.66
implies Conjecture 6.62 and Conjecture 6.67 implies
where {x} is the fractional part of the number x, which are called the Korobov
cubature formulas. His results lead to the following estimate:
Bakhvalov used the Fibonacci cubature formulas defined as follows. For peri-
odic functions of two variables we consider the Fibonacci cubature formulas
bn
n ( f ) = bn−1 f (2π μ/bn , 2π{μbn−1 /bn }),
μ=1
Lemma 6.68 There exists a matrix A such that the lattice L(m) = Am, where
m is a (column) vector with integer coordinates, has the following properties:
⎛ ⎛ ⎞⎞
L 1 (m)
⎜ ⎜ .. ⎟⎟
⎝ L(m) = ⎝ . ⎠⎠ ,
L d (m)
$
d
(10 ) j=1 L j (m) ≥ 1 for all m = 0;
(20 ) each parallelepiped P with volume |P| whose edges are parallel to the
coordinate axes contains no more than |P| + 1 lattice points.
Let a > 1 and A be the matrix from Lemma 6.68. We consider the cubature
formula
−1 T
!−1 (A ) m
(a, A)( f ) = a | det A|
d
f
a
d m∈Z
We now present the upper estimates for the discrepancy. Davenport (1956)
proved that
D(m, 2)2 ≤ Cm −1 (log m)1/2 .
Other proofs of this estimate were later given by Vilenkin (1967), Halton and
Zaremba (1969) and Roth (1976). Roth (1979) proved
We note that the upper estimates for D(m, d)q are stronger than the same upper
estimates for D1o (m, d)q .
The quantity δm (M W pr ), discussed above, is a function of four variables
m, r , p, d. We discussed above a traditional setting of the problem when we
want to find the right order of decay of δm (M W pr ) as a function of m when m
increases to infinity and other parameters are fixed. As we have seen, there are
important problems in this setting which remain open. Clearly, the problem of
400 Greedy approximation in Banach spaces
finding the right dependence of δm (M W pr ) of all four parameters m, r , p, d is
a more difficult problem. Recently, driven by possible applications in finance,
global optimization and path integrals, some researchers studied the problem of
dependence of δm (Fd ) on dimension d for various function classes Fd . Today
this topic is being thoroughly studied in information-based complexity, and a
number of results have been recently obtained. We refer the reader to a nice
survey of the topic written by Novak and Wozniakowski (2001).
We now proceed to applications of greedy approximation for the discrep-
ancy estimates in high dimensions. It will be convenient for us to study a slight
modification of D(ξ, m, d)q . For a, t ∈ [0, 1] denote
and for x, y ∈ d
d
H (x, y) := H (x j , y j ).
j=1
As already mentioned above (see (6.111) and (6.137)) the following relation
is known:
1
m
D(m, d)q = σme (J, B)q := inf
J (·) − gμ
L q (d ) ,
g1 ,...,gm ∈B m
μ=1
where
J (x) = B(x, y)dy
d
and
B = {B(x, y), y ∈ d }.
It has been proved in Davis, Mallat and Avellaneda (1997) that if an algo-
rithm finds a best m-term approximation for each f ∈ R N for every dictionary
D with the number of elements of order N k , k ≥ 1, then this algorithm
solves an N P-hard problem. Thus, in nonlinear m-term approximation we
look for methods (algorithms) which provide approximation close to best m-
term approximation and at each step solve an optimization problem over only
one parameter (ξ μ in our case). In this section we will provide such an algo-
rithm for estimating σme (J, B)q . We call this algorithm constructive because
it provides an explicit construction with feasible one-parameter optimization
steps.
402 Greedy approximation in Banach spaces
We proceed to the construction. We will use in our construction the IA(
)
which has been studied in Section 6.6. We will use the following corollaries of
Theorem 6.29.
Thus the IA(
) should find at step m an approximate solution to the following
optimization problem (over y ∈ d ):
i,
i,
| f m−1 (x)|q−2 f m−1 (x)H (x, y)d x → max.
d
1
m
H (x, ξ μ ),
m
μ=1
F fm−1
r (ϕm
r
− G rm−1 ) ≥ tm sup F fm−1
r (g − G rm−1 )
g∈D
i,
than an element ϕm :
F f i,
(ϕm
i,
− f ) ≥ −
m .
m−1
We will now derive an estimate for D(m, d)∞ from Corollary 6.70.
Proof We use the inequality from Niederreiter, Tichy and Turnwald (1990)
q/(q+d)
D(ξ, m, d)∞ ≤ c(d, q)d(3d + 4)D(ξ, m, d)q (6.143)
405
406 References
Bilyk, D. and Lacey, M. (2008), On the small ball inequality in three dimensions,
Duke Math J., 143, 81–115.
Bilyk, D., Lacey, M. and Vagharshakyan, A. (2008), On the small ball inequality in all
dimensions, J. Func. Anal., 254, 2470–2502.
Binev, P., Cohen, A., Dahmen, W., DeVore, R. and Temlyakov, V. (2005), Universal
algorithms for learning theory. Part I: Piecewise constant functions, J. Machine
Learning Theory (JMLT), 6, 1297–1321.
Bourgain, J. (1992), A remark on the behaviour of L p -multipliers and the range of
operators acting on L p -spaces, Israel J. Math., 79, 193–206.
Bykovskii, V. A. (1985), On the correct order of the error of optimal cubature formulas
in spaces with dominant derivative, and on quadratic deviations of grids,
Preprint, Computing Center Far-Eastern Scientific Center, Akad. Sci. USSR,
Vladivostok.
Candes, E. (2006), Compressive sampling, ICM Proc., Madrid, 3, 1433–1452.
Candes, E. (2008), The restricted isometry property and its implications for
compressed sensing, C. R. Acad. Sci. Paris, Ser. I 346, 589–592.
Candes, E., Romberg, J. and Tao, T. (2006), Stable signal recovery from incomplete
and inaccurate measurements, Commun. Pure Appl. Math., 59, 1207–1223.
Candes, E. and Tao, T. (2005), Decoding by linear programming, IEEE Trans. Inform.
Theory, 51, 4203–4215.
Carl, B. (1981), Entropy numbers, s-numbers, and eigenvalue problems, J. Func.
Anal., 41, 290–306.
Carlitz, L. and Uchiyama, S. (1957), Bounds for exponential sums, Duke Math. J., 24,
37–41.
Chazelle, B. (2000), The Discrepancy Method, (Cambridge: Cambridge University
Press).
Chen, W. W. L. (1980), On irregularities of distribution, Mathematika, 27, 153–170.
Chen, S. S., Donoho, D. L. and Saunders, M. A. (2001), Atomic decomposition by
basis pursuit, SIAM Rev., 43, 129–159.
Cohen, A., Dahmen, W. and DeVore, R. (2007), A taste of compressed sensing, Proc.
SPIE, Orlando, March 2007.
Cohen, A., Dahmen, W. and DeVore, R. (2009), Compressed sensing and k-term
approximation, J. Amer. Math. Soc., 22, 211–231.
Cohen, A., DeVore, R. A. and Hochmuth, R. (2000), Restricted nonlinear
approximation, Construct. Approx., 16, 85–113.
Coifman, R. R. and Wickerhauser, M. V. (1992), Entropy-based algorithms for
best-basis selection, IEEE Trans. Inform. Theory, 38, 713–718.
Conway, J. H., Hardin, R. H. and Sloane, N. J. A. (1996), Packing lines, planes, etc.:
packing in Grassmannian spaces, Experiment. Math. 5, 139–159.
Conway, J. H. and Sloane, N. J. A. (1998), Sphere Packing, Lattices and Groups (New
York: Springer-Verlag).
Cordoba, A. and Fernandez, P. (1998), Convergence and divergence of decreasing
rearranged Fourier series, SIAM J. Math. Anal., 29, 1129–1139.
van der Corput, J. G. (1935a), Verteilungsfunktionen. I, Proc. Kon. Ned. Akad.
v. Wetensch., 38, 813–821.
van der Corput, J. G. (1935b), Verteilungsfunktionen. II, Proc. Kon. Ned. Akad.
v. Wetensch., 38, 1058–1066.
Cucker, F. and Smale, S. (2001), On the mathematical foundations of learning, Bull.
AMS, 39, 1–49.
Dai, W. and Milenkovich, O. (2009), Subspace pursuit for compressive sensing signal
reconstruction, IEEE Trans. Inform. Theory, 55, 2230–2249.
References 407
Habala, P., Hájek, P. and Zizler, V. (1996), Introduction to Banach spaces [I]
(Karlovy: Matfyzpress).
Halász, G. (1981), On Roth’s method in the theory of irregularities of points
distributions, Recent Prog. Analytic Number Theory, 2, 79–94.
Halton, J. H. and Zaremba, S. K. (1969), The extreme and L 2 discrepancies of some
plane sets, Monats. für Math., 73, 316–328.
Heinrich, S., Novak, E., Wasilkowski, G. and Wozniakowski, H. (2001), The inverse
of the star-discrepancy depends linearly on the dimension, Acta Arithmetica, 96,
279–302.
Hitczenko, P. and Kwapien, S. (1994), On Rademacher series, Prog. Prob., 35, 31–36.
Höllig, K. (1980), Diameters of classes of smooth functions, in R. DeVore and
K. Scherer, eds., Quantitative Approximation (New York: Academic Press),
pp. 163–176.
Huber, P. J. (1985), Projection pursuit, Ann. Stat., 13, 435–475.
Jones, L. (1987), On a conjecture of Huber concerning the convergence of projection
pursuit regression, Ann. Stat., 15, 880–882.
Jones, L. (1992), A simple lemma on greedy approximation in Hilbert space and
convergence rates for projection pursuit regression and neural network training,
Ann. Stat., 20, 608–613.
Kalton, N. J., Beck, N. T. and Roberts, J. W. (1984), An F-space Sampler, London
Math. Soc. Lecture Notes 5 (Cambridge: Cambridge University Press).
Kamont, A. and Temlyakov, V. N. (2004), Greedy approximation and the multivariate
Haar system, Studia Mathematica, 161(3), 199–223.
Kashin, B. S. (1975), On widths of octahedrons, Uspekhi Matem. Nauk, 30, 251–252.
Kashin, B. S. (1977a), Widths of certain finite-dimensional sets and classes of smooth
functions, Izv. Akad. Nauk SSSR, Ser. Mat., 41, 334–351; English translation in
Math. USSR IZV., 11.
Kashin, B. S. (1977b), On the coefficients of expansion of functions from a certain
class with respect to complete systems, Siberian J. Math., 18, 122–131.
Kashin, B. S. (1980), On certain properties of the space of trigonometric polynomials
with the uniform norm, Trudy Mat. Inst. Steklov, 145, 111–116; English
translation in Proc. Steklov Inst. Math. (1981), Issue 1.
Kashin, B. S. (1985), On approximation properties of complete orthonormal systems,
Trudy Mat. Inst. Steklov, 172, 187–191; English translation in Proc. Steklov Inst.
Math., 3, 207–211.
Kashin, B. S. (2002), On lower estimates for n-term approximation in Hilbert spaces,
in B. Bojanov, ed., Approximation Theory: A Volume Dedicated to Blagovest
Sendov, (Sofia: DARBA), pp. 241–257.
Kashin, B. S. and Saakyan, A. A. (1989), Orthogonal Series (Providence, RI:
American Mathematical Society).
Kashin, B. S. and Temlyakov, V. N. (1994), On best m-term approximations and the
entropy of sets in the space L 1 , Math. Notes, 56, 1137–1157.
Kashin, B. S. and Temlyakov, V. N. (1995), Estimate of approximate characteristics
for classes of functions with bounded mixed derivative, Math. Notes, 58,
1340–1342.
Kashin, B. S. and Temlyakov, V. N. (2003), The volume estimates and their
applications, East J. Approx., 9, 469–485.
Kashin, B. S. and Temlyakov, V. N. (2007), A remark on compressed sensing, Math.
Notes, 82, 748–755.
Kashin, B. S. and Temlyakov, V. N. (2008), On a norm and approximate
characteristics of classes of multivariate functions, J. Math. Sci., 155, 57–80.
410 References
415
416 Index