You are on page 1of 93

An Introduction to Inequalities

N. L. Carothers

MATH 6820 Summer 2011


Bowling Green State University

Preface
These are notes for a six week summer course offered to graduate students at Bowling Green
State University. The audience consisted primarily of first and second year students in pure
mathematics, but also included a few students of applied mathematics and statistics.
I had no course outline or plan of attack other than to begin with the most widely
known classical inequalities and see where that led us. For the most part I simply followed
my nose and steered the discussion toward whatever held my interest at the moment. I
should quickly add, however, that the presentation consisted of more than lectures: From
the outset, the students were expected to present (and discuss) their solutions to the
problem sets in class, which added immensely to the experience. On several occasions we
devoted more than half the class period to discussing the problems. This material would
easily lend itself to a student seminar, a course on problem solving, or even a course taught
using the so-called Moore method.
I assumed very little by way of prerequisites and tried to make the course accessible to
a wide audience, which may help to explain the selection of topics. There are a few topics
that I would like to have covered (topics that required a knowledge of Lebesgue integration,
for example, or rudimentary functional analysis) that I feared would be beyond the grasp
of some of the students. Still, I think I managed to demonstrate a variety of techniques and
methods of proof in this brief survey, although my personal preferences, which lean toward
applications of convexity (and, more generally, toward an analytic rather than algebraic
perspective) are plainly reflected in these notes.

Preliminaries
Obviously, the comparison of quantities is an essential tool in mathematics. In this course
well be concerned with all manner of inequalities and, in particular, their systematic study
and the tools used in their creation.
The art of inequalities is found in the clever, often subtle methods used to generate
and verify them. The science of inequalities lies in their careful interpretation and in the
knowledge of their scope and limitations.
We will take great pains to distinguish between strict inequalities (<) and weak inequalities (). In the case of weak inequalities, we will carefully examine the case for
equality. While we will freely use techniques from calculus, we will be sparing with limits,
lest strict inequalities turn into weak inequalities.
Although the study of inequalities is arguably a subfield of real analysis, we will
occasionally find recourse to tools from complex analysis, linear algebra, geometry, and, of
course, algebra. Moreover, we will uncover applications to all of these fields (and more!).
As a starting point, well begin with two simple axioms:
(1) A given real number a satisfies precisely one of the following: a < 0, a = 0, a > 0.
(2) If a > 0 and b > 0, then ab > 0 and a + b > 0.
There are obvious variations and extensions of these axioms; for example:
(10 ) Given a, b R, precisely one of the following holds: a < b, a = b, a > b.
and:
Theorem. If a > b > 0 and c > d > 0, then ac > bd.
Proof. Given a > b, we have ab > 0 and, hence, (ab)c > 0. That is, ac > bc. Similarly,
c d > 0 leads to b(c d) > 0 and, hence, bc > bd. Combining these two observations
yields ac > bd.
Please consult the exercises for more direct algebra proofs.
As you can well imagine, we will also have frequent use of mathematical induction.
1

Theorem. If a > b > 0, then an > bn for all n = 1, 2, 3, . . ..


Theorem. n! > 2n1 for n = 3, 4, . . ..
Bernoullis Inequality. If x 1, then (1+x)n 1+nx for all n = 1, 2, 3, . . .. Equality
occurs only if n = 1 or x = 0.
Proof. Fix x 1. The inequality clearly holds for n = 1, so assume that (1+x)n 1+nx
for some n > 1. Then:
(1 + x)n+1 = (1 + x)(1 + x)n
(1 + x)(1 + nx)

(because 1 + x 0)

(1)

= 1 + (n + 1)x + x2
1 + (n + 1)x.

(2)

Thus, the inequality holds for all n = 1, 2, 3, . . .. Note that equality in (2) forces x = 0.
Refinement. If x 1, x 6= 0, then (1 + x)n > 1 + nx for all n = 2, 3, 4, . . ..
Corollary. Given x, y > 0, we have (x + y)n > xn + nxn1 y for all n = 2, 3, 4, . . ..
Equality can only occur if y = 0.

n
1
Application. 1 +
increases with n.
n
Proof. It suffices to show that 1 +

n+1
1
n+1

1+


1 n
n

> 1. For this we rewrite and apply

Bernoullis inequality:
!n+1
n+1


1
1
1 + n+1
1
+
1
n+1

= 1+

1 n
n
1 + n1
1+ n

  2
n+1
1
n + 2n
= 1+

n
(n + 1)2

 
n+1
1
1
= 1+
1
n
(n + 1)2

 

1
1
> 1+
1
= 1 (by Bernoulli).
n
n+1
2

Example. n! <

 n n
2

for n 6.

Proof. Its not hard to check directly that 6! < 36 . For the inductive step:


n+1 

n  

n+1
n n
n+1
n+1
n+1
=
>
2 n! = (n + 1)!
2
2
n
2
2
Bernoullis inequality is deceptively powerful. As evidence of this, well use it to prove
a classical inequality due to Cauchy.
The Arithmetic-Geometric Mean Inequality. Given n N and positive numbers
a1 , a2 , . . . , an we have
a1 a2 an

1/n

a1 + a2 + + an
n

(AGM)

with equality if and only if a1 = a2 = = an .


Proof. Well proceed by induction, of course. While the inequality is plainly true for n = 1,
its not hard to establish for n = 2 (see the exercises). So, suppose that the theorem holds
for some fixed n and all choices of a1 , a2 , . . . , an > 0. Given an+1 > 0, lets suppose (for
simplicity of notation) that an+1 ak for k = 1, . . . , n (otherwise, relabel the terms).
Lets also agree to write
Gk = a1 a2 ak

1/k

and Ak =

a1 + a2 + + ak
.
k

Then
An+1 =

nAn + an+1
an+1 An
= An +
n+1
n+1

where, by assumption, an+1 An . Thus, by our Corollary to Bernoullis inequality,




an+1 An
n+1
n+1
n
An+1 An + (n + 1)An
n+1

(3)

= an+1 Ann
an+1 Gnn = Gn+1
n+1 .

(4)

Note that equality in (3) would force an+1 = An while equality in (4) would force a1 = =
an (by hypothesis). Hence, equality throughout would yield a1 = a2 = = an+1 .
3

The AGM is extremely importantand well worth examining in greater detail


including variations, extensions, and a number of alternate proofs.
Example. Amongst all rectangles having a fixed perimeter P , the square has maximum
area.
Proof. Given a rectangle with sides of length a and b we have

A=

ab

P
a+b
=
2
4

(a constant).

Thus, A maximum when A = (P/4)2 which occurs when a = b = P/4.


This is a simplified example of what is sometimes called Didos problem or the isoperimetric problem: Amongst all planar regions having a fixed perimeter, which one has maximum area? The answer (which we may have time to pursue later) is the circle.
Example. Find the dimensions of the most economical 12 ounce soda can.
Solution. This is a familiar problem from calculus: We want to find the right-circular
cylinder having fixed volume V = r2 h and minimum surface area S = 2r2 + 2rh. But
S = 2r2 + 2r

V
r2

2V
r
V
V
= 2r2 + +
r
r

1/3
V
2 V
3 2r
r r

1/3
= 3 2V 2
,

= 2r2 +

(5)
(6)

where rewriting in (5) facilitates the application of the AGM in (6). Now equality occurs
when 2r2 = V /r; i.e., when r = (V /2)1/3 . For this value of r we have
 2/3
 1/3
V
V
2
V
h= 2 =

=2
= 2r.
r

V
2
In other words, the can should be as tall as it is wide. Not very realistic, but easy to solve.
4

Cauchys original proof of the AGM used a technique called backward induction: Suppose that P (n) is a proposition about the natural number n and suppose that
P (n) holds for infinitely many n, and
P (n 1) holds whenever P (n) holds.
Then P (n) holds for all n.
Cauchys proof of the AGM. We first establish (AGM) for n = 2m , m = 1, 2, . . .. To begin,
note that

a1 a2 =

a1 + a2
2

2

a1 a2
2

2


<

a1 + a2
2

2

unless a1 = a2 . The inductive step will be clear once we progress from m = 1 to m = 2.



2 
2
a1 + a2
a3 + a4
a1 a2 a3 a4
(7)
2
2
 a1 +a2
22
4
+ a3 +a
2
2

(8)
2
4

a1 + a2 + a3 + a4
.
=
4
Equality in (7) forces a1 = a2 and a3 = a4 , while equality in (8) forces a1 + a2 = a3 + a4 ;
thus, we have strict inequality in either (7) or (8) unless a1 = a2 = a3 = a4 . Continuing
leads to

a1 a2 a2m <

a1 + + a2m
2m

2m

unless a1 = = a2m . Now to see how the case of general n follows, consider this: Given
n < 2m and a1 , . . . , an > 0, set b1 = a1 , . . . , bn = an , bn+1 = = b2m = An . Then
m

a1 an A2n

= b1 b2m

2m
b1 + + b2m
<
2m

2m
m
nAn + (2m n)An
= A2n .
=
m
2

(9)

Hence, a1 an < Ann . Note that equality in (9) would force a1 = = an .


As a corollary, we get a special case of Youngs inequality, which we will see again
very soon.
5

mx + (n m)y
.
n
rx + (1 r)y.

Corollary. Let x, y > 0 and let 1 m < n. Then xm/n y 1m/n


That is, if r is a rational number satisfying 0 < r < 1, then xr y 1r

We conclude this section with a few stray facts from algebra and a few tools from
calculus that will prove helpful in the sequel.
n

The Binomial Theorem. (1 + x) =

n  
X
n

k=0

xk .


n
n
X
1
1
<
for n = 1, 2, . . ..
Application. 1 +
n
k!
k=0


Proof.

1
1+
n

n
=1+

n
X
n(n 1) (n k + 1)

nk

k=1

As it happens,

n
X
k=0

n
X
1
k! <
.
k!
k=0


n+1
1
1
< 1+
, but this is harder to prove.
k!
n

Geometric Series. For r 6= 1 we have 1 + r + r2 + + rn =


|r| < 1, we have

X
k=0

Application.

n
X
k=0

rk =

1 rn+1
. Thus, for
1r

1
. (See the exercises for a proof that rn 0 for |r| < 1.)
1r

1
< 3 for all n = 1, 2, . . ..
k!

Proof. As weve seen, k! 2k1 for k 2. Hence,


n
n
X
X
1
1
= 1+1+
k!
k!

k=0

1+1+

k=2
n
X
k=2

= 1+

n1
X
k=0

< 1+

1
2

1
2k1

k

1
= 3.
1 (1/2)

The Mean Value Theorem. If f : I R is differentiable, then for x, y I, we have


f (x) f (y) = f 0 (c) (x y)
6

for some c between x and y.


Application. For x 1 and r > 1, we have (1 + x)r 1 + rx with equality only at
x = 0.
Proof. If f (x) = (1 + x)r for x 1, then
(1 + x)r 1 = f (x) f (0) = r(1 + c)r1 x
for some c between x and 0. If x > c > 0, then 1 + c > 1 and, hence, (1 + x)r 1 > rx. If
1 x < c < 0, then 0 < 1 + c < 1 and, again, (1 + x)r 1 > rx.
x

x+1

1
1
increases while 1 +
decreases for x > 0.
Application.
1+
x
x
Proof. For x > 0, the mean value theorem assures us that log(x + 1) log x = 1/c for
some x < c < x + 1. Thus,
o
d n
1
x[ log(x + 1) log x ] = log(x + 1) log x
dx
x+1
1
1
=
>0
c x+1
x
and it follows that 1 + x1 = ex[ log(x+1)log x ] is increasing. Similarly,
o
d n
1
(x + 1)[ log(x + 1) log x ] = log(x + 1) log x
dx
x
1 1
= < 0,
c x

x+1
which means that 1 + x1
= e(x+1)[ log(x+1)log x ] is decreasing.
Heres a final application (whose proof is left as an exercise).
Application. If x 6= y are positive, then
(i) rxr1 (x y) > xr y r > ry r1 (x y) for r < 0 or r > 1;
(ii) rxr1 (x y) < xr y r < ry r1 (x y) for 0 < r < 1.
To see how this is related to our earlier work, watch closely:
xr y r < ry r1 (x y)

(0 < r < 1)

= xr < ry r1 (x y) + y r
= xr y r1 < r(x y) + y = rx + (1 r)y.
Look familiar?
7

Problem Set 1

1. If a > b > 0 and if p and q are positive integers, prove that ap/q > bp/q . In particular,
given a, b > 0, conclude that a > b if and only if ap/q > bp/q .
2. If a b > 0, p is a nonnegative integer, and q is a positive integer, show that
ap/q bp/q , with equality occurring if and only if either (i) a = b or (ii) p = 0.
3. If a1 b1 > 0, a2 b2 > 0, . . . , an bn > 0, show that a1 a2 an b1 b2 bn .
Moreover, equality holds if and only if a1 = b1 , a2 = b2 , . . ., an = bn .

4. If m and n are positive integers, show that 2 lies between m/n and (m+2n)/(m+n).

[Hint: Either m/n < 2 or 2 < m/n.]


5. From the inequality (a b)2 0, deduce that 2ab a2 + b2 , with equality occurring
if and only if a = b.
6. Given a, b > 0, show that

ab (a + b)/2, with equality if and only if a = b.

7. Amongst all rectangles having a fixed area A, prove that the square (with sides of

length A ) has the smallest perimeter.



2
1
1
8. From the inequality
0, deduce that
a
b

2
ab
(1/a) + (1/b)
for all positive a, b. Under what circumstances does equality hold?
9. Show that a + (1/a) 2 for all positive values of a. When does equality occur?
10. If 0 < c < 1, use Bernoullis inequality to show that cn 0. [Hint: Write c = 1/(1+x)
where x > 0.]
11. If c > 0, use Bernoullis inequality to show that c1/n 1. [Hint: If c > 1, write
c1/n = 1 + xn , where xn > 0, and estimate xn .]
12. Given x > 0, use Bernoullis inequality to show that (1 + (x/n))n increases while
(1 + (x/n))n+1 decreases as n increases.
13. Given a > 0, show that (1 + a)r > 1 + ra for any rational r > 1. [Hint: Write r = p/q,
where p > q are positive integers, and note that (1 + (x/p))p (1 + (x/q))q for x > 0.]
 n n
 n n
14. Prove by induction that
< n! <
for n 6.
3
2
8

15. Use the fact that k! 2

k1

n
X
1
< 3 for all
, for all k = 1, 2, 3, . . ., to prove that
k!
k=0

n = 0, 1, 2, . . ..
16. Suppose that 0 xi 1 for i = 1, 2, . . .. Prove that
(1 x1 ) (1 xn ) 1 (x1 + + xn ) + (x1 x2 + + xn1 xn )
for all n = 2, 3, . . ..

1
17. For n = 1, 2, 3, . . ., show that 2( n + 1 n ) < < 2( n n 1 ) and use
n
this to prove that, for n = 2, 3, . . .,
n
X

1
< 2 n 1.
2 n2 <
k
k=1

18. Prove that

1
1 3
2n 1
1
<
<
2 4
2n
4n + 1
3n + 1

for n = 2, 3, 4, . . ..

Convex Functions
A function f : I R, defined on a nontrivial interval I, is said to be convex if
f (x + y) f (x) + f (y)

(1)

whenever x, y I, , 0, + = 1. If the inequality in (1) is always strict for x 6= y, ,


> 0, we say that f is strictly convex. We say that f is concave (resp., strictly concave)
if f is convex (resp., strictly convex).
Examples.
(1) f (x) = ax + b is convexand concave, too, of course.
(2) f (x) = |x| is convex because |x + y| |x| + |y| for , 0.
(3) f (x) = x2 is strictly convex.
Proof. Because + = 1 we have
x2 + y 2 (x + y)2 = x2 + y 2 2xy
= (x y)2 > 0,
provided , > 0 and x 6= y.
As it happens, the rather unassuming inequality in (1) actually implies that convex
functions are quite well behaved. For example, its not terribly hard to prove the following:

Fact. Let f : I R be convex, where I is an open interval. Then


(i) f is continuous on I.
(ii) f is differentiable at all but (at most) countably many points of I.
(iii) f 0 is continuous wherever it exists.

An equivalent characterization of convexity is provided by the three chords lemma.


10

Theorem. Let f : I R. Then f is convex if and only if


f (z) f (x)
f (y) f (x)
f (y) f (z)

zx
yx
yz

(2)

for all x < z < y in I.


Proof. First suppose that f is convex. Given x < z < y, we can write
z =

zx
yz
x +
y
yx
yx

(where the coefficients of x and y are nonnegative and add to 1). Thus,
f (z)

yz
zx
f (x) +
f (y).
yx
yx

The inequalities in (2) now follow by rewriting; for example,


f (z) f (x)


zx
f (y) f (x)
yx

f (z) f (x)
f (y) f (x)

,
zx
yx

which is the first half of (1). The other half is entirely similar.
Now suppose that (2) holds. Take x < y in I, , 0, + = 1. Then z = x + y
satisfies
z x = (y x)
That is, z =

and y z = (y x).

yz
zx
x +
y. Thus, from (2),
yx
yx

zx
f (y) f (x)
yx

= f (x) + f (y) f (x)

f (z) f (x) +

= f (x) + f (y).
Corollary. If f : (a, b) R is differentiable and convex, then
f 0 (x)

f (y) f (x)
f 0 (y)
yx

for all x < y in (a, b). In particular, f 0 is increasing.


The converse of this result is also true and, indeed, well worth expanding on.
11

Theorem. Let f : (a, b) R be differentiable. Then the following are equivalent:


(i) f is convex;
(ii) f 0 is increasing;
(iii) f (x) + f 0 (x)(y x) f (y) for all x < y in (a, b).
Note that the left-hand side of (iii) is the tangent line to the graph of f at x. Thus,
(iii) tells us that every tangent line to the graph lies below the graph. A similar result
holds for nondifferentiable convex functions (see the exercises for details).
Proof.

(i) implies (ii) is clear. (ii) implies (iii) follows from the mean value theorem.

Indeed,
f (y) = f (x) + f 0 (c)(y x),

where x < c < y,

f (x) + f 0 (x)(y x).


Finally, (iii) implies (i). Given x < z < y we have:
f (x) + f 0 (x)(z x) f (z),
f (z) + f 0 (z)(x z) f (x),
f (z) + f 0 (z)(y z) f (y).
Thus,
f 0 (x)

f (z) f (x)
f (y) f (z)
f 0 (z)
.
zx
yz


As in the proof of the three chords lemma, writing z = x + y leads to f (z) f (x)

f (y) f (z) or f (z) f (x) + f (y).
Claim. If f : (a, b) R is twice-differentiable, then f is convex if and only if f 00 0
(that is, f is concave up). If f 00 > 0, then f is strictly convex. (See the exercises.)
Examples.
(1) xp is convex on (0, ) for p 1, and strictly convex for p > 1.
(2) ex is strictly convex on R.
(3) log x is strictly concave on (0, ).
(4) x log x is strictly convex on (0, ).
12

Applications to Classical Inequalities


We begin with an easy extension of our characterization of convex functions.
Jensens Inequality. f : I R is convex if and only if
f (1 x1 + + n xn ) 1 f (x1 ) + + n f (xn )
whenever x1 , . . . , xn I, 1 , . . . , n 0, 1 + + n = 1.
Jensens characterization follows easily from ours by induction.
Armed with our knowledge of convex functions, were ready for a fresh attack on
certain classical inequalities. We begin with a generalization of the AGM (but please
compare this result with Youngs inequality).
Theorem. Let x1 , . . . , xk , 1 , . . . , k > 0 with 1 + + k = 1. Then
k
1 2
x
1 x2 xk 1 x1 + 2 x2 + + k xk

with equality if and only if x1 = = xk .


Proof. Because log x is strictly concave on (0, ) we have


k
1 2
log x
x

x
= 1 log x1 + 2 log x2 + + k log xk
1
2
k

log 1 x1 + 2 x2 + + k xk ,
with equality if and only if x1 = = xk . And, because log x (or ex , if you prefer) is
strictly increasing, the result follows.
The case 1 = = k = 1/k is the familiar AGM. Another special case is also
familiar and will lead us to two new inequalities.
Youngs Inequality. If p, q > 1 satisfy

1
p

1
q

= 1, then xy

y > 0, with equality if and only if xp = y q .


This version of Youngs inequality leads to a simple proof of:
13

1 p 1 q
x + y for all x,
p
q

H
olders Inequality. Let x1 , . . . , xn , y1 , . . . , yn > 0 and let p, q > 1 satisfy

1
p

1
q

= 1.

Then
n
X

xi yi

i=1

with equality if and only if

(xpi )

and

n
X

!1/p
xpi

i=1
q
(yi ) are

n
X

!1/q
yiq

(3)

i=1

proportional; that is, if and only if, for some

, , not both zero, we have xpi = yiq for all i = 1, . . . , n. The case p = q = 2 is usually
referred to as the Cauchy-Schwarz inequality.
Pn
Pn
1/p
1/q
Proof. We may assume that u = ( i=1 xpi )
6= 0 and v = ( i=1 yiq )
6= 0, in which
case we have, from Youngs inequality, that
1  xi p 1  yi q
xi yi
+

u v
p u
q v

(4)

for each i, with equality if and only if (xi /u)p = (yi /v)q . Summing over i, we get
n
n
n
1 X
1 X p
1 X q
xi yi
x + q
y = 1.
uv i=1
pup i=1 i
qv i=1 i

Thus,

Pn

i=1

xi yi uv. Please note that equality in (3) would force equality in (4) for all

i and, hence, (xi /u)p = (yi /v)q for all i; that is, (xpi ) and (yiq ) would be proportional.
Corollary. Let x1 , . . . , xn , y1 , . . . , yn > 0, let 0 < p < 1, and let q satisfy
n
X

xi yi

i=1

n
X

!1/p
xpi

i=1

n
X

1
p

+ 1q = 1. Then

!1/q
yiq

i=1

with equality if and only if (xpi ) and (yiq ) are proportional.


Proof.

This version actually follows from our previous version after a bit of judicious

rewriting:
n
X

xpi

i=1

n
X

(xi yi )p yip .

i=1

Now we apply H
olders inequality using the conjugate pair of exponents
p0 =

1
>1
p

and q 0 =
14

q
= 1 q > 1.
p

(5)

Note that




1
p
1
1
+ 0 =p+
=p 1
= 1.
p0
q
q
p
Applying H
olders inequality with this pair of exponents yields
!p n
!p/q
n
n
X
X
X (p)(q/p)
p
xi
(xi yi )p(1/p)
yi
i=1

i=1

n
X

i=1

!p
xi yi

i=1

n
X

!p/q
yiq

i=1

Taking p-th roots and rearranging the terms finishes the proof.
We will see a few more variations and extensions of Holders inequality. For now well
settle for the following:
Theorem. Let a = (ai ), b = (bi ), . . . , h = (hi ) denote elements of Rn in which all entries
are positive, let , , . . . , > 0 with + + + = 1. Then
!
! n !
n
n
n
X
X
X
X

ai
bi

hi
ai bi hi
i=1

i=1

i=1

i=1

with equality if and only if a, b, . . . , h are all proportional.


The proof is essentially identical to the proof of Holders inequality for two sets of
numbers: We apply Youngs inequality (or the AGM) to the expression

 



ai
bi
hi
P
P
P
a
b
h
and sum over i.

15

Problem Set 2
1. If f : I R is convex, show that f (1 x1 + + k xk ) 1 f (x1 ) + + k f (xk ), for
any x1 , . . . , xk I and any 1 , . . . , k > 0 with 1 + + k = 1. [This is sometimes
called Jensens inequality, and follows easily by induction on k.] If f is strictly convex,
prove that equality occurs in Jensens inequality (if and) only if all the xi are equal.
2. Let f , g : I R be convex functions and let R. Determine which of the following
are convex functions: f , f + g, f g, |f |, f g, max{f, g}, min{f, g}.
3. Let f , g : R R be convex functions with g increasing. Show that g f is convex.
Deduce that if h : R R is a positive function and if log h is convex, then h is convex.
Give an example of a positive convex function on R whose logarithm is not convex.
4. Show that the reciprocal of a positive concave function is convex. Is the reciprocal of
a positive convex function always concave?
5. Prove that f : (0, ) R is convex if and only if g(x) = xf (1/x) is convex on (0, ).
6. If f is convex on (0, ), and if x1 , . . . , xm , y1 , . . . , ym > 0, show that


 


y1 + + ym
y1
ym
(x1 + + xm ) f
x1 f
+ + xm f
.
x1 + + xm
x1
xm
Show that f (x) = (1 + xp )1/p is convex on (0, ) for p 1, and deduce from the first

1/p
part of the exercise that (x1 + + xm )p + (y1 + + ym )p
(xp1 + y1p )1/p +
p 1/p
) .
+ (xpm + ym

7. Let f : (a, b) R be differentiable and convex. If f 0 (x0 ) = 0, show that f (x0 ) is a


global minimum for f .
8. Let f : (a, b) R be twice differentiable Show that f is convex if and only if f 00 0.
If f 00 > 0, show that f is strictly convex. Give an example of a twice differentiable,
strictly convex function for which f 00 fails to be strictly positive.
9. If f , g : I R are nonnegative, increasing, and convex, then so is f g. [Hint: First
note that [f (x) f (y)] [g(y) g(x)] 0 for x < y.]
10. Let f : (a, b) R be continuous. Show that f is convex if and only if, for all
1 R t
a < s < t < b, we have (t s)
f (x) dx [f (t) + f (s)]/2.
s
11. Let f : I R be convex. Then f has left- and right-derivatives at each point a in
0
0
the interior of I and, moreover, f
(a) f+
(a). If a < b are points in the interior of

I, show that
0
f+
(a)

f (b) f (a)
0
f
(b).
ba
16

12. If f : I R is convex, then f is locally Lipschitz on I; that is, f is Lipschitz on every


compact subinterval of I. In particular, f is continuous on I.
13. We say that T (x) = mx+b is a supporting line to the graph of f at x0 if T (x0 ) = f (x0 )
and T (x) f (x) for all x. Prove that f : (a, b) R is convex if and only if the graph
of f has a supporting line at each x0 (a, b). [Hint: For the forward implication, take
0
0
m with f
(x0 ) m f+
(x0 ).]

14. Prove that a convex function defined on a bounded interval is bounded below.
15. If f : R R is convex and bounded above, prove that f is constant.
16. Given f : I R, the epigraph of f is the set epif = {(x, y) : x I, y f (x)} in R2 .
Show that f is convex if and only if epif is a convex subset of R2 .
17. We say that f : I R is lower semicontinuous (l.s.c.) if, for each real , the set
{x I : f (x) } is closed. Prove that a convex function f : I R is lower
semicontinuous if and only if epif is closed. For this reason, l.s.c convex functions are
often referred to as closed convex functions.

17

Problem Set 2, Problem 10


10. Let f : (a, b) R be continuous. Show that f is convex if and only if, for all
1 R t
f (x) dx [f (t) + f (s)]/2.
a < s < t < b, we have (t s)
s
Solution. (=) If f is convex and if a < s < t < b, then, on the interval [ s, t ], the graph
of f lies below the chord joining (s, f (s)) and (t, f (t)); that is,


f (t) f (s)
f (x) f (s) +
(x s), s x t.
ts
Integrating both sides over [ s, t ] yields:




Z t
1 f (t) f (s)
f (s) + f (t)
2
f (x) dx f (s)(t s) +
(t s) =
(t s).
2
ts
2
s
(=) To prove the converse, it suffices to show that if f is not convex, then, on some
subinterval [ s, t ] (a, b), the graph of f lies (strictly) above the chord joining (s, f (s))
and (t, f (t)). That is,


f (t) f (s)
f (x) > f (s) +
(x s), s < x < t,
ts
1 R t
for then we just integrate both sides, as before, concluding that (t s)
f (x) dx >
s
[f (t) + f (s)]/2.
Now if f is not convex, then, for some pair of points a < s < t < b and some point




xs
tx
s+
t
x=
ts
ts
we have

f (x) >

tx
ts


f (s) +

xs
ts


f (t) = f (s) +

f (t) f (s)
ts


(x s) = L(x).

As suggested by Mr. Ghosh in class, we next consider:


s0 = sup{s0 : s s0 x, f (s0 ) = L(s0 )}

and t0 = inf{t0 : x t0 t, f (t0 ) = L(t0 )}.

Each of the sets above is closed and bounded, so s0 and t0 exist and satisfy s0 < x < t0
and f (s0 ) = L(s0 ), f (t0 ) = L(t0 ). (The continuity of both f and L has come into play
here.) Moreover, f (y) > L(y) for all s0 < y < t0 . (We cant have f (y) = L(y), and the
intermediate value theorem tells us that we cant have f (y) < L(y).) All that remains
is to note that the line through (s0 , f (s0 )) and (t0 , f (t0 )) coincides with the line through
(s, f (s)) and (t, f (t)); that is, the chord joining (s0 , f (s0 )) and (t0 , f (t0 )) lies on the chord
joining (s, f (s)) and (t, f (t)). (See the figure on the back.)
18

What makes this problem especially interesting is that theres a companion result (see
Theorem 125 in Hardy, Littlewood, and Polya).
100 . Let f : (a, b) R be continuous. Then f is convex if and only if, for all a < s < t < b,
1 R t
we have (t s)
f (x) dx f ((s + t)/2).
s
The forward implication is easy: If f is convex and if a < s < t < b, we can consider
a supporting line at (s + t)/2; that is, we can find an m such that





s+t
s+t
+m x
f (x) f
2
2
for all x. Integration over [ s, t ] then yields:


Z t
s+t
f (x) dx (t s)f
2
s
because


Z t
s+t
x
dx = 0.
2
s
Im not sure how to prove the backward implication. Anyone interested?

19

Elementary Means
Before we pursue more inequalities, lets introduce some notation. Given a vector x = (xi )
in Rn with positive entries and a real number p 6= 0, we write
!1/p
!1/p
n
n
X
X
1
and
Mp (x) =
Sp (x) =
xpi
.
xpi
n
i=1
i=1
The expression Sp (x) is sometimes written as kxkp . The expression Mp (x) is called the
(simple) mean of of x of order p. Of course, were somewhat familiar with such expression
when p > 0, but we will have occasion to explore the full range of valueseven p = !
Given a set of positive weights 1 , . . . , n > 0 with 1 + + n = 1, we write
!1/p
n
X
Mp (x, ) =
i xpi
.
i=1

Mp (x, ) is called the weighted mean of x of order p.


Please note that all of these expression are positive homogeneous; for example,
Mp (kx, ) = kMp (x, ) for k > 0,
where kx = (kxi ). Theyre also increasing functions of x in the sense that Mp (x, )
Mp (y, ) whenever xi yi for all i.
M1 (x) is called the arithmetic mean of x; M2 (x) is sometimes called the root-meansquare of x; M1 (x) is called the harmonic mean of x. One of our goals in this section is
to deduce suitable expressions for M , M , and M0 . First, a few simple
Observations.
1. min{ x1 , . . . , xn } Mp (x, ) max{ x1 , . . . , xn }. Equality occurs (in either inequality) only if x1 = = xn .
This is reasonably clear if p > 0. If p < 0, simplest might be to appeal to the fact that
Mp (x, ) =

1
Mp (1/x, )

where 1/x = (1/xi ).


20

n
1 2
2. M0 (x, ) lim p0 Mp (x, ) = x
1 x2 xn . Thus, in particular, we also have

min{ x1 , . . . , xn } M0 (x, ) max{ x1 , . . . , xn }.


Proof. First write
(
Mp (x, ) = exp

n
X
1
log
i xpi
p
i=1

!)
.

Next, we appeal to lH
opitals rule to find
P

n
p
Pn
log
i=1 i xi
i xpi log xi
i=1
P
= lim
lim
n
p
p0
p0
p
i=1 i xi
n
X
=
i log xi
i=1



n
1 2
= log x
x

x
.
n
1
2
3. M (x) lim p+ Mp (x, ) = max{ x1 , . . . , xn } and
M (x) lim p Mp (x, ) = min{ x1 , . . . , xn }.
1/p

Suppose that xk = max{ x1 , . . . , xn }. Then k xk Mp (x, ) xk

Proof.
1/p

and

1 as p +. The other case is similar but, again, we could just appeal to the

fact that Mp (x, ) = [ Mp (1/x, ) ]1 .


4. For s < t, we have Ms (x, ) Mt (x, ) with equality only if x1 = = xn .
First consider the case s = 1. Given t > 1, let t0 be the conjugate exponent:

Proof.
1
t

1
t0

= 1. Then, from H
olders inequality,
n
X
i=1

because

Pn

i=1

i xi =

n
X

1/t
1/t0
i xi i

i=1

n
X

!1/t
i xti

i=1

i = 1. That is, M1 (x, ) Mt (x, ). Equality can only occur if (i xti ) is

proportional to (i ); i.e., if and only if (xti ) is proportional to (1, 1, . . .); i.e., if and only if
x is constant.
Now consider the case 0 < s < t. In this case, t/s > 1 and, hence,
!1/s
!(s/t)(1/s)
!1/t
n
n
n
X
X
X
s(t/s)
i xsi

i xi
=
i xti
.
i=1

i=1

i=1

21

The case s = 0 is very similar. Given t > 0,


n
t

X
1 2
t n
t 2
n
t 1
i xti
x1 x2 xn
= (x1 ) (x2 ) (xn )
i=1

by the AGM. That is, M0 (x, ) Mt (x, ).


The remaining cases follow easily from what weve already shown. For example,
M0 (x, ) M1 (x, ), for all x, implies that M0 (1/x, ) M1 (1/x, ); that is,
!1
n
n
X
X
n
1 2
1 2
n
x
x2 x

i x1
= x
i x1
.
n
1
1 x2 xn
i
i
i=1

i=1

In other words, M0 (x, ) M1 (x, ).


Of particular interest are these: M M1 M0 M1 M2 M .
Minkowskis Inequality. Let x, y, Rn with positive entries and with 1 + +n =
1. Then
(i) Mp (x + y, ) Mp (x, ) + Mp (y, ) for 1 p ;
(ii) Mp (x + y, ) Mp (x, ) + Mp (y, ) for p 1.
If p 6= 1 is finite, then equality can only occur if x and y are proportional.
Proof.

Of course, equality always occurs (in both (i) and (ii)) when p = 1, so we may

suppose that p 6= 1.
First suppose that 1 < p < and let p0 satisfy 1/p + 1/p0 = 1. Then (p 1)p0 = p.
Next,
p

Mp (x + y, ) =

n
X

i (xi + yi ) =

i=1

n
X

n
X
i=1

i xi (xi + yi )

p1

i=1

n
X

i yi (xi + yi )p1

i=1
n
X
i=1

i (xi + yi )(xi + yi )p1

!1/p
i xpi

n
X
i=1

!1/p
i yip

n
X

i=1

= Mp (x, ) + Mp (y, ) Mp (x + y, )p1 .


22

!1/p0
i (xi + yi )(p1)p

(1)

Thus, Mp (x + y, ) Mp (x, ) + Mp (y, ). Equality occurs only if equality occurs in


both applications of H
olders inequality (in (1)); that is, only if (xpi ) is proportional to
((xi + yi )p ) is proportional to (yip ); all of which translates to: x is proportional to y.
For 0 < p < 1, we know that Holders inequality reverses, so (nearly) the same proof
yields Mp (x + y, ) Mp (x, ) + Mp (y, ). The case < p < 0 follow for precisely the
same reason for, in this case, the conjugate exponent p0 satisfies 0 < p0 < 1 (and Holders
inequality reverses in this case, too).
The three remaining cases are relatively easy to handle. Weve actually shown the
case p = 0, but using different notation:
n
x1 x
+ y11 ynn
M0 (x, ) + M0 (y, )
n
= 1
M0 (x + y, )
(x1 + y1 )1 (xn + yn )n

1 
n

1 
n
x1
xn
y1
yn
=

x1 + y1
xn + yn
x1 + y1
xn + yn

xn
y1
yn
x1
+ + n
+ 1
+ + n
x1 + y1
xn + yn
x1 + y1
xn + yn

= 1!
Equality can only occur if each of the sequences (xi /(xi + yi )) and (yi /(xi + yi )) are
constant, from which youll quickly deduce that x and y must be proportional.
Minkowskis inequality also holds in the cases p = , but the case for equality is a
bit different. For example,
M (x + y) = max{x1 + y1 , . . . , xn + yn } M (x) + M (y).
Equality can only occur if x and y attain their maximum values at the same coordinate;
i.e., if and only if, for some k, we have xk = M (x) and yk = M (y).
The Passage to Infinite Series
Its natural to ask whether our work on finite sums (and elementary means) extends to
infinite sums. For the most part, everything weve done has a satisfactory analogue in the
23

infinite series case, however the proofs can be surprisingly difficultsimply passing to a
limit is often not enough.
For now, well settle for stating a few of these extensions (with the occasional proof).
As well see, there are more convenient paths to all of this (and more) through integrals.
Given a sequence of positive weights = (i ) satisfying

i=1

i = 1 and a positive

number 0 < p < , we define


Mp (x, ) =

!1/p
i xpi

i=1

for x = (xi ) nonnegative. Well forego the case p < 0, but we will consider
!

Y
X
i
M0 (x, ) =
xi = exp
i log xi
i=1

i=1

and
M (x) = sup xi = l.u.b.{ xi : i 1 }.
i1

Facts.
1. If Ms is finite for some 0 < s < , then Mr is finite for all 0 < r < s and, in fact,
Mr Ms with equality if and only if x is constant. Moreover, M0 is finite (or zero)
in this case (i.e., M0 does not diverge to +) and, in fact, M0 = lim r0+ Mr .
2. M0 M1 (AGM) with equality if and only if x is constant.
Proof. From the inequality log x x 1 we have log xk log M1 (xk /M1 ) 1. Thus,





X
X
xk
1
(2)
k log xk log M1
k
M1
k=1

k=1

or, log M0 log M1 1 1 = 0. Equality here would force equality in (2), which would
mean that xk = M1 for all k.
3. If x = (xk ) is bounded, then lim r+ Mr = M .
Virtually all of our main results hold in the infinite series case with, at worst, minor
modifications.
24

The Passage to Integrals


In what follows, we will consider nonnegative, finite (almost everywhere) functions f :
I R, all defined on some common interval (or measurable set) I, and all assumed
R
to be (Riemann or Lebesgue) integrablemeaning that I f (x) dx exists as a finite real
number. In situations where the particular interval has no bearing on the argument or,
more commonly, is the same for all integrals, we may suppress it (and x) and simply write
R
f.
We will only pursue a couple of (very familiar) inequalities.
H
olders Inequality. Let f , g : I R be nonnegative integrable functions and let ,
0 with + = 1. Then f g is integrable and satisfies
Z  Z 
Z

f g
f
g .
Equality can only occur if, for some constants A and B, not both zero, we have Af = Bg.
Proof. The proof is quite familiar by now: We may suppose that u =
R
v = g 6= 0, in which case we have

 

g(x)
f (x)
g(x)
f (x)
+
,

u
v
u
v

f 6== and

with equality (everywhere) if and only if f /u = g/v. Now we integrate both sides:
Z
Z
Z

1

f g
f+
g = 1.
u v
u
v
A particular case is worth repeating:
Corollary. Let 1 < p < and let

1
p

1
q

= 1. If f p and g q are integrable, then f g is

integrable and satisfies


Z

Z
fg

1/p Z
g

1/q
.

Equality can only occur if Af p = Bg q for some constants A and B, not both zero.
As you can imagine, if p < 1, p 6= 0, and if

g q is nonzero and finite, then the

inequality reverses. We will forego the details.


Minkowski is next, but a slightly more general version that will come in handy later.
25

Minkowskis Inequality. Let 1 p < . If |f |p and |g|p are integrable, then so is


|f + g|p and
Z
|f + g|

1/p

Z

1/p

|f |

Z
+

1/p

|g|

(3)

If p = 1, equality can only occur if f g 0. If p > 1, equality can only occur if f g 0 and
Af = Bg for some nonnegative constants A and B, not both zero.
Proof. If p = 1, then
Z
|f + g|

Z 

|f | + |g| =

Z
|f | +

|g|,

(4)

hence |f + g| is integrable. Equality in (4) forces |f + g| = |f | + |g| (everywhere) and,


hence, we must have f g 0 (that is, f (x) and g(x) must have the same sign for all x).
If p > 1, first note that
h
ip
|f + g|p |f | + |g|
h
ip
2 max{ |f |, |g| }
= 2p max{ |f |p , |g|p }


p
p
p
2 |f | + |g| .
Thus, |f + g|p will be integrable. Now let q = p/(p 1) and apply Holders inequality:
Z
Z
p
|f + g| = |f + g| |f + g|p1
Z
Z
p1
|f | |f + g|
+ |g| |f + g|p1
(5)
(Z
1/p Z
1/p ) Z
1(1/p)
p
p
p

|f |
+
|g|
|f + g|
.
(6)
And (3) follows. Note that equality in (3) forces equality in both (5) and (6). Thus we
have f g 0 and
A|f |p B|g|p C|f + g|p
(where means is proportional to), which forces A0 f = B 0 g for some nonnegative
constants A0 and B 0 , not both zero.
26

We next, all too briefly, consider integral means. Given a positive, integrable weight
function (x), we define
Z

1/p

Mp (f, ) =

f (x) (x) dx
I

for 0 < p < and f 0 (and measurable, say, or even continuous). We also define
Z

M0 (f, ) = exp
(x) log f (x) dx
I

and
M (f ) = sup f (x) ( = ess.sup f (x)).
xI

xI

We could then develop the theory of the means Mp (f, ) in complete analogy to our
previous cases. We will, however, settle for a single observation:
If Ms (f, ) < for some 0 < s < , then Mr (f, ) < for all 0 < r < s and
Mr (f, ) Ms (f, ), with equality if and only if f is constant.
Proof. As before, the proof follows from Holders inequality, applied to p = s/r > 1.
Z
1/r Z
1/r
r
r r/s 1(r/s)
f
=
f
Z

(r/s)(1/r) Z

f
Z

(1/r)(1/s)

1/s

This result should be compared with the non-weighted setting, the so-called Lp -norms
Z
1/p
p
kf kp =
f (x) dx
(0 < p < ),
I

which are incomparable if I has infinite length and which otherwise satisfy
kf kr (b a)(1/r)(1/s) kf ks
if r < s and I = [ a, b ] (to see this, just set 1 in our previous calculation).

27

Problem Set 3A
A few calculus problems
1. Let 1 < p < and let q satisfy 1/p + 1/q = 1; in
other words, (p 1)(q 1) = 1. By examining the
graph of y = xp1 (a.k.a., x = y q1 ), pictured at
right, argue that
Z
ab

p1

y q1 dy,

dx +

with equality if and only if b = ap1 (equivalently,


ap = bq ). More generally, if y = f (x) and x = g(y)
are strictly increasing, continuous, inverses of one
another for x, y > 0, argue that
Z a
Z b
ab
f (x) dx +
g(y) dy,
0

with equality if and only if b = f (a).


2. Establish the following (using, for example, the mean value theorem).
(a) x/(1 + x) log(1 + x) x for x > 1, with equality (in either inequality) only
at x = 0.
(b) ex 1 + x with equality only at x = 0. Conclude that ex > (1 + (x/n))n .
(c) Deduce that log(1 + x) x/(1 x) for 1 < x < 1, with equality only at x = 0.
(d) Deduce that ex 1/(1 x) for x < 1, with equality only at x = 0.
3. Establish the following variations on Bernoullis inequality.
(i) (1 + x) 1 + x for x 1 and 0 < < 1.
(ii) (1 + x) 1 + x for x 1 and either < 0 or > 1.
(iii) (1 x) 1 x for 0 x 1 and 0 < < 1. [Hint: Write 1/(1 x) =
1 + x/(1 x).]
(iv) (1 x) 1 x for 0 x 1 and > 1. [Hint: If 0 < x < 1/, consider
(1 x)1/ .]
4. Which is bigger, e or e ?

28

A few algebra problems

a + nb
for all n = 1, 2, . . ., with equality if and only if a = b.
n+1
1
1
3
1
+ +
> for x > 1.
6. (i) Show that
x1 x x+1
x
P
(ii) Use (i) to show that the harmonic series n=1 n1 diverges.
5. Show that

n+1

abn

7. Without appealing to calculus, establish the following, where a, b, c > 0.


(i) ab + bc + ca a2 + b2 + c2
(ii) 9abc (a + b + c)(ab + bc + ca)
(iii) abc(a + b + c) a2 b2 + b2 c2 + c2 a2




1
1
1
8. Show that 64 1 +
1+
1+
for x, y, z > 0, x + y + z = 1.
x
y
z
9. Show that x + 1/(xy) + y 2 5(24/5 ) for any x, y > 0. Find values of x and y that
yield equality.
10. Show that the rectangular box of volume V having minimum surface area S is a cube.
[Hint: Show that V 2/3 S/6, with equality occurring if and only if all three edges
have equal length.]
11. Show that the right circular cylinder of volume V having least surface area S has its
diameter equal to its height.
12. Given a1 , . . . , an > 0, show that

n
X

!
ai

i=1

if a1 = = an .

29

n
X
1
a
i=1 i

!
n2 , with equality if and only

Problem Set 3A, Problem 8



6. Show that 64

1
1+
x




1
1
1+
1+
for x, y, z > 0, x + y + z = 1.
y
z

because f 00 (x) =

1
x2

1
(1+x)2

1
x

= log(1 + x) log x is strictly convex for x > 0



> 0 for x > 0. Thus, f (x) + f (y) + f (z) 3f x+y+z
=
3

Solution. The function f (x) = log 1 +

3f ( 13 ) = log(43 ). Exponentiating, this becomes:






1
1
1
1+
1+
1+
64.
x
y
z
By strict convexity, equality can only occur if x = y = z = 1/3.

30

Problem Set 3B
Notation: Fix n N and = (1 , . . . , n ), where i > 0 for all i and 1 + + n = 1.
For each nonzero real number t and each x = (x1 , . . . , xn ) with xi > 0, i = 1, . . . , n, we
define the weighted mean Mt (x, ) of order t by Mt (x, ) = (1 xt1 + n xtn )1/t . In the
n
1
remaining cases, t = 0, , we define M0 (x, ) = x
1 xn , M (x, ) = M (x) =

max{x1 , . . . , xn }, and M (x, ) = M (x) = min{x1 , . . . , xn }, respectively. In the case


1 = = n = 1/n, we often omit reference to and write Mt (x), which is called the
simple mean of order t.
1. Prove H
olders inequality in the case t = 1 (t0 = ) by showing M1 (x)M (y)
M1 (xy) M1 (x)M (y), where xy denotes the sequence (x1 y1 , . . . , xn yn ). When does
equality occur (in either of these inequalities)?
2. Prove Liapounovs inequality: If 0 < r < s < t and if we write s = r + t, where
, > 0, + = 1, then Ms (x)s Mr (x)r Mt (x)t . Equality can only occur if all
of the xi are equal. Upon taking logarithms, Liapounovs inequality tells us that the
function s 7 s log Ms is convex (for fixed x and ).
Recall that we have also defined the sums St (x) = (xt1 + + xtn )1/t for t > 0. Note that
St (x) = n1/t Mt (x). A somewhat more common notation is kxkt = (xt1 + + xtn )1/t .
These expressions are rarely used for t < 0, thus we are free to consider nonnegative xi .
3. If 0 < p < q, show that Sq (x) Sp (x). Equality can only occur if all but one of
the xi is zero. [Hint: The inequality is homogenous in x, thus we may assume that
Sp (x) = 1.] This is often called Jensens inequality. Please compare this result with
the fact that Mp (x) Mq (x).
4. Show that lim t St (x) = M (x). Thus, we define S (x) = M (x).
5. Given x > 0, show that lim t0+ (1 + xt )1/t = +. Conclude that lim t0+ St (x) =
+ if x has two or more nonzero coordinates.
6. Show that the sums St satisfy Holders inequality: If p > 1 and q satisfies 1/p+1/q = 1,
then S1 (xy) Sp (x)Sq (y). If 0 < p < 1, the inequality reverses: S1 (xy) Sp (x)Sq (y).
In every case, equality can only occur if x and y are proportional.
7. Show that S1 (xy) S1 (x)S (y) (this is Holders inequality in the case p = 1, q = ).
When does equality occur?
8. Show that if p, q > r > 0 satisfy 1/p + 1/q = 1/r, then Sr (xy) Sp (x)Sq (y).
9. Given 0 < p < r < q, write r = p + q, where , > 0, + = 1. Show that
Sr (x)r Sp (x)p Sq (x)q . Equality can only occur if all of the nonzero xi are equal.
31

10. Show that the sums St satisfy Minkowskis inequality: For t > 1 we have St (x + y)
St (x) + St (y), where x + y denotes the sequence (x1 + y1 , . . . , xn + yn ). For 0 < t < 1,
we have St (x + y) St (x) + St (y). In every case, equality can only occur if x and y
are proportional. (Of course, equality always holds if t = 1.)
11. Show that S (x + y) S (x) + S (y). When does equality occur in this case?
For 1 p , the expression kxkp = Sp (x) defines a norm on Rn . In other words, kxkp
satisfies: (i) kxkp 0 for all x Rn and kxkp = 0 if and only if x = 0; (ii) kaxkp = |a| kxkp
for any scalar a R; and (iii) kx + ykp kxkp + kykp for any x, y Rn (which is typically
called the triangle inequality in this context). For 0 < p < 1, (iii) reverses, thus kxkp
is not a norm in this case. However, as well see directly, for 0 < p < 1, the expression
d(x, y) = kx ykpp defines a translation invariant metric on Rn .
Pn
Pn
Pn
12. If p 1, show that i=1 (xi + yi )p i=1 xpi + i=1 yip . If 0 < p 1, the inequality
reverses. If p 6= 1, equality can only occur if xi yi = 0 for all i (in other words, xi and
yi cannot both be nonzero).
13. For 0 < p < 1, show that the expression d(x, y) = kx ykpp satisfies: (i) d(x, y) 0
for all x, y Rn and d(x, y) = 0 if and only if x = y; (ii) d(x, y) = d(y, x) for all
x, y Rn ; (iii) d(x, y) = d(x y, 0) = d(x + z, y + z) for all x, y, z Rn ; and
(iv) d(x, y) d(x, z) + d(z, y) for all x, y, z Rn .

32

Stirlings Formula
Recall (from Problem 14, Problem Set 1) that (n/3)n < n! < (n/2)! for n 6. Thus, its
reasonable to ask whether n! (n/a)n for some constant a (where means that the ratio
n!/(n/a)n 1 as n ). If you guessed that a = e, youd be rightbut the precise
order of magnitude (including error estimates) takes a bit more work.
We begin with a clever summation formula, due essentially to Euler.
Euler-Maclaurin Summation. If f has a continuous derivative on [ 1, n ], then
Z n
Z n
n
X

f (k) =
f (x) dx +
x [x] f 0 (x) dx + f (1),
1

k=1

where [x] denotes the greatest integer x.


Proof.

(If you happen to know Stieltjes integration, theres a very short proof! For

simplicity, well settle for a slightly longer but still elementary proof.) We begin with the
error:
n1
X

Z
f (k)

f (x) dx =
1

k=1

n1
X Z k+1
k=1


f (k) f (x) dx.

(Please note that the upper limit on the sums is now n 1.) We are going to integrate by
parts on the right-hand side, using the variables u = f (k) f (x) and v = x k 1 (!).
With this choice well have u(k) = 0 = v(k + 1), which will simplify things considerably.
Watch closely!
n1
X
k=1

Z
f (k)

f (x) dx =
1

n1
X Z k+1
k=1

n1
X Z k+1
k=1
n

Z
=


x k 1 f 0 (x) dx

x [x] 1 f 0 (x) dx


x [x] f 0 (x) dx + f (1) f (n).

Adding f (n) to both sides completes the proof.


Now Euler made another helpful observation: If we replace x [x] by x [x] 1/2,
well enhance the likelihood of convergence of the second integral (as n ) because well
introduce the possibility of cancellations.
33

y = x [x]

1/2
y=

Corollary.

n
X

Z
f (k) =


x [x] 1/2 f 0 (x) dx + [ f (1) + f (n) ]/2.

f (x) dx +
1

k=1

1
n

Z
Corollary. log( n! ) = (n + 1/2) log n n + 1 +
1

log( n! ) =

Proof.

n
X

x [x] 1/2
x

x [x] 1/2
dx.
x

log k

k=1
n

1
x [x] 1/2
dx + log n
x
2
1
1
Z n
x [x] 1/2
= (n + 1/2) log n n + 1 +
dx.
x
1
Z

log x dx +

In other words, weve just shown that




Z n
n! en
x [x] 1/2
an log
= 1+
dx.
n+1/2
x
n
1

(1)

We next show that (an ) converges using a standard technique from advanced calculus.
Z

Dirichlets Test for Integrals. If


f (t) dt is bounded for x 1 and if g(x) decreases
1
Z
to zero as x , then
f (x) g(x) dx exists (as an improper Riemann integral).
1

Its not hard to see that


Z x
Z



x [x] 1/2 dx =


1

[x]

Z

1

1

x [x] 1/2 dx
x [x] 1/2 dx = .

8
1/2


x [x] 1/2
dx exists. This proves that (an ) converges to a finite limit and,
x
1
hence, that bn = ean converges to a positive, finite limit C; that is, weve shown that
Z

Thus,

n! en
n nn+1/2

C = lim

34

exists.

This result is essentially due to DeMoivre in 1730. Finding the precise value of C, together
with some quantitative error estimates, will take a bit more work.
The Gamma Function
We begin with a continuous extension of n! initiated by an observation of Eulers:
Z 1
Z
n
( log x) dx =
tn1 et dt.
n! =
0

This motivated Legendre to consider the function


Z
(x) =
tx1 et dt,

x > 0.

Its not hard to see that (x) < for all x > 0. Indeed, given x > 0, we can find
t0 = t0 (x) sufficiently large so that tx1 et et/2 for all t t0 . Thus,
Z t0
Z
tx
x1
(x)
t
dt +
et/2 dt = 0 + 2et0 /2 < .
x
0
t0
We next show that (x) is continuousby showing that its convex!
Theorem. (i) (1) = 1.
(ii) (x + 1) = x(x) hence, (n + 1) = n!
(iii) is log-convex; that is, log (x) is convex.
(iv) is convex, hence continuous.
Proof. (i) is clear:

et dt = 1. (ii) follows from integration by parts:


Z
(x + 1) =
tx1 et dt
Z0

=
tx1 d et
0
Z
it=

x
t
= t e

et xtx1 dt
t=0
0
Z
= x
tx1 et dt = x(x).
0

35

To establish (iii), well use H


olders inequality. Given , 0 with + = 1, we have:
Z
(x + y) =
t(x+y)1 et dt
0

Z
=

tx1 et

i h

ty1 et

dt

(x) (y) ,
and it follows that log is convex. Finally, (iv) is a general principle: Every log-convex
function is also convex. This follows from Youngs inequality:
(x + y) (x) (y) (x) + (y).
Corollary. As x 0+ we have x(x) 1 and, hence, (x) +.
Proof. x(x) = (x + 1) (1) = 1 as x 0+ . It follows that (x) =

x(x)
+ as
x

x 0+ .
We next prove a remarkable characterization due to Artin in 1964.
Theorem. Suppose that f : (0, ) (0, ) satisfies (i) f (1) = 1; (ii) f (x + 1) = xf (x);
and (iii) f is log-convex. Then f = .
Proof. Of course, by (i) and (ii), f (n + 1) = n! for n = 0, 1, 2, . . .. Next, given 0 < x 1,
we use (ii) and (iii) to estimate f (n + 1 + x). Watch closely!
f (n + 1 + x) = f (n + 1 + x nx x + nx + x)
= f (1 x)(n + 1) + x(n + 2)

f (n + 1)1x f (n + 2)x

x
= f (n + 1)1x (n + 1)f (n + 1)
= (n + 1)x f (n + 1)
= (n + 1)x n!
36

Also,

n! = f (n + 1) = f x(n + x) + (1 x)(n + 1 + x)
f (n + x)x f (n + 1 + x)1x
= (n + x)x f (n + 1 + x)x f (n + 1 + x)1x
= (n + x)x f (n + 1 + x).
That is, f (n + 1 + x) (n + x)x n!.
Now we use (ii) to write these inequalities in terms of f (x).
f (n + 1 + x) = (n + x) f (n + x)
= (n + x)(n 1 + x) f (n 1 + x)
..
.
= (n + x)(n 1 + x) x f (x).
Thus,
(n + x)x n! (n + x)(n 1 + x) x f (x) (n + 1)x n!
or, after dividing by nx n!,


(n + x)(n 1 + x) x
x x

f (x)
1+
n
nx n!

1
1+
n

x
.

For fixed x, both extremes tend to 1 as n , so weve shown that


nx n!
n (n + x)(n 1 + x) x

f (x) = lim

for 0 < x 1. We next show that (2) holds for x > 1 as well.
Given x > 1, there is a positive integer m with 0 < x m 1. Then
f (x) = (x 1)(x 2) (x m) f (x m)
nxm n!
n (n + x m) (x m)

= (x 1)(x 2) (x m) lim
nxm n!
= lim
n (n + x m) x
37

(2)

nx n!
(n + x) (n + x + (m 1))

n (n + x) x
nm
nx n!
= lim
n (n + x) x

(3)

= lim

because the second factor in (3) consists of m terms each, top and bottom, and m and x
are fixed ; thus, the limit of this factor is 1 as n . Thus, (2) holds for all x > 0. In
other words, there can be only one function satisfying the hypotheses of the theorem.
nx n!
.
n (n + x)(n 1 + x) x

Corollary. (x) = lim

n1/2 n!
. But we can also find this
n (n + 1/2)(n 1/2) (1/2)

In particular, (1/2) = lim

limit by alternate means:


Z
Z
1/2 t
(1/2) =
t
e dt = 2
0

1/2

d t

Z
= 2

eu du =

Return to Stirlings Formula


If youll recall, weve shown that bn =

n! en
b > 0. Thus,
nn+1/2

b2n
(n!)2 e2n (2n)2n+1/2
=

b.
b2n
n2n+1
(2n)! e2n
But
22n n!
22n n!
=
(2n)!
(2n)(2n 1)(2n 2) 3 2 1
=

n!
n(n 1/2)(n 1) (3/2) 1 (1/2)

1
.
(n 1/2)(n 3/2) (3/2) (1/2)

So,

b2n
n1/2 n!
n + 1/2
=

2 2 (1/2) = 2.
b2n
(n + 1/2)(n 1/2) (3/2) (1/2)
n
Thus, weve finally arrived at Stirlings Formula.
lim

n! en
=
2
nn+1/2

or, in the notation from the beginning of this section, n!


38

2 nn+1/2 en .

Stirlings Series
Finally, lets talk about error estimates. In addition to the qualitative statement that

n! en
1
2 nn+1/2

as

n ,

Stirling provided quantitative estimates on




n! en
rn = log
.
2 nn+1/2
He showed that
rn

A
B
C
3 + 5
n
n
n

for certain constants A, B, etc., where means that rn lies between successive partial
sums of the (divergent) alternating series (called Stirlings series). For example,
B
A
A
3 < rn <
n
n
n

(n = 1, 2, 3, . . .)

for suitable A and B. Well settle for the modest estimate:


Lemma. 0 < rn < 1/12n for all n. The constant 1/12 cannot be improved.
Proof. The proof makes repeated use of the fact that if a sequence (or function) decreases
to 0 as n (resp., x ), then it must be positive. Similarly, if a sequence (or
function) increases to 0, then it must be negative.
In order to show that rn > 0, then, it suffices to show that (rn ) is decreasing (because
we already know that rn 0). Thus we want to show that
!

(n + 1)! en+1
2 nn+1/2
rn+1 rn = log

n! en
2 (n + 1)n+1+1/2


1
= 1 (n + 1/2) log 1 +
< 0
n
for all n. For this, it suffices to show that


1
1
log 1 +

> 0
n
n + 1/2
39

for all n. To this end, notice that the function g(x) = log(1 + 1/x) 1/(x + 1/2) tends to
0 as x . If we can show that g is decreasing, well know that g(x) > 0. But
g 0 (x) =

1
1
1
1

= 2
2
<0
2
(x + 1/2)
x(x + 1)
x + x + 1/4 x + x

for x > 0. Thus weve shown that (rn ) decreases to 0 and, hence, that rn > 0 for all n.
We next find a constant A > 0 so that rn < A/n for all n. For this, it suffices to show
that an = rn (A/n) < 0 for all n. But an 0, so it suffices to find A such that (an ) is
increasing. That is, we want


1
= 1 (n + 1/2) log 1 +
n

an+1 an


+A

1
1

n n+1


> 0

for all n. This will be the case if




g(x) =

1
1
log 1 +
x + 1/2
x

1
x


+A

1
x+1

x + 1/2

> 0

for all x > 0. Again, we know that g(x) 0 as x , so it would suffice to find A so
that g 0 (x) < 0 for all x > 0. Hang on!
2
g(x) =
log
2x + 1

x+1
x


+

2A
x(x + 1)(2x + 1)

x(x + 1) A(12x2 + 12x + 2)


= g (x) =
.
x2 (x + 1)2 (2x + 1)2
0

Thus it suffices to find A such that


A >

x(x + 1)
= h(x),
12x2 + 12x + 2

x > 0.

(4)

But h(x) 1/12 as x and


h0 (x) =

2x + 1
> 0
2(6x2 + 6x + 1)2

for x > 0. That is, h(x) strictly increases to 1/12 as x , so A = 1/12 is the smallest
possible upper bound in (4). Because this is such a roundabout proof, lets summarize what
weve done: A = 1/12 = g 0 < 0 = g > 0 = (an ) increasing = an < 0 =
rn < 1/12n. Phew!!
Conclusion. For all n = 1, 2, 3, . . ., we have 1 <

40

n! en
< e1/12n .
2 nn+1/2

Problem Set 4
Notation: A positive function f : I (0, ) defined on an interval I is said to be log
 

convex if log f is convex. Equivalently, f is log-convex if f (x + y) f (x) f (y)
whenever x, y I, , 0, + = 1. From Youngs inequality, a log-convex function is
also convex; thus, the log-convex functions form a subset of the convex functions.
1.

(i) If a, b > 0, show that f (x) = a bx is log-convex on R.


(ii) Show that g(x) = x is convex but not log-convex on (0, ).

2. If f and g are log-convex, show that f g and f + g are log-convex.


3. For x and fixed, show that the function f (t) = Mt (x, )t is log-convex for t > 0.
r
r
( 12 (n + 1))
2
2
4. Use the log-convexity of (x) to show that

for

1
n+1
n
( 2 (n + 2))
n = 1, 2, 3, . . ..
Z
3
5. Evaluate
t1/2 et dt.
0

(2n)!
1
6. Show that n +
=
for n = 0, 1, 2, . . ..
2
n! 4n
 
2n
22n
7. Show that
. [Recall that an bn means that an /bn 1.]
n
n


8. Establish Walliss formula:

22 44
(2n)(2n)
= lim

.
n
2
13 35
(2n 1)(2n + 1)

[Hint: /2 = (1/2)((1/2))2 .]

x x + 1

= x1 (x).
9. Prove Legendres duplication formula:
2
2
2

[Hint: Show that f (x) = (2x1 / )(x/2)((x + 1)/2) satisfies the hypotheses of
Artins theorem.]
Z
10. Consider f (x) =

tx1
dt for 0 < x < 1. [Euler showed that f (x) = / sin(x).]
1+t
0
R
(a) Use the fact that (1 + t)1 = 0 e(1+t)s ds to show that f (x) = (x)(1 x).

(b) Show that f is log-convex on (0, 1).


(c) Conclude that = f (1/2) f (x) for all 0 < x < 1.

41

Inner Product Spaces


An inner product space is a vector space X over C (or R) together with a scalar-valued
function h x, y i of two variables, called an inner product, satisfying:
(i) h, i : X X C (or R)
(ii) h x, y i = h y, x i for x, y X
(iii) h x + y, z i = h x, z i + h y, z i for x, y, z X and , C
From (ii), the inner product isnt fully linear in its second coordinate, so we sometimes
say that h, i is sesquilinear (or one-and-a-half-linear).
(iv) h x, x i 0 for any x X; h x, x i = 0 if and only if x = 0.
Note that by linearity we have h x, 0 i = 0 = h 0, x i for any x X. Moreover, from
(iii), if h x, y i = 0 for all y X, then x = 0.
Examples
1. Cn with its usual inner product h x, y i = y T x =

Pn

k=1

xk y k .

2. Given continuous functions f , g : [ a, b ] C, the expression h f, g i =

Rb
a

f (x) g(x) dx

defines an inner product. Given a strictly positive weight function w : [ a, b ] R, we


Rb
might also consider h f, g i = a f (x) g(x) w(x) dx.
In the remainder of this section, well assume that X is an inner product space over
C. As well see directly, an inner product induces a norm on X by setting
q
kxk = h x, x i .
In term of our first two examples we have
kxk =

n
X

!1/2
|xk |2

on Cn

k=1

and
Z
kf k =
a

!1/2
|f (x)|2 dx

Z
or

!1/2
|f (x)|2 w(x) dx

on C[ a, b ].

A key step in verifying that equation (1) defines a norm is given by:
42

(1)



The Cauchy-Schwarz Inequality. In any inner product space, h x, y i kxk kyk, with
equality if and only if x and y are linearly dependent (that is, parallel).
In terms of our initial examples,

n

X


xk y k



k=1

!1/2

n
X

|xk |

k=1

n
X

!1/2
|yk |

k=1

and
Z

b



f (x) g(x) dx

a

|f (x)|2

!1/2 Z

!1/2

|g(x)|2

Proof. By our earlier remarks, we may assume that x, y 6= 0. Now, given C, note
that
0 kx yk2 = h x y, x y i
= kxk2 h x, y i h x, y i + ||2 kyk2
= kxk2 2 Re ( h x, y i) + ||2 kyk2 .
In particular, setting = h x, y i/h y, y i = h x, y i/kyk2 , leads to
0 kxk2 2

|h x, y i|2
|h x, y i|2
|h x, y i|2
2
2
+
kyk
=
kxk

.
kyk2
kyk4
kyk2

We leave the case for equality as an exercise (just examine the proof closely).
| Reh x, y i|
|h x, y i|

1. Thus
kxk kyk
kxk kyk
we can (and will!) define the angle between (nonzero) vectors x and y by declaring
It follows from the Cauchy-Schwarz inequality that

cos =

Reh x, y i
.
kxk kyk

(This uniquely defines if we insist that 0 .) In particular, we say that x and y


are orthogonal if h x, y i = 0. We will sometimes use the shorthand x y to denote that
x and y are orthogonal. From our earlier remarks, every vector is orthogonal to the zero
vector (and the zero vector is the only vector with this property).
The observation that
kx + yk2 = kxk2 + 2 Reh x, y i + kyk2
(which is the law of cosines in disguise!) leads to two important results.
43

(2)

The Pythagorean Theorem. kx + yk2 = kxk2 + kyk2 if and only if h x, y i = 0.


The Triangle Inequality. kx + yk kxk + kyk.
Proof. From (2) and the Cauchy-Schwarz inequality,
kx + yk2 kxk2 + 2|h x, y i| + kyk2
kxk2 + 2kxk kyk + kyk2
2
= kxk + kyk .
Corollary. kxk =

h x, x i defines a norm on X.

At the risk of a bit of repetition, lets take another look at the proof of the CauchySchwarz inequality.
Lemma. Let x, y X with y 6= 0 and let = h x, y i/h y, y i. Then:
(i) (x y) y and, of course, x = (x y) + y. Thus, x can be written as the sum
of a vector orthogonal to y and a vector parallel to y.
(ii) kxk2 = kx yk2 + ||2 kyk2 . In particular, note that kxk kx yk.
(iii) kx yk < kx yk for all 6= . Thus, y = y is the unique point in span y
nearest to x; it is characterized by the requirement that (x y ) span y.
Proof. (i) follows by design; that is, is chosen to satisfy h xy, y i = 0. (ii) follows easily
from (i) and the Pythagorean theorem. Likewise, (iii) follows from (i) and the Pythagorean
theorem. Indeed, from (i) and the linearity of the inner product, it follows that x y is
perpendicular to every multiple of y; thus, given C, we have
kx yk2 = k(x y) + ( )yk2
= kx yk2 + | |2 kyk2
> kx yk2
unless = .
44

The vector y = y in our previous Lemma is called the orthogonal projection of x


along y; it is the shadow of x onto span y along a perpendicular line of sight. In calculus,
it is often called the component of x in the direction of y.
As well see directly, if we inductively apply the Lemma to a basis for a (finitedimensional) subspace Y of X, well arrive at a technique for building an orthogonal basis
(that is, a basis consisting of mutually orthogonal vectors). The benefit in having an
orthogonal basis is illustrated by our next result.
Lemma. Let {e1 , . . . , en } be a set of nonzero, mutually orthogonal vectors in X and let
E = span{e1 , . . . , en }. Then
(i) {e1 , . . . , en } is linearly independent; in fact, z E if and only if z =

h z,ei i
i=1 h ei ,ei i

Pn

ei .

Thus, {e1 , . . . , en } is a basis for E.


(ii) Given any x X, the vector x
(iii) e =

h x,ej i
j=1 h ej ,ej i

Pn

h x,ei i
i=1 h ei ,ei i

Pn

ei is orthogonal to E.

ej is the unique nearest point to x in E.

(iv) x E if and only if kxk2 =


Proof. To begin, if we set z =

|h x,ej i|2
j=1 h ej ,ej i

Pn

Pn

i=1

h z, ek i =

= ke k2 .

i ei , then

n
X

i h ei , ek i = i h ei , ei i,

i=1

which proves (i). Next, given x X, set y = x

h x,ei i
i=1 h ei ,ei i

Pn

ei . A similar calculation

then yields
h y, ek i = h x, ek i h x, ek i = 0,
and it follows that y is orthogonal to every vector in E. Parts (iii) and (iv) are now almost
immediate. Indeed, from (ii), x e is orthogonal to E. Thus, given any vector e E we
have e e E and, hence,
kx ek2 = k(x e ) + (e e )k2 = kx e k2 + ke e k2 > kx e k2 ,
45

unless e = e . Finally, x E if and only if x = e . But, from (ii), we have kxk2 =


Pn |h x,e i|2
kx e k2 + ke k2 . Thus, x = e if and only if kxk2 = ke k2 = j=1 h ej ,ej j i .
The vector e of our previous Lemma is the orthogonal projection of x onto E. Again,
it is characterized by the requirement that (x e ) E (as in (ii) of the Lemma).
Again, lets pause to examine our first two examples.
Examples
1. The usual basis ek = (0, . . . , 0, 1, 0, . . . , 0), where the single nonzero entry is in the
k-th coordinate, is an orthonormal basis for Cn . That is, the ek are not only mutually
orthogonal (h ei , ej i = 0 for i 6= j) but also norm one (h ej , ej i = 1 for all j). Note
Pn
Pn
that every vector x = (x1 , . . . , xn ) = k=1 h x, ek i ek has norm kxk2 = k=1 |xk |2 .
2. Relative to the inner product h f, g i =
are orthonormal. Indeed,
Z
Z 2
imx inx
e dx =
e
0

1
2

R 2
0

f (x) g(x) dx, the vectors {1, eix , . . . , einx }

2
i(mn)x


dx =

0,
if m 6= n
2, if m = n.

In the case of real scalars, real-valued functions, and the inner product h f, g i =

R
1 2
f
(x)
g(x)
dx,
the
vectors
{1/
2, cos x, sin x, . . . , cos nx, sin nx} are orthonormal.
0
We next show how the ideas developed in our first two lemmas can be used to construct
orthogonal sequences.
The Gram-Schmidt Process. Let E be a subspace of X with basis {x1 , . . . , xn }. Then
we can find an orthogonal basis {e1 , . . . , en } for E satisfying
(i) span{e1 , . . . , ek } = span{x1 , . . . , xk } for k = 1, . . . , n, and
(ii) 0 < kek k kxk k for k = 1, . . . , n.
Proof. To begin, set e1 = x1 6= 0 and let e2 = x2

h x2 ,e1 i
h e1 ,e1 i

e1 . Because x2
/ span{e1 } =

span{x1 }, we have e2 6= 0. Of course, e2 span{e1 , x2 } = span{x1 , x2 } and so we


have span{e1 , e2 } span{x1 , x2 }. But, from the previous Lemma, e2 is orthogonal to
46

span{e1 } = span{x1 }. In particular, e1 and e2 are linearly independent and it follows that
span{e1 , e2 } = span{x1 , x2 }. To see that e2 satisfies (ii), notice that x2 = e2 + e1 and,
hence, kx2 k2 = ke2 k2 + ||2 ke1 k2 ke2 k2 .
Continue by induction: Assuming that {e1 , . . . , ek } have been chosen, set ek+1 =
Pk h x ,ej i
/ span{x1 , . . . , xk } = span{e1 , . . . , ek }, we have
xk+1 j=1 h k+1
ej ,ej i ej . Because xk+1
ek+1 6= 0. Also, span{e1 , . . . , ek , ek+1 } span{x1 , . . . , xk , xk+1 }. But, as before, ek+1
is orthogonal to span{e1 , . . . , ek } and, hence, {e1 , . . . , ek , ek+1 } are linearly independent.
Thus we must have span{e1 , . . . , ek , ek+1 } = span{x1 , . . . , xk , xk+1 }. Finally, kxk+1 k2 =
Pn
kek+1 k2 + j=1 |j |2 kej k2 kek+1 k2 .
Clearly, once we have an orthogonal basis, we simply normalize to arrive at an orthonormal basis (alternatively, by slightly altering the process outlined above, we could
construct the vectors ek to have norm one).
Corollary. If E is a finite-dimensional subspace of X, then E has an orthonormal basis
{e1 , . . . , en }.
All of these ideas make very short work of:
The Projection Theorem. Let E be a subspace of X with orthonormal basis {e1 , . . . , en }.
Pn
Define P : X X by P (x) = k=1 h x, ek i ek for x X. Then
(i) P (x) E and (x P (x)) E; hence, kxk2 = kx P (x)k2 + kP (x)k2 .
(ii) P (x) is the unique nearest point to x in E.
(iii) P is a projection; that is, P 2 = P .
(iv) P is linear, continuous, and satisfies kP (x)k kxk for all x X.
P is called the orthogonal projection onto E or, sometimes, the nearest point map on
E. It follows from (ii) that P is actually independent of the choice of basis.
47

Hadamards Inequality
As an application of the ideas in this section, we next present a classical matrix inequality
due to Hadamard in 1893.
Theorem. Let A = [ aij ] be a real or complex n n matrix and let Aj = (aij )ni=1 denote
the j-th column of A. Then
| det A |

n
Y

kAj k =

j=1

n
Y

n
X

j=1

i=1

!1/2
|aij |2

Equality can only occur if one of the columns Aj = 0 or if the columns are orthogonal;
i.e., h Aj , Ak i = 0 for j 6= k.
It is well known that | det A | represents the volume of the parallelepiped with edges
(Aj ). Thus, Hadamards inequality states that the volume is maximal when the edges are
orthogonal.
Proof.

We may certainly suppose that det A 6= 0, in which case the columns (Aj ) are

linearly independent. Thus we may apply the Gram-Schmidt process to orthogonalize


them, arriving at vectors (bj ) satisfying b1 = A1 ,
bj = Aj

j1
X
h Aj , bj i
i=1

h bj , bj i

(j 2),

bi

(3)

and kbj k kAj k for all j. In particular, the matrix B = [ b1 bn ] with columns (bj ) can
be obtained from A by elementary column operations. It follows that det B = det A.
But the columns of B are orthogonal, so
B B = diag kb1 k2 , . . . , kbn k2
and, hence, det(B B) =

Qn

j=1

kbj k2 . On the other hand, det(B B) = det(B T ) det B =

| det B |2 . Consequently,
| det A | = | det B | =

det(B B)

1/2

n
Y
j=1

kbj k

n
Y

kAj k.

j=1

Equality can only occur if kbj k = kAj k for all j, which can only occur if bj = Aj for all j;
that is, if and only if the (Aj ) were already orthogonal.
48

Corollary. Suppose that A is an n n matrix with real or complex entries satisfying


|aij | 1 for all i, j. Then | det A | nn/2 with equality if and only if |aij | = 1 for all i, j
and the columns of A are orthogonal.

n, thus | det A | nn/2 .

Equality would mean that the Aj are orthogonal and it would also mean that kAj k = n,

Proof. In the notation of the previous Theorem, we have kAj k

which forces |aij | = 1 for all i, j.


An n n real matrix having entries aij = 1 and orthogonal columns is called a


1
1
Hadamard matrix. For example, A =
is a 2 2 Hadamard matrix. As it
1 1
happens, its very easy to construct Hadamard matrices of order 2n . (And not quite so
easynor, indeed, always possibleto construct Hadamard matrices of other orders. For
example, there is no Hadamard matrix of order 3.) The key is the so-called Kronecker
product, which we will define by example: The Kronecker product

1
1
1
1


1 1
1A
1A
1 1
AA =
=
1
1 1 1
1 A 1 A
1 1 1
1

of A with itself is

B,

which is a 44 Hadamard matrix. The product AB would then yield an 88 Hadamard


matrix, and so on.

49

Problem Set 5
Throughout, X denotes a finite-dimensional inner product space over C with inner product
p
h, i and with associated norm kxk = h x, x i.


1. Show that h x, y i = kxk kyk if and only if x and y are linearly dependent.
2. Show that kx + yk = kxk + kyk if and only if one of x or y is a nonnegative multiple
of the other.

3. Show that the parallelogram law : kx + yk2 + kx yk2 = 2 kxk2 + kyk2 holds for any
x, y X.
4. Show that the polarization identity: 4h x, y i = kx+yk2 kxyk2 +ikx+iyk2 ikxiyk2
holds for any x, y X.
5. Show that a linear map T : X Y between inner product spaces X and Y is an
isometry (into) if and only if it preserves inner products. That is, kT xk = kxk for
all x X if and only if h T x, T y i = h x, y i for all x, y X. In short, a linear map
preserves distances if and only if it preserves angles.
6. Let E be a subspace of X and let x0 X. Show that the nearest point to x0 in E
can be characterized as the (unique) point y0 E satisfying Re h x0 y0 , y y0 i 0
for all y X.
R1
7. Consider C[1, 1 ] with the inner product h f, g i = 1 f (x) g(x) dx (where we consider
real-valued functions and real scalars).
(i) Apply the Gram-Schmidt process to the linearly independent set {1, x, x2 } to find
the first three Legendre polynomials P1 (x) = 1, P2 (x) = x, and P3 (x) = x2 31 .
nR
o
2
1
(ii) Compute min 1 ex (ax2 + bx + c) dx : a, b, c real .
8. Given a subset A of X, let A = {x X : h x, a i = 0 for all x A}. The set A is
called the orthogonal complement of A. Verify the following for subsets A, B of X.
(a) A is a subspace of X and A A {0}.
(b) A B A B, and A B = A B .
(c) A A and, hence, span(A) A . (In fact, for X finite-dimensional, we
actually have span(A) = A .)
9. Let E be a subspace of X and let P : X X be the orthogonal projection onto E.
Show that I P is the orthogonal projection onto E (where I denotes the identity
map on X).
50

10. Let A be an m n matrix over C and let A = A

denote the conjugate transpose

of A. Given x Cn and y Cm , show that h Ax, y i = h x, A y i. Moreover, this


equation characterizes A ; that is, if B is an n m matrix over C which satisfies
h Ax, y i = h x, By i for all x, y X, show that B = A .
11. Let A be an m n matrix over C.
(a) Prove that (range A) = ker(A ).
(b) If b Cm is such that the equation Ax = b has no solution, we can always find a
vector x0 Cn that minimizes kb Axk over all x Cn . Prove that x0 satisfies
(b Ax0 ) range A.
(c) Prove that x0 satisfies the so-called normal equation: A Ax = A b. (Note that
A A is Hermitian. If A has rank n, then A A will be invertible and we can use
it to solve for x0 = (A A)1 A b.)
12. Let Mn (C) denote the collection of all n n matrices with complex entries. Mn (C)
is a vector space over C under coordinatewise addition and scalar multiplication.
Show that the expression h A, B i = trace(B A) defines an inner product on Mn (C).
13. Let A Mn (C).
(a) Show that [ x, y ] = h Ax, y i defines an inner product on Cn if and only if A
satisfies:
(i) A = A. (We say that A is Hermitian or self-adjoint if A = A.)
(ii) h Ax, x i 0 for any x, and h Ax, x i = 0 only for x = 0. (In other words,
the quadratic form f (x, y) = h Ax, y i is positive definite; this implies
that A has strictly positive [real] eigenvalues.)
(b) Conversely, show that any sesquilinear, positive definite quadratic form on Cn is
of the form h Ax, y i for some Hermitian matrix A.

51

Hilberts Double Series Theorem


Theorem. If (am ) and (bn ) are real, square-summable sequences, then
!1/2 !1/2
X

X
X
X
am bn
a2m
b2n
<
m
+
n
m=1 n=1
m=1
n=1

(1)

unless one of (an ) or (bm ) is identically zero. The constant is best possible.

Its unusual to see a strict inequality with a sharp bound! Hilbert proved this inequality, with constant 2, in his famous series of lectures on integral equations (roughly,
19041911); his result was first published by Weyl in 1908. The best constant, together
with various generalizations, was provided by Schur in 1911. We will present two proofs,
along with a generalization to integral inequalities.
Our first approach uses the Cauchy-Schwarz inequality in the form:
!1/2
!1/2
X
X
X
xk yk
x2k
yk2
,
kI

kI

(2)

kI

where (xk ) and (yk ) are real sequences and where I is a countable set; in this case,
I = {(m, n) : m, n = 1, 2, 3, . . .}.
This is a relatively straightforward generalization of the Rn version because
(
)
X
X
x2k = sup
x2k : J I, J finite .
kI

kJ

That is, (2) follows from the finite-sum version of Cauchy-Schwarz by showing that all
(finite length) partial sums satisfy (2).
The idea behind our first proof of Hilberts theorem is to take advantage of the symmetry of the summand, relative to m and n. We begin by rewriting:
 n 
am bn
am  m 
bn
=

,
m+n
m+n n
m+n m
52

where, if possible, will be chosen so that each of the factors on the right is squaresummable (over the index set I). Applying Cauchy-Schwarz we get:
!2
X a2  m 2 X b2  n 2
X am bn
n
m
.

m
+
n
m
+
n
n
m
+
n
m
m,n
m,n
m,n

(3)

But

X a2  m 2
X
X
1  m 2
m
2
=
am
m+n n
m+n n
m,n
m=1
n=1

and, similarly,

X
X
b2n  n 2
1  n 2
2
=
bn
.
m+n m
m+n m
m,n
n=1
m=1

Thus, well have a proof of Hilberts inequality (with some constant) if we can find a finite
bound B such that

1  m 2
B
m+n n
n=1
for all m; that is, B may depend on , but not on m.
Now for > 0 (and any m), the sequence

1
m+n


m 2
n

decreases as n increases, so we

can appeal to the integral test:


Lemma. If f : [ 0, ) [ 0, ) is strictly decreasing, then
Z
Z

X
f (x) dx <
f (n) <
f (x) dx.
1

n=1

Returning to our search for the bound B , we find


Z
Z

X
1  m 2
1  m 2
1
1
<
dx =
dy.
m+n n
m+x x
1 + y y 2
0
0
n=1
As weve seen (in a slightly different form), this integral exists provided that 0 < 2 < 1
and, in fact, equals (2)(1 2); that is, we can take B = (2)(1 2). In other
words, weve just shown that
X a m bn
< B
m+n
m,n

X
m=1

!1/2
a2m

!1/2
b2n

n=1

But weve also seen that the minimum value of B occurs when = 1/4, in which case
B1/4 = ((1/2)2 = . This proves Hilberts theorem, save the claim that is best possible.
53

Corollary. Let (am ) and (bn ) be nonnegative, let 1 < p < , and let q satisfy

1
p

+ 1q = 1.

Then

X
am bn
< (1/p)(1/q)
m+n
m=1 n=1

!1/p
apm

m=1

!1/q

bqn

n=1

unless one of (an ) or (bm ) is identically zero.

Proof. In this case, we write


1
=
m+n

1  m 1/q
m+n n

1/p 

1  n 1/p
m+n m

1/q

and apply H
olders inequality, which will lead to the bounds

Z
0

1
1
dy = (1/p)(1/q)
1/p
1+y y

Z
and
0

1
1
dy = (1/q)(1/p).
1/q
1+y y

As it happens, (x)(1 x) = /(sin x) for 0 < x < 1 and so the bound in the
Corollary can be written as /(sin(/p)), which is actually best possible.
The integral version of Hilberts inequality can be proved in essentially the same way
as the discrete version and, in fact, will yield the discrete version as a corollary.

Theorem. Let f , g : [ 0, ) [ 0, ), let 1 < p < , and let


Z
0

Z
0

f (x) g(y)
dx dy <
x+y
sin(/p)

Z

1
q

1/p Z

1
p

= 1. Then

f (x) dx

1/p

g(y) dy

unless one of f or g is identically zero. The constant / sin(/p) is best possible.

In this case we would write

 1/q #1/p
1
1
x
1
=
x+y
x+y y
x+y
"

y
x

!1/p 1/q

and proceed as before. Instead of completing this calculation, lets opt for a slightly more
general theorem:
54

Theorem. Let K : [ 0, ) [ 0, ) [ 0, ) satisfy K(x, y) = 1 K(x, y) for all


> 0. Then, for f , g, and p as above,
Z
Z Z
K(x, y) f (x) g(y) dx dy C

1/p Z
f (x) dx

1/p

g(y) dy

(4)

The constant C is given by the common value


Z
Z
1/p
C=
K(x, 1) x
dx =
0

K(1, y) y 1/q dy.

(5)

If K > 0, then the inequality is strict unless one of f or g is identically zero.


Proof. We begin with the change of variable y = ux (and a few changes in the order of
integration).
Z Z
I=
0

Z
K(x, y) f (x) g(y) dx dy =

Z

f (x)

K(x, y) g(y) dy dx
Z

Z
=
f (x)
xK(x, ux) g(ux) du dx
0
0
Z

Z
=
f (x)
K(1, u) g(ux) du dx
0
0
Z

Z
=
K(1, u)
f (x) g(ux) dx du.

We now apply H
olders inequality to the inner integral, using the same change of variable
y = ux:

1/p

Z

f (x) g(ux) dx

|f (x)| dx

1/q

Z

1/q

|g(y)| dy

Thus,
Z
I

K(1, u) u

1/q

Z

du

1/p Z
|f (x)| dx
p

|g(y)| dy

1/q
.

The case for strict inequality in (4) follows from a careful examination of the case for
equality in H
olders inequality (which we will forego). The fact that the two integrals in
(5) are equal is left as an exercise.
Now the integral version of Hilberts inequality can be used to deduce the discrete
version; indeed, because we have
Z m Z
m1

n1

dy dx
1

,
x+y
m+n
55

we could define f (x) = am for m 1 x < m, g(y) = bn , n 1 y < n, and write


the double sum in (1) as a double integral over [ 0, )2 . But we can actually do a bit
better, arriving at an even sharper version of (1). To see this, note that the function
h() = (m + n 1 + )1 is strictly convex for 1 1; thus,
1
1
2
+
>
.
m+n1 m+n1+
m+n1
Now watch closely!
Z m Z n
Z 1/2 Z 1/2
Z 1/2 Z 1/2
dy dx
dy dx
dy dx
=
=
.
m1 n1 x + y
1/2 1/2 m + n 1 + x + y
1/2 1/2 m + n 1 x y
(Where we first exchanged x and y for x + m 1/2 and y + n 1/2, then exchanged x and
y for x and y.) All three integrals are equal and the second two average to be strictly
bigger than (m + n 1)1 ; thus we have
Z m Z n
1
dy dx
>
.
m+n1
m1 n1 x + y
Replacing m and n by m + 1 and n + 1 leads to the following sharper version of Hilberts
inequality.
Corollary. If (am ) and (bn ) are real, square-summable sequences, then
!1/2 !1/2

X
X
X
am bn
<
a2m
b2n
m
+
n
+
1
m=0 n=0
m=0
n=0
unless one of (an ) or (bm ) is identically zero.
We now apply our Corollary to the moment sequence of a square-integrable function.
For this well find it helpful to have a converse to Holders inequality, a result thats useful
in its own right.
Lemma. Let 1 < p < and let

1
p

1
q

= 1. Suppose were given a real or complex

sequence (an ) and a positive constant C that satisfy:

an bn C

n=1

X
n=1

56

!1/q
|bn |q

whenever (bn ) is a q-th power summable sequence. Then (an ) is p-th power summable
P
and, moreover, n=1 |an |p C p .
Proof. It suffices to show that

PN

n=1

|an |p C p for all N . But setting bn = |an |p1 sgn an ,

for n = 1, . . . , N , and bn = 0 otherwise, we have |bn |q = |an |p , for n = 1, . . . , N . Thus,


!1/q
N
N
N
X
X
X
an bn =
|an |p C
|an |p
.
n=1

Dividing by

P

N
n=1

|an |

n=1

1/q

n=1

(which we may suppose is nonzero), we get

P

N
n=1

|an |

1/p

C.
Corollary. Let f : [ 0, 1 ] R be square-integrable and nonzero. For each n = 0, 1, 2, . . .,
R1
define an = 0 xn f (x) dx. Then
Z 1

X
2
f (x)2 dx.
an
0

n=0

The constant is best possible.


Proof. We may suppose that f 0. Indeed,
Z 1
|an |
xn |f (x)| dx = bn
0

(the moment sequence for |f |), so it would suffice to consider |f |.


Now if (bn ) is any square-summable sequence, then, for all N , we have
Z 1
Z 1
N
N
N
X
X
X
n
an bn =
bn
x f (x) dx =
f (x)
bn xn dx
n=0

n=0

Z

f (x)2 dx

1/2

Z
0

n=0
N
X

Z

1/2

1/2

!2
bn xn

dx

n=0

N X
N
X

bm bn
=
f (x)2 dx
m+n+1
0
m=0 n=0
!1/2
 Z 1
1/2 X
N
<
f (x)2 dx
b2n
.
0

n=0

57

!1/2

From the converse to H


olders inequality (or, in this case, the Cauchy-Schwarz inequality),
R1
P 2
2
it follows that (an ) is square-summable and satisfies
n an 0 f (x) dx. The fact
1

that is the best constant follows from considering the function f (x) = (1 x) 2 (and
letting 0), a calculation we will omit.
Our second proof of Hilberts inequality has the advantage of being entirely elementary,
but has the disadvantage of giving a slightly weaker result (given our present state of
knowledge). The proof, due to Toeplitz in 1910, begins with a simple observation.
i
Lemma.
2

(t ) eint dt =

1
for n Z, n 6= 0.
n

Proof. Integrate by parts:




Z 2
Z 2
i2
i
i
1 int
1
1
int
(t ) e dt =
(t ) d
e
=
(t ) eint
= .
2 0
2 0
in
2n
n
0
This simple observation gives us an integral representation for the left-hand side of
Hilberts inequality.
Z 2
N
N X
N
X
X
am bn
i
=
(t ) f (t) g(t) dt, where f (t) =
am eimt and
Lemma.
m
+
n
2
0
m=1 n=1
m=1
g(t) =

N
X

bn eint .

n=1

Finally, we apply the Cauchy-Schwarz inequality (and the fact that the functions eint
are orthonormal on [ 0, 2 ]) to arrive at:

N N


Z 2
XX a b
i


m n

(t ) f (t) g(t) dt

=

m+n
2 0
m=1 n=1
Z 2
1

|f (t)| |g(t)| dt
2 0

1/2 
1/2
Z 2
Z 2
1
1
2
2

|f (t)| dt
|g(t)| dt
2 0
2 0
!1/2 N
!1/2
N
X
X
=
a2m
b2n
.
m=1

n=1

58

Because of the special nature of the representation in our previous Lemma, its not so
clear that this proof will yield the infinite sum version (unless we assume that (am ) and
(bn ) are nonnegative). Moreover, we would be hard pressed to prove strict inequality by
this method alone. Nevertheless, Toeplitzs proof is not only simple, but has the benefit
of amply demonstrating the interplay between discrete and continuous inner product
spaces.
Our final result in this section is Schurs generalization of Hilberts inequality from
1911. Surprisingly, the proof we will give is short, elementary, and does not appeal in any
way to Hilberts inequality.
Lemma. Let ck be a complex number, let k be an integer, and let 0 < < 1. Then
n


Z 2 X
n
X c






k
ik x
ck e
dx 2 sin

.



k
0
k=1

k=1

Proof. We begin with the estimate



Z

Z 2 X
n
n

2 X





i(k )x
ik x
ck e
dx
ck e
dx


0


0
k=1
n k=1


X Z 2


ck
=
ei(k )x dx


0
k=1
n

Z 2(k )
X c



k
=
eiu du .


k 0
k=1

But the modulus of the last integral depends only on :


Z 2(k )
i2(k )
eiu du = i eiu
0
0
h
i
i2(k )
= i e
1
h
i
= i ei2 1
h
i
= i ei ei ei
= 2ei sin .
Z

2(k )



iu
Thus,
e du = 2 sin and the result follows.

0
59

Theorem. Let am and bn be complex numbers and let 0 < < 1. Then


!1/2 N
!1/2
N X
N
N

X
X
X

a
b


m n
b2n
.
a2m




m
+
n

sin

n=1
m=1
m=1 n=1
Proof. This is a simple matter of applying the Cauchy-Schwarz inequality and appealing
to our previous Lemma. First, we essentially repeat a calculation from Toeplitzs proof:

!1/2 N
!1/2
Z 2 X
N
N
N

X
X
X
1


a2m
b2n
.
(6)
am eimx
bn einx dx


2 0
m=1

m=1

n=1

n=1

On the other hand, from our previous Lemma,




Z 2 X
Z 2 X
N
N
N
N X


X
1
1




am eimx
bn einx dx =
am bn ei(m+n)x dx




2 0 m=1
2 0 m=1 n=1
n=1
N N



X
X
a
b
1


m n

2 sin


2
m+n
m=1 n=1

(7)

(by taking ck = am bn and k = m + n in our previous Lemma). Combining (6) and (7)
completes the proof.

60

Hardys Inequality
In his search for a new proof of Hilberts double series theorem, Hardy discovered the
following inequalities (in 1920, but without the best constant in Theorem 1; this was later
rectified by Landau in 1926).
Theorem 1. Let 1 < p < , let (an ) be nonnegative and p-th power summable, and let
An = a1 + a2 + + an . Then
p

X
An
n=1


<

p
p1

p X

apn

(1)

n=1

unless (an ) is identically zero. The constant is best possible.


The integral analogue of Theorem 1 is given as
Theorem 2. Let 1 < p < , let f : [ 0, ) [ 0, ) be p-th power integrable, and let
Rx
F (x) = 0 f (t) dt. Then
p

p Z
Z 
p
F (x)
dx <
f (x)p dx
(2)
x
p

1
0
0
unless f is identically zero. The constant is best possible.
Our proof of Theorem 1 is due to Elliot (1926). Our proof of Theorem 2 is (essentially)
Hardys original proof.
It may help to outline the heuristics that led Hardy to Theorem 1. Hardys approach
to the double series theorem was to treat the sums above and below the diagonal as
essentially identical; that is, he reasoned that it would suffice to examine
X
n

X
n
X
X
X
An
am bn
am bn

=
bn .
m+n
n
n
n=1 m=1
n=1
n=1 m=1

(3)

Now the conclusion of Hilberts theorem is, essentially, that the sum on the left converges
P
P
whenever m apm and n bqn converge (this by way of an application of Holders inequality).
P
But, Hardy reasoned, perhaps the convergence of m apm already implies the convergence
61

of

n (An /n)

, in which case an application of Holders inequality to the right-hand side

of (3) would lead to a more direct proof of Hilberts theorem; whence Theorem 1.
Proof of Theorem 1. The proof begins with an observation that will help us establish
strict inequality: We may suppose that a1 > 0. Indeed, if the theorem has been proved in
that case, then it will also hold for a sequence (bn ) with b1 = 0 for, in this case, setting
an = bn+1 , we would have
p
 p 
 a p  a + a  p
b2 + b3
b2
1
1
2
+
+ =
+
+
2
3
2
3
 a p  a + a  p
1
1
2

+
+
1
2

p X

p X
p
p
p
an =
bpn .
<
p1
p

1
n
n
We now set xn = An /n and estimate:
xnp

p
p
xnp1 an = xnp
{nxn (n 1)xn1 } xnp1
p1
p1


(n 1)p p1
np
xnp +
xn xn1
= 1
p1
p1





(n 1)p
p1
1 p
np
p
p
1
xn +
xn + xn1
p1
p1
p
p

1 
p
=
(n 1)xn1
nxnp ,
p1

(4)

where weve used Youngs inequality in (4). Next we sum over n = 1, . . . , N , noting that
the terms on the right, above, will telescope, and conclude that
N
X
n=1

xnp

N
p X p1
N xnp

xn an
0.
p 1 n=1
p1

That is,
N
X

xnp

n=1

N
p X p1

x
an .
p 1 n=1 n

Applying H
olders inequality then yields
N
X
n=1

xnp

p
p1

N
X

!11/p
xnp

n=1

62

N
X
n=1

!1/p
apn

(5)

Collecting terms and raising both sides to the p-th power gives us the finite version of our
result (but with a weak inequality):
N
X

xnp

n=1

p
p1

p X
N

p
p1

p X

apn .

n=1

It follows that

X
n=1

xnp

apn ,

(6)

n=1

where both sides of the inequality are finite. Finally, to see that we actually have strict
inequality, note that equality in (5) would force (xnp ) and (apn ) to be proportional. But
we took a1 = x1 > 0, so the constant of proportionality would have to be 1. That is, we
would have an = An /n for all n, which can only occur if (an ) is constant. (And this is
consistent with equality in (4), which would force (xn ) to be constant.) This is obviously
inconsistent with the fact that (an ) is p-th power summable; thus, (6) (and (1)) actually
holds with strict inequality.
To see that {p/(p 1)}p is the best constant, one approach is to consider the sequence
1

an = n p and let 0. Its a straightforward estimate, but well skip the details.
We next give (what is essentially) Hardys proof of Theorem 2. It is very similar in
spirit (if not in detail) to the proof weve just given for Theorem 1.
Proof of Theorem 2. As in the proof of Theorem 1, we will first establish a slightly weaker
version of (2). In particular, we will first show that
p

p Z T
Z T
p
F (x)
dx
f (x)p dx
x
p1
0
0

(20 )

R
for all T sufficiently large. To this end, note that if f is not identically zero, then 0 f p > 0
RT
and, hence, 0 f p > 0 for all T sufficiently large. (This expression is analogous to the sum
PN
p
n=1 an , and we will need to divide by it later in the proof, much as we did in the proof
of Theorem 1.)
63

Next, note that if f is p-th power integrable, then f is integrable over [ 0, x ] for any
x > 0 and, moreover,
Z

F (x) =

p
Z
p1
f (t) dt x

f (t)p dt

from Holders inequality. In particular, we have x1p F (x)p 0 as x 0+ , a fact that


will come in handy momentarily.
Were ready to attack (2). We first integrate by parts, then apply Holders inequality:
p
Z T
Z T

F (x)
1
dx =
F (x)p d x1p
x
1p 0
0
Z T
iT
1
p
1p
p
=
x F (x)
+
x1p F (x)p1 f (x) dx
1p
p

1
0
0
p1
Z T
p
1
F (x)
=
f (x) dx
T 1p F (T )p
(7)
p1 0
x
p1
p1
Z T
F (x)
p
f (x) dx
(8)

p1 0
x
!1/p Z 
p !11/p
Z T
T
F
(x)
p
f (x)p dx
dx
,
(9)

p1
x
0
0
where (in (7)) weve used the fact that x1p F (x)p 0 as x 0+ and (in (8)) the fact that
F 0. (Note that if (F (x)/x)p is integrable over [ 0, ), as the theorem suggests, then we
would expect to have x(F (x)/x)p 0 as x . Thus, dropping the term T 1p F (T )p is
unlikely to cause any problems.)
Weve been here before!

RT
0

(F/x)p occurs on both sides of our inequality, but to

different powers. Divide, then raise both sides to the p-th power to arrive at (2):
p

p Z T
Z T
F (x)
p
dx
f (x)p dx.
x
p

1
0
0
Because this inequality holds for all T sufficiently large, weve proved that
p

p Z
Z 
F (x)
p
dx
f (x)p dx,
x
p1
0
0

(10)

which is very nearly the conclusion we were hoping for. All thats missing is strict inequality, which follows from a closer examination of our application of Holders inequality
64

in (9). Indeed, equality in (10) would force equality in (9), which in turn would force
(F (x)/x)p and f (x)p to be proportional. But this would force f to be a power of x, which
is inconsistent with the assumption that f is p-th power integrable over [ 0, ). (If we
Rx
assume for the moment that f is continuous, then the equation xf (x) = C 0 f would
imply that f is differentiable and satisfies the differential equation xf 0 (x) = (C 1)f (x),
which has solution f (x) = BxC1 .)
As with Theorem 1, the proof that the constant {p/(p 1)}p is best would follow from
1

considering, for example, the function f (x) = x p , x 1, and f (x) = 0, otherwise.


Again, we will forego the details.

65

The Inequalities of Carleman, Knopp, Jensen, and Carleson


We begin with a simple application of Hardys Theorem 1.
Corollary. Let 1 < p < and let (an ) be nonnegative. Then
!

p X

1/p
1/p
1/p p
X
p
a1 + a2 + + an

an .
n
p 1 n=1
n=1
(n)

Note that the summand on the left can be written as M1/p (a), where the super(n)

script (n) is a reminder that were summing only n terms. Recall, too, that M0 (a) =
(n)

(n)

(a1 a2 an )1/n M1/p (a); in fact, if we let p (with n still fixed), M1/p (a) decreases
(n)

to M0 (a) while the constant on the right tends to e. Thus, we have another name
inequality, due to Carleman in 1923 (but without a proof of strict inequality):
Carlemans Inequality. For (an ) nonnegative,

1/n

(a1 a2 an )

<e

an

(1)

n=1

n=1

provided that (an ) is not identically zero. The constant is best possible.
An elementary yet elegant proof of Carlemans inequality, due to Polya in 1926, is
outlined in the exercises. Well give a second proof shortly. The proof that the constant e
is best possible will be left for another day.
There is an integral analogue of Carlemans inequality. To find it, we first rewrite (1)

P
P
P
1
log
a

e
as
exp
k
k=1
n=1 an , which suggests:
n=1
n
Knopps Inequality. If f : [ 0, ) (0, ) is integrable, then
 Z x

Z
Z

1
exp
log f (t) dt dx < e
f (x) dx
x 0
0
0
unless f is identically zero (and we interpret exp(log(0)) = 0).
Curiously, Knopps original inequality (from 1928) was stated with 1 as the lower
limit of integration (in all instances), and that inequality is actually false! It fails for the
66

function f (x) = 1/x2 , which can be verified by direct calculation. Nevertheless, Knopps
original proof can be used to prove the corrected inequality.
Well give two proofs of Knopps inequality. First, though, well take a short detour
to prove the integral version of Jensens inequality, which is not only of interest in its own
right, but which is also needed for Knopps proof.
Jensens Inequality. Let f , w : I R be integrable functions with w(x) 0 and
R
w(x) dx = 1. Let : J R be a convex function defined on an interval J containing
I
the range of f . Then
Z

f (x) w(x) dx

Proof. Let =

R
I

(f (x)) w(x) dx.


I

f (x) w(x) dx and let T (x) = () + m(x ) be a supporting line to

the graph of at . (Recall that we have T (x) (x) for all x and T () = ().) Then
() (f (x)) m( f (x)) for all x I. Multiplying by w and integrating over I then
R
leads to: () I (f (x)) w(x) dx m( ) = 0.
If we set (x) = ex , we have the following integral analogue of the arithmeticgeometric mean inequality:
R
Corollary. Let f , w : I (0, ) with I w(x) dx = 1. Then
Z

Z
exp
log(f (x)) w(x) dx
f (x) w(x) dx.
I

A special case of Jensens inequality is worth stating separately:


Corollary. Let f : [ a, b ] R be continuous and let : J R be a convex function
defined on an interval J containing the range of f . Then
!
Z b
Z b

1
1

f (t) dt
f (t) dt.
ba a
ba a
67

If is strictly convex, then equality can only occur if f is constant.


Proof. We only need to prove the last assertion. As before, let = (b a)1

Rb
a

f (t) dt

and let T (x) = () + m(x ) be a supporting line for at . If is strictly convex,


note that (y) () > m(y ) for y 6= . Now, if f is not constant, we have f (x) 6=
for all x in some subinterval I of [ a, b ]. Thus (f (x)) () > m(f (x) ) on I and

R b
(f (x)) () m(f (x) ) in any case. It follows that a (f (x)) () dx > 0.

R b
Conversely, a (f (x)) () dx = 0 would mean that (f (x)) = () for all x. But
strictly convex functions can take on the same value at most twice; thus, f (x) assumes at
most two values. As f is continuous, this forces f to be constant.
As a second application of Jensens inequality, well revisit an inequality we saw last
week; namely,
Z

m+1 Z n+1

1
dy dx
>
.
x+y
m+n+1

To see how this follows from Jensens inequality, again note that the function h(x) =
1/(c + x) is strictly convex for x > c. Thus,
Z n+1
1
dy
1
>
.
=
R n+1
x+y
x + n + 1/2
x+
y dy
n
n

Similarly,
Z

m+1

dx
1
1
=
>
.
R m+1
x + n + 1/2
m+n+1
n + 1/2 + m x dx

One more easy calculation and well be ready to prove Knopps inequality.
1
Lemma. Let f : [ 0, ) (0, ) be integrable and define g(x) = 2
x
Z
Z
f (x) dx =
g(x) dx.
0

yf (y) dy. Then


0

Proof. This is a simple matter of changing the order of integration.


Z x
Z
Z
Z
Z
Z
1
1
g(x) dx =
yf (y) dy dx =
yf (y)
dx dy =
f (y) dy.
x2 0
x2
0
0
0
y
0
68

Proof of Knopps inequality. Recall that an antiderivative for log x is x log x x (which
vanishes at 0). Thus,

 Z x

 Z x
Z
1
1 x
1
log(f (y)) dy = exp
log(yf (y)) dy
log y dy
exp
x 0
x 0
x 0

 Z x
e
1
= exp
log(yf (y)) dy
x
x 0
Z

e 1 x

exp log(yf (y) dy
x x 0
Z x
e
= 2
yf (y) dy,
x 0

(2)

where, in (2), weve applied Jensens inequality. An appeal to our previous Lemma yields

 Z x
Z
Z

1
exp
log f (t) dt dx e
f (x) dx.
(3)
x 0
0
0
Now equality in (3) would force equality in (2) for all x, which would in turn force log(yf (y))
to be constant. If f is not identically zero, this would mean that f (y) = c/y for some
positive constant c. This is clearly inconsistent with the fact that f is integrable over
(0, ). Thus, we have strict inequality in (3) unless f = 0.
We next present an inequality due to Lennart Carleson (1954) which is at least partly
related to Jensens inequality, through its style of proof, and which is at least partly related
to Hardys inequality, in a way that will become apparent later. Moreover, it will yield
both Carlemans inequality and Knopps inequality as corollaries.
Carlesons Inequality. Let : [ 0, ) R be a differentiable convex function with
(0) = 0. Then, for any 1 < < , we have


Z
Z
(x)
+1

x exp
dx e
x exp( 0 (x)) dx.
x
0
0

(4)

The constant e+1 is best possible.


Proof. To begin, note that if p > 1, then the convexity of assures us that
(py) (y)
0 (y),
py y
69

(5)

which we will write as: (py) (y) (p 1)y0 (y). Now we estimate the left-hand
side of (4) using (5) and H
olders inequality with q = p/(p 1), the conjugate to p.




Z A
Z A/p
(x)
(py)

+1
x exp
y exp
IA =
dx = p
dy
x
py
0
0


Z A
(y) (p 1)y0 (y)
+1

p
y exp
dy
py
0




Z A
0 (y)
(y)
/q
+1
/p
y
exp
dy
=p
y
exp
py
q
0
(Z
)1/q
A

1/p

p+1 IA

y exp( 0 (y)) dy

1/p

We now divide by IA and raise both sides to the q-th power to arrive at


Z A

+1 Z A
(x)

p/(p1)
x exp
dx p
x exp( 0 (x)) dx.
x
0
0
To finish the proof, we first let A and then let p 1+ or, equivalently, let q .
Written in terms of q we have
q 
q

1
q
p/(p1)
= 1+
e
p
=
q1
q1

as q .

As an application of Carlesons inequality, lets see how it can be used to derive


Carlemans inequality.
First suppose that the sequence (an ) is nonnegative and decreasing and define s0 = 0,
Pn
sn = k=1 log(1/ak ). Now let be the piecewise linear continuous function with nodes
or corners at (n, (n)) = (n, sn ). That is (n) = sn for each n and is defined
linearly on each interval (n 1, n). Then, on the open interval (n 1, n), well have
0 (x) = sn sn1 = log(1/an ). But if (an ) decreases, then 0 will increase; thus, will
be convex. Also, because (0) = 0, the function (x)/x will increaseits the slope of the
chord from (0, 0) to (x, (x)). Moreover, note that for n 1 < x < n we have
n

(x)
(n)
1X

=
log(1/ak ).
x
n
n
k=1

70

(6)

Equality in (6) for all n and x would force (ak ) to be constant, which is plainly inconsistent
with the summability of (ak ). Thus, we must have strict inequality in (6), at least for
certain n and x (which, by the continuity of , will be good enough for our purposes).
Finally, notice that
n
Y

!1/n
ak

k=1

(n)
= exp
n



(x)

exp
dx,
x
n1
Z

where the inequality is strict, at least for certain values of n, per our discussion of (6).
Summing this inequality and applying Carlesons inequality leads to
!1/n


Z
Z

X
Y
X
(x)
0
ak
<
exp
dx e
exp ( (x)) dx = e
an ,
x
0
0
n=1
n=1
k=1

where, again, strict inequality follows from our remarks concerning (6). That is, we have
Carlemans inequality for decreasing sequences (an ). But that hardly matters:
Claim: If (ak ) is decreasing and if (bk ) is any rearrangement of (ak ), then b1 b2 bn
a1 a2 an for all n.
Indeed, if n is fixed, we may suppose that b1 , b2 , . . . , bn have been arranged in decreasing order, in which case its clear that b1 a1 , b2 a2 , and so on. Thus, b1 b2 bn
a1 a2 an . It follows that
!1/n

X
Y
X
bk

n=1

k=1

n=1

n
Y

!1/n
ak

k=1

< e

X
n=1

an = e

bn .

n=1

(Because an 0, the series is unconditionally summablethat is, every rearrangement of


(an ) will sum to the same value.)
Carlesons inequality also leads to a version of Knopps inequality valid for decreasing
Rx
functions. In this case, the function (x) = 0 log(f (t)) dt will have 0 (x) = log(f (x)),
which is increasing; hence, is convex.
Corollary. If f : [ 0, ) (0, ) is decreasing, then
 Z x

Z
Z

1
log f (t) dt dx e
f (x) dx.
exp
x 0
0
0
71

In light of our discussion of decreasing sequences and our knowledge of Knopps result,
this Corollary raises a natural question: If f is decreasing, and if g is a rearrangement
of f (whatever that might mean), does it follow that

 Z x

 Z x
Z
Z


1
1
log g(t) dt dx
exp
log f (t) dt dx ?
exp
x 0
x 0
0
0

72

Problem Set 6
1. (a) Given positive weights w1 , . . . , wn , show that
! n
!
!2
n
n
X
X
X
1
a2k wk .
ak

wk
k=1

k=1

k=1

(b) If we set wk = wk (t) = t + k 2 /t for t > 0, show that the first factor on the right
is bounded above by /2.
)
( n
X
2
(c) Show that: min
ak wk (t) : t > 0 = 2

!1/2
a2k

k=1

k=1

(d) Finally, conclude that

n
X

n
X

!4
ak

k=1

n
X

!
a2k

k=1

!1/2

n
X

k 2 a2k

k=1
n
X

!
k 2 a2k

. This is known as

k=1

Carlsons inequality. The constant 2 is best possible.


2. For any pair of nonnegative, real sequences (am ) and (bn ), show that
!1/2 !1/2

X
X
X
am bn
< 4
a2m
b2n
.
max{m, n}
m=1 n=1
m=1
n=1
[Hint: Mimic either of the proofs we gave for Hilberts inequality (with p = 2). In the
case of the homogeneous kernel approach, take K(x, y) = [max{x, y}]1 and note
i1
R h
that 0
u max{1, u}
du = 4.]
3. Let (an ) be a nonnegative sequence and let (bn ) denote a rearrangement of (an ). If (bn )
p P
p
P
is decreasing, show that n=1 Ann n=1 Bnn , where An = a1 + a2 + + an
and Bn = b1 +b2 + +bn . [Hint: Argue, for example, that the left-hand side increases
if a pair of out of order terms ai < aj , where i < j, are swapped.] Conclude that it
suffices to prove Hardys inequality (for series) in the case where (an ) is decreasing.
4. Let (an ) be nonnegative and decreasing, and define f (x) = an for n 1 x < n.
R
Rx
P
Then n=1 apn = 0 f (t)p dt. As usual, we define F (x) = 0 f (t) dt.
(i) Show that F (x)/x decreases from An /n to An+1 /(n + 1) over n < x < n + 1 and
R
P
conclude that 0 (F (x)/x)p dx n=1 (An /n)p .
(ii) Deduce that Hardys series inequality for (an ) follows from his integral inequality
for f . Thus, from 3, the integral inequality implies the series inequality for any
nonnegative (not necessarily decreasing) sequence.
73

5. Here is an outline of an alternate proof of Hardys integral inequality due to Ingham.


In what follows, all functions are nonnegative and 1 < p < . Parts (a)(c) and (e)
are independent, only part (d) depends on the prior steps.
R R
p 1/p R R
1/p
1
1
1
1
(a) Show that 0 0 g(x, y) dx dy
0 0 g(x, y)p dy
dx. [Hint: Write
R1
R1
R
R
1
1
J(x) = 0 g(x, y) dy. Use Holder on I = 0 J(x)p dx = 0 0 J(x)p1 g(x, y) dy dx.
For a stronger result, show that the inequality is strict unless g(x, y) = h(x)k(y).]
Rx
R1
(b) If F (x) = 0 f (t) dt, show that F (y)/y = 0 f (xy) dx.
R
1/p 
1/p
R1
1
(c) Show that 0 f (xy)p dy
x1 0 f (t)p dt
.
R
1/p
R
1/p
1
1
p
p
p
f
(t)
dt
.
(d) Show that 0 (F (y)/y) dy
p1
0
(e) This proves Hardys inequality over [ 0, 1 ]. Its possible to deduce the inequality
over [ 0, ) from (d). How? [Hint: First try to deduce the result over [ 0, b ] by
means of a simple change of variable.]
6. Here is an outline of P
olyas proof of Carlemans inequality; its legendary for its
elegance and acuity. Let (an ) be a sequence of positive real numbers.
(a) Given a positive sequence (cn ), justify the steps in the following calculation:
1/n


X
X
c1 a1 c1 a2 cn an
1/n
(a1 a2 an )
=
c1 c1 cn
n=1
n=1

(c1 c2 cn )

n=1

1/n

n
1 X
cm am

n m=1

X
1
(c1 c2 cn )1/n .
=
am cm
n
n=m
m=1

(b) If cm = (m + 1)m /mm1 , show that (c1 c2 cn )1/n = 1/(n + 1) and conclude
that

X
X
1
1
1
1/n
(c1 c2 cn )
=
= .
n
n(n + 1)
m
n=m
n=m

(c) Finally, deduce that

1/n

(a1 a2 an )

n=1


m

X
X
X
am cm
1

=
am 1 +
< e
am .
m
m
m=1
m=1
m=1

7. Show that the constant e+1 in Carlesons inequality is best possible. [Hint: Let a >
and define (x) = (a + 1)x log x for x > 1, (x) = 0 otherwise.]
74

Problem Set 6, Problem 1


1. (a) Given positive weights w1 , . . . , wn , show that
! n
!
!2
n
n
X
X
X
1
2
ak wk .
ak

wk
k=1

k=1

k=1

(b) If we set wk = wk (t) = t + k 2 /t for t > 0, show that the first factor on the right
is bounded above by /2.
( n
)
X
(c) Show that: min
a2k wk (t) : t > 0 = 2
k=1

(d) Finally, conclude that

n
X

!1/2
a2k

k=1
n
X

!4
ak

n
X

k=1

n
X

!1/2
k 2 a2k

k=1

n
X

a2k

k=1

!
k 2 a2k .

k=1

Solution. (a) This is a straight application of Cauchy-Schwarz:


!1/2 n
!1/2
n
n
n
X
X
X
X

1
1
ak =
a2k wk
.
ak wk
wk
wk
k=1

k=1

k=1

k=1

(b) For wk = wk (t) = t + k 2 /t we have, by the integral test,


Z
Z
n
n
X
X
1
t

t
1
<
dx =
dx = .
=
2
2
2
2
2
wk
t +k
2
0 t +x
0 1+x
k=1

k=1

n
X

(c) Note that

a2k wk (t)

= t

k=1

t = (B/A)

1/2

n
X
k=1

a2k

n
B
1 X 2 2
k ak = At + , which is minimized when
+
t
t
k=1

(and its minimum value is then 2A1/2 B 1/2 ).

(d) Finally, we assemble the pieces. From (b) we have

n
X

!2
ak

k=1

n
X 2

ak wk (t) for
2
k=1

all t > 0. Thus, from (c),


n
X
k=1

!2
ak

n
X
k=1

!1/2
a2k

n
X

!1/2
k 2 a2k

k=1

Squaring both sides completes the solution (except for the claim that the constant 2 is
best possible).

75

Majorization and Schur Convexity


In this section we will discuss an order relation that may help to explainand consolidate
many of the elementary inequalities that weve encountered. As well see, majorization, as
the study and application of this order relation is called, is a fruitful undertaking.
Before we say more, lets make a few definitions and look at a few easy examples. To
begin, given a finite length sequence of real numbers x = (x1 , x2 , . . . , xn ), we will denote
the decreasing rearrangement of x by x = (xk ). That is, x1 = max{xk : 1 k n} = xk1 ,
x2 = max{xk : 1 k n, k 6= k1 } = xk2 , and so on. It would be more accurate to say
that x is the non-increasing rearrangement of x, but thats rather awkward. Well stick
with decreasing.
Heres an easy example of the decreasing rearrangement in action:
Proposition. For x, y Rn we have

n
X

xi yn+1i

i=1

n
X
i=1

xi yi

n
X

xi yi .

i=1

Proof. We only need to prove the second inequality; the first then follows by considering
x and y. To begin, we may suppose that one of the sequences, lets say x, is already in
decreasing order but y is not. Thus, there are a pair of indices j < k such that yj < yk
while xj xk . Now consider:
xj yk + xk yj (xj yj + xk yk ) = (xk xj )(yj yk ) 0.
It follows that the sum

Pn

i=1

xi yi can only increase if we exchange yj and yk . After a finite

number of such exchanges, y would be in decreasing order.


We next define an order relation, written x y, read x is majorized by y (or y
majorizes x), and defined by the following string of inequalities (plus one equation):

x1 y1

x1 + x2 y1 + y2

..
(1)
.

x1 + + xn1 y1 + + yn1

x1 + x2 + + xn = y1 + y2 + + yn
76

Example. (1, 1, 1, 1) (0, 1, 2, 1) (2, 0, 2, 0) (1, 3, 0, 0) (0, 0, 4, 0).


Judging by this single example, it would seem that smaller sequences are more evenly
distributed, while bigger sequences are less evenly distributed. Its also reasonably clear
that is transitive; thus, x y and y z imply that x z. But, because the relation
doesnt depend on the order of the terms, we cant expect to have, say, x y and y x
imply that x = y. It would, however, imply that x = y . Also, because we have one
equation to satisfy, not all pairs of sequences will be comparable; for example, (1, 1, 1)
and (1, 0, 1) are incomparable. Finally, given any x and any permutations and of
{1, 2, . . . , n}, the rearrangements x = (x(1) , . . . , x(n) ) and x satisfy x x x .
In 1923, Schur studied majorization and asked which functions f : (Rn )+ R satisfy
f (x) f (y) whenever x y. We now call such functions Schur convex. If, instead,
f (x) f (y) whenever x y, we say that f is Schur concave. Again, more precise
labels might be Schur increasing or Schur monotone, but so it goes. Because we have
x x x for any permutations and , note that a Schur convex/concave function
must also be symmetric; that is, f (x ) = f (x) for any permutation .
Fortunately, Schur gave us more than a label for such functions; he also gave us a test
for them.
Schurs Criterion. Let f : (Rn )+ R be symmetric and continuously differentiable.
Then f is Schur convex if and only if, for all 1 j, k n and all x (Rn )+ ,


f
f
0 (xj xk )

.
xj
xk
Proof. Because f is symmetric, it follows that f is Schur convex on (Rn )+ if and only if
f is Schur convex on D = {x : x1 x2 xn 0}.
Given x (Rn )+ , lets agree to write x
= (
x1 , . . . , x
n ) for the vector with entries
x
k = x1 + + xk , 1 k n. With this notation, the majorization x y is equivalent to
77

the string of inequalities x


k yk , 1 k < n, together with the equation x
n = yn . In this
way, majorization is almost the same as the coordinatewise comparison x
y.
e = {
e R by f(
In particular, if we put D
x : x D} and define f : D
x) = f (x), then
e satisfy
f is Schur convex on D if and only if f satisfies f(
x) f(
y ) whenever x
, y D
x
j yj , 1 j < n, and x
n = yn . That is, f must be nondecreasing in each of its first
n 1 coordinates; hence,
0

f(
x)
x
j

for all

1 j < n.

But f(
x) = f (
x1 , x
2 x
1 , . . . , x
n x
n1 ) so, by the chain rule,
0

f(
x)
f (x) xj
f (x) xj+1
f (x) f (x)
=
+
=

x
j
xj x
j
xj+1 x
j+1
xj
xj+1

(2)

for all 1 j < n. Summing (2) over j, j + 1, . . . , k 1 gives us


0

f (x) f (x)

xj
xk

for

1j<kn

Because f is symmetric, this is equivalent to




f (x) f (x)
0 (xj xk )

xj
xk

for

and x D.

x (Rn )+ ,

which is just a way of saying that the the variables must be in the same order as the
corresponding partial derivatives.
By way of an example, lets check that f (x) = x1 x2 . . . xn is Schur concave on (Rn )+ .
Q
In this case, fxj = i6=j xi , and so
(xj xk )(fxj fxk ) = (xj xk )2

xi 0.

i6=j,k

It follows that f (x) f (


x), where x
is the constant vector with entries M1 (x) = (x1 + +
xn )/n, because x
x. That is, we have another proof of the AGM: x1 x2 xn M1 (x)n .
As another example, its not hard to check that f (x) = kxkp is Schur convex for
1 p < (and Schur concave for 0 < p < 1). Thus, using the same test vector as
above, we have f (
x) f (x) or: M1 (x) Mp (x). It also tells us that the vectors in our first
78

example are, in fact, listed in increasing order of p-norm. In this case, 41/p (2 + 2p )1/p
(2p + 2p )1/p (1 + 3p )1/p 4.
Rather than catalogue more examples, lets turn our attention to a better understanding of the relation x y. This will lead us to another test for Schur convexity.
Whats missing is some justification for the convexity in Schur convexity and, ultimately, its connection with majorization. Filling in the details will take some time, and
will take us on something of a side trip, but the journey will have its rewards.
A Bit of Matrix Theory
We begin with the notion of a permutation matrix. Given a permutation of {1, . . . , n},
the matrix associated with the transformation T (x) = x is called a permutation matrix.
Note that T permutes the basis vectors according to ; i.e., T (ej ) = e(j) , and, hence, its
matrix representation is the corresponding permutation of the columns of I, the identity
matrix. Said in other words, a permutation matrix is a matrix with entries 0 and 1 in
which every row and every column has precisely one nonzero entry. The simplest example
of a permuation matrix is a transposition matrix, which corresponds to a simple exchange,
or transposition, of

1 0
0 1

0 0
0 0

two basis vectors:

0 0
0 0
swap e2 and e4
1 0
0 1

1
0

0
0

0
0
0
1

0
0
1
0

0
1
.
0
0

It wont surprise you to learn that the set of permutation matrices on Rn forms a group
(under multiplication) that is isomorphic to Sn , the symmetric group on n letters (that is,
the group of all permutations on {1, . . . , n}). In particular, note that every permuation
matrix is invertible and its inverse is again a permutation matrix; indeed, (T )1 = T1 .
We will also need the notion of a transfer matrix, defined by T = (1 )I + T ,
where 0 1 and where T is a transposition matrix. Symbolically, if T exchanges ei
79

and ej , we would have

T =

1
..

.
1
..
.

..
.

..
.
1
..

.
1

where all other diagonal entries are 1 and all other off-diagonal entries are 0. To understand
the action of T , suppose (for sake of argument) that we have xi > xj ; then T subtracts a
bit of the excess (namely, xi xj ) from xi and transfers it to xj ; that is,

(T (x))i = (1 )xi + xj = xi (xi xj ),

(T (x))j = xi + (1 )xj = xj + (xi xj ),


(3)

(T (x))k = xk , k 6= i, j,
where (T (x))k denotes the k-th coordinate of T (x). Also notice that we have (T (x))i +
(T (x))j = xi + xj ; thus, x and T (x) are comparable and, in fact, T (x) x because T (x)
is more evenly distributed than x (but it also follows from direct calculation, as well see
shortly). Finally, notice that every transposition matrix (including I) is also a transfer
matrix (by taking = 0 or = 1).
More generally, suppose that were given a convex combination of several permutation
matrices:
D = 1 T1 + + k Tk ,

i 0,

1 + + k = 1.

Now the permutation matrices Ti each have a single 1 in any given row or column, so
their convex combination D must satisfy
n
X
i=1

di,j =

n
X

i = 1

and

i=1

n
X
j=1

di,j =

n
X

j = 1

j=1

for all i and j; that is, every row and every column of D sums to 1. Such a matrix is said to
be doubly stochastic. In other words, an n n matrix is doubly stochastic if di,j 0, for all
80

i, j, and every row and every column sums to 1; i.e.,

Pn

i=1

di,j = 1 and

Pn

j=1

di,j = 1 for all

i and j. (It follows that di,j 1 for all i, j.) As it happens, every doubly stochastic matrix
can be written as a convex combination of permutation matrices (Birkhoffs theorem), so
our initial supposition is actually the general case.
Now might be a good time to say a couple of words about permutations and transpositions. From the proof of our first Proposition, its not hard to see that any permutation can
be written as the product of finitely many transpositions. To see this, lets first establish
a useful claim:
Claim. Given y Rn , we can write y = Tk Tk1 T1 y for some (finite) sequence of
transposition matrices.
Weve sketched the proof already: If i1 is the index of the largest coordinate of y and
if 1 is the transposition that exchanges i1 with 1, then T1 = T1 moves the largest entry
of y into the first coordinate; that is T1 y now has its largest entry in the first coordinate.
If i2 6= 1 is the index of the second largest coordinate in T1 y and if 2 exchanges i2 and
2, then T2 = T2 moves the second largest entry of T1 y into the second coordinate. That
is, T2 T1 y agrees with y in its first two coordinates. And so on. After at most n 1 such
exchanges, well arrive at y .
Now this argument has very little to do with decreasing rearrangements and everything
to do with matching a specific permutation of the coordinates of y, so weve actually
established:
Claim. Given y Rn and Sn , we can write y = Tk Tk1 T1 y for some (finite)
sequence of transposition matrices. In other words, T = Tk Tk1 T1 or, in still other
words, = k k1 1 . That is, every permutation can be written as the product of
finitely many transpositions.
Were now ready to put some of the pieces together.
81

Theorem. Let x, y Rn with x, y 0. Then the following are equivalent


(i) x is a convex combination (of finitely many) of the vectors {y : Sn }.
(ii) x = Dy for some doubly stochastic matrix D.
(iii) x y.
(iv) x = T1 T2 Tk y for some (finite) sequence of transfer matrices T1 , . . . , Tk .
Proof. Weve actually already shown that (i) implies (ii). Indeed, we only need to recall
that y = T (y) for then:
x = 1 y1 + + k yk = (1 T1 + + k Tk )(y) = Dy
where D is doubly stochastic.
To see that (ii) implies (iii), first note that either condition is unaffected by permutations of x or y. As weve already seen, the condition x y is equivalent to the condition
x y for any permutations , . Moreover, because the product of a doubly stochastic
matrix and a permutation matrix is still doubly stochastic, it follows that the condition
x = Dy is equivalent to the condition x = D0 y for any permutations , (and some
doubly stochastic matrix D0 , which may depend on and , of course). Thus we may
suppose that both x and y are in decreasing order.
Now if x = Dy and if 1 k n is fixed, then we can write
xi =

n
X
j=1

di,j yj =

k
X

xi =

i=1

k X
n
X

di,j yj =

i=1 j=1

n
X

cj yj

where

cj =

j=1

In particular, if k = n, then cj = 1 for all j and, hence,

j=1

cj =

k X
n
X
i=1 j=1

82

di,j = k.

di,j .

i=1

Pn

i=1

xi =

Pn

i=1

the fact that D is doubly stochastic implies that 0 cj 1 for all j and
n
X

k
X

yi . For k < n,

Thus, for 1 k < n, we can write (watch closely!):


k
X
i=1

xi

k
X

yi =

i=1

n
X

cj yj

j=1

i=1

n
X

k
X

cj yj

j=1

k
X

n
X

yi + yk k

i=1

n
X

cj

j=1

k
X
cj (yj yk ) +
(yk yi )

j=1
k
X

yi

i=1

(yk yi )(1 ci ) +

i=1

n
X

(yi yk )ci 0.

i=k+1

In other words, weve shown that x y.


We next show that (iii) implies (iv). Let x, y Rn with x y. Again, we may
suppose that x and y are decreasing. We may also suppose that x 6= y, for otherwise we
just take T1 = I in (iv). Our proof proceeds by induction on the number of coordinates N
in which x and y differ. The fact that we have x1 + + xn = y1 + + yn will convince
you that x and y cannot differ in only one coordinate, so lets first suppose that x and y
differ in precisely 2 coordinates; that is, for some 1 j < k n we have xj 6= yj , xk 6= yk
and xi = yi for i 6= j, k. Now, because x1 + + xj y1 + + yj = x1 + + xj1 + yj ,
we must have xj < yj . It follows that we then have xk > yk to compensate. Thus,
yj > xj xk > yk . Finally, notice that xj + xk = yj + yk ; that is, yj xj = xk yk .
With (3) as our guide, we can now write the transfer matrix that maps y into x: We
set T = (1 )I + T(j,k) where = (yj xj )/(yj yk ) = (xk yk )/(yj yk ) and
where T(j,k) is the transposition matrix that exchanges ej and ek . As the following 2 2
calculation shows, well then have x = T y because:


 
 

1

yj
yj (yj yk )
xj
=
=

1
yk
yk + (yj yk )
xk
(and because T fixes all other entries of y). This proves the case N = 2.
Suppose now that (iii) implies (iv) holds whenever x y and x and y differ in N 1 or
fewer coordinates and that were given a pair of (decreasing) sequences with x y where
83

x and y differ in N coordinates. Then, as in the case N = 2, the first place where x and
y differ will have to satisfy xj < yj and, for some k > j well have xk > yk to compensate.
But, in fact, we must have some pair of coordinates j < k which satisfy:
xj < yj ,

xk > yk ,

and xi = yi

for j < i < k.

(We might have k = j + 1, of course; what matters here is that x and y do not differ in
any intermediate coordinates.)
Because weve assumed that x and y are in decreasing order, this again means that
yj > xj xk > yk . In particular, yj yk > xj xk and well transfer a bit of yj to yk to
arrive at x. However, because x and y might differ in several coordinates, we can no longer
be assured that yj xj = xk yk ; instead, well transfer (a portion of) the smaller of
the two amounts. Specifically, set T = (1 )I + T(j,k) where T(j,k) is the transposition
matrix that exchanges ej and ek , and where = min{yj xj , xk yk }/(yj yk ). If we
let y = T y, then, as above, T acts on only two coordinates of y; in this case,


 
 
 

1

yj
yj (yj yk )
yj min{yj xj , xk yk }
yj
=
=
=

1
yk
yk + (yj yk )
yk + min{yj xj , xk yk }
yk
and yi = yi for i 6= j, k. Its reasonably clear from the definition of y that we have
yj yj xj xk yk yk .
Moreover, because we have either = (yj xj )/(yj yk ) or = (xk yk )/(yj yk ), it
follows that we have either yj = xj (in the first case) or yk = xk (in the second case).
Claim. x y = T y y.
That y = T y y follows from our proof that (ii) implies (iii). What remains is to show
that x y. The key to this is the fact that yj + yk = yj + yk . Indeed, we have
x1 + + xi y1 + + yi = y1 + + yi

if

x1 + + xi y1 + + yi y1 + + yi

if j i < k.

84

1i<j

or k i n;

(4)
(5)

That (4) holds is immediate; that (5) holds follows from the facts that xj yj and yi = yi
for j < i < k.
In summary, we shown that there is a transfer matrix T such that x y = T y y
and such that x and y differ in at most N 1 coordinates. By our induction hypothesis, there exist finitely many transfer matrices T1 , . . . , Tk such that x = T1 Tk (
y) =
T1 Tk (T y) = (T1 Tk T )(y). Thus, weve shown that (iii) implies (iv).
Finally, the proof that (iv) implies (i) follows by direct calculation and the observation
that if T is a transfer matrix, then T (y) = (1 )y + y is a convex combination of
permutations of y. (In other words, if we write H for the set of all convex combinations
of the vectors y , then for any transfer matrix T we have T (H) H.)
Corollary. : (Rn )+ R is Schur convex if and only if (T (x)) (x) for every
transfer matrix T if and only if (D(x)) (x) for every doubly stochastic matrix D.
Schurs Majorization Inequality. If f : I R is convex, then the function : I n R
defined by
(x1 , . . . , xn ) =

n
X

f (xi )

i=1

is Schur convex. That is,


n
X

f (xi )

i=1

n
X

f (yi )

i=1

whenever x y.
Proof.

The proof is virtually immediate from the preceding Corollary and (3). Indeed,

if T is a transfer matrix that acts only on the j-th and k-th coordinates, and if we write
x
= T (x), then, from (3), we have x
j = (1 )xj + xk , x
k = xj + (1 )xk , and x
i = xi
for i 6= j, k. Its then immediate from the convexity of f that (T (x)) (x).
Its clear that if (x) is Schur convex and if h(x) is increasing, then h((x)) is again
Schur convex. This explains why f (x) = kxkp is Schur convex for 1 p < . The
85

function (x) =

Pn

i=1

xpi is Schur convex (from the previous Corollary) and h(x) = x1/p

is increasing.
By way of another example, we present a classical inequality, due to Muirhead (1903),
who introduced the transfer method.
Muirheads Theorem. Let a1 , . . . , an be positive. For x Rn with x 0 define
(x) =

1
2
n
ax(1)
ax(2)
ax(n)
.

Sn

Then is Schur convex.


Proof. Let T = (1 )I + T be a transfer matrix, where 0 1 and where = (i,j)
denotes the transposition of i and j, i 6= j. Following (3), we can write
xj = m d,

xi = m + d,

(T (x))i = m + d,

(T (x))j = m d,

where = 1 2 satisfies 1 1. Then


2(x) = [ (x) + (x ) ]
"
#
X
X
x2
xn
1
2
n
1
=
ax(1)
ax(2)
ax(n)
+
ax(
(1)) a( (2)) a(n)
Sn

Sn

ax(k)

Sn

md
am+d
(i) a(j)

xj
i
ax(j)
a(i)

ax(k)

Sn

xj
i
ax(i)
a(j)

k6=i, j

m+d
amd
(i) a(j)

k6=i, j

Similarly,

2(T (x)) =

Sn

h
i
k
am+d amd + amd am+d .
ax(k)
(i)
(j)
(i)
(j)

k6=i, j

Consequently,

2(x) 2(T (x)) =

Sn

86

k6=i, j

k
(),
ax(k)

where
m
() = am
(i) a(j)

h
 
i
d
d
d d
d d
ad(i) ad
+
a
a

a
a
+
a
a
(j)
(i) (j)
(i) (j)
(i) (j)

 d

m
d
d
d
= am
) ,
(i) a(j) (y + y ) (y + y
and where y = a(i) /a(j) > 0. But for y > 0, the function f (s) = y s + y s is even and
increasing on [ 0, ), thus () 0 because f (d) f (d). Thus, (T (x)) (x).
Now if we insist that

Pn

xi = 1, then the expression


1 X x1 x2
n
a(1) a(2) ax(n)
(a, x) =
n!
i=1

Sn

defines a mean (of a); its an average of geometric-like means over the n! members of Sn .
With this slight change in notation, we get an interesting extension of the arithmeticgeometric mean inequality:
M0 (a) (a, x) M1 (a)
because b = (1/n, 1/n, . . . , 1/n) x (1, 0, . . . , 0) = c and because
1 X
(a, b) =
(a(1) a(2) a(n) )1/n = (a1 a2 an )1/n
n!
Sn

and
n
1X
1 X
a(1) =
ai .
(a, c) =
n!
n i=1
Sn

We complete this section with a proof of Garrett Birkhoffs theorem on doubly stochastic matrices (1946), a pivotal result in the theory of majorization. Curiously, this is yet
another example of a famous theorem of uncertain provenance. Some sources claim that
it was first proved by K
onig in 1936 (in an article on graph theory). The theorem was
discovered independently by John von Neumann in 1953 (in an article on game theory).
Many contemporary sources refer to it as the Birkhoff-von Neumann theorem.
Theorem. Every doubly stochastic matrix can be written as a convex combination of
permutation matrices.
Proof. Well prove the theorem by induction on the number of nonzero entries. Its clearly
true for doubly stochastic matrices with precisely n nonzero entries, as any such matrix
87

is necessarily a permutation matrix. So we suppose that the theorem holds for doubly
stochastic matrices with fewer than k nonzero entries, where n + 1 k n2 , and that
were given a doubly stochastic matrix D with precisely k nonzero entries.
In particular, D has some entry di1 ,j1 with 0 < di1 ,j1 < 1. But the sum of entries in
the i1 -th row is 1, so there must be some j2 6= j1 such that 0 < di1 ,j2 < 1. But the sum of
the entries in the j2 -th column is 1, so there is some index i2 6= i1 such that 0 < di2 ,j2 < 1,
and so on. This process can be iterated until some pair (i, j) is repeated. Thus we may
suppose that weve chosen a circuit that begins and ends at (i1 , j1 ) and, moreover, that
weve chosen the shortest such circuit whose entries will then satisfy
0 < dis ,js < 1,

0 < dis ,js+1 < 1,

1 s m,

where i1 , . . . , im are distinct, j1 , . . . , jm are distinct, and where jm+1 = j1 . (If the circuit
produced three consecutive pairs in the same row or column, we could delete one of those
pairs, arriving at a shorter circuit with the property we need. Consequently, a minimal
circuit will have an even number of corners, with no more than two corners in any one
row or column.)
We now define an auxiliary n n matrix A by setting ais ,js = 1, ais ,js+1 = 1, for
1 s m, and ai,j = 0 otherwise. Note that by our construction every row and every
column of A sums to zero. Finally, set
= min dis ,js > 0
1sm

and

= min dis ,js+1 > 0.


1sm

Then each of the matrices D A and D + A has nonnegative entries and each has row
and column sums equal to 1. Thus, each is doubly stochastic and, moreover, each has
fewer than k nonzero entries. By induction, each can be written as a convex combination
of permutation matrices. It follows that the same will be true of D because
D =

(D A) +
(D + A).
+
+

88

MATH 6820

Problem Set 7

August 1, 2011

1. For x, y, z > 0, show that



5 
5 
5
1
1
2
6
6
1
+
+
5 + 5 + 5.
x+y
3x + y + 2z
2x + 3y + z
x
y
z
2. For x Rn , n 2, statisticians use the sample variance
n

s(x) =

1 X
(xk x
)2 ,
n1
k=1

where x
= (x1 + +xn )/n, to measure the dispersion of sample data x = (x1 , . . . , xn ).
Information theorists, on the other hand, use the entropy
h(p) =

n
X

pk log pk

k=1

to measure the dispersion of probability distributions p = (p1 , . . . , pn ), where pk 0


and p1 + + pn = 1. Show that s(x) is Schur convex on Rn while h(p) is Schur
concave on (Rn )+ .
3. Given 0 < x, y, z < 1 such that max{x, y, z} (x + y + z)/2, show that
)2



 (
1 + 12 (x + y + z)
1+x
1+y
1+z

.
1x
1y
1z
1 12 (x + y + z)
4. Show that every doubly stochastic matrix has a positive diagonal. [Hint: If D is
doubly stochastic, then D = 1 P1 + + k Pk , where each Pj is a permutation
matrix, and where some j > 0.]
5. Let 1 n be the eigenvalues of a Hermitian matrix A Mn (C), arranged in
decreasing order. Show that for any orthonormal sequence v1 , . . . , vk Cn we have
k
X
j=1

nj+1

k
k
X
X

h Avj , vj i
j .
j=1

j=1

Conclude that
n inf{h Av, v i : kvk = 1} sup{h Av, v i : kvk = 1} 1 .
[Hint: For x Rn and 1 k n, the function g(x) =
89

Pk

i=1

xi is Schur convex.]

References
Books
Beckenbach, E., and Bellman, R., An Introduction to Inequalities, The Mathematical
Association of America, 1961 (Volume 3 in the New Mathematical Library).
A brief introduction to a handful of classical inequalities. Easy to read and includes full solutions to
the exercises.

Cloud, M. J., and Drachman, B. C., Inequalities: With Applications to Engineering,


Springer-Verlag, 1998.
A brief introduction but a relatively broad selection of topics given its size. Includes sections on inner
product spaces and matrix inequalities.

Garling, D. J. H., Inequalities: A Journey into Linear Analysis, Cambridge University


Press, 2007.
Highly advanced and very detailed.

lya, G., Inequalities, Cambridge University


Hardy, G. H., Littlewood, J. E., and Po
Press, 2nd. ed., 1952.
The bible. But its beginning to show its age and can be very hard to read in spots. Still a worthwhile
reference for any mathematician, in spite of this minor shortcoming.

Horn, Roger A., and Johnson, Charles R., Matrix Analysis, Cambridge University
Press, 1985; reprinted 1990.
Advanced and extremely thorough; the first of two volumes.

Kazarinoff, N., Analytic Inequalities, Dover Publications, 2003 (originally published in


1961 by Holt, Rinehart, and Winston).
A brief introduction that relies heavily on tools from calculus. Its brevity is somewhat compensated
for by the wide variety of exercises.

Marcus, M. and Minc, H., A Survey of Matrix Theory and Matrix Inequalities, Dover
Publications, 1992 (originally published in 1969 by Prindle, Weber & Schmidt).
The emphasis here is on survey: Somewhat telegraphic, and hard to read in spots, but packed with
information.

Marshall, A. W., and Olkin, I., Inequalities: Theory of Majorization and its Applications, Academic Press, 1979.
An excellent reference for applications of majorization to probability and statistics.

Steele, J. Michael, The Cauchy-Schwarz Master Class: An Introduction to the Art of


Mathematical Inequalities, Cambridge University Press, 2004.
As much an honors seminar as a textbook. Steele approaches each new inequality as a challenge
problem. Full of interesting extensions, asides, and fodder for further study, but dont expect a formal
textbook. Includes solutions to the exercises.

Webster, Roger, Convexity, Oxford University Press, 1994.


Very detailed and covers a wide range of topics, including sections on the Gamma function, Birkhoffs
theorem, and more. In spite of some rough edges, it is an excellent reference. Far more detail than most
books at this level. Includes solutions to the exercises.

90

Articles
Apostol, T. M., An elementary view of Eulers summation formula, The American
Mathematical Monthly, 106 (1999), 409418.
Carleson, L., A proof of an inequality of Carleman, Proceedings of the American Mathematical Society, 5 (1954), 932933.
Duncan, J., and McGregor, C. M., Carlemans inequality, The American Mathematical Monthly, 110 (2003), 424431.
Hurlbert, G., A short proof of the Birkhoff-von Neumann theorem, preprint (unpublished), 2008.
Impens, C., Stirlings series made easy, The American Mathematical Monthly, 110 (2003),
730735.
Redheffer, R., and Volkmann, P., Schurs generalization of Hilberts inequality, Proceedings of the American Mathematical Society, 87 (1983), 567568.
Romik, D., Stirlings approximation for n!: the ultimate short proof, The American Mathematical Monthly, 107 (2000), 556557.

91

You might also like