Explore Ebooks

Categories

Explore Audiobooks

Categories

Explore Magazines

Categories

Explore Documents

Categories

100%(9)100% found this document useful (9 votes)

3K views139 pagesThis is the first draft of my notes on probability. An updated version is available at http://prapun.googlepages.com/.

Introduction to Probability and Random Signals

© Attribution Non-Commercial (BY-NC)

PDF or read online from Scribd

This is the first draft of my notes on probability. An updated version is available at http://prapun.googlepages.com/.

Attribution Non-Commercial (BY-NC)

100%(9)100% found this document useful (9 votes)

3K views139 pagesIntroduction to Probability and Random Signals

This is the first draft of my notes on probability. An updated version is available at http://prapun.googlepages.com/.

Attribution Non-Commercial (BY-NC)

You are on page 1of 139

Prapun Suksompong

School of Electrical and Computer Engineering

Cornell University, Ithaca, NY 14853

ps92@cornell.edu

January 23, 2008

Contents

1 Mathematical Background 4

1.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Enumeration / Combinatorics / Counting . . . . . . . . . . . . . . . . . . 9

1.3 Dirac Delta Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Classical Probability 20

2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Probability Foundations 24

3.1 Algebra and σ-algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Kolmogorov’s Axioms for Probability . . . . . . . . . . . . . . . . . . . . . 29

3.3 Properties of Probability Measure . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Countable Ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Random Element 36

4.1 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Discrete random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Continuous random variable . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5 Mixed/hybrid Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.7 Misc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 PMF Examples 48

5.1 Random/Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2 Bernoulli and Binary distributions . . . . . . . . . . . . . . . . . . . . . . . 49

5.3 Binomial: B(n, p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4 Geometric: G(β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.5 Poisson Distribution: P(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.6 Compound Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.7 Hypergeometric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.8 Negative Binomial Distribution (Pascal / Pólya distribution) . . . . . . . . 58

5.9 Beta-binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.10 Zipf or zeta random variable . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6 PDF Examples 59

6.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.2 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.3 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.4 Pareto: Par(α)–heavy-tailed model/density . . . . . . . . . . . . . . . . . . 67

6.5 Laplacian: L(α) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.6 Rayleigh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.7 Cauchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.8 Weibull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7 Expectation 69

8 Inequalities 76

9 Random Vectors 80

9.1 Random Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

10 Transform Methods 86

10.1 Probability Generating Function . . . . . . . . . . . . . . . . . . . . . . . . 86

10.2 Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . . . . 87

10.3 One-Sided Laplace Transform . . . . . . . . . . . . . . . . . . . . . . . . . 88

10.4 Characteristic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

11.1 SISO case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

11.2 MISO case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

11.3 MIMO case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

11.4 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

12 Convergences 108

12.1 Summation of random variables . . . . . . . . . . . . . . . . . . . . . . . . 114

12.2 Summation of independent random variables . . . . . . . . . . . . . . . . . 114

12.3 Summation of i.i.d. random variable . . . . . . . . . . . . . . . . . . . . . 114

12.4 Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . 116

13.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

13.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

13.3 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

2

14 Real-valued Jointly Gaussian 120

A.1 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

A.2 Summations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A.3 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

A.4 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

A.5 Gamma and Beta functions . . . . . . . . . . . . . . . . . . . . . . . . . . 135

3

1 Mathematical Background

1.1 Set Theory

1.1. Basic Set Identities:

• Idempotence: (Ac )c = A

• Commutativity (symmetry):

• Associativity:

◦ A ∩ (B ∩ C) = (A ∩ B) ∩ C

◦ A ∪ (B ∪ C) = (A ∪ B) ∪ C

• Distributivity

◦ A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)

◦ A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)

• de Morgan laws

◦ (A ∪ B)c = Ac ∩ B c

◦ (A ∩ B)c = Ac ∪ B c

[9, p 9] if and only if Ai ∩Aj = ∅ when i = j.

taking values in an index or label set I) is said to be a partition of Ω if

(a) Ω = Aα∈I and

(b) For all i = j, Ai ⊥ Aj (pairwise disjoint).

any set B can be expressed as B = (B ∩ Ai ) where the union is a disjoint union.

α

• The cardinality (or size) of a collection pr set A, denoted |A|, is the number of

elements of the collection. This number may be ﬁnite or inﬁnite.

4

name rule

Associative laws A ∩ (B ∩ C) = (A ∩ B) ∩ C

A ∪ (B ∪ C) = (A ∪ B) ∪ C

Distributive laws A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)

A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)

DeMorgan’s laws A∩B =A∪B A∪B =A∩B

Complement laws A∩A=∅ A∪A=U

Double complement law A=A

Idempotent laws A∩A=A A∪A=A

Absorption laws A ∩ (A ∪ B) = A A ∪ (A ∩ B) = A

Dominance laws A∩∅=∅ A∪U =U

Identity laws A∪∅=A A∩U =A

◦ Inclusion-Exclusion Principle:

n

Ai = (−1)|I|+1 Ai .

i=1 φ=I⊂{1,...,n} i∈I

n

c

◦ Ai = |Ω| + (−1) Ai .

|I|

n

◦ If ∀i, Ai ⊂ B ( or equivalently, i=1 Ai ⊂ B), then

n

(B \ Ai ) = |B| + (−1)|I| Ai .

i=1 φ=I⊂{1,...,n} i∈I

or listed in a sequence: a1 , a2 , . . . . Empty set and ﬁnite sets are also said to be

countable. By a countably inﬁnite set, we mean a countable set that is not ﬁnite.

• N = {1, 2, , 3, . . . }

R = (−∞, ∞).

• For a set of sets, to avoid the repeated use of the word “set”, we will call it a

collection/class/family of sets.

Deﬁnition 1.3. Monotone sequence of sets

5

• The sequence of events (A1 , A2 , A3 , . . . ) is monotone-increasing sequence of events if

and only if

A1 ⊂ A2 ⊂ A3 ⊂ . . ..

In which case,

n

◦ A i = An

i=1

∞

◦ lim An = Ai .

n→∞ i=1

n→∞

∞

An ⊂ An+1 and An = A.

n=1

and only if

B1 ⊃ B2 ⊃ B3 ⊃ . . ..

In which case,

n

◦ Bi = Bn

i=1

∞

◦ lim Bn = Bi

n→∞ i=1

n→∞

∞

Bn+1 ⊂ Bn and Bn = B.

n=1

1, if ω ∈ A

IA (ω) =

0, otherwise.

• Alternative notation: 1A .

• A = {ω : IA (ω) = 1}

• A = B if and only if IA = IB

IA∪B (ω) = max (IA (ω) , IB (ω)) = IA (ω) + IB (ω) − IA (ω) · IB (ω)

6

N N ∞

1.5. Suppose i=1 Ai = Bi for all ﬁnite N ∈ N. Then, ∞

i=1 i=1 Ai = i=1 Bi .

Proof. To show “⊂”, suppose x ∈ ∞ i=1 Ai . Then, ∃N0 such that x ∈ AN0 . Now, AN0 ⊂

N0 N0 ∞ ∞

i=1 Ai = i=1 Bi ⊂ i=1 Bi . Therefore, x ∈ i=1 Bi . To show “⊃”, use symmetry.

(1) Bn+1 ⊂ Bn and (2) ∩∞ n=1 Bn = ∅.

Proof. Bn = ∪i≥n+1 Ai = (∪i≥n Ai ) ∪ An+1 = Bn+1 ∪ An+1 So, (1) is true. For (2), consider

two cases. (2.1) For element x ∈ / ∪i Ai , we know that x ∈

/ Bn and hence x ∈ / ∩∞

n=1 Bn . (2.2)

For x ∈ ∪i Ai , we know that ∃i0 such that x ∈ Ai0 . Note that x can’t be in other Ai because

the Ai ’s are disjoint. So, x ∈ / ∩∞

/ Bi+0 and therefore x ∈ n=1 Bn .

1.7. Any countable union can be written as a union of pairwise disjoint sets: Given any

sequence of sets Fn , deﬁne a new sequence by A1 = F1 , and

⎛ ⎞

An = Fn ∩ Fn−1c

∩ · · · ∩ F1c = Fn ∩ Fic = Fn \ ⎝ Fi ⎠ .

i∈[n−1] i∈[n−1]

∞ ∞

for n ≥ 2. Then, i=1 Fi = i=1 Ai where the union on the RHS is a disjoint union.

Proof. Note that the An are pairwise disjoint. To see this, consider An1 , An2 where n1 = n2 .

WLOG, assume n1 < n2 . This implies that Fnc1 get intersected in the deﬁnition of An2 . So,

⎛ ⎞

n1 −1 n2 −1

⎜ ⎟

An1 ∩ An2 = Fn1 ∩ Fnc1 ∩ ⎝ Fic ∩ Fn2 ∩ Fic ⎠ = ∅

i=1 i=1

∅ i=n1

Also, for ﬁnite N ≥ 1, we have n∈[N ] Fn = n∈[N ] An (*). To see this, note that (*) is

true for N = 1. Suppose (*) is true for N = m and let B = n∈[m] An = n∈[m] Fn . Now,

for N = m + 1, by deﬁnition, we have

An = An ∪ Am+1 = (B ∪ Fm+1 ) ∩ Ω

n∈[m+1] n∈[m]

= B ∪ Fm+1 = Fi .

i∈[m+1]

So, (*) is true for N = m + 1. By induction, (*) is true for all ﬁnite N . The extension to

∞ is done via (1.5).

For ﬁnite union, we can modify the above statement by setting Fn = ∅ for n ≥ N .

Then, An = ∅ for n ≥ N .

By construction, {An : n ∈ N} ⊂ σ ({Fn : n ∈ N}). However, in general, it is not true

that {Fn : n ∈ N} ⊂ σ ({An : n ∈ N}). For example, for ﬁnite union with N = 2, we can’t

get F2 back from set operations on A1 , A2 because we lost information about F1 ∩ F2 . To

create a disjoint union which preserve the information about the overlapping parts of the

7

Fn ’s, we can deﬁne the A’s by ∩n∈N Bn where Bn is Fn or Fnc . This is done in (1.8). However,

this leads to uncountably many Aα , which is why we used the index α above instead of n.

The uncountability problem does not occur if we start with a ﬁnite union. This is shown

in the next result.

1.8. Decomposition:

• Fix sets A1 , A2 , . . . , An , not necessarily disjoint. Let Π be a collection of all sets of

the form

B = B1 ∩ B2 ∩ · · · ∩ Bn

where each Bi is either Aj or its complement. There are 2n of these, say

n

B (1) , B (2) , . . . , B (2 ) .

Then,

(a) Π is a partition of Ω and

(b) Π \ ∩j∈[n] Acj is a partition of ∪j∈[n] Aj .

Moreover, any Aj can be expressed as Aj = B (i) for some Si ⊂ [2n ]. More

i∈Sj

(i)

speciﬁcally, Aj is the union of all B which is constructed by Bj = Aj

• Fix sets

A1 , A2 , . . ., not necessarily disjoint. Let Π be a collection of all sets of the form

B = n∈N Bn where each Bn is either An or its complement. There are uncountably

of these hence we will index the B’s by α; that is we write B (α) . Let I be the set of

all α after we eliminate all repeated B (α) ; note that I can still be uncountable. Then,

(a) Π = B (α) : α ∈ I is a partition of Ω and

(b) Π \ {∩n∈N Acn } is a partition of ∪n∈N An .

Moreover, any Aj can be expressed as Aj = B (α) for some Sj ⊂ I. Because I is

α∈Sj

uncountable, in general, Sj can be uncountable. More speciﬁcally, Aj is the (possibly

uncountable) union of all B (α) which is constructed by Bj = Aj . The uncountability

of Sj can be problematic because it implies that we need uncountable union to get

Aj back.

1.9. Let {Aα : α ∈ I} be a collection of disjoint sets where I is a nonempty index set. For

any set S ⊂ I, deﬁne a mapping g on 2I by g(S) = ∪α∈S Aα . Then, g is a 1:1 function if

and only if none of the Aα ’s is empty.

1.10. Let

A = (x, y) ∈ R2 : (x + a1 , x + b1 ) ∩ (y + a2 , y + b2 ) = ∅ ,

where ai < bi . Then,

A = (x, y) ∈ R2 : x + (a1 − b2 ) < y < x + (b1 − a2 )

= (x, y) ∈ R2 : a1 − b2 < y − x < b1 − a2 .

8

ݕ

ݔ=ݕ

ܿ + ሺܾ1 െ ܽ2 ሻ A

ܿ + ሺܽ1 െ ܾ2 ሻ

ܿ

x

ܿ

Figure 2: The region A = (x, y) ∈ R2 : (x + a1 , x + b1 ) ∩ (y + a2 , y + b2 ) = ∅ .

1.11. The four kinds of counting problems are:

(c) unordered sampling of r ≤ n out of n items without replacement: nr ;

(d) unordered sampling of r out of n items with replacement: n+r−1

r

.

1.12. Given a set of n distinct items, select a distinct ordered sequence (word) of length

r drawn from this set.

◦ μn,1 = n

◦ μ1,r = 1

◦ μn,r = μn,r−1 for r > 1

◦ Examples:

∗ Suppose A is a ﬁnite set, then the cardinality of its power set is 2A = 2|A| .

∗ There are 2r binary strings/sequences of length r.

9

• Sampling without replacement:

r−1

n!

(n)r = (n − i) =

i=0

(n − r)!

= n · (n − 1) · · · (n − (r − 1)); r≤n

r terms

◦ Ordered sampling of r ≤ n out of n items without replacement.

◦ For integers r, n such that r > n, we have (n)r = 0.

◦ The deﬁnition in product form

r−1

(n)r = (n − i) = n · (n − 1) · · · (n − (r − 1))

i=0

r terms

can be extended to any real number n and a non-negative integer r. We deﬁne

(n)0 = 1. (This makes sense because we usually take the empty product to be

1.)

◦ (n)1 = n

◦ (n)r = (n − (r − 1))(n)r−1 . For example, (7)5 = (7 − 4)(7)4 .

1, if r = 1

◦ (1)r =

0, if r > 1

• Ratio:

r−1

(n − i) r−1

(n)r i=0 i

= = 1−

nr

r−1 n

(n) i=0

i=0

r−1

1

r−1

−n i r(r−1)

− ni

≈ e =e i=0 = e− 2n

i=0

r2

≈ e− 2n

≥ 0 distinct items is (n)n = n!.

• 0! = 1! = 1

• n! = n(n − 1)!

∞

• n! = e−t tn dt

0

• Stirling’s Formula:

√ √ 1

2πe e(n+ 2 ) ln( e ) .

n

n! ≈ 2πnnn e−n =

10

1.14. Binomial coeﬃcient:

n (n)r n!

= =

r r! (n − r)!r!

This gives the number of unordered sets of size r drawn from an alphabet of size n without

replacement; this is unordered sampling of r ≤ n out of n items without replacement. It

is also the number of subsets of size r that can be formed from a set of n elements. Some

properties are listed below:

(b) Use combin(n,r) in Mathcad. However, to do symbolic manipulation, use the facto-

rial deﬁnition directly.

n

(c) Reﬂection property: nr = n−r .

(d) nn = n0 = 1.

n

(e) n1 = n−1 = n.

(f) nr = 0 if n < r or r is a negative integer.

n

(g) max nr = n+1 .

r 2

n−1

(h) Pascal’s “triangle” rule: nk = n−1 k

+ k−1 . This property divides the process of

choosing k items into two steps where the ﬁrst step is to decide whether to choose

the ﬁrst item or not.

n n

(i) 0≤k≤n k

= 0≤k≤n k

= 2n−1

k even k odd

There are many ways to show this identity.

11

(i) Consider the number of subsets of S = {a1 , a2 , . . . , an } of n distinct elements.

First choose subsets A of the ﬁrst n − 1 elements of S. There are 2n−1 distinct

S. Then, for each A, to get set with even number of element, add element an to

A if and only if |A| is odd.

(ii) Look at binomial expansion of (x + y)n with x = 1 and y = −1.

n

(iii) For odd n, use the fact that nr = n−r .

min(n1 ,n2 ) n1 n2 n1 +n2 n1 +n2

(j) k=0 k k

= n1

= n2

.

• This property divides the process of choosing k items from n1 +n2 into two steps

where the ﬁrst step is to choose from the ﬁrst n1 items.

• Can replace the min(n1 , n2 ) in the ﬁrst sum by n1 or n2 if we deﬁne nk = 0 for

k > n.

n n2 2n

• r=1 r = n .

n

m k k+1 n n+1

= + + ··· + = .

m=k

k k k k k+1

First, we choose whether to choose a1 . If so, then we need to choose the rest k items

from the n items a2 , . . . , an+1 . Hence, we have the nk term. Now, suppose we didn’t

choose a1 . Then, we still need to choose k + 1 items from n items a2 , . . . , an+1 . We

then repeat the same argument on a2 in stead of a1 .

Equivalently,

r

r

n+k n+k n+r+1 n+r+1

= = + . (1)

k n n+1 r

k=0 k=0

n

nn

(x + y) = xr y n−r

r

r=0

n

• Let x = y = 1, then n

r

= 2n .

r=0

• Entropy function:

◦ Binary: b = 2 ⇒

12

In which case,

1 nH ( nr ) n

≤ 2nH ( n ) .

r

2 ≤

n+1 r

n

≈ 2nH2 ( n ) .

r

Hence,

r

• By repeated diﬀerentiating with respect to x followed by multiplication by x, we have

n n r n−r

◦ r=0 r r x y = nx(x + y)n−1 and

n 2 n r n−r

◦ r=0 r r x y = nx (x(n − 1)(x + y)n−2 + (x + y)n−1 ).

For x + y = 1, we have

n n r

◦ r=0 r r x (1 − x)

n−r

= nx and

n 2 n r

◦ r=0 r r x (1 − x)

n−r

= nx (nx + 1 − x).

n

1.16. Multinomial Counting: The multinomial coeﬃcient n1 n2 ··· nr

is deﬁned as

r

i−1

n− nk

k=0

i=1 ni

n n − n1 n − n1 − n2 nr

= · · ···

n1 n2 n3 nr

n!

= r

n!

i=1

r

It is the number of ways that we can arrange n = ni tokens when having r types of

i=1

symbols and ni indistinguishable copies/tokens of a type i symbol.

n− ij

1

n n−i

j<r−1

n! n− ij

r−1

(x1 + . . . + xr ) =n

··· xr

j<r

xikk

i1 =0 i2 =0 ir−1 =0 n− ik ! ik ! k=1

k<n k<n

• r-ary entropy function: Consider any vector p = (p1 , p2 , . . . , pr ) such that pi ≥ 0 and

r

pi = 1. We deﬁne

i=1

r

H (p) = − pi logb pi .

i=1

13

n n!

Factorial expansion k = k!(n−k)! , k = 0, 1, 2, . . . , n

n n

Symmetry k = n−k , k

= 0, 1, 2, . . . , n

n n n

Monotonicity 0 < 1 < ···

< n/2 , n≥0

n n−1 n−1

Pascal’s identity k = k−1 + k , k = 0, 1, 2, . . . , n

n

Binomial theorem (x + y)n = k=0 nk xk y n−k , n ≥ 0

n n n

Counting all subsets k=0 k = 2 , n ≥ 0

n

k n

Even and odd subsets k=0 (−1) k = 0, n ≥ 0

n n2 2n

Sum of squares k=0 k = n , n≥0

n n 2 2n 2n

Square of row sums k=0 k = k=0 k , n ≥ 0

n n n−1

Absorption/extraction k = k k−1 , k = 0

n m n n−k

Trinomial revision m k = k m−k , 0 ≤ k ≤ m ≤ n

m n+k n+m+1

Parallel summation k=0 k = m , m, n ≥ 0

n−m m+k n+1

Diagonal summation k=0 m = m+1 , n ≥ m ≥ 0

r m n m+n

Vandermonde convolution k=0 k r−k = r , m, n, r ≥ 0

n/2 n−k

Diagonal sums in Pascal’s k=0 k = Fn+1 (Fibonacci numbers), n ≥ 0

triangle (§2.3.2)

n n

n−1

Other Common Identities k=0 k k = n2 , n≥0

n 2

n

n−2

k=0 k k = n(n + 1)2 , n≥0

n k

n

k=0 (−1) k k = 0, n ≥ 0

n (nk) 2n+1 −1

k=0 k+1 = n+1 , n≥0

n k ( )

n

1

k=0 (−1) k+1

k

= n+1 , n≥0

n k−1 ( k ) 1

n

1 1

k=1 (−1) k = 1 + 2 + 3 + · · · + n, n > 0

n−1 n n 2n

k=0 k k+1 = n−1 , n > 0

m m n m+n

k=0 k p+k = m+p , m, n, p ≥ 0, n ≥ p + m

14

ni

As a special case, let pi = n

, then

n n!

= ≈ 2nH2 (p)

n1 n2 · · · nr r

ni !

i=1

k+n−1 to x1 + x2 + · · · + xn = k for the xi ’s are nonnegative

k+n−1

integers is k

= n−1 .

(a) Suppose

we further require that the xi are strictly positive (xi ≥ 1), then there are

k−1

n−1

solutions.

(b) Extra Lower-bound Requirement: Suppose we further require that xi ≥ ai

where

k−(a1 +athe ai are some

given nonnegative integers, then the number of solution is

2 +···+an )+n−1

n−1 . Note that here we work with equivalent problem: y1 + y2 +

· · · + yn = k − ni=1 ai where yi ≥ 0.

(c) Extra Upper-bound Requirement: Suppose we further require that 0 ≤ xi <

bi . Let Ai be the set of solutions

such that xi ≥ bi and xj ≥ 0 for j = i. The

number of solutions is k+n−1

n−1

− |∪ i=1 Ai | where the second term can be found via the

n

inclusion/exclusion principle

Ai = (−1)|I|+1

Ai

i∈[n] I⊂[n] i∈I

I=∅

k−(

and the fact that for any index set I ⊂ [n], we have |∩i∈I Ai | = bi )+n−1

i∈I

n−1

.

(d) Extra Range Requirement: Suppose we further require that ai ≤ xi < bi where

0 ≤ ai < bi , then we work instead with yi = xi − ai . The number of solutions is

⎛ ⎞ ⎛ ⎞

n

n

⎝ k− ai +n−1 ⎠ k− ai − (bi − ai ) + n − 1 ⎠

i=1 + (−1)|I| ⎝ i=1 i∈I .

n−1 I⊂[n] n−1

I=∅

• Consider the distribution of r = 10 indistinguishable balls into n = n distinguishable

cells. Then, we only concern with the number of balls in each cell. Using n − 1 = 4

bars, we can divide r = 10 stars into n = 5 groups. For example, ****|***||**|*

n+r−1

would mean (4,3,0,2,1). In general, there are r

ways of arranging the bars and

stars.

• There are n+r−1 r

distinct vector x = xn1 of nonnegative integers such that x1 + x2 +

· · · + xn = r. We use n − 1 bars to separate r 1’s.

• Suppose r letters are drawn with replacement from a set {a1 , a2 , . . . , an }. Given a

drawn

n+r−1

sequence, let xinbe the number of ai in the drawn sequence. Then, there are

r

possible x = x1 .

15

objects number of objects reference

n distinct objects n! = P (n, n) = n(n − 1) . . . 2 · 1 §2.3.1

k out of n distinct objects nk = P (n, k) = n(n−1) . . . (n−k+1) §2.3.1

n

n!

some of the n objects are identical: k1 k2 ... kj = k1 ! k2 !...kj ! §2.3.2

k1 of a ﬁrst kind, k2 of a second

kind, . . . , kj of a jth kind, and

where k1 + k2 + · · · + kj = n

1 1

none of the n objects remains in its Dn = n! 1− 1! + · · · +(−1)n n! §2.4.2

original place (derangements)

Arranging objects in a circle (where rotations, but not reﬂections, are equivalent):

n distinct objects (n − 1)! §2.2.1

P (n,k)

k out of n distinct objects k §2.2.1

n!

order matters, no repetitions P (n, k) = (n−k)! = nk §2.3.1

order does not matter, no repeti- C(n, k) = nk = k!(n−k)!

n!

§2.3.2

tions

k+n−1

order does not matter, repetitions C R (n, k) = k §2.3.3

allowed

16

objects number of objects reference

Subsets:

n

of size k from a set of size n k §2.3.2

n

of all sizes from a set of size n 2 §2.3.4

of {1, . . . , n}, without consecutive Fn+2 §3.1.2

elements

distinct objects into distinct cells kn §2.2.1

n

distinct objects into distinct cells, k k! §2.5.2

no cell empty

n n n

distinct objects into identical cells 1 + 2 +· · ·+ k = Bn §2.5.2

n

distinct objects into identical cells, k §2.5.2

no cell empty

n

distinct objects into distinct cells, k1 k2 ... kj §2.3.2

with ki in cell i (i = 1, . . . , n),

and where k1 + k2 + · · · + kj = n

n+k−1

identical objects into distinct cells n §2.3.3

n−1

identical objects into distinct cells, k−1 §2.3.3

no cell empty

identical objects into identical pk (n) §2.5.1

cells

identical objects into identical pk (n) − pk−1 (n) §2.5.1

cells, no cell empty

n

Placing n distinct objects into k k §2.5.2

nonempty cycles

Solutions to x1 + · · · + xn = k:

k+n−1 k+n−1

nonnegative integers k = n−1 §2.3.3

k−1

positive integers n−1 §2.3.3

k−(a1 +···+an )+n−1

integers where 0 ≤ ai ≤ xi for all i n−1 §2.3.3

integers where 0 ≤ xi ≤ ai for one inclusion/exclusion principle §2.4.2

or more i

17

objects number of objects reference

all functions nk §2.2.1

n!

one-to-one functions (n ≥ k) nk = (n−k)! = P (n, k) §2.2.1

onto functions (n ≤ k) inclusion/exclusion §2.4.2

k k k 2 k k

partial functions 0 + 1 n+ 2 n + · · · + k n §2.3.2

= (n + 1)k

Bit strings of length n:

all strings 2n §2.2.1

n−k

with given entries in k positions 2 §2.2.1

n

with exactly k 0s k §2.3.2

n n n

with at least k 0s k + k+1 + · · · + n §2.3.2

n

with equal numbers of 0s and 1s n/2 §2.3.2

palindromes 2n/2 §2.2.1

with an even number of 0s 2n−1 §2.3.4

without consecutive 0s Fn+2 §3.1.2

falling power

n!

b(n, k): associated Stirling number of the P (n, k) = (n−k)! : k-permutation

second kind

1

2n

Cn = n+1 n : Catalan number p(n): number of partitions of n

n n!

C(n, k) = k = k!(n−k)! : binomial coeﬃcient pk (n): number of partitions of n into

at most k summands

d(n, k): associated Stirling number of p∗k (n): number of partitions of n into

the ﬁrst kind exactly k summands

n

En : Euler number k : Stirling cycle number

n

ϕ: Euler phi-function k : Stirling subset number

Fn : Fibonacci number

18

1.3 Dirac Delta Function

The (Dirac) delta function or (unit) impulse function is denoted by δ(t). It is usually

depicted as a vertical arrow at the origin. Note that δ(t) is not a true function; it is

undeﬁned at t = 0. We deﬁne δ(t) as a generalized function which satisﬁes the sampling

property (or sifting property) !

φ(t)δ(t)dt = φ(0)

for any function φ(t) which is continuous at t = 0. From this deﬁnition, It follows that

!

(δ ∗ φ)(t) = (φ ∗ δ)(t) = φ(τ )δ(t − τ )dτ = φ(t)

tall, inﬁnitely narrow rectangular pulse of unit area: lim ε 1 |t| ≤ 2 .

1 ε

ε→0

We list some interesting properties of δ(t) here.

• δ(t) = 0 when t = 0.

δ(t − T ) = 0 for t = T .

• A δ(t)dt = 1A (0).

(a) δ(t)dt = 1.

(b) {0} δ(t)dt = 1.

x

(c) −∞ δ(t)dt = 1[0,∞) (x). Hence, we may think of δ(t) as the “derivative” of the

unit step function U (t) = 1[0,∞) (x).

• φ(t)δ(t)dt = φ(0) for φ continuous at 0.

• φ(t)δ(t − T )dt = φ(T ) for φ continuous at T . In fact, for any ε > 0,

! T +ε

φ(t)δ(t − T )dt = φ(T ).

T −ε

• δ(at) = 1

|a|

δ(t)

• δ(t − t1 ) ∗ δ(t − t2 ) = δ (t − (t1 + t2 )).

• Fourier properties:

∞

◦ Fourier series: δ(x − a) = 1

2π

+ 1

π

cos(n(x − a)) on [−π, π].

k=1

◦ Fourier transform: δ(t) = e j2πf t

df

n

δ (t − ti )

δ (g (t)) = (2)

k=1

|g (ti )|

19

[3, p 387]. Hence, ! f (x)

f (t)δ(g(t))dt = . (3)

|g (x)|

x:g(x)=0

Note that the (Dirac) delta function is to be distinguished from the discrete time Kro-

necker delta function.

massat 0; that is for any set A, we have δ(A) = 1[0 ∈ A].

As a ﬁnite measure, δ is a unit

In which case, we have again gdδ = f (x)δ(dx) = g(0) for any measurable g.

For a function g : D → Rn where D ⊂ Rn ,

δ(x − z)

δ(g(x)) = (4)

|det dg(z)|

z:g(z)=0

[3, p 387].

2 Classical Probability

Classical probability, which is based upon the ratio of the number of outcomes favorable to

the occurrence of the event of interest to the total number of possible outcomes, provided

most of the probability models used prior to the 20th century. Classical probability remains

of importance today and provides the most accessible introduction to the more general

theory of probability.

Given a ﬁnite sample space Ω, the classical probability of an event A is

A the number of cases favorable to the outcome of the event

P (A) = = .

Ω the total number of possible cases

• In this section, we are more apt to refer to equipossible cases as ones selected at

random. Probabilities can be evaluated for events whose elements are chosen at

random by enumerating the number of elements in the event.

a cubical shape) or

◦ a balance of information or knowledge concerning the various possible outcomes.

• Equipossibility is meaningful only for ﬁnite sample space, and, in this case, the eval-

uation of probability is accomplished through the deﬁnition of classical probability.

2.1. Basic properties of classical probability:

• P (A) ≥ 0

• P (Ω) = 1

• P (∅) = 0

20

• P (Ac ) = 1 − P (A)

• A ⊥ B is equivalent to P (A ∩ B) = 0.

• A ⊥ B ⇒ P (A ∪ B) = P (A) + P (B)

• Suppose Ω = {ω1 , . . . , ωn } and P (ωi ) = n1 . Then P (A) = p (ω).

ω∈A

◦ The probability of an event is equal to the sum of the probabilities of its com-

ponent outcomes because outcomes are mutually exclusive

of event A, given that event B = ∅ occurred, is given by

|A ∩ B| P (A ∩ B)

P (A|B) = = . (5)

|B| P (B)

• It is the updated probability of the event A given that we now know that B occurred.

• P (A|B) = P (A ∩ B|B) ≥ 0

P (Ω|B) = P (B|B) = 1.

• If A ⊥ C, P (A ∪ C |B ) = P (A |B ) + P (C |B )

• P (A ∩ B) = P (B)P (A|B)

• P (A ∩ B) ≤ P (A|B)

• P (A ∩ B) = P (A) × P (B|A)

• P (A ∩ B ∩ C) = P (A ∩ B) × P (C|A ∩ B)

2.3. Total Probability and Bayes Theorem If {Bi , . . . , Bn } is a partition of Ω, then

for any set A,

• Total Probability Theorem: P (A) = ni=1 P (A|Bi )P (Bi ).

• Bayes Theorem: Suppose P (A) > 0, we have P (Bk |A) = nP (A|Bk )P (Bk ) .

i=1 P (A|Bi )P (Bi )

21

2.4. Independence Events: A and B are independent (A B) if and only if

|=

P (A ∩ B) = P (A)P (B) (6)

|A ∩ B||Ω| = |A||B|.

• Sometimes the deﬁnition for independence above does not agree with the everyday-

language use of the word “independence”. Hence, many authors use the term “statis-

tically independence” for the deﬁnition above to distinguish it from other deﬁnitions.

2.5. Having three pairwise independent events does not imply that the three events are

jointly independent. In other words,

A B, B C, A CA B C.

|=

|=

|=

|=

|=

Example: Experiment of ﬂipping a fair coin twice. Ω = {HH, HT, T H, T T }. Deﬁne event

A to be the event that the ﬁrst ﬂip gives a H; that is A = {HH, HT }. Event B is the

event that the second ﬂip gives a H; that is B = {HH, T H}. C = {HH, T T }. Note also

that even though the events A and B are not disjoint, they are independent.

2.6. Consider Ω of size 2n. We are given a set A ⊂ Ω of size n. Then, P (A) = 12 . We want

to ﬁnd all sets B ⊂ Ω such that A B. (Note that without the required independence,

|=

P (A ∩ B) = P (A)P (B).

Let r = |A ∩ B|. Then, r can be any integer from 0 to n. Also, let k = |B \ A|. Then, the

condition for independence becomes

r 1r+k

=

n 2 n

which is equivalent to r = k. So, the construction of the set B is given by choosing

r

elements from set A, then choose

r = k elements from set Ω \ A. There are n

r

choices

for the ﬁrst part and nk = nr choice for the second part. Therefore, the total number of

possible B such that A B is

|=

n 2

n 2n

= .

r=1

r n

2.1 Examples

2.7. Chevalier de Mere’s Scandal of Arithmetic:

Which is more likely, obtaining at least one six in 4 tosses of a fair die (event

A), or obtaining at least one double six in 24 tosses of a pair of dice (event B).

22

We have 4

5

P (A) = 1 − = .518

6

and 24

35

P (B) = 1 − = .491.

36

Therefore, the ﬁrst case is more probable.

2.8. A random sample of size r with replacement is taken from a population of n elements.

The probability of the event that in the sample no element appears twice (that is, no

repetition in our sample) is

(n)r

.

nr

The probability that at least one element appears twice is

r−1

i r(r−1)

pu (n, r) = 1 − 1− ≈ 1 − e− 2n .

i=1

n

r−1

1 r(r−1) 3n+2r−1 i 1 r(r−1)

e 2 n 3n ≤ 1− ≤ e2 n .

i=1

n

1 1"

r≈ + 1 − 8n ln (1 − p).

2 2

• Probability of coincidence birthday: Probability that there is at least two people who

have the same birthday in your class of n students

⎧

⎪ 1, if r ≥ 365,

⎪

⎪ ⎛ ⎞

⎪

⎨

= ⎜ 365 364 365 − (r − 1) ⎟

⎪

⎪ 1 − ⎜ · · · · · · ⎟ , if 0 ≤ r ≤ 365

⎪

⎪ ⎝ 365 365 365 ⎠

⎩

r terms

◦ Birthday Paradox : In a group of 23 randomly selected people, the probability

that at least two will share a birthday (assuming birthdays are equally likely to

occur on any given day of the year) is about 0.5. See also (3).

2.9. Monte Hall’s Game: Started with showing a contestant 3 closed doors behind of

which was a prize. The contestant selected a door but before the door was opened, Monte

Hall, who knew which door hid the prize, opened a remaining door. The contestant was

then allowed to either stay with his original guess or change to the other closed door.

Question: better to stay or to switch? Answer: Switch. Because after given that the

contestant switched, then the probability that he won the prize is 23 .

23

pu(n,r) for n = 365

1

0.9

pu n, r 0.6

0.8

0.5 p 0.9

0.7 r r 1

1 e 2n p 0.7

0.6

0.4

r p 0.5

0.5

p 0.3 r 23

0.4 r 23 n 0.3

p 0.1 n 365

0.3

n 365

0.2

0.2

0.1

0.1

0 0

0 5 10 15 20 25 30 35 40 45 50 55 0 50 100 150 200 250 300 350

r n

Figure 9: pu (n, r)

2.10. False Positives on Diagnostic Tests: Let D be the event that the testee has the

disease. Let + be the event that the test returns a positive result. Denote the probability

of having the disease by pD .

Now, assume that the test always returns positive result. This is equivalent to P (+|D) =

1 and P (+c |D) = 0. Also, suppose that even when the testee does not have a disease, the

test will still return a positive result with probability p+ ; that is P (+|Dc ).

If the test returns positive result, then the probability that the testee has the disease is

P (D ∩ +) pD 1

P (D |+ ) = = = p+

P (+) pD + p+ (1 − pD ) 1+ pD

(1 − pD )

PD

≈ ; for rare disease (pD 1)

P+

D Dc

+ P (+ ∩ D) P (+ ∩ Dc ) P (+)

= P (+|D)P (D) = P (+|Dc )P (Dc ) = P (+ ∩ D) + P (+ ∩ Dc )

= 1(pD ) = pD = p+ (1 − pD ) = pD + p+ (1 − pD )

+c P (+c ∩ D) P (+c ∩ Dc ) P (+c )

= P (+c |D)P (D) = P (+c |Dc )P (Dc ) = P (+c ∩D)+P (+c ∩Dc )

= 0(pD ) = 0 = (1 − p+ )(1 − pD ) = (1 − p+ )(1 − pD )

P (D) P (Dc ) P (Ω) = 1

= P (+ ∩ D) + P (+c ∩ D) = P (+∩Dc )+P (+c ∩Dc )

= pD = 1 − pD

3 Probability Foundations

To study formal deﬁnition of probability, we start with the probability space (Ω, A, P ). Let

Ω be an arbitrary space or set of points ω. Viewed probabilistically, a subset of Ω is an

event and an element ω of Ω is a sample point. Each event is a collection of outcomes

which are elements of the sample space Ω.

The theory of probability focuses on collections of events, called event algebras and

typically denoted A (or F) that contain all the events of interest (regarding the random

experiment E) to us, and are such that we have knowledge of their likelihood of occurrence.

24

The probability P itself is deﬁned as a number in the range [0, 1] associated with each event

in A.

The class 2Ω of all subsets can be too large1 for us to deﬁne probability measures with

consistency, across all member of the class. In this section, we present smaller classes

which have several “nice” properties.

Deﬁnition 3.1. [7, Def 1.6.1 p38] An event algebra A is a collection of subsets of the

sample space Ω such that it is

In other words, “A class is called an algebra on Ω if it contains Ω itself and is closed under

the formation of complements and ﬁnite unions.”

• Ω = (0, 1]. B0 = the collection of all ﬁnite unions of intervals of the form (a, b] ⊂ (0, 1]

∞

◦ Not a σ-ﬁeld. Consider the set 1

,1

2i+1 2i

i=1

(a) Nonempty: ∅ ∈ A, X ∈ A

(b) A ⊂ 2Ω

• A ∈ A ⇒ Ac ∈ A

• A, B ∈ A ⇒ A ∪ B ∈ A, A ∩ B ∈ A, A\B ∈ A, AΔB = (A\B ∪ B\A) ∈ A

n

n

• A1 , A2 , . . . , An ∈ F ⇒ Ai ∈ F and Ai ∈ F

i=1 i=1

let A1 , A2 be algebras of Ω and let A = A1 ∩ A2 be the collection of sets common to

both algebras. Then A is an algebra.

1

There is no problem when Ω is countable.

25

(f) The largest A is the set of all subsets of Ω known as the power set and denoted by

2Ω .

(g) Cardinality of Algebras: An algebra of subsets of a ﬁnite set of n elements will always

have a cardinality of the form 2k , k ≤ n. It is the intersection of all algebras which

contain C.

3.4. There is a smallest (in the sense of inclusion) algebra containing any given family C

of subsets of Ω. Let C ⊂ 2X , the algebra generated by C is

G,

G is an algebra

C⊂G

i.e., the intersection of all algebra containing C. It is the smallest algebra containing C.

Deﬁnition 3.5. A σ-algebra A is an event algebra that is also closed under countable

unions,

(∀i ∈ N)Ai ∈ A =⇒ ∪i∈N ∈ A.

Remarks:

• A σ-algebra is also an algebra.

• A ﬁnite algebra is also a σ-algebra.

3.6. Because every σ-algebra is also an algebra, it has all the properties listed in (3.3).

Extra properties of σ-algebra A are as followed:

∞

∞

(a) A1 , A2 , . . . ∈ A ⇒ Aj ∈ A and Aj ∈ A

j=1 j=1

(c) The collection of σ-algebra in Ω is closed under arbitrary intersection, i.e., let

Aα be σ-algebra ∀α ∈ A where A is some index set, potentially uncountable. Then,

Aα is still a σ-algebra.

α∈A

able.

(e) If A and B are σ-algebra in X, then, it is not necessary that A ∪ B is also a σ-

algebra. For example, let E1 , E2 ⊂ X distinct, and not complement of one another.

Let Ai = {∅, Ei , Eic , X}. Then, Ei ∈ A1 ∪ A2 but E1 ∪ E2 ∈

/ A 1 ∪ A2 .

Deﬁnition 3.7. Generation of σ-algebra Let C ⊂ 2Ω , the σ-algebra generated by

C, σ (C) is

G,

G is a σ−ﬁeld

C⊂G

i.e., the intersection of all σ-ﬁeld containing C. It is the smallest σ-ﬁeld containing C

26

• If the set Ω is not implicit, we will explicitly write σX (C)

(b) σ (C) is the smallest σ-ﬁeld containing C in the sense that if H is a σ-ﬁeld and

C ⊂ H, then σ (C) ⊂ H

(c) C ⊂ σ (C)

(h) σ ({A, B}) has at most 16 elements. They are ∅, A, B, A∩B, A∪B, A\B, B \A, AΔB,

and their corresponding complements. See also (3.11). Some of these 16 sets can be

the same and hence the use of “at-most” in the statement.

3.9. For the decomposition described in (1.8), let the starting collection be C1 and the

decomposed collection be C2 .

σ-algebra (a σ-algebra smaller than 2Ω ) can be constructed by ﬁrst partitioning Ω into

subsets and then forming the power set of these subsets, with an individual subset now

playing the role of an individual outcome ω.

Given a countable2 partition Π = {Ai , i ∈ I} of Ω, we can form a σ-algebra A by

including all unions of subcollections:

A = {∪α∈S Aα : S ⊂ I} (7)

2

In this case, Π is countable if and only if I is countable.

27

[7, Ex 1.5 p 39] where we deﬁne Ai = ∅. It turns out that A = σ(Π). Of course, a

i∈∅

σ-algebra is also an algebra. Hence, (7) is also a way to construct an algebra. Note that,

from (1.9), the necessary and suﬃcient condition for distinct S to produce distinct element

in (7) is that none of the Aα ’s are empty.

In particular, for countable Ω = {xi : i ∈ N}, where xi ’s are distinct. If we want a

σ-algebra which contains {xi } ∀i, then, the smallest one is 2Ω which happens to be the

biggest one. So, 2Ω is the only σ-algebra which is “reasonable” to use.

3.11. Generation of σ-algebra from ﬁnite partition: If a ﬁnite collection Π =

{Ai , i ∈ I} of non-empty sets forms a partition of Ω, then the algebra generated by Π is the

same as the σ-algebra generated by Π and it is given by (7). Moreover, |σ(Π)| is 2|I| = 2|Π| ;

that is distinct sets S in (7) produce distinct member of the (σ-)algebra.

Therefore, given a ﬁnite collection of sets C = {C1 , C2 , . . . , Cn }. To ﬁnd an algebra or a

σ-algebra generated by C, the ﬁrst step is to use the Ci ’s to create a partition of Ω. Using

(1.8), the partition is given by

By (3.9), we know that σ(π) = σ(C). Note that there are seemingly 2n sets in Π however,

some of them can be ∅. We can eliminate the empty set(s) from Π and it is still a partition.

So the cardinality of Π in (8) (after empty-set elimination) is k where k is at most 2n . The

partition Π is then a collection of k sets which can be renamed as A1 , . . . , Ak , all of which

are non-empty. Applying the construction in (7) (with I = [n]), we then have σ(C) whose

n

cardinality is 2k which is ≤ 22 . See also properties (3.8.7) and (3.8.8).

Deﬁnition 3.12. In general, the Borel σ-algebra or Borel algebra B is the σ-algebra

generated by the open subsets of Ω

• Call BΩ the σ-algebra of Borel subsets of Ω or σ-algebra of Borel sets on Ω

(a) On Ω = R, the σ-algebra generated by any of the followings are Borel σ-algebra:

(ii) Closed sets

(iii) Intervals

(iv) Open intervals

(v) Closed intervals

(vi) Intervals of the form (−∞, a], where a ∈ R

i. Can replace R by Q

ii. Can replace (−∞, a] by (−∞, a), [a, +∞), or (a, +∞)

iii. Can replace (−∞, a] by combination of those in 1(f)ii.

28

(c) Borel σ-algebra on the extended real line is the extended σ-algebra

'( ) '( *) ' *)

(ii) â, b̂ , â, b̂ , â, b̂

(iii) {[−∞, b]} , {[−∞, b)} , {(a, +∞]} , {[a, +∞]}

'( *) '( )

(iv) −∞, b̂ , −∞, b̂ , {(â, +∞]} , {[â, +∞]}

(ii) the class of closed sets in Rk

(iii) the class of bounded semi-open rectangles (cells) I of the form

I = x ∈ Rk : ai < xi ≤ bi , i = 1, . . . , k .

k

Note that I = ⊗ (ai , bi ] where ⊗ denotes the Cartesian product ×

i=1

(iv)

the class of “southwest regions” Sx of points “southwest” to x ∈ Rk , i.e. Sx =

y ∈ Rk : yi ≤ xi , i = 1, . . . , k

3.13. The Borel σ-algebra B of subsets of the reals is the usual algebra when we deal

with real- or vector-valued quantities.

Our needs will not require us to delve into these issues beyond being assured that

events we discuss are constructed out of intervals and repeated set operations on these

intervals and these constructions will not lead us out of B. In particular, countable unions

of intervals, their complements, and much, much more are in B.

Deﬁnition 3.14. Kolmogorov’s Axioms for Probability [12]: A set function satisfying

K0–K4 is called a probability measure.

sisting of an event σ-algebra A and a real-valued function P : A → R.

K1 Nonnegativity: ∀A ∈ A, P (A) ≥ 0.

29

K3 Finite additivity: If A, B are disjoint, then P (A ∪ B) = P (A) + P (B).

K4 Monotone continuity: If (∀i > 1) Ai+1 ⊂ Ai and ∩i∈N Ai = ∅ (a nested series of sets

shrinking to the empty set), then

lim P (Ai ) = 0.

i→∞

overlap) events, then +∞ ,

∞

P Ai = P (Ai ).

i=1 i=1

• Note that there is never a problem with the convergence of the inﬁnite sum; all partial

sums of these non-negative summands are bounded above by 1.

the ﬁnite sample space context of classical probability, is oﬀered by Kolmogorov to

ensure a degree of mathematical closure under limiting operations [7, p 111].

ability.

3.5.1 p 111].

(1.6), Bn+1 ⊂ Bn and ∩∞

n=1 Bn = ∅. So, by K4, lim P (Bn ) = 0. Furthermore,

n→∞

+ ,

∞

n

Ai = Bn ∪ Ai

i=1 i=1

where all the sets on the RHS are disjoint. Hence, by ﬁnite additivity,

+∞ ,

n

P Ai = P (Bn ) + P (Ai ).

i=1 i=1

To show “⇐”, see (3.17).

Equivalently, in stead of K0-K4, we can deﬁne probability measure using P0-P2 below.

(P0) P : A → [0, 1]

that satisﬁes:

30

(P1,K2) P (Ω) = 1

∞ : For

every countable sequence (An )∞

n=1 of disjoint ele-

∞

ments of A, one has P An = P (An )

n=1 n=1

• P (∅) = 0

space

3.16. Properties of probability measures:

(a) P (∅) = 0

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

n

P Ak = P (Ai ) − P (Ai ∩ Aj ) + P (Ai ∩ Aj ∩ Ak )+

k=1 i i<j i<j<k

· · · + (−1)n+1 P (A1 ∩ · · · ∩ An )

+n , + ,

P Ai = (−1)|I|+1 P Ai .

i=1 φ=I⊂{1,...,n} i∈I

31

For example,

− P (A1 ∩ A2 ) − P (A1 ∩ A3 ) − P (A2 ∩ A3 )

+ P (A1 ∩ A2 ∩ A3 ) .

+ n , + ,

P Ack ∩ B = P (B) + (−1)|I| P Ai ∩ B . (9)

k=1 ∅=I⊂[n] i∈I

n

n

(j) Finite additivity: If A = Aj with Aj ∈ A disjoint, then P (A) = P (Aj )

j=1 j=1

Inequality: If A1 , . . . , An are events, not necessarily

n

n

disjoint, then P Ai ≤ P (Ai )

i=1 i=1

∞If A

∞

disjoint, then P Ai ≤ P (Ai )

i=1 i=1

• If

A1 , A2 ,. . . α of events, not necessarily disjoint, then ∀α ∈ (0, 1],

is a sequence

∞

∞

P Ai ≤ P (Ai )

i=1 i=1

3.17. Conditional continuity

of measurable sets, then P ( i=1 Bi ) = lim P (Bj ). In a more compact notation, if Bi B,

j→∞

then P (B) = lim P (Bj ).

j→∞

∞

Proof. Let B = Bi . LetAk = Bk \Bk+1 , i.e, the new part. We consider two partitions of

i=1

∞

n−1

∞

B1 : (1) B1 = B ∪ Aj and (2) B1 = Bn ∪ Aj . (1) implies P (B1 ) − P (B) = P (Aj ).

j=1 j=1 j=1

n−1

(2) implies P (B1 ) − P (Bn ) = P (Aj ). We then have

j=1

(1)

∞

(2)

lim (P (B1 ) − P (Bn )) = P (Aj ) = P (B1 ) − P (B) .

n→∞

j=1

32

3.18. Let A be a σ-algebra. Suppose that P : A → [0, 1] satisﬁes (P1) and is ﬁnitely

additive. Then, the following are equivalent:

∞

∞

• (P2): If An ∈ A, disjoint, then P An = P (An )

n=1 n=1

• (Continuity from above) If An ∈ A, and An A, then P (An ) P (A)

• If An ∈ A, and An Ω, then P (An ) P (Ω) = 1

• (Continuity from below) If An ∈ A, and An A, then P (An ) P (A)

Hence, a probability measure satisﬁes all ﬁve properties above. Continuity from above

and continuity from below are collectively called sequential continuity properties.

In fact, for probability measure P , let An be a sequence of events in A which converges

to A (i.e. An → A). Then A ∈ A and lim P (An ) = P (A). Of course, both An A and

n→∞

An A imply An → A. Note also that

+∞ , +N ,

P An = lim P An ,

N →∞

n=1 n=1

and + , + ,

∞

N

P An = lim P An .

N →∞

n=1 n=1

Alternative form of the sequential continuity properties are

+∞ ,

P An = lim P (AN ) , if An ⊂ An+1

N →∞

n=1

and + ,

∞

P An = lim P (AN ) , if An+1 ⊂ An .

N →∞

n=1

3.19. Given a common event algebra A , probability measures P1 , . . ., Pm , and the numbers

m

m

λ1 , . . ., λm , λi ≥ 0, λi = 1, a convex combination P = λi Pi of probability measures is

1 1

a probability measure

3.20. A can not contain an uncountable, disjoint collection of sets of positive probability.

Deﬁnition 3.21. Discrete probability measure P is a discrete probability measure

if ∃ ﬁnitelyor countably

many points ωk and nonnegative masses mk such that ∀A ∈ A

P (A) = mk = mk IA (ωk )

k:ωk ∈A k

If there is just one of these points, say ω0 , with mass m0 = 1, then P is a unit mass

at ω0 . In this case, ∀A ∈ A, P (A) = IA (ω0 ).

Notation: P = δω0

• Here, Ω can be uncountable.

33

3.4 Countable Ω

A sample space Ω is countable if it is either ﬁnite or countably inﬁnite. It is countably

inﬁnite if it has as many elements as there are integers. In either case, the element of Ω

can be enumerated as, say, ω1 , ω2 , . . . . If the event algebra A contains each singleton set

{ωk } (from which it follows that A is the power set of Ω), then we specify probabilities

satisfying the Kolmogorov axioms through a restriction to the set S = {{ωk }} of singleton

events.

function p : Ω → [0, 1] such that

p(ω) = 1.

ω∈Ω

3.23. Every pmf p deﬁnes a probability measure P and conversely. Their relationship is

given by

p(ω) = P ({ω}), (10)

P (A) = p(ω). (11)

ω∈A

The convenience of a speciﬁcation by pmf becomes clear when Ω is a ﬁnite set of, say, n

elements. Specifying P requires specifying 2n values, one for each event in A, and doing so

in a manner that is consistent with the Kolmogorov axioms. However, specifying p requires

only providing n values, one for each element of Ω, satisfying the simple constraints of

nonnegativity and addition to 1. The probability measure P satisfying (11) automatically

satisﬁes the Kolmogorov axioms.

3.5 Independence

Deﬁnition 3.24. Independence between events and collections of events.

In particular, ∅ and Ω are independent of any events.

(ii) Two events A, B with positive probabilities are independent if and only if

P (B |A) = P (B), which is equivalent to P (A |B ) = P (A)

When A and/or B has zero probability, A and B are automatically independent.

(iii) An event A is independent of itself if and only if P (A) is 0 or 1.

(iv) If A an B are independent, then the two classes σ ({A}) = {∅, A, Ac , Ω} and

σ ({B}) = {∅, B, B c , Ω} are independent.

(v) Suppose A and B are disjoint. A and B are independent if and only if P (A) = 0

or P (B) = 0.

34

P (B)

(vi) Suppose A ⊂ B. A and B are independent if and only if P (A) = P (B)+1

.

+ ,

≡ P Aj = P (Aj ) ∀J ⊂ [n] and |J| ≥ 2

j∈J j∈J

◦ Note that the case when j = 1 automatically holds. The case when j = 0

can be regard as the ∅ event case, which is also trivially true.

n

◦ There are n

j

= 2n − 1 − n constraints.

j=2

◦ Example: A1 , A2 , A3 are independent if and only if

P (A1 ∩ A2 ) = P (A1 ) P (A2 )

P (A1 ∩ A3 ) = P (A1 ) P (A3 )

P (A2 ∩ A3 ) = P (A2 ) P (A3 )

Remark : The ﬁrst equality alone is not enough for independence. See a

counter example below. In fact, it is possible for the ﬁrst equation to hold

while the last three fail as shown in (3.26.b). It is also possible to construct

events such that the last three equations hold (pairwise independence), but

the ﬁrst one does not as demonstrated in (3.26.a).

≡ P (B1 ∩ B2 ∩ · · · ∩ Bn ) = P (B1 ) P (B2 ) · · · P (Bn ) where Bi = Ai or Bi = Ω

≡ ∀ ﬁnite J ⊂ I, P Aα = P (A)

α∈J α∈J

≡ Each of the ﬁnite subcollection is independent.

≡ P (B1 ∩ B2 ∩ · · · ∩ Bn ) = P (B1 ) P (B2 ) · · · P (Bn ) where Bi ∈ Ai or Bi = Ω

≡ P (B1 ∩ B2 ∩ · · · ∩ Bn ) = P (B1 ) P (B2 ) · · · P (Bn ) where Bi ∈ Ai ∪ {Ω}

≡ ∀i ∀Bi ⊂ Ai B1 , . . . , Bn are independent.

≡ A1 ∪ {Ω} , . . . , An ∪ {Ω} are independent.

≡ A1 ∪ {∅} , . . . , An ∪ {∅} are independent.

≡ Any ﬁnite subcollection of classes is independent.

≡ ∀ ﬁnite Λ ⊂ Θ, P Aθ = P (Aθ )

θ∈Λ θ∈Λ

35

• By deﬁnition, a subcollection of independent events is also independent.

• The class {∅, Ω} is independent from any class.

Deﬁnition 3.25. A collection of events {Aα } is called pairwise independent if for every

distinct events Aα1 , Aα2 , we have P (Aα1 ∩ Aα2 ) = P (Aα1 ) P (Aα2 )

The converse is false. See (a) in example (3.26).

• For K ⊂ J, P Aα = P (A) does not imply P Aα = P (A)

α∈J α∈J α∈K α∈K

Example 3.26.

(a) Let Ω = {1, 2, 3, 4}, A = 2Ω , P (i) = 14 , A1 = {1, 2}, A2 = {1, 3}, A3 = {2, 3}. Then

P (Ai ∩ Aj ) = P (Ai ) P (Aj ) for all i = j butP (A1 ∩ A2 ∩ A3 ) = P (A1 ) P (A2 ) P (A3 )

Then, P (A1 ∩ A2 ∩ A3 ) = P (A1 ) P (A2 ) P (A3 ) but P (Ai ∩ Aj ) = P (Ai ) P (Aj ) for

all i = j

(c) The paradox of ”almost sure” events: Consider two random events with probabilities

of 99% and 99.99%, respectively. One could say that the two probabilities are nearly

the same, both events are almost sure to occur. Nevertheless the diﬀerence may

become signiﬁcant in certain cases. Consider, for instance, independent events which

may occur on any day of the year with probability p = 99%; then the probability

P that it will occur every day of the year is less than 3%, while if p = 99.99% then

P = 97%.

4 Random Element

4.1. A function X : Ω → E is said to be a random element of E if and only if

(Ω,A) (E,BE )

X is measurable which is equivalent to each of the following statements.

≡ X −1 (BE ) ⊂ A

≡ σ (X) ⊂ A

36

4.3. Law of X or Distribution of X: P X = μX = P X −1 = L(X) : E → [0, 1]

(E,E)

μX (A) = P X (A) = P X −1 (A) = P ◦ X −1 (A)

= P ({ω : X (ω) ∈ A}) = P ({X ∈ A})

t→∞

Deﬁnition 4.6. A real-valued function X(ω) deﬁned for points ω in a sample space Ω is

called a random variable.

• Random variables are important because they provide a compact way of referring to

events via their numerical attributes.

• The abbreviation r.v. will be used for “real-valued random variables” [10, p 1].

4.7. At a certain point in most probability courses, the sample space is rarely mentioned

anymore and we work directly with random variables. The sample space often “disappears”

but it is really there in the background.

• [X ∈ B] = {ω ∈ Ω : X(ω) ∈ B} and

• P [X ∈ B, Y ∈ C] = P ([X ∈ B] ∩ [Y ∈ C]) .

4.10. Every random variable can be written as a sum of a discrete random variable and

a continuous random variable.

4.11. A random variable can have at most countably many point x such thatP [X = x] > 0.

4.12. Point masses probability measures / Direc measures, usually written εα , δα , is used

to denote point mass of size one at the point α. In this case,

• P X {α} = 1

• P X ({α}c ) = 0

37

• FX (x) = 1[α,∞) (x)

4.13. There exists distributions that are neither discrete nor continuous.

Let μ (A) = 12 μ1 (A) + 12 μ2 (A) for μ1 discrete and μ2 coming from a density.

4.14. When X and Y take ﬁnitely many values, say x1 , . . . , xm and y1 , . . . , yn , respectively,

we can arrange the probabilities pX,Y (xi , yj ) in the m × n matrix

⎡ ⎤

pX,Y (x1 , y1 ) pX,Y (x1 , y2 ) . . . pX,Y (x1 , yn )

⎢ pX,Y (x2 , y1 ) pX,Y (x2 , y2 ) . . . pX,Y (x2 , yn ) ⎥

⎢ ⎥

⎢ .. .. . .. ⎥.

⎣ . . . . . ⎦

pX,Y (xm , y1 ) pX,Y (xm , y2 ) . . . pX,Y (xm , yn )

• The sum of the entries in the ith row is PX (xi ), and the sum of the entries in the jth

column is PY (yj ).

• The sum of all the entries in the matrix is one.

4.15. The (cumulative) distribution function (cdf ) induced by a probability P on

R , B is the function F (x) = P (−∞, x].

=Ω

The (cumulative) distribution function (cdf ) of the random variable X is the func-

tion FX (x) = P X (−∞, x] = P [X ≤ x].

• The distribution P X can be obtained from the distribution function by setting P X (−∞, x] =

FX (x); that is FX uniquely determines P X

• 0 ≤ FX ≤ 1

C1 FX is non-decreasing

C2 FX is right continuous:

∀x FX x+ ≡ y→x

lim FX (y) ≡ lim FX (y) = FX (x) = P [X ≤ x]

y
x

y>x

x→−∞ x→∞

• ∀x FX (x− ) ≡ y→x

lim FX (y) ≡ lim FX (y) = P X (−∞, x) = P [X < x]

yx

y<x

38

Figure 10: Right-continuous function at jump point

• ∀x<y

P ((x, y]) = F (y) − F (x)

P ([x, y]) = F (y) − F x−

P ([x, y)) = F y − − F x−

P ((x, y)) = F y − − F (x)

P ({x}) = F (x) − F x−

and only if one has (C1), (C2), and (C3).

• If F satisﬁes (C1), (C2), and (C3), then ∃ a unique probability measure P on (R, B)

that has P (a, b] = F (b) − F (a) ∀a, b ∈ R

[21, p 25].

• Trick : Just ﬂip the graph along the line x = y, then make the graph left-continuous.

• If g is a cdf, then only consider y ∈ (0, 1). It is called the inverse CDF [7, def 8.4.1

p. 238] or quantile function.

◦ In [21, Def 2.16 p 25], the inverse CDF is deﬁned using strict inequality “>”

rather than “≥”.

39

g 1 y inf ^ x \ : g x t y`

g x

y6

x4

y5

y4 x3

y3

y2

y1

x2

x2 x3 x4 x y1 y3 y4 y6 y

Distribution F F −1

Exponential 1 − e−λx − λ1 ln (u)

x−a

Extreme value 1 − e−e b

a 3+ b ln ln4u

Geometric 1 − (1 − p)i ln u

ln(1−p)

Logistic 1− 1

x−μ μ− b ln u1 − 1

1+e b

1

Pareto 1 − x−a u− a

x b 1

Weibull 1 − e( a ) a (ln u) b

40

Deﬁnition 4.19. Let X be a random variable with distribution function F . Suppose that

p ∈ (0, 1). A value of x such that F (x− ) = P [X < x] ≤ p and F (x) = P [X ≤ x] ≥ p is

called a quantile of order p for the distribution. Roughly speaking, a quantile of order p

is a value where the cumulative distribution crosses p. Note that it is not unique. Suppose

F (x) = p on an interval [a, b], then all x ∈ [a, b] are quantile of order p.

A quantile of order 12 is called a median of the distribution. When there is only one

median, it is frequently used as a measure of the center of the distribution. A quantile

of order 14 is called a ﬁrst quartile and the quantile of order 34 is called a third quartile. A

median is a second quartile.

Assuming uniqueness, let q1 , q2 , and q3 denote the ﬁrst, second, and third quartiles

of X. The interquartile range is deﬁned to be q3 − q1 , and is sometimes used as a mea-

sure of the spread of the distribution with respect to the median. The ﬁve parameters

max X, q1 , q2 , q3 , min X are often referred to as the ﬁve-number summary. Graphically,

the ﬁve numbers are often displayed as a boxplot.

4.20. If F is non-decreasing, right continuous, with lim F (x) = 0 and lim F (x) = 1,

x→−∞ x→∞

then F is the CDF of some probability measure on (R, B).

In particular, let U ∼ U (0, 1) and X = F −1 (U ), then FX = F . Here, F −1 is the

left-continuous inverse of F . Note that we just explicitly deﬁne a random variable X(ω)

with distribution function F on Ω = (0, 1).

Deﬁnition 4.21. A random variable X is said to be a discrete random variable if

there exists countable distinct real numbers xk such that

P [X = xk ] = 1.

k

• pi = pX (xi ) = P [X = xi ]

deﬁne its probability mass function (pmf) by

pX (xk ) = P [X = xk ].

• If Ω is countable, then there can be only countably many value of X(ω). So, any

random variable deﬁned on countable Ω is discrete.

41

• Sometimes, we write p(xk ) or pxk in stead of pX (xk ).

• P [X ∈ B] = xk ∈B P [X = xk ].

• FX (x) = xk pX (xk )U (x − xk ).

Deﬁnition

4.23 (Discrete CDF). A cdf which can be written in the form Fd (x) =

k pi U (x − xk ) is called a discrete cdf [7, Def. 5.4.1 p 163]. Here, U is the unit step

function, {xk } is an arbitrary countable set of real numbers, and {pi } is a countable set of

positive numbers that sum to 1.

distinct values are xk = k.

For integer-valued random variables,

P [X ∈ B] = P [X = k].

k∈B

• p : Ω → [0, 1].

• 0 ≤ pX ≤ 1.

• k pX (xk ) = 1.

Deﬁnition 4.26. Sometimes, it is convenient to work with the “pdf” of a discrete r.v.

Given that X is a discrete random variable which is deﬁned as in deﬁnition 4.22. Then,

the “pdf” of X is

fX (x) = pX (xk )δ(x − xk ), x ∈ R. (12)

xk

Although the delta function is not a well-deﬁned function3 , this technique does allow easy

manipulation of mixed distribution. The deﬁnition of quantities involving discrete random

variables and the corresponding properties can then be derived from the pdf and hence

there is no need to talk about pmf at all!

Deﬁnition 4.27. A random variable X is said to be a continuous random variable if

and only if any one of the following equivalent conditions holds.

≡ ∀x, P [X = x] = 0

≡ FX is continuous

3

Rigorously, it is a unit measure at 0.

42

4.28. f is (probability) density function f (with respect to Lebesgue measure) of a

random variable X (or the distribution P X )

≡ P X have density f with respect to Lebesgue measure.

dP X

dλ

, the Radon-Nikodym derivative.

≡ f is a nonnegative Borel function on R such that ∀B ∈ BR PX (B) = B f (x)dx =

f dλ where λ is the Lebesgue measure. (This extends nicely to the random vector

B

case.)

≡ X is absolutely continuous

x

≡ ∀x ∈ R FX (x) = f (t)dt

−∞

b

≡ ∀a, b FX (b) − FX (a) = f (x)dx

a

theorem of calculus that f is indeed a density for F . That is, if F has a continuous

derivative, this derivative can serve as the density f .

4.30. Suppose a random variable X has a density f .

• F need not diﬀerentiate to f everywhere.

• f (x) dx = 1

Ω

serve as a density for X and PX

• P [X ∈ [a, b]] = P [X ∈ [a, b)] = P [X ∈ (a, b]] = P [X ∈ (a, b)] because the corre-

sponding integrals over an interval are not aﬀected by whether or not the endpoints

are included or excluded. In other words, P [X = a] = P [X = b] = 0.

• P [fx (X) = 0] = 0

4.31. fX (x) = E [δ (X − x)]

43

Deﬁnition 4.32 (Absolutely Continuous CDF). An absolutely continuous cdf Fac can

be written in the form ! x

Fac (x) = f (z)dz,

−∞

d

Fac (x),

f (x) =

dx

is deﬁned a.e., and is a nonnegative, integrable function (possibly having discontinuities)

satisfying !

f (x)dx = 1.

4.33. Any nonnegative function that integrates to one is a probability density function

(pdf) [9, p 139].

4.34. Remarks: Some useful intuitions

x+Δx

(a) Approximately, for a small Δx, P [X ∈ [x, x + Δx]] = x

fX (t)dt ≈ f − X(x)Δx.

This is why we call fX the density function.

P [x<X≤x+Δx]

(b) In fact, fX (x) = lim Δx

Δx→0

distribution function F and density f on the interval [0, ∞). The following terms are often

used when T denotes the lieftime of a device or system.

(a) Its survival-, survivor-, or reliability-function is:

!∞

R (t) = P [T > t] = f (x)dx = 1 − F (t) .

t

• R(0) = P [T > 0] = P [T ≥ 0] = 1.

∞

(b) The mean time of failure (MTTF) = E [T ] = 0

R(t)dt.

(c) The (age-speciﬁc) failure rate or hazard function os a device or system with lifetime

T is

P [T ≤ t + δ|T > t] R (t) f (t) d

r (t) = lim =− = = ln R(t).

δ→0 δ R(t) R (t) dt

(i) r (t) δ ≈ P [T ∈ (t, t + δ] |T > t]

t

(ii) R(t) = e− 0 r(τ )dτ

.

t

(iii) f (t) = r(t)e− 0 r(τ )dτ

44

Deﬁnition 4.36. A random variable whose cdf is continuous but whose derivative is the

zero function is said to be singular.

• It has no density. (Otherwise, the cdf is the zero function.) So, ∃ continuous random

variable X with no density. Hence, ∃ random variable X with no density.

• Even when we allow the use of delta function for the density as in the case of mixed

r.v., it still has no density because there is no jump in the cdf.

• There exists singular r.v. whose cdf is strictly increasing.

d

Deﬁnition 4.37. fX,A (x) = F

dx X,A

(x). See also deﬁnition 4.17.

There are many occasion where we have a random variable whose distribution is a combina-

tion of a normalized linear combination of discrete and absolutely continuous distributions.

For convenience, we use the Dirac delta function to link the pmf to a pdf as in deﬁnition

4.26. Then, we only have to concern about the pdf of a mixed distribution/r.v.

4.38. By allowing density functions tocontain impulses, the cdfs of mixed random variables

can be expressed in the form F (x) = (−∞,x] f (t)dt.

fX (x) = f˜X (x) + P [X = xk ]δ(x − xk ),

k

where

• the xi are the distinct points at which FX has jump discontinuities, and

FX (x) , FX is diﬀerentiable at x

• f˜X (x) =

0, otherwise.

In which case, !

E [g(X)] = g(x)f˜X (x)dx + g(xk )P [X = xk ].

k

k)

4.40. Suppose the cdf F can be expressed in the form F (x) = G(x)U (x − x0 ) for some

function G. Then, the density is f (x) = G (x)U (x − x0 ) + G(x0 )δ(x − x0 ). Note that

G(x0 ) = F (x0 ) = P [X = x0 ] is the jump of the cdf at x0 . When the random variable is

continuous, G(x0 ) = 0 and thus f (x) = G (x)U (x − x0 ).

45

4.6 Independence

Deﬁnition 4.41. A family of random variables {Xi : i ∈ I} is independent if ∀ ﬁnite

J ⊂ I, the family of random variables {Xi : i ∈ J} is independent. In words, “an inﬁnite

collection of random elements is by deﬁnition independent if each ﬁnite subcollection is.”

Hence, we only need to know how to test independence for ﬁnite collection.

events (sets) {Ai : i ∈ I} is independent.

Deﬁnition 4.42. Independence among ﬁnite collection of random variables: For ﬁnite I,

the following statements are equivalent

≡ P [Xi ∈ Hi ∀i ∈ I] = P [Xi ∈ Hi ]where Hi ∈ Ei

i∈I

5 6

≡ P (Xi : i ∈ I) ∈ × Hi = P [Xi ∈ Hi ] where Hi ∈ Ei

i∈I i∈I

≡ P (Xi :i∈I) × Hi = P Xi (Hi ) where Hi ∈ Ei

i∈I i∈I

≡ P [Xi ≤ xi ∀i ∈ I] = P [Xi ≤ xi ]

i∈I

≡ [Factorization Criterion] F(Xi :i∈I) ((xi : i ∈ I)) = FXi (xi )

i∈I

≡ σ (Xi ) and σ X1i−1 are independent ∀i ≥ 2

≡ Discrete random variables Xi ’s with countable range E:

P [Xi = xi ∀i ∈ I] = P [Xi = xi ] ∀xi ∈ E ∀i ∈ I

i∈I

≡ Absolutely continuous Xi with density fXi : f(Xi :i∈I) ((xi : i ∈ I)) = fXi (xi )

i∈I

Deﬁnition 4.43. If the Xα , α ∈ I are independent and each has the same marginal

distribution with distribution Q, we say that the Xα ’s are iid (independent and identically

iid

distributed) and we write Xα ∼ Q

random variables any two of which are independent.

46

(b) Some pairwise independent collections are not independent. See example (4.45).

Example 4.45. Let suppose X, Y , and Z have the following joint probability distribution:

pX,Y,Z (x, y, z) = 14 for (x, y, z) ∈ {(0, 0, 0), (0, 1, , 1), (1, 0, 1), (1, 1, 0)}. This, for example,

can be constructed by starting with independent X and Y that are Bernoulli- 12 . Then set

Z = X ⊕ Y = X + Y mod 2.

(a) X, Y, Z are pairwise independent.

(b) The combination of X Z and Y Z does not imply (X, Y ) Z.

|=

|=

|=

Deﬁnition 4.46. The convolution of probability measures μ1 and μ2 on (R, B) is the

measure μ1 ∗ μ2 deﬁned by

!

(μ1 ∗ μ2 ) (H) = μ2 (H − x)μ1 (dx) H ∈ BR

(b) If FX and GX are distribution functions corresponding to μX and μY , the distribution

function corresponding to μX ∗ μY is

!

(μX ∗ μY ) (−∞, z] = FY (z − x)μX (dx)

Integral.) Then, (μX ∗ μY ) (−∞, z] = FY (z − x)dFX (x). This is denoted by

(FX ∗ FY ) (z). That is

!

FZ (z) = (FX ∗ FY ) (z) = FY (z − x)dFX (x)

!

(FX ∗ fY ) (z) = fY (z − x)dFX (x)

(i) If Y (or FY ) is absolutely continuous with density fY , then for any X (or

FX ), X + Y (or FX ∗ FY ) is absolutely continuous with density (FX ∗ fY ) (z) =

fY (z − x)dFX (x)

If, in addition, FX has density fX . Then,

! !

fY (z − x)dFX (x) = fY (z − x)fX (x) dx

This is denoted by fX ∗ fY

In other words, if densities fX , fY exist, then FX ∗ FY has density fX ∗ fY , where

!

(fX ∗ fY ) (z) = fY (z − x)fX (x) dx

47

4.47. If random variables X and Y are independent and have distribution μX and μY ,

then X + Y has distribution μX ∗ μY

(a) Let X and Y be nonnegative independent random variables on (Ω, A, P ), then E [XY ] =

EXEY

6

n

gk (Xk )’s are independent. Moreover, if gk (Xk ) is integrable, then E gk (Xk ) =

k=1

n

E [gk (Xk )]

k=1

n

(c) If X1 , . . . , Xn are independent and Xk has a ﬁnite second moment, then all Xk

k=1

have ﬁnite second 6

5 n moments as well.

n

Moreover, Var Xk = Var Xk .

k=1 k=1

5 6

k

k

(d) If pairwise independent Xi ∈ L , then Var

2

Xi = Var [Xi ]

i=1 i=1

4.7 Misc

4.49. The mode of a discrete probability distribution is the value at which its probability

mass function takes its maximum value. The mode of a continuous probability distribution

is the value at which its probability density function attains its maximum value.

• the mode is not necessarily unique, since the probability mass function or probability

density function may achieve its maximum value at several points.

5 PMF Examples

The following pmf will be deﬁned on its support S. For Ω larger than S, we will simply

put the pmf to be 0.

5.1 Random/Uniform

5.1. Rn , Un

When an experiment results in a ﬁnite number of “equally likely” or “totally random”

outcomes, we model it with a uniform random variable. We say that X is uniformly

distributed on [n] if

1

P [X = k] = , k ∈ [n].

n

We write X ∼ Un .

48

X∼ Support set X pX (k) ϕX (u)

Uniform Un {1, 2, . . . , n} 1

n

1−eiun

U{0,1,...,n−1} {0, 1, . . . , n − 1} 1

n n(1−eiu )

1 − p, k = 0

Bernoulli B(1, p) {0, 1}

np,

k k=1

n

Binomial B(n, p) {0, 1, . . . , n} k

p (1 − p)n−k (1 − p + peju )

Geometric G(p) N ∪ {0} (1 − p)pk 1−β

1−βeiu

Geometric G (p) N (1 − p)k−1 p

eλ(e )

iu −1

e−λ λk!

k

Poisson P(λ) N ∪ {0}

• pi = 1

n

for i ∈ S = {1, 2, . . . , n}.

• Examples

◦ fair gaming devices (well-balanced coins and dice, well shuﬄed decks of cards)

◦ experiment where

∗ there are only n possible outcomes and they are all equally probable

∗ there is a balance of information about outcomes

Suppose |S| = n, then p(x) = n1 for all x ∈ S.

Example 5.3. For X uniform on [-M:1:M], we have EX = 0 and Var X = M (M3 +1) .

For X uniform on [N:1:M], we have EX = M −N

2

1

and Var X = 12 (M − N )(M − N − 2).

Example 5.4. Set S = 0, 1, 2, . . . , M , then the sum of two independent U(S) has pmf

(M + 1) − |k − M |

p(k) =

(M + 1)2

1

for k = 0, 1, . . . , 2M . Note its triangular shape with maximum value at p(M ) = M +1

. To

visualize the pmf in MATLAB, try

k = 0:2*M;

P = (1/((M+1)^2))*ones(1,M+1);

P = conv(P,P); stem(k,P)

5.5. Bernoulli : B(1, p) or Bernoulli(p)

• S = {0, 1}

49

• p0 = q = 1 − p, p1 = p

• EX = E [X 2 ] = p.

Var [X] = p − p2 = p (1 − p). Note that the variance is maximized at p = 0.5.

0.25

0.2

p ( 1 p )

0.1

0

0

0 0.2 0.4 0.6 0.8 1

0 p 1

5.6. Binary Suppose X takes only two values a and b with b > a. P [X = b] = p. Then,

X can be expressed as X = (b − a) I + a, where I is a Bernoulli random variable with

P [I = 1] = p.

• Binomial distribution with size n and parameter p. p ∈ [0, 1].

• pi = ni pi (1 − p)n−i for i ∈ S = {0, 1, 2, . . . , n}

• X is the number of success in n independent Bernoulli trials and hence the sum of n

independent, identically distributed Bernoulli r.v.

n

• ϕX (u) = (1 − p + peju )

• EX = np

• EX 2 = (np)2 + np (1 − p)

• Var [X] = np (1 − p)

n

• Tail probability: n

r

pr (1 − p)n−r = Ip (k, n − k + 1)

r=k

50

• Maximum probability value happens at kmax = mode X = (n + 1) p ≈ np

◦ When (n+1)p is an integer, then the maximum is achieved at kmax and kmax −1.

• If have E1 , . . ., En , n unlinked repetition of E and event A for E, the the distribution

B(n,p) describe the probability that A occurs k times in E1 , . . . ,En .

• Gaussian Approximation for Binomial Probabilities: When n is large, binomial distri-

bution becomes diﬃcult to compute directly because of the need to calculate factorial

terms. We can use

1 (k−np)2

P [X = k] " " e− 2np(1−p) , (13)

2πnp (1 − p)

which comes from approximating X by Gaussian Y with the same mean and variance

and the relation

P [X = k] " P [X ≤ k] − P [X ≤ k − 1]

" P [Y ≤ k] − P [Y ≤ k − 1] " fY (k).

n

n−k (np)k −np 2

• Approximation: k

p (1 − p)

k

= k!

e 1+O np2 , kn

x x

O O O O

e e

* ( x 1) 0.15 * ( x 1) 0.06

1 2 1 2

( x O ) ( x O )

1 2 O 1 2 O

e e

2 S O 2 S O

0.1 0.04

x O1 x O1

e x e x

* ( O) * ( O)

* ( n 1) x n x 0.05 * ( n 1) x n x 0.02

p ( 1 p ) p ( 1 p )

* ( n x 1) * ( x 1) * ( n x 1) * ( x 1)

0 0

0 5 10 0 20 40 60

x x

Figure 13: Gaussian approximation to Binomial, Poisson distribution, and Gamma distri-

bution.

A geometric distribution is deﬁned by the fact that for some β ∈ [0, 1), pk+1 = βpk for all

k ∈ S where S can be either N or N ∪ {0}.

• When its support is N, pk = (1 − p) pk−1 . This is referred to as G1 (β) or geometric1 (β).

In MATLAB, use geopdf(k-1,1-p).

51

• When its support is N ∪ {0}, pk = (1 − p) pk . This is referred to as G0 (β) or

geometric0 (β). In MATLAB, use geopdf(k,1-p).

5.7. Consider X ∼ G0 (β).

• pi = (1 − β) β i , for S = N ∪ {0} , 0 ≤ β < 1

• β= m

m+1

where m = average waiting time/ lifetime

P [X ≥ k] = β k = the probability of having at least k initial failure = the probability

of having to perform at least k+1 trials.

P [X > k] = β k+1 = the probability of having at least k+1 initial failure.

• Memoryless property:

◦ P [X ≥ k + c |X ≥ k ] = P [X ≥ c], k, c > 0.

◦ P [X > k + c |X ≥ k ] = P [X > c], k, c > 0.

◦ If a success has not occurred in the ﬁrst k trials (already fails for k times),

then the probability of having to perform at least j more trials is the same the

probability of initially having to perform at least jtrials.

◦ Each time a failure occurs, the system “forgets” and begins anew as if it were

performing the ﬁrst trial.

◦ Geometric r.v. is the only discrete r.v. that satisﬁes the memoryless property.

• Ex.

◦ lifetimes of components, measured in discrete time units, when the fail catas-

trophically (without degradation due to aging)

◦ waiting times

∗ for next customer in a queue

∗ between radioactive disintegrations

∗ between photon emission

◦ number of repeated, unlinked random experiments that must be performed prior

to the ﬁrst occurrence of a given event A

∗ number of coin tosses prior to the ﬁrst appearance of a ‘head’

number of trials required to observe the ﬁrst success

• The sum of independent G0 (p) and G0 (q) has pmf

7 k+1 −pk+1

(1 − p) (1 − q) q q−p , p = q

2 k

(k + 1) (1 − p) p , p=q

for k ∈ N ∪ {0}.

5.8. Consider X ∼ G1 (β).

• P [X > k] = β k

• Suppose independent Xi ∼ G1 (βi ). min (X1 , X2 , . . . , Xn ) ∼ G1 ( ni=1 βi ).

52

5.5 Poisson Distribution: P(λ)

5.9. Characterized by

• pX (k) = P [X = k] = e−λ λk! ; or equivalently,

k

• ϕX (u) = eλ(e ),

iu −1

In MATLAB, use poisspdf(k,lambda).

5.10. Denoted by P (λ).

5.11. In stead of X, Poisson random variable is usually denoted by Λ.

5.12. EX = Var X = λ.

5.13. Successive probabilities are connected via the relation kpX (k) = λpX (k − 1).

5.14. mode X = λ.

• Note that when λ ∈ N, there are two maximums at λ − 1 and λ.

• When λ # 1, pX (λ) ≈ √1

2πλ

via the Stirling’s formula (1.13).

0<λ<1 0 e−λ

λλ −λ

λ∈N λ − 1, λ λ!

e

λλ −λ

λ ≥ 1, λ ∈

/N λ λ!

e

The cumulative probabilities can be found by

8 k+1 9 !∞

(∗) 1

P [X ≤ k] = P Xi > 1 = e−t tx dt,

i=1

Γ (k + 1)

λ

8 k+1 9 !λ

(∗) 1

P [X > k] = P [X ≥ k + 1] = P Xi ≤ 1 = e−t tx dt,

i=1

Γ (k + 1)

0

where the Xi ’s are i.i.d. E(λ). The equalities given by (*) are easily obtained via

counting the number of events from rate-λ Poisson process on interval [0, 1].

Var X

5.16. Fano factor (index of dispersion): EX

=1

An important property of the Poisson and Compound Poisson laws is that their classes

are close under convolution (independent summation). In particular, we have divisibility

properties (5.21) and (5.31) which are straightforward to prove from their characteristic

functions.

53

5.17

(Recursion equations). Suppose X ∼ P(λ). Let mk (λ) = E X k and μk (λ) =

E (X − EX)k .

μk+1 (λ) = λ (kμk−1 (λ) + μk (λ)) (15)

[14, p 112]. Starting with m1 = λ = μ2 and μ1 = 0, the above equations lead to recursive

determination of the moments mk and μk .

1

−λ d

5.18. E X+1 = λ 1 − e . Because for d ∈ N, Y = X+1

1 1

n=0 an X

n

can be expressed

d−1

as n=0 bn X

n c

+ X+1 , the value of EY is easy to ﬁnd if we know EX n .

5.19. Mixed Poisson distribution: Let X be Poisson with mean λ. Suppose, that the

mean λ is chosen in accord with a probability distribution whose characteristic function is

ϕΛ . Then,

iuX ( * ( *

Λ(eiu −1) i(−i(eiu −1))Λ

ϕX (u) = E E e |Λ = E e =E e = ϕΛ −i eiu − 1 .

• EX = EΛ.

• Var X = Var Λ + EΛ.

• E [X 2 ] = E [Λ2 ] + EΛ.

• Var[X|Λ] = E [X|Λ] = Λ.

• When Λ is a nonnegative z−1

1 integer-valued random variable, we have GX (z) = GΛ (e )

and P [X = 0] = GΛ z .

• E [XΛ] = E [Λ2 ]

• Cov [X, Λ] = Var Λ

is a binomial channel with success probability s. (Each 1 in the X get through the channel

with success probability s.)

• Note that Y is in fact a random sum X i=1 Ii where i.i.d. Ii has Bernoulli distribution

with parameter s.

• Y ∼ P (sλ);

x−y

• p (x |y ) = e−λ(1−s) (λ(1−s))

(x−y)!

; x ≥ y (shifted Poisson);

n

n Finite additivity: Suppose we have independent Λi ∼ P (λi ), then

5.21. i=1 Λi ∼

P ( i=1 λi ).

5.22. Raikov’s theorem: independent random variables can have their sum Poisson-

distributed only if every component of the sum is Poisson-distributed.

54

5.23. Countable Additivity Theorem [11, p 5]: Let (Xj : j =∈ N) be independent

random variables, and assume that Xj has the distribution P(μj ) for each j. If

∞

μj (16)

j=1

∞

converges to μ, then S = Xj converges with probability 1, and S has distribution P(μ).

j=1

If on the other hand (16) diverges, then S diverges with probability 1.

5.24. Let X1 , X

n n

j.

n Then S n = j=1 Xj has distribution P(μ), with μ = j=1 μj ; and so, whenever

j=1 rj = s,

n r

s! μj j

P [Xj = rj ∀j|Sn = s] =

r1 !r2 ! · · · rn ! j=1 μ

λ

and μ, then (1) Z = X +Y is P(λ+μ) and (2) conditioned on Z = z, X is B z, λ+μ .

λ

So, E [X|Z] = λ

λ+μ

Z, λμ

Var[X|Z] = Z (λ+μ) 2 , and E [Var[X|Z]] =

λμ

λ+μ

.

5.25. One of the reasons why Poisson distribution is important is because many natural

phenomenons can be modeled by Poisson processes. For example, if we consider the number

of occurrences Λ during a time interval of length τ in a rate-λ homogeneous Poisson process,

then Λ ∼ P(λτ ).

Example 5.26.

• The ﬁrst use of the Poisson model is said to have been by a Prussian physician,

von Bortkiewicz, who found that the annual number of late-19th-century Prussian

soldiers kicked to death by horses followed a Poisson distribution [7, p 150].

• #clicks in a Geiger counter in τ seconds when the average number of click in 1 second

is λ.

in time τ

55

• #soldiers kicked to death by horses

• Counts of defects in a semiconductor chip.

5.27. Normal Approximation to Poisson Distribution with large λ: Let X ∼ P (λ). X

n

can be though of as a sum of i.i.d.Xi ∼ P (λn ), i.e., X = Xi , where nλn = λ. Hence X

i=1

is approximately normal N (λ, λ) for λ large.

Some says that the normal approximation is good when λ > 5.

5.28. Poisson distribution can be obtained as a limit from negative binomial distributions.

Thus, the negative binomial distribution with parameters r and p can be approximated

by the Poisson distribution with parameter λ = rqp (mean-matching), provided that p is

“suﬃciently” close to 1 and r is “suﬃciently” large.

5.29. Convergence of sum of bernoulli random variables to the Poisson Law

Suppose that for each n ∈ N

Xn,1 , Xn,2 , . . . , Xn,rn

are independent; the probability space for the sequence may change with n. Such a collec-

tion is called a triangular array [1] or double sequence [8] which captures the nature

of the collection when it is arranged as

⎫

X1,1 , X1,2 , . . . , X1,r1 , ⎪

⎪

X2,1 , X2,2 , . . . , X2,r2 , ⎪

⎪

⎪

⎬

.. .. ..

. . ··· .

⎪

Xn,1 , Xn,2 , . . . , Xn,rn , ⎪

⎪

⎪

.. .. .. ⎪

⎭

. . ··· .

where the random variables in each row are independent. Let Sn = Xn,1 + Xn,2 + · · · + Xn,rn

be the sum of the random variables in the nth row.

Consider a triangular array of bernoulli random variables Xn,k with P [Xn,k = 1] = pn,k .

rn

If max pn,k → 0 and pn,k → λ as n → ∞, then the sums Sn converges in distribution

1≤k≤rn k=1

to the Poisson law. In other words, Poisson distribution rare events limit of the binomial

(large n, small p).

As a simple special case, consider a triangular array of bernoulli random variables Xn,k

with P [Xn,k = 1] = pn . If npn → λ as n → ∞, then the sums Sn converges in distribution

to the Poisson law. i

To show this special case directly, we bound the ﬁrst i terms of n! to get (n−i)

i!

≤ ni ≤

ni

i!

. Using the upper bound,

n i 1 npn n

pn (1 − pn )n−i ≤ (npn )i (1 − pn )−i (1 − ) .

i i! n

→λ i →1

→e−λ

n−i i

The lower bound gives the same limit because (n − i)i = n

ni where the ﬁrst term

→ 1.

56

5.6 Compound Poisson

Given an arbitrary probability measure μ and a positive real number λ, the compound

Poisson distribution CP (λ, μ) is the distribution of the sum Λj=1 Vj where the Vj are i.i.d.

with distribution μ and Λ is a P (λ) random variable, independent of the Vj .

Sometimes, it is written as POIS (λμ). The parameter λ is called the rate of CP (λ, μ)

and μ is called the base distribution.

5.30. The mean and variance of CP (λ, L (V )) are λEV and λEV 2 respectively.

An important property of the Poisson and Compound Poisson laws is that their classes

are close under convolution (independent summation). In particular, we have divisibility

properties (5.21) and (5.31) which are straightforward to prove from their characteristic

functions.

5.32. Divisibility

property

ofthe compound

1 Poisson

n law: Suppose we

nhave inde-

n

pendent Λi ∼ CP λi , μ , then i=1 Λi ∼ CP λ, λ i=1 λi μ

(i) (i)

where λ = i=1 λi .

Proof.

+ n ,

n

ϕ

n (t) = eλi (ϕqi (t)−1) = exp λi (ϕqi (t) − 1)

Zi

i=1 i=1

i=1

+ n , + + ,,

1

n

= exp λi ϕqi (t) − λ = exp λ λi ϕqi (t) − 1

i=1

λ i=1

We usually focus on the case when μ is a discrete probability measure on N = {1, 2, . . .}.

q on N; q is called the base pmf. Equivalently,

In which case, we usually refer to μ by the pmf

CP (λ, q) is also the distribution

of the sum i∈N iΛi where (Λi : i ∈ N) are independent

with Λi ∼ P (λqi ). Note that i∈N λqi = λ. The Poisson distribution is a special case of

the compound Poisson distribution where we set q to be the point mass at 1.

5.33. The compound negative binomial [Bower, Gerber, Hickman, Jones, and Nesbitt,

1982, Ch 11] can be approximated by the compound Poisson distribution.

5.7 Hypergeometric

An urn contains N white balls and M black balls. One draws n balls without replacement,

so n ≤ N + M . One gets X white balls and n − X black balls.

7 N M

( x )(n−x)

, 0 ≤ x ≤ N and 0 ≤ n − x ≤ M

P [X = x] = (N +M

n )

0, otherwise

57

5.34. The hypergeometric distributions “converge” to the binomial distribution: Assume

N

that n is ﬁxed, while N and M increase to +∞ with lim N +M = p, then

N →∞

M →∞

n x

p (x) → p (1 − p)n−x (binomial).

x

n x n−x x n−x

x

N M n N M

p (x) = = .

(N + M )n x N +M N +M

Intuitively, when N and M large, there is not much diﬀerence in drawing n balls with or

without replacement.

5.35. Extension: If we have m colors and Ni balls of color i. The urn contains N =

N1 + · · · + Nm balls. One draws n balls without replacement. Call Xi the number of balls

of color i drawn among n balls. (Of course, X1 + · · · + Xm = n.)

⎧ N N

⎨ ( x11 )( x22 )···(Nxmm )

, x1 + · · · + xm = n and xi ≥ 0

P [X1 = x1 , . . . , Xm = xm ] = (Nn )

⎩ 0, otherwise

5.36. The probability that the rth success occurs on the (x + r)th trial

x+r−1 x x+r−1

=p p (1 − p) =

r−1

pr (1 − p)x

r−1 r−1

x+r−1

= pr (1 − p)x

x

i.e. among the ﬁrst (x + r − 1) trials, there are r − 1 successes and x failures.

• Fix r.

• ϕX (u) = pr (1−(1−p)e

1

iu )r

• EX = rq

p

,V ar (X) = rq

p2

,q =1−p

n n(n−1)·····(n−(x−1))

• Note that if we deﬁne x

≡ x(x−1)·····1

. Then,

−r r (r + 1) · · · · · (r + (x − 1))

x x r+x−1

≡ (−1) = (−1) .

x x (x − 1) · · · · · 1 x

58

• If independent Xi ∼ NegBin (ri , p), then Xi ∼ NegBin ri , p . This is easy to

i i

see from the characteristic function.

• p (x) = Γ(r+x) r

Γ(r)x!

p (1 − p)x

as a mixture of Poisson distributions with

p

mean distributed as a gamma distribution Γ q = r, λ = 1−p .

Let X be Poisson with mean λ. Suppose, that the mean λ is chosen in accord with a

probability distribution FΛ (λ). Then, ϕX (u) = ϕΛ (−i (eiu − 1)) [see compound Poisson

distribution]. Here, Λ ∼ Γ (q, λ0 ); hence, ϕΛ (u) = 1 u q .l

1−i λ

q 0

−q λ0

iu

0

= 1− λ01+1 eiu , which is negative binomial with p = λ0λ+1

0

.

λ0 +1

A variable with a beta binomial distribution is distributed as a binomial distribution with

parameter p, where p is distribution with a beta distribution with parameters αand β.

• P (k |p ) = nk pk (1 − p)n−k .

Γ(q1 )Γ(q2 )

p (1 − p)q2 −1 1(0,1) (p)

n Γ(q1 +q2 ) Γ(k+q1 )Γ(n−k+q2 ) n β(k+q1 ,n−k+q2 )

• pmf: P (k) = k Γ(q1 )Γ(q2 ) Γ(q1 +q2 +n)

= k β(q1 ,q2 )

• EX = nq1

q1 +q2

• Var X = (q1 +q2 )2 (1+q1 +q2 )

5.38. P [X = k] = 1 1

ξ(p) kp

where k ∈ N, p > 1, and ξ is the zeta function deﬁned in (A.11).

ξ(p−n)

5.39. E [X n ] = ξ(p)

is ﬁnite for n < p − 1, and E [X n ] = ∞ for n ≥ p − 1.

6 PDF Examples

6.1 Uniform Distribution

6.1. Characterization for uniform[a, b]:

0 x < a, x > b

(a) f (x) = b−a U (x − a) U (b − x) =

1

1

b−a

a≤x≤b

59

X∼ fX (x) ϕX (u)

b+a sin(u b−a

2 )

Uniform U(a, b) 1

1 (x)

b−a [a,b]

eiu 2

u b−a

2

Exponential E(λ) λe−λx 1[0,∞] (x) λ

λ−iu

x−s

1 − μ−s0

Shifted Exponential (μ, s0 ) μ−s0

e 01

[s0 ,∞) (x)

α −αx

Truncated Exp. e−αa −e−αb

e 1[a,b] (x)

α −α|x| α2

Laplacian L(α) 2

e α2 +u2

1 x−m 2

√1 e− 2 ( σ ) ium− 12 σ 2 u2

Normal N (m, σ 2 ) σ 2π

e

1 T −1 T m− 1 uT Λu

Normal N (m, Λ) n√ e− 2 (x−m) Λ (x−m)

1

eju 2

(2π) 2 det(Λ)

λq xq−1 e−λx 1

Gamma Γ (q, λ) 1(0,∞) (x) q

Γ(q) (1−i uλ )

Pareto Par(α) αx−(α+1) 1[1,∞] (x)

α c α+1

Par(α, c) = c Par(α) c x

1(c,∞) (x)

Beta β (q1 , q2 ) Γ(q1 +q2 ) q1 −1

Γ(q1 )Γ(q2 )

x (1 − x)q2 −1 1(0,1) (x)

Γ(q1 +q2 ) xq1 −1

Beta prime 1

Γ(q1 )Γ(q2 ) (x+1)(q1 +q2 ) (0,∞)

(x)

2

Rayleigh 2αxe−αx 1[0,∞] (x)

1 1

Standard Cauchy π 1+x2

1 α

Cau(α) π α2 +x2

Γ(d) 1

Cau(α, d) √

παΓ(d− 12 ) 1+ x 2 d

( )

α

ln x−μ 2

μ,σ 2 − 12

Log Normal eN ( ) 1

√

σx 2π

e ( σ )1

(0,∞) (x)

positive and d > 12 . γ = −ψ(1) ≈ .5772 is the Euler-constant. ψ (z) = dz d

log Γ (z) =

Γ (z) Γ(q1 )Γ(q2 )

(log e) Γ(z) is the digamma function. B(q1 , q2 ) = Γ(q1 +q2 ) is the beta function.

60

0 x < a, x > b

(b) F (x) = x−a

b−a

a≤x≤b

b+a sin(u b−a

2 )

(c) ϕX (u) = eiu 2

u b−a

2

esb −esa

(d) MX (s) = s(b−a)

.

6.2. For most purpose, it does not matter whether the value of the density f at the

1

endpoints are 0 or b−a .

Example 6.3.

phase φ ∼ U(-ππ)

• Use with caution to represent ignorance about a parameter taking value in [a, b].

(b−a)2

6.4. EX = a+b

2

, Var X = 12

, E [X 2 ] = 13 (b2 + ab + a2 ).

and

FX (x) = x − x ln x

1 x 1

on [0, 1]. This comes from P [X > x] = 1 − FU t

dt = 1 − xt dt.

0 x

6.6. Gaussian distribution:

1 x−m 2

(b) fX (x) = √ 1 e− 2 ( σ ) .

2πσ

• The standard normal cdf is sometimes denoted by Φ(x). It inherits all properties

of cdf. Moreover, note that Φ(−x) = 1 − Φ(x).

1 2 2

(d) ϕX (v) = E ejvX = ejmv− 2 v σ .]

1 2 2

(e) MX (s) = esm+ 2 s σ

61

fX x

fX x

95%

68%

P V P P V P 2V P P 2V

∞ 1 2 σ2

(f) Fourier transform: F {fX } = fX (x) e−jωx dt = e−jωm− 2 ω .

−∞

x−m x−m

(g) P [X > x] = P [X ≥ x] = Q x−m

σ

= 1

− Φ σx−m = Φ − σ

P [X < x] = P [X ≤ x] = 1 − Q x−mσ

= Q − σ

= Φ x−m

σ

.

6.7. Properties

(a) P [|X − μ| < σ] = 0.6827; P [|X − μ| > σ] = 0.3173

P [|X − μ| > 2σ] = 0.0455; P [|X − μ| < 2σ] = 0.9545

( * ( *

k k−2 0, k odd

(i) E (X − μ) = (k − 1) E (X − μ) =

1 · 3 · 5 · · · · · (k − 1) σ , k even

k

7 =

( *

k 2 · 4 · 6 · · · · · (k − 1) σ k π2 , k odd

(ii) E |X − μ| =

1 · 3 · 5 · · · · · (k − 1) σ k , k even

(iii) Var [X 2 ] = 4μ2 σ 2 + 2σ 4 .

n 0 1 2 3 4

EX n 1 μ μ + σ2

2

μ (μ + 3σ 2 )

2

μ + 6μ σ + 3σ 4

4 2 2

E [(X − μ)n ] 1 0 σ2 0 3σ 4

0, k odd

E X k = (k − 1) E X k−2 =

1 · 3 · 5 · · · · · (k − 1) , k even.

The ﬁrst equality comes from integration by parts. Observe also that

(2m)!

E X 2m = m .

2 m!

(d) Lévy–Cramér theorem: If the sum of two independent non-constant random variables

is normally distributed, then each of the summands is normally distributed.

62

∞ 2 "π

• Note that e−αx dx = α

.

−∞

! |B|/2

|B|

P [X ∈ B] ≤ fX (x) = 1 − 2Q ,

−|B|/2 2

where |B| is the length (Lebesgue measure) of the set B. This is because the probability

is concentrated around 0. More generally, for X ∼ N (m, σ 2 )

|B|

P [X ∈ B] ≤ 1 − 2Q .

2σ

6.9 (Stein’s Lemma). Let X ∼ N (μ, σ 2 ), and let g be a diﬀerentiable function satisfying

E |g (X)| < ∞. Then

E [g(X)(X − μ)] = σ 2 E [g (X)] .

[2, Lemma 3.6.5 p 124]. Note that this is simply integration by parts with u = g(x) and

dv = (x − μ)fX (x)dx.

• E (X − μ)k = E (X − μ)k−1 (X − μ) = σ 2 (k − 1)E (X − μ)k−2 .

∞ x 2

6.10. Q-function: Q (z) = √1 e− 2 dx corresponds to P [X > z] where X ∼ N (0, 1);

2π

z

that is Q (z) is the probability of the “tail” of N (0, 1). The Q function is then a comple-

mentary cdf (ccdf).

1

& 0,1 0.9

0.8

0.7

Qz 0.6

0.5

0.4

0.3

0.2

0.1

z

0

-3 -2 -1 0 1 2 3

z

0

63

π π

2 − x2 2 x2

(d) Craig’s formula: Q (x) = 1

π

e 2 sin2 θ dθ = 1

π

e− 2 cos2 θ dθ, x ≥ 0.

0 0

i.i.d.

To see this, consider X, Y ∼ N (0, 1). Then,

π

!! !2 !∞

Q (z) = fX,Y (x, y)dxdy = 2 fX,Y (r cos θ, r sin θ)drdθ.

0 z

(x,y)∈(z,∞)×R cos θ

where we evaluate the double integral using polar coordinates [9, Q7.22 p 322].

π

4 x2

2

(e) Q (x) = 1

π

e− 2 sin2 θ dθ

0

x2

(f) d

dx

Q (x) = − √12π e− 2

(f (x))2

(g) d

dx

Q (f (x)) = − √12π e− 2

d

dx

f (x)

(f (x))2

− 2

d x

(h) Q (f (x)) g (x)dx = Q (f (x)) g (x)dx + √1 e f (x) g (t) dt dx

2π dx

a

(i) P [X > x] = Q x−m

σ x−m

P [X < x] = 1 − Q x−mσ

= Q − σ .

(j) Approximation:

( * z2

(i) Q (z) ≈ (1−a)z+a z2 +b √12π e− 2 ;

1 √

a = π1 , b = 2π

e− x22 x2

(ii) 1 − 1

x2

≤ Q (x) ≤ 12 e− 2

√

x 2π

− z2

(iii) Q (z) ≈ z√12π 1 − 0.7

z 2 e 2 ;z > 2

z 2 √

6.11. Error function (MATLAB): erf (z) = √2

π

e−x dx = 1 − 2Q 2z

0

(b) For z ≥ 0, it corresponds to P [|X| < z] where X ∼ N 0, 12 .

z→∞

(e) Q (z) = 2 erfc 2 = 2 1 − erf √z2

1 √z 1

(f) Φ(x) = 1

2

1 + erf √x = 1

2

erfc − √x2

(2)

64

√

(g) Q−1 (q) = 2 erfc−1 (2q)

√ ∞ 2

(h) The complementary error function: erfc (z) = 1−erf (z) = 2Q 2z = √2

π z

e−x dx

§ 1·

& ¨ 0, ¸

© 2¹

erf z

Q 2z

0 z

6.12. Denoted by E (λ). It is in fact Γ (1, λ). λ > 0 is a parameter of the distribution,

often called the rate parameter.

6.13. Characterized by

• FX (x) = 1 − e−λx U (x) ;

• Survival-, survivor-, or reliability-function: P [X > x] = e−λx 1[0,∞) (x) + 1(−∞,0) (x);

• ϕX (u) = λ

λ−iu

.

• MX (s) = λ

λ−s

for Re {s} < λ.

λ2

.

6.15. median(X) = 1

λ

ln 2, mode(X) = 0, E [X n ] = n!

λn

.

σX

6.16. Coeﬃcient of variation: CV = EX

=1

$X% ∼ G1 (e−λ )

In general, for X ∼ E(λ), we have E [X α ] = λ1α Γ(α + 1) for any α > −1. In particular, for

n ∈ N ∪ {0}, the moment E [X n ] = λn!n .

λ3

and μ4 = 9

λ4

.

65

f (x)

6.21. Hazard function: P [X>x]

=λ

6.24. MATLAB:

• X = exprnd(1/lambda)

• fX (x) = exppdf(x,1/lambda)

• FX (x) = expcdf(x,1/lambda)

6.25. Memoryless property : The exponential r.v. is the only continuous r.v. on [0, ∞)

that satisﬁes the memoryless property:

for all x > 0 and all s > 0 [13, p 157–159]. In words, the future is independent of the

past. The fact that it hasn’t happened yet, tells us nothing about how much longer it will

take before it does happen.

and set B ⊂ [0, ∞), we have

P [X ∈ B + x|X > x] = P [X ∈ B]

because

P [X ∈ B + x] B+x

λe−λt dt τ =t−x B

λe−λ(τ +x) dτ

= = .

P [X > x] e−λx e−λx

6.27. Consider independent Xi ∼ E (λ). Let Sn = ni=1 Xi .

(a) min Xi ∼ E λi .

i i

Recall order statistics. Let Y1 = min Xi

i

( * ∞

λ

(b) P min Xi = j = j . Note that this is fXj (t) P [Xi > t]dt.

i λi

i 0 i=j

66

i.i.d.

6.29. If Si , Ti ∼ E (α), then

+ m , m−1

n n + m − 1 1 n+m−1

P Si > Tj =

i 2

i=1 j=1 i=0

n + i − 1 1 n+i

m−1

= .

i 2

i=0

Note that we can set up two Poisson processes. Consider the superposed process. We want

the nth arrival from the T processes to come before the mth one of the S process.

6.30. Characterizations: Fix α > 0.

0 x<1

(b) F (x) = 1 − 1

U (x − 1) =

xα 1− 1

xα

x≥1

Example 6.31.

• distribution of wealth

6.32. Characterization: α > 0

1 αx

2

e x<0

(c) F (x) = 1 −αx

1 − 2e x≥0

α2

(d) ϕX (u) = α2 +u2

67

λ2

(e) MX (s) = λ2 −s2

, −λ < Re {s} < λ.

6.33. EX = 0, Var X = 2

α2

Example 6.34.

• amplitudes of speech signals

• amplitudes of diﬀerences of intensities between adjacent pixels in an image

• If X and Y are independent E(λ), then X − Y is L(λ). (Easy proof via ch.f.)

6.6 Rayleigh

6.35. Characterizations:

2

(a) F (x) = 1 − e−αx U (x)

2

(b) f (x) = 2αxe−αx u (x)

2

e−αt , t ≥ 0

(c) P [X > t] = 1 − F (t) =

1, t<0

√

(d) Use −2σ 2 ln U to generate Rayleign σ12 from U ∼ U(0, 1).

6.36. Read “ray -lee”

Example 6.37.

• noise X at the output of AM envelope detector when no signal is present

6.38. Relationship with other distributions

(a) Let X be a Rayleigh(α) r.v., then Y = X 2 is E(α). Hence,

√

·

E(α) −

−

−

−

2

Rayleigh(α). (17)

(·)

i.i.d. √

(b) Suppose X, Y ∼ N (0, σ 2 ). R = X 2 + Y 2 has a Rayleigh distribution with density

1 1 2

fR (r) = 2r 2 e− 2σ2 r .

2σ

i.i.d.

• Note that X 2 , Y 2 ∼ Γ 12 , 2σ1 2 . Hence, X 2 + Y 2 ∼ Γ 1, α = 2σ1 2 , exponential.

√

By (17), X 2 + Y 2 is a Rayleigh r.v. α = 2σ1 2 .

• Alternatively,

x r transformation from Cartesian coordinates to polar coordinates

y

→ θ

1 1 r cos θ 2 1 1 r sin θ 2

fR,Θ (r, θ) = rfX,Y (r cos θ, r sin θ) = r √ e− 2 ( σ ) √ e− 2 ( σ )

σ 2π σ 2π

1 1 1 2

= 2r 2 e− 2σ2 r

2π 2σ

Hence, the radius R and the angle Θ are independent, with the radius R having

a Rayleigh distribution while the angle Θ is uniformly distributed in the interval

(0, 2π).

68

6.7 Cauchy

6.39. Characterizations: Fix α > 0.

α 1

(a) fX (x) = ,

α>0

π α2 +x2

(b) FX (x) = π1 tan− 1 λx + 12 .

Note that Cau(α) is simply αX1 where X1 ∼ Cau(1). Also, because fX is even, f−X = fX

and thus −X is still ∼ Cau(α).

6.40. Odd moments are not deﬁned; even moments are inﬁnite.

• Because the ﬁrst moment is not deﬁned, central moments, including the variance, are

not deﬁned.

• Mean and variance do not exist.

• Note that even though the pdf of Cauchy distribution is an even function, the mean

is not 0.

6.41. Suppose Xi are independent Cau(αi ). Then, i ai Xi ∼ Cau(|ai | αi ).

6.42. Suppose X ∼ Cau(α). X1 ∼ Cau α1 .

6.8 Weibull

6.43. For λ > 0 and p > 0, the Weibull(p, λ) distribution [9] is characterized by

1

(a) X = Yλ p where Y ∼ E(1).

p

p

Γ(1+ n

p)

(d) E [X n ] = n .

λp

7 Expectation

Consider probability space (Ω, A, P )

7.1. Let X + = max (X, 0), and X − = − min (X, 0) = max (−X, 0). Then, X = X + − X − ,

and X + , X − are nonnegative r.v.’s. Also, |X| = X + + X −

7.2. A random variable X is integrable if and only if

≡ X has a ﬁnite expectation

69

≡ E |X| is ﬁnite.

≡ EX is ﬁnite ≡ EX is deﬁned.

≡ X∈L

≡ |X| ∈ L

In which case,

EX = E X + − E X −

! !

= X (ω) P (dω) = XdP

! !

= xdP (x) = xP X (dx)

X

and !

XdP = E [1A X] .

A

both equal to +∞. Then, the expectation of X is still given by EX = E [X + ] − E [X − ]

with the conventions +∞ + a = +∞ and −∞ + a = −∞ when a ∈ R

1

(a) (E [X p ]) p = 0

(b) E [X p ] = 0

(c) X = 0 a.s.

a.s.

7.6. X = Y ⇒ EX = EY

• FX (x) = E 1(−∞,x] (X)

7.8. Expectation rule: Let X be a r.v. on (Ω, A, P ), with values in (E, E), and distri-

bution P X . Let h : (E, E) → (R, B) be measurable. If

• X ≥ 0 or

• h (X) ∈ L 1 (Ω, A, P ) which is equivalent to h ∈ L 1 E, E, P X

then

• E [h (X)] = h (X (ω)) P (dω) = h (x) P X (dx)

• h (X (ω)) P (dω) = h (x) P X (dx)

[X∈G] G

70

7.9. Expectation of an absolutely continuous random variable: Suppose X has

density fX , then h is P X -integrable if and only if h · fX is integrable w.r.t. Lebesgue

measure. In which case,

! !

E [h (X)] = h (x) P (dx) = h (x) fX (x) dx

X

and ! !

X

h (x) P (dx) = h (x) fX (x) dx

G G

conclude that E [h(X)] = 0. One obvious odd-function h is h(x) = x. For example,

in (6.40), when X is Cauchy, the expectation does not exist even though the pdf is

an even funciton. Of course, in general, if we also know that h(X) is integrable, then

E [h(X)] is 0.

EX = xP [X = x]

x

and

E [g(X)] = g(x)P [X = x].

x

Similarly,

E [g(X, Y )] = = g(x, y)P [X = x, Y = y].

x y

These are called the law/rule of the lazy statistician (LOTUS) [21, Thm 3.6 p 48],[9,

p 149].

7.10. P [X ≥ t] dt = P [X > t] dt and P [X ≤ t] dt = P [X < t] dt

E E E E

!∞ !∞ !∞

EX = P [X > y] dy = (1 − FX (y)) dy = P [X ≥ y] dy (18)

0 0 0

For p > 0,

!∞

E [X p ] = pxp−1 P [X > x]dx.

0

71

(b) For integrable X,

!∞ !0

EX = (1 − FX (x)) dx − FX (x) dx

0 −∞

∞ ∞

(c) For nonnegative integer-valued X, EX = k=0 P [X > k] = k=1 P [X ≥ k].

Deﬁnition 7.12.

( *

(a) Absolute moment: E |X|k = |x|k P X (dx), where we deﬁne E |X|0 = 1

( *

(b) Moment: If E |X|k < ∞, then mk = E X k = xk P X (dx) = the k th moment of

X

!

Var X = E (X − EX) = (x − EX)2 P X (dx) = E X 2 − (EX)2

2

= E [X(X − EX)]

• Notation: DX , or σ 2 (X), or σX

2

, or VX [21, p 51]

5k 6 5k 6

k

k

• If Xi ∈ L , then

2

Xi ∈ L , Var

2

Xi exists, and E Xi = EXi

i=1 i=1 i=1 i=1

"

(d) Standard deviation: σX = Var[X]

σX

(e) Coeﬃcient of variation: CVX = EX

Var X

(f) Fano factor (index of dispersion): EX

(i) μ1 = E [X − EX] = 0.

2

(ii) μ2 = σX = Var X

n

n

(iii) μn = k

mn−k (−m1 )k

k=1

n n

(iv) mn = k

μn−k mk1

k=1

μ3

(h) Skewness coeﬃcient: γX = 3

σX

(i) Describe the deviation of the distribution from a symmetric shape (around the

mean)

(ii) 0 for any symmetric distribution

72

μ4

(i) Kurtosis: κX = 4 .

σX

4

σX

−3

1 ∂k

(k) Cumulants or semivariants: For one variable: γk = j k ∂v k

ln (ϕX (v)) .

v=0

(i) γ1 = EX = m1 ,

γ2 = E [X − EX]2 = m2 − m21 = μ2

γ3 = E [X − EX]3 = m3 − 3m1 m2 + 2m31 = μ3

γ4 = m4 − 3m22 − 4m1 m3 + 12m21 m2 − 6m41 = μ4 − 3μ22

(ii) m1 = γ1

m2 = γ2 + γ12

m3 = γ2 + 3γ1 γ2 + γ13

m4 = γ4 + 3γ22 + 4γ1 γ3 + 6γ12 γ2 + γ14

∞

First Central Mean, expected value μx = −∞

x f x (x) dx μx = x p(xk )

all x s k

x̄ = xi /n

location E( X ) = μx

∞ 1

Second Dispersion Variance, Var( X ) = μ2 = σx2 σx2 = −∞

(x − μx ) 2 f x (x) dx σx2 = all x s

(xk − μx ) 2 Px (xk ) s2 = n−1

(xi − x̄) 2

" " =

1

Standard deviation, σx σx = Var( X ) σx = Var( X ) s= n−1

(xi − x̄) 2

Coefﬁcient of variation, x x = σx /μx x = σx /μx Cv = s/x̄

∞ n

Third Asymmetry Skewness μ3 = −∞

(x − μx ) 3 f x (x) dx μ3 = all x s

(xk − μx ) 3 p x (xk ) m3 = (n−1)(n−2)

(xi − x̄) 3

Skewness coefﬁcient, γx γx = μ3 /σx3 γx = μ3 /σx3 g = m3 /s3

∞ n(n+1)

Fourth Peakedness Kurtosis, κx μ4 = −∞

(x − μx ) 4 f x (x) dx μ4 = all x s

(xk − μx ) 4 p x (xk ) m4 = (n−1)(n−2)(n−3)

(xi − x̄) 4

Excess coefﬁcient, εx κx = μ4 /σx4 κx = μ4 /σx4 k = m4 /s4

εx = κx − 3 εx = κx − 3

7.13.

• For c ∈ R, E [c] = c

• E [·] is a linear operator: E [aX + bY ] = aEX + bEY .

• In general, Var[·] is not a linear operator.

7.14. All pairs of mean and variance are possible. A random variable X with EX = m

and Var X = σ 2 can be constructed by setting P [X = m − a] = P [X = m + a] = 12 .

Deﬁnition 7.15.

73

Model E [X] Var [X]

n2 −1

U{0,1,...,n−1} n−1

2 12

B(n, p) np np(1-p)

G(β) β

1−β

β

(1−β)2

P(λ) λ λ

(b−a)2

U(a, b) a+b

2 12

E(λ) 1 1

λ ⎧ λ2

α ⎨ undeﬁned, 0 < α < 1

, α>1

Par(α) α−1 ∞, 1<α<2

∞, 0<α≤1 ⎩ α

, α>2

(α−2)(α−1)2

L(α) 0 2

α2

N (m, σ 2 ) m σ 2

q q

Γ (q, λ) λ λ2

Cov [X, Y ] = E [(X − EX)(Y − EY )] = E [XY ] − EXEY

= E [X(Y − EY )] = E [Y (X − EX)] .

≡ E [XY ] = EXEY

• Correlation coeﬃcient, autocorrelation, normalized covariance:

5 6

Cov [X, Y ] X − EX Y − EY E [XY ] − EXEY

ρXY = =E = .

σX σY σX σY σX σY

7.16. Properties

(a) Var X = Cov [X, X], ρX,X = 1

|=

2 2

σY with equality if and only if

σY (X − EX) = σX (Y − EY ) a.s.

2 2 2 2

74

When σY , σX > 0, equality occurs if and only if the following conditions holds

≡ ∃c = 0 and b ∈ R such that Y = cX + b

≡ ∃a = 0 and b ∈ R such that X = aY + b

≡ |ρXY | = 1

a

= sgn a. Hence, pXY is used to quantify linear

dependence between X and Y . The closer |ρXY | to 1, the higher degree of linear

dependence between X and Y .

(f) Linearity:

(i) Let Yi = ai Xi + bi .

i. Cov [Y1 , Y2 ] = Cov [a1 X1 + b1 , a2 X2 + b2 ] = a1 a2 Cov [X1 , X2 ].

ii. The ρ is preserved under linear transformation:

(iii) ρa1 X+b1 ,a2 X+b2 = 1. In particular, if Y = aX + b, then ρX,Y = 1.

8 9

Var ai Xi = a2i Var Xi + 2 aj aj Cov [Xi , Xj ] .

i∈I i∈I (i,j)∈I×I

i=j

In particular

Var (X + Y ) = Var X + Var Y + 2Cov [X, Y ]

and

Var (X − Y ) = Var X + Var Y − 2Cov [X, Y ] .

8 9

Cov ai Xi , b j Yj = ai bj Cov [Xi , Yj ] .

i∈I j∈J i∈I j∈J

(k) Covariance Inequality: Let X be any random variable and g and h any function such

that E [g(X)], E [h(X)], and E [g(X)h(X)] exist.

75

• If g and h are either both non-decreasing or non-increasing, then

is also an even function. Then, E [XY ] = E [X] = E [X] E [Y ] = Cov [X, Y ] = 0.

Consider a point x0 such that pX (x0 ) > 0. Then, pX,Y (x0 , g(x0 )) = pX (x0 ). We

only need to show that pY (g(x0 )) = 1 to show that X and Y are not independent.

For example, let X be uniform on {±1, ±2} and Y = |X|. Consider the point

x0 = 1.

• Continuous: Let Θ be uniform on an interval of length 2π. Set X = cos Θ and

Y = sin Θ. See (11.6).

are independent random variables with P [Xi = 1] = p and P [Xi = 0] = 1 − p. Let

N = inf {i : Xi = 1}. Also deﬁne

⎧

⎨ 0, N =0

L(N ) = N

−1

⎩ a i

r , N ∈N

i=0

k−1

and G(N ) = arN − L(N ). To have G(N ) > 0, need ri < rk ∀k ∈ N which turns out to

i=0

require r ≥ 2. In fact, for r ≥ 2, we have G(N ) ≥ a ∀N ∈ N ∪ {0}. Hence, E [G(N )] ≥ a.

It is exactly a when r = 2.

Now, E [L (N )] = a ∞ rn −1 n

n=1 r−1 p (1 − p) = ∞ if and only if r(1 − p) ≥ 1. When

1 − p ≤ 12 , because we already have r ≥ 2, it is true that r(1 − p) ≥ 1.

8 Inequalities

8.1. Let (Ai : i ∈ I) be a ﬁnite family of events. Then

2

+ ,

P (Ai )

i

≤P Ai ≤ P (Ai )

P (Ai ∩ Aj ) i i

i j

76

8.2. [19, p 14]

n +n ,

1 n

n

− 1− ≤P Ai − P (Ai ) ≤ (n − 1) n− n−1 .

n i=1 i=1 ↓

↓

1

− 1e ≈ −0.37

1

1

x

§

¨ 1

1· 0.5

¸

© x¹

x

x 1

( x 1) x

0

1

e

2 4 6 8 10

1 x 10

n

n

Figure 18: Bound for P Ai − P (Ai ).

i=1 i=1

a a1ª x t a º

¬ ¼

x

a

Figure 19: Proof of Markov’s Inequality

(a) Useless when a ≤ E |X|. Hence, good for bounding the “tails” of a distribution.

77

(b) Remark: P [|X| > a] ≤ P [|X| ≥ a]

(i) P [g (X) ≥ α] ≤ 1

αp

(E [(g (X))p ])

(ii) P [g (X − EX) ≥ α] ≤ 1

αp

(E [(g (X − EX))p ])

a2

EX 2 , a > 0.

(i) P [|X| ≥ α] ≤ 1

αp

(E [|X|p ])

2

σX

(ii) P [|X − EX| ≥ α] ≤ α2

; that is P [|X − EX| ≥ nσX ] ≤ 1

n2

• Useful only when α > σX

a+b 2

(iii) For a < b, P [a ≤ X ≤ b] ≥ 1 − 4

(b−a)2

2

σX + EX − 2

EX 2

(i) If EX = 0, P [X ≥ a] ≤ EX 2 +a2

(ii) For general X,

2

σX

i. P [X ≥ EX + a] ≤ 2 +a2 ;

σX

that is P [X ≥ EX + nσX ] ≤ 1

1+n2

2

σX

ii. P [X ≤ EX − a] ≤ 2 +a2 ;

σX

that is P [X ≤ EX − nσX ] ≤ 1

1+n2

2

2σX

iii. P [|X − EX| ≥ a] ≤ 2 +a2 ;

σX

that is P [|X − EX| ≥ nσX ] ≤ 2

1+n2

This is a

2

σX

better bound than a2

iﬀ σX > a

E[e−θX ]

(i) P [X ≤ b] ≤ e−θb

∀θ > 0

E[eθX ]

(ii) P [X ≥ b] ≤ eθb

∀θ > 0

E|X|−a

8.4. Suppose |X| ≤ M a.s., then P [|X| ≥ a] ≥ M −a

∀a ∈ [0, M )

(EX)2

8.5. X ≥ 0 and E [X 2 ] < ∞ ⇒ P [X > 0] ≥ E[X 2 ]

Deﬁnition 8.6. If p and q are positive real numbers such that p + q = pq, or equivalently,

1

p

+ 1q = 1, then we call p and qa pair of conjugate exponents.

• 1 < p, q < ∞

• As p → 1, q → ∞. Consequently, 1 and ∞ are also regarded as a pair of conjugate

exponents.

78

8.7. Hölder’s Inequality : X ∈ Lp , Y ∈ Lq , p > 1, 1

p

+ 1

q

= 1. Then,

(a) XY ∈ L1

1 1

(b) E [|XY |] ≤ (E [|X|p ]) p (E [|Y |q ]) q with equality if and only if

E [|Y |q ] |X (ω)|p = E [|X|p ] |Y (ω)|q a.s.

1 1

|E [XY ]| ≤ E [|XY |] ≤ E |X|2 2 E |Y |2 2 or equivalently,

2

(E [XY ])2 ≤ E X 2 E Y

with equality if and only if E [Y 2 ] X 2 = E [X 2 ] Y 2 a.s.

(a) |Cov (X, Y )| ≤ σX σY .

(b) (EX)2 ≤ EX 2

(c) (P (A ∩ B))2 ≤ P (A)P (B)

1

8.9. Minkowski’s Inequality : p ≥ 1, X, Y ∈ Lp ⇒ X + Y ∈ Lp and (E [|X + Y |p ]) p ≤

1 1

(E [|X|p ]) p + (E [|Y |p ]) p

8.10. p > q > 0

(a) E [|X|q ] ≤ 1 + E [|X|p ]

1 1

(b) Lyapounov’s inequality : (E [|X|q ]) q ≤ (E [|X|p ]) p

= 1

• E [|X|] ≤ E |X|2 ≤ E |X|3 3 ≤ · · ·

2) X ∈ (a, b) a.s.; and 3) ϕ is convex on (a, b), then ϕ (EX) ≤ E [ϕ (X)]

• For X > 0 (a.s.), E X1 ≥ EX 1

.

8.12.

• For p ∈ (0, 1], E [|X + Y |p ] ≤ E [|X|p ] + E [|Y |p ].

• For p ≥ 1, E [|X + Y |p ] ≤ 2p−1 (E [|X|p ] + E [|Y |p ]).

8.13 (Covariance Inequality). Let X be any random variable and g and h any function

such that E [g(X)], E [h(X)], and E [g(X)h(X)] exist.

• If g and h are either both non-decreasing or non-increasing, then

E [g(X)h(X)] ≥ E [g(X)] E [h(X)] .

In particular, for nondecreasing g, E [g(X)(X − EX)] ≥ 0.

• If g is non-decreasing and h is non-increasing, then

E [g(X)h(X)] ≤ E [g(X)] E [h(X)] .

79

9 Random Vectors

In this article, a vector is a column matrix with dimension n × 1 for some n ∈ N. We use

1 to denote a vector with all element being 1. Note that 1(1T ) is a square matrix with all

element being 1. Finally, for any matrix A and constant a, we deﬁne the matrix A + a to

be the matrix A with each of the components are added by a. If A is a square matrix, then

A + a = A + a1(1T ).

Deﬁnition 9.1. Suppose I is an index set. When Xi ’s are random variables, we de-

ﬁne a random vector XI by XI = (Xi : i ∈ I). For example, if I = [n], we have XI =

(X1 , X2 , . . . , Xn ). Note also that X[n] is usually denoted by X1n . Sometimes, we simply

write X to denote X1n .

• For vector xI , yI , we say x ≤ y if ∀i ∈ I, xi ≤ yi [7, p 206].

• When the dimension of X is implicit, we simply write X and x to represent X1n and

xn1 , respectively.

• For random vector X, Y , we use (X, Y ) to represent the random vector X or

T T T Y

equivalently X Y .

Deﬁnition 9.2. Half-open cell or bounded rectangle in Rk is set of the form Ia,b =

k

{x : ai < xi ≤ bi , ∀i ∈ [k]} = × (ai , bi ]. For a real function F on Rk , the diﬀerence of F

i=1

around the vertices of Ia,b is

ΔIa,b F = sgnIa,b (v) F (v) = (−1)|{i:vi =ai }| F (v) (21)

v v

where the sum extending over the 2k vertices v of Ia,b . (The ith coordinate of the vertex v

could be either ai or bi .) In particular, for k = 2, we have

( *

FX (x) = FX1k xk1 = P [X1 ≤ x1 , . . . , Xk ≤ xk ] = P X ∈ Sxk1 = P X (Sx )

• ΔIa,b FX ≥ 0

the direction of the ﬁrst orthant) speciﬁed by the point x [7, p 206].

h 0

80

C3 xi → −∞ for some i (the other coordinates held ﬁxed), then FX (x) → 0

If ∀i xi → ∞, then FX (x) → 1.

h 0

interior of Sx

This comes from (9) with Ai = [Xi ≤ ai ] and B = [∀ Xi ≤ bi ]. Note that

ai , i ∈ I,

P [∩i∈I Ai ∩ B] = F (v) where vi =

bi , otherwise.

• For any function F on Rk with satisﬁes (C1), (C2), and (C3), there is a unique

probability measure μ on BRk such that ∀a, ∀b ∈ Rk with a ≤ b, we have μ (Ia,b ) =

ΔIa,b F (and ∀x ∈ Rk μ (Sx ) = F (x)).

• TFAE:

(a) FX is continuous at x

(b) FX is continuous from below

(c) FX (x) = P X (Sx◦ )

(d) P X (Sx ) = P X (Sx◦ )

(e) P X (∂Sx ) = 0 where ∂Sx = Sx − Sx◦ = {y : yi ≤ xi ∀i, ∃j yj = xj }

xj →∞

9.4 (Joint pdf ). A function f is a multivariate or joint pdf (commonly called a density)

if and only if it satisﬁes the following two conditions:

(a) f ≥ 0;

(b) f (x)d(x) = 1.

lim f (xI ) = 0.

xi →±∞

• P [X ∈ A] = A

fX (x)dx.

• Remarks: Roughly, we may say the following:

<Xi ≤xi +Δxi ]

P [∀i, xi P [x<X≤x+δx]

(a) fX (x) = lim = lim

∀i, Δxi →0 i Δxi Δx→0 i Δxi

81

(b) For I = [n],

x1 xn

◦ FX (x) = −∞ . . . −∞ fX (x)dxn . . . dx1

∂n

◦ fX (x) = ∂x1 ···∂xn FX (x) .

(k) (k) u, j = k

◦ ∂u FX (u, . . . , u) =

∂

fX v dx[n]\{k} where vj = .

k∈N (−∞,u]n k

xj , j =

∂

u u

For example, ∂u FX,Y (u, u) = fX,Y (x, u)dx + fX,Y (u, y) dy.

−∞ −∞

5 6

n

(c) fX1n (xn1 ) = E δ (Xi − xi ) .

i=1

9.5. Consider two random vectors X : Ω → Rd1 and Y : Ω → Rd2 . Deﬁne Z = (X, Y ) :

Ω → Rd1 +d2 . Suppose that Z has density fX,Y (x, y).

(a) Marginal Density : fY (y) = fX,Y (x, y)dx and fX (x) = fX,Y (x, y)dy.

Rd 1 Rd 2

• In other words, to obtain the marginal densities, integrate out the unwanted

variables.

• fXI\{i} (xI\{i} ) = fXI (xI )dxi .

fX,Y (x,y)

(b) fY |X (y|x) = fY (y)

.

y1 yd

(c) FY |X (y|x) = −∞

··· −∞

2

fY |X (t|x)dtd2 · · · dt1 .

9.6. P [(X + a1 , X + b1 ) ∩ (Y + a2 , Y + b2 ) = ∅] = A

fX,Y (x, y)dxdy where A is deﬁned

in (1.10).

9.7. Expectation and covariance:

(a) The expectation of a random vector X is deﬁned to be the vector of expectations of

its entries. EX is usually denoted by μX or mX .

(b) For non-random matrix A, B, C and a random vector X, E [AXB + C] = AEXB +C.

RX = E XX T .

CX = ΛX = Cov [X] = E (X − EX)(X − EX)T = E XX T − (EX)(EX)T

= RX − (EX)(EX)T .

82

(ii) ΛX is symmetric.

i. Properties of symmetric matrix

A. All eigenvalues are real.

B. Eigenvectors corresponding to diﬀerent eigenvalues are not just linearly

independent, but mutually orthogonal.

C. Diagonalizable.

ii. Spectral theorem: The following equivalent statements hold for symmet-

ric matrix.

A. There exists a complete set of eigenvectors; that is there exists an or-

thonormal basis u(1) , . . . , u(n) of R with CX u(k) = λk u(k) .

B. CX is diagonalizable by an orthogonal matrix U (U U T = U T U = I).

C. CX can be represented as CX = U ΛU T where U is an orthogonal matrix

whose columns are eigenvectors of CX and λ = diag(λ1 , . . . , λn ) is a

diagonal matrix with the eigenvalues of CX .

(iii) Always nonnegative deﬁnite (positive

( semideﬁnite).* That is ∀a ∈ Rn where n is

2

the dimension of X, aT CX a = E aT (X − μX ) ≥ 0.

• det (CX ) ≥ 0.

1 √ √ √ √ √ √

(iv) We can deﬁne CX2 = CX to be CX = U ΛU T where Λ = diag( λ1 , . . . , λn ).

√ √

i. det CX = det CX .

√

ii. CX is nonnegative deﬁnite.

√ 2 √ √

iii. CX = CX CX = CX .

(v) Suppose, furthermore, that CX is positive deﬁnite.

−1

i. CX = U Λ−1 U T where Λ−1 = diag( λ11 , . . . , λ1n ).

= √

− 12 −1 −1 T 1 1

ii. CX = CX = ( CX ) = U DU where D = √ √

, . . . , λn

λ1

√ −1

√

iii. CX CX CX = I.

1

−1 −1

iv. CX , CX2 , CX 2 are all positive deﬁnite (and hence are all symmetric).

1 2

− −1

v. CX 2 = CX .

−1

vi. Let Y = CX 2 (X − EX). Then, EY = 0 and CY = I.

(vi) For i.i.d. Xi with each with variance σ 2 , ΛX = σ 2 I.

• Cov X T h = Cov hT X = hT Cov [X] h where h is a vector with the same

dimension as X.

83

• For Yi = X + Zi where X and Z are independent, ΛY = σX

2

+ ΛZ .

(g) ΛX+Y + ΛX−Y = 2ΛX + 2ΛY

(h) det (ΛX+Y ) ≤ 2n det (ΛX + ΛY ) where n is the dimension of X and Y .

2

(i) Y =⎡ (X, X, . . . ⎤

, X) where X is a random variable with variance σX , then ΛY =

1 ··· 1

2 ⎢ .. . ⎥

σX ⎣ . . . . .. ⎦. Note that Y = 1X where 1 has the same dimension as Y .

1 ··· 1

(j) Let X be a zero-mean random vector whose covariance matrix is singular. Then, one

of the Xi is a deterministic linear combination of the remaining components. In other

words, there is a nonzero vector a such that aT X = 0. In general, if ΛX is singular,

then there is a nonzero vector a such that aT X = aT EX.

(k) If X and Y are both random vectors (not necessarily of the same dimension), then

their cross-covariance matrix is

ΛXY = CXY = Cov [X, Y ] = E (X − EX)(Y − EY )T .

Note that the ij-entry of CXY is Cov [Xi , Yj ].

• CY X = (CXY )T .

(l) RXY = E XY T .

(m) If we stack X and Y in to a composite vector Z = XY

, then

CX CXY

CZ = .

CY X CY

(n) X and Y are said to be uncorrelated if CXY = 0, the zero matrix. In which case,

CX 0

C(X ) = ,

Y 0 CY

a block diagonal matrix.

9.8. The joint characteristic function of an n-dimensional random vector X is deﬁned

by ( T *

ϕX (v) = E ejv X = E ej i vi Xi .

When X has a joint density fX , ϕX is just the n-dimensional Fourier transform:

!

T

ϕX (v) = ejv x fX (x)dx,

and the joint density can be recovered using the multivariate inverse Fourier transform:

!

1

e−jv x ϕX (v)dv.

T

fX (x) = n

(2π)

84

TX

(a) ϕX (u) = Eeiu .

e−jv x ϕX (v)dv.

1 T

(b) fX (x) = (2π)n

T

(c) For Y = AX + b, ϕY (u) = eib u ϕX AT u .

n

n

υi

υi

n

(f) Moment: υ

∂ i=1

υ υn ϕX

∂v1 1 ∂v2 2 ···∂vn

(0) = j i=1 E Xiυi

i=1

∂

(i) ϕ

∂vi X

(0) = jEXi .

∂2

(ii) ϕ

∂vi ∂vj X

(0) = j 2 E [Xi Xj ] .

(i) ∂

∂vi

ln (ϕX (v)) = jEXi .

v=0

∂2

(ii) ∂vi ∂vj

ln (ϕX (v)) = −Cov [Xi , Xj ] .

v=0

∂3

(iii) ∂vi ∂vj ∂vk

ln (ϕX (v)) = j 3 E [(Xi − EXi ) (Xj − EXj ) (Xk − EXk )] .

v=0

(iv) E [(Xi − EXi ) (Xj − EXj ) (Xk − EXk ) (X − EX )] = Ψijk + Ψij Ψk + Ψik Ψj +

4

Ψi Ψjk where Ψijk = ∂vi ∂v∂j ∂vk ∂v ln (ϕX (v)) .

v=0

random vector with zero mean and covariance matrix C. X has the representation X = P Y ,

there the component of Y are uncorrelated and P is an n × n orthonormal matrix. This

representation is called the Karhunen-Loève expansion.

• Y = PTX

• P T = P −1 is called a decorrelating transformation.

• Diagonalize C = P DP T where D = diag(λ1 ). Then, Cov [Y ] = D.

• In MATLAB, use [P,D] = eig(C). To extract the diagonal elements of D as a vector,

use the command d = diag(D).

• If C is singular (equivalently, if some of the λi are zero), we only need to keep around

the Yi for which λi > 0 and can throw away the other components of Y without any

loss of information. This is because λi = EYi2 and EYi2 = 0 if and only if Yi ≡ 0 a.s.

[9, p 338–339].

85

9.1 Random Sequence

9.10. [10, p 9–10] Given a countable family X1 , X2 , . . . of random r.v.’s, their statistical

properties are regarded as deﬁned by prescribing, for each integer n ≥ 1 and every ﬁnite

set I ⊂ N, the joint distribution function FXI of the random vector XI = (Xi : i ∈ I).

Of course, some consistency requirements must be imposed upon the inﬁnite family FXI ,

namely, that for j ∈ I

(a) FXI\{j} xI\{j} = lim FXI (xI ) and that

xj →∞

(b) the distribution function obtained from FXI (xI ) by interchanging two of the indices

i1 , i2 ∈ I and the corresponding variable xi1 and xi2 should be invariant. This simply

means that the manner of labeling the random variables X1 , X2 , . . . is not relevant.

The joint distributions {FXI } are called the ﬁnite-dimensional distributions associated

with XN = (Xn )∞n=1 .

10 Transform Methods

10.1 Probability Generating Function

Deﬁnition 10.1. [9][10, p 11] Let X be a discrete random variable taking only nonnegative

integer values. The probability generating function (pgf) of X is

X

∞

GX (z) = E z = z k P [X = k].

k=0

• GX (0) = P [X = 0]

• G(z −1 ) is the z transform of the pmf.

• GX (1) = 1.

• The names derives from the fact that it can be used to compute the pmf.

• It is ﬁnite at least for any complex z with |z| ≤ 1. Hence pgf is well deﬁned for

|z| ≤ 1.

(k) (k)

Deﬁnition 10.2. GX (1) = lim GX (z).

z1

10.3. Properties

(a) GX is inﬁnitely diﬀerentiable at least for |z| < 1.

1 d(k)

(k)

GX (z) = P [X = k].

k! dz z=0

86

(c) Moment generating property:

8k−1 9

d(k)

GX (z) =E (X − i) .

dz (k) z=1 i=0

(d) In particular,

EX = GX (1)

EX 2 = GX (1) + GX (1)

Var X = GX (1) + GX (1) − (GX (1))

2

n of independent random variables is the product of the individual pgfs.

Let S = i=1 Xi where the Xi ’s are independent.

n

GS (z) = GXi (z).

i=1

Deﬁnition 10.4.

sXThe moment

sx X generating function of a random variable X is deﬁned

as MX (s) = E e = e P (dx) for all s for which this is ﬁnite.

10.5. Properties of moment generating funciton

(a) MX (s) is deﬁned on some interval containing 0. It is possible that the interval consists

of 0 alone.

(i) If X ≥ 0, this interval contains (−∞, 0].

(ii) If X ≤ 0, this interval contains [0, ∞).

(b) Suppose that M (s) is deﬁned throughout an interval (−s0 , s0 ) where s0 > 0, i.e. it

exists (is ﬁnite) in some neighborhood of 0. Then,

( *

(i) X has ﬁnite moments of all order: E |X|k < ∞∀k ≥ 0

∞

sk

k

(ii) M (s) = k!

E X , for complex-valued s with |s| < s0 [1, eqn (21.22) p 278].

k=0

Thus M (s) has a Taylor expansion about 0 with positive radius of convergence.

∞

i. If M (s) can somehow be calculated and expanded in a series ak sk , and if

k=0

the coeﬃcients ak can be identiﬁed, then ak = k!1 E X k . That is E X k =

k!ak

(iii) M (k) (0) = E X k = xk P X (dx)

(c) If M is deﬁned in some neighborhood of s, then M (k) (s) = xk esx P X (dx)

(d) See also Chernoﬀ bound.

87

10.3 One-Sided Laplace Transform

10.6. The one-sided Laplace transform a nonnegative random variable X is deﬁned

of−sx

for s ≥ 0 by L (s) = M (s) = E e−sX = e P X (dx)

[0,∞)

• Always ﬁnite because e−sx ≤ 1. In fact, it is a decreasing function of s

• L (0) = 1

• L (s) ∈ [0, 1]

(a) Derivative: For s > 0, L(k) (s) = (−1)k xk e−sx P X (dx) = (−1)k E X k e−sX

n n

n L (s) = (−1) E [X ]

d n

(i) lim ds

s↓0

• Because the value at 0 can be ∞, it does not make sense to talk about

dn

L (0) for n > 1

dsn

(ii) ∀s ≥ 0 L (s) is diﬀerentiable and ds d

L (s) = −E Xe−sX , where at 0, this is the

d

right-derivative. ds L (s) is ﬁnite for s > 0

• d

ds

L (0) = −E [X] ∈ [0, ∞]

st

(−1)k k (k)

(b) Inversion formula: If FX is continuous at t > 0, then FX (t) = lim k!

s L (s)

s→∞ k=0

(i) In fact, they are determined by the values of LX (s) for s beyond any arbitrary

s0 . (That is we don’t need to know LX (s) for small s.) Also, knowing LX (s)

on N is also suﬃcient.

(ii) Let μ and ν be probability measures on [0, ∞). If ∃s0 ≥ 0 such that e−sx μ (dx) =

e−sx ν (dx)∀s ≥ s0 , then μ = ν

−sx

(iii) Let f1 , f2 be real functions on [0, ∞). If ∃s0 ≥ 0 such that ∀s ≥ s0 e f1 (x)dx =

−sx [0,∞)

e f2 (x)dx , then f1 = f2 Lebesgue-a.e.

[0,∞)

n (s) =

Xi

i=1

n

LXi (s)

i=1

−sx

(i) e F (x)dx = λ1 L (s)

[0,∞)

(ii) e−sx (1 − F (x))dx = 1

λ

(1 − L (s))

[0,∞)

[16, p 183].

88

10.4 Characteristic Function

10.7. The characteristic function (abbreviated

c.f. or ch.f.) of a probability measure

μ on the line is deﬁned for real t by ϕ (t) = eitx μ (dx)

A random variable X has characteristic function ϕX (t) = E eitX = eitx P X (dx)

(a) Always exists because |ϕ(t)| ≤ |eitx |μ (dx) = 1μ (dx) = 1 < ∞

(b) If X has a density, then ϕX (t) = eitx fX (x) dx

(c) ϕ (0) = 1

(d) ∀t ∈ R |ϕ (t)| ≤ 1

(f) Suppose that all moments of X exists and ∀t ∈ R, EetX < ∞, then ϕX (t) =

∞

(it)k

k!

EX k

k=0

( *

(g) If E |X|k < ∞, then ϕ(k) (t) = ik E X k eitX and ϕ(k) (0) = ik E X k .

D

• X = −X iﬀ ϕX is real-valued.

• |ϕX | is even.

n

(l) If X1 , X2 , . . . , Xn are independent, then ϕ

n (t) = ϕXj (t)

Xj j=1

j=1

(m) Inversion

(i) The inversion formula: If the probability measure μ has characteristic func-

T e−ita −e−itb

tion ϕ and if μ {a} = μ {b} = 0, then μ (a, b] = lim 2π

1

it

ϕ (t)dt

T →∞ −T

T e−ita −e−itb

i. In fact, if a < b, then lim 1

ϕ (t)dt = μ (a, b) + 12 μ {a, b}

T →∞ 2π −T it

T e−ita −e−itb

of F , then F (b) − F (a) = lim 2π 1

it

ϕ (t)dt

T →∞ −T

(ii) Fourier inversion: Suppose that |ϕX (t)|dt < ∞, then X is absolutely con-

tinuous with

89

i. bounded continuous density f (x) = 1

e−itx ϕ (t)dt

1

e−ita −e−itb 2π

ii. μ (a, b] = 2π it

ϕ (t)dt

D

(n) Continuity Theorem: Xn −

→ X if and only if ∀t ϕXn (t) → ϕX (t) (pointwise).

i. |ϕX (z)| ≤ 1 for such z

(ii) In the domain Im {z} > 0, ϕX is analytic and continuous including the boundary

Im {z} = 0

(iii) ϕX determines uniquely a function LX (s) of real argument s ≥ 0 which is

equal to LX (s) = ϕX (is) = Ee−sX . Conversely, LX (s) on the half-line s ≥ 0

determines uniquely ϕX

8 7 >9

n k |t|n+1

|t|n

(it) 2

ϕX (t) − EX k ≤ E min |X|n+1 , |X|n (22)

k! (n + 1)! n!

k=0

n

(it)k

and ϕX (t) = k!

EX k + tn β (t) where lim β (t) = 0 or equivalently,

k=0 |t|→0

n

(it)k

ϕX (t) = EX k + o (tn ) (23)

k=0

k!

( '2 )* 2

• If EX = 0, then |ϕX (t) − 1| ≤ E min t2 X 2 , 2 |t| |X| ≤ t2 EX 2

( '2 )* 2

(ii) For integrable X, |ϕX (t) − 1 − itEX| ≤ E min 2 X , 2 |t| |X| ≤ t2 EX 2

t 2

' 3 )

i. ϕX (t) − 1 − itEX + 12 t2 EX 2 ≤ E min |t|6 |X|3 , t2 |X|2

ii. ϕX (t) = 1 + itEX − 12 t2 EX 2 + t2 β (t) where lim β (t) = 0

|t|→0

10.9. If X is a continuous r.v. with density fX , then ϕX (t) = ejtx fX (x)dx (Fourier

1

transform) and fX (x) = 2π e−jtx φX (t)dt (Fourier inversion formula).

• ϕX (t) is the Fourier transform of fX evaluated at −t.

• ϕX inherit the properties of a Fourier transform.

(a) For nonnegative ai such that i ai = 1, if fY = i ai fXi , then ϕX = i ai ϕXi .

90

(b) If fX is even, then ϕX is also even.

• If fX is even, ϕX = ϕ−X .

n of independent random n variables: Suppose X1 , . . . , Xn are

independent. Let Y = i=1 ai Xi . Then, ϕY (t) = i=1 ϕX (ai t). Furthermore, if |ai | = 1

and all fXi are even, then ϕY (t) = ni=1 ϕX (t).

10.11. Characteristic function for sum of distribution: Consider nonnegative ai such that

i ai = 1. Let Pi be probability measure with corresponding ch.f. ϕi . Then, the ch.f. of

i ai Pi is i ai ϕ i .

(a)

Discrete r.v.:

Suppose pi is pmf with corresponding ch.f. ϕi . Then, the ch.f. of

i ai pi is i ai ϕ i .

(b) Absolutely

continuous

r.v.: Suppose fi is pdf with corresponding ch.f. ϕi . Then, the

ch.f. of i ai fi is i ai ϕi .

Deﬁnition 11.1. The preimage or inverse image of a set B is deﬁned by g −1 (B) =

{x : g(x) = B}.

11.2. For discrete X, suppose Y = g(X). Then, pY (y) = x∈g−1 ({y}) pX (x). The joint

pmf of Y and X is given by pX,Y (x, y) = pX (x)1[y = g(x)].

• In most cases,

we can show that X and Y are not independent,

(0) pick a point x

(0)

y = g x

(0) (1) (1) (1)

such that pX x > 0. Pick a point

(0)y such

(1)that and pY y > 0.

(0) (1)

Then, pX,Y x , y = 0 but pX x pY y > 0. Note that this technique does

not always work. For example, if g is a constant function which maps all values of x

to a constant c. Then, we won’t be able to ﬁnd y (1) . Of course, this is to be expected

because we know that a constant is always independent of other random variables.

There are many techniques for ﬁnding the cdf and pdf of Y = g(X).

(a) One may ﬁrst ﬁnd FY (y) = P [g(X) ≤ y] ﬁrst and then ﬁnd fY from FY (y). In which

case, the Leibniz’ rule in (38) will be useful.

(b) Formula (25) below provides a convenient way of arriving at fY from fX without

going through FY .

91

11.3 (Linear transformation). Y = aX + b where a = 0.

7

FX y−b

a

, a>0

FY (y) = y−b −

1 − FX a

, a<0

1 y−b

fY (y) = fX . (24)

|a| a

y−b

In fact (24) holds even for mixed r.v. if we allow delta function because 1

|a|

δ a

− xk =

δ (y − (axk + b)).

(b) Suppose X is discrete,

y−b

pY (y) = pX .

a

If we write fX (x) = k pX (xk )δ(x − xk ), we have

fY (y) = p (xk ) δ (y − (axk + b)).

k

1 1

1

1 n −1

(a) n odd: FY (y) = FX y n and fY (y) = n y fX y n .

(b) n even: ⎧

−

⎨ 1 1

FX y n − FX −y n , y≥0

FY (y) =

⎩

0, y < 0.

and 7 1

1 1

y −1

1 n

n

fX y n + fX −y n , y≥0

fY (y) =

0, y < 0.

Again, the density fY in the above formula holds when X is absolutely continuous. Note

that when n < 1, fY is not deﬁned at 0. If we allow delta functions,

then the density

1 1

−1

1 n

formula above are also valid for mixed r.v. because n y δ ±y n − xk = δ (y − (±xk )n ).

0, √ √ y<0

fY (y) = 1

√ f y + 1

√ f − y , y ≥ 0.

2 y X 2 y X

11.5. In general, for Y = g(X), we solve the equation y = g(x). Denoting its real roots

by xk . Then,

fX (xk )

fY (y) = . (25)

k

|g (x )|

k

If g(x) = c = constant for every x in the interval (a, b), then FY (y) is discontinuous for

y = c. Hence, fY (y) contains an impulse (FX (b) − FX (a)) δ(y − c) [14, p 93–94].

92

• To see this, consider when there is unique x such that g(x) = y. Then, For small Δx

and Δy, P [y, y < Y ≤ y+Δy] = P [x < X ≤ x+Δx] where (y+Δy] = g ((x, x + Δx])

is the image of the interval (x, x + Δx]. (Equivalently, (x, x + Δx] is the inverse image

of y + Δy].) This gives fY (y)Δy = fX (x)Δx.

• The joint density fX,Y is

Let the xk be the solutions for x of g(x) = y. Then, by integrating (26) w.r.t. x, we

have (25) via the use of (3).

• When g bijective,

d −1 −1

fY (y) = g (y) fX g (y) .

dy

• For Y = a

X

, fY (y) = ay fX ay .

√

• Suppose X is nonnegative. For Y = X,

fY (y) = 2yfX (y 2 ).

11.6. Given Y = g(X) where X ∼ U(a, b). Then, to get fY (y0 ), plot g on (a, b). Let

A = g −1 ({y0 }) be the set of all points x such that g(x) = y0 . Suppose A can be written

as a countable disjoint union A = B ∪ ∪i Ii where B is countable and the Ii ’s are intervals.

We have + ,

1 1

fY (y) = + (Ii ) δ(y − y0 )

b − a |g (x)| i

both arcsine random variables with FYi (y) = 1 − π1 cos−1 y = 21 + π1 sin−1 (y) and

fYi (y) = π1 √ 1 2 for y ∈ [−1, 1]. Note also that E [Y1 Y2 ] = EYi = Cov [Y1 , Y2 ] = 0.

1−y

Hence, Y1 and Y2 are uncorrelated. However, it is easy to see that Y1 and Y2 are not

independent by considering the joint and marginal densities at y1 = y2 = 0.

⎧

⎨ 0, y<0

(a) FY (y) = FX (0),

√ y = 0

⎩

FX y , y>0

?

0, √ y < 0

(b) fY (y) = 1

√ f y , y > 0 + FX (0) δ (y)

2 y X

93

p = n1 Hypergeometric

distributions

Discrete

n3 x = 0,1⋅⋅⋅ , min(n1, n2)

X1 + ⋅⋅⋅ + XK n3 ← ∞

n1, n2, n3

Poisson v = np

x = 0,1⋅⋅⋅ n ∞ ← Binomial

v x = 0,1 ⋅⋅⋅ n X1 + ⋅⋅⋅ + XK

n, p

Bernoulli

n=1 x = 0,1

m = np

s2 = v p

s 2 = np(1 – p)

m=v

n ← ∞

Normal a = b→ ∞

Y = eX Beta

Lognormal -∞ < x < ∞ 0<x<1

y>0 log Y m, s a, b

m = ab

X- m s 2 = ab 2

X1 + ⋅⋅⋅ + XK a→ ∞ X1

X1 + ⋅⋅⋅ + XK s

1/X

m + sX X1 + X2

a=b=1

-∞ < x < ∞

distributions

Continuous

normal x>0

a, a -∞ < x < ∞ a, b

n = 2a

X1/X2 2 a = 1/2

X1 + ⋅⋅⋅ + XK2

a=0 a=n

a + aX X1 + ⋅⋅⋅ + XK

a=1

a =1 Erlang

Standard Chi-square x>0

cauchy n→∞ x>0 b, n

-∞ < x < ∞ n

n2 → ∞

1

n=2 n=1

X n1 X X1/n1 b =2 X1 + ⋅⋅⋅ + XK

n=1 X2 /n2

F Exponential Standard

x>0 x>0 -blog X

uniform

n1, n2

b 0 < x <1

min(X1,◊◊◊,XK)

2

X

1

a + (b − a)X a=0

X X a a=1 X1 - X2

b=1

t X2

-∞ < x < ∞ Rayleigh Weibull Uniform

Triangular

n x>0 x>0 a<x<b

–1 < x < 1 a, b

b a, b

94

Uniform: - a, b

§ ·

X i ~ * qi , O ¦X ~ * ¨ ¦ qi , O ¸ . 1 §O·

i

© i ¹ fX x 1 x X ~ O D X ~ ¨ ¸ .

i

b a > a , b@ ©D ¹

§ O·

X ~ * q, O Y D X ~ * ¨ q, ¸ . § ·

© D¹ 1 Xi ~ (Oi) min X i ~ ¨ ¦ Oi ¸ .

X ~ - 0,1 ln X ~ O

O © i ¹

Gamma : * q, O Exponential: (O)

O q x q 1e O x q 1 fX x O e O x1>0, f x

fX x 1 0, f x

*q

n 1

q ,O §1 1 · i .i .d .

§ 1 ·

2 2V 2 X ~ & 0,V 2 X 2 ~ * ¨ , 2 ¸ X 2 Y 2 with X , Y ~ & ¨ 0, ¸

© 2 2V ¹ © 2O ¹

§n 1 ·

Chi-squared F 2 : * ¨ , 2 ¸ 1 § xP ·

2

© 2 2V ¹ 1 ¨

V ¸¹

n i .i .d . fX x e 2©

Usually, V 1 . ¦X i

2

with X i ~ & 0,V 2

V 2S

i 1

X1 X2

with X i ~ * qi , O

X1 X 2 O D i .i .d .

§ 1 ·

X 2 Y 2 with X , Y ~ & ¨ 0, ¸

© 2D ¹

Beta: 2

* q1 q2 q2 1

Rayleigh: f x 2D xe D x 1>0, f x

f E q ,q x x q1 1 1 x 1 0,1 x

1 2

* q1 * q2

Din

November 8, 2004

95

11.2 MISO case

11.8. If X and Y are jointly continuous random variables with joint density fX,Y . The

following two methods give the density of Z = g(X, Y ).

• Condition on one of the variable, say Y = y. Then, begin conditioned, Z is simply a

function of one variable g(X, y); hence, we can use the one-variable technique to ﬁnd

fZ|Y (z|y). Finally, fZ (z) = fZ|Y (z|y)fY (y)dy.

• Directly ﬁnd the joint density of the random vector YZ = g(X,Y )

. Observe that the

∂g ∂g Y

∂g

Jacobian is ∂x ∂y . Hence, the magnitude of the determinant is ∂x .

0 1.

Of course, the standard way of ﬁnding the pdf of Z is by ﬁnding the derivative of the

cdf FZ (z) = (x,y):x2 +y2 ≤z fX,Y (x, y)d(x, y). This is still good for solving speciﬁc examples.

It is also a good starting point for those who haven’t learned conditional probability nor

Jacobian.

Let the x(k) be the solutions of g(x, y) = z for ﬁxed z and y. The ﬁrst method gives

fX|Y (x|y)

∂

fZ|Y (z|y) = g(x, y) .

k ∂x (k) x=x

Hence,

fX,Y (x, y)

∂

fZ,Y (z, y) = g(x, y) ,

k ∂x x=x(k)

which comes out of the second method directly. Both methods then gives

!

fX,Y (x, y)

∂

fZ (z) = g(x, y) dy.

∂xk (k) x=x

The integration for a given z is only on the value of y such that there is at least a solution

for x in z = g(x, y). If there is no such solution, fZ (z) = 0. The same technique works for

a function of more than one random variables Z = g(X1 , . . . , Xn ). For any j ∈ [n], let the

(k)

xj be the solutions for xj in z = g(x1 , . . . , xn ). Then,

!

fX1 ,...,Xn (x1 , . . . , xn )

fZ (z) = dx[n]\{j} .

∂

k ∂xj g(x1 , . . . , xn ) (k)

xj =xj

For the second method, we consider the random vector (hr (X1 , . . . , Xn ), r ∈ [n]) where

hr (X1 , . . . , Xn ) = Xr for r = j and hj = g. The Jacobian is of the form

⎛ ⎞

1 0 0 0

⎜ 0 1 0 0 ⎟

⎜ ∂g ∂g ⎟

⎝ ∂x ∂x ∂x ∂x ⎠ .

∂g ∂g

1 2 j n

0 0 0 1

96

By swapping the row with all the partial derivatives to the ﬁrst row, the magnitude of

the determinant is unchanged and we also end up with upper triangular matrix whose

determinant is simply the product of the diagonal elements.

(a) For Z = aX + bY ,

! !

1 z − by 1 z − ax

fZ (z) = fX,Y , y dy = fX,Y x, dx.

|a| a |b| b

ax+by a b

• Note that Jacobian y

y

, xy = .

0 1

! !

fX−Y (z) = fX,Y (z + y, y) dy = fX,Y (x, x − z) dx.

(ii) Note that when X and Y are independent and a = b = 1, we have the convolu-

tion formula

! !

fZ (z) = fX (z − y) fY (y)dy = fX (x)fY (z − x) dx.

(b) For Z = XY ,

! !

z 1 z 1

fZ (z) = fX,Y x, dx = fX,Y ,y dy

x |x| y |y|

y x

[9, Ex 7.2, 7.11, 7.15]. Note that Jacobian xyy

, xy = .

0 1

(c) For Z = X 2 + Y 2 ,

√ " "

! z fX|Y

z− + fX|Y − z − y y

y2 y 2

fZ (z) = " fY (y)dy

√ 2 z − y2

− z

!

1 2π √ √

fZ (z) = fX,Y ( z cos θ, z sin θ)dθ, z>0 (27)

2 0

i.i.d. 1

• This can be used to show that when X, Y ∼ N (0, 1), Z = X 2 + Y 2 ∼ E 2

.

√

(d) For R = X 2 + Y 2 ,

! 2π

fR (r) = r fX,Y (r cos θ, r sin θ)dθ, r > 0

0

97

Y

(e) For Z = , ! !

X

y

y

fZ (z) = fX,Y , y dy = |x| fX,Y (x, xz)dx.

z z

X

Similarly, when Z = Y

, !

fZ (z) = |y| fX,Y (yz, y)dy.

min(X,Y )

(f) For Z = max(X,Y )

where X and Y are strictly positive,

! ∞ ! ∞

FZ (z) = FY |X (zx|x)fX (x)dx + FX|Y (zy|y)fY (y)dy,

!

0

∞ !

0

∞

fZ (z) = xfY |X (zx|x)fX (x)dx + yfX|Y (zy|y)fY (y)dy, 0 < z < 1.

0 0

[9, Ex 7.17].

N

11.9 (Random sum). Let S = i=1 Vi where Vi ’s are i.i.d. ∼V independent of N .

• For non-negative integer-valued summands, we have GS (z) = GN (GV (z))

(b) ES = EN EV .

Remark : If N ∼ P (λ), then ϕS (u) = exp (λ (ϕV (u) − 1)), the compound Poisson

distribution CP (λ, L (V )). Hence, the mean and variance of CP (λ, L (V )) are λEV and

λEV 2 respectively.

Deﬁnition 11.10 (Jacobian). In vector calculus, the Jacobian is shorthand for either

the Jacobian matrix or its determinant, the Jacobian determinant. Let g be a

function from a subset D of Rn to Rm . If g is diﬀerentiable at z ∈ D, then all partial

derivatives exists at z and the Jacobian matrix of g at a point z ∈ D is

⎛ ∂g1 ⎞

∂x1

(z) · · · ∂x

∂g1

(z)

⎜ .. ⎟

n

. . ∂g ∂g

dg (z) = ⎝ . .. .. ⎠ = (z) , . . . , (z) .

∂x1 ∂xn

∂gm

∂x1

(z) · · · ∂xn (z)

∂gm

∂(g1 ,...,gn )

Alternative notations for the Jacobian matrix are J, ∂(x 1 ,...,xn )

[7, p 242], Jg (x) where the it

is assumed that the Jacobian matrix is evaluated at z = x = (x1 , . . . , xn ).

98

• Let A be an n-dimensional

“box” deﬁned by the corners x and x+Δx. The “volume”

of the image g(A) is ( i Δxi ) |det dg(x)|. Hence, the magnitude of the Jacobian

determinant gives the ratios (scaling factor) of n-dimensional volumes (contents). In

other words,

∂(y1 , . . . , yn )

dy1 · · · dyn = dx1 · · · dxn .

∂(x1 , . . . , xn )

• In MATLAB, use jacobian.

is an Rn -valued random vector. Deﬁne Y = g(X). (Then, Y is also an Rn -valued random

vector.) If X has joint density fX , and g is a suitable invertible mapping (such that the

inversion mapping theorem is applicable), then

1 −1

fY (y) = fX g (y) = det d g −1 (y) fX g −1 (y) .

|det (dg (g (y)))|

−1

• Note that for any matrix A, det(A) = det(AT ). Hence, the formula above could

tolerate the incorrect “Jacobian”.

{x : fX (x) > 0}. Consider a new random vector Y = (Y1 , Y2 , . . . , Yn ), deﬁned by Yi =

gi (X). Suppose that A0 , A1 , . . . , Ar form a partition of S with these properties. The set

A0 , which may be empty, satisﬁes P [X ∈ A0 ] = 0. The transformation Y = g(X) =

(g1 (X), . . . , gn (X)) is a one-to-ont transformation from Ai onto some common set B for

each i ∈ [k]. Then, for each i, the inverse functions from B to Ai can be found. Denote

(k)

the kth inverse x = h(k) (u) by xj = hj (y). This kth inverse gives, for y ∈ B, the unique

x ∈ Ak such that y = g(x). Assuming that the Jacobians det(dh(k) (y)) do not vanish

identically on B, we have

r

fY (y) = fX (h(k) (y)) det(dh(k) (y)) , y∈B

k=1

[2, p 185].

for some index set I and some deterministic function h. Then, the kth row of the

Jacobian matrix is a linear combination of other rows. In particular,

∂yk ∂ ∂yi

= h (yI ) .

∂xj i∈I

∂y i ∂x j

99

11.12. Suppose Y = g(X) where both X and Y have the same dimension, then the joint

density of X and Y is

fX,Y (x, y) = fX (x)δ(y − g(x)).

• In most cases,

we can show that X and Y are not independent, pick a point

x(0)

such that fX x(0) > 0. Pick a point

y(1) such

that

y (1) = g x(0) and fY y (1) > 0.

Then, fX,Y x(0) , y (1) = 0 but fX x(0) fY y (1) > 0. Note that this technique does

not always work. For example, if g is a constant function which maps all values of x

to a constant c. Then, we won’t be able to ﬁnd y (1) . Of course, this is to be expected

because we know that a constant is always independent of other random variables.

Example 11.13.

1

fY (y) = fX A−1 (y − b) . (28)

|det A|

(b) Transformation between Cartesian coordinates (x, y) and polar coordinates (r, θ)

"

• x = r cos θ, y = r sin θ, r = x2 + y 2 , θ = tan−1 xy .

∂x ∂x

cos θ −r sin θ

• ∂y ∂y =

∂r ∂θ = r. (Recall that dxdy = rdrdθ).

∂r ∂θ

sin θ r cos θ

We have

and " y

1

fX,Y (x, y) = " fR,Θ x2 + y 2 , tan−1 .

x2 + y 2 x

If, furthermore, Θ is uniform on (0, 2π) and independent of R. Then,

1 1 "

fX,Y (x, y) = " fR x2 + y 2 .

2π x2 + y 2

√ Y

(c) A related transformation is given by Z = X 2 + Y 2 and Θ = tan−1 . In this

√ √ X

case, X = Z cos Θ, Y = Z sin Θ, and

1 √ √

fZ,Θ (z, θ) = fX,Y z cos θ, z sin θ

2

which gives (27).

√

11.14. Suppose X, Y are i.i.d. N (0, σ 2 ). Then, R = X 2 + Y 2 and Θ = arctan X Y

are

1 r 2

independent with R being Rayleigh fR (r) = σr2 e− 2 ( σ ) U (r) and Θ being uniform on

[0, 2π].

100

11.15 (Generation of a random sample of a normally distributed random vari-

able). Let U1 , U2 be i.i.d. U(0, 1). Then, the random variables

"

X1 = −2 ln U1 cos(2πU2 )

"

X2 = −2 ln U1 sin(2πU2 )

"

Z1 = −2σ 2 ln U1 cos(2πU2 )

"

Z2 = −2σ 2 ln U1 sin(2πU2 )

x2 2

1 +x2

• det(dx(u)) = − 2π

u1

, u1 = e− 2 .

11.16. In (11.11), suppose dim(Y ) = dim(g)

introduce “arbitrary” Z = h(X) so that dim YZ = dim(X).

Given a sample of n random variables X1 , . . . , Xn , reorder them so that Y1 ≤ Y2 ≤ · · · ≤ Yn .

Then, Yi is called the ith order statistic, sometimes also denoted Xi:n , X i , X(i) , Xn:i , Xi,n ,

or X(i)n .

In particular

• Y1 = X1:n = Xmin is the ﬁrst order statistic denoting the smallest of the Xi ’s,

• Y2 = X2:n is the second order statistic denoting the second smallest of the Xi ’s . . .,

and

• Yn = Xn:n = Xmax is the nth order statistic denoting the largest of the Xi ’s.

In words, the order statistics of a random sample are the sample values placed in ascending

order [2, p 226]. Many results in this section can be found in [4].

[Xmin ≥ y] = i [Xi ≥ y] [Xmax ≥ y] = i [Xi ≥ y]

[Xmin > y] = i [Xi > y] [Xmax > y] = i [Xi > y]

[Xmin ≤ y] = i [Xi ≤ y] [Xmax ≤ y] = i [Xi ≤ y]

[Xmin < y] = i [Xi < y] [Xmax < y] = i [Xi < y]

Let Ay = [Xmax ≤ y], By = [Xmin > y]. Then, Ay = [∀i Xi ≤ y] and By = [∀i Xi > y].

101

11.18 (Densities). Suppose the Xi are absolutely continuous with joint density fX . Let

Sy be the set of all n! vector which comes from permuting the coordinates of y.

fY (y) = fX (x), y1 ≤ y2 ≤ · · · ≤ yn . (29)

x∈Sy

To see this, note that fY (y) j Δy j is the probability that Yj is in the small interval of

length Δyj around yj . This probability can be calculated from ﬁnding the probability that

all Xk fall into the above small regions.

From the joint density, we can ﬁnd the joint pdf/cdf of YI for any I ⊂ [n]. However, in

many cases, we can directly reapply the above technique to ﬁnd the joint pdf of YI . This

is especially useful when the Xi are independent or i.i.d.

(a) The marginal density fYk can be found by approximating fYk (y) Δy with

n

P [Xj ∈ [y, y + Δy) and ∀i ∈ I, Xi ≤ y and ∀r ∈ (I ∪ {k})c , Xr > y],

j=1 I∈([n]\{j})

k−1

where for any set A and integer ∈ |A|, we deﬁne A to be the set of all k-element

subsets of A. Note also that we assume (I ∪ {k})c = [n] \ (I ∪ {k}).

To see this, we ﬁrst choose the Xj that will be Yk with value around y. Then, we

must have k − 1 of the Xi below y and have the rest of the Xi > y.

(b) For integers r < s, the joint density fYr ,Ys (yr , ys )Δyr Δys can be approximated by the

probability that two of the Xi are inside small regions around yr and ys . To make

them Yr and Ys , for the other Xi , r − 1 of them before yr , s − r − 1 of them between

yr and ys , and n − s of them beyond ys .

• fXmax ,Xmin (u, v)ΔuΔv can be approximated by by

P [Xj ∈ [u, u+Δu), Xj ∈ [v, v+Δv), and ∀i ∈ [n]\{j, k} , v < Xi ≤ u], v ≤ u,

(j,k)∈S

where S is the set of all n(n − 1) pairs (j, k) from [n] × [n] with j = k. This is

simply choosing the j, k so that Xj will be the maximum with value around u,

and Xk will be the minimum with value around v. Of course, the rest of the Xi

have to be between the min and max.

◦ When n = 2, we can use (29) to get

fXmax ,Xmin (u, v) = fX1 ,X2 (u, v) + fX1 ,X2 (v, u), v ≤ u.

Note that the joint density at point yI is0 if the the elements in yI are not arranged in the

“right” order.

11.19 (Distribution functions). We note again the the cdf may be obtained by inte-

gration of the densities in (11.18) as well as by direct arguments valid also in the discrete

case.

102

(a) The marginal cdf is

n

FYk (y) = P [∀i ∈ I, Xi ≤ y and ∀r ∈ [n] \I, Xr > y].

j=k I∈([n])

j

is the same as event that at least k of the Xi are

n

≤ y. In other words, let N (a) = i=1 1[Xi ≤ a] be the number of Xi which are ≤ a.

8 9 8 9

[Yk ≤ y] = N (y) ≥ k = N (y) = j , (30)

i j≥k i

where the union is a disjoint union. Hence, we sum the probability that exactly j

of the Xi are ≤ y for j ≥ k. Alternatively, note that the event [Yk ≤ y] can also be

expressed as a disjoint union

[Xi ≤ k and exactly k − 1 of the X1 , . . . , Xj−1 are ≤ y] .

j≥k

This gives

n

FYk (y) = P [Xj ≤ y, ∀i ∈ I, Xi ≤ y, and ∀r ∈ [j − 1] \ I, Xr > y] .

j=k I∈( [j−1]

k−1)

+ , + ,

n

n

[Yr ≤ yr ] ∩ [Ys ≤ ys ] = [N (yr ) = j] ∩ [N (ys ) = m]

j=r m=s

n

m

= [N (yr ) = j and N (ys ) = m],

m=s j=r

where the upper limit of the second union is changed from n to m because we must

have N (yr ) ≤ N (ys ). Now, to have N (yr ) = j and N (ys ) = m for m > j is to put j

of the Xi in (−∞, yr ], m − j of the Xi in (yr , ys ], and n − m of the Xi in (ys , ∞).

= P [∀i Xi ≤ u] − P [∀i v < Xi ≤ u]

103

when u < v. When v ≥ u, the second term can be found by (21) which gives

FXmax ,Xmin (u, v) = FX1 ,...,Xn (u, . . . , u) − (−1)|i:wi =v| FX1 ,...,Xn (w)

w∈S

|i:wi =v|+1

= (−1) FX1 ,...,Xn (w).

w∈S\{(u,...,u)}

where S = {u, v}n is the set of all 2n vertices w of the “box” × (ai , bi ]. The joint

i∈[n]

density is 0 for u < v.

n

(a) fY (y) = fXi (xi )

x∈Sy i=1

+ ,⎛ ⎞

n

FYk (y) = FXi (y) ⎝ (1 − FXi (y))⎠

j=k I∈([n]) i∈I r∈[n]\I

,⎛ ⎞

j

+

n

= FXj (y) FXi (y) ⎝ (1 − FXr (y))⎠

j=k I∈( [j−1] i∈I r∈[j−1]\I

k−1 )

7 >

0, u≤v

(c) FXmax ,Xmin (u, v) = FXk (u) − (FXk (u) − FXk (v)), v < u

k∈[n] k∈[n]

+ ,⎛ ⎞

n

FYk (y) = FXi (y) ⎝ (1 − FXr (y))⎠ .

j=k I∈([n]) i∈I r∈[n]\I

j

(e) FXmin (v) = 1 − (1 − FXi (v)).

i

(f) FXmax (u) = FXi (u).

i

i.i.d.

11.21. Suppose Xi ∼ X with common density f and distribution function F .

104

If we deﬁne y0 = −∞, yk+1=∞ , n0 = 0, nk+1 = n + 1, then for k ∈ [n] and 1 ≤ n1 <

· · · < nk ≤ n, the joint density fYn1 ,Yn2 ,...,Ynk (yn1 , yn2 , . . . , ynk ) is given by

+ k , k

(F (ynj+1 ) − F (ynj ))nj+1 −nj −1

n! f (ynj ) .

j=1 j=1

(n j+1 − n j − 1)!

In particular, for r < s, the joint density fYr ,Ys (yr , ys ) is given by

n!

f (yr )f (ys )F r−1 (yr )(F (ys ) − F (yr ))s−r−1 (1 − F (ys ))n−s

(r − 1)!(s − r − 1)!(n − s)!

[2, Theorem 5.4.6 p 230].

(b) The joint cdf FYr ,Ys (yr , ys ) is given by

⎧

⎨ FYs (ys ), ys ≤ yr ,

n m

⎩

n!

j!(m−j)!(n−m)!

(F (yr ))j (F (ys ) − F (yr ))m−j (1 − F (ys ))n−m , yr < ys .

m=s j=r

?

n 0, u≤v

(c) FXmax ,Xmin (u, v) = (F (u)) − n .

(F (u) − F (v)) , v < u

0, u≤v

(d) fXmax ,Xmin (u, v) = n−2

n (n − 1) fX (u) fX (v) (F (u) − F (v)) , v<u

(e) Marginal cdf:

n

n

FYk (y) = (F (y))j (1 − F (y))n−j

j

j=k

n

n−k

j−1 k+m−1

= (F (y))k (1 − F (y))j−k = (F (y))k (1 − F (y))m

k−1 k−1

j=k m=0

! F (y)

n!

= tk−1 (1 − t)n−k dt.

(k − 1)! (n − k)! 0

Note that N (y) ∼ B (n, F (y)). The last equality comes from integrating the marginal

density fYk in (31) with change of variable t = F (y).

(i) FXmax (y) = (F (y))n and fXmax (y) = n (F (y))n−1 fX (y).

(ii) FXmin (y) = 1 − (1 − F (y))n and fXmin (y) = n (1 − F (y))n−1 fX (y).

(f) Marginal density:

n!

fYk (y) = (F (y))k−1 (1 − F (y))n−k fX (y) (31)

(k − 1)! (n − k)!

[2, Theorem 5.4.4 p 229]

Consider small neighborhood Δy around y. To have Yk ∈ Δy , we must have exactly

n−1k− 1 of them

one of the Xi ’s in Δy , exactly

n!

less than y, and exactly n − k of them

1

greater than y. There are n k−1 = (k−1)!(n−k)! = B(k,n−k+1) possible setups.

105

(g) The range R is deﬁned as R = Xmax − Xmin .

(i) For x > 0, fR (x) = n(n − 1) (F (u) − F (u − x))n−2 f (u − x)f (u)du.

(ii) For x ≥ 0, FR (x) = n (F (u) − F (u − x))n−1 f (u)du.

Both pdf and cdf above are derived by ﬁrst ﬁnding the distribution of the range

conditioned on the value of the Xmax = u.

pX (xi ) = pi , where

x1 < x2 < . . . are the possible values of X in ascending order. Deﬁne

P0 = 0 and Pi = ik=1 pk , then

n

n

P [Yj ≤ xi ] = Pik (1 − Pi )n−k

k=j

k

and n

n

P [Yj = xi ] = Pik (1 − Pi )n−k − Pi−1

k

(1 − Pi−1 )n−k .

k=j

k

0 to t0 , then they have joint pdf

1

k tk0

, 0 ≤ ui ≤ t0

fU1k u1 =

0, otherwise

7

k k!

tk0

, 0 ≤ t1 ≤ t2 ≤ · · · ≤ tk ≤ t0

fτ1k t1 =

0, otherwise

have joint cdf FXY .

FX,Y (u, u), u ≤ v,

FU,V (u, v) = ,

FX,Y (v, u) + FX,Y (u, v) − FX,Y (v, v), u > v

FU (u) = FX,Y (u, u),

FV (v) = FX (v) − FY (v) − FX,Y (v, v).

106

The marginal densities is given by

∂ ∂

fU (u) = FX,Y (x, y) + FX,Y (x, y)

∂x x=u, ∂y x=u,

y=u y=u

!u !u

= fX,Y (x, u)dx + fX,Y (u, y) dy,

−∞ −∞

+ ,

∂ ∂

= fX (v) + fY (v) − FX,Y (x, y) + FX,Y (x, y)

∂x x=v, ∂y x=v,

y=v y=v

⎛ v ⎞

! ! v

−∞ −∞

! ∞ ! ∞

= fX,Y (v, y)dy + fX,Y (x, v)dx

v v

FX (u) FY (u) , u≤v

FU,V (u, v) =

FX (v) FY (u) + FX (u) FY (v) − FX (v) FY (v) , u > v,

FU (u) = FX (u) FY (u) ,

FV (v) = FX (v) + FY (v) − FX (v) FY (v) ,

fU (u) = fX (u) FY (u) + FX (u) fY (u) ,

fV (v) = fX (v) + fY (v) − fX (v) FY (v) − FX (v) fY (v) .

If, furthermore, X and Y are i.i.d., then

FU (u) = F 2 (u) ,

FV (v) = 2F (v) − F 2 (v) ,

fU (u) = 2f (u) F (u) ,

fV (v) = 2f (v) (1 − F (v)) .

11.25. Let the Xi be i.i.d. with density f and cdf F . The range R is deﬁned as R =

Xmax − Xmin .

0, x<0

FR (x) =

n (F (v) − F (v − x))n−1 f (v) dv, x ≥ 0

0, x<0

fR (x) = n−2

n (n − 1) (F (v) − F (v − x)) f (v − x) f (v) dv, x > 0.

i.i.d.

For example, when Xi ∼ U(0, 1),

fR (x) = n (n − 1) xn−2 (1 − x) , 0≤x≤1

[15, Ex 9F p 322–323].

107

12 Convergences

Deﬁnition 12.1. A sequence of random variables (Xn ) converges pointwise to X if

∀ω ∈ Ω lim Xn (ω) → X (ω)

n→∞

Deﬁnition 12.2 (Strong Convergence). The following statements are all equivalent

conditions/notations for a sequence of random variables (Xn ) to converge almost surely to

a random variable X

a.s.

(a) Xn −−→ X

(i) Xn → X a.s.

(ii) Xn → X with probability 1.

(iii) Xn → X w.p. 1

(iv) lim Xn = Xa.s.

n→∞

a.s.

(b) (Xn − X) −−→ 0

(c) P [Xn → X] = 1

(' )*

(i) P ω : lim Xn (ω) = X (ω) = 1

n→∞

(' )c *

(ii) P ω : lim Xn (ω) = X (ω) =0

n→∞

(iii) P [Xn X] = 0

(' )*

(iv) P ω : lim |Xn (ω) − X (ω)| = 0 = 1

n→∞

(' )*

(d) ∀ε > 0 P ω : lim |Xn (ω) − X (ω)| < ε =1

n→∞

a.s. a.s.

(a) Uniqueness: if Xn −−→ X and Xn −−→ Y , then X = Y a.s.

a.s. a.s.

(b) If Xn −−→ X and Yn −−→ Y , then

a.s.

(i) Xn + Yn −−→ X + Y

a.s.

(ii) Xn Yn −−→ XY

a.s. a.s.

(c) g continuous, Xn −−→ X ⇒ g (Xn ) −−→ g (X)

∞

a.s.

(d) Suppose that ∀ε > 0, P [|Xn − X| > ε] < ∞, then Xn −−→ X

n=1

a.s.

(e) Let A1 , A2 , . . . be independent. Then, 1An −−→ 0 if and only if P (An ) < ∞

n

108

Deﬁnition 12.4 (Convergence in probability). The following statements are all equiv-

alent conditions/notations for a sequence of random variables (Xn ) to converges in proba-

bility to a random variable X

P

(a) Xn −

→X

(i) Xn →P X

(ii) p lim Xn = X

n→∞

P

(b) (Xn − X) −

→0

n→∞

n→∞

(ii) ∀ε > 0 ∀δ > 0 ∃Nδ ∈ N such that ∀n ≥ Nδ P [|Xn − X| > ε] < δ

(iii) ∀ε > 0 lim P [|Xn − X| > ε] = 0

n→∞

(iv) The strict inequality between |Xn − X| and ε can be replaced by the correspond-

ing “non-strict” version.

P P

(a) Uniqueness: If Xn −

→ X and Xn −

→ Y , then X = Y a.s.

P P

(b) Suppose Xn −

→ X, Yn −

→ Y , and an → a, then

P

(i) Xn + Yn −

→X +Y

P

(ii) an Xn −

→ aX

P

(iii) Xn Yn −

→ XY

(c) Suppose (Xn ) i.i.d. with distribution U [0, θ]. Let Zn = max {Xi : 1 ≤ i ≤ n}. Then,

P

Zn −

→θ

P P

(d) g continuous, Xn −

→ X ⇒ g (Xn ) −

→ g (X)

P

(i) Suppose that g : Rd → R is continuous. Then, ∀i Xi,n −

→ Xi implies

n→∞

P

g (X1,n , . . . , Xd,n ) −

→ g (X1 , . . . , Xd )

n→∞

P P

(e) Let g be a continuous function at c. Then, Xn −

→ c ⇒ g (Xn ) −

→ g (c)

P

(f) Fatou’s lemma: 0 ≤ Xn −

→ X ⇒ lim inf EXn ≥ EX

n→∞

P

(g) Suppose Xn −

→ X and |Xn | ≤ Y with EY < ∞, then EXn → EX

109

P

(h) Let A1 , A2 , . . . be independent. Then, 1An −

→ 0 iﬀ P (An ) → 0

P

(i) Xn −

→ 0 iﬀ ∃δ > 0 such that ∀t ∈ [−δ, δ] we have ϕXn (t) → 1

(d ≥ 1). The sequence Pn converges weakly to P if the se-

probability measure on R d

quence of real numbers gdPn → gdP for any g which is real-valued, continuous, and

bounded on Rd

12.7. Let (Xn ), X be Rd -valued random variables with distribution functions (Fn ), F ,

distributions (μn ) , μ, and ch.f. (ϕn ) , ϕ respectively.

The following are equivalent conditions for a sequence of random variables (Xn ) to

converge in distribution to a random variable X

(i) Xn ⇒ X

L

i. Xn −

→X

D

ii. Xn −

→X

(ii) Fn ⇒ F

i. FXn ⇒ FX

ii. Fn converges weakly to F

iii. lim P Xn (A) = P X (A) for every A of the form A = (−∞, x] for which

n→∞

P {x} = 0

X

(iii) μn ⇒ μ

i. P Xn converges weakly to P X

D D

(Ω, F, P ) such that Yn = Xn , Y = X, and Yn → Y on (the whole) Ω

n→∞

(e) Continuous Mapping theorem: lim Eg (Xn ) = Eg (X)for all g which is real-

n→∞

valued, continuous, and bounded on Rd

(i) lim Eg (Xn ) = Eg (X) for all bounded real-valued function g such that P [X ∈ Dg ] =

n→∞

0 where Dg is the set of points of discontinuity of g

(ii) lim Eg (Xn ) = Eg (X) for all bounded Lipschitz continuous functions g

n→∞

(iii) lim Eg (Xn ) = Eg (X) for all bounded uniformly continuous functions g

n→∞

110

(iv) lim Eg (Xn ) = Eg (X) for all complex-valued functions g whose real and imag-

n→∞

inary parts are bounded and continuous

(i) For nonnegative random variables: ∀s ≥ 0 LXn (s) → LX (s) where LX (s) =

Ee−sX

Note that there is no requirement that (Xn ) and X be deﬁned on the same probability

space (Ω, A, P )

12.8. Continuity Theorem: Suppose lim ϕn (t) exists ∀t; call this limit ϕ∞ Further-

n→∞

more, suppose ϕ∞ is continuous at 0. Then there exists ∃ a probability distribution μ∞

such that μn ⇒ μ∞ and ϕ∞ is the characteristic function of μ∞

(b) Suppose Xn ⇒ X

(ii) Eg (Xn ) → Eg (X) for every bounded real-valued function g such that P [X ∈ Dg ] =

0 where Dg is the set of points of discontinuity of g

i. g (Xn ) ⇒ g (X) for g continuous.

P

(iii) If Yn −

→ 0, then Xn + Yn ⇒ X

(iv) If Xn − Yn ⇒ 0, then Yn ⇒ X

(d) Suppose (μn ) is a sequence of probability measures on R that are all point masses

with μn ({αn }) = 1. Then, μn converges weakly to a limit μ iﬀ αn → α; and in this

case μ is a point mass at α

(i) Suppose P Xn and P X have densities δn and δ w.r.t. the same measure μ. Then,

δn → δ μ-a.e. implies

i. ∀B ∈ BR P Xn (E) → P X (E)

• FXn → FX

• FXn ⇒ FX

ii. Suppose g is bounded. Then, g (x) P Xn (dx) → g (x) P X (dx). In Equiv-

alently, E [g (Xn )] → E [g (X)] where the E is deﬁned with respect to appro-

priate P .

(ii) Remarks:

111

i. For absolutely continuous random variables, μ is the Lebesgue measure, δ

is the probability density function.

ii. For discrete random variables, μ is the counting measure, δ is the probability

mass function.

N (μ, σ 2 )

(ii) Suppose that Xn are normal random variables and let Xn ⇒ X. Then, (1) the

mean and variance of Xn converge to some limit m and σ 2 . (2) X is normal

with mean m and variance σ 2

(g) Xn ⇒ 0 if and only if ∃δ > 0 such that ∀t ∈ [−δ, δ] we have ϕXn (t) → 1

12.10. Relationship between convergences

p

X

nk

a.s.

L

X n oX X n oX

p

Xn d Y L

f ( ) continuous

p p

Xn o X

X n d Y Lp X

nk

P

X n oX

f ( ) continuous c P X c 1 f X n

o fX

X n oX

f ( ) continuous

^

x x : F x F x ` x D dense in

dense in

lim Fn x F x

n of

a.s. P

(a) For discrete probability space, Xn −−→ X if and only if Xn −

→X

P Lp

(b) Suppose Xn −

→ X, ∀n |Xn | ≤ Y , Y ∈ Lp . Then, X ∈ Lp and Xn −→ X

a.s. Lp

(c) Suppose Xn −−→ X, ∀n |Xn | ≤ Y , Y ∈ Lp . Then, X ∈ Lp and Xn −→ X

P D

(d) Xn −

→ X ⇒ Xn −

→X

112

D P

(e) If Xn −

→ X and if ∃a ∈ RX = a a.s., then Xn −

→X

P D

(i) Hence, when X = a a.s., Xn −

→ X and Xn −

→ X are equivalent.

See also Figure 22.

Example 12.11.

0, ω ∈ 0, 1 − n12

(a) Ω = [0, 1] , P is uniform on [0,1]. Xn (ω) =

en , ω ∈ 1 − n12 , 1

Lp

(i) Xn 0

a.s.

(ii) Xn −−→ 0

P

(iii) Xn −

→0

(b) Let Ω = [0, 1] with uniform probability distribution. Deﬁne Xn (ω) = ω +1[ j−1 , j ] (ω) ,

@ √ A k k

j−1 j

k

, k under the indicator function is shown in ﬁgure 23.

>0,1@

ª 1º ª1 º

« 0, 2 » o « 2 ,1»

¬ ¼ ¬ ¼

ª 1 º ª1 2 º ª 2 º

« 0, 3 » o « 3 , 3 » o « 3 ,1»

¬ ¼ ¬ ¼ ¬ ¼

ª 1º ª1 1º ª1 3º ª3 º

« 0, 4 » o « 4 , 2 » o « 2 , 4 » o « 4 ,1»

¬ ¼ ¬ ¼ ¬ ¼ ¬ ¼

Let X (ω) = ω.

(i) The sequence of real numbers (Xn (ω)) does not converge for any ω

Lp

(ii) Xn −→ X

P

(iii) Xn −

→X

1

n

, w.p. 1 − 1

n

(c) Xn = .

n, w.p. n1

P

(i) Xn −

→0

Lp

(ii) Xn 0

113

12.1 Summation of random variables

n

Set Sn = Xi .

i=1

P

1

n2

Var Sn → 0, then n1 Sn − n1 ESn −

→0

n

• If (Xi ) are pairwise independent, then 1

n2

Var Sn = 1

n2

Var Xi See also (12.16).

i=1

n

12.13. For independent Xn , the probability that Xi converges is either 0 or 1.

i=1

12.14 (Kolmogorov’s

Var Xn SLLN). Consider a sequence (Xn ) of independent random vari-

a.s.

ables. If n2

< ∞, then n1 Sn − n1 ESn −−→ 0

n

n

a.s.

• In particular, for independent Xn , if 1

n

EXn → a or EXn → a, then n1 Sn −−→ a

i=1

converges a.s. Then, ϕX = ϕX1 ϕX2 · · ·

n

12.16. For pairwise independent (Xi ) with ﬁnite variances, if 1

n2

Var Xi → 0, then

i=1

P

1

S

n n

− n1 ESn −

→0

(a) Chebyshev’s Theorem: For pairwise independent (Xi ) with uniformly bounded

P

variances, then n1 Sn − n1 ESn −

→0

Let (Xi ) be i.i.d. random variables.

5 n 6

1

12.17 (Chebyshev’s inequality). P n

Xn − EX1 ≥ ε ≤ 1 Var[X1 ]

ε2 n

i=1

12.18 (WLLN). Weak Law of Large Numbers:

(a) L2 weak law: (ﬁnite variance) Let (Xi ) be uncorrelated (or pairwise independent)

random variables

n

(i) V arSn = Var Xi

i=1

n

P

(ii) If 1

n2

Var Xi → 0, then n1 Sn − n1 ESn −

→0

i=1

114

n

L2

(iii) If EXi = μ and V arXi ≤ C < ∞. Then, 1

n

Xi −→ μ

i=1

(b) Let (Xi ) be i.i.d. random variables with EXi = μ and V arXi = σ 2 < ∞. Then,

5 n 6

1

(i) P n

Xn − EX1 ≥ ε ≤ ε12 Var[X1]

n

i=1

n

L2

(ii) 1

n

Xi −→ μ

i=1

n

L2

n

P

n

D

(c) 1

n

Xi −→ μ implies 1

n

Xi −

→ μ which in turn imply 1

n

Xi −

→μ

i=1 i=1 i=1

(n) P

(d) If Xn are i.i.d. random variables such that lim tP [|X1 | > t] → 0, then n1 Sn −EX1 −

→

t→∞

0

(e) Khintchine’s theorem: If Xn are i.i.d. random variables and E |X1 | < ∞, then

(n) P

EX1 → EX1 and n1 Sn −

→ EX1 .

12.19 (SLLN).

(a) Kolmogorov’s

Var Xn SLLN : Consider a sequence (Xn ) of independent random variables.

a.s.

If n2

< ∞, then n1 Sn − n1 ESn −−→ 0

n

n

a.s.

• In particular, for independent Xn , if 1

n

EXn → a or EXn → a, then n1 Sn −−→ a

i=1

n

a.s.

(b) Khintchin’s SLLN : If Xi ’s are i.i.d. with ﬁnite mean μ, then 1

n

Xi −−→ μ

i=1

(c) Consider the sequence (Xn ) of i.i.d. random variables. Suppose EX1− < ∞ and

a.s.

EX1+ = ∞, then n1 Sn −−→ ∞

a.s.

• Suppose that Xn ≥ 0 are i.i.d. random variables and EXn = ∞. Then, n1 Sn −−→

∞

the probability). Consider i.i.d. Zi ∼ Z. Let Xi = 1A (Zi ). Then, n Sn = n ni=1 1A (Zi ) =

1 1

frequency of an event A. Via LLN and appropriate conditions, rn (A)

converges to E n1 ni=1 1A (Zi ) = P [Z ∈ A].

115

12.4 Central Limit Theorem (CLT)

Suppose that (Xk )k≥1 isa sequence of i.i.d. random variables with mean m and variance

0 < σ 2 < ∞. Let Sn = nk=1 Xk .

Sn −mc

n

Xk −m

(a) √

σ n

= √1

n σ

⇒ N (0, 1).

k=1

−mc

n

(b) Sn√

n

= √1

n

(Xk − m) ⇒ N (0, σ).

k=1

Xk −m iid

To ∼ Z and Yn = nk=1 Zk . Then, EZ = 0, Var Z = 1, and ϕYn (t) =

see this, n let Zk = σ

ϕZ ( √tn ) . By approximating ex ≈ 1 + x + 12 x2 . We have ϕX (t) ≈ 1 + jtEX − 12 t2 E [X 2 ]

(see also (22)) and n

1 t2 t2

ϕYn (t) = 1 − → e− 2 .

2n

• The case of Bernoulli(1/2) was derived by Abraham de Moivre around 1733. The

case of Bernoulli(p) for 0 < p < 1 was considered by Pierre-Simon Laplace [9, p 208].

12.22 (Approximation of densities and pmfs using the CLT). Approximate the

distribution of Sn by N (nm, nσ 2 ).

• FSn (s) ≈ Φ s−nm

√

σ n

2

−1 x−nm

√

• fSn (s) ≈ √ 1√ e 2

2πσ n

σ n

5 6 2

1 1 1 − 12 k−nm

√

P [Sn = k] = P k − < Sn ≤ k + ≈√ √ e σ n

2 2 2πσ n

√ 1

• The approximation n! ≈ 2πnn+ 2 e−n can be derived from approximating the density

n−1 e−s

of Sn when Xi ∼ E(1). We know that fSn (s) = s(n−1)! . Approximate the density at

√ 1

s = n, gives (n − 1)! ≈ 2πnn− 2 e−n . Multiply through by n. [9, Ex 5.18 p 212]

• See also normal approximation for the binomial in (13).

116

13 Conditional Probability and Expectation

13.1 Conditional Probability

Deﬁnition 13.1. Suppose conditioned on X = x, Y has distribution Q. Then, we write

Y |X = x ∼ Q [21, p 40]. It might be clearer to write P Y |X=x = Q.

13.2. Discrete random variables

pX,Y (x,y)

(a) pX|Y (x|y) = pY (y)

(c) The law of total probability:

P [Y ∈ C] = x P [Y ∈ C|X = x]P [X = x]. In

particular, pY (y) = x pY |X (y|x)pX (x).

pY |X (y|x)pX (x)

(d) pX|Y (x|y) =

x pY |X (y|x )pX (x )

• pY |X (y|x) = pZ (y − x)

• pY (y) = x pZ (y − x)pX (x)

• pX|Y (x|y) = pZ (y−x)pX (x) .

x pZ (y−x )pX (x )

P [g(X, Y ) = z|X = x] = P [g(x, Y ) = z|X = x].

• When X and Y are independent, we can “drop the conditioning”.

13.3. Absolutely continuous random variables

fX,Y (x,y)

(a) fY |X (y|x) = fY (y)

.

y

(b) FY |X (y|x) = −∞

fY |X (t|x)dt

13.4. P [(X, Y ) ∈ A] = E [P [(X, Y ) ∈ A|X]] = A

fX,Y (x, y)d(x, y)

g(z,x)

13.5. ∂

F

∂z Y |X

(g (z, x) |x ) = ∂

∂z

fY |X (y |x )dy = fY |X (g (z, x) |x ) ∂z

∂

g (z, x)

−∞

P [X≤x,y−Δ<Y ≤y]

13.6. FX|Y (x |y ) = P [X ≤ x |Y = y ] = lim P [y−Δ<Y ≤y]

= E 1(−∞,x] (X) |Y = y .

Δ→0

fX,A (x)

13.7. Deﬁne fX|A (x) to be P (A)

. Then,

x

fX,A (x ) dx

x−Δ fX,A (x) P (A) fX|A (x)

P [A |X = x ] = lim x = = .

Δ→0 fX (x) fX (x)

fX (x ) dx

x−Δ

13.8. For independent Xi , P [∀i Xi ∈ Bi |∀i Xi ∈ Ci ] = P [Xi ∈ Bi |Xi ∈ Ci ].

i

117

13.2 Conditional Expectation

Deﬁnition 13.9. E [g(Y )|X = x] = y g(y)pY |X (y|x).

• In particular, E [Y |X = x] = y ypY |X (y|x).

• Note that E [Y |X = x] is a function of x.

13.10. Properties of conditional expectation.

(a) Substitution law for conditional expectation: E [g(X, Y )|X = x] = E [g(x, Y )|X = x]

(b) E [h(X)g(Y )|X = x] = h(x)E [g(Y )|X = x].

(c) (The Rule of Iterated Expectations) EY = E [E [Y |X]].

(d) Law of total probability for expectation:

E [g(X, Y )] = E [E [g(X, Y )|X]] .

(i) E [g(Y )] = E [E [g(Y )|X]].

(ii) FY (y) = E FY |X (y|X) . (Take g(z) = 1(−∞,y] (z)).

(e) E [g (X) h(Y ) |X = x ] = g (x) E [h(Y ) |X = x ]

(f) E [g (X) h(Y )] = E [g (X) E [h(Y ) |X ]]

(g) E [X + Y |Z = z ] = E [X |Z = z ] + E [Y |Z = z ]

(h) E [AX|z] = AE [X|z].

(i) E [(X − E [X |Y ])| Y ] = 0 and E [X − E [X |Y ]] = 0

(j) ming(x) E [(Y − g(X))2 ] = E [(Y − E [Y |X])2 ] where g ranges over all functions. Hence,

E [Y |X] is sometimes called the regression of Y on X, the “best” predictor of Y con-

ditional on X.

Deﬁnition 13.11. The conditional variance is deﬁned as

!

Var[Y |X = x] = (y − m(x))2 f (y|x)dy

13.12. Properties of conditional variance

(a) Var Y = E [Var[Y |X]] + Var[E [Y |X]].

In other words, suppose given X = x, the mean and variance of Y is m(x), v(x). Then,

the variance of Y is Var Y = E [v(X)] + Var[m(X)]. Recall that for any function g,

we have E [g(Y )] = E [E [g(Y )|X]]. Because EY is just a constant, say μ, we can

deﬁne g(y) = (y − μ)2 which then implies Var Y = E [E [g(Y )|X]]. Note, however,

that E [g(Y )|X] and Var[Y |X] are not the same. Suppose conditioned

on X, Y has

distribution Q with mean m(x) and variance v(x). Then, v(x) = (y − m(x))2 Q(dy).

However, E [g(Y )|X = x] = (y − μ) Q(dy); note the use of μ in stead of m(x).

2

118

• All three terms in the expression are nonnegative. Var Y is an upper bound for

each of the terms on the RHS.

|=

(c) Var[AX|z] = A Var[X|z]AH .

13.13. Suppose E [Y |X] = X. Then, Cov [X, Y ] = Var X. See also (5.19). This is also

true for Y = X + N with X N and N is zero-mean noise.

|=

Deﬁnition 13.14. μn [Y |X ] = E [(Y − E [Y |X ])n |X ]

13.15. Properties

(a) μ3 [Y ] = E [μ3 [Y |X ]] + μ3 [E [Y |X ]]

μ4 [Y ] = E [μ4 [Y |X ]] + 6E [Var [Y |X ]] Var [E [Y |X ]] + μ4 [E [Y |X ]]

13.16. The following statements are equivalent:conditions for X1 , X2 , . . . , Xn to be mu-

tually independent conditioning on Y (a.s.).

n

(a) p (xn1 |y ) = p (xi |y ).

i=1

(b) ∀i ∈ [n] \ {1} p xi xi−1

1 , y = p (xi |y ).

(c) ∀i ∈ [n] Xi and the vector (Xj )[n]\{i} are independent conditioning on Y .

variable Z, it is not true in general that X and Y are still independent. See example

(4.45). Recall that Z = X ⊕ Y which can be rewritten as Y = X ⊕ Z. Hence, when Z = 0,

we must have Y = X.

13.18. Suppose we know that fX|Y,Z (x|y, z) = g(x, y); that is fX|Y,Z does not depend on

z. Then, conditioned on Y , X and Z are independent. In which case,

13.19. Suppose we know that fZ|V,U1 ,U2 (z |v, u1 , u2 ) = fZ|V (z |v ) for all z, v, u1 , u2 , then

conditioned on V , we can conclude that Z and (U1 , U2 ) are independent. This further

implies Z and Ui are independent. Moreover,

119

14 Real-valued Jointly Gaussian

Deﬁnition 14.1. Random vector Rd is jointly Gaussian or jointly normal if and

only if ∀v ∈ Rd , the random variable v T X is Gaussian.

• In order for this deﬁnition to make sense when v = 0 or when X has a singular

covariance matrix, we agree that any constant random variable is considered to be

Gaussian.

• Of course, the mean and variance are v T EX and v T ΛX v, respectively.

If X is a Gaussian random vector with mean vector m and covariance matrix Λ, we write

X ∼ N (m, Λ).

14.2. Properties of jointly Gaussian random vector X ∼ N (m, Λ)

(a) m = EX, Λ = Cov [X] = E (X − EX)(X − EX)T .

1

√

T

Λ−1 (x−m)

(b) fX (x) = n

1

e− 2 (x−m) .

(2π) 2 det(Λ)

• To remember the form of the above formulas, both exponents have to be scalar.

So, we better have (x − m)T Λ−1 (x − m) instead of having the transpose on the

last term. To make this

more

clear, set Λ = I, then we must have a dot product.

T

Note also that v Av = vk Ak v .

k

• The above formula can be derived by starting form a random vector Z whose

1

components are i.i.d. N (0, 1). Let X = CX2 Z + m. Use (28) and (32).

i.i.d.

• For Xi ∼ N (mi , σi2 ),

x−mi 2

1 − 12

fX (x) = n e .

i σi

(2π) 2 i σi

i.i.d.

In particular, if Xi ∼ N (0, 1),

1 − 12 xT x

fX (x) = n e (32)

(2π) 2

n " "

• (2π) 2 det (Λ) = det (2πΛ)

• The Gaussian density is constant on the “ellipsoids” centered at m,

−1

x ∈ Rn : (x − m)T CX (x − m) = constant .

T m− 1 v T Λv j vi EXi − 12 vk v Cov[Xk ,X ]

(c) ϕX (v) = ejv 2 =e i k .

• This can be derived from deﬁnition (14.1) by noting that

⎡ ⎤

Y

( T *

⎢ T ⎥

ϕX (v) = E ejv X = E ⎣ej1 v X ⎦

120

(d) Random vector Rd is Jointly Gaussian if and only if ∀v ∈ Rd , the random variable

v T X is Gaussian.

is preserved

under linear transformation: suppose Y = AX + b, then

Y ∼ N Am + b, AΛA . T

(f) If (X, Y ) jointly Gaussian, then X and Y are independent if and only if Cov [X, Y ] =

0. Hence, uncorrelated jointly Gaussian random variables are independent.

(g) Note that the joint density does not exists when the covariance matrix Λ is singular.

?

1 1 2

fX1n (xn1 ) = n exp − 2 x − μ

(2π) 2 σ n 2σ

7 >

1 2 μ

n n

1 μ2

= n exp − 2 x + xi − 2

(2πσ 2 ) 2 2σ i=1 i σ 2 i=1 2σ

In particular, E (X − EX)3 = 0.

(j) Isserlis’s Theorem: Any forth-order central moment of jointly Gaussian r.v. is

expressible as the a sum of all possible products of pairs of their covariances:

= Cov [Xi , Xj ] Cov [Xk , X ] + Cov [Xi , Xk ] Cov [Xj , X ] + Cov [Xi , X ] Cov [Xj , Xk ] .

1 4

Note that 2 2

= 3.

• In particular, E (X − EX)4 = 3σ 4 .

nal matrix whose columns are eigenvectors of Λ and D is diagonal matrix with the

eigenvalues of Λ. The random variable we want is V X + m where X ∼ N (0, D).

⎧ 2 2 ⎫

⎪

⎨ x−EX

− 2ρ σX

x−EX y−EY y−EY

+ σY ⎪

⎬

1 σX σY

" exp − ,

2πσX σY 1 − ρ2 ⎪

⎩ 2 (1 − ρ2 ) ⎪

⎭

Cov(X,Y )

where ρ = σX σ Y

∈ [−1, 1]. Here, x, y ∈ R.

121

• fX,Y (x, y) = 1

σ X σY

ψρ x−mX y−mY

σX

, σY

2 2

• fX,Y is constant on ellipses of the form x

σX

+ y

σY

= r2 .

2

2

σX Cov [X, Y ] σX ρσX σY

• Λ= = .

Cov [X, Y ] σY2 ρσX σY σY2

• The following are equivalent:

(a) ρ = 0

(b) Cov [X, Y ] = 0

(c) X and Y are independent.

◦ ρ= k

|k|

= sign (k)

◦ |k| = σσXY

"

• Suppose fX,Y (x, y) only depends on x2 + y 2 and X and Y are independent, then

X and Y are normal with zero mean and equal variance.

• X|Y ∼ N ρ σσXY (Y − mY ) + mX , σX2

(1 − ρ2 )

Y |X ∼ N ρ σσXY (X − mX ) + mY , σY2 (1 − ρ2 )

1 − 1

(u2 −2ρuv+v 2 )

ψρ (u, v) = " e 2(1−p)2

2π 1 − ρ2

+ , + ,

1 v − ρu 1 u − ρv

= ψ(u) " ψ " =" ψ " ψ(v)

1 − ρ2 1 − ρ2 1 − ρ2 1 − ρ2

fU (u) fV (v)

fV |U (v|u) fU |V (u|v)

V |U ∼N (ρU,1−ρ2 ) U |V ∼N (ρV,1−ρ2 )

[9, eq. (7.22) p 309, eq. (7.23) p 311, eq. (7.26) p 313]. This is the joint density of

U, V where U, V ∼ N (0, 1) with Cov [U, V ] = E [U V ] = ρ.

X σX U + mX σX 0 U mX

= = + .

Y σY V + mY 0 σY V mY

122

X μX

ΛX ΛXY

(a) Suppose (X, Y ) are jointly Gaussian; that is Y ∼ N μY

, . Then,

ΛY X ΛY

fX|Y (x |y ) is N E [X |y ] , ΛX|y where E [X |y ] = μX +ΛXY Λ−1

Y (y − μY ) and ΛX|y =

−1

ΛX − ΛXY ΛY ΛY X .

⎛ ⎞

ΛX → ΛXY

⎝ ↓ ⎠

ΛY X ← ΛY

Y, W ) are jointly Gaussian with W (X, Y ) . Set V = BX + W . Then,

|=

V |y ∼ N E [V |y ] , ΛV |y where E [V |y ] = BE [X |y ] + EW and ΛV |y = BΛX|y B T +

ΛW .

Consider a pair of random vectors Θ and Y , where Θ is not observed, but Y is observed.

We know the joint distribution of the pair (Θ, Y ) which is usually given in the form of the

prior distribution pΘ (θ) and the conditional distribution pY |Θ (y|θ). By an estimator of Θ

based on Y , we mean a function g such that Θ̂(Y ) = g(Y ) is our estimate or “guess” of

the value of Θ.

15.1 (Orthogonality Principle). Let D be a collection of random vectors with the same

dimension as Θ. For a random vector Z, suppose that

∀X ∈ D, Z − X ∈ D. (33)

If

E X T (Θ − Z) = 0, ∀X ∈ D, (34)

then

E |Θ − X|2 = E |Θ − Z|2 + E |Z − X|2 + 2 E[(Z

−

X) (Θ − Z)],

T

∈D

=0

which implies

E |Θ − Z|2 ≤ E |Θ − X|2 , ∀X ∈ D.

• If D is a subspace and Z ∈ D, then (33) is automatically satisﬁed.

• (34) says that the vector Θ − Z is orthogonal to all vectors in D.

Example 15.2. Suppose Θ and N are independent Poisson random variables with respec-

tive parameters λ and μ. Let Z = Θ + N .

• Y is P(λ + μ)

• Conditioned on Y = y, Θ is B y, λ

λ+μ

.

123

• Θ̂MMSE (Y ) = E [Θ|Y ] = λ

λ+μ

Y .

• Var[Θ|Y ] = Y λμ

(λ+μ)2

, and MSE = E [Var[Θ|Y ]] = λμ

λ+μ

< Var Y = λ + μ.

T

15.3 (Weighted Error). Suppose we deﬁne the error by E = Θ − Θ̂ (Y ) W Θ − Θ̂ (Y )

for some positive deﬁnite matrix W . (Note that the usual MSE use W = I.) The MSE

EE is uniquely minimized by the MMSE estimator Θ̂(Y ) = E [Θ|Y ]. The resulting MSE

is E (Θ − E [Θ|Y ])T W (Θ − E [Θ|Y ]) . ( *

In fact, for any function, g(Y ), the conditional weight error E (Θ − g (Y ))T W (Θ − g (Y )) Y

is given by

( *

E (Θ − E [Θ |Y ]) W (Θ − E [Θ |Y ]) Y + (E [Θ |Y ] − g (Y ))T W (E [Θ |Y ] − g (Y )) .

T

Θ̂LMMSE = gLMMSE (Y ) minimizes the MSE E |Θ − g(Y )|2 among all aﬃne estimators of

the form g(y) = Ay + b.

Cov [Y, Θ]

Θ̂LMMSE (Y ) = EΘ + (Y − EY ) .

Var Y

• To see this in Hilbert space, note that we want the orthogonal projection of Θ

onto the subspace spanned by two elements: Y and 1. The orthogonal basis of

the subspace is {1, Y − EY }. Hence, the orthogonal projection is

(Θ, 1) (Θ, Y − EY )

+ (Y − EY ) .

(1, 1) (Y − EY, Y − EY )

ﬁnding a, b in Θ̂(Y ) = aY + b such that the error E = Θ − Θ̂(Y ) is orthogonal

to both 1 and Y − EY . The condition (E, 1) = 0 requires Θ̂(Y ) to be unbiased.

The condition (E, Y − EY ) = 0 gives a = Cov[Y,Θ]

Var Y

.

Y (Y − EY )

and ( *

MMSE = Cov Θ − Θ̂LMMSE (Y ) = ΣΘ − ΣΘY Σ−1

Y ΣY Θ .

124

In fact, the optimal choice of A is any solution of

AΣY = ΣΘY .

In which case,

( *

Cov Θ − Θ̂LMMSE (Y ) = ΣΘ − AΣY Θ − ΣΘY AT + AΣY AT

= ΣΘ − ΣΘY AT = ΣΘ − AΣY Θ

= ΣΘ − AΣY AT .

Y . When ΣY is singular, see [9, Q8.38 p 359].

E ((Θ − EΘ) − A(Y − EY ))2 + |EΘ − AEY − b|2 ,

which show that the optimal choice of b is b = EΘ − AEY . This is the b which

makes the estimator unbiased.

• Fix A. Let Θ̃ = Θ − EΘ and Ỹ = Y − EY . Suppose for any matrix B, we have

5 T 6

E B Ỹ Θ̃ − AỸ = 0 .

5 2 6 5 2 6

E Θ̃ − AỸ ≤ E Θ̃ − B Ỹ .

In which case,

5 2 6 5 2 6 5 2 6

E Θ̃ − B Ỹ = E Θ̃ − AỸ + E (A − B)Ỹ .

Cov [Θ, Y ]

Θ̂LMMSE (Y ) = EΘ + (Y − EY )

Var Y

Var Θ

= EΘ + (Y − (EΘ + EN ))

Var Θ + Var N

SNR

= EΘ + (Y − (EΘ + EN )) .

1 + SNR

and

Cov2 [Θ, Y ] Var Θ Var N

MMSE = Var Θ − = .

Var Y Var Θ + Var N

125

A More Math

A.1 Inequalities

A.1. By deﬁnition,

∞ N

• n=1 an = n∈N an = lim n=1 an and N →∞

∞ N

• n=1 an = n∈N an = lim n=1 an .

N →∞

2

ex−x ≤ 1 + x ≤ ex .

This is because

x − x2 ≤ ln (1 + x) ≤ x,

which is semi-proved by the plot in ﬁgure 24. .

-0.6838

1

ݔ

0.5

0

lnሺ1 + ݔሻ

-0.5

ݔെ ݔ2

-1 ݔ

ݔ+1

-1.5

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

ݔ

A.3. Let αi and βi be complex numbers with |αi | ≤ 1 and |βi | ≤ 1. Then,

m m m

αi − βi ≤ |αi − βi |.

i=1 i=1 i=1

rn

A.4. Consider a triangular array of real numbers (xn,k ). Suppose (i) xn,k → x and (ii)

k=1

rn

x2n,k → 0. Then,

k=1

rn

(1 + xn,k ) → ex .

k=1

126

Proof. Use (A.2).

n

Suppose the sum rk=1 |xn,k | converges as n → ∞. Conditions (i) and (ii) is equivalent

to conditions (i) and (iii) where (iii) is the requirement that max |xk,n | → 0 as n → ∞.

k∈[rn ]

rn

Proof. Suppose k=1 |xn,k | → x0 . To show that (iii) implies (ii) under (i), let an =

max |xk,n |. Then,

k∈[rn ]

rn

rn

0≤ x2n,k ≤ an |xn,k | → 0 × x0 = 0.

k=1 k=1

On the other hand, suppose we have (i) and (ii). Given any ε > 0, by (ii), ∃n0 such that

rn

rn

∀n ≥ n0 , x2n,k ≤ ε2 . Hence, for any k, x2n,k ≤ x2n,k ≤ ε2 and hence |xn,k | ≤ ε which

k=1 k=1

implies an ≤ ε.

rnNote that when the xk,n are non-negative, condition (i) already implies that the sum

k=1 |xn,k | converges as n → ∞.

n

A.5. Suppose lim an = a. Then lim 1 − ann = e−a [9, p 584].

n→∞ n→∞

Proof. Use (A.4) with rn = n, xn,k = − ann . Then, nk=1 xn,k = −an → −a and nk=1 x2n,k =

a2n n1 → a · 0 = 0.

n

Alternatively, from L’Hôpital’s rule, lim 1 − na = e−a . (See also [18, Theorem 3.31,

n→∞

p 64]) This gives

a direct proof for the case when a > 0. For n large enough, note that

both 1− an

and 1 − a

are ≤ 1 where we need a > 0 here. Applying (A.3), we get

n n

1 − an − 1 − a ≤ |an − a| → 0.

n n

n n n n −1

b −1

For a < 0, we use the fact that, for bn → b > 0, (1) 1 + n = 1 + nb → e−b

−1 −1

and (2) for n large enough, both 1 + nb and 1 + bnn are ≤ 1 and hence

+ −1 ,n + −1 ,n

bn b |bn− b|

1+ − 1+ ≤ → 0.

n n 1 + bnn 1 + nb

A.2 Summations

A.6. Basic formulas:

n

n(n+1)

(a) k= 2

k=0

n

n(n+1)(2n+1)

(b) k2 = 6

= 16 (2n3 + 3n2 + n)

k=0

2

n

n

(c) k3 = k = 14 n2 (n + 1)2 = 14 (n4 + 2n3 + n2 )

k=0 k=0

127

A nicer formula is given by

n

1

k (k + 1) · · · (k + d) = n (n + 1) · · · (n + d + 1) (35)

k=1

d+2

n

A.7. Let g (n) = h (k) where h is a polynomial of degree d. Then, g is a polynomial

k=0

d+1

of degree d + 1; that is g (n) = am x m .

m=1

• To ﬁnd the coeﬃcients am , evaluate g (n) for n = 1, 2, . . . , d + 1. Note that the case

when n = 0 gives a0 = 0 and hence the sum starts with m = 1.

+ d−1 ,

h (k) = bi k (k + 1) · · · (k + i) + c.

i=0

• k 3 = k (k + 1) (k + 2) − 3k (k + 1) + k

A.8. Geometric Sums:

∞

(a) 1

ρi = 1−ρ for |ρ| < 1

i=0

∞

ρk

(b) ρi = 1−ρ

i=k

b

ρa −ρb+1

(c) ρi = 1−ρ

i=a

∞

ρ

(d) iρi = (1−ρ)2

i=0

b

ρb+1 (bρ−b−1)−ρa (aρ−a−ρ)

(e) iρi = (1−ρ)2

i=a

∞

kρk ρk+1

(f) iρi = 1−ρ

+ (1−ρ)2

i=k

∞

ρ+ρ2

(g) i2 ρ i = (1−ρ)3

i=0

n 2

n n

(a) ai = ai aj

i=1 i=1 j=1

128

∞

∞

∞

i

(b) f (i, j) = f (i, j) = 1 [i ≥ j]f (i, j)

j=1 i=j i=1 j=1 (i,j)

∞

λk λ2 λ3

• eλ = k!

=1+λ+ 2!

+ 3!

+ ...

k=0

2 3

∞ k−1

∞ k−1

• λeλ + eλ = 1 + 2λ + 3 λ2! + 4 λ3! + . . . = λ

k (k−1)! = λ

k (k−1)!

k=1 k=0

A.11. Zeta function ξ (s) is deﬁned for any complex number s with Re {s} > 1 by the

∞

1

Dirichlet series: ξ (s) = ns

.

n=1

(b) ξ (x) diverges for 0 < x ≤ 1

A.12. Abel’s theorem: Let a = (ai : i ∈ N) be any sequence of real or complex numbers

and let

∞

Ga (z) = ai z i ,

i=0

∞

be the power series with coeﬃcients a. Suppose that the series i=0 ai converges. Then,

∞

lim− Ga (z) = ai . (36)

z→1

i=0

In the special case where all the coeﬃcientsai are nonnegative real numbers, then the

above formula (36) holds also when the series ∞ i=0 ai does not converge. I.e. in that case

both sides of the formula equal +∞.

A.3 Derivatives

A.13. Basic Formulas

d u

(a) dx

a = au ln a du

dx

loga e du

(b) d

dx

loga u = u dx

, a = 0, 1

129

(c) Derivatives of the products: Suppose f (x) = g (x) h (x), then

n

(n) n (n−k)

f (x) = g (x) h(k) (x).

k=0

k

In fact,

dn dni

r r

n!

fi (t) = fi (t).

dtn i=1 n +···+n

n !n ! · · · nr ! i=1 dtni

=n 1 2

1 r

Deﬁnition A.14 (Jacobian). In vector calculus, the Jacobian is shorthand for either the

Jacobian matrix or its determinant, the Jacobian determinant. Let g be a function

from a subset D of Rn to Rm . If g is diﬀerentiable at z ∈ D, then all partial derivatives

exists at z and the Jacobian matrix of g at a point z ∈ D is

⎛ ∂g1 ⎞

∂x1

(z) · · · ∂g1

∂xn

(z)

⎜ .. . . ⎟ ∂g ∂g

dg (z) = ⎝ . .. .. ⎠ = (z) , . . . , (z) .

∂x1 ∂xn

∂gm

∂x1

(z) · · · ∂xn (z)

∂gm

∂(g1 ,...,gn )

Alternative notations for the Jacobian matrix are J, ∂(x 1 ,...,xn )

[7, p 242], Jg (x) where the it

is assumed that the Jacobian matrix is evaluated at z = x = (x1 , . . . , xn ).

j

exists

in a neighborhood of z and continuous at z, then g is diﬀerentiable at z. And dg is

continuous at z.

• Let A be an n-dimensional

“box” deﬁned by the corners x and x+Δx. The “volume”

of the image g(A) is ( i Δxi ) |det dg(x)|. Hence, the magnitude of the Jacobian

determinant gives the ratios (scaling factor) of n-dimensional volumes (contents). In

other words,

∂(y1 , . . . , yn )

dy1 · · · dyn = dx1 · · · dxn .

∂(x1 , . . . , xn )

• d(g −1 (y)) is the Jacobian of the inverse transformation.

• In MATLAB, use jacobian.

• Change of variable: Let g be a continuous diﬀerentiable map of the open set U onto

V . Suppose that g is one-to-one and that det(dg(x)) = 0 for all x.

! !

h(g(x)) |det(dg(x))| dx = h(y)dy.

U V

Deﬁnition A.15. The gradient (or gradient vector ﬁeld) of a scalar function f (x) with

respect to a vector variable x = (x1 , . . . , xn ) is

⎛ ∂f ⎞

∂θ1

⎜ ⎟

∇θ f (θ) = ⎝ ... ⎠ .

∂f

∂θn

130

If the argument is row vector, then,

⎛ ∂f1 ∂f2 ∂fm

⎞

⎜ ∂θ1 ∂θ1 ∂θ1

⎟

∇θ f T (θ) = ⎝ ... ..

.

... ..

. ⎠.

∂f1 ∂f2 ∂fm

∂θn ∂θn ∂θn

Deﬁnition A.16. Given a scalar-valued function f , the Hessian matrix is the square

matrix of second partial derivatives

⎡ 2 2f

⎤

∂ f

∂θ12

· · · ∂θ∂1 ∂θ

⎢ . n

⎥

T ⎢

∇θ f (θ) = ∇θ (∇θ f (θ)) = ⎣ .

2 . . . .

.

. . ⎥.

⎦

∂2f ∂2

∂θn ∂θ1

· · · ∂θ 2

n

A.17. ∇x f T (x) = (df (x))T .

dh (x) = (f (x))T dg (x) + (g (x))T df (x) .

• For an n × n matrix A, let f (x) = (Ax, x). Then df (x) = (Ax)T I + xT A =

xT AT + xT A.

◦ If A is symmetric, then df (x) = 2xT A. So, ∂

∂xj

(Ax, x) = 2 (Ax)j .

g ◦ f is diﬀerentiable at y and

d (g ◦ f ) (y) = dg (z) df (y)

p×n p×m m×n

(matrix multiplication).

• In particular,

d ∂ d ∂ d

g (x (t) , y (t) , z (t)) = g (x, y, z) x (t) + g (x, y, z) y (t)

dt ∂x dt ∂y dt

∂ d

+ g (x, y, z) z (t) .

∂z dt (x,y,z)=(x(t),y(t),z(t))

Then, df (x) = 0 ∀x ∈ D ⇒ f is constant.

A.21. If f is diﬀerentiable at y, then all partial and directional derivative exists at y

⎛ ⎞ ⎛ ∂f1 ⎞

∇f1 (y) ∂x1

(y) · · · ∂x

∂f1

(y)

⎜ ⎟ ⎜ ⎟

n

. . ... .. ∂f ∂f

df (y) = ⎝ .. ⎠=⎝ .. . ⎠= (y) , . . . , (y) .

∂x1 ∂xn

∇fm (y) ∂fm

∂x1

(y) · · · ∂xn (y)

∂fm

131

A.22. Inversion (Mapping) Theorem: Let open Ω ⊂ Rn , f : Ω → Rn is C 1 , c ∈ Ω.

df (c) is bijective, then ∃U open neighborhood of c such that

n×n

(b) f |U : U → V is bijective.

(c) g = (f |U )−1 : V → U is C 1 .

(d) ∀y ∈ V dg (y) = [df (g (y))]−1 .

d (x) = I ∇x xT = I

2

d x2 = 2xT ∇x x

= 2x

d (Ax

T +

b) =T A ∇x (Ax + b)T = AT

d a x =a

∇x aT x = a

∇x f T (x) g (x)

d f T (x) g (x)

= (dg (x))T f (x) + (df (x))T g (x)

T T

= f (x) dg (x) + g (x) df (x)

= ∇x g T (x) f (x) + ∇x f T (x) g (x)

d f T (x) Qf (x) ∇x f T (x) Qf (x)

= f T (x) Qdf (x) + f T (x) QT df (x) = 2 (df (x))T Qf (x)

= 2f T (x) Qdf (x) = 2∇x f T (x) Qf (x)

d xT Qx = 2xT Q ∇x xT Qx = 2Qx.

d f (x)2 = 2f T (x) df (x) ∇x f (x)2 = 2∇x f T (x) f (x)

A.4 Integration

A.23. Basic Formulas

u

(a) au du = lna a , a > 0, a = 1.

1 1

1 α , α > −1 ∞ α , α < −1

(b) t dt = α+1 and t dt = α+1 So, the integration of the

0 ∞, α ≤ −1 1 ∞, α ≥ −1

1 ∞

function 1t is the test case. In fact, 1t dt = 1t dt = ∞.

0 1

xm+1

ln x − 1

, m = −1

(c) xm ln xdx = m+1 m+1

1

2

ln2 x, m = −1

A.24. Integration by Parts:

! !

udv = uv − vdu

132

4

4

2

t

1.5 3

t

t

0.5

t

2

1

0.5

t

1

t 1

0

0

0 1 2 3 4 5

0 t 5

(a) Basic idea: Start with an integral of the form

f (x) g (x)dx.

Match this with an integral of the form udv by choosing dv to be part of the

integrand including dx and possibly f (x) or g (x).

(b) In particular, repeated application of integration by parts gives

!

n−1 !

f (x) g (x)dx = f (x) G1 (x) + (−1)i f (i) (x) Gi+1 (x) + (−1)n f (n) (x) Gn (x) dx

i=1

(37)

(i) di

where f (x) = dxi

f

(x), G1 (x) = g (x)dx, and Gi+1 (x) = Gi (x)dx. Figure 26

can be used to derived (37).

f x g x

+

1

f x G1 x

Differentiate 2 Integrate

f x + G2 x

f x

n 1

1

n1

Gn 1 x

f x

n

Gn x

n

1

! !

f (x) g (x)dx = f (x) G1 (x) − f (x) G1 (x) dx,

and ! !

f (n)

(x) Gn (x) dx = f (n)

(x) Gn+1 (x) − f (n+1) (x) Gn+1 (x) dx.

133

x2 e3 x 2 3x §1 2 2 2 · 3x sin x + ex

+ x ³x e dx ¨ x x ¸e

1 3x ©3 9 27 ¹ cos x - ex

2x e

³ sin x e dx

x

- 3 x sin x ex

+

1 3x

2 e sin x cos x e x ³ sin x e x dx

+ 9

1 3x 1

0 - e sin x cos x e x

27 2

xn eax

(c) xn eax dx = a

− n

a

xn−1 eax

!

eax (−1)k n! n−k

n

xn eax dx = x .

a k=0 ak (n − k)!

eax

(a) n = 1 : a

x− 1

a

eax

(b) n = 2 : a

x2 − a2 x + 2

a2

t eat

n

(−1)k n! n−k (−1)n n! eat

n

(−1)k n! n−k

(c) xn eax dx = a ak (n−k)!

t − an+1

= a ak (n−k)!

t + n!

(−a)n+1

0 k=0 k=0

∞ e−at

n

e−at

n

(d) xn e−ax dx = a

n!

ak (n−k)!

tn−k = a

n!

an−k j!

tj

t k=0 j=0

∞ n!

(e) xn eax dx = (−a)n+1

, a < 0. (See also Gamma function)

0

∞

• n! = e−t tn dt.

0

• In MATLAB, consider using gamma(n+1) in stead of factorial(n). Note also

that gamma() allows vector input.

1

(f) xβ e−x dx is ﬁnite if and only if β > −1.

0

1 1 1

Note that 1

e

xβ dx ≤ xβ e−x dx ≤ xβ dx.

0 0 0

∞

(g) ∀β ∈ R, xβ e−x dx < ∞.

1

∞ ∞ ∞

For β ≤ 0, xβ e−x dx ≤ e−x dx < e−x dx = 1.

1 1 0

∞ β −x

∞ β −x

∞

For β > 0, x e dx ≤ x e dx ≤ xβ e−x dx = $β%!

1 1 0

134

∞

(h) xβ e−x dx is ﬁnite if and only if β > −1.

0

b(x)

b : R → R be C 1 . Then f (x) = g (x, y)dy is C 1 and

a(x)

!b(x)

∂g

f (x) = b (x) g (x, b (x)) − a (x) g (x, a (x)) + (x, y)dy. (38)

∂x

a(x)

In particular, we have

!x

d

f (t)dt = f (x) , (39)

dx

a

!v(x) !v(x)

d dv d

f (t)dt = f (t)dt = f (v (x)) v (x) , (40)

dx dx dv

a a

⎛ v(x) ⎞

!v(x) ! !

u(x)

d d ⎝

f (t)dt = f (t)dt − f (t)dt⎠ = f (v (x)) v (x) − f (u (x)) u (x) . (41)

dx dx

u(x) a a

Note that (38) can be derived from (A.19) by considering f (x) = h(a(x), b(x), x) where

b

h (a, b, c) = g (c, y)dy. [9, p 318–319].

a

A.27. Gamma function:

∞

(a) Γ (q) = xq−1 e−x dx. ; q > 0.

0

(b) Γ (0) = ∞

Γ (n + 1) = n! if n ∈ N ∪ {0}.

(d) 0! = 1.

√

(e) Γ 12 = π.

• This relationship is used to deﬁne the gamma function for negative numbers.

135

∞

(g) Γ(q)

αq

= xq−1 e−αx dx, α > 0.

0

! x

B(x; a, b) = ta−1 (1 − t)b−1 dt.

0

For x = 1, the incomplete beta function coincides with the (complete) beta function.

The regularized incomplete beta function (or regularized beta function for short) is de-

ﬁned in terms of the incomplete beta function and the (complete) beta function:

B(x; a, b)

Ix (a, b) = .

B(a, b)

• For integers m, k,

m+k−1

(m + k − 1)!

Ix (m, k) = xj (1 − x)m+k−1−j .

j=m

j!(m + k − 1 − j)!

• I0 (a, b) = 0, I1 (a, b) = 1

• Ix (a, b) = 1 − I1−x (b, a)

136

References

[1] Patrick Billingsley. Probability and Measure. John Wiley & Sons, New York, 1995.

5.29, 2b

[2] George Casella and Roger L. Berger. Statistical Inference. Duxbury Press, 2001. 4.42,

6.9, 11, 8.13, 11.11, 11.4, 1, 6, 11.21

[3] Donald G. Childers. Probability And Random Processes Using MATLAB. McGraw-

Hill, 1997. 1.3, 1.3

11.4, 11.21

[5] W. Feller. An Introduction to Probability Theory and Its Applications, volume 2. John

Wiley & Sons, 1971. 4.36

[6] William Feller. An Introduction to Probability Theory and Its Applications, Volume 1.

Wiley, 3 edition, 1968.

[7] Terrence L. Fine. Probability and Probabilistic Reasoning for Electrical Engineering.

Prentice Hall, 2005. 3.1, 3.10, 3.14, 4.18, 4.23, 5.26, 9.1, 9.3, 11.10, 15.2, A.14

[8] Boris Vladimirovich Gnedenko. Theory of probability. Chelsea Pub. Co., New York, 4

edition, 1967. Translated from the Russian by B.D. Seckler. 5.29

[9] John A. Gubner. Probability and Random Processes for Electrical and Computer En-

gineers. Cambridge University Press, 2006. 1.2, 4.30, 4.33, 4.35, 4, 6.43, 7.9, 9.9, 10.1,

2, 3, 6, 11.24, 12.21, 12.22, 14.3, 3, A.5, A.11, A.26

[10] Samuel Karlin and Howard E. Taylor. A First Course in Stochastic Processes. Aca-

demic Press, 1975. 4.6, 9.10, 10.1

0198536933. 5.23, 5.24

[13] Nabendu Pal, Chun Jin, and Wooi K. Lim. Handbook of Exponential and Related

Distributions for Engineers and Scientists. Chapman & Hall/CRC, 2005. 6.25

McGraw-Hill Companies, 1991. 5.17, 11.5

CRC, 1999. 1, 4, 5, 6, 7, 8

137

[18] Walter Rudin. Principles of Mathematical Analysis. McGraw-Hill, 1976. A.5

[19] Gbor J. Szkely. Paradoxes in Probability Theory and Mathematical Statistics. 1986.

8.2

[20] Tung. Fundamental of Probability and Statistics for Reliability Analysis. 2005. 17, 20

Springer, 2004. 4.16, 4.18, 4.43, 7.9, 3, 13.1

138

Index

Bayes Theorem, 21

Binomial theorem, 12

Birthday Paradox, 23

22

Delta function, 19

Dirac delta function, see Delta function

event algebra, 25

gradient, 130

gradient vector ﬁeld, 130

Jacobian formulas, 99

multinomial coeﬃcient, 13

Multinomial Counting, 13

Multinomial Theorem, 13

uncorrelated

not independent, 76, 93

Zipf or zeta random variable, 59

139