Are you sure?
This action might not be possible to undo. Are you sure you want to continue?
Prapun Suksompong
School of Electrical and Computer Engineering
Cornell University, Ithaca, NY 14853
ps92@cornell.edu
January 23, 2008
Contents
1 Mathematical Background 4
1.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Enumeration / Combinatorics / Counting . . . . . . . . . . . . . . . . . . 9
1.3 Dirac Delta Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Classical Probability 20
2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Probability Foundations 24
3.1 Algebra and σalgebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Kolmogorov’s Axioms for Probability . . . . . . . . . . . . . . . . . . . . . 29
3.3 Properties of Probability Measure . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Countable Ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Random Element 36
4.1 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Discrete random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Continuous random variable . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Mixed/hybrid Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.7 Misc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 PMF Examples 48
5.1 Random/Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Bernoulli and Binary distributions . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Binomial: B(n, p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4 Geometric: ((β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.5 Poisson Distribution: {(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.6 Compound Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.7 Hypergeometric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.8 Negative Binomial Distribution (Pascal / P´ olya distribution) . . . . . . . . 58
5.9 Betabinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.10 Zipf or zeta random variable . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6 PDF Examples 59
6.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.4 Pareto: Par(α)–heavytailed model/density . . . . . . . . . . . . . . . . . . 67
6.5 Laplacian: L(α) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.6 Rayleigh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.7 Cauchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.8 Weibull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7 Expectation 69
8 Inequalities 76
9 Random Vectors 80
9.1 Random Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
10 Transform Methods 86
10.1 Probability Generating Function . . . . . . . . . . . . . . . . . . . . . . . . 86
10.2 Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . . . . 87
10.3 OneSided Laplace Transform . . . . . . . . . . . . . . . . . . . . . . . . . 88
10.4 Characteristic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
11 Functions of random variables 91
11.1 SISO case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
11.2 MISO case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
11.3 MIMO case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
11.4 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
12 Convergences 108
12.1 Summation of random variables . . . . . . . . . . . . . . . . . . . . . . . . 114
12.2 Summation of independent random variables . . . . . . . . . . . . . . . . . 114
12.3 Summation of i.i.d. random variable . . . . . . . . . . . . . . . . . . . . . 114
12.4 Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . 116
13 Conditional Probability and Expectation 117
13.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
13.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
13.3 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
2
14 Realvalued Jointly Gaussian 120
15 Bayesian Detection and Estimation 123
A More Math 126
A.1 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
A.2 Summations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.3 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.4 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A.5 Gamma and Beta functions . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3
1 Mathematical Background
1.1 Set Theory
1.1. Basic Set Identities:
• Idempotence: (A
c
)
c
= A
• Commutativity (symmetry):
A ∪ B = B ∪ A , A ∩ B = B ∩ A
• Associativity:
◦ A ∩ (B ∩ C) = (A ∩ B) ∩ C
◦ A ∪ (B ∪ C) = (A ∪ B) ∪ C
• Distributivity
◦ A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)
◦ A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)
• de Morgan laws
◦ (A ∪ B)
c
= A
c
∩ B
c
◦ (A ∩ B)
c
= A
c
∪ B
c
1.2. Basic Terminology:
• A ∩ B is sometimes written simply as AB.
• Sets A and B are said to be disjoint (A ⊥ B) if and only if A ∩ B = ∅
• A collection of sets (A
i
: i ∈ I) is said to be pairwise disjoint or mutually exclusive
[9, p 9] if and only if A
i
∩A
j
= ∅ when i = j.
• A collection Π = (A
α
: α ∈ I) of subsets of Ω (in this case, indexed or labeled by α
taking values in an index or label set I) is said to be a partition of Ω if
(a) Ω =
¸
A
α∈I
and
(b) For all i = j, A
i
⊥ A
j
(pairwise disjoint).
In which case, the collection (B ∩ A
α
: α ∈ I) is a partition of B. In other words,
any set B can be expressed as B =
¸
α
(B ∩ A
i
) where the union is a disjoint union.
• The cardinality (or size) of a collection pr set A, denoted [A[, is the number of
elements of the collection. This number may be ﬁnite or inﬁnite.
4
name rule
Commutative laws A∩ B = B ∩ A A∪ B = B ∪ A
Associative laws A∩ (B ∩ C) = (A∩ B) ∩ C
A∪ (B ∪ C) = (A∪ B) ∪ C
Distributive laws A∩ (B ∪ C) = (A∩ B) ∪ (A∩ C)
A∪ (B ∩ C) = (A∪ B) ∩ (A∪ C)
DeMorgan’s laws A∩ B = A∪ B A∪ B = A∩ B
Complement laws A∩ A = ∅ A∪ A = U
Double complement law A = A
Idempotent laws A∩ A = A A∪ A = A
Absorption laws A∩ (A∪ B) = A A∪ (A∩ B) = A
Dominance laws A∩ ∅ = ∅ A∪ U = U
Identity laws A∪ ∅ = A A∩ U = A
Figure 1: Set Identities [17]
◦ InclusionExclusion Principle:
n
¸
i=1
A
i
=
¸
φ,=I⊂1,...,n¦
(−1)
[I[+1
¸
i∈I
A
i
.
◦
n
¸
i=1
A
c
i
= [Ω[ +
¸
φ,=I⊂1,...,n¦
(−1)
[I[
¸
i∈I
A
i
.
◦ If ∀i, A
i
⊂ B ( or equivalently,
¸
n
i=1
A
i
⊂ B), then
n
¸
i=1
(B ` A
i
)
= [B[ +
¸
φ,=I⊂1,...,n¦
(−1)
[I[
¸
i∈I
A
i
.
• An inﬁnite set A is said to be countable if the elements of A can be enumerated
or listed in a sequence: a
1
, a
2
, . . . . Empty set and ﬁnite sets are also said to be
countable. By a countably inﬁnite set, we mean a countable set that is not ﬁnite.
• A singleton is a set with exactly one element.
• N = ¦1, 2, , 3, . . . ¦
R = (−∞, ∞).
• For a set of sets, to avoid the repeated use of the word “set”, we will call it a
collection/class/family of sets.
Deﬁnition 1.3. Monotone sequence of sets
5
• The sequence of events (A
1
, A
2
, A
3
, . . . ) is monotoneincreasing sequence of events if
and only if
A
1
⊂ A
2
⊂ A
3
⊂ . . ..
In which case,
◦
n
¸
i=1
A
i
= A
n
◦ lim
n→∞
A
n
=
∞
¸
i=1
A
i
.
Put A = lim
n→∞
A
n
. We then write A
n
A; that is A
n
A if and only if ∀n
A
n
⊂ A
n+1
and
∞
¸
n=1
A
n
= A.
• The sequence of events (B
1
, B
2
, B
3
, . . . ) is monotonedecreasing sequence of events if
and only if
B
1
⊃ B
2
⊃ B
3
⊃ . . ..
In which case,
◦
n
¸
i=1
B
i
= B
n
◦ lim
n→∞
B
n
=
∞
¸
i=1
B
i
Put B = lim
n→∞
B
n
. We then write B
n
` B; that is B
n
` B if and only if ∀n
B
n+1
⊂ B
n
and
∞
¸
n=1
B
n
= B.
Note that A
n
`A ⇔A
c
n
A
c
.
1.4. An (Event)indicator function I
A
: Ω →¦0, 1¦ is deﬁned by
I
A
(ω) =
1, if ω ∈ A
0, otherwise.
• Alternative notation: 1
A
.
• A = ¦ω : I
A
(ω) = 1¦
• A = B if and only if I
A
= I
B
• I
A
c (ω) = 1 −I
A
(ω)
• A ⊂ B ⇔¦∀ω, I
A
(ω) ≤ I
B
(ω)¦ ⇔¦∀ω, I
A
(ω) = 1 ⇒I
B
(ω) = 1¦
I
A∩B
(ω) = min (I
A
(ω) , I
B
(ω)) = I
A
(ω) I
B
(ω)
I
A∪B
(ω) = max (I
A
(ω) , I
B
(ω)) = I
A
(ω) +I
B
(ω) −I
A
(ω) I
B
(ω)
6
1.5. Suppose
¸
N
i=1
A
i
=
¸
N
i=1
B
i
for all ﬁnite N ∈ N. Then,
¸
∞
i=1
A
i
=
¸
∞
i=1
B
i
.
Proof. To show “⊂”, suppose x ∈
¸
∞
i=1
A
i
. Then, ∃N
0
such that x ∈ A
N
0
. Now, A
N
0
⊂
¸
N
0
i=1
A
i
=
¸
N
0
i=1
B
i
⊂
¸
∞
i=1
B
i
. Therefore, x ∈
¸
∞
i=1
B
i
. To show “⊃”, use symmetry.
1.6. Let A
1
, A
2
, . . . be a sequence of disjoint sets. Deﬁne B
n
= ∪
i>n
A
i
= ∪
i≥n+1
A
i
. Then,
(1) B
n+1
⊂ B
n
and (2) ∩
∞
n=1
B
n
= ∅.
Proof. B
n
= ∪
i≥n+1
A
i
= (∪
i≥n
A
i
) ∪ A
n+1
= B
n+1
∪ A
n+1
So, (1) is true. For (2), consider
two cases. (2.1) For element x / ∈ ∪
i
A
i
, we know that x / ∈ B
n
and hence x / ∈ ∩
∞
n=1
B
n
. (2.2)
For x ∈ ∪
i
A
i
, we know that ∃i
0
such that x ∈ A
i
0
. Note that x can’t be in other A
i
because
the A
i
’s are disjoint. So, x / ∈ B
i+0
and therefore x / ∈ ∩
∞
n=1
B
n
.
1.7. Any countable union can be written as a union of pairwise disjoint sets: Given any
sequence of sets F
n
, deﬁne a new sequence by A
1
= F
1
, and
A
n
= F
n
∩ F
c
n−1
∩ ∩ F
c
1
= F
n
∩
¸
i∈[n−1]
F
c
i
= F
n
`
⎛
⎝
¸
i∈[n−1]
F
i
⎞
⎠
.
for n ≥ 2. Then,
¸
∞
i=1
F
i
=
¸
∞
i=1
A
i
where the union on the RHS is a disjoint union.
Proof. Note that the A
n
are pairwise disjoint. To see this, consider A
n
1
, A
n
2
where n
1
= n
2
.
WLOG, assume n
1
< n
2
. This implies that F
c
n
1
get intersected in the deﬁnition of A
n
2
. So,
A
n
1
∩ A
n
2
=
F
n
1
∩ F
c
n
1
. .. .
∅
∩
⎛
⎜
⎝
n
1
−1
¸
i=1
F
c
i
∩ F
n
2
∩
n
2
−1
¸
i=1
i,=n
1
F
c
i
⎞
⎟
⎠
= ∅
Also, for ﬁnite N ≥ 1, we have
¸
n∈[N]
F
n
=
¸
n∈[N]
A
n
(*). To see this, note that (*) is
true for N = 1. Suppose (*) is true for N = m and let B =
¸
n∈[m]
A
n
=
¸
n∈[m]
F
n
. Now,
for N = m+ 1, by deﬁnition, we have
¸
n∈[m+1]
A
n
=
¸
n∈[m]
A
n
∪ A
m+1
= (B ∪ F
m+1
) ∩ Ω
= B ∪ F
m+1
=
¸
i∈[m+1]
F
i
.
So, (*) is true for N = m + 1. By induction, (*) is true for all ﬁnite N. The extension to
∞ is done via (1.5).
For ﬁnite union, we can modify the above statement by setting F
n
= ∅ for n ≥ N.
Then, A
n
= ∅ for n ≥ N.
By construction, ¦A
n
: n ∈ N¦ ⊂ σ (¦F
n
: n ∈ N¦). However, in general, it is not true
that ¦F
n
: n ∈ N¦ ⊂ σ (¦A
n
: n ∈ N¦). For example, for ﬁnite union with N = 2, we can’t
get F
2
back from set operations on A
1
, A
2
because we lost information about F
1
∩ F
2
. To
create a disjoint union which preserve the information about the overlapping parts of the
7
F
n
’s, we can deﬁne the A’s by ∩
n∈N
B
n
where B
n
is F
n
or F
c
n
. This is done in (1.8). However,
this leads to uncountably many A
α
, which is why we used the index α above instead of n.
The uncountability problem does not occur if we start with a ﬁnite union. This is shown
in the next result.
1.8. Decomposition:
• Fix sets A
1
, A
2
, . . . , A
n
, not necessarily disjoint. Let Π be a collection of all sets of
the form
B = B
1
∩ B
2
∩ ∩ B
n
where each B
i
is either A
j
or its complement. There are 2
n
of these, say
B
(1)
, B
(2)
, . . . , B
(2
n
)
.
Then,
(a) Π is a partition of Ω and
(b) Π `
¸
∩
j∈[n]
A
c
j
¸
is a partition of ∪
j∈[n]
A
j
.
Moreover, any A
j
can be expressed as A
j
=
¸
i∈S
j
B
(i)
for some S
i
⊂ [2
n
]. More
speciﬁcally, A
j
is the union of all B
(i)
which is constructed by B
j
= A
j
• Fix sets A
1
, A
2
, . . ., not necessarily disjoint. Let Π be a collection of all sets of the form
B =
¸
n∈N
B
n
where each B
n
is either A
n
or its complement. There are uncountably
of these hence we will index the B’s by α; that is we write B
(α)
. Let I be the set of
all α after we eliminate all repeated B
(α)
; note that I can still be uncountable. Then,
(a) Π =
¸
B
(α)
: α ∈ I
¸
is a partition of Ω and
(b) Π ` ¦∩
n∈N
A
c
n
¦ is a partition of ∪
n∈N
A
n
.
Moreover, any A
j
can be expressed as A
j
=
¸
α∈S
j
B
(α)
for some S
j
⊂ I. Because I is
uncountable, in general, S
j
can be uncountable. More speciﬁcally, A
j
is the (possibly
uncountable) union of all B
(α)
which is constructed by B
j
= A
j
. The uncountability
of S
j
can be problematic because it implies that we need uncountable union to get
A
j
back.
1.9. Let ¦A
α
: α ∈ I¦ be a collection of disjoint sets where I is a nonempty index set. For
any set S ⊂ I, deﬁne a mapping g on 2
I
by g(S) = ∪
α∈S
A
α
. Then, g is a 1:1 function if
and only if none of the A
α
’s is empty.
1.10. Let
A =
¸
(x, y) ∈ R
2
: (x +a
1
, x +b
1
) ∩ (y +a
2
, y +b
2
) = ∅
¸
,
where a
i
< b
i
. Then,
A =
¸
(x, y) ∈ R
2
: x + (a
1
−b
2
) < y < x + (b
1
−a
2
)
¸
=
¸
(x, y) ∈ R
2
: a
1
−b
2
< y −x < b
1
−a
2
¸
.
8
x
ݕ
ݕ = ݔ
ܿ
ܿ
ܿ + ሺܽ
1
െ ܾ
2
ሻ
ܿ + ሺܾ
1
െ ܽ
2
ሻ
A
Figure 2: The region A =
¸
(x, y) ∈ R
2
: (x +a
1
, x +b
1
) ∩ (y +a
2
, y +b
2
) = ∅
¸
.
1.2 Enumeration / Combinatorics / Counting
1.11. The four kinds of counting problems are:
(a) ordered sampling of r out of n items with replacement: n
r
;
(b) ordered sampling of r ≤ n out of n items without replacement: (n)
r
;
(c) unordered sampling of r ≤ n out of n items without replacement:
n
r
;
(d) unordered sampling of r out of n items with replacement:
n+r−1
r
.
1.12. Given a set of n distinct items, select a distinct ordered sequence (word) of length
r drawn from this set.
• Sampling with replacement: μ
n,r
= n
r
◦ Ordered sampling of r out of n items with replacement.
◦ μ
n,1
= n
◦ μ
1,r
= 1
◦ μ
n,r
= μ
n,r−1
for r > 1
◦ Examples:
∗ Suppose A is a ﬁnite set, then the cardinality of its power set is
2
A
= 2
[A[
.
∗ There are 2
r
binary strings/sequences of length r.
9
• Sampling without replacement:
(n)
r
=
r−1
¸
i=0
(n −i) =
n!
(n −r)!
= n (n −1) (n −(r −1))
. .. .
r terms
; r ≤ n
◦ Ordered sampling of r ≤ n out of n items without replacement.
◦ For integers r, n such that r > n, we have (n)
r
= 0.
◦ The deﬁnition in product form
(n)
r
=
r−1
¸
i=0
(n −i) = n (n −1) (n −(r −1))
. .. .
r terms
can be extended to any real number n and a nonnegative integer r. We deﬁne
(n)
0
= 1. (This makes sense because we usually take the empty product to be
1.)
◦ (n)
1
= n
◦ (n)
r
= (n −(r −1))(n)
r−1
. For example, (7)
5
= (7 −4)(7)
4
.
◦ (1)
r
=
1, if r = 1
0, if r > 1
• Ratio:
(n)
r
n
r
=
r−1
¸
i=0
(n −i)
r−1
¸
i=0
(n)
=
r−1
¸
i=0
1 −
i
n
≈
r−1
¸
i=0
e
−
i
n
= e
−
1
n
r−1
¸
i=0
i
= e
−
r(r−1)
2n
≈ e
−
r
2
2n
1.13. Factorial and Permutation: The number of arrangements (permutations) of n
≥ 0 distinct items is (n)
n
= n!.
• 0! = 1! = 1
• n! = n(n −1)!
• n! =
∞
0
e
−t
t
n
dt
• Stirling’s Formula:
n! ≈
√
2πnn
n
e
−n
=
√
2πe
e
(n+
1
2
) ln(
n
e
)
.
10
1.14. Binomial coeﬃcient:
n
r
=
(n)
r
r!
=
n!
(n −r)!r!
This gives the number of unordered sets of size r drawn from an alphabet of size n without
replacement; this is unordered sampling of r ≤ n out of n items without replacement. It
is also the number of subsets of size r that can be formed from a set of n elements. Some
properties are listed below:
(a) Use nchoosek(n,r) in MATLAB.
(b) Use combin(n,r) in Mathcad. However, to do symbolic manipulation, use the facto
rial deﬁnition directly.
(c) Reﬂection property:
n
r
=
n
n−r
.
(d)
n
n
=
n
0
= 1.
(e)
n
1
=
n
n−1
= n.
(f)
n
r
= 0 if n < r or r is a negative integer.
(g) max
r
n
r
=
n
n+1
2

.
(h) Pascal’s “triangle” rule:
n
k
=
n−1
k
+
n−1
k−1
. This property divides the process of
choosing k items into two steps where the ﬁrst step is to decide whether to choose
the ﬁrst item or not.
Figure 3: Pascal Triangle.
(i)
¸
0≤k≤n
k even
n
k
=
¸
0≤k≤n
k odd
n
k
= 2
n−1
There are many ways to show this identity.
11
(i) Consider the number of subsets of S = ¦a
1
, a
2
, . . . , a
n
¦ of n distinct elements.
First choose subsets A of the ﬁrst n −1 elements of S. There are 2
n−1
distinct
S. Then, for each A, to get set with even number of element, add element a
n
to
A if and only if [A[ is odd.
(ii) Look at binomial expansion of (x +y)
n
with x = 1 and y = −1.
(iii) For odd n, use the fact that
n
r
=
n
n−r
.
(j)
¸
min(n
1
,n
2
)
k=0
n
1
k
n
2
k
=
n
1
+n
2
n
1
=
n
1
+n
2
n
2
.
• This property divides the process of choosing k items from n
1
+n
2
into two steps
where the ﬁrst step is to choose from the ﬁrst n
1
items.
• Can replace the min(n
1
, n
2
) in the ﬁrst sum by n
1
or n
2
if we deﬁne
n
k
= 0 for
k > n.
•
¸
n
r=1
n
r
2
=
2n
n
.
(k) Parallel summation:
n
¸
m=k
m
k
=
k
k
+
k + 1
k
+ +
n
k
=
n + 1
k + 1
.
To see this, suppose we try to choose k + 1 items from n + 1 items a
1
, a
2
, . . . , a
n+1
.
First, we choose whether to choose a
1
. If so, then we need to choose the rest k items
from the n items a
2
, . . . , a
n+1
. Hence, we have the
n
k
term. Now, suppose we didn’t
choose a
1
. Then, we still need to choose k + 1 items from n items a
2
, . . . , a
n+1
. We
then repeat the same argument on a
2
in stead of a
1
.
Equivalently,
r
¸
k=0
n + k
k
=
r
¸
k=0
n + k
n
=
n + r + 1
n + 1
+
n + r + 1
r
. (1)
To prove the middle equality in (1), use induction on r.
1.15. Binomial theorem:
(x +y)
n
=
n
¸
r=0
n
r
x
r
y
n−r
• Let x = y = 1, then
n
¸
r=0
n
r
= 2
n
.
• Entropy function:
H(p) = −p log
b
(p) −(1 −p) log
b
(1 −p)
◦ Binary: b = 2 ⇒
H
2
(p) = −p log
2
(p) −(1 −p) log
2
(1 −p) .
12
In which case,
1
n + 1
2
nH(
r
n
)
≤
n
r
≤ 2
nH(
r
n
)
.
Hence,
n
r
≈ 2
nH
2(
r
n
)
.
• By repeated diﬀerentiating with respect to x followed by multiplication by x, we have
◦
¸
n
r=0
r
n
r
x
r
y
n−r
= nx(x +y)
n−1
and
◦
¸
n
r=0
r
2
n
r
x
r
y
n−r
= nx (x(n −1)(x +y)
n−2
+ (x +y)
n−1
).
For x +y = 1, we have
◦
¸
n
r=0
r
n
r
x
r
(1 −x)
n−r
= nx and
◦
¸
n
r=0
r
2
n
r
x
r
(1 −x)
n−r
= nx (nx + 1 −x).
All identities above can be veriﬁed easily via Mathcad.
1.16. Multinomial Counting: The multinomial coeﬃcient
n
n
1
n
2
nr
is deﬁned as
r
¸
i=1
n −
i−1
¸
k=0
n
k
n
i
=
n
n
1
n −n
1
n
2
n −n
1
−n
2
n
3
n
r
n
r
=
n!
r
¸
i=1
n!
It is the number of ways that we can arrange n =
r
¸
i=1
n
i
tokens when having r types of
symbols and n
i
indistinguishable copies/tokens of a type i symbol.
1.17. Multinomial Theorem
(x
1
+. . . +x
r
)
n
=
n
¸
i
1
=0
n−i
1
¸
i
2
=0
n−
¸
j<r−1
i
j
¸
i
r−1
=0
n!
n −
¸
k<n
i
k
!
¸
k<n
i
k
!
x
n−
¸
j<r
i
j
r
r−1
¸
k=1
x
i
k
k
• rary entropy function: Consider any vector p = (p
1
, p
2
, . . . , p
r
) such that p
i
≥ 0 and
r
¸
i=1
p
i
= 1. We deﬁne
H (p) = −
r
¸
i=1
p
i
log
b
p
i
.
13
Factorial expansion
n
k
=
n!
k!(n−k)!
, k = 0, 1, 2, . . . , n
Symmetry
n
k
=
n
n−k
, k = 0, 1, 2, . . . , n
Monotonicity
n
0
<
n
1
< <
n
n/2
, n ≥ 0
Pascal’s identity
n
k
=
n−1
k−1
+
n−1
k
, k = 0, 1, 2, . . . , n
Binomial theorem (x + y)
n
=
¸
n
k=0
n
k
x
k
y
n−k
, n ≥ 0
Counting all subsets
¸
n
k=0
n
k
= 2
n
, n ≥ 0
Even and odd subsets
¸
n
k=0
(−1)
k
n
k
= 0, n ≥ 0
Sum of squares
¸
n
k=0
n
k
2
=
2n
n
, n ≥ 0
Square of row sums
¸
n
k=0
n
k
2
=
¸
2n
k=0
2n
k
, n ≥ 0
Absorption/extraction
n
k
=
n
k
n−1
k−1
, k = 0
Trinomial revision
n
m
m
k
=
n
k
n−k
m−k
, 0 ≤ k ≤ m ≤ n
Parallel summation
¸
m
k=0
n+k
k
=
n+m+1
m
, m, n ≥ 0
Diagonal summation
¸
n−m
k=0
m+k
m
=
n+1
m+1
, n ≥ m ≥ 0
Vandermonde convolution
¸
r
k=0
m
k
n
r−k
=
m+n
r
, m, n, r ≥ 0
Diagonal sums in Pascal’s
¸
n/2
k=0
n−k
k
= F
n+1
(Fibonacci numbers), n ≥ 0
triangle ('2.3.2)
Other Common Identities
¸
n
k=0
k
n
k
= n2
n−1
, n ≥ 0
¸
n
k=0
k
2
n
k
= n(n + 1)2
n−2
, n ≥ 0
¸
n
k=0
(−1)
k
k
n
k
= 0, n ≥ 0
¸
n
k=0
(
n
k
)
k+1
=
2
n+1
−1
n+1
, n ≥ 0
¸
n
k=0
(−1)
k
(
n
k
)
k+1
=
1
n+1
, n ≥ 0
¸
n
k=1
(−1)
k−1
(
n
k
)
k
= 1 +
1
2
+
1
3
+ +
1
n
, n > 0
¸
n−1
k=0
n
k
n
k+1
=
2n
n−1
, n > 0
¸
m
k=0
m
k
n
p+k
=
m+n
m+p
, m, n, p ≥ 0, n ≥ p + m
Figure 4: Binomial coeﬃcient identities [17]
14
As a special case, let p
i
=
n
i
n
, then
n
n
1
n
2
n
r
=
n!
r
¸
i=1
n
i
!
≈ 2
nH
2
(p)
1.18. The number of solutions to x
1
+ x
2
+ + x
n
= k for the x
i
’s are nonnegative
integers is
k+n−1
k
=
k+n−1
n−1
.
(a) Suppose we further require that the x
i
are strictly positive (x
i
≥ 1), then there are
k−1
n−1
solutions.
(b) Extra Lowerbound Requirement: Suppose we further require that x
i
≥ a
i
where the a
i
are some given nonnegative integers, then the number of solution is
k−(a
1
+a
2
++an)+n−1
n−1
. Note that here we work with equivalent problem: y
1
+ y
2
+
+y
n
= k −
¸
n
i=1
a
i
where y
i
≥ 0.
(c) Extra Upperbound Requirement: Suppose we further require that 0 ≤ x
i
<
b
i
. Let A
i
be the set of solutions such that x
i
≥ b
i
and x
j
≥ 0 for j = i. The
number of solutions is
k+n−1
n−1
−[∪
n
i=1
A
i
[ where the second term can be found via the
inclusion/exclusion principle
¸
i∈[n]
A
i
=
¸
I⊂[n]
I,=∅
(−1)
[I[+1
¸
i∈I
A
i
and the fact that for any index set I ⊂ [n], we have [∩
i∈I
A
i
[ =
k−(
¸
i∈I
b
i
)+n−1
n−1
.
(d) Extra Range Requirement: Suppose we further require that a
i
≤ x
i
< b
i
where
0 ≤ a
i
< b
i
, then we work instead with y
i
= x
i
−a
i
. The number of solutions is
⎛
⎝
k −
n
¸
i=1
a
i
+ n −1
n −1
⎞
⎠
+
¸
I⊂[n]
I,=∅
(−1)
[I[
⎛
⎝
k −
n
¸
i=1
a
i
−
¸
i∈I
(b
i
−a
i
)
+ n −1
n −1
⎞
⎠
.
1.19. The bars and stars argument:
• Consider the distribution of r = 10 indistinguishable balls into n = n distinguishable
cells. Then, we only concern with the number of balls in each cell. Using n − 1 = 4
bars, we can divide r = 10 stars into n = 5 groups. For example, ****[***[[**[*
would mean (4,3,0,2,1). In general, there are
n+r−1
r
ways of arranging the bars and
stars.
• There are
n+r−1
r
distinct vector x = x
n
1
of nonnegative integers such that x
1
+x
2
+
+x
n
= r. We use n −1 bars to separate r 1’s.
• Suppose r letters are drawn with replacement from a set ¦a
1
, a
2
, . . . , a
n
¦. Given a
drawn sequence, let x
i
be the number of a
i
in the drawn sequence. Then, there are
n+r−1
r
possible x = x
n
1
.
15
objects number of objects reference
Arranging objects in a row:
n distinct objects n! = P(n, n) = n(n −1) . . . 2 1 '2.3.1
k out of n distinct objects n
k
=P(n, k) =n(n−1) . . . (n−k+1) '2.3.1
some of the n objects are identical:
k
1
of a ﬁrst kind, k
2
of a second
kind, . . . , k
j
of a jth kind, and
where k
1
+ k
2
+ + k
j
= n
n
k
1
k
2
... k
j
=
n!
k
1
! k
2
!...k
j
!
'2.3.2
none of the n objects remains in its
original place (derangements)
D
n
= n!
1−
1
1!
+ +(−1)
n 1
n!
'2.4.2
Arranging objects in a circle (where rotations, but not reﬂections, are equivalent):
n distinct objects (n −1)! '2.2.1
k out of n distinct objects
P(n,k)
k
'2.2.1
Choosing k objects from n distinct objects:
order matters, no repetitions P(n, k) =
n!
(n−k)!
= n
k
'2.3.1
order matters, repetitions allowed P
R
(n, k) = n
k
'2.3.3
order does not matter, no repeti
tions
C(n, k) =
n
k
=
n!
k!(n−k)!
'2.3.2
order does not matter, repetitions
allowed
C
R
(n, k) =
k+n−1
k
'2.3.3
Figure 5: Counting problems and corresponding sections in [17].
16
objects number of objects reference
Subsets:
of size k from a set of size n
n
k
'2.3.2
of all sizes from a set of size n 2
n
'2.3.4
of ¦1, . . . , n¦, without consecutive
elements
F
n+2
'3.1.2
Placing n objects into k cells:
distinct objects into distinct cells k
n
'2.2.1
distinct objects into distinct cells,
no cell empty
¸
n
k
¸
k! '2.5.2
distinct objects into identical cells
¸
n
1
¸
+
¸
n
2
¸
+ +
¸
n
k
¸
= B
n
'2.5.2
distinct objects into identical cells,
no cell empty
¸
n
k
¸
'2.5.2
distinct objects into distinct cells,
with k
i
in cell i (i = 1, . . . , n),
and where k
1
+k
2
+ +k
j
= n
n
k
1
k
2
... k
j
'2.3.2
identical objects into distinct cells
n+k−1
n
'2.3.3
identical objects into distinct cells,
no cell empty
n−1
k−1
'2.3.3
identical objects into identical p
k
(n) '2.5.1
cells
identical objects into identical p
k
(n) −p
k−1
(n) '2.5.1
cells, no cell empty
Placing n distinct objects into k
n
k
'2.5.2
nonempty cycles
Solutions to x
1
+ + x
n
= k:
nonnegative integers
k+n−1
k
=
k+n−1
n−1
'2.3.3
positive integers
k−1
n−1
'2.3.3
integers where 0 ≤ a
i
≤ x
i
for all i
k−(a
1
+···+a
n
)+n−1
n−1
'2.3.3
integers where 0 ≤ x
i
≤ a
i
for one
or more i
inclusion/exclusion principle '2.4.2
Figure 6: Counting problems (con’t) and corresponding sections in [17].
17
objects number of objects reference
Functions from a kelement set to an nelement set:
all functions n
k
'2.2.1
onetoone functions (n ≥ k) n
k
=
n!
(n−k)!
= P(n, k) '2.2.1
onto functions (n ≤ k) inclusion/exclusion '2.4.2
partial functions
k
0
+
k
1
n+
k
2
n
2
+ +
k
k
n
k
'2.3.2
= (n + 1)
k
Bit strings of length n:
all strings 2
n
'2.2.1
with given entries in k positions 2
n−k
'2.2.1
with exactly k 0s
n
k
'2.3.2
with at least k 0s
n
k
+
n
k+1
+ +
n
n
'2.3.2
with equal numbers of 0s and 1s
n
n/2
'2.3.2
palindromes 2
n/2
'2.2.1
with an even number of 0s 2
n−1
'2.3.4
without consecutive 0s F
n+2
'3.1.2
Figure 7: Counting problems (con’t) and corresponding sections in [17].
B(n) or B
n
: Bell number n
k
= n(n −1) . . . (n −k + 1) = P(n, k):
falling power
b(n, k): associated Stirling number of the P(n, k) =
n!
(n−k)!
: kpermutation
second kind
C
n
=
1
n+1
2n
n
: Catalan number p(n): number of partitions of n
C(n, k) =
n
k
=
n!
k!(n−k)!
: binomial coeﬃcient p
k
(n): number of partitions of n into
at most k summands
d(n, k): associated Stirling number of p
∗
k
(n): number of partitions of n into
the ﬁrst kind exactly k summands
E
n
: Euler number
n
k
: Stirling cycle number
ϕ: Euler phifunction
¸
n
k
¸
: Stirling subset number
E(n, k): Eulerian number T
n
: tangent number
F
n
: Fibonacci number
Figure 8: Notation from [17]
18
1.3 Dirac Delta Function
The (Dirac) delta function or (unit) impulse function is denoted by δ(t). It is usually
depicted as a vertical arrow at the origin. Note that δ(t) is not a true function; it is
undeﬁned at t = 0. We deﬁne δ(t) as a generalized function which satisﬁes the sampling
property (or sifting property)
φ(t)δ(t)dt = φ(0)
for any function φ(t) which is continuous at t = 0. From this deﬁnition, It follows that
(δ ∗ φ)(t) = (φ ∗ δ)(t) =
φ(τ)δ(t −τ)dτ = φ(t)
where we assume that φ is continuous at t. Intuitively we may visualize δ(t) as a inﬁnitely
tall, inﬁnitely narrow rectangular pulse of unit area: lim
ε→0
1
ε
1
[t[ ≤
ε
2
.
We list some interesting properties of δ(t) here.
• δ(t) = 0 when t = 0.
δ(t −T) = 0 for t = T.
•
A
δ(t)dt = 1
A
(0).
(a)
δ(t)dt = 1.
(b)
0¦
δ(t)dt = 1.
(c)
x
−∞
δ(t)dt = 1
[0,∞)
(x). Hence, we may think of δ(t) as the “derivative” of the
unit step function U(t) = 1
[0,∞)
(x).
•
φ(t)δ(t)dt = φ(0) for φ continuous at 0.
•
φ(t)δ(t −T)dt = φ(T) for φ continuous at T. In fact, for any ε > 0,
T+ε
T−ε
φ(t)δ(t −T)dt = φ(T).
• δ(at) =
1
[a[
δ(t)
• δ(t −t
1
) ∗ δ(t −t
2
) = δ (t −(t
1
+t
2
)).
• Fourier properties:
◦ Fourier series: δ(x −a) =
1
2π
+
1
π
∞
¸
k=1
cos(n(x −a)) on [−π, π].
◦ Fourier transform: δ(t) =
e
j2πft
df
• For a function g whose realvalues roots are t
i
,
δ (g (t)) =
n
¸
k=1
δ (t −t
i
)
[g
t
(t
i
)[
(2)
19
[3, p 387]. Hence,
f(t)δ(g(t))dt =
¸
x:g(x)=0
f(x)
[g
t
(x)[
. (3)
Note that the (Dirac) delta function is to be distinguished from the discrete time Kro
necker delta function.
As a ﬁnite measure, δ is a unit mass at 0; that is for any set A, we have δ(A) = 1[0 ∈ A].
In which case, we have again
gdδ =
f(x)δ(dx) = g(0) for any measurable g.
For a function g : D →R
n
where D ⊂ R
n
,
δ(g(x)) =
¸
z:g(z)=0
δ(x −z)
[det dg(z)[
(4)
[3, p 387].
2 Classical Probability
Classical probability, which is based upon the ratio of the number of outcomes favorable to
the occurrence of the event of interest to the total number of possible outcomes, provided
most of the probability models used prior to the 20th century. Classical probability remains
of importance today and provides the most accessible introduction to the more general
theory of probability.
Given a ﬁnite sample space Ω, the classical probability of an event A is
P(A) =
A
Ω
=
the number of cases favorable to the outcome of the event
the total number of possible cases
.
• In this section, we are more apt to refer to equipossible cases as ones selected at
random. Probabilities can be evaluated for events whose elements are chosen at
random by enumerating the number of elements in the event.
• The bases for identifying equipossibility were often
◦ physical symmetry (e.g. a wellbalanced die, made of homogeneous material in
a cubical shape) or
◦ a balance of information or knowledge concerning the various possible outcomes.
• Equipossibility is meaningful only for ﬁnite sample space, and, in this case, the eval
uation of probability is accomplished through the deﬁnition of classical probability.
2.1. Basic properties of classical probability:
• P(A) ≥ 0
• P(Ω) = 1
• P(∅) = 0
20
• P(A
c
) = 1 −P(A)
• P(A ∪ B) = P(A) +P(B) −P(A ∩ B) which comes directly from
[A ∪ B[ = [A[ +[B[ −[A ∩ B[.
• A ⊥ B is equivalent to P(A ∩ B) = 0.
• A ⊥ B ⇒P(A ∪ B) = P(A) +P(B)
• Suppose Ω = ¦ω
1
, . . . , ω
n
¦ and P (ω
i
) =
1
n
. Then P (A) =
¸
ω∈A
p (ω).
◦ The probability of an event is equal to the sum of the probabilities of its com
ponent outcomes because outcomes are mutually exclusive
2.2. Classical Conditional Probability: The conditional classical probability P(A[B)
of event A, given that event B = ∅ occurred, is given by
P(A[B) =
[A ∩ B[
[B[
=
P(A ∩ B)
P(B)
. (5)
• It is the updated probability of the event A given that we now know that B occurred.
• Read “conditional probability of A given B”.
• P(A[B) = P(A ∩ B[B) ≥ 0
• For any A such that B ⊂ A, we have P(A[B) = 1. This implies
P(Ω[B) = P(B[B) = 1.
• If A ⊥ C, P (A ∪ C [B) = P(A[B) +P(C [B)
• P(A ∩ B) = P(B)P(A[B)
• P(A ∩ B) ≤ P(A[B)
• P(A ∩ B ∩ C) = P(A) P(B[A) P(C[A ∩ B)
• P(A ∩ B) = P(A) P(B[A)
• P(A ∩ B ∩ C) = P(A ∩ B) P(C[A ∩ B)
• P (A, B[C) = P (A[C) P (B[A, C) = P (B[C) P (A[B, C)
2.3. Total Probability and Bayes Theorem If ¦B
i
, . . . , B
n
¦ is a partition of Ω, then
for any set A,
• Total Probability Theorem: P(A) =
¸
n
i=1
P(A[B
i
)P(B
i
).
• Bayes Theorem: Suppose P(A) > 0, we have P(B
k
[A) =
P(A[B
k
)P(B
k
)
¸
n
i=1
P(A[B
i
)P(B
i
)
.
21
2.4. Independence Events: A and B are independent (A
[
= B) if and only if
P(A ∩ B) = P(A)P(B) (6)
In classical probability, this is equivalent to
[A ∩ B[[Ω[ = [A[[B[.
• Sometimes the deﬁnition for independence above does not agree with the everyday
language use of the word “independence”. Hence, many authors use the term “statis
tically independence” for the deﬁnition above to distinguish it from other deﬁnitions.
2.5. Having three pairwise independent events does not imply that the three events are
jointly independent. In other words,
A
[
= B, B
[
= C, A
[
= C A
[
= B
[
= C.
Example: Experiment of ﬂipping a fair coin twice. Ω = ¦HH, HT, TH, TT¦. Deﬁne event
A to be the event that the ﬁrst ﬂip gives a H; that is A = ¦HH, HT¦. Event B is the
event that the second ﬂip gives a H; that is B = ¦HH, TH¦. C = ¦HH, TT¦. Note also
that even though the events A and B are not disjoint, they are independent.
2.6. Consider Ω of size 2n. We are given a set A ⊂ Ω of size n. Then, P(A) =
1
2
. We want
to ﬁnd all sets B ⊂ Ω such that A
[
= B. (Note that without the required independence,
there are 2
2n
possible B.) For independence, we need
P(A ∩ B) = P(A)P(B).
Let r = [A∩ B[. Then, r can be any integer from 0 to n. Also, let k = [B ` A[. Then, the
condition for independence becomes
r
n
=
1
2
r +k
n
which is equivalent to r = k. So, the construction of the set B is given by choosing r
elements from set A, then choose r = k elements from set Ω ` A. There are
n
r
choices
for the ﬁrst part and
n
k
=
n
r
choice for the second part. Therefore, the total number of
possible B such that A
[
= B is
n
¸
r=1
n
r
2
=
2n
n
.
2.1 Examples
2.7. Chevalier de Mere’s Scandal of Arithmetic:
Which is more likely, obtaining at least one six in 4 tosses of a fair die (event
A), or obtaining at least one double six in 24 tosses of a pair of dice (event B).
22
We have
P(A) = 1 −
5
6
4
= .518
and
P(B) = 1 −
35
36
24
= .491.
Therefore, the ﬁrst case is more probable.
2.8. A random sample of size r with replacement is taken from a population of n elements.
The probability of the event that in the sample no element appears twice (that is, no
repetition in our sample) is
(n)
r
n
r
.
The probability that at least one element appears twice is
p
u
(n, r) = 1 −
r−1
¸
i=1
1 −
i
n
≈ 1 −e
−
r(r−1)
2n
.
In fact, when r −1 <
n
2
, (A.2) gives
e
1
2
r(r−1)
n
3n+2r−1
3n
≤
r−1
¸
i=1
1 −
i
n
≤ e
1
2
r(r−1)
n
.
• From the approximation, to have p
u
(n, r) = p, we need
r ≈
1
2
+
1
2
1 −8nln (1 −p).
• Probability of coincidence birthday: Probability that there is at least two people who
have the same birthday in your class of n students
=
⎧
⎪
⎪
⎪
⎪
⎨
⎪
⎪
⎪
⎪
⎩
1, if r ≥ 365,
1 −
⎛
⎜
⎜
⎝
365
365
364
365
365 −(r −1)
365
. .. .
r terms
⎞
⎟
⎟
⎠
, if 0 ≤ r ≤ 365
◦ Birthday Paradox: In a group of 23 randomly selected people, the probability
that at least two will share a birthday (assuming birthdays are equally likely to
occur on any given day of the year) is about 0.5. See also (3).
2.9. Monte Hall’s Game: Started with showing a contestant 3 closed doors behind of
which was a prize. The contestant selected a door but before the door was opened, Monte
Hall, who knew which door hid the prize, opened a remaining door. The contestant was
then allowed to either stay with his original guess or change to the other closed door.
Question: better to stay or to switch? Answer: Switch. Because after given that the
contestant switched, then the probability that he won the prize is
2
3
.
23
0 5 10 15 20 25 30 35 40 45 50 55
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
r
p
u
(n,r) for n = 365
23
365
r
n
=
=
( ) ,
u
p n r
( ) 1
2
1
r r
n
e
÷
÷
÷
0 50 100 150 200 250 300 350
0
0.1
0.2
0.3
0.4
0.5
0.6
n
0.9
0.7
0.5
0.3
0.1
p
p
p
p
p
=
=
=
=
=
23
365
r
n
=
=
r
n
Figure 9: p
u
(n, r)
2.10. False Positives on Diagnostic Tests: Let D be the event that the testee has the
disease. Let + be the event that the test returns a positive result. Denote the probability
of having the disease by p
D
.
Now, assume that the test always returns positive result. This is equivalent to P(+[D) =
1 and P(+
c
[D) = 0. Also, suppose that even when the testee does not have a disease, the
test will still return a positive result with probability p
+
; that is P(+[D
c
).
If the test returns positive result, then the probability that the testee has the disease is
P (D[+) =
P (D ∩ +)
P (+)
=
p
D
p
D
+p
+
(1 −p
D
)
=
1
1 +
p
+
p
D
(1 −p
D
)
≈
P
D
P
+
; for rare disease (p
D
<1)
D D
c
+ P(+∩ D)
= P(+[D)P(D)
= 1(p
D
) = p
D
P(+∩ D
c
)
= P(+[D
c
)P(D
c
)
= p
+
(1 −p
D
)
P(+)
= P(+∩D) +P(+∩D
c
)
= p
D
+ p
+
(1 −p
D
)
+
c
P(+
c
∩ D)
= P(+
c
[D)P(D)
= 0(p
D
) = 0
P(+
c
∩ D
c
)
= P(+
c
[D
c
)P(D
c
)
= (1 −p
+
)(1 −p
D
)
P(+
c
)
= P(+
c
∩D)+P(+
c
∩D
c
)
= (1 −p
+
)(1 −p
D
)
P(D)
= P(+∩D) +P(+
c
∩D)
= p
D
P(D
c
)
= P(+∩D
c
)+P(+
c
∩D
c
)
= 1 −p
D
P(Ω) = 1
3 Probability Foundations
To study formal deﬁnition of probability, we start with the probability space (Ω, /, P). Let
Ω be an arbitrary space or set of points ω. Viewed probabilistically, a subset of Ω is an
event and an element ω of Ω is a sample point. Each event is a collection of outcomes
which are elements of the sample space Ω.
The theory of probability focuses on collections of events, called event algebras and
typically denoted / (or T) that contain all the events of interest (regarding the random
experiment c) to us, and are such that we have knowledge of their likelihood of occurrence.
24
The probability P itself is deﬁned as a number in the range [0, 1] associated with each event
in /.
3.1 Algebra and σalgebra
The class 2
Ω
of all subsets can be too large
1
for us to deﬁne probability measures with
consistency, across all member of the class. In this section, we present smaller classes
which have several “nice” properties.
Deﬁnition 3.1. [7, Def 1.6.1 p38] An event algebra / is a collection of subsets of the
sample space Ω such that it is
(1) nonempty (this is equivalent to Ω ∈ /);
(2) closed under complementation (if A ∈ / then A
c
∈ /);
(3) and closed under ﬁnite unions (if A, B ∈ / then A ∪ B ∈ sA).
In other words, “A class is called an algebra on Ω if it contains Ω itself and is closed under
the formation of complements and ﬁnite unions.”
3.2. Examples of algebras
• Ω = any ﬁxed interval of R. / = ¦ﬁnite unions of intervals contained in Ω¦
• Ω = (0, 1]. B
0
= the collection of all ﬁnite unions of intervals of the form (a, b] ⊂ (0, 1]
◦ Not a σﬁeld. Consider the set
∞
¸
i=1
1
2i+1
,
1
2i
3.3. Properties of an algebra /:
(a) Nonempty: ∅ ∈ /, X ∈ /
(b) / ⊂ 2
Ω
(c) An algebra is closed under ﬁnite settheoretic operations.
• A ∈ / ⇒A
c
∈ /
• A, B ∈ / ⇒A ∪ B ∈ /, A ∩ B ∈ /, A`B ∈ /, AΔB = (A`B ∪ B`A) ∈ /
• A
1
, A
2
, . . . , A
n
∈ T ⇒
n
¸
i=1
A
i
∈ T and
n
¸
i=1
A
i
∈ T
(d) The collection of algebras in Ω is closed under arbitrary intersection. In particular,
let /
1
, /
2
be algebras of Ω and let / = /
1
∩/
2
be the collection of sets common to
both algebras. Then / is an algebra.
(e) The smallest / is ¦∅, Ω¦.
1
There is no problem when Ω is countable.
25
(f) The largest / is the set of all subsets of Ω known as the power set and denoted by
2
Ω
.
(g) Cardinality of Algebras: An algebra of subsets of a ﬁnite set of n elements will always
have a cardinality of the form 2
k
, k ≤ n. It is the intersection of all algebras which
contain (.
3.4. There is a smallest (in the sense of inclusion) algebra containing any given family (
of subsets of Ω. Let ( ⊂ 2
X
, the algebra generated by ( is
¸
C is an algebra
(⊂C
(,
i.e., the intersection of all algebra containing (. It is the smallest algebra containing (.
Deﬁnition 3.5. A σalgebra / is an event algebra that is also closed under countable
unions,
(∀i ∈ N)A
i
∈ / =⇒ ∪
i∈N
∈ /.
Remarks:
• A σalgebra is also an algebra.
• A ﬁnite algebra is also a σalgebra.
3.6. Because every σalgebra is also an algebra, it has all the properties listed in (3.3).
Extra properties of σalgebra / are as followed:
(a) A
1
, A
2
, . . . ∈ / ⇒
∞
¸
j=1
A
j
∈ / and
∞
¸
j=1
A
j
∈ /
(b) A σﬁeld is closed under countable settheoretic operations.
(c) The collection of σalgebra in Ω is closed under arbitrary intersection, i.e., let
/
α
be σalgebra ∀α ∈ A where A is some index set, potentially uncountable. Then,
¸
α∈A
/
α
is still a σalgebra.
(d) An inﬁnite σalgebra T on X is uncountable i.e. σalgebra is either ﬁnite or uncount
able.
(e) If / and B are σalgebra in X, then, it is not necessary that / ∪ B is also a σ
algebra. For example, let E
1
, E
2
⊂ X distinct, and not complement of one another.
Let /
i
= ¦∅, E
i
, E
c
i
, X¦. Then, E
i
∈ /
1
∪ /
2
but E
1
∪ E
2
/ ∈ /
1
∪ /
2
.
Deﬁnition 3.7. Generation of σalgebra Let ( ⊂ 2
Ω
, the σalgebra generated by
(, σ (() is
¸
C is a σ−ﬁeld
(⊂C
(,
i.e., the intersection of all σﬁeld containing (. It is the smallest σﬁeld containing (
26
• If the set Ω is not implicit, we will explicitly write σ
X
(()
• We will say that a set A can be generated by elements of ( if A ∈ σ (()
3.8. Properties of σ (():
(a) σ (() is a σﬁeld
(b) σ (() is the smallest σﬁeld containing ( in the sense that if H is a σﬁeld and
( ⊂ H, then σ (() ⊂ H
(c) ( ⊂ σ (()
(d) σ (σ (()) = σ (()
(e) If H is a σﬁeld and ( ⊂ H, then σ (() ⊂ H
(f) σ (¦∅¦) = σ (¦X¦) = ¦∅, X¦
(g) σ (¦A¦) = ¦∅, A, A
c
, X¦ for A ⊂ Ω.
(h) σ (¦A, B¦) has at most 16 elements. They are ∅, A, B, A∩B, A∪B, A`B, B`A, AΔB,
and their corresponding complements. See also (3.11). Some of these 16 sets can be
the same and hence the use of “atmost” in the statement.
(i) σ (/) = σ (/∪ ¦∅¦) = σ (/∪ ¦X¦) = σ (/∪ ¦∅, X¦)
(j) / ⊂ B ⇒σ (/) ⊂ σ (B)
(k) σ (A) , σ (B) ⊂ σ (A) ∪ σ (B) ⊂ σ (A ∪ B)
(l) σ (/) = σ (/∪ ¦∅¦) = σ (/∪ ¦Ω¦) = σ (/∪ ¦∅, Ω¦)
3.9. For the decomposition described in (1.8), let the starting collection be (
1
and the
decomposed collection be (
2
.
• If (
1
is ﬁnite, then σ ((
2
) = σ ((
1
).
• If (
1
is countable, then σ ((
2
) ⊂ σ ((
2
).
3.10. Construction of σalgebra from countable partition: An intermediatesized
σalgebra (a σalgebra smaller than 2
Ω
) can be constructed by ﬁrst partitioning Ω into
subsets and then forming the power set of these subsets, with an individual subset now
playing the role of an individual outcome ω.
Given a countable
2
partition Π = ¦A
i
, i ∈ I¦ of Ω, we can form a σalgebra / by
including all unions of subcollections:
/ = ¦∪
α∈S
A
α
: S ⊂ I¦ (7)
2
In this case, Π is countable if and only if I is countable.
27
[7, Ex 1.5 p 39] where we deﬁne
¸
i∈∅
A
i
= ∅. It turns out that / = σ(Π). Of course, a
σalgebra is also an algebra. Hence, (7) is also a way to construct an algebra. Note that,
from (1.9), the necessary and suﬃcient condition for distinct S to produce distinct element
in (7) is that none of the A
α
’s are empty.
In particular, for countable Ω = ¦x
i
: i ∈ N¦, where x
i
’s are distinct. If we want a
σalgebra which contains ¦x
i
¦ ∀i, then, the smallest one is 2
Ω
which happens to be the
biggest one. So, 2
Ω
is the only σalgebra which is “reasonable” to use.
3.11. Generation of σalgebra from ﬁnite partition: If a ﬁnite collection Π =
¦A
i
, i ∈ I¦ of nonempty sets forms a partition of Ω, then the algebra generated by Π is the
same as the σalgebra generated by Π and it is given by (7). Moreover, [σ(Π)[ is 2
[I[
= 2
[Π[
;
that is distinct sets S in (7) produce distinct member of the (σ)algebra.
Therefore, given a ﬁnite collection of sets ( = ¦C
1
, C
2
, . . . , C
n
¦. To ﬁnd an algebra or a
σalgebra generated by (, the ﬁrst step is to use the C
i
’s to create a partition of Ω. Using
(1.8), the partition is given by
Π = ¦∩
n
i=1
B
i
: B
i
= C
i
or C
c
i
¦ . (8)
By (3.9), we know that σ(π) = σ((). Note that there are seemingly 2
n
sets in Π however,
some of them can be ∅. We can eliminate the empty set(s) from Π and it is still a partition.
So the cardinality of Π in (8) (after emptyset elimination) is k where k is at most 2
n
. The
partition Π is then a collection of k sets which can be renamed as A
1
, . . . , A
k
, all of which
are nonempty. Applying the construction in (7) (with I = [n]), we then have σ(() whose
cardinality is 2
k
which is ≤ 2
2
n
. See also properties (3.8.7) and (3.8.8).
Deﬁnition 3.12. In general, the Borel σalgebra or Borel algebra B is the σalgebra
generated by the open subsets of Ω
• Call B
Ω
the σalgebra of Borel subsets of Ω or σalgebra of Borel sets on Ω
• Call set B ∈ B
Ω
Borel set of Ω
(a) On Ω = R, the σalgebra generated by any of the followings are Borel σalgebra:
(i) Open sets
(ii) Closed sets
(iii) Intervals
(iv) Open intervals
(v) Closed intervals
(vi) Intervals of the form (−∞, a], where a ∈ R
i. Can replace R by Q
ii. Can replace (−∞, a] by (−∞, a), [a, +∞), or (a, +∞)
iii. Can replace (−∞, a] by combination of those in 1(f)ii.
(b) For Ω ⊂ R, B
Ω
= ¦A ∈ B
R
: A ⊂ Ω¦ = B
R
∩ Ω where B
R
∩ Ω = ¦A ∩ Ω : A ∈ B
R
¦
28
(c) Borel σalgebra on the extended real line is the extended σalgebra
B
R
= ¦A ∪ B : A ∈ B
R
, B ∈ ¦∅, ¦−∞¦ , ¦∞¦ , ¦−∞, ∞¦¦¦
It is generated by, for example
(i) /∪ ¦¦−∞¦ , ¦∞¦¦ where σ (/) = B
R
(ii)
ˆ a,
ˆ
b
¸
,
ˆ a,
ˆ
b
¸
,
ˆ a,
ˆ
b
¸
(iii) ¦[−∞, b]¦ , ¦[−∞, b)¦ , ¦(a, +∞]¦ , ¦[a, +∞]¦
(iv)
−∞,
ˆ
b
¸
,
−∞,
ˆ
b
¸
, ¦(ˆ a, +∞]¦ , ¦[ˆ a, +∞]¦
Here, a, b ∈ R and ˆ a,
ˆ
b ∈ R ∪ ¦±∞¦
(d) Borel σalgebra in R
k
is generated by
(i) the class of open sets in R
k
(ii) the class of closed sets in R
k
(iii) the class of bounded semiopen rectangles (cells) I of the form
I =
¸
x ∈ R
k
: a
i
< x
i
≤ b
i
, i = 1, . . . , k
¸
.
Note that I =
k
⊗
i=1
(a
i
, b
i
] where ⊗ denotes the Cartesian product
(iv) the class of “southwest regions” S
x
of points “southwest” to x ∈ R
k
, i.e. S
x
=
¸
y ∈ R
k
: y
i
≤ x
i
, i = 1, . . . , k
¸
3.13. The Borel σalgebra B of subsets of the reals is the usual algebra when we deal
with real or vectorvalued quantities.
Our needs will not require us to delve into these issues beyond being assured that
events we discuss are constructed out of intervals and repeated set operations on these
intervals and these constructions will not lead us out of B. In particular, countable unions
of intervals, their complements, and much, much more are in B.
3.2 Kolmogorov’s Axioms for Probability
Deﬁnition 3.14. Kolmogorov’s Axioms for Probability [12]: A set function satisfying
K0–K4 is called a probability measure.
K0 Setup: The random experiment c is described by a probability space (Ω, /, P) con
sisting of an event σalgebra / and a realvalued function P : / →R.
K1 Nonnegativity: ∀A ∈ /, P(A) ≥ 0.
K2 Unit normalization: P(Ω) = 1.
29
K3 Finite additivity: If A, B are disjoint, then P(A ∪ B) = P(A) +P(B).
K4 Monotone continuity: If (∀i > 1) A
i+1
⊂ A
i
and ∩
i∈N
A
i
= ∅ (a nested series of sets
shrinking to the empty set), then
lim
i→∞
P(A
i
) = 0.
K4
t
Countable or σadditivity: If (A
i
) is a countable collection of pairwise disjoint (no
overlap) events, then
P
∞
¸
i=1
A
i
=
∞
¸
i=1
P(A
i
).
• Note that there is never a problem with the convergence of the inﬁnite sum; all partial
sums of these nonnegative summands are bounded above by 1.
• K4 is not a property of limits of sequences of relative frequencies nor meaningful in
the ﬁnite sample space context of classical probability, is oﬀered by Kolmogorov to
ensure a degree of mathematical closure under limiting operations [7, p 111].
• K4 is an idealization that is rejected by some accounts (usually subjectivist) of prob
ability.
• If P satisﬁes K0–K3, then it satisﬁes K4 if and only if it satisﬁes K4
t
[7, Theorem
3.5.1 p 111].
Proof. To show “⇒”, consider disjoint A
1
, A
2
, . . .. Deﬁne B
n
= ∪
i>n
A
i
. Then, by
(1.6), B
n+1
⊂ B
n
and ∩
∞
n=1
B
n
= ∅. So, by K4, lim
n→∞
P(B
n
) = 0. Furthermore,
∞
¸
i=1
A
i
= B
n
∪
n
¸
i=1
A
i
where all the sets on the RHS are disjoint. Hence, by ﬁnite additivity,
P
∞
¸
i=1
A
i
= P(B
n
) +
n
¸
i=1
P(A
i
).
Taking limiting as n →∞ gives us K4
t
.
To show “⇐”, see (3.17).
Equivalently, in stead of K0K4, we can deﬁne probability measure using P0P2 below.
Deﬁnition 3.15. A probability measure deﬁned on a σalgebra /of Ω is a (set) function
(P0) P : / →[0, 1]
that satisﬁes:
30
(P1,K2) P (Ω) = 1
(P2,K4
t
) Countable additivity: For every countable sequence (A
n
)
∞
n=1
of disjoint ele
ments of /, one has P
∞
¸
n=1
A
n
=
∞
¸
n=1
P (A
n
)
• P (∅) = 0
• The number P (A) is called the probability of the event A
• The triple (Ω, /, P) is called a probability measure space, or simply a probability
space
• A support of P is any /set A for which P (A) = 1
3.3 Properties of Probability Measure
3.16. Properties of probability measures:
(a) P(∅) = 0
(b) 0 ≤ P ≤ 1: For any A ∈ /, 0 ≤ P(A) ≤ 1
(c) If P(A) = 1, A is not necessary Ω.
(d) Additivity: A, B ∈ /, A ∩ B = ∅ ⇒P (A ∪ B) = P (A) +P (B)
(e) Monotonicity: A, B ∈ /, A ⊂ B ⇒P (A) ≤ P (B) and P (B −A) = P (B) −P (A)
(f) P (A
c
) = 1 −P (A)
(g) P (A) +P (B) = P (A ∪ B) +P (A ∩ B) = P (A −B) + 2P (A ∩ B) +P (B −A).
P (A ∪ B) = P (A) +P (B) −P (A ∩ B).
(h) P(A ∪ B) ≥ max(P(A), P(B)) ≥ min(P(A), P(B)) ≥ P(A ∩ B)
(i) Inclusionexclusion principle:
P
n
¸
k=1
A
k
=
¸
i
P (A
i
) −
¸
i<j
P (A
i
∩ A
j
) +
¸
i<j<k
P (A
i
∩ A
j
∩ A
k
)+
+ (−1)
n+1
P (A
1
∩ ∩ A
n
)
In a more compact form,
P
n
¸
i=1
A
i
=
¸
φ,=I⊂1,...,n¦
(−1)
[I[+1
P
¸
i∈I
A
i
.
31
For example,
P (A
1
∪ A
2
∪ A
3
) = P (A
1
) +P (A
2
) +P (A
3
)
−P (A
1
∩ A
2
) −P (A
1
∩ A
3
) −P (A
2
∩ A
3
)
+P (A
1
∩ A
2
∩ A
3
) .
See also (8.2). Moreover, for any event B, we have
P
n
¸
k=1
A
c
k
∩ B
= P (B) +
¸
∅,=I⊂[n]
(−1)
[I[
P
¸
i∈I
A
i
∩ B
. (9)
(j) Finite additivity: If A =
n
¸
j=1
A
j
with A
j
∈ / disjoint, then P (A) =
n
¸
j=1
P (A
j
)
• If A and B are disjoint sets in /, then P (A ∪ B) = P (A) +P (B)
(k) Subadditivity or Boole’s Inequality: If A
1
, . . . , A
n
are events, not necessarily
disjoint, then P
n
¸
i=1
A
i
≤
n
¸
i=1
P (A
i
)
(l) σsubadditivity: If A
1
, A
2
, . . . is a sequence of measurable sets, not necessarily
disjoint, then P
∞
¸
i=1
A
i
≤
∞
¸
i=1
P (A
i
)
• This formula is known as the union bound in engineering.
• If A
1
, A
2
, . . . is a sequence of events, not necessarily disjoint, then ∀α ∈ (0, 1],
P
∞
¸
i=1
A
i
≤
∞
¸
i=1
P (A
i
)
α
3.17. Conditional continuity from above. If B
1
⊃ B
2
⊃ B
3
⊃ is a decreasing sequence
of measurable sets, then P (
¸
∞
i=1
B
i
) = lim
j→∞
P (B
j
). In a more compact notation, if B
i
`B,
then P (B) = lim
j→∞
P (B
j
).
Proof. Let B =
∞
¸
i=1
B
i
. LetA
k
= B
k
`B
k+1
, i.e, the new part. We consider two partitions of
B
1
: (1) B
1
= B∪
∞
¸
j=1
A
j
and (2) B
1
= B
n
∪
n−1
¸
j=1
A
j
. (1) implies P (B
1
) −P (B) =
∞
¸
j=1
P (A
j
).
(2) implies P (B
1
) −P (B
n
) =
n−1
¸
j=1
P (A
j
). We then have
lim
n→∞
(P (B
1
) −P (B
n
))
(1)
=
∞
¸
j=1
P (A
j
)
(2)
= P (B
1
) −P (B) .
32
3.18. Let / be a σalgebra. Suppose that P : / → [0, 1] satisﬁes (P1) and is ﬁnitely
additive. Then, the following are equivalent:
• (P2): If A
n
∈ /, disjoint, then P
∞
¸
n=1
A
n
=
∞
¸
n=1
P (A
n
)
• (K4) If A
n
∈ /, and A
n
`∅, then P (A
n
) `0
• (Continuity from above) If A
n
∈ /, and A
n
`A, then P (A
n
) `P (A)
• If A
n
∈ /, and A
n
Ω, then P (A
n
) P (Ω) = 1
• (Continuity from below) If A
n
∈ /, and A
n
A, then P (A
n
) P (A)
Hence, a probability measure satisﬁes all ﬁve properties above. Continuity from above
and continuity from below are collectively called sequential continuity properties.
In fact, for probability measure P, let A
n
be a sequence of events in / which converges
to A (i.e. A
n
→ A). Then A ∈ / and lim
n→∞
P (A
n
) = P (A). Of course, both A
n
` A and
A
n
A imply A
n
→A. Note also that
P
∞
¸
n=1
A
n
= lim
N→∞
P
N
¸
n=1
A
n
,
and
P
∞
¸
n=1
A
n
= lim
N→∞
P
N
¸
n=1
A
n
.
Alternative form of the sequential continuity properties are
P
∞
¸
n=1
A
n
= lim
N→∞
P (A
N
) , if A
n
⊂ A
n+1
and
P
∞
¸
n=1
A
n
= lim
N→∞
P (A
N
) , if A
n+1
⊂ A
n
.
3.19. Given a common event algebra /, probability measures P
1
, . . ., P
m
, and the numbers
λ
1
, . . ., λ
m
, λ
i
≥ 0,
m
¸
1
λ
i
= 1, a convex combination P =
m
¸
1
λ
i
P
i
of probability measures is
a probability measure
3.20. / can not contain an uncountable, disjoint collection of sets of positive probability.
Deﬁnition 3.21. Discrete probability measure P is a discrete probability measure
if ∃ ﬁnitely or countably many points ω
k
and nonnegative masses m
k
such that ∀A ∈ /
P (A) =
¸
k:ω
k
∈A
m
k
=
¸
k
m
k
I
A
(ω
k
)
If there is just one of these points, say ω
0
, with mass m
0
= 1, then P is a unit mass
at ω
0
. In this case, ∀A ∈ /, P (A) = I
A
(ω
0
).
Notation: P = δ
ω
0
• Here, Ω can be uncountable.
33
3.4 Countable Ω
A sample space Ω is countable if it is either ﬁnite or countably inﬁnite. It is countably
inﬁnite if it has as many elements as there are integers. In either case, the element of Ω
can be enumerated as, say, ω
1
, ω
2
, . . . . If the event algebra / contains each singleton set
¦ω
k
¦ (from which it follows that / is the power set of Ω), then we specify probabilities
satisfying the Kolmogorov axioms through a restriction to the set o = ¦¦ω
k
¦¦ of singleton
events.
Deﬁnition 3.22. When Ω is countable, a probability mass function (pmf) is any
function p : Ω →[0, 1] such that
¸
ω∈Ω
p(ω) = 1.
When the elements of Ω are enumerated, then it is common to abbreviate p(ω
i
) = p
i
.
3.23. Every pmf p deﬁnes a probability measure P and conversely. Their relationship is
given by
p(ω) = P(¦ω¦), (10)
P(A) =
¸
ω∈A
p(ω). (11)
The convenience of a speciﬁcation by pmf becomes clear when Ω is a ﬁnite set of, say, n
elements. Specifying P requires specifying 2
n
values, one for each event in /, and doing so
in a manner that is consistent with the Kolmogorov axioms. However, specifying p requires
only providing n values, one for each element of Ω, satisfying the simple constraints of
nonnegativity and addition to 1. The probability measure P satisfying (11) automatically
satisﬁes the Kolmogorov axioms.
3.5 Independence
Deﬁnition 3.24. Independence between events and collections of events.
(a) Two events A, B are called independent if P (A ∩ B) = P (A) P (B)
(i) An event with probability 0 or 1 is independent of any event (including itself).
In particular, ∅ and Ω are independent of any events.
(ii) Two events A, B with positive probabilities are independent if and only if
P (B[A) = P (B), which is equivalent to P (A[B) = P (A)
When A and/or B has zero probability, A and B are automatically independent.
(iii) An event A is independent of itself if and only if P (A) is 0 or 1.
(iv) If A an B are independent, then the two classes σ (¦A¦) = ¦∅, A, A
c
, Ω¦ and
σ (¦B¦) = ¦∅, B, B
c
, Ω¦ are independent.
(v) Suppose A and B are disjoint. A and B are independent if and only if P(A) = 0
or P(B) = 0.
34
(vi) Suppose A ⊂ B. A and B are independent if and only if P(A) =
P(B)
P(B)+1
.
(b) Independence for ﬁnite collection ¦A
1
, . . . , A
n
¦ of sets:
≡ P
¸
j∈J
A
j
=
¸
j∈J
P (A
j
) ∀J ⊂ [n] and [J[ ≥ 2
◦ Note that the case when j = 1 automatically holds. The case when j = 0
can be regard as the ∅ event case, which is also trivially true.
◦ There are
n
¸
j=2
n
j
= 2
n
−1 −n constraints.
◦ Example: A
1
, A
2
, A
3
are independent if and only if
P (A
1
∩ A
2
∩ A
3
) = P (A
1
) P (A
2
) P (A
3
)
P (A
1
∩ A
2
) = P (A
1
) P (A
2
)
P (A
1
∩ A
3
) = P (A
1
) P (A
3
)
P (A
2
∩ A
3
) = P (A
2
) P (A
3
)
Remark: The ﬁrst equality alone is not enough for independence. See a
counter example below. In fact, it is possible for the ﬁrst equation to hold
while the last three fail as shown in (3.26.b). It is also possible to construct
events such that the last three equations hold (pairwise independence), but
the ﬁrst one does not as demonstrated in (3.26.a).
≡ P (B
1
∩ B
2
∩ ∩ B
n
) = P (B
1
) P (B
2
) P (B
n
) where B
i
= A
i
or B
i
= Ω
(c) Independence for collection ¦A
α
: α ∈ I¦ of sets:
≡ ∀ ﬁnite J ⊂ I, P
¸
α∈J
A
α
=
¸
α∈J
P (A)
≡ Each of the ﬁnite subcollection is independent.
(d) Independence for ﬁnite collection ¦/
1
, . . . , /
n
¦ of classes:
≡ the ﬁnite collection of sets A
1
, . . . , A
n
is independent where A
i
∈ /
i
.
≡ P (B
1
∩ B
2
∩ ∩ B
n
) = P (B
1
) P (B
2
) P (B
n
) where B
i
∈ /
i
or B
i
= Ω
≡ P (B
1
∩ B
2
∩ ∩ B
n
) = P (B
1
) P (B
2
) P (B
n
) where B
i
∈ /
i
∪ ¦Ω¦
≡ ∀i ∀B
i
⊂ /
i
B
1
, . . . , B
n
are independent.
≡ /
1
∪ ¦Ω¦ , . . . , /
n
∪ ¦Ω¦ are independent.
≡ /
1
∪ ¦∅¦ , . . . , /
n
∪ ¦∅¦ are independent.
(e) Independence for collection ¦/
θ
: θ ∈ Θ¦ of classes:
≡ Any collection ¦A
θ
: θ ∈ Θ¦ of sets is independent where A
θ
∈ /
θ
≡ Any ﬁnite subcollection of classes is independent.
≡ ∀ ﬁnite Λ ⊂ Θ, P
¸
θ∈Λ
A
θ
=
¸
θ∈Λ
P (A
θ
)
35
• By deﬁnition, a subcollection of independent events is also independent.
• The class ¦∅, Ω¦ is independent from any class.
Deﬁnition 3.25. A collection of events ¦A
α
¦ is called pairwise independent if for every
distinct events A
α
1
, A
α
2
, we have P (A
α
1
∩ A
α
2
) = P (A
α
1
) P (A
α
2
)
• If a collection of events ¦A
:α
: α ∈ I¦ is independent, then it is pairwise independent.
The converse is false. See (a) in example (3.26).
• For K ⊂ J, P
¸
α∈J
A
α
=
¸
α∈J
P (A) does not imply P
¸
α∈K
A
α
=
¸
α∈K
P (A)
Example 3.26.
(a) Let Ω = ¦1, 2, 3, 4¦, / = 2
Ω
, P (i) =
1
4
, A
1
= ¦1, 2¦, A
2
= ¦1, 3¦, A
3
= ¦2, 3¦. Then
P (A
i
∩ A
j
) = P (A
i
) P (A
j
) for all i = j butP (A
1
∩ A
2
∩ A
3
) = P (A
1
) P (A
2
) P (A
3
)
(b) Let Ω = ¦1, 2, 3, 4, 5, 6¦, / = 2
Ω
, P (i) =
1
6
, A
1
= ¦1, 2, 3, 4¦, A
2
= A
3
= ¦4, 5, 6¦.
Then, P (A
1
∩ A
2
∩ A
3
) = P (A
1
) P (A
2
) P (A
3
) but P (A
i
∩ A
j
) = P (A
i
) P (A
j
) for
all i = j
(c) The paradox of ”almost sure” events: Consider two random events with probabilities
of 99% and 99.99%, respectively. One could say that the two probabilities are nearly
the same, both events are almost sure to occur. Nevertheless the diﬀerence may
become signiﬁcant in certain cases. Consider, for instance, independent events which
may occur on any day of the year with probability p = 99%; then the probability
P that it will occur every day of the year is less than 3%, while if p = 99.99% then
P = 97%.
4 Random Element
4.1. A function X :
Ω
(Ω,.)
→
E
(E,B
E
)
is said to be a random element of E if and only if
X is measurable which is equivalent to each of the following statements.
≡ X
−1
(B
E
) ⊂ /
≡ σ (X) ⊂ /
≡ (reduced form) ∃( ⊂ B
E
such that σ (() = B
E
and X
−1
(() ⊂ /
• When E ⊂ R, X is called a random variable.
• When E ⊂ R
d
, then X is called a random vector.
Deﬁnition 4.2. X = Y almost surely (a.s.) if P [X = Y ] = 1.
• The a.s. equality is an equivalence relation.
36
4.3. Law of X or Distribution of X: P
X
= μ
X
= PX
−1
= L(X) :
E
(E,c)
→[0, 1]
μ
X
(A) = P
X
(A) = P
X
−1
(A)
= P ◦ X
−1
(A)
= P (¦ω : X (ω) ∈ A¦) = P (¦X ∈ A¦)
4.4. For X ∈ L
p
, lim
t→∞
t
p
P [[X[ ≥ t] →0
4.5. A Borel set S is called a support of X if P
X
(S
c
) = 0 (or equivalently P
X
(S) = 1)
4.1 Random Variable
Deﬁnition 4.6. A realvalued function X(ω) deﬁned for points ω in a sample space Ω is
called a random variable.
• Random variables are important because they provide a compact way of referring to
events via their numerical attributes.
• The abbreviation r.v. will be used for “realvalued random variables” [10, p 1].
• Technically, a random variable must be measurable.
4.7. At a certain point in most probability courses, the sample space is rarely mentioned
anymore and we work directly with random variables. The sample space often “disappears”
but it is really there in the background.
4.8. For B ∈ R, we use the shorthand
• [X ∈ B] = ¦ω ∈ Ω : X(ω) ∈ B¦ and
• P [X ∈ B] = P ([X ∈ B]) = P (¦ω ∈ Ω : X(ω) ∈ B¦) .
• In particular, P[X < x] is a shorthand for P (¦ω ∈ Ω : X(ω) < x¦).
4.9. If X and Y are random variables, we use the shorthand
• [X ∈ B, Y ∈ C] = ¦ω ∈ Ω : X(ω) ∈ B and Y (ω) ∈ C¦ = [X ∈ B] ∩ [Y ∈ C].
• P[X ∈ B, Y ∈ C] = P ([X ∈ B] ∩ [Y ∈ C]) .
4.10. Every random variable can be written as a sum of a discrete random variable and
a continuous random variable.
4.11. A random variable can have at most countably many point x such thatP [X = x] > 0.
4.12. Point masses probability measures / Direc measures, usually written ε
α
, δ
α
, is used
to denote point mass of size one at the point α. In this case,
• P
X
¦α¦ = 1
• P
X
(¦α¦
c
) = 0
37
• F
X
(x) = 1
[α,∞)
(x)
4.13. There exists distributions that are neither discrete nor continuous.
Let μ(A) =
1
2
μ
1
(A) +
1
2
μ
2
(A) for μ
1
discrete and μ
2
coming from a density.
4.14. When X and Y take ﬁnitely many values, say x
1
, . . . , x
m
and y
1
, . . . , y
n
, respectively,
we can arrange the probabilities p
X,Y
(x
i
, y
j
) in the mn matrix
⎡
⎢
⎢
⎢
⎣
p
X,Y
(x
1
, y
1
) p
X,Y
(x
1
, y
2
) . . . p
X,Y
(x
1
, y
n
)
p
X,Y
(x
2
, y
1
) p
X,Y
(x
2
, y
2
) . . . p
X,Y
(x
2
, y
n
)
.
.
.
.
.
.
.
.
.
.
.
.
p
X,Y
(x
m
, y
1
) p
X,Y
(x
m
, y
2
) . . . p
X,Y
(x
m
, y
n
)
⎤
⎥
⎥
⎥
⎦
.
• The sum of the entries in the ith row is P
X
(x
i
), and the sum of the entries in the jth
column is P
Y
(y
j
).
• The sum of all the entries in the matrix is one.
4.2 Distribution Function
4.15. The (cumulative) distribution function (cdf ) induced by a probability P on
R
=Ω
, B
is the function F (x) = P (−∞, x].
The (cumulative) distribution function (cdf ) of the random variable X is the func
tion F
X
(x) = P
X
(−∞, x] = P [X ≤ x].
• The distribution P
X
can be obtained from the distribution function by setting P
X
(−∞, x] =
F
X
(x); that is F
X
uniquely determines P
X
• 0 ≤ F
X
≤ 1
C1 F
X
is nondecreasing
C2 F
X
is right continuous:
∀x F
X
x
+
≡ lim
y→x
y>x
F
X
(y) ≡ lim
y`x
F
X
(y) = F
X
(x) = P [X ≤ x]
◦ The function g(x) = P[X < x] is leftcontinuous in x.
C3 lim
x→−∞
F
X
(x) = 0 and lim
x→∞
F
X
(x) = 1.
• ∀x F
X
(x
−
) ≡ lim
y→x
y<x
F
X
(y) ≡ lim
yx
F
X
(y) = P
X
(−∞, x) = P [X < x]
• P [X = x] = P
X
¦x¦ = F (x) −F (x
−
) = the jump or saltus in F at x
38
Figure 10: Rightcontinuous function at jump point
• ∀ x < y
P ((x, y]) = F (y) −F (x)
P ([x, y]) = F (y) −F
x
−
P ([x, y)) = F
y
−
−F
x
−
P ((x, y)) = F
y
−
−F (x)
P (¦x¦) = F (x) −F
x
−
• A function F is the distribution function of some probability measure on (R, B) if
and only if one has (C1), (C2), and (C3).
• If F satisﬁes (C1), (C2), and (C3), then ∃ a unique probability measure P on (R, B)
that has P (a, b] = F (b) −F (a) ∀a, b ∈ R
• F
X
is continuous if and only if P [X = x] = 0
• F
X
is continuous if and only if P
X
is continuous.
• F
X
has at most countably many points of discontinuity.
Deﬁnition 4.16. It is traditional to write X ∼ F to indicate that “X has distribution F”
[21, p 25].
Deﬁnition 4.17. F
X,A
(x) = P ([X ≤ x] ∩ A)
4.18. Leftcontinuous inverse: g
−1
(y) = inf ¦x ∈ R : g (x) ≥ y¦, y ∈ (0, 1)
• Trick: Just ﬂip the graph along the line x = y, then make the graph leftcontinuous.
• If g is a cdf, then only consider y ∈ (0, 1). It is called the inverse CDF [7, def 8.4.1
p. 238] or quantile function.
◦ In [21, Def 2.16 p 25], the inverse CDF is deﬁned using strict inequality “>”
rather than “≥”.
• See table 1 for examples.
39
x
( ) g x
( ) ( ) { }
1
inf : g y x g x y
÷
= e > \
y
2
x
3
x
4
x
2
x
3
x
4
x
1
y
2
y
3
y
4
y
5
y
6
y
6
y
1
y
3
y
4
y
Figure 11: Leftcontinuous inverse on (0,1)
Distribution F F
−1
Exponential 1 −e
−λx
−
1
λ
ln (u)
Extreme value 1 −e
−e
x−a
b
a +b ln ln u
Geometric 1 −(1 −p)
i
ln u
ln(1−p)
¸
Logistic 1 −
1
1+e
x−μ
b
μ −b ln
1
u
−1
Pareto 1 −x
−a
u
−
1
a
Weibull 1 −e
(
x
a
)
b
a (ln u)
1
b
Table 1: Leftcontinuous inverse
40
Deﬁnition 4.19. Let X be a random variable with distribution function F. Suppose that
p ∈ (0, 1). A value of x such that F(x
−
) = P[X < x] ≤ p and F(x) = P[X ≤ x] ≥ p is
called a quantile of order p for the distribution. Roughly speaking, a quantile of order p
is a value where the cumulative distribution crosses p. Note that it is not unique. Suppose
F(x) = p on an interval [a, b], then all x ∈ [a, b] are quantile of order p.
A quantile of order
1
2
is called a median of the distribution. When there is only one
median, it is frequently used as a measure of the center of the distribution. A quantile
of order
1
4
is called a ﬁrst quartile and the quantile of order
3
4
is called a third quartile. A
median is a second quartile.
Assuming uniqueness, let q
1
, q
2
, and q
3
denote the ﬁrst, second, and third quartiles
of X. The interquartile range is deﬁned to be q
3
− q
1
, and is sometimes used as a mea
sure of the spread of the distribution with respect to the median. The ﬁve parameters
max X, q
1
, q
2
, q
3
, min X are often referred to as the ﬁvenumber summary. Graphically,
the ﬁve numbers are often displayed as a boxplot.
4.20. If F is nondecreasing, right continuous, with lim
x→−∞
F (x) = 0 and lim
x→∞
F (x) = 1,
then F is the CDF of some probability measure on (R, B).
In particular, let U ∼  (0, 1) and X = F
−1
(U), then F
X
= F. Here, F
−1
is the
leftcontinuous inverse of F. Note that we just explicitly deﬁne a random variable X(ω)
with distribution function F on Ω = (0, 1).
• To generate X ∼ c (λ), set X = −
1
λ
ln (U)
4.3 Discrete random variable
Deﬁnition 4.21. A random variable X is said to be a discrete random variable if
there exists countable distinct real numbers x
k
such that
¸
k
P[X = x
k
] = 1.
≡ ∃ countable set ¦x
1
, x
2
, . . .¦ such that μ
X
(¦x
1
, x
2
, . . .¦) = 1
≡ X has a countable support ¦x
1
, x
2
, . . .¦
⇒ X is completely determined by the values μ
X
(¦x
1
¦) , μ
X
(¦x
2
¦) , . . .
• p
i
= p
X
(x
i
) = P [X = x
i
]
Deﬁnition 4.22. When X is a discrete random variable taking distinct values x
k
, we
deﬁne its probability mass function (pmf) by
p
X
(x
k
) = P[X = x
k
].
• We can use stem plot to visualize p
X
.
• If Ω is countable, then there can be only countably many value of X(ω). So, any
random variable deﬁned on countable Ω is discrete.
41
• Sometimes, we write p(x
k
) or p
x
k
in stead of p
X
(x
k
).
• P[X ∈ B] =
¸
x
k
∈B
P[X = x
k
].
• F
X
(x) =
¸
x
k
p
X
(x
k
)U(x −x
k
).
Deﬁnition 4.23 (Discrete CDF). A cdf which can be written in the form F
d
(x) =
¸
k
p
i
U(x − x
k
) is called a discrete cdf [7, Def. 5.4.1 p 163]. Here, U is the unit step
function, ¦x
k
¦ is an arbitrary countable set of real numbers, and ¦p
i
¦ is a countable set of
positive numbers that sum to 1.
Deﬁnition 4.24. An integervalued random variable is a discrete random variable whose
distinct values are x
k
= k.
For integervalued random variables,
P[X ∈ B] =
¸
k∈B
P[X = k].
4.25. Properties of pmf
• p : Ω →[0, 1].
• 0 ≤ p
X
≤ 1.
•
¸
k
p
X
(x
k
) = 1.
Deﬁnition 4.26. Sometimes, it is convenient to work with the “pdf” of a discrete r.v.
Given that X is a discrete random variable which is deﬁned as in deﬁnition 4.22. Then,
the “pdf” of X is
f
X
(x) =
¸
x
k
p
X
(x
k
)δ(x −x
k
), x ∈ R. (12)
Although the delta function is not a welldeﬁned function
3
, this technique does allow easy
manipulation of mixed distribution. The deﬁnition of quantities involving discrete random
variables and the corresponding properties can then be derived from the pdf and hence
there is no need to talk about pmf at all!
4.4 Continuous random variable
Deﬁnition 4.27. A random variable X is said to be a continuous random variable if
and only if any one of the following equivalent conditions holds.
≡ ∀x, P [X = x] = 0
≡ ∀ countable set C, P
X
(C) = 0
≡ F
X
is continuous
3
Rigorously, it is a unit measure at 0.
42
4.28. f is (probability) density function f (with respect to Lebesgue measure) of a
random variable X (or the distribution P
X
)
≡ P
X
have density f with respect to Lebesgue measure.
≡ P
X
is absolutely continuous w.r.t. the Lebesgue measure (P
X
< λ) with f =
dP
X
dλ
, the RadonNikodym derivative.
≡ f is a nonnegative Borel function on R such that ∀B ∈ B
R
P
X
(B) =
B
f(x)dx =
B
fdλ where λ is the Lebesgue measure. (This extends nicely to the random vector
case.)
≡ X is absolutely continuous
≡ X (or F
X
) comes from the density f
≡ ∀x ∈ R F
X
(x) =
x
−∞
f (t)dt
≡ ∀a, b F
X
(b) −F
X
(a) =
b
a
f (x)dx
4.29. If F does diﬀerentiate to f and f is continuous, it follows by the fundamental
theorem of calculus that f is indeed a density for F. That is, if F has a continuous
derivative, this derivative can serve as the density f.
4.30. Suppose a random variable X has a density f.
• F need not diﬀerentiate to f everywhere.
◦ When X ∼ (a, b), F
X
is not diﬀerentiable at a nor b.
•
Ω
f (x) dx = 1
• f is determined only Lebesguea.e. That is, If g = f Lebesguea.e., then g can also
serve as a density for X and P
X
• f is nonnegative a.e. [9, stated on p 138]
• X is a continuous random variable
• f at its continuity points must be the derivative of F
• P [X ∈ [a, b]] = P [X ∈ [a, b)] = P [X ∈ (a, b]] = P [X ∈ (a, b)] because the corre
sponding integrals over an interval are not aﬀected by whether or not the endpoints
are included or excluded. In other words, P[X = a] = P[X = b] = 0.
• P [f
x
(X) = 0] = 0
4.31. f
X
(x) = E[δ (X −x)]
43
Deﬁnition 4.32 (Absolutely Continuous CDF). An absolutely continuous cdf F
ac
can
be written in the form
F
ac
(x) =
x
−∞
f(z)dz,
where the integrand,
f(x) =
d
dx
F
ac
(x),
is deﬁned a.e., and is a nonnegative, integrable function (possibly having discontinuities)
satisfying
f(x)dx = 1.
4.33. Any nonnegative function that integrates to one is a probability density function
(pdf) [9, p 139].
4.34. Remarks: Some useful intuitions
(a) Approximately, for a small Δx, P [X ∈ [x, x + Δx]] =
x+Δx
x
f
X
(t)dt ≈ f −X(x)Δx.
This is why we call f
X
the density function.
(b) In fact, f
X
(x) = lim
Δx→0
P[x<X≤x+Δx]
Δx
4.35. Let T be an absolutely continuous nonnegative random variable with cumulative
distribution function F and density f on the interval [0, ∞). The following terms are often
used when T denotes the lieftime of a device or system.
(a) Its survival, survivor, or reliabilityfunction is:
R(t) = P [T > t] =
∞
t
f (x)dx = 1 −F (t) .
• R(0) = P[T > 0] = P[T ≥ 0] = 1.
(b) The mean time of failure (MTTF) = E[T] =
∞
0
R(t)dt.
(c) The (agespeciﬁc) failure rate or hazard function os a device or system with lifetime
T is
r (t) = lim
δ→0
P[T ≤ t +δ[T > t]
δ
= −
R
t
(t)
R(t)
=
f (t)
R(t)
=
d
dt
ln R(t).
(i) r (t) δ ≈ P [T ∈ (t, t +δ] [T > t]
(ii) R(t) = e
−
t
0
r(τ)dτ
.
(iii) f(t) = r(t)e
−
t
0
r(τ)dτ
• For T ∼ c(λ), r(t) = λ.
See also [9, section 5.7].
44
Deﬁnition 4.36. A random variable whose cdf is continuous but whose derivative is the
zero function is said to be singular.
• See Cantortype distribution in [5, p 35–36].
• It has no density. (Otherwise, the cdf is the zero function.) So, ∃ continuous random
variable X with no density. Hence, ∃ random variable X with no density.
• Even when we allow the use of delta function for the density as in the case of mixed
r.v., it still has no density because there is no jump in the cdf.
• There exists singular r.v. whose cdf is strictly increasing.
Deﬁnition 4.37. f
X,A
(x) =
d
dx
F
X,A
(x). See also deﬁnition 4.17.
4.5 Mixed/hybrid Distribution
There are many occasion where we have a random variable whose distribution is a combina
tion of a normalized linear combination of discrete and absolutely continuous distributions.
For convenience, we use the Dirac delta function to link the pmf to a pdf as in deﬁnition
4.26. Then, we only have to concern about the pdf of a mixed distribution/r.v.
4.38. By allowing density functions to contain impulses, the cdfs of mixed random variables
can be expressed in the form F(x) =
(−∞,x]
f(t)dt.
4.39. Given a cdf F
X
of a mixed random variable X, the density f
X
is given by
f
X
(x) =
˜
f
X
(x) +
¸
k
P[X = x
k
]δ(x −x
k
),
where
• the x
i
are the distinct points at which F
X
has jump discontinuities, and
•
˜
f
X
(x) =
F
t
X
(x) , F
X
is diﬀerentiable at x
0, otherwise.
In which case,
E[g(X)] =
g(x)
˜
f
X
(x)dx +
¸
k
g(x
k
)P[X = x
k
].
Note also that P[X = x
k
] = F(x
k
) −F(x
−
k
)
4.40. Suppose the cdf F can be expressed in the form F(x) = G(x)U(x − x
0
) for some
function G. Then, the density is f(x) = G
t
(x)U(x − x
0
) + G(x
0
)δ(x − x
0
). Note that
G(x
0
) = F(x
0
) = P[X = x
0
] is the jump of the cdf at x
0
. When the random variable is
continuous, G(x
0
) = 0 and thus f(x) = G
t
(x)U(x −x
0
).
45
4.6 Independence
Deﬁnition 4.41. A family of random variables ¦X
i
: i ∈ I¦ is independent if ∀ ﬁnite
J ⊂ I, the family of random variables ¦X
i
: i ∈ J¦ is independent. In words, “an inﬁnite
collection of random elements is by deﬁnition independent if each ﬁnite subcollection is.”
Hence, we only need to know how to test independence for ﬁnite collection.
(a) (E
i
, c
i
)’s are not required to be the same.
(b) The collection of random variables ¦1
A
i
: i ∈ I¦ is independent iﬀ the collection of
events (sets) ¦A
i
: i ∈ I¦ is independent.
Deﬁnition 4.42. Independence among ﬁnite collection of random variables: For ﬁnite I,
the following statements are equivalent
≡ (X
i
)
i∈I
are independent (or mutually independent [2, p 182]).
≡ P [X
i
∈ H
i
∀i ∈ I] =
¸
i∈I
P [X
i
∈ H
i
]where H
i
∈ c
i
≡ P
¸
(X
i
: i ∈ I) ∈
i∈I
H
i
=
¸
i∈I
P [X
i
∈ H
i
] where H
i
∈ c
i
≡ P
(X
i
:i∈I)
i∈I
H
i
=
¸
i∈I
P
X
i
(H
i
) where H
i
∈ c
i
≡ P [X
i
≤ x
i
∀i ∈ I] =
¸
i∈I
P [X
i
≤ x
i
]
≡ [Factorization Criterion] F
(X
i
:i∈I)
((x
i
: i ∈ I)) =
¸
i∈I
F
X
i
(x
i
)
≡ X
i
and X
i−1
1
are independent ∀i ≥ 2
≡ σ (X
i
) and σ
X
i−1
1
are independent ∀i ≥ 2
≡ Discrete random variables X
i
’s with countable range E:
P [X
i
= x
i
∀i ∈ I] =
¸
i∈I
P [X
i
= x
i
] ∀x
i
∈ E ∀i ∈ I
≡ Absolutely continuous X
i
with density f
X
i
: f
(X
i
:i∈I)
((x
i
: i ∈ I)) =
¸
i∈I
f
X
i
(x
i
)
Deﬁnition 4.43. If the X
α
, α ∈ I are independent and each has the same marginal
distribution with distribution Q, we say that the X
α
’s are iid (independent and identically
distributed) and we write X
α
iid
∼ Q
• The abbreviation can be IID [21, p 39].
Deﬁnition 4.44. A pairwise independent collection of random variables is a set of
random variables any two of which are independent.
(a) Any collection of (mutually) independent random variables is pairwise independent
46
(b) Some pairwise independent collections are not independent. See example (4.45).
Example 4.45. Let suppose X, Y , and Z have the following joint probability distribution:
p
X,Y,Z
(x, y, z) =
1
4
for (x, y, z) ∈ ¦(0, 0, 0), (0, 1, , 1), (1, 0, 1), (1, 1, 0)¦. This, for example,
can be constructed by starting with independent X and Y that are Bernoulli
1
2
. Then set
Z = X ⊕Y = X +Y mod 2.
(a) X, Y, Z are pairwise independent.
(b) The combination of X
[
= Z and Y
[
= Z does not imply (X, Y )
[
= Z.
Deﬁnition 4.46. The convolution of probability measures μ
1
and μ
2
on (R, B) is the
measure μ
1
∗ μ
2
deﬁned by
(μ
1
∗ μ
2
) (H) =
μ
2
(H −x)μ
1
(dx) H ∈ B
R
(a) μ
X
∗ μ
Y
= μ
Y
∗ μ
X
and μ
X
∗ (μ
Y
∗ μ
Z
) = (μ
X
∗ μ
Y
) ∗ μ
Z
(b) If F
X
and G
X
are distribution functions corresponding to μ
X
and μ
Y
, the distribution
function corresponding to μ
X
∗ μ
Y
is
(μ
X
∗ μ
Y
) (−∞, z] =
F
Y
(z −x)μ
X
(dx)
In this case, it is notationally convenient to replace μ
X
(dx) by dF
X
(x) (Stieltjes
Integral.) Then, (μ
X
∗ μ
Y
) (−∞, z] =
F
Y
(z −x)dF
X
(x). This is denoted by
(F
X
∗ F
Y
) (z). That is
F
Z
(z) = (F
X
∗ F
Y
) (z) =
F
Y
(z −x)dF
X
(x)
(c) If densityf
Y
exists, F
X
∗ F
Y
has density F
X
∗ f
Y
, where
(F
X
∗ f
Y
) (z) =
f
Y
(z −x)dF
X
(x)
(i) If Y (or F
Y
) is absolutely continuous with density f
Y
, then for any X (or
F
X
), X +Y (or F
X
∗ F
Y
) is absolutely continuous with density (F
X
∗ f
Y
) (z) =
f
Y
(z −x)dF
X
(x)
If, in addition, F
X
has density f
X
. Then,
f
Y
(z −x)dF
X
(x) =
f
Y
(z −x)f
X
(x) dx
This is denoted by f
X
∗ f
Y
In other words, if densities f
X
, f
Y
exist, then F
X
∗ F
Y
has density f
X
∗ f
Y
, where
(f
X
∗ f
Y
) (z) =
f
Y
(z −x)f
X
(x) dx
47
4.47. If random variables X and Y are independent and have distribution μ
X
and μ
Y
,
then X +Y has distribution μ
X
∗ μ
Y
4.48. Expectation and independence
(a) Let X and Y be nonnegative independent random variables on (Ω, /, P), then E[XY ] =
EXEY
(b) If X
1
, X
2
, . . . , X
n
are independent and g
k
’s complexvalued measurable function. Then
g
k
(X
k
)’s are independent. Moreover, if g
k
(X
k
) is integrable, then E
¸
n
¸
k=1
g
k
(X
k
)
=
n
¸
k=1
E[g
k
(X
k
)]
(c) If X
1
, . . . , X
n
are independent and
n
¸
k=1
X
k
has a ﬁnite second moment, then all X
k
have ﬁnite second moments as well.
Moreover, Var
¸
n
¸
k=1
X
k
=
n
¸
k=1
Var X
k
.
(d) If pairwise independent X
i
∈ L
2
, then Var
¸
k
¸
i=1
X
i
=
k
¸
i=1
Var [X
i
]
4.7 Misc
4.49. The mode of a discrete probability distribution is the value at which its probability
mass function takes its maximum value. The mode of a continuous probability distribution
is the value at which its probability density function attains its maximum value.
• the mode is not necessarily unique, since the probability mass function or probability
density function may achieve its maximum value at several points.
5 PMF Examples
The following pmf will be deﬁned on its support S. For Ω larger than S, we will simply
put the pmf to be 0.
5.1 Random/Uniform
5.1. 1
n
, 
n
When an experiment results in a ﬁnite number of “equally likely” or “totally random”
outcomes, we model it with a uniform random variable. We say that X is uniformly
distributed on [n] if
P[X = k] =
1
n
, k ∈ [n].
We write X ∼ 
n
.
48
X ∼ Support set A p
X
(k) ϕ
X
(u)
Uniform 
n
¦1, 2, . . . , n¦
1
n

0,1,...,n−1¦
¦0, 1, . . . , n −1¦
1
n
1−e
iun
n(1−e
iu
)
Bernoulli B(1, p) ¦0, 1¦
1 −p, k = 0
p, k = 1
Binomial B(n, p) ¦0, 1, . . . , n¦
n
k
p
k
(1 −p)
n−k
(1 −p +pe
ju
)
n
Geometric ((p) N ∪ ¦0¦ (1 −p)p
k 1−β
1−βe
iu
Geometric (
t
(p) N (1 −p)
k−1
p
Poisson {(λ) N ∪ ¦0¦ e
−λ λ
k
k!
e
λ(e
iu
−1)
Table 2: Examples of probability mass functions. Here, p, β ∈ (0, 1). λ > 0.
• p
i
=
1
n
for i ∈ S = ¦1, 2, . . . , n¦.
• Examples
◦ classical game of chance / classical probability drawing at random
◦ fair gaming devices (wellbalanced coins and dice, well shuﬄed decks of cards)
◦ experiment where
∗ there are only n possible outcomes and they are all equally probable
∗ there is a balance of information about outcomes
5.2. Uniform on a ﬁnite set: (S)
Suppose [S[ = n, then p(x) =
1
n
for all x ∈ S.
Example 5.3. For X uniform on [M:1:M], we have EX = 0 and Var X =
M(M+1)
3
.
For X uniform on [N:1:M], we have EX =
M−N
2
and Var X =
1
12
(M −N)(M −N −2).
Example 5.4. Set S = 0, 1, 2, . . . , M, then the sum of two independent (S) has pmf
p(k) =
(M + 1) −[k −M[
(M + 1)
2
for k = 0, 1, . . . , 2M. Note its triangular shape with maximum value at p(M) =
1
M+1
. To
visualize the pmf in MATLAB, try
k = 0:2*M;
P = (1/((M+1)^2))*ones(1,M+1);
P = conv(P,P); stem(k,P)
5.2 Bernoulli and Binary distributions
5.5. Bernoulli : B(1, p) or Bernoulli(p)
• S = ¦0, 1¦
49
• p
0
= q = 1 −p, p
1
= p
• EX = E[X
2
] = p.
Var [X] = p −p
2
= p (1 −p). Note that the variance is maximized at p = 0.5.
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.25
0
p 1 p ÷ ( )
1 0 p
Figure 12: The plot for p(1 −p).
5.6. Binary Suppose X takes only two values a and b with b > a. P [X = b] = p. Then,
X can be expressed as X = (b −a) I + a, where I is a Bernoulli random variable with
P [I = 1] = p.
• Var X = (b −a)
2
V arI = (b −a)
2
p (1 −p) . Note that it is still maximized at p =
1
/
2
.
• Suppose a = −b. Then, X = 2I +a = 2I −b. In which case, Var X = 2b
2
p (1 −p).
5.3 Binomial: B(n, p)
• Binomial distribution with size n and parameter p. p ∈ [0, 1].
• p
i
=
n
i
p
i
(1 −p)
n−i
for i ∈ S = ¦0, 1, 2, . . . , n¦
• X is the number of success in n independent Bernoulli trials and hence the sum of n
independent, identically distributed Bernoulli r.v.
• ϕ
X
(u) = (1 −p +pe
ju
)
n
• EX = np
• EX
2
= (np)
2
+np (1 −p)
• Var [X] = np (1 −p)
• Tail probability:
n
¸
r=k
n
r
p
r
(1 −p)
n−r
= I
p
(k, n −k + 1)
50
• Maximum probability value happens at k
max
= mode X = (n + 1) p ≈ np
◦ When (n+1)p is an integer, then the maximum is achieved at k
max
and k
max
−1.
• If have c
1
, . . ., c
n
, n unlinked repetition of c and event A for c, the the distribution
B(n,p) describe the probability that A occurs k times in c
1
, . . . ,c
n
.
• Gaussian Approximation for Binomial Probabilities: When n is large, binomial distri
bution becomes diﬃcult to compute directly because of the need to calculate factorial
terms. We can use
P [X = k] ·
1
2πnp (1 −p)
e
−
(k−np)
2
2np(1−p)
, (13)
which comes from approximating X by Gaussian Y with the same mean and variance
and the relation
P [X = k] · P[X ≤ k] −P[X ≤ k −1]
· P[Y ≤ k] −P[Y ≤ k −1] · f
Y
(k).
See also (12.22).
• Approximation:
n
k
p
k
(1 −p)
n−k
=
(np)
k
k!
e
−np
1 +O
np
2
,
k
2
n
p 0.05 := n 100 := ì 5 :=
0 5 10
0
0.05
0.1
0.15
e
ì ÷
ì
x
I x 1 + ( )
1
2 t ì
e
1 ÷
2 ì
x ì ÷ ( )
2
e
x ÷
x
ì 1 ÷
I ì ( )
I n 1 + ( )
I n x ÷ 1 + ( ) I x 1 + ( )
p
x
1 p ÷ ( )
n x ÷
x
p 0.05 := n 800 := ì 40 :=
0 20 40 60
0
0.02
0.04
0.06
e
ì ÷
ì
x
I x 1 + ( )
1
2 t ì
e
1 ÷
2 ì
x ì ÷ ( )
2
e
x ÷
x
ì 1 ÷
I ì ( )
I n 1 + ( )
I n x ÷ 1 + ( ) I x 1 + ( )
p
x
1 p ÷ ( )
n x ÷
x
Figure 13: Gaussian approximation to Binomial, Poisson distribution, and Gamma distri
bution.
5.4 Geometric: ((β)
A geometric distribution is deﬁned by the fact that for some β ∈ [0, 1), p
k+1
= βp
k
for all
k ∈ S where S can be either N or N ∪ ¦0¦.
• When its support is N, p
k
= (1 −p) p
k−1
. This is referred to as (
1
(β) or geometric
1
(β).
In MATLAB, use geopdf(k1,1p).
51
• When its support is N ∪ ¦0¦, p
k
= (1 −p) p
k
. This is referred to as (
0
(β) or
geometric
0
(β). In MATLAB, use geopdf(k,1p).
5.7. Consider X ∼ (
0
(β).
• p
i
= (1 −β) β
i
, for S = N ∪ ¦0¦ , 0 ≤ β < 1
• β =
m
m+1
where m = average waiting time/ lifetime
• P [X = k] = P [k failures followed by a success] = (P [failures])
k
P [success]
P [X ≥ k] = β
k
= the probability of having at least k initial failure = the probability
of having to perform at least k+1 trials.
P [X > k] = β
k+1
= the probability of having at least k+1 initial failure.
• Memoryless property:
◦ P [X ≥ k +c [X ≥ k] = P [X ≥ c], k, c > 0.
◦ P [X > k +c [X ≥ k] = P [X > c], k, c > 0.
◦ If a success has not occurred in the ﬁrst k trials (already fails for k times),
then the probability of having to perform at least j more trials is the same the
probability of initially having to perform at least jtrials.
◦ Each time a failure occurs, the system “forgets” and begins anew as if it were
performing the ﬁrst trial.
◦ Geometric r.v. is the only discrete r.v. that satisﬁes the memoryless property.
• Ex.
◦ lifetimes of components, measured in discrete time units, when the fail catas
trophically (without degradation due to aging)
◦ waiting times
∗ for next customer in a queue
∗ between radioactive disintegrations
∗ between photon emission
◦ number of repeated, unlinked random experiments that must be performed prior
to the ﬁrst occurrence of a given event A
∗ number of coin tosses prior to the ﬁrst appearance of a ‘head’
number of trials required to observe the ﬁrst success
• The sum of independent (
0
(p) and (
0
(q) has pmf
(1 −p) (1 −q)
q
k+1
−p
k+1
q−p
, p = q
(k + 1) (1 −p)
2
p
k
, p = q
for k ∈ N ∪ ¦0¦.
5.8. Consider X ∼ (
1
(β).
• P [X > k] = β
k
• Suppose independent X
i
∼ (
1
(β
i
). min (X
1
, X
2
, . . . , X
n
) ∼ (
1
(
¸
n
i=1
β
i
).
52
5.5 Poisson Distribution: {(λ)
5.9. Characterized by
• p
X
(k) = P [X = k] = e
−λ λ
k
k!
; or equivalently,
• ϕ
X
(u) = e
λ(e
iu
−1)
,
where λ ∈ (0, ∞) is called the parameter or intensity parameter of the distribution.
In MATLAB, use poisspdf(k,lambda).
5.10. Denoted by { (λ).
5.11. In stead of X, Poisson random variable is usually denoted by Λ.
5.12. EX = Var X = λ.
5.13. Successive probabilities are connected via the relation kp
X
(k) = λp
X
(k −1).
5.14. mode X = λ.
• Note that when λ ∈ N, there are two maximums at λ −1 and λ.
• When λ 1, p
X
(λ) ≈
1
√
2πλ
via the Stirling’s formula (1.13).
Most probable value (i
max
) Associated max probability
0 < λ < 1 0 e
−λ
λ ∈ N λ −1, λ
λ
λ
λ!
e
−λ
λ ≥ 1, λ / ∈ N λ
λ
λ
]λ!
e
−λ
5.15. P [X ≥ 2] = 1 −e
−λ
−λe
−λ
= O(λ
2
) .
The cumulative probabilities can be found by
P [X ≤ k]
(∗)
= P
¸
k+1
¸
i=1
X
i
> 1
¸
=
1
Γ(k + 1)
∞
λ
e
−t
t
x
dt,
P [X > k] = P [X ≥ k + 1]
(∗)
= P
¸
k+1
¸
i=1
X
i
≤ 1
¸
=
1
Γ(k + 1)
λ
0
e
−t
t
x
dt,
where the X
i
’s are i.i.d. c(λ). The equalities given by (*) are easily obtained via
counting the number of events from rateλ Poisson process on interval [0, 1].
5.16. Fano factor (index of dispersion):
Var X
EX
= 1
An important property of the Poisson and Compound Poisson laws is that their classes
are close under convolution (independent summation). In particular, we have divisibility
properties (5.21) and (5.31) which are straightforward to prove from their characteristic
functions.
53
5.17 (Recursion equations). Suppose X ∼ {(λ). Let m
k
(λ) = E
X
k
and μ
k
(λ) =
E
(X −EX)
k
.
m
k+1
(λ) = λ(m
k
(λ) +m
t
k
(λ)) (14)
μ
k+1
(λ) = λ(kμ
k−1
(λ) +μ
t
k
(λ)) (15)
[14, p 112]. Starting with m
1
= λ = μ
2
and μ
1
= 0, the above equations lead to recursive
determination of the moments m
k
and μ
k
.
5.18. E
1
X+1
=
1
λ
1 −e
−λ
. Because for d ∈ N, Y =
1
X+1
¸
d
n=0
a
n
X
n
can be expressed
as
¸
d−1
n=0
b
n
X
n
+
c
X+1
, the value of EY is easy to ﬁnd if we know EX
n
.
5.19. Mixed Poisson distribution: Let X be Poisson with mean λ. Suppose, that the
mean λ is chosen in accord with a probability distribution whose characteristic function is
ϕ
Λ
. Then,
ϕ
X
(u) = E
E
e
iuX
[Λ
= E
e
Λ(e
iu
−1)
= E
e
i(−i(e
iu
−1))Λ
= ϕ
Λ
−i
e
iu
−1
.
• EX = EΛ.
• Var X = Var Λ +EΛ.
• E[X
2
] = E[Λ
2
] +EΛ.
• Var[X[Λ] = E[X[Λ] = Λ.
• When Λ is a nonnegative integervalued random variable, we have G
X
(z) = G
Λ
(e
z−1
)
and P[X = 0] = G
Λ
1
z
.
• E[XΛ] = E[Λ
2
]
• Cov [X, Λ] = Var Λ
5.20. Thinned Poisson: Suppose we have X → s →Y where X ∼ { (λ). The box s
is a binomial channel with success probability s. (Each 1 in the X get through the channel
with success probability s.)
• Note that Y is in fact a random sum
¸
X
i=1
I
i
where i.i.d. I
i
has Bernoulli distribution
with parameter s.
• Y ∼ { (sλ);
• p (x [y) = e
−λ(1−s)
(λ(1−s))
x−y
(x−y)!
; x ≥ y (shifted Poisson);
[Levy and Baxter, 2002]
5.21. Finite additivity: Suppose we have independent Λ
i
∼ { (λ
i
), then
¸
n
i=1
Λ
i
∼
{ (
¸
n
i=1
λ
i
).
5.22. Raikov’s theorem: independent random variables can have their sum Poisson
distributed only if every component of the sum is Poissondistributed.
54
5.23. Countable Additivity Theorem [11, p 5]: Let (X
j
: j =∈ N) be independent
random variables, and assume that X
j
has the distribution {(μ
j
) for each j. If
∞
¸
j=1
μ
j
(16)
converges to μ, then S =
∞
¸
j=1
X
j
converges with probability 1, and S has distribution {(μ).
If on the other hand (16) diverges, then S diverges with probability 1.
5.24. Let X
1
, X
2
, . . . , X
n
be independent, and let X
j
have distribution {(μ
j
) for all
j. Then S
n
=
¸
n
j=1
X
j
has distribution {(μ), with μ =
¸
n
j=1
μ
j
; and so, whenever
¸
n
j=1
r
j
= s,
P [X
j
= r
j
∀j[S
n
= s] =
s!
r
1
!r
2
! r
n
!
n
¸
j=1
μ
j
μ
r
j
which follows the multinomial distribution [11, p 6–7].
• If X and Y are independent Poisson random variables with respective parameters λ
and μ, then (1) Z = X+Y is {(λ+μ) and (2) conditioned on Z = z, X is B
z,
λ
λ+μ
.
So, E[X[Z] =
λ
λ+μ
Z, Var[X[Z] = Z
λμ
(λ+μ)
2
, and E[Var[X[Z]] =
λμ
λ+μ
.
5.25. One of the reasons why Poisson distribution is important is because many natural
phenomenons can be modeled by Poisson processes. For example, if we consider the number
of occurrences Λ during a time interval of length τ in a rateλ homogeneous Poisson process,
then Λ ∼ {(λτ).
Example 5.26.
• The ﬁrst use of the Poisson model is said to have been by a Prussian physician,
von Bortkiewicz, who found that the annual number of late19thcentury Prussian
soldiers kicked to death by horses followed a Poisson distribution [7, p 150].
• #photons emitted by a light source of intensity λ [photons/second] in time τ
• #atoms of radioactive material undergoing decay in time τ
• #clicks in a Geiger counter in τ seconds when the average number of click in 1 second
is λ.
• #dopant atoms deposited to make a small device such as an FET
• #customers arriving in a queue or workstations requesting service from a ﬁle server
in time τ
• Counts of demands for telephone connections
• number of occurrences of rare events in time τ
55
• #soldiers kicked to death by horses
• Counts of defects in a semiconductor chip.
5.27. Normal Approximation to Poisson Distribution with large λ: Let X ∼ { (λ). X
can be though of as a sum of i.i.d.X
i
∼ { (λ
n
), i.e., X =
n
¸
i=1
X
i
, where nλ
n
= λ. Hence X
is approximately normal ^ (λ, λ) for λ large.
Some says that the normal approximation is good when λ > 5.
5.28. Poisson distribution can be obtained as a limit from negative binomial distributions.
Thus, the negative binomial distribution with parameters r and p can be approximated
by the Poisson distribution with parameter λ =
rq
p
(meanmatching), provided that p is
“suﬃciently” close to 1 and r is “suﬃciently” large.
5.29. Convergence of sum of bernoulli random variables to the Poisson Law
Suppose that for each n ∈ N
X
n,1
, X
n,2
, . . . , X
n,rn
are independent; the probability space for the sequence may change with n. Such a collec
tion is called a triangular array [1] or double sequence [8] which captures the nature
of the collection when it is arranged as
X
1,1
, X
1,2
, . . . , X
1,r
1
,
X
2,1
, X
2,2
, . . . , X
2,r
2
,
.
.
.
.
.
.
.
.
.
X
n,1
, X
n,2
, . . . , X
n,rn
,
.
.
.
.
.
.
.
.
.
⎫
⎪
⎪
⎪
⎪
⎪
⎬
⎪
⎪
⎪
⎪
⎪
⎭
where the random variables in each row are independent. Let S
n
= X
n,1
+X
n,2
+ +X
n,rn
be the sum of the random variables in the n
th
row.
Consider a triangular array of bernoulli random variables X
n,k
with P [X
n,k
= 1] = p
n,k
.
If max
1≤k≤rn
p
n,k
→ 0 and
rn
¸
k=1
p
n,k
→ λ as n → ∞, then the sums S
n
converges in distribution
to the Poisson law. In other words, Poisson distribution rare events limit of the binomial
(large n, small p).
As a simple special case, consider a triangular array of bernoulli random variables X
n,k
with P [X
n,k
= 1] = p
n
. If np
n
→λ as n →∞, then the sums S
n
converges in distribution
to the Poisson law.
To show this special case directly, we bound the ﬁrst i terms of n! to get
(n−i)
i
i!
≤
n
i
≤
n
i
i!
. Using the upper bound,
n
i
p
i
n
(1 −p
n
)
n−i
≤
1
i!
(np
n
)
i
. .. .
→λ
i
(1 −p
n
)
−i
. .. .
→1
(1 −
np
n
n
)
n
. .. .
→e
−λ
.
The lower bound gives the same limit because (n −i)
i
=
n−i
n
i
n
i
where the ﬁrst term
→1.
56
5.6 Compound Poisson
Given an arbitrary probability measure μ and a positive real number λ, the compound
Poisson distribution ({ (λ, μ) is the distribution of the sum
¸
Λ
j=1
V
j
where the V
j
are i.i.d.
with distribution μ and Λ is a { (λ) random variable, independent of the V
j
.
Sometimes, it is written as POIS (λμ). The parameter λ is called the rate of ({ (λ, μ)
and μ is called the base distribution.
5.30. The mean and variance of ({ (λ, L(V )) are λEV and λEV
2
respectively.
5.31. If Z ∼ ({ (λ, q), then ϕ
Z
(t) = e
λ(ϕq(t)−1)
.
An important property of the Poisson and Compound Poisson laws is that their classes
are close under convolution (independent summation). In particular, we have divisibility
properties (5.21) and (5.31) which are straightforward to prove from their characteristic
functions.
5.32. Divisibility property of the compound Poisson law: Suppose we have inde
pendent Λ
i
∼ ({
λ
i
, μ
(i)
, then
¸
n
i=1
Λ
i
∼ ({
λ,
1
λ
¸
n
i=1
λ
i
μ
(i)
where λ =
¸
n
i=1
λ
i
.
Proof.
ϕ n
¸
i=1
Z
i
(t) =
n
¸
i=1
e
λ
i(ϕq
i
(t)−1)
= exp
n
¸
i=1
λ
i
(ϕ
q
i
(t) −1)
= exp
n
¸
i=1
λ
i
ϕ
q
i
(t) −λ
= exp
λ
1
λ
n
¸
i=1
λ
i
ϕ
q
i
(t) −1
We usually focus on the case when μ is a discrete probability measure on N = ¦1, 2, . . .¦.
In which case, we usually refer to μ by the pmf q on N; q is called the base pmf. Equivalently,
({ (λ, q) is also the distribution of the sum
¸
i∈N
iΛ
i
where (Λ
i
: i ∈ N) are independent
with Λ
i
∼ { (λq
i
). Note that
¸
i∈N
λq
i
= λ. The Poisson distribution is a special case of
the compound Poisson distribution where we set q to be the point mass at 1.
5.33. The compound negative binomial [Bower, Gerber, Hickman, Jones, and Nesbitt,
1982, Ch 11] can be approximated by the compound Poisson distribution.
5.7 Hypergeometric
An urn contains N white balls and M black balls. One draws n balls without replacement,
so n ≤ N +M. One gets X white balls and n −X black balls.
P [X = x] =
(
N
x
)(
M
n−x
)
(
N+M
n
)
, 0 ≤ x ≤ N and 0 ≤ n −x ≤ M
0, otherwise
57
5.34. The hypergeometric distributions “converge” to the binomial distribution: Assume
that n is ﬁxed, while N and M increase to +∞ with lim
N→∞
M→∞
N
N+M
= p, then
p (x) →
n
x
p
x
(1 −p)
n−x
(binomial).
Note that binomial is just drawing balls with replacement:
p (x) =
n
x
N
x
M
n−x
(N +M)
n
=
n
x
N
N +M
x
M
N +M
n−x
.
Intuitively, when N and M large, there is not much diﬀerence in drawing n balls with or
without replacement.
5.35. Extension: If we have m colors and N
i
balls of color i. The urn contains N =
N
1
+ +N
m
balls. One draws n balls without replacement. Call X
i
the number of balls
of color i drawn among n balls. (Of course, X
1
+ +X
m
= n.)
P [X
1
= x
1
, . . . , X
m
= x
m
] =
⎧
⎨
⎩
(
N
1
x
1
)(
N
2
x
2
)(
Nm
xm
)
(
N
n
)
, x
1
+ +x
m
= n and x
i
≥ 0
0, otherwise
5.8 Negative Binomial Distribution (Pascal / P´ olya distribution)
5.36. The probability that the r
th
success occurs on the (x +r)
th
trial
= p
x +r −1
r −1
p
r−1
(1 −p)
x
=
x +r −1
r −1
p
r
(1 −p)
x
=
x +r −1
x
p
r
(1 −p)
x
i.e. among the ﬁrst (x +r −1) trials, there are r −1 successes and x failures.
• Fix r.
• ϕ
X
(u) = p
r 1
(1−(1−p)e
iu
)
r
• EX =
rq
p
, V ar (X) =
rq
p
2
, q = 1 −p
• Note that if we deﬁne
n
x
≡
n(n−1)(n−(x−1))
x(x−1)1
. Then,
−r
x
≡ (−1)
x
r (r + 1) (r + (x −1))
x (x −1) 1
= (−1)
x
r +x −1
x
.
58
• If independent X
i
∼ NegBin (r
i
, p), then
¸
i
X
i
∼ NegBin
¸
i
r
i
, p
. This is easy to
see from the characteristic function.
• p (x) =
Γ(r+x)
Γ(r)x!
p
r
(1 −p)
x
5.37. A negative binomial distribution can arise as a mixture of Poisson distributions with
mean distributed as a gamma distribution Γ
q = r, λ =
p
1−p
.
Let X be Poisson with mean λ. Suppose, that the mean λ is chosen in accord with a
probability distribution F
Λ
(λ). Then, ϕ
X
(u) = ϕ
Λ
(−i (e
iu
−1)) [see compound Poisson
distribution]. Here, Λ ∼ Γ(q, λ
0
); hence, ϕ
Λ
(u) =
1
1−i
u
λ
0
q .l
So, ϕ
X
(u) =
1 −
e
iu
−1
λ
0
−q
=
λ
0
λ
0
+1
1−
1
λ
0
+1
e
iu
q
, which is negative binomial with p =
λ
0
λ
0
+1
.
So, a.k.a. Poissongamma distribution, or simply compound Poisson distribution.
5.9 Betabinomial distribution
A variable with a beta binomial distribution is distributed as a binomial distribution with
parameter p, where p is distribution with a beta distribution with parameters αand β.
• P (k [p) =
n
k
p
k
(1 −p)
n−k
.
• f (p) = f
βq
1
,q
2
(p) =
Γ(q
1
+q
2
)
Γ(q
1
)Γ(q
2
)
p
q
1
−1
(1 −p)
q
2
−1
1
(0,1)
(p)
• pmf: P (k) =
n
k
Γ(q
1
+q
2
)
Γ(q
1
)Γ(q
2
)
Γ(k+q
1
)Γ(n−k+q
2
)
Γ(q
1
+q
2
+n)
=
n
k
β(k+q
1
,n−k+q
2
)
β(q
1
,q
2
)
• EX =
nq
1
q
1
+q
2
• Var X =
nq
1
q
2
(n+q
1
+q
2
)
(q
1
+q
2
)
2
(1+q
1
+q
2
)
5.10 Zipf or zeta random variable
5.38. P[X = k] =
1
ξ(p)
1
k
p
where k ∈ N, p > 1, and ξ is the zeta function deﬁned in (A.11).
5.39. E[X
n
] =
ξ(p−n)
ξ(p)
is ﬁnite for n < p −1, and E[X
n
] = ∞ for n ≥ p −1.
6 PDF Examples
6.1 Uniform Distribution
6.1. Characterization for uniform[a, b]:
(a) f (x) =
1
b−a
U (x −a) U (b −x) =
0 x < a, x > b
1
b−a
a ≤ x ≤ b
59
X ∼ f
X
(x) ϕ
X
(u)
Uniform (a, b)
1
b−a
1
[a,b]
(x) e
iu
b+a
2
sin(u
b−a
2
)
u
b−a
2
Exponential c(λ) λe
−λx
1
[0,∞]
(x)
λ
λ−iu
Shifted Exponential (μ, s
0
)
1
μ−s
0
e
−
x−s
0
μ−s
0
1
[s
0
,∞)
(x)
Truncated Exp.
α
e
−αa
−e
−αb
e
−αx
1
[a,b]
(x)
Laplacian L(α)
α
2
e
−α[x[ α
2
α
2
+u
2
Normal ^(m, σ
2
)
1
σ
√
2π
e
−
1
2
(
x−m
σ
)
2
e
ium−
1
2
σ
2
u
2
Normal ^(m, Λ)
1
(2π)
n
2
√
det(Λ)
e
−
1
2
(x−m)
T
Λ
−1
(x−m)
e
ju
T
m−
1
2
u
T
Λu
Gamma Γ(q, λ)
λ
q
x
q−1
e
−λx
Γ(q)
1
(0,∞)
(x)
1
(1−i
u
λ
)
q
Pareto Par(α) αx
−(α+1)
1
[1,∞]
(x)
Par(α, c) = c Par(α)
α
c
c
x
α+1
1
(c,∞)
(x)
Beta β (q
1
, q
2
)
Γ(q
1
+q
2
)
Γ(q
1
)Γ(q
2
)
x
q
1
−1
(1 −x)
q
2
−1
1
(0,1)
(x)
Beta prime
Γ(q
1
+q
2
)
Γ(q
1
)Γ(q
2
)
x
q
1
−1
(x+1)
(q
1
+q
2
)
1
(0,∞)
(x)
Rayleigh 2αxe
−αx
2
1
[0,∞]
(x)
Standard Cauchy
1
π
1
1+x
2
Cau(α)
1
π
α
α
2
+x
2
Cau(α, d)
Γ(d)
√
παΓ(d−
1
2
)
1
1+(
x
α
)
2
d
Log Normal e
A(μ,σ
2
) 1
σx
√
2π
e
−
1
2
(
ln x−μ
σ
)
2
1
(0,∞)
(x)
Table 3: Examples of probability density functions. Here, c, α, q, q
1
, q
2
, σ, λ are all strictly
positive and d >
1
2
. γ = −ψ(1) ≈ .5772 is the Eulerconstant. ψ (z) =
d
dz
log Γ(z) =
(log e)
Γ
(z)
Γ(z)
is the digamma function. B(q
1
, q
2
) =
Γ(q
1
)Γ(q
2
)
Γ(q
1
+q
2
)
is the beta function.
60
(b) F (x) =
0 x < a, x > b
x−a
b−a
a ≤ x ≤ b
(c) ϕ
X
(u) = e
iu
b+a
2
sin(u
b−a
2
)
u
b−a
2
(d) M
X
(s) =
e
sb
−e
sa
s(b−a)
.
6.2. For most purpose, it does not matter whether the value of the density f at the
endpoints are 0 or
1
b−a
.
Example 6.3.
• Phase of oscillators ⇒ [π, π] or [0,2π]
• Phase of received signals in incoherent communications → usual broadcast carrier
phase φ ∼ U(ππ)
• Mobile cellular communication: multipath → path phases φ
c
∼ (−π, π)
• Use with caution to represent ignorance about a parameter taking value in [a, b].
6.4. EX =
a+b
2
, Var X =
(b−a)
2
12
, E[X
2
] =
1
3
(b
2
+ab +a
2
).
6.5. The product X of two independent [0, 1] has
f
X
(x) = −(ln (x)) 1
[0,1]
(x)
and
F
X
(x) = x −x ln x
on [0, 1]. This comes from P [X > x] =
1
0
1 −F
U
x
t
dt =
1
x
1 −
x
t
dt.
6.2 Gaussian Distribution
6.6. Gaussian distribution:
(a) Denoted by ^ (m, σ
2
) . ^ (0, 1) is the standard Gaussian (normal) distribution.
(b) f
X
(x) =
1
√
2πσ
e
−
1
2
(
x−m
σ
)
2
.
(c) F
X
(x) = normcdf(x,m,sigma).
• The standard normal cdf is sometimes denoted by Φ(x). It inherits all properties
of cdf. Moreover, note that Φ(−x) = 1 −Φ(x).
(d) ϕ
X
(v) = E
e
jvX
= e
jmv−
1
2
v
2
σ
2
.]
(e) M
X
(s) = e
sm+
1
2
s
2
σ
2
61
u o + u o ÷ u
68%
( )
X
f x
u
2 u o + 2 u o ÷
( )
X
f x
95%
Figure 14: Probability density function of X ∼ ^ (m, σ
2
) .
(f) Fourier transform: T ¦f
X
¦ =
∞
−∞
f
X
(x) e
−jωx
dt = e
−jωm−
1
2
ω
2
σ
2
.
(g) P [X > x] = P [X ≥ x] = Q
x−m
σ
= 1 −Φ
x−m
σ
= Φ
−
x−m
σ
P [X < x] = P [X ≤ x] = 1 −Q
x−m
σ
= Q
−
x−m
σ
= Φ
x−m
σ
.
6.7. Properties
(a) P [[X −μ[ < σ] = 0.6827; P [[X −μ[ > σ] = 0.3173
P [[X −μ[ > 2σ] = 0.0455; P [[X −μ[ < 2σ] = 0.9545
(b) Moments and central moments:
(i) E
(X −μ)
k
= (k −1) E
(X −μ)
k−2
=
0, k odd
1 3 5 (k −1) σ
k
, k even
(ii) E
[X −μ[
k
=
2 4 6 (k −1) σ
k
2
π
, k odd
1 3 5 (k −1) σ
k
, k even
(iii) Var [X
2
] = 4μ
2
σ
2
+ 2σ
4
.
n 0 1 2 3 4
EX
n
1 μ μ
2
+σ
2
μ(μ
2
+ 3σ
2
) μ
4
+ 6μ
2
σ
2
+ 3σ
4
E[(X −μ)
n
] 1 0 σ
2
0 3σ
4
(c) For ^ (0, 1) and k ≥ 1,
E
X
k
= (k −1) E
X
k−2
=
0, k odd
1 3 5 (k −1) , k even.
The ﬁrst equality comes from integration by parts. Observe also that
E
X
2m
=
(2m)!
2
m
m!
.
(d) L´evy–Cram´er theorem: If the sum of two independent nonconstant random variables
is normally distributed, then each of the summands is normally distributed.
62
• Note that
∞
−∞
e
−αx
2
dx =
π
α
.
6.8 (Length bound). For X ∼ ^(0, 1) and any (Borel) set B,
P[X ∈ B] ≤
[B[/2
−[B[/2
f
X
(x) = 1 −2Q
[B[
2
,
where [B[ is the length (Lebesgue measure) of the set B. This is because the probability
is concentrated around 0. More generally, for X ∼ ^(m, σ
2
)
P[X ∈ B] ≤ 1 −2Q
[B[
2σ
.
6.9 (Stein’s Lemma). Let X ∼ ^(μ, σ
2
), and let g be a diﬀerentiable function satisfying
E[g
t
(X)[ < ∞. Then
E[g(X)(X −μ)] = σ
2
E[g
t
(X)] .
[2, Lemma 3.6.5 p 124]. Note that this is simply integration by parts with u = g(x) and
dv = (x −μ)f
X
(x)dx.
• E
(X −μ)
k
= E
(X −μ)
k−1
(X −μ)
= σ
2
(k −1)E
(X −μ)
k−2
.
6.10. Qfunction: Q(z) =
∞
z
1
√
2π
e
−
x
2
2
dx corresponds to P [X > z] where X ∼ ^ (0, 1);
that is Q(z) is the probability of the “tail” of N (0, 1). The Q function is then a comple
mentary cdf (ccdf).
( ) 0,1 &
z
0
( ) Q z
3 2 1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
z
Figure 15: Qfunction
(a) Q is a decreasing function with Q(0) =
1
2
.
(b) Q(−z) = 1 −Q(z) = Φ(z)
(c) Q
−1
(1 −Q(z)) = −z
63
(d) Craig’s formula: Q(x) =
1
π
π
2
0
e
−
x
2
2 sin
2
θ
dθ =
1
π
π
2
0
e
−
x
2
2 cos
2
θ
dθ, x ≥ 0.
To see this, consider X, Y
i.i.d.
∼ ^(0, 1). Then,
Q(z) =
(x,y)∈(z,∞)R
f
X,Y
(x, y)dxdy = 2
π
2
0
∞
z
cos θ
f
X,Y
(r cos θ, r sin θ)drdθ.
where we evaluate the double integral using polar coordinates [9, Q7.22 p 322].
(e) Q
2
(x) =
1
π
π
4
0
e
−
x
2
2 sin
2
θ
dθ
(f)
d
dx
Q(x) = −
1
√
2π
e
−
x
2
2
(g)
d
dx
Q(f (x)) = −
1
√
2π
e
−
(f(x))
2
2
d
dx
f (x)
(h)
Q(f (x)) g (x)dx = Q(f (x))
g (x)dx +
1
√
2π
e
−
(f(x))
2
2
d
dx
f (x)
x
a
g (t) dt
dx
(i) P [X > x] = Q
x−m
σ
P [X < x] = 1 −Q
x−m
σ
= Q
−
x−m
σ
.
(j) Approximation:
(i) Q(z) ≈
1
(1−a)z+a
√
z
2
+b
1
√
2π
e
−
z
2
2
; a =
1
π
, b = 2π
(ii)
1 −
1
x
2
e
−
x
2
2
x
√
2π
≤ Q(x) ≤
1
2
e
−
x
2
2
(iii) Q(z) ≈
1
z
√
2π
1 −
0.7
z
2
e
−
z
2
2
; z > 2
6.11. Error function (MATLAB): erf (z) =
2
√
π
z
0
e
−x
2
dx = 1 −2Q
√
2z
(a) It is an odd function of z.
(b) For z ≥ 0, it corresponds to P [[X[ < z] where X ∼ ^
0,
1
2
.
(c) lim
z→∞
erf (z) = 1
(d) erf (−z) = −erf (z)
(e) Q(z) =
1
2
erfc
z
√
2
=
1
2
1 −erf
z
√
2
(f) Φ(x) =
1
2
1 + erf
x
√
(2)
=
1
2
erfc
−
x
√
2
64
(g) Q
−1
(q) =
√
2 erfc
−1
(2q)
(h) The complementary error function: erfc (z) = 1−erf (z) = 2Q
√
2z
=
2
√
π
∞
z
e
−x
2
dx
1
0,
2
 

\ .
&
z
( ) erf z
0
( )
2 Q z
Figure 16: erffunction and Qfunction
6.3 Exponential Distribution
6.12. Denoted by c (λ). It is in fact Γ(1, λ). λ > 0 is a parameter of the distribution,
often called the rate parameter.
6.13. Characterized by
• f
X
(x) = λe
−λx
U(x) ;
• F
X
(x) =
1 −e
−λx
U(x) ;
• Survival, survivor, or reliabilityfunction: P [X > x] = e
−λx
1
[0,∞)
(x) + 1
(−∞,0)
(x);
• ϕ
X
(u) =
λ
λ−iu
.
• M
X
(s) =
λ
λ−s
for Re ¦s¦ < λ.
6.14. EX = σ
X
=
1
λ
, Var [X] =
1
λ
2
.
6.15. median(X) =
1
λ
ln 2, mode(X) = 0, E[X
n
] =
n!
λ
n
.
6.16. Coeﬃcient of variation: CV =
σ
X
EX
= 1
6.17. It is a continuous version of geometric distribution. In fact, X ∼ (
0
(e
−λ
) and
X ∼ (
1
(e
−λ
)
6.18. X ∼ c(λ) is simply
1
λ
X
1
where X
1
∼ c(1).
6.19. Suppose X
1
∼ c(1). Then E[X
n
1
] = n! for n ∈ N ∪ ¦0¦.
In general, for X ∼ c(λ), we have E[X
α
] =
1
λ
α
Γ(α + 1) for any α > −1. In particular, for
n ∈ N ∪ ¦0¦, the moment E[X
n
] =
n!
λ
n
.
6.20. μ
3
= E[(X −EX)
3
] =
2
λ
3
and μ
4
=
9
λ
4
.
65
6.21. Hazard function:
f(x)
P[X>x]
= λ
6.22. h(X) = log
e
λ
.
6.23. Can be generated by X = −
1
λ
ln U where U ∼  (0, 1).
6.24. MATLAB:
• X = exprnd(1/lambda)
• f
X
(x) = exppdf(x,1/lambda)
• F
X
(x) = expcdf(x,1/lambda)
6.25. Memoryless property: The exponential r.v. is the only continuous r.v. on [0, ∞)
that satisﬁes the memoryless property:
P [X > s +x [X > s] = P [X > x]
for all x > 0 and all s > 0 [13, p 157–159]. In words, the future is independent of the
past. The fact that it hasn’t happened yet, tells us nothing about how much longer it will
take before it does happen.
• In particular, suppose we deﬁne the set B + x to be ¦x +b : b ∈ B¦. For any x > 0
and set B ⊂ [0, ∞), we have
P[X ∈ B +x[X > x] = P[X ∈ B]
because
P [X ∈ B +x]
P [X > x]
=
B+x
λe
−λt
dt
e
−λx
τ=t−x
=
B
λe
−λ(τ+x)
dτ
e
−λx
.
6.26. The diﬀerence of two independent c(λ) is L(λ).
6.27. Consider independent X
i
∼ c (λ). Let S
n
=
¸
n
i=1
X
i
.
(a) S
n
∼ Γ(n, λ), i.e. its has nErlang distribution.
(b) Let N = inf ¦n : S
n
≥ s¦. Then, N ∼ Λ + 1 where Λ ∼ { (λs).
6.28. If independent X
i
∼ c (λ
i
), then
(a) min
i
X
i
∼ c
¸
i
λ
i
.
Recall order statistics. Let Y
1
= min
i
X
i
(b) P
min
i
X
i
= j
=
λ
j
¸
i
λ
i
. Note that this is
∞
0
f
X
j
(t)
¸
i,=j
P [X
i
> t]dt.
66
6.29. If S
i
, T
i
i.i.d.
∼ c (α), then
P
m
¸
i=1
S
i
>
n
¸
j=1
T
j
=
m−1
¸
i=0
n +m−1
i
1
2
n+m−1
=
m−1
¸
i=0
n +i −1
i
1
2
n+i
.
Note that we can set up two Poisson processes. Consider the superposed process. We want
the n
th
arrival from the T processes to come before the m
th
one of the S process.
6.4 Pareto: Par(α)–heavytailed model/density
6.30. Characterizations: Fix α > 0.
(a) f (x) = αx
−α−1
U (x −1)
(b) F (x) =
1 −
1
x
α
U (x −1) =
0 x < 1
1 −
1
x
α
x ≥ 1
Example 6.31.
• distribution of wealth
• ﬂood heights of the Nile river
• designing dam height
• (discrete) sizes of ﬁles requested by web users
• waiting times between successive keystrokes at computer terminals
• (discrete) sizes of ﬁles stored on Unix system ﬁle servers
• running times for NPhard problems as a function of certain parameters
6.5 Laplacian: L(α)
6.32. Characterization: α > 0
(a) Also known as Laplace or double exponential.
(b) f (x) =
α
2
e
−α[x[
(c) F (x) =
1
2
e
αx
x < 0
1 −
1
2
e
−αx
x ≥ 0
(d) ϕ
X
(u) =
α
2
α
2
+u
2
67
(e) M
X
(s) =
λ
2
λ
2
−s
2
, −λ < Re ¦s¦ < λ.
6.33. EX = 0, Var X =
2
α
2
Example 6.34.
• amplitudes of speech signals
• amplitudes of diﬀerences of intensities between adjacent pixels in an image
• If X and Y are independent c(λ), then X −Y is L(λ). (Easy proof via ch.f.)
6.6 Rayleigh
6.35. Characterizations:
(a) F (x) =
1 −e
−αx
2
U (x)
(b) f (x) = 2αxe
−αx
2
u(x)
(c) P [X > t] = 1 −F (t) =
e
−αt
2
, t ≥ 0
1, t < 0
(d) Use
√
−2σ
2
ln U to generate Rayleign
1
σ
2
from U ∼ (0, 1).
6.36. Read “ray
t
lee”
Example 6.37.
• noise X at the output of AM envelope detector when no signal is present
6.38. Relationship with other distributions
(a) Let X be a Rayleigh(α) r.v., then Y = X
2
is c(α). Hence,
c(α)
√
−−
−−
()
2
Rayleigh(α). (17)
(b) Suppose X, Y
i.i.d.
∼ ^ (0, σ
2
). R =
√
X
2
+Y
2
has a Rayleigh distribution with density
f
R
(r) = 2r
1
2σ
2
e
−
1
2σ
2
r
2
.
• Note that X
2
, Y
2
i.i.d.
∼ Γ
1
2
,
1
2σ
2
. Hence, X
2
+Y
2
∼ Γ
1, α =
1
2σ
2
, exponential.
By (17),
√
X
2
+Y
2
is a Rayleigh r.v. α =
1
2σ
2
.
• Alternatively, transformation from Cartesian coordinates to polar coordinates
x
y
→
r
θ
f
R,Θ
(r, θ) = rf
X,Y
(r cos θ, r sin θ) = r
1
σ
√
2π
e
−
1
2
(
r cos θ
σ
)
2
1
σ
√
2π
e
−
1
2
(
r sin θ
σ
)
2
=
1
2π
2r
1
2σ
2
e
−
1
2σ
2
r
2
Hence, the radius R and the angle Θ are independent, with the radius R having
a Rayleigh distribution while the angle Θ is uniformly distributed in the interval
(0, 2π).
68
6.7 Cauchy
6.39. Characterizations: Fix α > 0.
(a) f
X
(x) =
α
π
1
α
2
+x
2
, α > 0
(b) F
X
(x) =
1
π
tan
−
1
x
λ
+
1
2
.
(c) ϕ
X
(u) = e
−α[u[
Note that Cau(α) is simply αX
1
where X
1
∼ Cau(1). Also, because f
X
is even, f
−X
= f
X
and thus −X is still ∼ Cau(α).
6.40. Odd moments are not deﬁned; even moments are inﬁnite.
• Because the ﬁrst moment is not deﬁned, central moments, including the variance, are
not deﬁned.
• Mean and variance do not exist.
• Note that even though the pdf of Cauchy distribution is an even function, the mean
is not 0.
6.41. Suppose X
i
are independent Cau(α
i
). Then,
¸
i
a
i
X
i
∼ Cau([a
i
[ α
i
).
6.42. Suppose X ∼ Cau(α).
1
X
∼ Cau
1
α
.
6.8 Weibull
6.43. For λ > 0 and p > 0, the Weibull(p, λ) distribution [9] is characterized by
(a) X =
Y
λ
1
p
where Y ∼ c(1).
(b) f
X
(x) = λpx
p−1
e
−λx
p
, x > 0
(c) F
X
(x) = 1 −e
−λx
p
, x > 0
(d) E[X
n
] =
Γ(1+
n
p
)
λ
n
p
.
7 Expectation
Consider probability space (Ω, /, P)
7.1. Let X
+
= max (X, 0), and X
−
= −min (X, 0) = max (−X, 0). Then, X = X
+
−X
−
,
and X
+
, X
−
are nonnegative r.v.’s. Also, [X[ = X
+
+X
−
7.2. A random variable X is integrable if and only if
≡ X has a ﬁnite expectation
≡ both E[X
+
] and E[X
−
] are ﬁnite
69
≡ E[X[ is ﬁnite.
≡ EX is ﬁnite ≡ EX is deﬁned.
≡ X ∈ L
≡ [X[ ∈ L
In which case,
EX = E
X
+
−E
X
−
=
X (ω) P (dω) =
XdP
=
xdP
X
(x) =
xP
X
(dx)
and
A
XdP = E[1
A
X] .
Deﬁnition 7.3. A r.v. X admits (has) an expectation if E[X
+
] and E[X
−
] are not
both equal to +∞. Then, the expectation of X is still given by EX = E[X
+
] − E[X
−
]
with the conventions +∞+a = +∞ and −∞+a = −∞ when a ∈ R
7.4. L
1
= L
1
(Ω, /, P) = the set of all integrable random variables.
7.5. For 1 ≤ p < ∞, the following are equivalent:
(a) (E[X
p
])
1
p
= 0
(b) E[X
p
] = 0
(c) X = 0 a.s.
7.6. X
a.s.
= Y ⇒EX = EY
7.7. E[1
B
(X)] = P (X
−1
(B)) = P
X
(B) = P [X ∈ B]
• F
X
(x) = E
1
(−∞,x]
(X)
7.8. Expectation rule: Let X be a r.v. on (Ω, /, P), with values in (E, c), and distri
bution P
X
. Let h : (E, c) →(R, B) be measurable. If
• X ≥ 0 or
• h(X) ∈ L
1
(Ω, /, P) which is equivalent to h ∈ L
1
E, c, P
X
then
• E[h(X)] =
h(X (ω)) P (dω) =
h(x) P
X
(dx)
•
[X∈G]
h(X (ω)) P (dω) =
G
h(x) P
X
(dx)
70
7.9. Expectation of an absolutely continuous random variable: Suppose X has
density f
X
, then h is P
X
integrable if and only if h f
X
is integrable w.r.t. Lebesgue
measure. In which case,
E[h(X)] =
h(x) P
X
(dx) =
h(x) f
X
(x) dx
and
G
h(x) P
X
(dx) =
G
h(x) f
X
(x) dx
• Caution: Suppose h is an odd function and f
X
is an even function, we can not
conclude that E[h(X)] = 0. One obvious oddfunction h is h(x) = x. For example,
in (6.40), when X is Cauchy, the expectation does not exist even though the pdf is
an even funciton. Of course, in general, if we also know that h(X) is integrable, then
E[h(X)] is 0.
Expectation of a discrete random variable: Suppose x is a discrete random variable.
EX =
¸
x
xP[X = x]
and
E[g(X)] =
¸
x
g(x)P[X = x].
Similarly,
E[g(X, Y )] =
¸
x
¸
y
= g(x, y)P[X = x, Y = y].
These are called the law/rule of the lazy statistician (LOTUS) [21, Thm 3.6 p 48],[9,
p 149].
7.10.
E
P [X ≥ t] dt =
E
P [X > t] dt and
E
P [X ≤ t] dt =
E
P [X < t] dt
7.11. Expectation and cdf :
(a) For nonnegative X,
EX =
∞
0
P [X > y] dy =
∞
0
(1 −F
X
(y)) dy =
∞
0
P [X ≥ y] dy (18)
For p > 0,
E[X
p
] =
∞
0
px
p−1
P [X > x]dx.
71
(b) For integrable X,
EX =
∞
0
(1 −F
X
(x)) dx −
0
−∞
F
X
(x) dx
(c) For nonnegative integervalued X, EX =
¸
∞
k=0
P[X > k] =
¸
∞
k=1
P[X ≥ k].
Deﬁnition 7.12.
(a) Absolute moment: E
[X[
k
=
[x[
k
P
X
(dx), where we deﬁne E
[X[
0
= 1
(b) Moment: If E
[X[
k
< ∞, then m
k
= E
X
k
=
x
k
P
X
(dx) = the k
th
moment of
X
(c) Variance: If E[X
2
] < ∞, then we deﬁne
Var X = E
(X −EX)
2
=
(x −EX)
2
P
X
(dx) = E
X
2
−(EX)
2
= E[X(X −EX)]
• Notation: D
X
, or σ
2
(X), or σ
2
X
, or VX [21, p 51]
• If X
i
∈ L
2
, then
k
¸
i=1
X
i
∈ L
2
, Var
¸
k
¸
i=1
X
i
exists, and E
¸
k
¸
i=1
X
i
=
k
¸
i=1
EX
i
• Suppose E[X
2
] < ∞. If Var X = 0, then X = EX a.s.
(d) Standard deviation: σ
X
=
Var[X]
(e) Coeﬃcient of variation: CV
X
=
σ
X
EX
(f) Fano factor (index of dispersion):
Var X
EX
(g) Central moments: the nth central moment is μ
n
= E[(X −EX)
n
].
(i) μ
1
= E[X −EX] = 0.
(ii) μ
2
= σ
2
X
= Var X
(iii) μ
n
=
n
¸
k=1
n
k
m
n−k
(−m
1
)
k
(iv) m
n
=
n
¸
k=1
n
k
μ
n−k
m
k
1
(h) Skewness coeﬃcient: γ
X
=
μ
3
σ
3
X
(i) Describe the deviation of the distribution from a symmetric shape (around the
mean)
(ii) 0 for any symmetric distribution
72
(i) Kurtosis: κ
X
=
μ
4
σ
4
X
.
• κ
X
= 3 for Gaussian distribution
(j) Excess coeﬃcient: ε
X
= κ
X
−3 =
μ
4
σ
4
X
−3
(k) Cumulants or semivariants: For one variable: γ
k
=
1
j
k
∂
k
∂v
k
ln (ϕ
X
(v))
v=0
.
(i) γ
1
= EX = m
1
,
γ
2
= E[X −EX]
2
= m
2
−m
2
1
= μ
2
γ
3
= E[X −EX]
3
= m
3
−3m
1
m
2
+ 2m
3
1
= μ
3
γ
4
= m
4
−3m
2
2
−4m
1
m
3
+ 12m
2
1
m
2
−6m
4
1
= μ
4
−3μ
2
2
(ii) m
1
= γ
1
m
2
= γ
2
+γ
2
1
m
3
= γ
2
+ 3γ
1
γ
2
+γ
3
1
m
4
= γ
4
+ 3γ
2
2
+ 4γ
1
γ
3
+ 6γ
2
1
γ
2
+γ
4
1
Moment Measure of Deﬁnition Continuous variable Discrete variable Sample estimator
First Central Mean, expected value μ
x
=
∞
−∞
xf
x
(x) dx μ
x
=
¸
all x
s
x
k
p(x
k
) ¯ x =
¸
x
i
/n
location E( X) =μ
x
Second Dispersion Variance, Var( X) =μ
2
=σ
2
x
σ
2
x
=
∞
−∞
(x −μ
x
)
2
f
x
(x) dx σ
2
x
=
¸
all x
s
(x
k
−μ
x
)
2
P
x
(x
k
) s
2
=
1
n−1
¸
(x
i
− ¯ x)
2
Standard deviation, σ
x
σ
x
=
Var( X) σ
x
=
Var( X) s =
1
n−1
¸
(x
i
− ¯ x)
2
Coefﬁcient of variation,
x
x
=σ
x
/μ
x
x
=σ
x
/μ
x
C
v
=s/¯ x
Third Asymmetry Skewness μ
3
=
∞
−∞
(x −μ
x
)
3
f
x
(x) dx μ
3
=
¸
all x
s
(x
k
−μ
x
)
3
p
x
(x
k
) m
3
=
n
(n−1)(n−2)
¸
(x
i
− ¯ x)
3
Skewness coefﬁcient, γ
x
γ
x
=μ
3
/σ
3
x
γ
x
=μ
3
/σ
3
x
g =m
3
/s
3
Fourth Peakedness Kurtosis, κ
x
μ
4
=
∞
−∞
(x −μ
x
)
4
f
x
(x) dx μ
4
=
¸
all x
s
(x
k
−μ
x
)
4
p
x
(x
k
) m
4
=
n(n+1)
(n−1)(n−2)(n−3)
¸
(x
i
− ¯ x)
4
Excess coefﬁcient, ε
x
κ
x
=μ
4
/σ
4
x
κ
x
=μ
4
/σ
4
x
k =m
4
/s
4
ε
x
=κ
x
−3 ε
x
=κ
x
−3
Figure 17: ProductMoments of Random Variables [20]
7.13.
• For c ∈ R, E[c] = c
• E[] is a linear operator: E[aX +bY ] = aEX +bEY .
• In general, Var[] is not a linear operator.
7.14. All pairs of mean and variance are possible. A random variable X with EX = m
and Var X = σ
2
can be constructed by setting P[X = m−a] = P[X = m+a] =
1
2
.
Deﬁnition 7.15.
• Correlation between X and Y : E[XY ].
73
Model E[X] Var [X]

0,1,...,n−1¦
n−1
2
n
2
−1
12
B(n, p) np np(1p)
((β)
β
1−β
β
(1−β)
2
{(λ) λ λ
(a, b)
a+b
2
(b−a)
2
12
c(λ)
1
λ
1
λ
2
Par(α)
α
α−1
, α > 1
∞, 0 < α ≤ 1
⎧
⎨
⎩
undeﬁned, 0 < α < 1
∞, 1 < α < 2
α
(α−2)(α−1)
2
, α > 2
L(α) 0
2
α
2
^(m, σ
2
) m σ
2
^ (m, Λ) m Λ = [Cov [X
i
, X
j
]]
Γ(q, λ)
q
λ
q
λ
2
Table 4: Expectations and Variances
• Covariance between X and Y :
Cov [X, Y ] = E[(X −EX)(Y −EY )] = E[XY ] −EXEY
= E[X(Y −EY )] = E[Y (X −EX)] .
• X and Y are said to be uncorrelated if and only if Cov [X, Y ] = 0.
≡ E[XY ] = EXEY
• X and Y are said to be orthogonal if E[XY ] = 0.
• Correlation coeﬃcient, autocorrelation, normalized covariance:
ρ
XY
=
Cov [X, Y ]
σ
X
σ
Y
= E
¸
X −EX
σ
X
Y −EY
σ
Y
=
E[XY ] −EXEY
σ
X
σ
Y
.
7.16. Properties
(a) Var X = Cov [X, X], ρ
X,X
= 1
(b) Var[aX] = a
2
Var X, σ
aX
= [a[ σ
X
.
(c) If X
[
= Y , then Cov [X, Y ] = 0. The converse is not true.
(d) ρ
XY
∈ [−1, 1].
(e) By CaychySchwartz inequality, (Cov [X, Y ])
2
≤ σ
2
X
σ
2
Y
with equality if and only if
σ
2
Y
(X −EX)
2
= σ
2
X
(Y −EY )
2
a.s.
• This implies [ρ
X,Y
[ ≤ 1.
74
When σ
Y
, σ
X
> 0, equality occurs if and only if the following conditions holds
≡ ∃a = 0 such that (X −EX) = a(Y −EY )
≡ ∃c = 0 and b ∈ R such that Y = cX +b
≡ ∃a = 0 and b ∈ R such that X = aY +b
≡ [ρ
XY
[ = 1
In which case, [a[ =
σ
X
σ
Y
and ρ
XY
=
a
[a[
= sgn a. Hence, p
XY
is used to quantify linear
dependence between X and Y . The closer [ρ
XY
[ to 1, the higher degree of linear
dependence between X and Y .
(f) Linearity:
(i) Let Y
i
= a
i
X
i
+b
i
.
i. Cov [Y
1
, Y
2
] = Cov [a
1
X
1
+b
1
, a
2
X
2
+b
2
] = a
1
a
2
Cov [X
1
, X
2
].
ii. The ρ is preserved under linear transformation:
ρ
Y
1
,Y
2
= ρ
X
1
,X
2
.
(ii) Cov [a
1
X +b
1
, a
2
X +b
2
] = a
1
a
2
Var X.
(iii) ρ
a
1
X+b
1
,a
2
X+b
2
= 1. In particular, if Y = aX +b, then ρ
X,Y
= 1.
(g) ρ
X,Y
= 0 if and only if X and Y are uncorrelated.
(h) When EX = 0 or EY = 0, orthogonality is equivalent to uncorrelatedness.
(i) For ﬁnite index set I,
Var
¸
¸
i∈I
a
i
X
i
¸
=
¸
i∈I
a
2
i
Var X
i
+ 2
¸
(i,j)∈II
i,=j
a
j
a
j
Cov [X
i
, X
j
] .
In particular
Var (X +Y ) = Var X + Var Y + 2Cov [X, Y ]
and
Var (X −Y ) = Var X + Var Y −2Cov [X, Y ] .
(j) For ﬁnite index set I and J,
Cov
¸
¸
i∈I
a
i
X
i
,
¸
j∈J
b
j
Y
j
¸
=
¸
i∈I
¸
j∈J
a
i
b
j
Cov [X
i
, Y
j
] .
(k) Covariance Inequality: Let X be any random variable and g and h any function such
that E[g(X)], E[h(X)], and E[g(X)h(X)] exist.
75
• If g and h are either both nondecreasing or nonincreasing, then
Cov [g(X), h(X)] ≥ 0. (19)
• If g is nondecreasing and h is nonincreasing, then
Cov [g(X), h(X)] ≤ 0. (20)
See also [2, p 191–192] and (8.13).
(l) Being uncorrelated does not imply independence
• Discrete: Suppose p
X
is an even function with p
X
(0) = 0. Let Y = g(X) where g
is also an even function. Then, E[XY ] = E[X] = E[X] E[Y ] = Cov [X, Y ] = 0.
Consider a point x
0
such that p
X
(x
0
) > 0. Then, p
X,Y
(x
0
, g(x
0
)) = p
X
(x
0
). We
only need to show that p
Y
(g(x
0
)) = 1 to show that X and Y are not independent.
For example, let X be uniform on ¦±1, ±2¦ and Y = [X[. Consider the point
x
0
= 1.
• Continuous: Let Θ be uniform on an interval of length 2π. Set X = cos Θ and
Y = sin Θ. See (11.6).
7.17. See (4.48) for relationships between expectation and independence.
Example 7.18 (Martingale betting strategy). Fix a > 0. Suppose X
0
, X
1
, X
2
, . . .
are independent random variables with P[X
i
= 1] = p and P[X
i
= 0] = 1 − p. Let
N = inf ¦i : X
i
= 1¦. Also deﬁne
L(N) =
⎧
⎨
⎩
0, N = 0
a
N−1
¸
i=0
r
i
, N ∈ N
and G(N) = ar
N
−L(N). To have G(N) > 0, need
k−1
¸
i=0
r
i
< r
k
∀k ∈ N which turns out to
require r ≥ 2. In fact, for r ≥ 2, we have G(N) ≥ a ∀N ∈ N ∪ ¦0¦. Hence, E[G(N)] ≥ a.
It is exactly a when r = 2.
Now, E[L(N)] = a
¸
∞
n=1
r
n
−1
r−1
p (1 −p)
n
= ∞ if and only if r(1 − p) ≥ 1. When
1 −p ≤
1
2
, because we already have r ≥ 2, it is true that r(1 −p) ≥ 1.
8 Inequalities
8.1. Let (A
i
: i ∈ I) be a ﬁnite family of events. Then
¸
i
P (A
i
)
2
¸
i
¸
j
P (A
i
∩ A
j
)
≤ P
¸
i
A
i
≤
¸
i
P (A
i
)
76
8.2. [19, p 14]
−
1 −
1
n
n
↓
−
1
e
≈ −0.37
≤ P
n
¸
i=1
A
i
−
n
¸
i=1
P (A
i
) ≤ (n −1) n
−
n
n−1
↓
1
.
See ﬁgure 18.
• [P (A
1
∩ A
2
) −P (A
1
) P (A
2
)[ ≤
1
4
.
2 4 6 8 10
0
0.5
1
1
1 ÷
e
1
1
x
÷

\


.
x
÷
x 1 ÷ ( ) x
x ÷
x 1 ÷
10 1 x
Figure 18: Bound for P
n
¸
i=1
A
i
−
n
¸
i=1
P (A
i
).
8.3. Markov’s Inequality: P [[X[ ≥ a] ≤
1
a
E[X[, a > 0.
a
x
a
1
x a
a
>
x
Figure 19: Proof of Markov’s Inequality
(a) Useless when a ≤ E[X[. Hence, good for bounding the “tails” of a distribution.
77
(b) Remark: P [[X[ > a] ≤ P [[X[ ≥ a]
(c) P [[X[ ≥ aE[X[] ≤
1
a
, a > 0.
(d) Suppose g is a nonnegative function. Then, ∀α > 0 and p > 0, we have
(i) P [g (X) ≥ α] ≤
1
α
p
(E[(g (X))
p
])
(ii) P [g (X −EX) ≥ α] ≤
1
α
p
(E[(g (X −EX))
p
])
(e) Chebyshev’s Inequality: P [[X[ > a] ≤ P [[X[ ≥ a] ≤
1
a
2
EX
2
, a > 0.
(i) P [[X[ ≥ α] ≤
1
α
p
(E[[X[
p
])
(ii) P [[X −EX[ ≥ α] ≤
σ
2
X
α
2
; that is P [[X −EX[ ≥ nσ
X
] ≤
1
n
2
• Useful only when α > σ
X
(iii) For a < b, P [a ≤ X ≤ b] ≥ 1 −
4
(b−a)
2
σ
2
X
+
EX −
a+b
2
2
(f) Onesided Chebyshev inequalities: If X ∈ L
2
, for a > 0,
(i) If EX = 0, P [X ≥ a] ≤
EX
2
EX
2
+a
2
(ii) For general X,
i. P [X ≥ EX +a] ≤
σ
2
X
σ
2
X
+a
2
; that is P [X ≥ EX +nσ
X
] ≤
1
1+n
2
ii. P [X ≤ EX −a] ≤
σ
2
X
σ
2
X
+a
2
; that is P [X ≤ EX −nσ
X
] ≤
1
1+n
2
iii. P [[X −EX[ ≥ a] ≤
2σ
2
X
σ
2
X
+a
2
; that is P [[X −EX[ ≥ nσ
X
] ≤
2
1+n
2
This is a
better bound than
σ
2
X
a
2
iﬀ σ
X
> a
(g) Chernoﬀ bounds:
(i) P [X ≤ b] ≤
E[e
−θX
]
e
−θb
∀θ > 0
(ii) P [X ≥ b] ≤
E[e
θX
]
e
θb
∀θ > 0
This can be optimized over θ
8.4. Suppose [X[ ≤ M a.s., then P [[X[ ≥ a] ≥
E[X[−a
M−a
∀a ∈ [0, M)
8.5. X ≥ 0 and E[X
2
] < ∞⇒P [X > 0] ≥
(EX)
2
E[X
2
]
Deﬁnition 8.6. If p and q are positive real numbers such that p +q = pq, or equivalently,
1
p
+
1
q
= 1, then we call p and qa pair of conjugate exponents.
• 1 < p, q < ∞
• As p → 1, q → ∞. Consequently, 1 and ∞ are also regarded as a pair of conjugate
exponents.
78
8.7. H¨older’s Inequality: X ∈ L
p
, Y ∈ L
q
, p > 1,
1
p
+
1
q
= 1. Then,
(a) XY ∈ L
1
(b) E[[XY [] ≤ (E[[X[
p
])
1
p
(E[[Y [
q
])
1
q
with equality if and only if
E[[Y [
q
] [X (ω)[
p
= E[[X[
p
] [Y (ω)[
q
a.s.
8.8. CauchyBunyakovskiiSchwartz Inequality: If X, Y ∈ L
2
, then XY ∈ L
1
and
[E[XY ][ ≤ E[[XY [] ≤
E
[X[
2
1
2
E
[Y [
2
1
2
or equivalently,
(E[XY ])
2
≤
E
X
2
E
Y
2
with equality if and only if E[Y
2
] X
2
= E[X
2
] Y
2
a.s.
(a) [Cov (X, Y )[ ≤ σ
X
σ
Y
.
(b) (EX)
2
≤ EX
2
(c) (P(A ∩ B))
2
≤ P(A)P(B)
8.9. Minkowski’s Inequality: p ≥ 1, X, Y ∈ L
p
⇒X +Y ∈ L
p
and (E[[X +Y [
p
])
1
p
≤
(E[[X[
p
])
1
p
+ (E[[Y [
p
])
1
p
8.10. p > q > 0
(a) E[[X[
q
] ≤ 1 +E[[X[
p
]
(b) Lyapounov’s inequality: (E[[X[
q
])
1
q
≤ (E[[X[
p
])
1
p
• E[[X[] ≤
E
[X[
2
≤
E
[X[
3
1
3
≤
8.11. Jensen’s Inequality: For a random variable X, if 1) X ∈ L
1
(and ϕ(X) ∈ L
1
);
2) X ∈ (a, b) a.s.; and 3) ϕ is convex on (a, b), then ϕ(EX) ≤ E[ϕ(X)]
• For X > 0 (a.s.), E
1
X
≥
1
EX
.
8.12.
• For p ∈ (0, 1], E[[X +Y [
p
] ≤ E[[X[
p
] +E[[Y [
p
].
• For p ≥ 1, E[[X +Y [
p
] ≤ 2
p−1
(E[[X[
p
] +E[[Y [
p
]).
8.13 (Covariance Inequality). Let X be any random variable and g and h any function
such that E[g(X)], E[h(X)], and E[g(X)h(X)] exist.
• If g and h are either both nondecreasing or nonincreasing, then
E[g(X)h(X)] ≥ E[g(X)] E[h(X)] .
In particular, for nondecreasing g, E[g(X)(X −EX)] ≥ 0.
• If g is nondecreasing and h is nonincreasing, then
E[g(X)h(X)] ≤ E[g(X)] E[h(X)] .
See also (19), (20), and [2, p 191–192].
79
9 Random Vectors
In this article, a vector is a column matrix with dimension n 1 for some n ∈ N. We use
1 to denote a vector with all element being 1. Note that 1(1
T
) is a square matrix with all
element being 1. Finally, for any matrix A and constant a, we deﬁne the matrix A + a to
be the matrix A with each of the components are added by a. If A is a square matrix, then
A +a = A +a1(1
T
).
Deﬁnition 9.1. Suppose I is an index set. When X
i
’s are random variables, we de
ﬁne a random vector X
I
by X
I
= (X
i
: i ∈ I). For example, if I = [n], we have X
I
=
(X
1
, X
2
, . . . , X
n
). Note also that X
[n]
is usually denoted by X
n
1
. Sometimes, we simply
write X to denote X
n
1
.
• For disjoint A, B, X
A∪B
= (X
A
, X
B
).
• For vector x
I
, y
I
, we say x ≤ y if ∀i ∈ I, x
i
≤ y
i
[7, p 206].
• When the dimension of X is implicit, we simply write X and x to represent X
n
1
and
x
n
1
, respectively.
• For random vector X, Y , we use (X, Y ) to represent the random vector
X
Y
or
equivalently
X
T
Y
T
T
.
Deﬁnition 9.2. Halfopen cell or bounded rectangle in R
k
is set of the form I
a,b
=
¦x : a
i
< x
i
≤ b
i
, ∀i ∈ [k]¦ =
k
i=1
(a
i
, b
i
]. For a real function F on R
k
, the diﬀerence of F
around the vertices of I
a,b
is
Δ
I
a,b
F =
¸
v
sgn
I
a,b
(v)
F (v) =
¸
v
(−1)
[i:v
i
=a
i
¦[
F (v) (21)
where the sum extending over the 2
k
vertices v of I
a,b
. (The ith coordinate of the vertex v
could be either a
i
or b
i
.) In particular, for k = 2, we have
F (b
1
, b
2
) −F (a
1
, b
2
) −F (b
1
, a
2
) +F (a
1
, a
2
) .
9.3 (Joint cdf ).
F
X
(x) = F
X
k
1
x
k
1
= P [X
1
≤ x
1
, . . . , X
k
≤ x
k
] = P
X ∈ S
x
k
1
= P
X
(S
x
)
where S
x
= ¦y : y
i
≤ x
i
, i = 1, . . . , k¦ consists of the points “southwest” of x.
• Δ
I
a,b
F
X
≥ 0
• The set S
x
is an orthanlike or semiinﬁnite corner with “northeast” vertex (vertex in
the direction of the ﬁrst orthant) speciﬁed by the point x [7, p 206].
C1 F
X
is nondecreasing in each variable. Suppose ∀i y
i
≥ x
i
, then F
X
(y) ≥ F
X
(x)
C2 F
X
is continuous from above: lim
h`0
F
X
(x
1
+h, . . . , x
k
+h) = F
X
(x)
80
C3 x
i
→−∞ for some i (the other coordinates held ﬁxed), then F
X
(x) →0
If ∀i x
i
→∞, then F
X
(x) →1.
• lim
h`0
F
X
(x
1
−h, . . . , x
k
−h) = P
X
(S
◦
x
) where S
◦
x
= ¦y : y
i
< x
i
, i = 1, . . . , k¦ is the
interior of S
x
• Given a ≤ b, P [X ∈ I
a,b
] = Δ
I
a,b
F
X
.
This comes from (9) with A
i
= [X
i
≤ a
i
] and B = [∀ X
i
≤ b
i
]. Note that
P [∩
i∈I
A
i
∩ B] = F(v) where v
i
=
a
i
, i ∈ I,
b
i
, otherwise.
• For any function F on R
k
with satisﬁes (C1), (C2), and (C3), there is a unique
probability measure μ on B
R
k such that ∀a, ∀b ∈ R
k
with a ≤ b, we have μ(I
a,b
) =
Δ
I
a,b
F (and ∀x ∈ R
k
μ(S
x
) = F (x)).
• TFAE:
(a) F
X
is continuous at x
(b) F
X
is continuous from below
(c) F
X
(x) = P
X
(S
◦
x
)
(d) P
X
(S
x
) = P
X
(S
◦
x
)
(e) P
X
(∂S
x
) = 0 where ∂S
x
= S
x
−S
◦
x
= ¦y : y
i
≤ x
i
∀i, ∃j y
j
= x
j
¦
• If k > 1, F
X
can have discontinuity points even if P
X
has no point masses.
• F
X
can be discontinuous at uncountably many points.
• The continuity points of F
X
are dense.
• For any j, we have lim
x
j
→∞
F
X
(x) = F
X
I\{j}
(x
I\j¦
)
9.4 (Joint pdf ). A function f is a multivariate or joint pdf (commonly called a density)
if and only if it satisﬁes the following two conditions:
(a) f ≥ 0;
(b)
f(x)d(x) = 1.
• The integrability of the pdf f implies that for all i ∈ I
lim
x
i
→±∞
f(x
I
) = 0.
• P[X ∈ A] =
A
f
X
(x)dx.
• Remarks: Roughly, we may say the following:
(a) f
X
(x) = lim
∀i, Δx
i
→0
P[∀i, x
i
<X
i
≤x
i
+Δx
i
]
¸
i
Δx
i
= lim
Δx→0
P[x<X≤x+δx]
¸
i
Δx
i
81
(b) For I = [n],
◦ F
X
(x) =
x
1
−∞
. . .
xn
−∞
f
X
(x)dx
n
. . . dx
1
◦ f
X
(x) =
∂
n
∂x
1
∂xn
F
X
(x) .
◦
∂
∂u
F
X
(u, . . . , u) =
¸
k∈N
(−∞,u]
n
f
X
v
(k)
dx
[n]\k¦
where v
(k)
j
=
u, j = k
x
j
, j = k
.
For example,
∂
∂u
F
X,Y
(u, u) =
u
−∞
f
X,Y
(x, u)dx +
u
−∞
f
X,Y
(u, y) dy.
(c) f
X
n
1
(x
n
1
) = E
¸
n
¸
i=1
δ (X
i
−x
i
)
.
• The level sets of a density are sets where density is constant.
9.5. Consider two random vectors X : Ω → R
d
1
and Y : Ω → R
d
2
. Deﬁne Z = (X, Y ) :
Ω →R
d
1
+d
2
. Suppose that Z has density f
X,Y
(x, y).
(a) Marginal Density: f
Y
(y) =
R
d
1
f
X,Y
(x, y)dx and f
X
(x) =
R
d
2
f
X,Y
(x, y)dy.
• In other words, to obtain the marginal densities, integrate out the unwanted
variables.
• f
X
I\{i}
(x
I\i¦
) =
f
X
I
(x
I
)dx
i
.
(b) f
Y [X
(y[x) =
f
X,Y
(x,y)
f
Y
(y)
.
(c) F
Y [X
(y[x) =
y
1
−∞
y
d
2
−∞
f
Y [X
(t[x)dt
d
2
dt
1
.
9.6. P [(X +a
1
, X +b
1
) ∩ (Y +a
2
, Y +b
2
) = ∅] =
A
f
X,Y
(x, y)dxdy where A is deﬁned
in (1.10).
9.7. Expectation and covariance:
(a) The expectation of a random vector X is deﬁned to be the vector of expectations of
its entries. EX is usually denoted by μ
X
or m
X
.
(b) For nonrandom matrix A, B, C and a random vector X, E[AXB +C] = AEXB+C.
(c) The correlation matrix R
X
of a random vector X is deﬁned by
R
X
= E
XX
T
.
Note that it is symmetric.
(d) The covariance matrix C
X
of a random vector X is deﬁned as
C
X
= Λ
X
= Cov [X] = E
(X −EX)(X −EX)
T
= E
XX
T
−(EX)(EX)
T
= R
X
−(EX)(EX)
T
.
(i) The ijentry of the Cov [X] is simply Cov [X
i
, X
j
].
82
(ii) Λ
X
is symmetric.
i. Properties of symmetric matrix
A. All eigenvalues are real.
B. Eigenvectors corresponding to diﬀerent eigenvalues are not just linearly
independent, but mutually orthogonal.
C. Diagonalizable.
ii. Spectral theorem: The following equivalent statements hold for symmet
ric matrix.
A. There exists a complete set of eigenvectors; that is there exists an or
thonormal basis u
(1)
, . . . , u
(n)
of R with C
X
u
(k)
= λ
k
u
(k)
.
B. C
X
is diagonalizable by an orthogonal matrix U (UU
T
= U
T
U = I).
C. C
X
can be represented as C
X
= UΛU
T
where U is an orthogonal matrix
whose columns are eigenvectors of C
X
and λ = diag(λ
1
, . . . , λ
n
) is a
diagonal matrix with the eigenvalues of C
X
.
(iii) Always nonnegative deﬁnite (positive semideﬁnite). That is ∀a ∈ R
n
where n is
the dimension of X, a
T
C
X
a = E
a
T
(X −μ
X
)
2
≥ 0.
• det (C
X
) ≥ 0.
(iv) We can deﬁne C
1
2
X
=
√
C
X
to be
√
C
X
= U
√
ΛU
T
where
√
Λ = diag(
√
λ
1
, . . . ,
√
λ
n
).
i. det
√
C
X
=
√
det C
X
.
ii.
√
C
X
is nonnegative deﬁnite.
iii.
√
C
X
2
=
√
C
X
√
C
X
= C
X
.
(v) Suppose, furthermore, that C
X
is positive deﬁnite.
i. C
−1
X
= UΛ
−1
U
T
where Λ
−1
= diag(
1
λ
1
, . . . ,
1
λn
).
ii. C
−
1
2
X
=
C
−1
X
= (
√
C
X
)
−1
= UDU
T
where D =
1
√
λ
1
, . . . ,
1
√
λn
iii.
√
C
X
C
−1
X
√
C
X
= I.
iv. C
−1
X
, C
1
2
X
, C
−
1
2
X
are all positive deﬁnite (and hence are all symmetric).
v.
C
−
1
2
X
2
= C
−1
X
.
vi. Let Y = C
−
1
2
X
(X −EX). Then, EY = 0 and C
Y
= I.
(vi) For i.i.d. X
i
with each with variance σ
2
, Λ
X
= σ
2
I.
(e) Cov [AX +b] = ACov [X] A
T
• Cov
X
T
h
= Cov
h
T
X
= h
T
Cov [X] h where h is a vector with the same
dimension as X.
(f) For Y = X +Z, Λ
Y
= Λ
X
+ 2Λ
XZ
+ Λ
Z
.
• When X and Z are independent, Λ
Y
= Λ
X
+ Λ
Z
.
83
• For Y
i
= X +Z
i
where X and Z are independent, Λ
Y
= σ
2
X
+ Λ
Z
.
(g) Λ
X+Y
+ Λ
X−Y
= 2Λ
X
+ 2Λ
Y
(h) det (Λ
X+Y
) ≤ 2
n
det (Λ
X
+ Λ
Y
) where n is the dimension of X and Y .
(i) Y = (X, X, . . . , X) where X is a random variable with variance σ
2
X
, then Λ
Y
=
σ
2
X
⎡
⎢
⎣
1 1
.
.
.
.
.
.
.
.
.
1 1
⎤
⎥
⎦
. Note that Y = 1X where 1 has the same dimension as Y .
(j) Let X be a zeromean random vector whose covariance matrix is singular. Then, one
of the X
i
is a deterministic linear combination of the remaining components. In other
words, there is a nonzero vector a such that a
T
X = 0. In general, if Λ
X
is singular,
then there is a nonzero vector a such that a
T
X = a
T
EX.
(k) If X and Y are both random vectors (not necessarily of the same dimension), then
their crosscovariance matrix is
Λ
XY
= C
XY
= Cov [X, Y ] = E
(X −EX)(Y −EY )
T
.
Note that the ijentry of C
XY
is Cov [X
i
, Y
j
].
• C
Y X
= (C
XY
)
T
.
(l) R
XY
= E
XY
T
.
(m) If we stack X and Y in to a composite vector Z =
X
Y
, then
C
Z
=
C
X
C
XY
C
Y X
C
Y
.
(n) X and Y are said to be uncorrelated if C
XY
= 0, the zero matrix. In which case,
C
(
X
Y
)
=
C
X
0
0 C
Y
,
a block diagonal matrix.
9.8. The joint characteristic function of an ndimensional random vector X is deﬁned
by
ϕ
X
(v) = E
e
jv
T
X
= E
e
j
¸
i
v
i
X
i
.
When X has a joint density f
X
, ϕ
X
is just the ndimensional Fourier transform:
ϕ
X
(v) =
e
jv
T
x
f
X
(x)dx,
and the joint density can be recovered using the multivariate inverse Fourier transform:
f
X
(x) =
1
(2π)
n
e
−jv
T
x
ϕ
X
(v)dv.
84
(a) ϕ
X
(u) = Ee
iu
T
X
.
(b) f
X
(x) =
1
(2π)
n
e
−jv
T
x
ϕ
X
(v)dv.
(c) For Y = AX +b, ϕ
Y
(u) = e
ib
T
u
ϕ
X
A
T
u
.
(d) ϕ
X
(−u) = ϕ
X
(u).
(e) ϕ
X
(u) = ϕ
X,Y
(u, 0)
(f) Moment:
∂
n
¸
i=1
υ
i
∂v
υ
1
1
∂v
υ
2
2
∂v
υn
n
ϕ
X
(0) = j
n
¸
i=1
υ
i
E
n
¸
i=1
X
υ
i
i
(i)
∂
∂v
i
ϕ
X
(0) = jEX
i
.
(ii)
∂
2
∂v
i
∂v
j
ϕ
X
(0) = j
2
E[X
i
X
j
] .
(g) Central Moment:
(i)
∂
∂v
i
ln (ϕ
X
(v))
v=0
= jEX
i
.
(ii)
∂
2
∂v
i
∂v
j
ln (ϕ
X
(v))
v=0
= −Cov [X
i
, X
j
] .
(iii)
∂
3
∂v
i
∂v
j
∂v
k
ln (ϕ
X
(v))
v=0
= j
3
E[(X
i
−EX
i
) (X
j
−EX
j
) (X
k
−EX
k
)] .
(iv) E[(X
i
−EX
i
) (X
j
−EX
j
) (X
k
−EX
k
) (X
−EX
)] = Ψ
ijk
+Ψ
ij
Ψ
k
+Ψ
ik
Ψ
j
+
Ψ
i
Ψ
jk
where Ψ
ijk
=
∂
4
∂v
i
∂v
j
∂v
k
∂v
ln (ϕ
X
(v))
v=0
.
Remark: we do not require that any or all of i, j, k, and λ be distinct.
9.9 (Decorrelation and the KarhunenLo`eve expansion). Let X be an ndimensional
random vector with zero mean and covariance matrix C. X has the representation X = PY ,
there the component of Y are uncorrelated and P is an n n orthonormal matrix. This
representation is called the KarhunenLo`eve expansion.
• Y = P
T
X
• P
T
= P
−1
is called a decorrelating transformation.
• Diagonalize C = PDP
T
where D = diag(λ
1
). Then, Cov [Y ] = D.
• In MATLAB, use [P,D] = eig(C). To extract the diagonal elements of D as a vector,
use the command d = diag(D).
• If C is singular (equivalently, if some of the λ
i
are zero), we only need to keep around
the Y
i
for which λ
i
> 0 and can throw away the other components of Y without any
loss of information. This is because λ
i
= EY
2
i
and EY
2
i
= 0 if and only if Y
i
≡ 0 a.s.
[9, p 338–339].
85
9.1 Random Sequence
9.10. [10, p 9–10] Given a countable family X
1
, X
2
, . . . of random r.v.’s, their statistical
properties are regarded as deﬁned by prescribing, for each integer n ≥ 1 and every ﬁnite
set I ⊂ N, the joint distribution function F
X
I
of the random vector X
I
= (X
i
: i ∈ I).
Of course, some consistency requirements must be imposed upon the inﬁnite family F
X
I
,
namely, that for j ∈ I
(a) F
X
I\{j}
x
I\j¦
= lim
x
j
→∞
F
X
I
(x
I
) and that
(b) the distribution function obtained from F
X
I
(x
I
) by interchanging two of the indices
i
1
, i
2
∈ I and the corresponding variable x
i
1
and x
i
2
should be invariant. This simply
means that the manner of labeling the random variables X
1
, X
2
, . . . is not relevant.
The joint distributions ¦F
X
I
¦ are called the ﬁnitedimensional distributions associated
with X
N
= (X
n
)
∞
n=1
.
10 Transform Methods
10.1 Probability Generating Function
Deﬁnition 10.1. [9][10, p 11] Let X be a discrete random variable taking only nonnegative
integer values. The probability generating function (pgf) of X is
G
X
(z) = E
z
X
=
∞
¸
k=0
z
k
P[X = k].
• In the summation, the ﬁrst term (the k = 0 term) is P[X = 0] even when z = 0.
• G
X
(0) = P[X = 0]
• G(z
−1
) is the z transform of the pmf.
• G
X
(1) = 1.
• The names derives from the fact that it can be used to compute the pmf.
• It is ﬁnite at least for any complex z with [z[ ≤ 1. Hence pgf is well deﬁned for
[z[ ≤ 1.
Deﬁnition 10.2. G
(k)
X
(1) = lim
z1
G
(k)
X
(z).
10.3. Properties
(a) G
X
is inﬁnitely diﬀerentiable at least for [z[ < 1.
(b) Probability generating property:
1
k!
d
(k)
dz
(k)
G
X
(z)
z=0
= P[X = k].
86
(c) Moment generating property:
d
(k)
dz
(k)
G
X
(z)
z=1
= E
¸
k−1
¸
i=0
(X −i)
¸
.
The RHS is called the kth factorial moment of X.
(d) In particular,
EX = G
t
X
(1)
EX
2
= G
tt
X
(1) +G
t
X
(1)
Var X = G
tt
X
(1) +G
t
X
(1) −(G
t
X
(1))
2
(e) pgf of a sum of independent random variables is the product of the individual pgfs.
Let S =
¸
n
i=1
X
i
where the X
i
’s are independent.
G
S
(z) =
n
¸
i=1
G
X
i
(z).
10.2 Moment Generating Function
Deﬁnition 10.4. The moment generating function of a random variable X is deﬁned
as M
X
(s) = E
e
sX
=
e
sx
P
X
(dx) for all s for which this is ﬁnite.
10.5. Properties of moment generating funciton
(a) M
X
(s) is deﬁned on some interval containing 0. It is possible that the interval consists
of 0 alone.
(i) If X ≥ 0, this interval contains (−∞, 0].
(ii) If X ≤ 0, this interval contains [0, ∞).
(b) Suppose that M (s) is deﬁned throughout an interval (−s
0
, s
0
) where s
0
> 0, i.e. it
exists (is ﬁnite) in some neighborhood of 0. Then,
(i) X has ﬁnite moments of all order: E
[X[
k
< ∞∀k ≥ 0
(ii) M (s) =
∞
¸
k=0
s
k
k!
E
X
k
, for complexvalued s with [s[ < s
0
[1, eqn (21.22) p 278].
Thus M (s) has a Taylor expansion about 0 with positive radius of convergence.
i. If M (s) can somehow be calculated and expanded in a series
∞
¸
k=0
a
k
s
k
, and if
the coeﬃcients a
k
can be identiﬁed, then a
k
=
1
k!
E
X
k
. That is E
X
k
=
k!a
k
(iii) M
(k)
(0) = E
X
k
=
x
k
P
X
(dx)
(c) If M is deﬁned in some neighborhood of s, then M
(k)
(s) =
x
k
e
sx
P
X
(dx)
(d) See also Chernoﬀ bound.
87
10.3 OneSided Laplace Transform
10.6. The onesided Laplace transform of a nonnegative random variable X is deﬁned
for s ≥ 0 by L(s) = M (s) = E
e
−sX
=
[0,∞)
e
−sx
P
X
(dx)
• Note that 0 is included in the range of integration.
• Always ﬁnite because e
−sx
≤ 1. In fact, it is a decreasing function of s
• L(0) = 1
• L(s) ∈ [0, 1]
(a) Derivative: For s > 0, L
(k)
(s) = (−1)
k
x
k
e
−sx
P
X
(dx) = (−1)
k
E
X
k
e
−sX
(i) lim
s↓0
d
n
ds
n
L(s) = (−1)
n
E[X
n
]
• Because the value at 0 can be ∞, it does not make sense to talk about
d
n
ds
n
L(0) for n > 1
(ii) ∀s ≥ 0 L(s) is diﬀerentiable and
d
ds
L(s) = −E
Xe
−sX
, where at 0, this is the
rightderivative.
d
ds
L(s) is ﬁnite for s > 0
•
d
ds
L(0) = −E[X] ∈ [0, ∞]
(b) Inversion formula: If F
X
is continuous at t > 0, then F
X
(t) = lim
s→∞
]st
¸
k=0
(−1)
k
k!
s
k
L
(k)
(s)
(c) F
X
and P
X
are determined by L
X
(s)
(i) In fact, they are determined by the values of L
X
(s) for s beyond any arbitrary
s
0
. (That is we don’t need to know L
X
(s) for small s.) Also, knowing L
X
(s)
on N is also suﬃcient.
(ii) Let μ and ν be probability measures on [0, ∞). If ∃s
0
≥ 0 such that
e
−sx
μ(dx) =
e
−sx
ν (dx)∀s ≥ s
0
, then μ = ν
(iii) Let f
1
, f
2
be real functions on [0, ∞). If ∃s
0
≥ 0 such that ∀s ≥ s
0
[0,∞)
e
−sx
f
1
(x)dx =
[0,∞)
e
−sx
f
2
(x)dx , then f
1
= f
2
Lebesguea.e.
(d) Let X
1
, . . . , X
n
be independent nonnegative random variables, then L n
¸
i=1
X
i
(s) =
n
¸
i=1
L
X
i
(s)
(e) Suppose F is a distribution function with corresponding Laplace transform L. Then
(i)
[0,∞)
e
−sx
F (x)dx =
1
λ
L(s)
(ii)
[0,∞)
e
−sx
(1 −F (x))dx =
1
λ
(1 −L(s))
[16, p 183].
88
10.4 Characteristic Function
10.7. The characteristic function (abbreviated c.f. or ch.f.) of a probability measure
μ on the line is deﬁned for real t by ϕ(t) =
e
itx
μ(dx)
A random variable X has characteristic function ϕ
X
(t) = E
e
itX
=
e
itx
P
X
(dx)
(a) Always exists because [ϕ(t)[ ≤
[e
itx
[μ(dx) =
1μ(dx) = 1 < ∞
(b) If X has a density, then ϕ
X
(t) =
e
itx
f
X
(x) dx
(c) ϕ(0) = 1
(d) ∀t ∈ R [ϕ(t)[ ≤ 1
(e) ϕ is uniformly continuous.
(f) Suppose that all moments of X exists and ∀t ∈ R, Ee
tX
< ∞, then ϕ
X
(t) =
∞
¸
k=0
(it)
k
k!
EX
k
(g) If E
[X[
k
< ∞, then ϕ
(k)
(t) = i
k
E
X
k
e
itX
and ϕ
(k)
(0) = i
k
E
X
k
.
(h) RiemannLebesgue theorem: If X has a density, then ϕ
X
(t) →0 as [t[ →∞
(i) ϕ
aX+b
(t) = e
itb
ϕ(at)
(j) Conjugate Symmetry Property: ϕ
−X
(t) = ϕ
X
(−t) = ϕ
X
(t)
• X
T
=−X iﬀ ϕ
X
is realvalued.
• [ϕ
X
[ is even.
(k) X is a.s. integervalued if and only if ϕ
X
(2π) = 1
(l) If X
1
, X
2
, . . . , X
n
are independent, then ϕ n
¸
j=1
X
j
(t) =
n
¸
j=1
ϕ
X
j
(t)
(m) Inversion
(i) The inversion formula: If the probability measure μ has characteristic func
tion ϕ and if μ¦a¦ = μ¦b¦ = 0, then μ(a, b] = lim
T→∞
1
2π
T
−T
e
−ita
−e
−itb
it
ϕ(t)dt
i. In fact, if a < b, then lim
T→∞
1
2π
T
−T
e
−ita
−e
−itb
it
ϕ(t)dt = μ(a, b) +
1
2
μ¦a, b¦
ii. Equivalently, if F is the distribution function, and a, b are continuity points
of F, then F (b) −F (a) = lim
T→∞
1
2π
T
−T
e
−ita
−e
−itb
it
ϕ(t)dt
(ii) Fourier inversion: Suppose that
[ϕ
X
(t)[dt < ∞, then X is absolutely con
tinuous with
89
i. bounded continuous density f (x) =
1
2π
e
−itx
ϕ(t)dt
ii. μ(a, b] =
1
2π
e
−ita
−e
−itb
it
ϕ(t)dt
(n) Continuity Theorem: X
n
T
−→X if and only if ∀t ϕ
Xn
(t) →ϕ
X
(t) (pointwise).
(o) ϕ
X
on complex plane for X ≥ 0
(i) ϕ
X
(z) is deﬁned in the complex plane for Imz ≥ 0
i. [ϕ
X
(z)[ ≤ 1 for such z
(ii) In the domain Im¦z¦ > 0, ϕ
X
is analytic and continuous including the boundary
Im¦z¦ = 0
(iii) ϕ
X
determines uniquely a function L
X
(s) of real argument s ≥ 0 which is
equal to L
X
(s) = ϕ
X
(is) = Ee
−sX
. Conversely, L
X
(s) on the halfline s ≥ 0
determines uniquely ϕ
X
(p) If E[X[
n
< ∞, then
ϕ
X
(t) −
n
¸
k=0
(it)
k
k!
EX
k
≤ E
¸
min
[t[
n+1
(n + 1)!
[X[
n+1
,
2 [t[
n
n!
[X[
n
¸¸
(22)
and ϕ
X
(t) =
n
¸
k=0
(it)
k
k!
EX
k
+t
n
β (t) where lim
[t[→0
β (t) = 0 or equivalently,
ϕ
X
(t) =
n
¸
k=0
(it)
k
k!
EX
k
+o (t
n
) (23)
(i) [ϕ
X
(t) −1[ ≤ E[min ¦[tX[ , 2¦]
• If EX = 0, then [ϕ
X
(t) −1[ ≤ E
min
t
2
2
X
2
, 2 [t[ [X[
¸
≤
t
2
2
EX
2
(ii) For integrable X, [ϕ
X
(t) −1 −itEX[ ≤ E
min
t
2
2
X
2
, 2 [t[ [X[
¸
≤
t
2
2
EX
2
(iii) For X with ﬁnite EX
2
,
i.
ϕ
X
(t) −1 −itEX +
1
2
t
2
EX
2
≤ Emin
[t[
3
6
[X[
3
, t
2
[X[
2
¸
ii. ϕ
X
(t) = 1 +itEX −
1
2
t
2
EX
2
+t
2
β (t) where lim
[t[→0
β (t) = 0
10.8. φ
X
(u) = M
X
(ju).
10.9. If X is a continuous r.v. with density f
X
, then ϕ
X
(t) =
e
jtx
f
X
(x)dx (Fourier
transform) and f
X
(x) =
1
2π
e
−jtx
φ
X
(t)dt (Fourier inversion formula).
• ϕ
X
(t) is the Fourier transform of f
X
evaluated at −t.
• ϕ
X
inherit the properties of a Fourier transform.
(a) For nonnegative a
i
such that
¸
i
a
i
= 1, if f
Y
=
¸
i
a
i
f
X
i
, then ϕ
X
=
¸
i
a
i
ϕ
X
i
.
90
(b) If f
X
is even, then ϕ
X
is also even.
• If f
X
is even, ϕ
X
= ϕ
−X
.
10.10. Linear combination of independent random variables: Suppose X
1
, . . . , X
n
are
independent. Let Y =
¸
n
i=1
a
i
X
i
. Then, ϕ
Y
(t) =
¸
n
i=1
ϕ
X
(a
i
t). Furthermore, if [a
i
[ = 1
and all f
X
i
are even, then ϕ
Y
(t) =
¸
n
i=1
ϕ
X
(t).
(a) ϕ
X+Y
(t) = ϕ
X
(t)ϕ
Y
(t).
(b) ϕ
X−Y
(t) = ϕ
X
(t)ϕ
Y
(−t).
(c) If f
Y
is even, ϕ
X−Y
= ϕ
X+Y
= ϕ
X
ϕ
Y
.
10.11. Characteristic function for sum of distribution: Consider nonnegative a
i
such that
¸
i
a
i
= 1. Let P
i
be probability measure with corresponding ch.f. ϕ
i
. Then, the ch.f. of
¸
i
a
i
P
i
is
¸
i
a
i
ϕ
i
.
(a) Discrete r.v.: Suppose p
i
is pmf with corresponding ch.f. ϕ
i
. Then, the ch.f. of
¸
i
a
i
p
i
is
¸
i
a
i
ϕ
i
.
(b) Absolutely continuous r.v.: Suppose f
i
is pdf with corresponding ch.f. ϕ
i
. Then, the
ch.f. of
¸
i
a
i
f
i
is
¸
i
a
i
ϕ
i
.
11 Functions of random variables
Deﬁnition 11.1. The preimage or inverse image of a set B is deﬁned by g
−1
(B) =
¦x : g(x) = B¦.
11.2. For discrete X, suppose Y = g(X). Then, p
Y
(y) =
¸
x∈g
−1
(y¦)
p
X
(x). The joint
pmf of Y and X is given by p
X,Y
(x, y) = p
X
(x)1[y = g(x)].
• In most cases, we can show that X and Y are not independent, pick a point x
(0)
such that p
X
x
(0)
> 0. Pick a point y
(1)
such that y
(1)
= g
x
(0)
and p
Y
y
(1)
> 0.
Then, p
X,Y
x
(0)
, y
(1)
= 0 but p
X
x
(0)
p
Y
y
(1)
> 0. Note that this technique does
not always work. For example, if g is a constant function which maps all values of x
to a constant c. Then, we won’t be able to ﬁnd y
(1)
. Of course, this is to be expected
because we know that a constant is always independent of other random variables.
11.1 SISO case
There are many techniques for ﬁnding the cdf and pdf of Y = g(X).
(a) One may ﬁrst ﬁnd F
Y
(y) = P[g(X) ≤ y] ﬁrst and then ﬁnd f
Y
from F
t
Y
(y). In which
case, the Leibniz’ rule in (38) will be useful.
(b) Formula (25) below provides a convenient way of arriving at f
Y
from f
X
without
going through F
Y
.
91
11.3 (Linear transformation). Y = aX +b where a = 0.
F
Y
(y) =
F
X
y−b
a
, a > 0
1 −F
X
y−b
a
−
, a < 0
(a) Suppose X is absolutely continuous,
f
Y
(y) =
1
[a[
f
X
y −b
a
. (24)
In fact (24) holds even for mixed r.v. if we allow delta function because
1
[a[
δ
y−b
a
−x
k
=
δ (y −(ax
k
+b)).
(b) Suppose X is discrete,
p
Y
(y) = p
X
y −b
a
.
If we write f
X
(x) =
¸
k
p
X
(x
k
)δ(x −x
k
), we have
f
Y
(y) =
¸
k
p (x
k
) δ (y −(ax
k
+b)).
11.4 (Power Law Function). Y = X
n
, n ∈ N or n ∈ (0, ∞).
(a) n odd: F
Y
(y) = F
X
y
1
n
and f
Y
(y) =
1
n
y
1
n
−1
f
X
y
1
n
.
(b) n even:
F
Y
(y) =
⎧
⎨
⎩
F
X
y
1
n
−F
X
−y
1
n
−
, y ≥ 0
0, y < 0.
and
f
Y
(y) =
1
n
y
1
n
−1
f
X
y
1
n
+f
X
−y
1
n
, y ≥ 0
0, y < 0.
Again, the density f
Y
in the above formula holds when X is absolutely continuous. Note
that when n < 1, f
Y
is not deﬁned at 0. If we allow delta functions, then the density
formula above are also valid for mixed r.v. because
1
n
y
1
n
−1
δ
±y
1
n
−x
k
= δ (y −(±x
k
)
n
).
• Let X be an absolutely continuous random variable. The density of Y = X
2
is
f
Y
(y) =
0, y < 0
1
2
√
y
f
X
√
y
+
1
2
√
y
f
X
−
√
y
, y ≥ 0.
11.5. In general, for Y = g(X), we solve the equation y = g(x). Denoting its real roots
by x
k
. Then,
f
Y
(y) =
¸
k
f
X
(x
k
)
[g
t
(x
k
)[
. (25)
If g(x) = c = constant for every x in the interval (a, b), then F
Y
(y) is discontinuous for
y = c. Hence, f
Y
(y) contains an impulse (F
X
(b) −F
X
(a)) δ(y −c) [14, p 93–94].
92
• To see this, consider when there is unique x such that g(x) = y. Then, For small Δx
and Δy, P[y, y < Y ≤ y+Δy] = P[x < X ≤ x+Δx] where (y+Δy] = g ((x, x + Δx])
is the image of the interval (x, x+Δx]. (Equivalently, (x, x+Δx] is the inverse image
of y + Δy].) This gives f
Y
(y)Δy = f
X
(x)Δx.
• The joint density f
X,Y
is
f
X,Y
(x, y) = f
X
(x) δ (y −g (x)) . (26)
Let the x
k
be the solutions for x of g(x) = y. Then, by integrating (26) w.r.t. x, we
have (25) via the use of (3).
• When g bijective,
f
Y
(y) =
d
dy
g
−1
(y)
f
X
g
−1
(y)
.
• For Y =
a
X
, f
Y
(y) =
a
y
f
X
a
y
.
• Suppose X is nonnegative. For Y =
√
X,
f
Y
(y) = 2yf
X
(y
2
).
11.6. Given Y = g(X) where X ∼ (a, b). Then, to get f
Y
(y
0
), plot g on (a, b). Let
A = g
−1
(¦y
0
¦) be the set of all points x such that g(x) = y
0
. Suppose A can be written
as a countable disjoint union A = B ∪∪
i
I
i
where B is countable and the I
i
’s are intervals.
We have
f
Y
(y) =
1
b −a
1
[g
t
(x)[
+
¸
i
(I
i
)
δ(y −y
0
)
at y = y
0
where (I) is the length of the interval I.
• Suppose Θ is uniform on an interval of length 2π. Y
1
= cos Θ and Y
2
= sin Θ are
both arcsine random variables with F
Y
i
(y) = 1 −
1
π
cos
−1
y =
1
2
+
1
π
sin
−1
(y) and
f
Y
i
(y) =
1
π
1
√
1−y
2
for y ∈ [−1, 1]. Note also that E[Y
1
Y
2
] = EY
i
= Cov [Y
1
, Y
2
] = 0.
Hence, Y
1
and Y
2
are uncorrelated. However, it is easy to see that Y
1
and Y
2
are not
independent by considering the joint and marginal densities at y
1
= y
2
= 0.
Example 11.7. Y = X
2
1
[X≥0]
(a) F
Y
(y) =
⎧
⎨
⎩
0, y < 0
F
X
(0), y = 0
F
X
√
y
, y > 0
(b) f
Y
(y) =
0, y < 0
1
2
√
y
f
X
√
y
, y > 0
+F
X
(0) δ (y)
93
t
∞ < x <∞
n
Rayleigh
0 x
b
>
min(X
1
,◊◊◊,X
K
)
X
1
1/X
1
0
=
=
a
a
a = b = 1
a = 0
b = 1
a + (b − a)X
2 1
X X 
blogX
n = 1
a = 1
a
1
X
2
X
X
2
1
n
n
→ ∞
2 = b
2 = n
2
1/2
n a
a
=
=
2
2 2
1
/X X
1 1
/n X
2 2
/n X
2
X
1 n =
a + aX
1 = a
n → ∞
a = n
2 1
1
X X
X
+
m = ab
s
2
ab
2
m + sX
a → ∞
=
X m
s

Standard
normal
∞ < < ∞ x
Gamma
b a,
0 > x
Erlang
n
x
,
0
b
>
Exponential
0 x
b
>
Standard
uniform
1 0 < < x
F
0 x >
Weibull
b a,
0 > x
Triangular
1 –1 < < x
Uniform
a < x < b
a, b
log Y
Lognormal
y > 0
Beta
0 < x < 1
a, b
a = b → ∞
Normal
∞< x <∞
m, s
Cauchy
a , a
x ∞ < < ∞
Chisquare
n
Standard
cauchy
∞ < < ∞ x
Y = e
X
x 0 >
Poisson
x = 0,1⋅⋅⋅
v
C
o
n
t
i
n
u
o
u
s
d
i
s
t
r
i
b
u
t
i
o
n
s
D
i
s
c
r
e
t
e
d
i
s
t
r
i
b
u
t
i
o
n
s
3
1
n
n
p =
Hypergeometric
x = 0,1⋅⋅⋅ , min(n
1
, n
2
)
n
1
, n
2
, n
3
n = 1
Bernoulli
x = 0,1
p
n ∞
v = np
X
1
+ ⋅⋅⋅ + X
K
X
1
+ ⋅⋅⋅ + X
K
X
1
+ ⋅⋅⋅ + X
K
X
1
+ ⋅⋅⋅ + X
K
X
1
+ ⋅⋅⋅ + X
K
X
1
+ ⋅⋅⋅ + X
K
X
1
+ ⋅⋅⋅ + X
K
←
n
3
∞
←
Binomial
x = 0,1 ⋅⋅⋅ n
n, p
m = np
s
2
= np(1 – p)
n ∞
s
2
= v
m = v
←
X
n
1
, n
2
Figure 20: Relationships among univariate distributions [20]
94
Exponential: (ì)
( )
 )
( )
0,
1
x
X
f x e x
ì
ì
÷
·
=
( ) ~ ~ X X
ì
ì o
o
 
¬

\ .
.
Xi ~ (ìi) ¬ ( ) min ~
i i
i
X ì
 

\ .
¯
.
Normal/Gaussian: ( )
2
, m o &
( )
2
1
2
1
2
x
X
f x e
u
o
o t
÷  
÷

\ .
=
Rayleigh: ( )
 )
( )
2
0,
2 1
x
f x xe x
o
o
÷
·
=
2 2
X Y + with
. . .
1
, ~ 0,
2
i i d
X Y
o
 

\ .
&
2
X
ì o =
2 2
X Y + with
. . .
1
, ~ 0,
2
i i d
X Y
ì
 

\ .
&
Gamma : ( ) , q ì I
( )
( )
( )
( )
1
0,
1
q q x
X
x e
f x x
q
ì
ì
÷ ÷
·
=
I
1 q =
( ) ~ ,
i i
X q ì I ¬ ~ ,
i i
i i
X q ì
 
I

\ .
¯ ¯
.
( ) ~ , X q ì I ¬ Y X o = ~ , q
ì
o
 
I

\ .
.
1
1 2
X
X X +
with ( ) ~ ,
i i
X q ì I
( )
2 2
2
1 1
~ 0, ~ ,
2 2
X X o
o
 
¬ I

\ .
&
( )
. . .
2 2
1
with ~ 0,
n i i d
i i
i
X X o
=
¯
&
Beta:
( )
( )
( ) ( )
( )
( )
( )
2
1
, 1 2
1
1 2 1
0,1
1 2
1 1
q q
q
q
q q
f x x x x
q q
þ
÷
÷
I +
= ÷
I I
Chisquared
2
; :
2
1
,
2 2
n
o
 
I

\ .
Usually, 1 o = .
2
1
,
2 2
n
q ì
o
= =
Din
November 8, 2004
Uniform: ( ) , a b 
( )
 
( )
,
1
1
X a b
f x x
b a
=
÷
( ) ( ) ( )
1
~ 0,1 ln ~ X X ì
ì
¬÷ 
Figure 21: Another diagram demonstrating relationship among univariate distributions
95
11.2 MISO case
11.8. If X and Y are jointly continuous random variables with joint density f
X,Y
. The
following two methods give the density of Z = g(X, Y ).
• Condition on one of the variable, say Y = y. Then, begin conditioned, Z is simply a
function of one variable g(X, y); hence, we can use the onevariable technique to ﬁnd
f
Z[Y
(z[y). Finally, f
Z
(z) =
f
Z[Y
(z[y)f
Y
(y)dy.
• Directly ﬁnd the joint density of the random vector
Z
Y
=
g(X,Y )
Y
. Observe that the
Jacobian is
∂g
∂x
∂g
∂y
0 1.
. Hence, the magnitude of the determinant is
∂g
∂x
.
Of course, the standard way of ﬁnding the pdf of Z is by ﬁnding the derivative of the
cdf F
Z
(z) =
(x,y):x
2
+y
2
≤z
f
X,Y
(x, y)d(x, y). This is still good for solving speciﬁc examples.
It is also a good starting point for those who haven’t learned conditional probability nor
Jacobian.
Let the x
(k)
be the solutions of g(x, y) = z for ﬁxed z and y. The ﬁrst method gives
f
Z[Y
(z[y) =
¸
k
f
X[Y
(x[y)
∂
∂x
g(x, y)
x=x
(k)
.
Hence,
f
Z,Y
(z, y) =
¸
k
f
X,Y
(x, y)
∂
∂x
g(x, y)
x=x
(k)
,
which comes out of the second method directly. Both methods then gives
f
Z
(z) =
¸
k
f
X,Y
(x, y)
∂
∂x
g(x, y)
x=x
(k)
dy.
The integration for a given z is only on the value of y such that there is at least a solution
for x in z = g(x, y). If there is no such solution, f
Z
(z) = 0. The same technique works for
a function of more than one random variables Z = g(X
1
, . . . , X
n
). For any j ∈ [n], let the
x
(k)
j
be the solutions for x
j
in z = g(x
1
, . . . , x
n
). Then,
f
Z
(z) =
¸
k
f
X
1
,...,Xn
(x
1
, . . . , x
n
)
∂
∂x
j
g(x
1
, . . . , x
n
)
x
j
=x
(k)
j
dx
[n]\j¦
.
For the second method, we consider the random vector (h
r
(X
1
, . . . , X
n
), r ∈ [n]) where
h
r
(X
1
, . . . , X
n
) = X
r
for r = j and h
j
= g. The Jacobian is of the form
⎛
⎜
⎜
⎝
1 0 0 0
0 1 0 0
∂g
∂x
1
∂g
∂x
2
∂g
∂x
j
∂g
∂xn
0 0 0 1
⎞
⎟
⎟
⎠
.
96
By swapping the row with all the partial derivatives to the ﬁrst row, the magnitude of
the determinant is unchanged and we also end up with upper triangular matrix whose
determinant is simply the product of the diagonal elements.
(a) For Z = aX +bY ,
f
Z
(z) =
1
[a[
f
X,Y
z −by
a
, y
dy =
1
[b[
f
X,Y
x,
z −ax
b
dx.
• Note that Jacobian
ax+by
y
y
,
x
y
=
a b
0 1
.
(i) When a = 1, b = −1,
f
X−Y
(z) =
f
X,Y
(z +y, y) dy =
f
X,Y
(x, x −z) dx.
(ii) Note that when X and Y are independent and a = b = 1, we have the convolu
tion formula
f
Z
(z) =
f
X
(z −y) f
Y
(y)dy =
f
X
(x)f
Y
(z −x) dx.
(b) For Z = XY ,
f
Z
(z) =
f
X,Y
x,
z
x
1
[x[
dx =
f
X,Y
z
y
, y
1
[y[
dy
[9, Ex 7.2, 7.11, 7.15]. Note that Jacobian
xy
y
,
x
y
=
y x
0 1
.
(c) For Z = X
2
+Y
2
,
f
Z
(z) =
√
z
−
√
z
f
X[Y
z −y
2
y
+f
X[Y
−
z −y
2
y
2
z −y
2
f
Y
(y)dy
[9, Ex 7.16]. Alternatively,
f
Z
(z) =
1
2
2π
0
f
X,Y
(
√
z cos θ,
√
z sin θ)dθ, z > 0 (27)
[15, eq (9.14) p 318)].
• This can be used to show that when X, Y
i.i.d.
∼ ^(0, 1), Z = X
2
+Y
2
∼ c
1
2
.
(d) For R =
√
X
2
+Y
2
,
f
R
(r) = r
2π
0
f
X,Y
(r cos θ, r sin θ)dθ, r > 0
[15, eq (9.13) p 318)].
97
(e) For Z =
Y
X
,
f
Z
(z) =
y
z
f
X,Y
y
z
, y
dy =
[x[ f
X,Y
(x, xz)dx.
Similarly, when Z =
X
Y
,
f
Z
(z) =
[y[ f
X,Y
(yz, y)dy.
(f) For Z =
min(X,Y )
max(X,Y )
where X and Y are strictly positive,
F
Z
(z) =
∞
0
F
Y [X
(zx[x)f
X
(x)dx +
∞
0
F
X[Y
(zy[y)f
Y
(y)dy,
f
Z
(z) =
∞
0
xf
Y [X
(zx[x)f
X
(x)dx +
∞
0
yf
X[Y
(zy[y)f
Y
(y)dy, 0 < z < 1.
[9, Ex 7.17].
11.9 (Random sum). Let S =
¸
N
i=1
V
i
where V
i
’s are i.i.d. ∼V independent of N.
(a) ϕ
S
(u) = ϕ
N
(−i ln (ϕ
V
(u))).
• ϕ
S
(u) = G
N
(ϕ
V
(u))
• For nonnegative integervalued summands, we have G
S
(z) = G
N
(G
V
(z))
(b) ES = ENEV .
(c) Var [S] = EN (Var V ) + (EV )
2
(Var N).
Remark: If N ∼ { (λ), then ϕ
S
(u) = exp (λ(ϕ
V
(u) −1)), the compound Poisson
distribution ({ (λ, L(V )). Hence, the mean and variance of ({ (λ, L(V )) are λEV and
λEV
2
respectively.
11.3 MIMO case
Deﬁnition 11.10 (Jacobian). In vector calculus, the Jacobian is shorthand for either
the Jacobian matrix or its determinant, the Jacobian determinant. Let g be a
function from a subset D of R
n
to R
m
. If g is diﬀerentiable at z ∈ D, then all partial
derivatives exists at z and the Jacobian matrix of g at a point z ∈ D is
dg (z) =
⎛
⎜
⎝
∂g
1
∂x
1
(z)
∂g
1
∂xn
(z)
.
.
.
.
.
.
.
.
.
∂gm
∂x
1
(z)
∂gm
∂xn
(z)
⎞
⎟
⎠
=
∂g
∂x
1
(z) , . . . ,
∂g
∂x
n
(z)
.
Alternative notations for the Jacobian matrix are J,
∂(g
1
,...,gn)
∂(x
1
,...,xn)
[7, p 242], J
g
(x) where the it
is assumed that the Jacobian matrix is evaluated at z = x = (x
1
, . . . , x
n
).
98
• Let A be an ndimensional “box” deﬁned by the corners x and x+Δx. The “volume”
of the image g(A) is (
¸
i
Δx
i
) [det dg(x)[. Hence, the magnitude of the Jacobian
determinant gives the ratios (scaling factor) of ndimensional volumes (contents). In
other words,
dy
1
dy
n
=
∂(y
1
, . . . , y
n
)
∂(x
1
, . . . , x
n
)
dx
1
dx
n
.
• d(g
−1
(y)) is the Jacobian of the inverse transformation.
• In MATLAB, use jacobian.
See also (A.14).
11.11 (Jacobian formulas). Suppose g is a vectorvalued function of x ∈ R
n
, and X
is an R
n
valued random vector. Deﬁne Y = g(X). (Then, Y is also an R
n
valued random
vector.) If X has joint density f
X
, and g is a suitable invertible mapping (such that the
inversion mapping theorem is applicable), then
f
Y
(y) =
1
[det (dg (g
−1
(y)))[
f
X
g
−1
(y)
=
det
d
g
−1
(y)
f
X
g
−1
(y)
.
• Note that for any matrix A, det(A) = det(A
T
). Hence, the formula above could
tolerate the incorrect “Jacobian”.
In general, let X = (X
1
, X
2
, . . . , X
n
) be a random vector with pdf f
X
(x). Let o =
¦x : f
X
(x) > 0¦. Consider a new random vector Y = (Y
1
, Y
2
, . . . , Y
n
), deﬁned by Y
i
=
g
i
(X). Suppose that A
0
, A
1
, . . . , A
r
form a partition of o with these properties. The set
A
0
, which may be empty, satisﬁes P[X ∈ A
0
] = 0. The transformation Y = g(X) =
(g
1
(X), . . . , g
n
(X)) is a onetoont transformation from A
i
onto some common set B for
each i ∈ [k]. Then, for each i, the inverse functions from B to A
i
can be found. Denote
the kth inverse x = h
(k)
(u) by x
j
= h
(k)
j
(y). This kth inverse gives, for y ∈ B, the unique
x ∈ A
k
such that y = g(x). Assuming that the Jacobians det(dh
(k)
(y)) do not vanish
identically on B, we have
f
Y
(y) =
r
¸
k=1
f
X
(h
(k)
(y))
det(dh
(k)
(y))
, y ∈ B
[2, p 185].
• Suppose for some k, Y
k
is some functions of other Y
i
. In particular, suppose Y
k
= h(y
I
)
for some index set I and some deterministic function h. Then, the kth row of the
Jacobian matrix is a linear combination of other rows. In particular,
∂y
k
∂x
j
=
¸
i∈I
∂
∂y
i
h(y
I
)
∂y
i
∂x
j
.
Hence, the Jacobian determinant is 0.
99
11.12. Suppose Y = g(X) where both X and Y have the same dimension, then the joint
density of X and Y is
f
X,Y
(x, y) = f
X
(x)δ(y −g(x)).
• In most cases, we can show that X and Y are not independent, pick a point x
(0)
such that f
X
x
(0)
> 0. Pick a point y
(1)
such that y
(1)
= g
x
(0)
and f
Y
y
(1)
> 0.
Then, f
X,Y
x
(0)
, y
(1)
= 0 but f
X
x
(0)
f
Y
y
(1)
> 0. Note that this technique does
not always work. For example, if g is a constant function which maps all values of x
to a constant c. Then, we won’t be able to ﬁnd y
(1)
. Of course, this is to be expected
because we know that a constant is always independent of other random variables.
Example 11.13.
(a) For Y = AX +b, where A is a square, invertible matrix,
f
Y
(y) =
1
[det A[
f
X
A
−1
(y −b)
. (28)
(b) Transformation between Cartesian coordinates (x, y) and polar coordinates (r, θ)
• x = r cos θ, y = r sin θ, r =
x
2
+y
2
, θ = tan
−1
y
x
.
•
∂x
∂r
∂x
∂θ
∂y
∂r
∂y
∂θ
=
cos θ −r sin θ
sin θ r cos θ
= r. (Recall that dxdy = rdrdθ).
We have
f
R,Θ
(r, θ) = rf
X,Y
(r cos θ, r sin θ) , r ≥ 0 and θ ∈ (−π, π),
and
f
X,Y
(x, y) =
1
x
2
+y
2
f
R,Θ
x
2
+y
2
, tan
−1
y
x
.
If, furthermore, Θ is uniform on (0, 2π) and independent of R. Then,
f
X,Y
(x, y) =
1
2π
1
x
2
+y
2
f
R
x
2
+y
2
.
(c) A related transformation is given by Z =
√
X
2
+Y
2
and Θ = tan
−1
Y
X
. In this
case, X =
√
Z cos Θ, Y =
√
Z sin Θ, and
f
Z,Θ
(z, θ) =
1
2
f
X,Y
√
z cos θ,
√
z sin θ
which gives (27).
11.14. Suppose X, Y are i.i.d. ^(0, σ
2
). Then, R =
√
X
2
+Y
2
and Θ = arctan
Y
X
are
independent with R being Rayleigh
f
R
(r) =
r
σ
2
e
−
1
2
(
r
σ
)
2
U(r)
and Θ being uniform on
[0, 2π].
100
11.15 (Generation of a random sample of a normally distributed random vari
able). Let U
1
, U
2
be i.i.d. (0, 1). Then, the random variables
X
1
=
−2 ln U
1
cos(2πU
2
)
X
2
=
−2 ln U
1
sin(2πU
2
)
are i.i.d. ^(0, 1). Moreover,
Z
1
=
−2σ
2
ln U
1
cos(2πU
2
)
Z
2
=
−2σ
2
ln U
1
sin(2πU
2
)
are i.i.d. ^(0, σ
2
).
• The idea is to generate R and Θ according to (11.14) ﬁrst.
• det(dx(u)) = −
2π
u
1
, u
1
= e
−
x
2
1
+x
2
2
2
.
11.16. In (11.11), suppose dim(Y ) = dim(g) ≤ dim(X). To ﬁnd the joint pdf of Y , we
introduce “arbitrary” Z = h(X) so that dim
Y
Z
= dim(X).
11.4 Order Statistics
Given a sample of n random variables X
1
, . . . , X
n
, reorder them so that Y
1
≤ Y
2
≤ ≤ Y
n
.
Then, Y
i
is called the i
th
order statistic, sometimes also denoted X
i:n
, X
/i)
, X
(i)
, X
n:i
, X
i,n
,
or X
(i)n
.
In particular
• Y
1
= X
1:n
= X
min
is the ﬁrst order statistic denoting the smallest of the X
i
’s,
• Y
2
= X
2:n
is the second order statistic denoting the second smallest of the X
i
’s . . .,
and
• Y
n
= X
n:n
= X
max
is the nth order statistic denoting the largest of the X
i
’s.
In words, the order statistics of a random sample are the sample values placed in ascending
order [2, p 226]. Many results in this section can be found in [4].
11.17. Events properties:
[X
min
≥ y] =
¸
i
[X
i
≥ y] [X
max
≥ y] =
¸
i
[X
i
≥ y]
[X
min
> y] =
¸
i
[X
i
> y] [X
max
> y] =
¸
i
[X
i
> y]
[X
min
≤ y] =
¸
i
[X
i
≤ y] [X
max
≤ y] =
¸
i
[X
i
≤ y]
[X
min
< y] =
¸
i
[X
i
< y] [X
max
< y] =
¸
i
[X
i
< y]
Let A
y
= [X
max
≤ y], B
y
= [X
min
> y]. Then, A
y
= [∀i X
i
≤ y] and B
y
= [∀i X
i
> y].
101
11.18 (Densities). Suppose the X
i
are absolutely continuous with joint density f
X
. Let
S
y
be the set of all n! vector which comes from permuting the coordinates of y.
f
Y
(y) =
¸
x∈Sy
f
X
(x), y
1
≤ y
2
≤ ≤ y
n
. (29)
To see this, note that f
Y
(y)
¸
j
Δy
j
is the probability that Y
j
is in the small interval of
length Δy
j
around y
j
. This probability can be calculated from ﬁnding the probability that
all X
k
fall into the above small regions.
From the joint density, we can ﬁnd the joint pdf/cdf of Y
I
for any I ⊂ [n]. However, in
many cases, we can directly reapply the above technique to ﬁnd the joint pdf of Y
I
. This
is especially useful when the X
i
are independent or i.i.d.
(a) The marginal density f
Y
k
can be found by approximating f
Y
k
(y) Δy with
n
¸
j=1
¸
I∈(
[n]\{j}
k−1
)
P [X
j
∈ [y, y + Δy) and ∀i ∈ I, X
i
≤ y and ∀r ∈ (I ∪ ¦k¦)
c
, X
r
> y],
where for any set A and integer ∈ [A[, we deﬁne
A
to be the set of all kelement
subsets of A. Note also that we assume (I ∪ ¦k¦)
c
= [n] ` (I ∪ ¦k¦).
To see this, we ﬁrst choose the X
j
that will be Y
k
with value around y. Then, we
must have k −1 of the X
i
below y and have the rest of the X
i
> y.
(b) For integers r < s, the joint density f
Yr,Ys
(y
r
, y
s
)Δy
r
Δy
s
can be approximated by the
probability that two of the X
i
are inside small regions around y
r
and y
s
. To make
them Y
r
and Y
s
, for the other X
i
, r −1 of them before y
r
, s −r −1 of them between
y
r
and y
s
, and n −s of them beyond y
s
.
• f
Xmax,X
min
(u, v)ΔuΔv can be approximated by by
¸
(j,k)∈S
P[X
j
∈ [u, u+Δu), X
j
∈ [v, v+Δv), and ∀i ∈ [n]`¦j, k¦ , v < X
i
≤ u], v ≤ u,
where S is the set of all n(n −1) pairs (j, k) from [n] [n] with j = k. This is
simply choosing the j, k so that X
j
will be the maximum with value around u,
and X
k
will be the minimum with value around v. Of course, the rest of the X
i
have to be between the min and max.
◦ When n = 2, we can use (29) to get
f
Xmax,X
min
(u, v) = f
X
1
,X
2
(u, v) +f
X
1
,X
2
(v, u), v ≤ u.
Note that the joint density at point y
I
is0 if the the elements in y
I
are not arranged in the
“right” order.
11.19 (Distribution functions). We note again the the cdf may be obtained by inte
gration of the densities in (11.18) as well as by direct arguments valid also in the discrete
case.
102
(a) The marginal cdf is
F
Y
k
(y) =
n
¸
j=k
¸
I∈(
[n]
j
)
P [∀i ∈ I, X
i
≤ y and ∀r ∈ [n] `I, X
r
> y].
This is because the event [Y
k
≤ y] is the same as event that at least k of the X
i
are
≤ y. In other words, let N(a) =
¸
n
i=1
1[X
i
≤ a] be the number of X
i
which are ≤ a.
[Y
k
≤ y] =
¸
¸
i
N(y) ≥ k
¸
=
¸
j≥k
¸
¸
i
N(y) = j
¸
, (30)
where the union is a disjoint union. Hence, we sum the probability that exactly j
of the X
i
are ≤ y for j ≥ k. Alternatively, note that the event [Y
k
≤ y] can also be
expressed as a disjoint union
¸
j≥k
[X
i
≤ k and exactly k −1 of the X
1
, . . . , X
j−1
are ≤ y] .
This gives
F
Y
k
(y) =
n
¸
j=k
¸
I∈(
[j−1]
k−1
)
P [X
j
≤ y, ∀i ∈ I, X
i
≤ y, and ∀r ∈ [j −1] ` I, X
r
> y] .
(b) For r < s, Because Y
r
≤ Y
s
, we have
[Y
r
≤ y
r
] ∩ [Y
s
≤ y
s
] = [Y
s
≤ y
s
] , y
s
≤ y
r
.
By (30), for y
r
< y
s
,
[Y
r
≤ y
r
] ∩ [Y
s
≤ y
s
] =
n
¸
j=r
[N (y
r
) = j]
∩
n
¸
m=s
[N (y
s
) = m]
=
n
¸
m=s
m
¸
j=r
[N (y
r
) = j and N (y
s
) = m],
where the upper limit of the second union is changed from n to m because we must
have N(y
r
) ≤ N(y
s
). Now, to have N(y
r
) = j and N(y
s
) = m for m > j is to put j
of the X
i
in (−∞, y
r
], m−j of the X
i
in (y
r
, y
s
], and n −m of the X
i
in (y
s
, ∞).
(c) Alternatively, for X
max
, X
min
, we have
F
Xmax,X
min
(u, v) = P(A
u
∩ B
c
v
) = P(A
u
) −P(A
u
∩ B
v
)
= P[∀i X
i
≤ u] −P[∀i v < X
i
≤ u]
where the second term is 0 when u < v. So,
F
Xmax,X
min
(u, v) = F
X
1
,...,Xn
(u, . . . , u)
103
when u < v. When v ≥ u, the second term can be found by (21) which gives
F
Xmax,X
min
(u, v) = F
X
1
,...,Xn
(u, . . . , u) −
¸
w∈S
(−1)
[i:w
i
=v[
F
X
1
,...,Xn
(w)
=
¸
w∈S\(u,...,u)¦
(−1)
[i:w
i
=v[+1
F
X
1
,...,Xn
(w).
where S = ¦u, v¦
n
is the set of all 2
n
vertices w of the “box”
i∈[n]
(a
i
, b
i
]. The joint
density is 0 for u < v.
11.20. For independent X
i
’s,
(a) f
Y
(y) =
¸
x∈Sy
n
¸
i=1
f
X
i
(x
i
)
(b) Two forms of marginal cdf:
F
Y
k
(y) =
n
¸
j=k
¸
I∈(
[n]
j
)
¸
i∈I
F
X
i
(y)
⎛
⎝
¸
r∈[n]\I
(1 −F
X
i
(y))
⎞
⎠
=
n
¸
j=k
¸
I∈(
[j−1]
k−1
)
F
X
j
(y)
¸
i∈I
F
X
i
(y)
⎛
⎝
¸
r∈[j−1]\I
(1 −F
Xr
(y))
⎞
⎠
(c) F
Xmax,X
min
(u, v) =
¸
k∈[n]
F
X
k
(u) −
0, u ≤ v
¸
k∈[n]
(F
X
k
(u) −F
X
k
(v)), v < u
¸
(d) The marginal cdf is
F
Y
k
(y) =
n
¸
j=k
¸
I∈(
[n]
j
)
¸
i∈I
F
X
i
(y)
⎛
⎝
¸
r∈[n]\I
(1 −F
Xr
(y))
⎞
⎠
.
(e) F
X
min
(v) = 1 −
¸
i
(1 −F
X
i
(v)).
(f) F
Xmax
(u) =
¸
i
F
X
i
(u).
11.21. Suppose X
i
i.i.d.
∼ X with common density f and distribution function F.
(a) The joint density is given by
f
Y
(y) = n!f(y
1
)f(y
2
) . . . f(y
n
), y
1
≤ y
2
≤ ≤ y
n
.
104
If we deﬁne y
0
= −∞, y
k+1=∞
, n
0
= 0, n
k+1
= n + 1, then for k ∈ [n] and 1 ≤ n
1
<
< n
k
≤ n, the joint density f
Yn
1
,Yn
2
,...,Yn
k
(y
n
1
, y
n
2
, . . . , y
n
k
) is given by
n!
k
¸
j=1
f(y
n
j
)
k
¸
j=1
(F(y
n
j+1
) −F(y
n
j
))
n
j+1
−n
j
−1
(n
j+1
−n
j
−1)!
.
In particular, for r < s, the joint density f
Yr,Ys
(y
r
, y
s
) is given by
n!
(r −1)!(s −r −1)!(n −s)!
f(y
r
)f(y
s
)F
r−1
(y
r
)(F(y
s
) −F(y
r
))
s−r−1
(1 −F(y
s
))
n−s
[2, Theorem 5.4.6 p 230].
(b) The joint cdf F
Yr,Ys
(y
r
, y
s
) is given by
⎧
⎨
⎩
F
Ys
(y
s
), y
s
≤ y
r
,
n
¸
m=s
m
¸
j=r
n!
j!(m−j)!(n−m)!
(F (y
r
))
j
(F (y
s
) −F (y
r
))
m−j
(1 −F (y
s
))
n−m
, y
r
< y
s
.
(c) F
Xmax,X
min
(u, v) = (F (u))
n
−
0, u ≤ v
(F (u) −F (v))
n
, v < u
.
(d) f
Xmax,X
min
(u, v) =
0, u ≤ v
n(n −1) f
X
(u) f
X
(v) (F (u) −F (v))
n−2
, v < u
(e) Marginal cdf:
F
Y
k
(y) =
n
¸
j=k
n
j
(F (y))
j
(1 −F (y))
n−j
=
n
¸
j=k
j −1
k −1
(F(y))
k
(1 −F(y))
j−k
= (F(y))
k
n−k
¸
m=0
k + m−1
k −1
(1 −F(y))
m
=
n!
(k −1)! (n −k)!
F(y)
0
t
k−1
(1 −t)
n−k
dt.
Note that N(y) ∼ B (n, F(y)). The last equality comes from integrating the marginal
density f
Y
k
in (31) with change of variable t = F(y).
(i) F
Xmax
(y) = (F (y))
n
and f
Xmax
(y) = n(F (y))
n−1
f
X
(y).
(ii) F
X
min
(y) = 1 −(1 −F (y))
n
and f
X
min
(y) = n(1 −F (y))
n−1
f
X
(y).
(f) Marginal density:
f
Y
k
(y) =
n!
(k −1)! (n −k)!
(F (y))
k−1
(1 −F (y))
n−k
f
X
(y) (31)
[2, Theorem 5.4.4 p 229]
Consider small neighborhood Δ
y
around y. To have Y
k
∈ Δ
y
, we must have exactly
one of the X
i
’s in Δ
y
, exactly k − 1 of them less than y, and exactly n − k of them
greater than y. There are n
n−1
k−1
=
n!
(k−1)!(n−k)!
=
1
B(k,n−k+1)
possible setups.
105
(g) The range R is deﬁned as R = X
max
−X
min
.
(i) For x > 0, f
R
(x) = n(n −1)
(F(u) −F(u −x))
n−2
f(u −x)f(u)du.
(ii) For x ≥ 0, F
R
(x) = n
(F(u) −F(u −x))
n−1
f(u)du.
Both pdf and cdf above are derived by ﬁrst ﬁnding the distribution of the range
conditioned on the value of the X
max
= u.
See also [4, Sec. 2.2] and [2, Sec. 5.4].
11.22. Let X
1
, X
2
, . . . , X
n
be a random sample from a discrete distribution with pmf
p
X
(x
i
) = p
i
, where x
1
< x
2
< . . . are the possible values of X in ascending order. Deﬁne
P
0
= 0 and P
i
=
¸
i
k=1
p
k
, then
P[Y
j
≤ x
i
] =
n
¸
k=j
n
k
P
k
i
(1 −P
i
)
n−k
and
P[Y
j
= x
i
] =
n
¸
k=j
n
k
P
k
i
(1 −P
i
)
n−k
−P
k
i−1
(1 −P
i−1
)
n−k
.
Example 11.23. If U
1
, U
2
, . . . , U
k
are independently uniformly distributed on the interval
0 to t
0
, then they have joint pdf
f
U
k
1
u
k
1
=
1
t
k
0
, 0 ≤ u
i
≤ t
0
0, otherwise
The order statistics τ
1
, τ
2
, . . . , τ
k
corresponding to U
1
, U
2
, . . . , U
k
have joint pdf
f
τ
k
1
t
k
1
=
k!
t
k
0
, 0 ≤ t
1
≤ t
2
≤ ≤ t
k
≤ t
0
0, otherwise
Example 11.24 (n = 2). Suppose U = max(X, Y ) and V = min(X, Y ) where X and Y
have joint cdf F
XY
.
F
U,V
(u, v) =
F
X,Y
(u, u), u ≤ v,
F
X,Y
(v, u) +F
X,Y
(u, v) −F
X,Y
(v, v), u > v
,
F
U
(u) = F
X,Y
(u, u),
F
V
(v) = F
X
(v) −F
Y
(v) −F
X,Y
(v, v).
[9, Ex 7.5, 7.6]. The joint density is
f
U,V
(u, v) = f
X,Y
(u, v) +f
X,Y
(v, u), v < u.
106
The marginal densities is given by
f
U
(u) =
∂
∂x
F
X,Y
(x, y)
x=u,
y=u
+
∂
∂y
F
X,Y
(x, y)
x=u,
y=u
=
u
−∞
f
X,Y
(x, u)dx +
u
−∞
f
X,Y
(u, y) dy,
f
V
(v) = f
X
(v) +f
Y
(v) −f
U
(v)
= f
X
(v) +f
Y
(v) −
∂
∂x
F
X,Y
(x, y)
x=v,
y=v
+
∂
∂y
F
X,Y
(x, y)
x=v,
y=v
= f
X
(v) +f
Y
(v) −
⎛
⎝
v
−∞
f
X,Y
(x, u)dx +
v
−∞
f
X,Y
(u, y) dy
⎞
⎠
.
=
∞
v
f
X,Y
(v, y)dy +
∞
v
f
X,Y
(x, v)dx
If, furthermore, X and Y are independent, then
F
U,V
(u, v) =
F
X
(u) F
Y
(u) , u ≤ v
F
X
(v) F
Y
(u) +F
X
(u) F
Y
(v) −F
X
(v) F
Y
(v) , u > v,
F
U
(u) = F
X
(u) F
Y
(u) ,
F
V
(v) = F
X
(v) +F
Y
(v) −F
X
(v) F
Y
(v) ,
f
U
(u) = f
X
(u) F
Y
(u) +F
X
(u) f
Y
(u) ,
f
V
(v) = f
X
(v) +f
Y
(v) −f
X
(v) F
Y
(v) −F
X
(v) f
Y
(v) .
If, furthermore, X and Y are i.i.d., then
F
U
(u) = F
2
(u) ,
F
V
(v) = 2F (v) −F
2
(v) ,
f
U
(u) = 2f (u) F (u) ,
f
V
(v) = 2f (v) (1 −F (v)) .
11.25. Let the X
i
be i.i.d. with density f and cdf F. The range R is deﬁned as R =
X
max
−X
min
.
F
R
(x) =
0, x < 0
n
(F (v) −F (v −x))
n−1
f (v) dv, x ≥ 0
f
R
(x) =
0, x < 0
n(n −1)
(F (v) −F (v −x))
n−2
f (v −x) f (v) dv, x > 0.
For example, when X
i
i.i.d.
∼ (0, 1),
f
R
(x) = n(n −1) x
n−2
(1 −x) , 0 ≤ x ≤ 1
[15, Ex 9F p 322–323].
107
12 Convergences
Deﬁnition 12.1. A sequence of random variables (X
n
) converges pointwise to X if
∀ω ∈ Ω lim
n→∞
X
n
(ω) →X (ω)
Deﬁnition 12.2 (Strong Convergence). The following statements are all equivalent
conditions/notations for a sequence of random variables (X
n
) to converge almost surely to
a random variable X
(a) X
n
a.s.
−−→X
(i) X
n
→X a.s.
(ii) X
n
→X with probability 1.
(iii) X
n
→X w.p. 1
(iv) lim
n→∞
X
n
= Xa.s.
(b) (X
n
−X)
a.s.
−−→0
(c) P [X
n
→X] = 1
(i) P
ω : lim
n→∞
X
n
(ω) = X (ω)
¸
= 1
(ii) P
ω : lim
n→∞
X
n
(ω) = X (ω)
¸
c
= 0
(iii) P [X
n
X] = 0
(iv) P
ω : lim
n→∞
[X
n
(ω) −X (ω)[ = 0
¸
= 1
(d) ∀ε > 0 P
ω : lim
n→∞
[X
n
(ω) −X (ω)[ < ε
¸
= 1
12.3. Properties of convergence a.s.
(a) Uniqueness: if X
n
a.s.
−−→X and X
n
a.s.
−−→Y , then X = Y a.s.
(b) If X
n
a.s.
−−→X and Y
n
a.s.
−−→Y , then
(i) X
n
+Y
n
a.s.
−−→X +Y
(ii) X
n
Y
n
a.s.
−−→XY
(c) g continuous, X
n
a.s.
−−→X ⇒g (X
n
)
a.s.
−−→g (X)
(d) Suppose that ∀ε > 0,
∞
¸
n=1
P [[X
n
−X[ > ε] < ∞, then X
n
a.s.
−−→X
(e) Let A
1
, A
2
, . . . be independent. Then, 1
An
a.s.
−−→0 if and only if
¸
n
P (A
n
) < ∞
108
Deﬁnition 12.4 (Convergence in probability). The following statements are all equiv
alent conditions/notations for a sequence of random variables (X
n
) to converges in proba
bility to a random variable X
(a) X
n
P
−→X
(i) X
n
→
P
X
(ii) p lim
n→∞
X
n
= X
(b) (X
n
−X)
P
−→0
(c) ∀ε > 0 lim
n→∞
P [[X
n
−X[ < ε] = 1
(i) ∀ε > 0 lim
n→∞
P (¦ω : [X
n
(ω) −X (ω)[ > ε¦) = 0
(ii) ∀ε > 0 ∀δ > 0 ∃N
δ
∈ N such that ∀n ≥ N
δ
P [[X
n
−X[ > ε] < δ
(iii) ∀ε > 0 lim
n→∞
P [[X
n
−X[ > ε] = 0
(iv) The strict inequality between [X
n
−X[ and ε can be replaced by the correspond
ing “nonstrict” version.
12.5. Properties of convergence in probability
(a) Uniqueness: If X
n
P
−→X and X
n
P
−→Y , then X = Y a.s.
(b) Suppose X
n
P
−→X, Y
n
P
−→Y , and a
n
→a, then
(i) X
n
+Y
n
P
−→X +Y
(ii) a
n
X
n
P
−→aX
(iii) X
n
Y
n
P
−→XY
(c) Suppose (X
n
) i.i.d. with distribution  [0, θ]. Let Z
n
= max ¦X
i
: 1 ≤ i ≤ n¦. Then,
Z
n
P
−→θ
(d) g continuous, X
n
P
−→X ⇒g (X
n
)
P
−→g (X)
(i) Suppose that g : R
d
→R is continuous. Then, ∀i X
i,n
P
−→
n→∞
X
i
implies
g (X
1,n
, . . . , X
d,n
)
P
−→
n→∞
g (X
1
, . . . , X
d
)
(e) Let g be a continuous function at c. Then, X
n
P
−→c ⇒g (X
n
)
P
−→g (c)
(f) Fatou’s lemma: 0 ≤ X
n
P
−→X ⇒liminf
n→∞
EX
n
≥ EX
(g) Suppose X
n
P
−→X and [X
n
[ ≤ Y with EY < ∞, then EX
n
→EX
109
(h) Let A
1
, A
2
, . . . be independent. Then, 1
An
P
−→0 iﬀ P (A
n
) →0
(i) X
n
P
−→0 iﬀ ∃δ > 0 such that ∀t ∈ [−δ, δ] we have ϕ
Xn
(t) →1
Deﬁnition 12.6 (Weak convergence for probability measures). Let P
n
and P be
probability measure on R
d
(d ≥ 1). The sequence P
n
converges weakly to P if the se
quence of real numbers
gdP
n
→
gdP for any g which is realvalued, continuous, and
bounded on R
d
12.7. Let (X
n
), X be R
d
valued random variables with distribution functions (F
n
), F,
distributions (μ
n
) , μ, and ch.f. (ϕ
n
) , ϕ respectively.
The following are equivalent conditions for a sequence of random variables (X
n
) to
converge in distribution to a random variable X
(a) (X
n
) converges in distribution (or in law) to X
(i) X
n
⇒X
i. X
n
/
−→X
ii. X
n
T
−→X
(ii) F
n
⇒F
i. F
Xn
⇒F
X
ii. F
n
converges weakly to F
iii. lim
n→∞
P
Xn
(A) = P
X
(A) for every A of the form A = (−∞, x] for which
P
X
¦x¦ = 0
(iii) μ
n
⇒μ
i. P
Xn
converges weakly to P
X
(b) Skorohod’s theorem: ∃ random variables Y
n
and Y on a common probability space
(Ω, T, P) such that Y
n
T
=X
n
, Y
T
=X, and Y
n
→Y on (the whole) Ω
(c) lim
n→∞
F
n
= F for all continuity points of F
(i) F
Xn
(x) →F
X
(x) ∀x such that P [X = x] = 0
(d) ∃ a (countable) set D dense in R such that F
n
(x) →F (x) ∀x ∈ D
(e) Continuous Mapping theorem: lim
n→∞
Eg (X
n
) = Eg (X)for all g which is real
valued, continuous, and bounded on R
d
(i) lim
n→∞
Eg (X
n
) = Eg (X) for all bounded realvalued function g such that P [X ∈ D
g
] =
0 where D
g
is the set of points of discontinuity of g
(ii) lim
n→∞
Eg (X
n
) = Eg (X) for all bounded Lipschitz continuous functions g
(iii) lim
n→∞
Eg (X
n
) = Eg (X) for all bounded uniformly continuous functions g
110
(iv) lim
n→∞
Eg (X
n
) = Eg (X) for all complexvalued functions g whose real and imag
inary parts are bounded and continuous
(f) Continuity Theorem: ϕ
Xn
→ϕ
X
(i) For nonnegative random variables: ∀s ≥ 0 L
Xn
(s) → L
X
(s) where L
X
(s) =
Ee
−sX
Note that there is no requirement that (X
n
) and X be deﬁned on the same probability
space (Ω, /, P)
12.8. Continuity Theorem: Suppose lim
n→∞
ϕ
n
(t) exists ∀t; call this limit ϕ
∞
Further
more, suppose ϕ
∞
is continuous at 0. Then there exists ∃ a probability distribution μ
∞
such that μ
n
⇒μ
∞
and ϕ
∞
is the characteristic function of μ
∞
12.9. Properties of convergence in distribution
(a) If F
n
⇒F and F
n
⇒G, then F = G
(b) Suppose X
n
⇒X
(i) If P [X is a discontinuity point of g] = 0, then g (X
n
) ⇒g (X)
(ii) Eg (X
n
) →Eg (X) for every bounded realvalued function g such that P [X ∈ D
g
] =
0 where D
g
is the set of points of discontinuity of g
i. g (X
n
) ⇒g (X) for g continuous.
(iii) If Y
n
P
−→0, then X
n
+Y
n
⇒X
(iv) If X
n
−Y
n
⇒0, then Y
n
⇒X
(c) If X
n
⇒a and g is continuous at a, then g (X
n
) ⇒g (a)
(d) Suppose (μ
n
) is a sequence of probability measures on R that are all point masses
with μ
n
(¦α
n
¦) = 1. Then, μ
n
converges weakly to a limit μ iﬀ α
n
→ α; and in this
case μ is a point mass at α
(e) Scheﬀ´e’s theorem:
(i) Suppose P
Xn
and P
X
have densities δ
n
and δ w.r.t. the same measure μ. Then,
δ
n
→δ μa.e. implies
i. ∀B ∈ B
R
P
Xn
(E) →P
X
(E)
• F
Xn
→F
X
• F
Xn
⇒F
X
ii. Suppose g is bounded. Then,
g (x) P
Xn
(dx) →
g (x) P
X
(dx). In Equiv
alently, E[g (X
n
)] →E[g (X)] where the E is deﬁned with respect to appro
priate P.
(ii) Remarks:
111
i. For absolutely continuous random variables, μ is the Lebesgue measure, δ
is the probability density function.
ii. For discrete random variables, μ is the counting measure, δ is the probability
mass function.
(f) Normal r.v.
(i) Let X
n
∼ ^ (μ
n
, σ
2
n
). Suppose μ
n
→ μ ∈ R and σ
2
n
→ σ
2
≥ 0. Then, X
n
⇒
^ (μ, σ
2
)
(ii) Suppose that X
n
are normal random variables and let X
n
⇒ X. Then, (1) the
mean and variance of X
n
converge to some limit m and σ
2
. (2) X is normal
with mean m and variance σ
2
(g) X
n
⇒0 if and only if ∃δ > 0 such that ∀t ∈ [−δ, δ] we have ϕ
Xn
(t) →1
12.10. Relationship between convergences
p
L
n
X X ÷÷÷
P
n
X X ÷÷÷
a.s.
n
X X ÷÷÷
p
n
X Y L s e
f ( ) continuous
n
X X ÷÷÷
c  ( ) 1 P X c = =
( ) ( ) lim
n
n
F x F x
÷·
=
( ) ( ) { }
: x x F x F x
÷
¬ e =
dense in
x D ¬ e dense in
n
X X
f f ÷÷÷
( )
k
n
X
( )
k
n
X
p
n
X Y L s e
f ( ) continuous
p p
n
X X ÷
f ( ) continuous
Figure 22: Relationship between convergences
(a) For discrete probability space, X
n
a.s.
−−→X if and only if X
n
P
−→X
(b) Suppose X
n
P
−→X, ∀n[X
n
[ ≤ Y , Y ∈ L
p
. Then, X ∈ L
p
and X
n
L
p
−→X
(c) Suppose X
n
a.s.
−−→X, ∀n[X
n
[ ≤ Y , Y ∈ L
p
. Then, X ∈ L
p
and X
n
L
p
−→X
(d) X
n
P
−→X ⇒X
n
T
−→X
112
(e) If X
n
T
−→X and if ∃a ∈ RX = a a.s., then X
n
P
−→X
(i) Hence, when X = a a.s., X
n
P
−→X and X
n
T
−→X are equivalent.
See also Figure 22.
Example 12.11.
(a) Ω = [0, 1] , P is uniform on [0,1]. X
n
(ω) =
0, ω ∈
0, 1 −
1
n
2
e
n
, ω ∈
1 −
1
n
2
, 1
(i) X
n
L
p
0
(ii) X
n
a.s.
−−→0
(iii) X
n
P
−→0
(b) Let Ω = [0, 1] with uniform probability distribution. Deﬁne X
n
(ω) = ω+1
[
j−1
k
,
j
k
]
(ω) ,
where k =
1
2
√
1 + 8n −1
and j = n −
1
2
k (k −1). The sequence of intervals
j−1
k
,
j
k
under the indicator function is shown in ﬁgure 23.
  0,1
1 1
0, ,1
2 2
÷
1 1 2 2
0, , ,1
3 3 3 3
÷ ÷
1 1 1 1 3 3
0, , , ,1
4 4 2 2 4 4
÷ ÷ ÷
Figure 23: Diagram for example (12.11)
Let X (ω) = ω.
(i) The sequence of real numbers (X
n
(ω)) does not converge for any ω
(ii) X
n
L
p
−→X
(iii) X
n
P
−→X
(c) X
n
=
1
n
, w.p. 1 −
1
n
n, w.p.
1
n
.
(i) X
n
P
−→0
(ii) X
n
L
p
0
113
12.1 Summation of random variables
Set S
n
=
n
¸
i=1
X
i
.
12.12 (Markov’s theorem; Chebyshev’s inequality). For ﬁnite variance (X
i
), if
1
n
2
Var S
n
→0, then
1
n
S
n
−
1
n
ES
n
P
−→0
• If (X
i
) are pairwise independent, then
1
n
2
Var S
n
=
1
n
2
n
¸
i=1
Var X
i
See also (12.16).
12.2 Summation of independent random variables
12.13. For independent X
n
, the probability that
n
¸
i=1
X
i
converges is either 0 or 1.
12.14 (Kolmogorov’s SLLN). Consider a sequence (X
n
) of independent random vari
ables. If
¸
n
Var Xn
n
2
< ∞, then
1
n
S
n
−
1
n
ES
n
a.s.
−−→0
• In particular, for independent X
n
, if
1
n
n
¸
i=1
EX
n
→a or EX
n
→a, then
1
n
S
n
a.s.
−−→a
12.15. Suppose that X
1
, X
2
, . . . are independent and that the series X = X
1
+ X
2
+
converges a.s. Then, ϕ
X
= ϕ
X
1
ϕ
X
2
12.16. For pairwise independent (X
i
) with ﬁnite variances, if
1
n
2
n
¸
i=1
Var X
i
→ 0, then
1
n
S
n
−
1
n
ES
n
P
−→0
(a) Chebyshev’s Theorem: For pairwise independent (X
i
) with uniformly bounded
variances, then
1
n
S
n
−
1
n
ES
n
P
−→0
12.3 Summation of i.i.d. random variable
Let (X
i
) be i.i.d. random variables.
12.17 (Chebyshev’s inequality). P
¸
1
n
n
¸
i=1
X
n
−EX
1
≥ ε
≤
1
ε
2
Var[X
1
]
n
• The X
i
’s don’t have to be independent; they only need to be pairwise uncorrelated.
12.18 (WLLN). Weak Law of Large Numbers:
(a) L
2
weak law: (ﬁnite variance) Let (X
i
) be uncorrelated (or pairwise independent)
random variables
(i) V arS
n
=
n
¸
i=1
Var X
i
(ii) If
1
n
2
n
¸
i=1
Var X
i
→0, then
1
n
S
n
−
1
n
ES
n
P
−→0
114
(iii) If EX
i
= μ and V arX
i
≤ C < ∞. Then,
1
n
n
¸
i=1
X
i
L
2
−→μ
(b) Let (X
i
) be i.i.d. random variables with EX
i
= μ and V arX
i
= σ
2
< ∞. Then,
(i) P
¸
1
n
n
¸
i=1
X
n
−EX
1
≥ ε
≤
1
ε
2
Var[X
1
]
n
(ii)
1
n
n
¸
i=1
X
i
L
2
−→μ
(The fact that σ
2
< ∞ implies μ < ∞).
(c)
1
n
n
¸
i=1
X
i
L
2
−→μ implies
1
n
n
¸
i=1
X
i
P
−→μ which in turn imply
1
n
n
¸
i=1
X
i
T
−→μ
(d) If X
n
are i.i.d. random variables such that lim
t→∞
tP [[X
1
[ > t] →0, then
1
n
S
n
−EX
(n)
1
P
−→
0
(e) Khintchine’s theorem: If X
n
are i.i.d. random variables and E[X
1
[ < ∞, then
EX
(n)
1
→EX
1
and
1
n
S
n
P
−→EX
1
.
• No assumption about the ﬁniteness of variance.
12.19 (SLLN).
(a) Kolmogorov’s SLLN: Consider a sequence (X
n
) of independent random variables.
If
¸
n
Var Xn
n
2
< ∞, then
1
n
S
n
−
1
n
ES
n
a.s.
−−→0
• In particular, for independent X
n
, if
1
n
n
¸
i=1
EX
n
→a or EX
n
→a, then
1
n
S
n
a.s.
−−→a
(b) Khintchin’s SLLN: If X
i
’s are i.i.d. with ﬁnite mean μ, then
1
n
n
¸
i=1
X
i
a.s.
−−→μ
(c) Consider the sequence (X
n
) of i.i.d. random variables. Suppose EX
−
1
< ∞ and
EX
+
1
= ∞, then
1
n
S
n
a.s.
−−→∞
• Suppose that X
n
≥ 0 are i.i.d. random variables and EX
n
= ∞. Then,
1
n
S
n
a.s.
−−→
∞
12.20 (Relationship between LLN and the convergence of relative frequency to
the probability). Consider i.i.d. Z
i
∼ Z. Let X
i
= 1
A
(Z
i
). Then,
1
n
S
n
=
1
n
¸
n
i=1
1
A
(Z
i
) =
r
n
(A), the relative frequency of an event A. Via LLN and appropriate conditions, r
n
(A)
converges to E
1
n
¸
n
i=1
1
A
(Z
i
)
= P[Z ∈ A].
115
12.4 Central Limit Theorem (CLT)
Suppose that (X
k
)
k≥1
is a sequence of i.i.d. random variables with mean m and variance
0 < σ
2
< ∞. Let S
n
=
¸
n
k=1
X
k
.
12.21 (LindebergL´evy theorem).
(a)
Sn−mc
σ
√
n
=
1
√
n
n
¸
k=1
X
k
−m
σ
⇒^ (0, 1).
(b)
Sn−mc
√
n
=
1
√
n
n
¸
k=1
(X
k
−m) ⇒^ (0, σ).
To see this, let Z
k
=
X
k
−m
σ
iid
∼Z and Y
n
=
¸
n
k=1
Z
k
. Then, EZ = 0, Var Z = 1, and ϕ
Yn
(t) =
ϕ
Z
(
t
√
n
)
n
. By approximating e
x
≈ 1 +x +
1
2
x
2
. We have ϕ
X
(t) ≈ 1 +jtEX −
1
2
t
2
E[X
2
]
(see also (22)) and
ϕ
Yn
(t) =
1 −
1
2
t
2
n
n
→e
−
t
2
2
.
• The case of Bernoulli(1/2) was derived by Abraham de Moivre around 1733. The
case of Bernoulli(p) for 0 < p < 1 was considered by PierreSimon Laplace [9, p 208].
12.22 (Approximation of densities and pmfs using the CLT). Approximate the
distribution of S
n
by ^ (nm, nσ
2
).
• F
Sn
(s) ≈ Φ
s−nm
σ
√
n
• f
Sn
(s) ≈
1
√
2πσ
√
n
e
−
1
2
x−nm
σ
√
n
2
• If the X
i
are integervalued, then
P[S
n
= k] = P
¸
k −
1
2
< S
n
≤ k +
1
2
≈
1
√
2πσ
√
n
e
−
1
2
k−nm
σ
√
n
2
[9, eq (5.14) p 213].
The approximation is best for s, k near nm [9, p 211].
• The approximation n! ≈
√
2πn
n+
1
2
e
−n
can be derived from approximating the density
of S
n
when X
i
∼ c(1). We know that f
Sn
(s) =
s
n−1
e
−s
(n−1)!
. Approximate the density at
s = n, gives (n −1)! ≈
√
2πn
n−
1
2
e
−n
. Multiply through by n. [9, Ex 5.18 p 212]
• See also normal approximation for the binomial in (13).
116
13 Conditional Probability and Expectation
13.1 Conditional Probability
Deﬁnition 13.1. Suppose conditioned on X = x, Y has distribution Q. Then, we write
Y [X = x ∼ Q [21, p 40]. It might be clearer to write P
Y [X=x
= Q.
13.2. Discrete random variables
(a) p
X[Y
(x[y) =
p
X,Y
(x,y)
p
Y
(y)
(b) p
X,Y
(x, y) = p
X[Y
(x[y)p
Y
(y) = p
Y [X
(y[x)p
X
(x)
(c) The law of total probability: P[Y ∈ C] =
¸
x
P[Y ∈ C[X = x]P[X = x]. In
particular, p
Y
(y) =
¸
x
p
Y [X
(y[x)p
X
(x).
(d) p
X[Y
(x[y) =
p
Y X
(y[x)p
X
(x)
¸
x
p
Y X
(y[x
)p
X
(x
)
(e) If Y = X +Z where X and Z are independent, then
• p
Y [X
(y[x) = p
Z
(y −x)
• p
Y
(y) =
¸
x
p
Z
(y −x)p
X
(x)
• p
X[Y
(x[y) =
p
Z
(y−x)p
X
(x)
¸
x
p
Z
(y−x
)p
X
(x
)
.
(f) The substitution law of conditional probability:
P[g(X, Y ) = z[X = x] = P[g(x, Y ) = z[X = x].
• When X and Y are independent, we can “drop the conditioning”.
13.3. Absolutely continuous random variables
(a) f
Y [X
(y[x) =
f
X,Y
(x,y)
f
Y
(y)
.
(b) F
Y [X
(y[x) =
y
−∞
f
Y [X
(t[x)dt
13.4. P[(X, Y ) ∈ A] = E[P[(X, Y ) ∈ A[X]] =
A
f
X,Y
(x, y)d(x, y)
13.5.
∂
∂z
F
Y [X
(g (z, x) [x) =
∂
∂z
g(z,x)
−∞
f
Y [X
(y [x)dy = f
Y [X
(g (z, x) [x)
∂
∂z
g (z, x)
13.6. F
X[Y
(x [y) = P [X ≤ x [Y = y] = lim
Δ→0
P[X≤x,y−Δ<Y ≤y]
P[y−Δ<Y ≤y]
= E
1
(−∞,x]
(X) [Y = y
.
13.7. Deﬁne f
X[A
(x) to be
f
X,A
(x)
P(A)
. Then,
P [A[X = x] = lim
Δ→0
x
x−Δ
f
X,A
(x
t
) dx
t
x
x−Δ
f
X
(x
t
) dx
t
=
f
X,A
(x)
f
X
(x)
=
P (A) f
X[A
(x)
f
X
(x)
.
13.8. For independent X
i
, P [∀i X
i
∈ B
i
[∀i X
i
∈ C
i
] =
¸
i
P [X
i
∈ B
i
[X
i
∈ C
i
].
117
13.2 Conditional Expectation
Deﬁnition 13.9. E[g(Y )[X = x] =
¸
y
g(y)p
Y [X
(y[x).
• In particular, E[Y [X = x] =
¸
y
yp
Y [X
(y[x).
• Note that E[Y [X = x] is a function of x.
13.10. Properties of conditional expectation.
(a) Substitution law for conditional expectation: E[g(X, Y )[X = x] = E[g(x, Y )[X = x]
(b) E[h(X)g(Y )[X = x] = h(x)E[g(Y )[X = x].
(c) (The Rule of Iterated Expectations) EY = E[E[Y [X]].
(d) Law of total probability for expectation:
E[g(X, Y )] = E[E[g(X, Y )[X]] .
(i) E[g(Y )] = E[E[g(Y )[X]].
(ii) F
Y
(y) = E
F
Y [X
(y[X)
. (Take g(z) = 1
(−∞,y]
(z)).
(e) E[g (X) h(Y ) [X = x] = g (x) E[h(Y ) [X = x]
(f) E[g (X) h(Y )] = E[g (X) E[h(Y ) [X]]
(g) E[X +Y [Z = z ] = E[X [Z = z ] +E[Y [Z = z ]
(h) E[AX[z] = AE[X[z].
(i) E[ (X −E[X [Y ])[ Y ] = 0 and E[X −E[X [Y ]] = 0
(j) min
g(x)
E[(Y −g(X))
2
] = E[(Y −E[Y [X])
2
] where g ranges over all functions. Hence,
E[Y [X] is sometimes called the regression of Y on X, the “best” predictor of Y con
ditional on X.
Deﬁnition 13.11. The conditional variance is deﬁned as
Var[Y [X = x] =
(y −m(x))
2
f(y[x)dy
where m(x) = E[Y [X = x].
13.12. Properties of conditional variance
(a) Var Y = E[Var[Y [X]] + Var[E[Y [X]].
In other words, suppose given X = x, the mean and variance of Y is m(x), v(x). Then,
the variance of Y is Var Y = E[v(X)] + Var[m(X)]. Recall that for any function g,
we have E[g(Y )] = E[E[g(Y )[X]]. Because EY is just a constant, say μ, we can
deﬁne g(y) = (y −μ)
2
which then implies Var Y = E[E[g(Y )[X]]. Note, however,
that E[g(Y )[X] and Var[Y [X] are not the same. Suppose conditioned on X, Y has
distribution Q with mean m(x) and variance v(x). Then, v(x) =
(y −m(x))
2
Q(dy).
However, E[g(Y )[X = x] =
(y −μ)
2
Q(dy); note the use of μ in stead of m(x).
Therefore, in general, Var Y = E[Var[Y [X]].
118
• All three terms in the expression are nonnegative. Var Y is an upper bound for
each of the terms on the RHS.
(b) Suppose N
[
= (X, Z), then Var[X +N[Z] = Var[X[Z] + Var N.
(c) Var[AX[z] = AVar[X[z]A
H
.
13.13. Suppose E[Y [X] = X. Then, Cov [X, Y ] = Var X. See also (5.19). This is also
true for Y = X +N with X
[
= N and N is zeromean noise.
Deﬁnition 13.14. μ
n
[Y [X] = E[(Y −E[Y [X])
n
[X]
13.15. Properties
(a) μ
3
[Y ] = E[μ
3
[Y [X]] +μ
3
[E[Y [X]]
μ
4
[Y ] = E[μ
4
[Y [X]] + 6E[Var [Y [X]] Var [E[Y [X]] +μ
4
[E[Y [X]]
13.3 Conditional Independence
13.16. The following statements are equivalent:conditions for X
1
, X
2
, . . . , X
n
to be mu
tually independent conditioning on Y (a.s.).
(a) p (x
n
1
[y) =
n
¸
i=1
p (x
i
[y).
(b) ∀i ∈ [n] ` ¦1¦ p
x
i
x
i−1
1
, y
= p (x
i
[y).
(c) ∀i ∈ [n] X
i
and the vector (X
j
)
[n]\i¦
are independent conditioning on Y .
Example 13.17. Suppose X and Y are independent. Conditioned on another random
variable Z, it is not true in general that X and Y are still independent. See example
(4.45). Recall that Z = X⊕Y which can be rewritten as Y = X⊕Z. Hence, when Z = 0,
we must have Y = X.
13.18. Suppose we know that f
X[Y,Z
(x[y, z) = g(x, y); that is f
X[Y,Z
does not depend on
z. Then, conditioned on Y , X and Z are independent. In which case,
f
X[Y,Z
(x [y, z ) = f
X[Y
(x [y) = g (x, y) .
13.19. Suppose we know that f
Z[V,U
1
,U
2
(z [v, u
1
, u
2
) = f
Z[V
(z [v) for all z, v, u
1
, u
2
, then
conditioned on V , we can conclude that Z and (U
1
, U
2
) are independent. This further
implies Z and U
i
are independent. Moreover,
f
Z[V,U
1
,U
2
(z [v, u
1
, u
2
) = f
Z[V,U
1
(z [v, u
1
) = f
Z[V,U
1
(z [v, u
2
) = f
Z[V
(z [v) .
119
14 Realvalued Jointly Gaussian
Deﬁnition 14.1. Random vector R
d
is jointly Gaussian or jointly normal if and
only if ∀v ∈ R
d
, the random variable v
T
X is Gaussian.
• In order for this deﬁnition to make sense when v = 0 or when X has a singular
covariance matrix, we agree that any constant random variable is considered to be
Gaussian.
• Of course, the mean and variance are v
T
EX and v
T
Λ
X
v, respectively.
If X is a Gaussian random vector with mean vector m and covariance matrix Λ, we write
X ∼ ^(m, Λ).
14.2. Properties of jointly Gaussian random vector X ∼ ^ (m, Λ)
(a) m = EX, Λ = Cov [X] = E
(X −EX)(X −EX)
T
.
(b) f
X
(x) =
1
(2π)
n
2
√
det(Λ)
e
−
1
2
(x−m)
T
Λ
−1
(x−m)
.
• To remember the form of the above formulas, both exponents have to be scalar.
So, we better have (x −m)
T
Λ
−1
(x −m) instead of having the transpose on the
last term. To make this more clear, set Λ = I, then we must have a dot product.
Note also that v
T
Av =
¸
k
¸
v
k
A
k
v
.
• The above formula can be derived by starting form a random vector Z whose
components are i.i.d. ^(0, 1). Let X = C
1
2
X
Z +m. Use (28) and (32).
• For X
i
i.i.d.
∼ ^(m
i
, σ
2
i
),
f
X
(x) =
1
(2π)
n
2
¸
i
σ
i
e
−
1
2
¸
i
x−m
i
σ
i
2
.
In particular, if X
i
i.i.d.
∼ ^(0, 1),
f
X
(x) =
1
(2π)
n
2
e
−
1
2
x
T
x
(32)
• (2π)
n
2
det (Λ) =
det (2πΛ)
• The Gaussian density is constant on the “ellipsoids” centered at m,
¸
x ∈ R
n
: (x −m)
T
C
−1
X
(x −m) = constant
¸
.
(c) ϕ
X
(v) = e
jv
T
m−
1
2
v
T
Λv
= e
j
¸
i
v
i
EX
i
−
1
2
¸
k
¸
v
k
v
Cov[X
k
,X
]
.
• This can be derived from deﬁnition (14.1) by noting that
ϕ
X
(v) = E
e
jv
T
X
= E
⎡
⎢
⎣
e
j1
Y
....
v
T
X
⎤
⎥
⎦
is simply ϕ
Y
(1) where Y = v
T
X which by deﬁnition is normal.
120
(d) Random vector R
d
is Jointly Gaussian if and only if ∀v ∈ R
d
, the random variable
v
T
X is Gaussian.
• Independent Gaussian random variables are jointly Gaussian.
(e) Joint normality is preserved under linear transformation: suppose Y = AX +b, then
Y ∼ ^
Am+b, AΛA
T
.
(f) If (X, Y ) jointly Gaussian, then X and Y are independent if and only if Cov [X, Y ] =
0. Hence, uncorrelated jointly Gaussian random variables are independent.
(g) Note that the joint density does not exists when the covariance matrix Λ is singular.
(h) For i.i.d. ^ (μ, σ
2
):
f
X
n
1
(x
n
1
) =
1
(2π)
n
2
σ
n
exp
−
1
2σ
2
x −μ
2
=
1
(2πσ
2
)
n
2
exp
−
1
2σ
2
n
¸
i=1
x
2
i
+
μ
σ
2
n
¸
i=1
x
i
−
μ
2
2σ
2
¸
(i) Third order joint Gaussian moments are 0:
E[(X
i
−EX
i
) (X
j
−EX
j
) (X
k
−EX
k
)] = 0 ∀ i, j, k not necessarily distinct.
In particular, E
(X −EX)
3
= 0.
(j) Isserlis’s Theorem: Any forthorder central moment of jointly Gaussian r.v. is
expressible as the a sum of all possible products of pairs of their covariances:
E[(X
i
−EX
i
) (X
j
−EX
j
) (X
k
−EX
k
) (X
−EX
)]
= Cov [X
i
, X
j
] Cov [X
k
, X
] + Cov [X
i
, X
k
] Cov [X
j
, X
] + Cov [X
i
, X
] Cov [X
j
, X
k
] .
Note that
1
2
4
2
= 3.
• In particular, E
(X −EX)
4
= 3σ
4
.
(k) To generate ^ (m, Λ). First, by spectral theorem, Λ = V DV
T
where V is orthogo
nal matrix whose columns are eigenvectors of Λ and D is diagonal matrix with the
eigenvalues of Λ. The random variable we want is V X +m where X ∼ ^ (0, D).
14.3. For bivariate normal, f
X,Y
(x, y) is
1
2πσ
X
σ
Y
1 −ρ
2
exp
⎧
⎪
⎨
⎪
⎩
−
x−EX
σ
X
2
−2ρ
x−EX
σ
X
y−EY
σ
Y
+
y−EY
σ
Y
2
2 (1 −ρ
2
)
⎫
⎪
⎬
⎪
⎭
,
where ρ =
Cov(X,Y )
σ
X
σ
Y
∈ [−1, 1]. Here, x, y ∈ R.
121
• f
X,Y
(x, y) =
1
σ
X
σ
Y
ψ
ρ
x−m
X
σ
X
,
y−m
Y
σ
Y
• f
X,Y
is constant on ellipses of the form
x
σ
X
2
+
y
σ
Y
2
= r
2
.
• Λ =
σ
2
X
Cov [X, Y ]
Cov [X, Y ] σ
2
Y
=
σ
2
X
ρσ
X
σ
Y
ρσ
X
σ
Y
σ
2
Y
.
• The following are equivalent:
(a) ρ = 0
(b) Cov [X, Y ] = 0
(c) X and Y are independent.
• [ρ[ = 1 if and only if (X −EX) = k (Y −EY ). In which case
◦ ρ =
k
[k[
= sign (k)
◦ [k[ =
σ
X
σ
Y
• Suppose f
X,Y
(x, y) only depends on
x
2
+y
2
and X and Y are independent, then
X and Y are normal with zero mean and equal variance.
• X[Y ∼ ^
ρ
σ
X
σ
Y
(Y −m
Y
) +m
X
, σ
2
X
(1 −ρ
2
)
Y [X ∼ ^
ρ
σ
Y
σ
X
(X −m
X
) +m
Y
, σ
2
Y
(1 −ρ
2
)
• The standard bivariate density is deﬁned as
ψ
ρ
(u, v) =
1
2π
1 −ρ
2
e
−
1
2(1−p)
2
(u
2
−2ρuv+v
2
)
= ψ(u)
....
f
U
(u)
1
1 −ρ
2
ψ
v −ρu
1 −ρ
2
. .. .
f
V U
(v[u)
V [U∼A(ρU,1−ρ
2
)
=
1
1 −ρ
2
ψ
u −ρv
1 −ρ
2
. .. .
f
UV
(u[v)
U[V ∼A(ρV,1−ρ
2
)
ψ(v)
....
f
V
(v)
[9, eq. (7.22) p 309, eq. (7.23) p 311, eq. (7.26) p 313]. This is the joint density of
U, V where U, V ∼ ^(0, 1) with Cov [U, V ] = E[UV ] = ρ.
◦ The general bivariate Gaussian pair is obtained from the transformation
X
Y
=
σ
X
U +m
X
σ
Y
V +m
Y
=
σ
X
0
0 σ
Y
U
V
+
m
X
m
Y
.
◦ f
U[V
(u[v) is ^(ρv, 1 −ρ
2
). In other words, U[V ∼ ^(ρV, 1 −ρ
2
).
14.4 (Conditional Gaussian).
122
(a) Suppose (X, Y ) are jointly Gaussian; that is
X
Y
∼ ^
μ
X
μ
Y
,
Λ
X
Λ
XY
Λ
Y X
Λ
Y
. Then,
f
X[Y
(x [y) is ^
E[X [y] , Λ
X[y
where E[X [y] = μ
X
+Λ
XY
Λ
−1
Y
(y −μ
Y
) and Λ
X[y
=
Λ
X
−Λ
XY
Λ
−1
Y
Λ
Y X
.
• Note the direction of the formula for Λ
X[y
.
⎛
⎝
Λ
X
→ Λ
XY
↓
Λ
Y X
← Λ
Y
⎞
⎠
(b) Suppose (X, Y, W) are jointly Gaussian with W
[
= (X, Y ) . Set V = BX +W. Then,
V [y ∼ ^
E[V [y] , Λ
V [y
where E[V [y] = BE[X [y] +EW and Λ
V [y
= BΛ
X[y
B
T
+
Λ
W
.
15 Bayesian Detection and Estimation
Consider a pair of random vectors Θ and Y , where Θ is not observed, but Y is observed.
We know the joint distribution of the pair (Θ, Y ) which is usually given in the form of the
prior distribution p
Θ
(θ) and the conditional distribution p
Y [Θ
(y[θ). By an estimator of Θ
based on Y , we mean a function g such that
ˆ
Θ(Y ) = g(Y ) is our estimate or “guess” of
the value of Θ.
15.1 (Orthogonality Principle). Let T be a collection of random vectors with the same
dimension as Θ. For a random vector Z, suppose that
∀X ∈ T, Z −X ∈ T. (33)
If
E
X
T
(Θ−Z)
= 0, ∀X ∈ T, (34)
then
E
[Θ−X[
2
= E
[Θ−Z[
2
+E
[Z −X[
2
+ 2 E[(Z −X
. .. .
∈T
)
T
(Θ−Z)]
. .. .
=0
,
which implies
E
[Θ−Z[
2
≤ E
[Θ−X[
2
, ∀X ∈ T.
• If T is a subspace and Z ∈ T, then (33) is automatically satisﬁed.
• (34) says that the vector Θ−Z is orthogonal to all vectors in T.
Example 15.2. Suppose Θ and N are independent Poisson random variables with respec
tive parameters λ and μ. Let Z = Θ +N.
• Y is {(λ +μ)
• Conditioned on Y = y, Θ is B
y,
λ
λ+μ
.
123
•
ˆ
Θ
MMSE
(Y ) = E[Θ[Y ] =
λ
λ+μ
Y .
• Var[Θ[Y ] = Y
λμ
(λ+μ)
2
, and MSE = E[Var[Θ[Y ]] =
λμ
λ+μ
< Var Y = λ +μ.
See also [7, Q 15.17].
15.3 (Weighted Error). Suppose we deﬁne the error by E =
Θ−
ˆ
Θ(Y )
T
W
Θ−
ˆ
Θ(Y )
for some positive deﬁnite matrix W. (Note that the usual MSE use W = I.) The MSE
EE is uniquely minimized by the MMSE estimator
ˆ
Θ(Y ) = E[Θ[Y ]. The resulting MSE
is E
(Θ−E[Θ[Y ])
T
W(Θ−E[Θ[Y ])
.
In fact, for any function, g(Y ), the conditional weight error E
(Θ−g (Y ))
T
W (Θ−g (Y ))
Y
is given by
E
(Θ−E[Θ[Y ])
T
W (Θ−E[Θ[Y ])
Y
+ (E[Θ[Y ] −g (Y ))
T
W (E[Θ[Y ] −g (Y )) .
Hence, for each Y , it is minimized by having g(Y ) = E[Θ[Y ].
15.4 (Linear minimum meansquarederror estimator). A linear MMSE estimator
ˆ
Θ
LMMSE
= g
LMMSE
(Y ) minimizes the MSE E
[Θ−g(Y )[
2
among all aﬃne estimators of
the form g(y) = Ay +b.
(a) It is sometimes called Wiener ﬁlters.
(b) The scalar linear (aﬃne) MMSE estimator is given by
ˆ
Θ
LMMSE
(Y ) = EΘ +
Cov [Y, Θ]
Var Y
(Y −EY ) .
• To see this in Hilbert space, note that we want the orthogonal projection of Θ
onto the subspace spanned by two elements: Y and 1. The orthogonal basis of
the subspace is ¦1, Y −EY ¦. Hence, the orthogonal projection is
'Θ, 1`
'1, 1`
+
'Θ, Y −EY `
'Y −EY, Y −EY `
(Y −EY ) .
• The above discussion suggest alternative ways of arriving at the LMMSE by
ﬁnding a, b in
ˆ
Θ(Y ) = aY + b such that the error E = Θ −
ˆ
Θ(Y ) is orthogonal
to both 1 and Y −EY . The condition 'E, 1` = 0 requires
ˆ
Θ(Y ) to be unbiased.
The condition 'E, Y −EY ` = 0 gives a =
Cov[Y,Θ]
Var Y
.
(c) The vector linear (aﬃne) MMSE estimator is given by
ˆ
Θ
LMMSE
(Y ) = EΘ + Σ
ΘY
Σ
−1
Y
(Y −EY )
and
MMSE = Cov
Θ−
ˆ
Θ
LMMSE
(Y )
= Σ
Θ
−Σ
ΘY
Σ
−1
Y
Σ
Y Θ
.
124
In fact, the optimal choice of A is any solution of
AΣ
Y
= Σ
ΘY
.
In which case,
Cov
Θ−
ˆ
Θ
LMMSE
(Y )
= Σ
Θ
−AΣ
Y Θ
−Σ
ΘY
A
T
+AΣ
Y
A
T
= Σ
Θ
−Σ
ΘY
A
T
= Σ
Θ
−AΣ
Y Θ
= Σ
Θ
−AΣ
Y
A
T
.
When Σ
Y
is invertible, A = Σ
ΘY
Σ
−1
Y
. When Σ
Y
is singular, see [9, Q8.38 p 359].
• The MSE can be rewrite as
E
((Θ−EΘ) −A(Y −EY ))
2
+[EΘ−AEY −b[
2
,
which show that the optimal choice of b is b = EΘ −AEY . This is the b which
makes the estimator unbiased.
• Fix A. Let
˜
Θ = Θ−EΘ and
˜
Y = Y −EY . Suppose for any matrix B, we have
E
¸
B
˜
Y
T
˜
Θ−A
˜
Y
= 0
.
if and only if, for all matrix B,
E
¸
˜
Θ−A
˜
Y
2
≤ E
¸
˜
Θ−B
˜
Y
2
.
In which case,
E
¸
˜
Θ−B
˜
Y
2
= E
¸
˜
Θ−A
˜
Y
2
+E
¸
(A −B)
˜
Y
2
.
• Additive Noise in 1D: Y = Θ +N where Cov [Θ, N] = 0.
ˆ
Θ
LMMSE
(Y ) = EΘ +
Cov [Θ, Y ]
Var Y
(Y −EY )
= EΘ +
Var Θ
Var Θ + Var N
(Y −(EΘ +EN))
= EΘ +
SNR
1 + SNR
(Y −(EΘ +EN)) .
and
MMSE = Var Θ−
Cov
2
[Θ, Y ]
Var Y
=
Var ΘVar N
Var Θ + Var N
.
125
A More Math
A.1 Inequalities
A.1. By deﬁnition,
•
¸
∞
n=1
a
n
=
¸
n∈N
a
n
= lim
N→∞
¸
N
n=1
a
n
and
•
¸
∞
n=1
a
n
=
¸
n∈N
a
n
= lim
N→∞
¸
N
n=1
a
n
.
A.2. For [x[ ≤ 0.5, we have
e
x−x
2
≤ 1 +x ≤ e
x
.
This is because
x −x
2
≤ ln (1 +x) ≤ x,
which is semiproved by the plot in ﬁgure 24. .
0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8
1.5
1
0.5
0
0.5
1
ݔ
0.6838
lnሺ1 +ݔሻ
ݔ െ ݔ
2
ݔ
ݔ
ݔ +1
Figure 24: Bounds for ln(1 +x) when x is small.
A.3. Let α
i
and β
i
be complex numbers with [α
i
[ ≤ 1 and [β
i
[ ≤ 1. Then,
m
¸
i=1
α
i
−
m
¸
i=1
β
i
≤
m
¸
i=1
[α
i
−β
i
[.
In particular, [α
m
−β
m
[ ≤ m[α −β[.
A.4. Consider a triangular array of real numbers (x
n,k
). Suppose (i)
rn
¸
k=1
x
n,k
→x and (ii)
rn
¸
k=1
x
2
n,k
→0. Then,
rn
¸
k=1
(1 +x
n,k
) →e
x
.
126
Proof. Use (A.2).
Suppose the sum
¸
rn
k=1
[x
n,k
[ converges as n →∞. Conditions (i) and (ii) is equivalent
to conditions (i) and (iii) where (iii) is the requirement that max
k∈[rn]
[x
k,n
[ →0 as n →∞.
Proof. Suppose
¸
rn
k=1
[x
n,k
[ → x
0
. To show that (iii) implies (ii) under (i), let a
n
=
max
k∈[rn]
[x
k,n
[. Then,
0 ≤
rn
¸
k=1
x
2
n,k
≤ a
n
rn
¸
k=1
[x
n,k
[ →0 x
0
= 0.
On the other hand, suppose we have (i) and (ii). Given any ε > 0, by (ii), ∃n
0
such that
∀n ≥ n
0
,
rn
¸
k=1
x
2
n,k
≤ ε
2
. Hence, for any k, x
2
n,k
≤
rn
¸
k=1
x
2
n,k
≤ ε
2
and hence [x
n,k
[ ≤ ε which
implies a
n
≤ ε.
Note that when the x
k,n
are nonnegative, condition (i) already implies that the sum
¸
rn
k=1
[x
n,k
[ converges as n →∞.
A.5. Suppose lim
n→∞
a
n
= a. Then lim
n→∞
1 −
an
n
n
= e
−a
[9, p 584].
Proof. Use (A.4) with r
n
= n, x
n,k
= −
an
n
. Then,
¸
n
k=1
x
n,k
= −a
n
→−a and
¸
n
k=1
x
2
n,k
=
a
2
n
1
n
→a 0 = 0.
Alternatively, from L’Hˆ opital’s rule, lim
n→∞
1 −
a
n
n
= e
−a
. (See also [18, Theorem 3.31,
p 64]) This gives a direct proof for the case when a > 0. For n large enough, note that
both
1 −
an
n
and
1 −
a
n
are ≤ 1 where we need a > 0 here. Applying (A.3), we get
1 −
an
n
n
−
1 −
a
n
n
≤ [a
n
−a[ →0.
For a < 0, we use the fact that, for b
n
→b > 0, (1)
1 +
b
n
−1
n
=
1 +
b
n
n
−1
→e
−b
and (2) for n large enough, both
1 +
b
n
−1
and
1 +
bn
n
−1
are ≤ 1 and hence
1 +
b
n
n
−1
n
−
1 +
b
n
−1
n
≤
[b
n
−b[
1 +
bn
n
1 +
b
n
→0.
A.2 Summations
A.6. Basic formulas:
(a)
n
¸
k=0
k =
n(n+1)
2
(b)
n
¸
k=0
k
2
=
n(n+1)(2n+1)
6
=
1
6
(2n
3
+ 3n
2
+n)
(c)
n
¸
k=0
k
3
=
n
¸
k=0
k
2
=
1
4
n
2
(n + 1)
2
=
1
4
(n
4
+ 2n
3
+n
2
)
127
A nicer formula is given by
n
¸
k=1
k (k + 1) (k +d) =
1
d + 2
n(n + 1) (n +d + 1) (35)
A.7. Let g (n) =
n
¸
k=0
h(k) where h is a polynomial of degree d. Then, g is a polynomial
of degree d + 1; that is g (n) =
d+1
¸
m=1
a
m
x
m
.
• To ﬁnd the coeﬃcients a
m
, evaluate g (n) for n = 1, 2, . . . , d + 1. Note that the case
when n = 0 gives a
0
= 0 and hence the sum starts with m = 1.
• Alternative, ﬁrst express h(k) in terms of summation of polynomials:
h(k) =
d−1
¸
i=0
b
i
k (k + 1) (k +i)
+c.
To do this, substitute k = 0, −1, −2, . . . , −(d −1).
• k
3
= k (k + 1) (k + 2) −3k (k + 1) +k
Then, to get g (n), use (35).
A.8. Geometric Sums:
(a)
∞
¸
i=0
ρ
i
=
1
1−ρ
for [ρ[ < 1
(b)
∞
¸
i=k
ρ
i
=
ρ
k
1−ρ
(c)
b
¸
i=a
ρ
i
=
ρ
a
−ρ
b+1
1−ρ
(d)
∞
¸
i=0
iρ
i
=
ρ
(1−ρ)
2
(e)
b
¸
i=a
iρ
i
=
ρ
b+1
(bρ−b−1)−ρ
a
(aρ−a−ρ)
(1−ρ)
2
(f)
∞
¸
i=k
iρ
i
=
kρ
k
1−ρ
+
ρ
k+1
(1−ρ)
2
(g)
∞
¸
i=0
i
2
ρ
i
=
ρ+ρ
2
(1−ρ)
3
A.9. Double Sums:
(a)
n
¸
i=1
a
i
2
=
n
¸
i=1
n
¸
j=1
a
i
a
j
128
(b)
∞
¸
j=1
∞
¸
i=j
f (i, j) =
∞
¸
i=1
i
¸
j=1
f (i, j) =
¸
(i,j)
1 [i ≥ j]f (i, j)
A.10. Exponential Sums:
• e
λ
=
∞
¸
k=0
λ
k
k!
= 1 +λ +
λ
2
2!
+
λ
3
3!
+. . .
• λe
λ
+e
λ
= 1 + 2λ + 3
λ
2
2!
+ 4
λ
3
3!
+. . . =
∞
¸
k=1
k
λ
k−1
(k−1)!
=
∞
¸
k=0
k
λ
k−1
(k−1)!
A.11. Zeta function ξ (s) is deﬁned for any complex number s with Re ¦s¦ > 1 by the
Dirichlet series: ξ (s) =
∞
¸
n=1
1
n
s
.
• For realvalued nonnegative x
(a) ξ (x) converges for x > 1
(b) ξ (x) diverges for 0 < x ≤ 1
[9, Q2.48 p 105].
• ξ (1) = ∞ corresponds to harmonic series.
A.12. Abel’s theorem: Let a = (a
i
: i ∈ N) be any sequence of real or complex numbers
and let
G
a
(z) =
∞
¸
i=0
a
i
z
i
,
be the power series with coeﬃcients a. Suppose that the series
¸
∞
i=0
a
i
converges. Then,
lim
z→1
−
G
a
(z) =
∞
¸
i=0
a
i
. (36)
In the special case where all the coeﬃcients a
i
are nonnegative real numbers, then the
above formula (36) holds also when the series
¸
∞
i=0
a
i
does not converge. I.e. in that case
both sides of the formula equal +∞.
A.3 Derivatives
A.13. Basic Formulas
(a)
d
dx
a
u
= a
u
ln a
du
dx
(b)
d
dx
log
a
u =
log
a
e
u
du
dx
, a = 0, 1
129
(c) Derivatives of the products: Suppose f (x) = g (x) h(x), then
f
(n)
(x) =
n
¸
k=0
n
k
g
(n−k)
(x) h
(k)
(x).
In fact,
d
n
dt
n
r
¸
i=1
f
i
(t) =
¸
n
1
++nr=n
n!
n
1
!n
2
! n
r
!
r
¸
i=1
d
n
i
dt
n
i
f
i
(t).
Deﬁnition A.14 (Jacobian). In vector calculus, the Jacobian is shorthand for either the
Jacobian matrix or its determinant, the Jacobian determinant. Let g be a function
from a subset D of R
n
to R
m
. If g is diﬀerentiable at z ∈ D, then all partial derivatives
exists at z and the Jacobian matrix of g at a point z ∈ D is
dg (z) =
⎛
⎜
⎝
∂g
1
∂x
1
(z)
∂g
1
∂xn
(z)
.
.
.
.
.
.
.
.
.
∂gm
∂x
1
(z)
∂gm
∂xn
(z)
⎞
⎟
⎠
=
∂g
∂x
1
(z) , . . . ,
∂g
∂x
n
(z)
.
Alternative notations for the Jacobian matrix are J,
∂(g
1
,...,gn)
∂(x
1
,...,xn)
[7, p 242], J
g
(x) where the it
is assumed that the Jacobian matrix is evaluated at z = x = (x
1
, . . . , x
n
).
• Let g : D → R
m
with open D ⊂ R
n
, y ∈ D. If ∀k ∀j partial derivatives
∂g
k
∂x
j
exists
in a neighborhood of z and continuous at z, then g is diﬀerentiable at z. And dg is
continuous at z.
• Let A be an ndimensional “box” deﬁned by the corners x and x+Δx. The “volume”
of the image g(A) is (
¸
i
Δx
i
) [det dg(x)[. Hence, the magnitude of the Jacobian
determinant gives the ratios (scaling factor) of ndimensional volumes (contents). In
other words,
dy
1
dy
n
=
∂(y
1
, . . . , y
n
)
∂(x
1
, . . . , x
n
)
dx
1
dx
n
.
• d(g
−1
(y)) is the Jacobian of the inverse transformation.
• In MATLAB, use jacobian.
• Change of variable: Let g be a continuous diﬀerentiable map of the open set U onto
V . Suppose that g is onetoone and that det(dg(x)) = 0 for all x.
U
h(g(x)) [det(dg(x))[ dx =
V
h(y)dy.
Deﬁnition A.15. The gradient (or gradient vector ﬁeld) of a scalar function f(x) with
respect to a vector variable x = (x
1
, . . . , x
n
) is
∇
θ
f (θ) =
⎛
⎜
⎝
∂f
∂θ
1
.
.
.
∂f
∂θn
⎞
⎟
⎠
.
130
If the argument is row vector, then,
∇
θ
f
T
(θ)
=
⎛
⎜
⎝
∂f
1
∂θ
1
.
.
.
∂f
1
∂θn
∂f
2
∂θ
1
.
.
.
∂f
2
∂θn
.
.
.
∂fm
∂θ
1
.
.
.
∂fm
∂θn
⎞
⎟
⎠
.
Deﬁnition A.16. Given a scalarvalued function f, the Hessian matrix is the square
matrix of second partial derivatives
∇
2
θ
f (θ) = ∇
θ
(∇
θ
f (θ))
T
=
⎡
⎢
⎢
⎣
∂
2
f
∂θ
2
1
∂
2
f
∂θ
1
∂θn
.
.
.
.
.
.
.
.
.
∂
2
f
∂θn∂θ
1
∂
2
∂θ
2
n
⎤
⎥
⎥
⎦
.
It is symmetric for nice function.
A.17. ∇
x
f
T
(x)
= (df (x))
T
.
A.18. Let f, g : Ω →R
m
. Ω ⊂ R
n
. h(x) = 'f (x) , g (x)` : Ω →R. Then
dh(x) = (f (x))
T
dg (x) + (g (x))
T
df (x) .
• For an n n matrix A, let f (x) = 'Ax, x`. Then df (x) = (Ax)
T
I + x
T
A =
x
T
A
T
+x
T
A.
◦ If A is symmetric, then df (x) = 2x
T
A. So,
∂
∂x
j
'Ax, x` = 2 (Ax)
j
.
A.19. Chain rule: If f is diﬀerentiable at y and g is diﬀerentiable at z = f (y), then
g ◦ f is diﬀerentiable at y and
d (g ◦ f) (y)
pn
= dg (z)
pm
df (y)
mn
(matrix multiplication).
• In particular,
d
dt
g (x(t) , y (t) , z (t)) =
∂
∂x
g (x, y, z)
d
dt
x(t)
+
∂
∂y
g (x, y, z)
d
dt
y (t)
+
∂
∂z
g (x, y, z)
d
dt
z (t)
(x,y,z)=(x(t),y(t),z(t))
.
A.20. Let f : D → R
m
where D ⊂ R
n
is open and connected (so arcwiseconnected).
Then, df (x) = 0 ∀x ∈ D ⇒ f is constant.
A.21. If f is diﬀerentiable at y, then all partial and directional derivative exists at y
df (y) =
⎛
⎜
⎝
∇f
1
(y)
.
.
.
∇f
m
(y)
⎞
⎟
⎠
=
⎛
⎜
⎝
∂f
1
∂x
1
(y)
∂f
1
∂xn
(y)
.
.
.
.
.
.
.
.
.
∂fm
∂x
1
(y)
∂fm
∂xn
(y)
⎞
⎟
⎠
=
∂f
∂x
1
(y) , . . . ,
∂f
∂x
n
(y)
.
131
A.22. Inversion (Mapping) Theorem: Let open Ω ⊂ R
n
, f : Ω → R
n
is C
1
, c ∈ Ω.
df (c)
nn
is bijective, then ∃U open neighborhood of c such that
(a) V = f(U) is an open neighborhood of f(c).
(b) f[
U
: U →V is bijective.
(c) g = (f [
U
)
−1
: V →U is C
1
.
(d) ∀y ∈ V dg (y) = [df (g (y))]
−1
.
d (x) = I ∇
x
x
T
= I
d
x
2
= 2x
T
∇
x
x
2
= 2x
d (Ax +b) = A
d
a
T
x
= a
T
∇
x
(Ax +b)
T
= A
T
∇
x
a
T
x
= a
d
f
T
(x) g (x)
= f
T
(x) dg (x) +g
T
(x) df (x)
∇
x
f
T
(x) g (x)
= (dg (x))
T
f (x) + (df (x))
T
g (x)
= ∇
x
g
T
(x)
f (x) +∇
x
f
T
(x)
g (x)
For symmetric Q,
d
f
T
(x) Qf (x)
= f
T
(x) Qdf (x) +f
T
(x) Q
T
df (x)
= 2f
T
(x) Qdf (x)
For symmetric Q,
∇
x
f
T
(x) Qf (x)
= 2 (df (x))
T
Qf (x)
= 2∇
x
f
T
(x)
Qf (x)
d
x
T
Qx
= 2x
T
Q ∇
x
x
T
Qx
= 2Qx.
d
f (x)
2
= 2f
T
(x) df (x) ∇
x
f (x)
2
= 2∇
x
f
T
(x)
f (x)
A.4 Integration
A.23. Basic Formulas
(a)
a
u
du =
a
u
ln a
, a > 0, a = 1.
(b)
1
0
t
α
dt =
1
α+1
, α > −1
∞, α ≤ −1
and
∞
1
t
α
dt =
1
α+1
, α < −1
∞, α ≥ −1
So, the integration of the
function
1
t
is the test case. In fact,
1
0
1
t
dt =
∞
1
1
t
dt = ∞.
(c)
x
m
ln xdx =
x
m+1
m+1
ln x −
1
m+1
, m = −1
1
2
ln
2
x, m = −1
A.24. Integration by Parts:
udv = uv −
vdu
132
0 1 2 3 4 5
0
1
2
3
4
4
0
t
2
t
1.5
t
t
0.5
1
t
0.5 ÷
t
1 ÷
5 0 t
Figure 25: Plots of t
α
(a) Basic idea: Start with an integral of the form
f (x) g (x)dx.
Match this with an integral of the form
udv by choosing dv to be part of the
integrand including dx and possibly f (x) or g (x).
(b) In particular, repeated application of integration by parts gives
f (x) g (x)dx = f (x) G
1
(x) +
n−1
¸
i=1
(−1)
i
f
(i)
(x) G
i+1
(x) + (−1)
n
f
(n)
(x) G
n
(x) dx
(37)
where f
(i)
(x) =
d
i
dx
i
f (x), G
1
(x) =
g (x)dx, and G
i+1
(x) =
G
i
(x)dx. Figure 26
can be used to derived (37).
( ) ( )
( )
( ) ( )
( )
( ) ( )
( )
( ) ( )
( )
( ) ( )
1
1
2
2
1
1
n
n
n
n
f x g x
f x G x
f x G x
f x G x
f x G x
÷
÷
( )
1
1
n÷
÷
( ) 1
n
÷
+
+
Differentiate
Integrate
Figure 26: Integration by Parts
To see this, note that
f (x) g (x)dx = f (x) G
1
(x) −
f
t
(x) G
1
(x) dx,
and
f
(n)
(x) G
n
(x) dx = f
(n)
(x) G
n+1
(x) −
f
(n+1)
(x) G
n+1
(x) dx.
133

2 3 2 3
1 2 2
3 9 27
x x
x e dx x x e
 
= ÷ +

\ .
l
 ( )
( ) ( )
( )
sin
sin cos sin
1
sin cos
2
x
x x
x
x e dx
x x e x e dx
x x e
= ÷ ÷
= ÷
l
l
2 3
3
3
3
1
2
3
1
2
9
1
0
27
x
x
x
x
x e
x e
e
e
+
+


sin
cos
sin
x
x
x
x e
x e
x e ÷
+

+
Figure 27: Examples of Integration by Parts using ﬁgure 26.
(c)
x
n
e
ax
dx =
x
n
e
ax
a
−
n
a
x
n−1
e
ax
A.25. If n is a positive integer,
x
n
e
ax
dx =
e
ax
a
n
¸
k=0
(−1)
k
n!
a
k
(n −k)!
x
n−k
.
(a) n = 1 :
e
ax
a
x −
1
a
(b) n = 2 :
e
ax
a
x
2
−
2
a
x +
2
a
2
(c)
t
0
x
n
e
ax
dx =
e
at
a
n
¸
k=0
(−1)
k
n!
a
k
(n−k)!
t
n−k
−
(−1)
n
n!
a
n+1
=
e
at
a
n
¸
k=0
(−1)
k
n!
a
k
(n−k)!
t
n−k
+
n!
(−a)
n+1
(d)
∞
t
x
n
e
−ax
dx =
e
−at
a
n
¸
k=0
n!
a
k
(n−k)!
t
n−k
=
e
−at
a
n
¸
j=0
n!
a
n−k
j!
t
j
(e)
∞
0
x
n
e
ax
dx =
n!
(−a)
n+1
, a < 0. (See also Gamma function)
• n! =
∞
0
e
−t
t
n
dt.
• In MATLAB, consider using gamma(n+1) in stead of factorial(n). Note also
that gamma() allows vector input.
(f)
1
0
x
β
e
−x
dx is ﬁnite if and only if β > −1.
Note that
1
e
1
0
x
β
dx ≤
1
0
x
β
e
−x
dx ≤
1
0
x
β
dx.
(g) ∀β ∈ R,
∞
1
x
β
e
−x
dx < ∞.
For β ≤ 0,
∞
1
x
β
e
−x
dx ≤
∞
1
e
−x
dx <
∞
0
e
−x
dx = 1.
For β > 0,
∞
1
x
β
e
−x
dx ≤
∞
1
x
]β
e
−x
dx ≤
∞
0
x
]β
e
−x
dx = β!
134
(h)
∞
0
x
β
e
−x
dx is ﬁnite if and only if β > −1.
A.26 (Diﬀerential of integral). Leibniz’s Rule: Let g : R
2
→ R, a : R → R, and
b : R →R be C
1
. Then f (x) =
b(x)
a(x)
g (x, y)dy is C
1
and
f
t
(x) = b
t
(x) g (x, b (x)) −a
t
(x) g (x, a (x)) +
b(x)
a(x)
∂g
∂x
(x, y)dy. (38)
In particular, we have
d
dx
x
a
f (t)dt = f (x) , (39)
d
dx
v(x)
a
f (t)dt =
dv
dx
d
dv
v(x)
a
f (t)dt = f (v (x)) v
t
(x) , (40)
d
dx
v(x)
u(x)
f (t)dt =
d
dx
⎛
⎝
v(x)
a
f (t)dt −
u(x)
a
f (t)dt
⎞
⎠
= f (v (x)) v
t
(x) −f (u(x)) u
t
(x) . (41)
Note that (38) can be derived from (A.19) by considering f(x) = h(a(x), b(x), x) where
h(a, b, c) =
b
a
g (c, y)dy. [9, p 318–319].
A.5 Gamma and Beta functions
A.27. Gamma function:
(a) Γ(q) =
∞
0
x
q−1
e
−x
dx. ; q > 0.
(b) Γ(0) = ∞
(c) Γ(n) = (n −1)! for n ∈ N.
Γ(n + 1) = n! if n ∈ N ∪ ¦0¦.
(d) 0! = 1.
(e) Γ
1
2
=
√
π.
(f) Γ(x + 1) = xΓ(x) (Integration by parts).
• This relationship is used to deﬁne the gamma function for negative numbers.
135
(g)
Γ(q)
α
q
=
∞
0
x
q−1
e
−αx
dx, α > 0.
A.28. The incomplete beta function is deﬁned as
B(x; a, b) =
x
0
t
a−1
(1 −t)
b−1
dt.
For x = 1, the incomplete beta function coincides with the (complete) beta function.
The regularized incomplete beta function (or regularized beta function for short) is de
ﬁned in terms of the incomplete beta function and the (complete) beta function:
I
x
(a, b) =
B(x; a, b)
B(a, b)
.
• For integers m, k,
I
x
(m, k) =
m+k−1
¸
j=m
(m+k −1)!
j!(m+k −1 −j)!
x
j
(1 −x)
m+k−1−j
.
• I
0
(a, b) = 0, I
1
(a, b) = 1
• I
x
(a, b) = 1 −I
1−x
(b, a)
136
References
[1] Patrick Billingsley. Probability and Measure. John Wiley & Sons, New York, 1995.
5.29, 2b
[2] George Casella and Roger L. Berger. Statistical Inference. Duxbury Press, 2001. 4.42,
6.9, 11, 8.13, 11.11, 11.4, 1, 6, 11.21
[3] Donald G. Childers. Probability And Random Processes Using MATLAB. McGraw
Hill, 1997. 1.3, 1.3
[4] Herbert A. David and H. N. Nagaraja. Order Statistics. WileyInterscience, 2003.
11.4, 11.21
[5] W. Feller. An Introduction to Probability Theory and Its Applications, volume 2. John
Wiley & Sons, 1971. 4.36
[6] William Feller. An Introduction to Probability Theory and Its Applications, Volume 1.
Wiley, 3 edition, 1968.
[7] Terrence L. Fine. Probability and Probabilistic Reasoning for Electrical Engineering.
Prentice Hall, 2005. 3.1, 3.10, 3.14, 4.18, 4.23, 5.26, 9.1, 9.3, 11.10, 15.2, A.14
[8] Boris Vladimirovich Gnedenko. Theory of probability. Chelsea Pub. Co., New York, 4
edition, 1967. Translated from the Russian by B.D. Seckler. 5.29
[9] John A. Gubner. Probability and Random Processes for Electrical and Computer En
gineers. Cambridge University Press, 2006. 1.2, 4.30, 4.33, 4.35, 4, 6.43, 7.9, 9.9, 10.1,
2, 3, 6, 11.24, 12.21, 12.22, 14.3, 3, A.5, A.11, A.26
[10] Samuel Karlin and Howard E. Taylor. A First Course in Stochastic Processes. Aca
demic Press, 1975. 4.6, 9.10, 10.1
[11] J. F. C. Kingman. Poisson Processes. Oxford University Press, 1993. ISBN:
0198536933. 5.23, 5.24
[12] A.N. Kolmogorov. The Foundations of Probability. 1933. 3.14
[13] Nabendu Pal, Chun Jin, and Wooi K. Lim. Handbook of Exponential and Related
Distributions for Engineers and Scientists. Chapman & Hall/CRC, 2005. 6.25
[14] Athanasios Papoulis. Probability, Random Variables and Stochastic Processes.
McGrawHill Companies, 1991. 5.17, 11.5
[15] E. Parzen. Stochastic Processes. Holden Day, 1962. 3, 4, 11.25
[16] Sidney I. Resnick. Adventures in Stochastic Processes. Birkhuser Boston, 1992. 5
[17] Kenneth H. Rosen, editor. Handbook of Discrete and Combinatorial Mathematics.
CRC, 1999. 1, 4, 5, 6, 7, 8
137
[18] Walter Rudin. Principles of Mathematical Analysis. McGrawHill, 1976. A.5
[19] Gbor J. Szkely. Paradoxes in Probability Theory and Mathematical Statistics. 1986.
8.2
[20] Tung. Fundamental of Probability and Statistics for Reliability Analysis. 2005. 17, 20
[21] Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference.
Springer, 2004. 4.16, 4.18, 4.43, 7.9, 3, 13.1
138
Index
Bayes Theorem, 21
Binomial theorem, 12
Birthday Paradox, 23
Chevalier de Mere’s Scandal of Arithmetic,
22
Delta function, 19
Dirac delta function, see Delta function
event algebra, 25
False Positives on Diagnostic Tests, 24
gradient, 130
gradient vector ﬁeld, 130
Hessian matrix, 131
Integration by Parts, 132
Jacobian, 98, 130
Jacobian formulas, 99
Leibniz’s Rule, 135
Monte Hall’s Game, 23
multinomial coeﬃcient, 13
Multinomial Counting, 13
Multinomial Theorem, 13
Order Statistics, 101
Probability of coincidence birthday, 23
Total Probability Theorem, 21
uncorrelated
not independent, 76, 93
Zeta function, 129
Zipf or zeta random variable, 59
139
This action might not be possible to undo. Are you sure you want to continue?