Notes 424 03

LECTURE NOTES
STATISTICS 424
MATHEMATICAL STATISTICS
SPRING 2003
Robert J. Boik
Department of Mathematical Sciences
Montana State University Bozeman
Revised August 30, 2004
2
Contents
0 COURSE INFORMATION & SYLLABUS 7
0.1 Course Information . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.2 Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.3 Study Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.4 Types of Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 CONTINUOUS RANDOM VARIABLES 11
5.1 Cumulative Distribution Function (CDF) . . . . . . . . . . . . 11
5.2 Density and the Probability Element . . . . . . . . . . . . . . . 14
5.3 The Median and Other Percentiles . . . . . . . . . . . . . . . . 19
5.4 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.5 Expected Value of a Function . . . . . . . . . . . . . . . . . . . . 21
5.6 Average Deviations . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.7 Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . 25
5.8 Several Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.9 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . 28
5.10 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.11 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . 34
5.12 Moment Generating Functions . . . . . . . . . . . . . . . . . . . 37
6 FAMILIES OF CONTINUOUS DISTRIBUTIONS 41
6.1 Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Exponential Distributions . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Gamma Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.4 Chi Squared Distributions . . . . . . . . . . . . . . . . . . . . . . 52
6.5 Distributions for Reliability . . . . . . . . . . . . . . . . . . . . . 53
6.6 t, F, and Beta Distributions . . . . . . . . . . . . . . . . . . . . . 56
7 ORGANIZING & DESCRIBING DATA 59
7.1 Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . 59
7.2 Data on Continuous Variables . . . . . . . . . . . . . . . . . . . . 59
7.3 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.5 The Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.6 Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . 63
3
4 CONTENTS
7.7 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8 SAMPLES, STATISTICS, & SAMPLING DISTRIBUTIONS 67
8.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.3 Sucient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.4 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . 79
8.5 Simulating Sampling Distributions . . . . . . . . . . . . . . . . . 82
8.6 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.7 Moments of Sample Means and Proportions . . . . . . . . . . . 88
8.8 The Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . 90
8.9 Using the Moment Generating Function . . . . . . . . . . . . . 93
8.10 Normal Populations . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.11 Updating Prior Probabilities Via Likelihood . . . . . . . . . . 97
8.12 Some Conjugate Families . . . . . . . . . . . . . . . . . . . . . . . 99
8.13 Predictive Distributions . . . . . . . . . . . . . . . . . . . . . . . 102
9 ESTIMATION 105
9.1 Errors in Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.3 Large Sample Condence Intervals . . . . . . . . . . . . . . . . . 109
9.4 Determining Sample Size . . . . . . . . . . . . . . . . . . . . . . . 111
9.5 Small Sample Condence Intervals for
X
. . . . . . . . . . . . 112
9.6 The Distribution of T . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.7 Pivotal Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.8 Estimating a Mean Dierence . . . . . . . . . . . . . . . . . . . . 115
9.9 Estimating Variability . . . . . . . . . . . . . . . . . . . . . . . . . 116
9.10 Deriving Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 117
9.11 Bayes Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.12 Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
10 SIGNIFICANCE TESTING 131
10.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
10.2 Assessing the Evidence . . . . . . . . . . . . . . . . . . . . . . . . 131
10.3 One Sample Z Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.4 One Sample t Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.5 Some Nonparametric Tests . . . . . . . . . . . . . . . . . . . . . 135
10.6 Probability of the Null Hypothesis . . . . . . . . . . . . . . . . . 135
11 TESTS AS DECISION RULES 137
11.1 Rejection Regions and Errors . . . . . . . . . . . . . . . . . . . . 137
11.2 The Power function . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.3 Choosing a Sample Size . . . . . . . . . . . . . . . . . . . . . . . 140
11.4 Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.5 Most Powerful tests . . . . . . . . . . . . . . . . . . . . . . . . . . 142
CONTENTS 5
11.6 Randomized Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.7 Uniformly Most Powerful tests . . . . . . . . . . . . . . . . . . . 144
11.8 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . 145
11.9 Bayesian Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
A GREEK ALPHABET 155
B ABBREVIATIONS 157
C PRACTICE EXAMS 159
C.1 Equation Sheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
C.2 Exam 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
C.3 Exam 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
C.4 Exam 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6 CONTENTS
Chapter 0
COURSE INFORMATION &
SYLLABUS
0.1 Course Information
Prerequisite: Stat 420
Required Texts
Berry, D. A. & Lindgren, B. W. (1996). Statistics: Theory and Methods,
2
nd
edition. Belmont CA: Wadsworth Publishing Co.
Lecture Notes
Instructor
Robert J. Boik, 2260 Wilson, 994-5339, rjboik@math.montana.edu.
Oce Hours: Monday & Wednesday 11:0011:50 & 2:103:00; Friday
11:0011:50.
Holidays: Monday Jan 20, MLK Day; Monday Feb 17 (Presidents Day);
March 1014 (Spring Break); Apr 18 (University Day)
Drop dates: Wednesday Feb 5 is the last day to drop without a W grade;
Friday April 25 is the last day to drop.
Grading: 600 Points Total; Grade cutos (percentages) 90, 80, 70, 60; All
exams are closed book. Tables and equations will be provided.
Homework: 200 points
Exam-1, Wed Feb 19, 6:10?: 100 points, Wilson 1-139
Exam-2, Monday March 31, 6:10-?: 100 points, Wilson 1-141
Comprehensive Final, Monday, May 5, 8:009:50 AM: 200 points
7
8 CHAPTER 0. COURSE INFORMATION & SYLLABUS
Homepage: http://www.math.montana.edu/rjboik/classes/424/stat.424.html
Homework assignments and revised lecture notes will be posted on the Stat
424 home page.
0.2 Syllabus
1. Continuous Random Variables: Remainder of Chapter 5
2. Families of Continuous Distributions: Chapter 6
3. Data: Chapter 7
4. Samples, Statistics, and Sampling Distributions: Chapter 8
5. Estimation: Chapter 9
6. Signicance Testing: Chapter 10
7. Tests as Decision Rules: Chapter 11
0.3 Study Suggestions
1. How to do poorly in Stat 424.
(a) Come to class.
(b) Do home work.
(c) When home work is returned, pay most attention to your score.
(d) Skim the text to see if it matches the lecture.
(e) Read notes to prepare for exams.
Except for item (c), there is nothing in the above list that hurts performance
in 424. On the contrary, if you do not come to class etc, then you will likely do
worse. The problem with the above list is that the suggestions are not active
enough. The way to learn probability and mathematical statistics is to do
probability and mathematical statistics. Watching me do a problem or a proof
helps, but it is not enough. You must do the problem or proof yourself.
2. How to do well in Stat 424.
(a) In class
Make a note of all new terms.
Make a note of new results or theorems.
Make a note of which results or theorems were proven.
0.4. TYPES OF PROOFS 9
For each proof, make a note of the main steps, especially of how the
proof begins.
(b) Re-write notes at home
Organize ideas and carefuly dene all new terms. Use the text and
the bound lecture notes for help.
Write-up each proof, lling in details. Use the text and the bound
lecture notes for help.
Review weak areas. For example, a proof may use certain
mathematical tools and/or background material in probability and
statistics. If your knowledge of the tools and/or background is weak,
then review the material. Summarize your review in your notes. Use
the text and the bound lecture notes for help.
(c) Prepare for exams.
Practicere-work homework problems from scratch. Be careful to
not make the same mistakes as were made on the original home
work. Learn from the mistakes on the graded home work.
Practicetake the practice exam with notes, text, and bound lecture
notes as aids.
Practicere-take the practice exam without notes, text, and bound
lecture notes as aids.
Practicere-work proofs without notes, text, and bound lecture
notes as aids.
0.4 Types of Proofs
1. Proof by Construction: To prove the claim If A then B. start by assuming
that A is true. Ask yourself what are the sucient conditions for B to be true.
Show that if A is true, then one or more of the sucient conditions for B are
true. A proof by construction essentially constructs B from A.
2. Proof by Contradiction: To prove the claim If A then B. start by assuming
that A is true and that B is false. Work forward from both of these
assumptions and show that they imply a statement that obviously is false. For
example, if A is true and B is false, then Var(X) < 0. This false statement
contradicts the possibility that A can be true and B can be false. Therefore, if
A is true, B also must be true.
3. Proof by Contrapositive: This is similar to the proof by contradiction. To
prove the claim If A then B. start by assuming that A is true and that B is
false. Work forward only from the assumption that B is false and show that it
implies that A also must be false. This contradicts the assumptions that A is
true and B is false. Therefore, if A is true, B also must be true.
10 CHAPTER 0. COURSE INFORMATION & SYLLABUS
4. Proof by Induction: To prove the claim that If A then B
1
, B
2
, B
3
, . . . , B
all
are true, start by proving the claim If A then B
1
. Any of the above types of
proof may be used to prove this claim. Then prove the claim If A and B
k
then B
k+1
. The proofs of the two claims imply, by induction, that A implies
that B
1
, B
2
, . . . , B
all are true.

Chapter 5
CONTINUOUS RANDOM
VARIABLES
5.1 Cumulative Distribution Function (CDF)
1. Denition: Let X be a random variable. Then the cdf of X is denoted by
F
X
(x) and dened by
F
X
(x) = P(X x).
If X is the only random variable under consideration, then F
X
(x) can be
written as F(x).
2. Example: Discrete Distribution. Suppose that X Bin(3, 0.5). Then F(x) is
a step function and can be written as
F(x) =
0 x (, 0);
1
8
x [0, 1);
1
2
x [1, 2);
7
8
x [2, 3);
1 x [3, ).
3. Example: Continuous Distribution. Consider modeling the probability of
vehicle accidents on I-94 in the Gallatin Valley by a Poisson process with rate
per year. Let T be the time until the rst accident. Then
P(T t) = P(at least one accident in time t)
= 1 P(no accidents in time t) = 1
e
0
0!
= 1 e
t
.
Therefore,
F(t) =
0 t < 0;
1 e
t
t 0.
11
12 CHAPTER 5. CONTINUOUS RANDOM VARIABLES
4. Example: Uniform Distribution. Suppose that X is a random variable with
support
S
= [a, b], where b > a. Further, suppose that the probability that X
falls in an interval in
S
is proportional to the length of the interval. That is,
P(x
1
X x
2
) = (x
2
x
1
) for a x
1
x
2
b. To solve for , let x
1
= a
and x
2
= b. then
P(a X b) = 1 = (b a) = =
1
b a
.
Accordingly, the cdf is
F(x) = P(X x) = P(a X x) =
0 x < a;
x a
b a
x [a, b];
1 x > b.
In this case, X is said to have a uniform distribution: X Unif(a, b).
5. Properties of a cdf
(a) F() = 0 and F() = 1. Your text tries (without success) to motivate
this result by using equation 1 on page 157. Ignore the discussion on the
bottom of page 160 and the top of page 161.
(b) F is non-decreasing; i.e., F(a) F(b) whenever a b.
(c) F(x) is right continuous. That is, lim
0
+ F(x + ) = F(x).
6. Let X be a rv with cdf F(x).
(a) If b a, then P(a < X b) = F(b) F(a).
(b) For any x, P(X = x) = lim
0
+
P(x < X x) = F(x) F(x), where
F(x) is F evaluated as x and is an innitesimally small positive
number. If the cdf of X is continuous from the left, then F(x) = F(x)
and P(X = x) = 0. If the cdf of X has a jump at x, then F(x) F(x)
is the size of the jump.
(c) Example: Problem 5-8.
7. Denition of Continuous Distribution: The distribution of the rv X is said to
be continuous if the cdf is continuous at each x and the cdf is dierentiable
(except, possibly, at a countable number of points).
8. Monotonic transformations of a continuous rv: Let X be a continuous rv with
cdf F
X
(x).
(a) Suppose that g(X) is a continuous one-to-one increasing function. Then
for y in the counter-domain (range) of g, the inverse function x = g
1
(y)
exists. Let Y = g(X). Find the cdf of Y . Solution:
P(Y y) = P[g(X) y] = P(X g
1
(y)] = F
X
[g
1
(y)].
5.1. CUMULATIVE DISTRIBUTION FUNCTION (CDF) 13
(b) Suppose that g(X) is a continuous one-to-one decreasing function. Then
for y in the counter-domain (range) of g, the inverse function x = g
1
(y)
exists. Let Y = g(X). Find the cdf of Y . Solution:
P(Y y) = P[g(X) y] = P(X > g
1
(y)] = 1 F
X
[g
1
(y)].
(c) Example: Suppose that X Unif(0, 1), and Y = g(X) = hX + k where
h < 0. Then, X = g
1
(Y ) = (Y k)/h;
F
X
(x) =
0 x < 0;
x x [0, 1];
1 x > 1,
and
F
Y
(y) = 1 F
x
[(y k)/h] =
0 y < h + k;
y (h + k)
k
y [h + k, k];
1 y > k,
That is, Y Unif(h + k, k).
(d) Inverse CDF Transformation.
i. Suppose that X is a continuous rv having a strictly increasing cdf
F
X
(x). Recall that a strictly monotone function has an inverse.
Denote the inverse of the cdf by F
1
X
. That is, if F
X
(x) = y, then
F
1
X
(y) = x. Let Y = F
X
(X). Then the distribution of Y is
Unif(0, 1).
Proof: If W Unif(0, 1), then the cdf of W is F
W
(w) = w. The cdf
of Y is
F
Y
(y) = P(Y y) = P(F
X
(X) y) = P[X F
1
X
(y)]
= F
X
F
1
X
(y)
= y.
If Y has support [0, 1] and F
Y
(y) = y, then it must be true that
Y Unif(0, 1).
ii. Let U be a rv with distribution U Unif(0, 1). Suppose that F
X
(x)
is a strictly increasing cdf for a continuous random variable X. Then
the cdf of the rv F
1
X
(U) is F
X
(x).
Proof:
P
F
1
X
(U) x
= P [U F
X
(x)] = F
U
[F
X
(x)] = F
X
(x)
because F
U
(u) = u.
(e) Application of inverse cdf transformation: Given U
1
, U
2
, . . . , U
n
, a
random sample from Unif(0, 1), generate a random sample from F
X
(x).
Solution: Let X
i
= F
1
X
(U
i
) for i = 1, . . . , n.
i. Example 1: Suppose that F
X
(x) = 1 e
x
for x > 0, where > 0.
Then X
i
= ln(1 U
i
)/ for i = 1, . . . , n is a random sample from
F
X
.
ii. Example 2: Suppose that
F
X
(x) =
a
x
I
(a,)
(x),
where a > 0 and b > 0 are constants. Then X
i
= a(1 U
i
)
b
for
i = 1, . . . , n is a random sample from F
X
.
9. Non-monotonic transformations of a continuous rv. Let X be a continuous rv
with cdf F
X
(x). Suppose that Y = g(X) is a continuous but non-monotonic
function. As in the case of monotonic functions,
F
Y
(y) = P(Y y) = P[g(X) y], but in this case each inverse solution
x = g
1
(y) must be used to nd an expression for F
Y
(y) in terms of
F
X
[g
1
(y)]. For example, suppose that X Unif(1, 2) and g(X) = Y = X
2
.
Note that x =
y for y [0, 1] and x = +
y for y (1, 4]. The cdf of Y is

F
Y
(y) = P(X
2
y) =
P(
y X
y) y [0, 1];
P(X
y) y (1, 4]
=
F
X
(
y) F
X
(
y) y [0, 1];
F
X
(
y) y (1, 4]
=
0 y < 0;
2
y/3 y [0, 1];

(
y + 1)/3 y (1, 4];

1 y > 4;
Plot the function g(x) over x
SX
as an aid to nding the inverse solutions
x = g
1
(y).
5.2 Density and the Probability Element
1. Mathematical Result: Assume that F(x) is a continuous cdf. Let g(x + m) be
a dierentiable function and let y = x + m. Then
d
dm
g(x + m)
m=0
=
d
d y
g(y)
d
d m
y
m=0
=
d
d y
g(y)
m=0
=
d
dx
g(x)
by the chain rule.
5.2. DENSITY AND THE PROBABILITY ELEMENT 15
2. Probability Element: Suppose that X is a continuous rv. Let x be a small
positive number. Dene h(a, b) as
h(a, b)
def
= P(a X a + b) = F
X
(a + b) F
X
(a).
Expand h(x, x) = P(x X x + x) in a Taylor series around x = 0:
h(x, x) = F(x + x) F(x)
= h(x, 0) +
d
dx
h(x, x)
x=0
x + o(x)
= 0 +
d
dx
F(x + x)
x=0
x + o(x)
=
d
dx
F(x)
x + o(x), where
lim
x0
o(x)
x
= 0.
The function
d F(x) =
d
dx
F(x)
x
is called the dierential. In the eld of statistics, the dierential of a cdf is
called the probability element. The probability element is an approximation to
h(x, x). Note that the probability element is a linear function of the
derivative
d
dx
F(x).
3. Example; Suppose that
F(x) =
0 x < 0;
1 e
3x
otherwise.
Note that F(x) is a cdf. Find the probability element at x = 2 and
approximate the probability P(2 X 2.01). Solution:
d
dx
F(x) = 3e
3x
so
the probability element is 3e
6
x and
P(2 X 2.01) 3e
6
0.01 = 0.00007436. The exact probability is
F(2.01) F(2) = 0.00007326.
4. The average density in the interval (x, x + x) is dened as
Average density
def
=
P(x < X < x + x)
x
.
5. Density: The probability density function (pdf) at x is the limit of the average
density as x 0:
pdf = f(x)
def
= lim
x0
P(x X x + x)
x
= lim
x0
F
X
(x + x) F
X
(x)
x
=
d
dx
F(x)
x + o(x)
x
=
d
dx
F(x).
Note that the probability element can be written as d F(x) = f(x)x.
6. Example: Suppose that is a positive real number. If
F(x) =
1 e
x
x 0;
0 otherwise.
then f(x) =
d
dx
F(x) =
e
x
x 0;
0 otherwise.
7. Example: If X Unif(a, b), then
F(x) =
0 x < a;
xa
ba
x [a, b];
1 x > b
and f(x) =
d
dx
F(x) =
0 x < a;
1
ba
x [a, b];
0 x > b.
8. Properties of a pdf
i f(x) 0 for all x.
ii
f(x) = 1
9. Relationship between pdf and cdf: If X is a continuous rv with pdf f(x) and
cdf F(x), then
f(x) =
d
dx
F(x)
F(x) =
f(u)du and
P(a < X < b) = P(a X b) = P(a < X b)
= P(a X < b) = F(b) F(a) =
b
a
f(x)dx.
10. PDF example - Cauchy distribution. Let f(x) = c/(1 + x
2
) for < x <
and where c is a constant. Note that f(x) is nonnegative and
1
1 + x
2
dx = arctan(x)
=

2

2
= .
5.2. DENSITY AND THE PROBABILITY ELEMENT 17
Accordingly, if we let c = 1/, then
f(x) =
1
(1 + x
2
)
is a pdf. It is called the Cauchy pdf. The corresponding cdf is
F(x) =
1
(1 + u
2
)
du =
arctan(u)
=
1
arctan(x) +

2
=
arctan(x)
+
1
2
.
11. PDF example - Gamma distribution: A more general waiting time
distribution: Let T be the time of arrival of the r
th
event in a Poisson process
with rate parameter . Find the pdf of T. Solution: T (t, t +t) if and only
if (a) r 1 events occur before time t and (b) one event occurs in the interval
(t, t +t). The probability that two or more events occur in (t, t +t) is o(t)
and can be ignored. By the Poisson assumptions, outcomes (a) and (b) are
independent and the probability of outcome (b) is t + o(t). Accordingly,
P(t < T < t + t) f(t)t =
e
t
(t)
r1
(r 1)!
t
=
e
t
r
t
r1
(r 1)!
t
and the pdf is
f(t) =
0 t < 0;
e
t
r
t
r1
(r 1)!
t 0
=
e
t
r
t
r1
(r)
I
[0,)
(t).
12. Transformations with Single-Valued Inverses: If X is a continuous random
variable with pdf f
X
(x) and Y = g(X) is a single-valued dierentiable
function of X, then the pdf of Y is
f
Y
(y) = f
X
g
1
(y)
d
d y
g
1
(y)
for y
Sg(x)
(i.e., support of Y = g(X)). The term
J(y) =
d
d y
g
1
(y)
is called the Jacobian of the transformation.
(a) Justication 1: Suppose that Y = g(X) is strictly increasing. Then
F
Y
(y) = F
X
[g
1
(y)] and
f
Y
(y) =
d
dy
F
Y
(y) = f
X
g
1
(y)
d
dy
g
1
(y)
= f
X
g
1
(y)
d
dy
g
1
(y)
because the Jacobian is positive. Suppose that Y = g(X) is strictly

decreasing. Then F
Y
(y) = 1 F
X
[g
1
(y)] and
f
Y
(y) =
d
dy
[1 F
Y
(y)] = f
X
g
1
(y)
d
dy
g
1
(y)
= f
X
g
1
(y)
d
dy
g
1
(y)
because the Jacobian is negative.

(b) Justication 2: Suppose that g(x) is strictly increasing. Recall that
P(x X x + x) = f
X
(x)x + o(x).
Note that
x X x + x g(x) g(X) g(x + x).
Accordingly,
P(x X x + x) = P(y Y y + y) = f
Y
(y)y + o(y)
= f
X
(x)x + o(x)
where y +y = g(x +x). Expanding g(x +x) around x = 0 reveals
that
y + y = g(x + x) = g(x) +
d g(x)
d x
x + o(x).
Also,
y = g(x) =g
1
(y) = x
=
d g
1
(y)
d y
=
d x
d y
=
d y
d x
=
d g(x)
d x
=
d g
1
(y)
d y
1
=y + y = g(x) +
d g
1
(y)
d y
1
x
=y =
d g
1
(y)
d y
1
x
5.3. THE MEDIAN AND OTHER PERCENTILES 19
=x =
d g
1
(y)
d y
y.
Lastly, equating f
X
(x)x to f
Y
(y)y reveals that
f
Y
(y)y = f
X
(x)x = f
X
g
1
(y)
x
= f
X
g
1
(y)
d g
1
(y)
d y
y
=f
Y
(y) = f
X
g
1
(y)
d g
1
(y)
d y
.
The Jacobian
dg
1
(y)
dy
is positive for an increasing function, so the absolute
value operation is not necessary. A similar argument can be made for the
case when g(x) is strictly decreasing.
13. Transformations with Multiple-Valued Inverses: If g(x) has more than one
inverse function, then a separate probability element must be calculated for
each of the inverses. For example, suppose that X Unif(1, 2) and
Y = g(X) = X
2
. There are two inverse functions for y [0, 1], namely
x =
y and x = +
y. There is a single inverse function for y (1, 4]. The

pdf of Y is found as
f(y) =
0 y < 0;
f(
y)
y
dy
+ f(
y)
y
dy
y [0, 1];
f(
y)
y
dy
y (1, 4];
0 y > 4.
=
0 y < 0;
1
3
y
y [0, 1];
1
6
y
y (1, 4];
0 y > 4.
5.3 The Median and Other Percentiles
1. Denition: The number x
p
is said to be the 100p
th
percentile of the
distribution of X if x
p
satises
F
X
(x
p
) = P(X x
p
) = p.
2. If the cdf F
X
(x) is strictly increasing, then x
p
= F
1
X
(p) and x
p
is unique.
3. If F
X
(x) is not strictly increasing, then x
p
may not be unique.
4. Median: The median is the 50
th
percentile (i.e., p = 0.5).
5. Quartiles: The rst and third quartiles are x
0.25
and x
0.75
respectively.
6. Example: If X Unif(a, b), then (x
p
a)/(b a) = p; x
p
= a + p(b a); and
x
0.5
= (a + b)/2.
7. Example: If F
X
(x) = 1 e
x
(i.e., waiting time distribution), then
1 e
xp
= p; x
p
= ln(1 p)/; and x
0.5
= ln(2)/.
8. ExampleCauchy: Suppose that X is a random variable with pdf
f(x) =
1
1 +
(x )
2
2
,
where < x < ; > 0; and is a nite number. Then
F(x) =
f(u)du =
x
1
(1 + z
2
)
dz
make the change of variable from x to z =

x
=
1
arctan
+
1
2
.
Accordingly,
F(x
p
) = p =
x
p
= + tan [(p 0.5)] ;
x
0.25
= + tan [(0.25 0.5)] = ;
x
0.5
= + tan(0) = ; and
x
0.75
= + tan [(0.75 0.5)] = + .
9. Denition Symmetric Distribution: A distribution is said to be symmetric
around c if F
X
(c ) = 1 F
X
(c + ) for all .
10. Denition Symmetric Distribution: A distribution is said to be symmetric
around c if f
X
(c ) = f
X
(c + ) for all .
11. Median of a symmetric distribution. Suppose that the distribution of X is
symmetric around c. Then, set to c x
0.5
to obtain
F
X
(x
0.5
) =
1
2
= 1 F
X
(2c x
0.5
) =F
X
(2c x
0.5
) =
1
2
=c = x
0.5
.
That is, if the distribution of X is symmetric around c, then the median of the
distribution is c.
5.4. EXPECTED VALUE 21
5.4 Expected Value
1. Denition: Let X be a rv with pdf f(x). Then the expected value (or mean)
of X, if it exists, is
E(X) =
X
=
xf(x)dx.
2. The expectation is said to exist if the integral of the positive part of the
function is nite and the integral of the negative part of the function is nite.
5.5 Expected Value of a Function
1. Let X be a rv with pdf f(x). Then the expected value of g(X), if it exists, is
E[g(X)] =
g(x)f(x)dx.
2. Linear Functions. The integral operator is linear. If g
1
(X) and g
2
(X) are
functions whose expectation exists and a, b, c are constants, then
E[ag
1
(X) + bg
2
(X) + c] = aE[g
1
(X)] + bE[g
2
(X)] + c.
3. Symmetric Distributions: If the distribution of X is symmetric around c and
the expectation exists, then E(X) = c.
Proof: Assume that the mean exists. First, show that E(X c) = 0:
E(X c) =
(x c)f(x)dx
=
(x c)f(x)dx +

c
(x c)f(x)dx
( let x = c u in integral 1 and let x = c + u in integral 2)
=

0
uf(c u)du +

0
uf(c + u)du
=

0
u [f(c + u) f(c u)] du = 0
by symmetry of the pdf around c. Now use E(X c) = 0 E(X) = c.
4. Example: Suppose that X Unif(a, b). That is,
f(x) =
1
b a
x [a, b];
0 otherwise.
A sketch of the pdf shows that the distribution is symmetric around (a +b)/2.
More formally,
f
a + b
2

= f
a + b
2
+
1
b a

ba
2
,
ba
2
;
0 otherwise.
Accordingly, E(X) = (a +b)/2. Alternatively, the expectation can be found by
integrating xf(x):
E(X) =
xf(x) dx =
b
a
x
b a
dx
=
x
2
2(b a)
b
a
=
b
2
a
2
2(b a)
=
(b a)(b + a)
2(b a)
=
a + b
2
.
5. Example: Suppose that X has a Cauchy distribution. The pdf is
f(x) =
1
1 +
(x )
2
2
,
where and are constants that satisfy || < and (0, ). By
inspection, it is apparent that the pdf is symmetric around . Nonetheless,
the expectation is not , because the expectation does not exist. That is,
xf(x)dx =
1 +
(x )
2
2
dx
= +
z
(1 + z
2
)
dz where z =
x
= +
z
(1 + z
2
)
dz +

0
z
(1 + z
2
)
dz
= +
ln(1 + z
2
)
2
+
ln(1 + z
2
)
2
0
and neither the positive nor the negative part is nite.
6. Example: Waiting time distribution. Suppose that X is a rv with pdf e
x
for x > 0 and where > 0. Then, using integration by parts,
E(X) =

0
xe
x
dx = xe
x
0
+

0
e
x
dx
= 0
1
e
x
0
=
1
.
5.6. AVERAGE DEVIATIONS 23
5.6 Average Deviations
1. Variance
(a) Denition:
Var(X)
def
= E(X
X
)
2
=
(x
X
)
2
f(x)dx
if the expectation exists. It is conventional to denote the variance of X
by
2
X
.
(b) Computational formula: Be able to verify that
Var(X) = E(X
2
) [E(X)]
2
.
(c) Example: Suppose that X Unif(a, b). Then
E(X
r
) =
b
a
x
r
b a
dx =
x
r+1
(r + 1)(b a)
b
a
=
b
r+1
a
r+1
(r + 1)(b a)
.
Accordingly,
X
= (a + b)/2,
E(X
2
) =
b
3
a
3
3(b a)
=
(b a)(b
2
+ ab + a
2
)
3(b a)
=
b
2
+ ab + a
2
3
and
Var(X) =
b
2
+ ab + a
2
3

(b + a)
2
4
=
b
2
2ab + a
2
12
=
(b a)
2
12
.
(d) Example: Suppose that f(x) = e
x
for x > 0 and where > 0. Then
E(X) = 1/,
E(X
2
) =

0
x
2
e
x
dx
= x
2
e
x
0
+

0
2xe
x
dx = 0 +
2
2
and
Var(X) =
2
2

1
2
=
1
2
.
2. MAD
(a) Denition:
Mad(X)
def
= E(|X
X
|) =
|x
X
|f(x)dx.
(b) Alternative expression: First, note that
E(|X c|) =
(c x)f(x)dx +

c
(x c)f(x)dx
= c [2F
X
(c) 1]
xf(x)dx +

c
xf(x)dx.
Accordingly,
Mad(X) =
X
[2F
X
(
X
) 1]
xf(x)dx +
X
xf(x)dx.
(c) Leibnitzs Rule: Suppose that a(), b(), and g(x, ) are dierentiable
functions of . Then
d
d
b()
a()
g(x, )dx = g [b(), ]
d
d
b() g [a(), ]
d
d
a()
+
b()
a()
d
d
g(x, )dx.
(d) Result: If the expectation E(|X c|) exists, then the minimizer of
E(|X c|) with respect to c is c = F
1
X
(0.5) = median of X.
Proof: Set the derivative of E(|X c|) to zero and solve for c:
d
dc
E(|X c|)
=
d
dc
c [2F
X
(c) 1]
xf
X
(x)dx +

c
xf
X
(x)dx
= 2F
X
(c) 1 + 2cf
X
(c) cf
X
(c) cf
X
(c)
= 2F
X
(c) 1.
Equating the derivative to zero and solving for c reveals that c is a
solution to F
X
(c) = 0.5. That is, c is the median of X. Use the second
derivative test to verify that the solution is a minimizer:
d
2
dc
2
E(|X c|) =
d
dc
[2F
X
(c) 1] = 2f
X
(c) > 0
=c is a minimizer.
(e) Example: Suppose that X Unif(a, b). Then F
X
(
a+b
2
) = 0.5 and
Mad(X) =
a+b
2
a
x
b a
dx +
b
a+b
2
x
b a
dx =
b a
4
.
(f) Example: Suppose that f
X
(x) = e
x
for x > 0 and where > 0. Then
E(X) = 1/, Median(X) = ln(2)/, F
X
(x) = 1 e
x
, and
Mad(X) =
1
2 2e
1
1
5.7. BIVARIATE DISTRIBUTIONS 25

1
0
xe
x
dx +
1
xe
x
dx =
2
e
,
where
xe
x
dx = xe
x
1
e
x
has been used. The mean
absolute deviation from the median is
E
X
ln(2)
ln(2)
1
0
xe
x
dx +

ln(2)
1
xe
x
dx
=
ln(2)
.
3. Standard Scores
(a) Let Z =
X
X
X
.
(b) Moments: E(Z) = 0 and Var(Z) = 1.
(c) Interpretation: Z scores are scaled in standard deviation units.
(d) Inverse Transformation: X =
X
+
X
Z.
5.7 Bivariate Distributions
1. Denition: A function f
X,Y
(x, y) is a bivariate pdf if
(i) f
X,Y
(x, y) 0 for all x, y and
(ii)
f
X,Y
(x, y)dxdy = 1.
2. Bivariate CDF: If f
X,Y
(x, y) is a bivariate pdf, then
F
X,Y
(x, y) = P(X x, Y y) =
f
X,Y
(u, v)dvdu.
3. Properties of a bivariate cdf:
(i) F
X,Y
(x, ) = F
X
(x)
(ii) F
X,Y
(, y) = F
Y
(y)
(iii) F
X,Y
(, ) = 1
(iv) F
X,Y
(, y) = F
X,Y
(x, ) = F
X,Y
(, ) = 0
(v) f
X,Y
(x, y) =

2
xy
F
X,Y
(x, y).
4. Joint pdfs and joint cdfs for three or more random variables are obtained as
straightforward generalizations of the above denitions and conditions.
5. Probability Element: f
X,Y
(x, y)xy is the joint probability element. That is,
P(x X x + x, y Y y + y) = f
X,Y
(x, y)xy + o(xy).
6. Example: Bivariate Uniform. If (X, Y ) Unif(a, b, c, d), then
f
X,Y
(x, y) =
1
(b a)(d c)
x (a, b), y (c, d);
0 otherwise.
For this density, the probability P(x
1
X x
2
, y
1
Y y
2
) is the volume of
the rectangle. For example, if (X, Y ) Unif(0, 4, 0, 6), then
P(2.5 X 3.5, 1 Y 4) = (3.5 2.5)(4 1)/(4 6) = 3/24. Another
example is P(X
2
+ Y
2
> 16) = 1 P(X
2
+ Y
2
16) = 1 4/24 = 1 /6
because the area of a circle is r
2
and therefore, the area of a circle with
radius 4 is 16 and the area of the quarter circle in the support set is 4.
7. Example: f
X,Y
(x, y) =
6
5
(x + y
2
) for x (0, 1) and y (0, 1). Find
P(X + Y < 1). Solution: First sketch the region of integration, then use
calculus:
P(X + Y < 1) = P(X < 1 Y ) =
1
0
1y
0
6
5
(x + y
2
)dxdy
=
6
5
1
0
x
2
2
+ xy
2
1y
0
dy
=
6
5
1
0
(1 y)
2
2
+ (1 y)y
2
dy
=
6
5
y
2

y
2
2
+
y
3
6
+
y
3
3

y
4
4
1
0
=
3
10
.
8. Example: Bivariate standard normal
f
X,Y
(x, y) =
e
1
2
(x
2
+y
2
)
2
=
e
1
2
x
2
2
e
1
2
y
2
2
= f
X
(x)f
Y
(y).
Using numerical integration, P(X + Y < 1) = 0.7602. The matlab code is
g = inline(normpdf(y).*normcdf(1-y),y);
Prob=quadl(g,-5,5)
where has been approximated by 5.
9. Marginal Densities:
5.8. SEVERAL VARIABLES 27
(a) Integrate out unwanted variables to obtain marginal densities. For
example,
f
X
(x) =
f
X,Y
(x, y)dy; f
Y
(y) =
f
X,Y
(x, y)dx;
and f
X,Y
(x, y) =
f
W,X,Y,Z
(w, x, y, z)dwdz.
(b) Example: If f
X,Y
(x, y) =
6
5
(x + y
2
) for x (0, 1) and y (0, 1), then
f
X
(x) =
6
5
1
0
(x + y
2
)dy =
6x + 2
5
for x (0, 1) and
f
Y
(y) =
6
5
1
0
(x + y
2
)dx =
6y
2
+ 3
5
for y (0, 1).
10. Expected Values
(a) The expected value of a function g(X, Y ) is
E[g(X, Y )] =
g(x, y)f
X,Y
(x, y)dxdy.
(b) Example: If f
X,Y
(x, y) =
6
5
(x + y
2
) for x (0, 1) and y (0, 1), then
E(X) =
1
0
1
0
x
6
5
(x + y
2
)dxdy =
1
0
3y
2
+ 2
5
dy =
3
5
.
5.8 Several Variables
1. The joint pdf of n continuous random variables, X
1
, . . . , X
n
is a function that
satises
(i) f(x
1
, . . . , x
n
) 0, and
(ii)
f(x
1
, . . . , x
n
) dx
1
dx
n
= 1.
2. Expectations are linear regardless of the number of variables:
E
i=1
a
i
g
i
(X
1
, X
2
, . . . , X
n
)
=
k
i=1
a
i
E[g
i
(X
1
, X
2
, . . . , X
n
)]
if the expectations exist.
3. Exchangeable Random variables
(a) Let x
1
, . . . , x
n
be a permutation of x
1
, . . . , x
n
. Then, the joint density of
X
1
, . . . , X
n
is said to be exchangeable if
f
X
1
,...,Xn
(x
1
, . . . , x
n
) = f
X
1
,...,Xn
(x
1
, . . . , x
n
)
for all x
1
, . . . , x
n
and for all permutations x
1
, . . . , x
n
.
(b) Result: If the joint density is exchangeable, then all marginal densities
are identical. For example,
f
X
1
,X
2
(x
1
, x
2
) =
f
X
1
,X
2
,X
3
(x
1
, x
2
, x
3
) dx
3
=
f
X
1
,X
2
,X
3
(x
3
, x
2
, x
1
) dx
3
by exchangeability
=
f
X
1
,X
2
,X
3
(x
1
, x
2
, x
3
) dx
1
by relabeling variables
= f
X
2
,X
3
(x
2
, x
3
).
(c) Result: If the joint density is exchangeable, then all bivariate marginal
densities are identical, and so forth.
(d) Result: If the joint density is exchangeable, then the moments of X
i
(if
they exist) are identical for all i.
(e) Example Suppose that f
X,Y
(x, y) = 2 for x 0, y 0, and x + y 1.
Then
f
X
(x) =
1x
0
2dy = 2(1 x) for x (0, 1)
f
Y
(y) =
1y
0
2dx = 2(1 y) for y (0, 1) and
E(X) = E(Y ) =
1
3
.
5.9 Covariance and Correlation
1. Review covariance and correlation results for discrete random variables
(Section 3.4) because they also hold for continuous random variables. Below
are lists of the most important denitions and results.
(a) Denitions
Cov(X, Y )
def
= E[(X
X
)(Y
Y
)].
Cov(X, Y ) is denoted by
X,Y
.
Var(X) = Cov(X, X).
Corr(X, Y )
def
= Cov(X, Y )/
Var(X) Var(Y ).
Corr(X, Y ) is denoted by
X,Y
.
5.9. COVARIANCE AND CORRELATION 29
(b) Covariance and Correlation Results (be able to prove any of these).
Cov(X, Y ) = E(XY ) E(X)E(Y ).
Cauchy-Schwartz Inequality: [E(XY )]
2
E(X
2
)E(Y
2
).

X,Y
[1, 1] To proof, use the Cauchy-Schwartz inequality.
Cov(a + bX, c + dY ) = bd Cov(X, Y ).
Cov
i
a
i
X
i
,
i
b
i
Y
i
j
a
i
b
j
Cov(X
i
, Y
j
). For example,
Cov(aW + bX, cY + dZ) =
ac Cov(W, Y ) + ad Cov(W, Z) + bc Cov(X, Y ) + bd Cov(X, Z).
Corr(a + bX, c + dY ) = sign(bd) Corr(X, Y ).
Var
i
X
i
j
Cov(X
i
, X
j
) =
i
Var(X
i
) +
i=j
Cov(X
i
, X
j
).
Parallel axis theorem: E(X c)
2
= Var(X) + (
X
c)
2
. Hint on
proof: rst add zero X c = (X
X
) + (
X
c), then take
expectation.
2. Example (simple linear regression with correlated observations): Suppose that
Y
i
= + x
i
+
i
for i = 1, . . . , n and where
1
, . . . ,
n
have an exchangeable
distribution with E(
1
) = 0, Var(
1
) =
2
and Cov(
1
,
2
) =
2
. The ordinary
least squares estimator of is
=
n
i=1
(x
i
x)(Y
i
Y )
n
i=1
(x
i
x)
2
.
Then,
E(
) = and Var(
) =

2
(1 )
n
i=1
(x
i
x)
2
.
Proof: First examine the numerator of

:
n
i=1
(x
i
x)(Y
i
Y ) =
n
i=1
(x
i
x)Y
i
i=1
(x
i
x)Y
=
n
i=1
(x
i
x)Y
i
Y
n
i=1
(x
i
x)
=
n
i=1
(x
i
x)Y
i
because =
n
i=1
(x
i
x) = 0.
In the same manner, it can be shown that
n
i=1
(x
i
x)
2
=
n
i=1
(x
i
x)(x
i
x) =
n
i=1
(x
i
x)x
i
. ()
Denote the denominator of

by V
x
. That is,
V
x
=
n
i=1
(x
i
x)
2
=
n
i=1
(x
i
x)x
i
.
The least squares estimator can therefore be written as
=
1
V
x
n
i=1
(x
i
x)Y
i
or as
=
n
i=1
w
i
Y
i
, where w
i
=
x
i
x
V
v
.
Note that
n
i=1
w
i
=
1
V
x
n
i=1
(x
i
x) = 0.
The expectation of

is
E(
) =
n
i=1
w
i
E(Y
i
)
=
n
i=1
w
i
( + x
i
) because E(Y
i
) = E( + x
i
+
i
) = + x
i
=
n
i=1
w
i
+
n
i=1
w
i
x
i
= 0 +

V
x
n
i=1
(x
i
x)x
i
because
n
i=1
w
i
= 0
=

V
x
n
i=1
(x
i
x)(x
i
x) by ()
=

V
x
V
x
= .
The variance of

is
Var(
) = Var
i=1
w
i
Y
i
=
n
i=1
w
2
i
Var(Y
i
) +
i=j
w
i
w
j
Cov(Y
i
, Y
j
)
using results on variances
of linear combinations
5.10. INDEPENDENCE 31
=
2
n
i=1
w
2
i
+
2
i=j
w
i
w
j
. ()
To complete the proof, rst note that
n
i=1
w
2
i
=
1
V
2
x
n
i=1
(x
i
x)
2
=
1
V
2
x
V
x
=
1
V
x
.
Second, note that
i=1
w
i
2
= 0 because
n
i=1
w
i
= 0 and
0 =
i=1
w
i
2
=
i=1
w
i
j=1
w
j
=
n
i=1
w
i
n
j=1
w
j
=
n
i=1
n
j=1
w
i
w
j
=
n
i=1
w
2
i
+
n
i=j
w
i
w
j
=
1
V
x
+
n
i=j
w
i
w
j
=
n
i=j
w
i
w
j
=
1
V
x
.
Lastly, use the above two results in equation () to obtain
Var(
) =
2
n
i=1
w
2
i
+
2
i=j
w
i
w
j
=

2
V
x

2
V
x
=

2
(1 )
V
x
=

2
(1 )
n
i=1
(x
i
x)
2
.
5.10 Independence
1. Denition: Continuous random variables X and Y are said to be independent
if their joint pdf factors into a product of the marginal pdfs. That is,
X Y f
X,Y
(x, y) = f
X
(x)f
Y
(y) (x, y).
2. Example: if f
X,Y
(x, y) = 2 for x (0, 0.5) and y (0, 1) then X Y . Note,
the joint pdf can be written as
f
X,Y
(x, y) = 2I
(0,0.5)
(x)I
(0,1)
(y) = 2I
(0,0.5)
(x) I
(0,1)
(y)
= f
X
(x) f
Y
(y)
where
I
A
(x) =
1 x A;
0 otherwise.
3. Example: if f
X,Y
(x, y) = 8xy for 0 x y 1, then X and Y are not
independent. Note
f
X,Y
(x, y) = 8xyI
(0,1)
(y)I
(0,y)
(x),
but
f
X
(x) =
1
x
f
X,Y
(x, y) dy = 4x(1 x
2
)I
(0,1)
(x),
f
Y
(y) =
y
0
f
X,Y
(x, y) dx = 4y
3
I
(0,1)
(y), and
f
X
(x)f
Y
(y) = 16xy
3
(1 x
2
)I
(0,1)
(x)I
(0,1)
(y) = f
X,Y
(x, y).
4. Note: Cov(X, Y ) = 0 =X Y . For example, if
f
X,Y
(x, y) =
1
3
I
(1,2)
(x)I
(x,x)
(y),
then
E(X) =
2
1
x
x
x
3
dydx =
2
1
=
2x
2
3
dx =
14
9
,
E(Y ) =
2
1
x
x
y
3
dydx =
2
1
x
2
x
2
6
dx = 0, and
E(XY ) =
2
1
x
x
xy
3
dydx =
2
1
x(x
2
x
2
)
6
dx = 0.
Accordingly, X and Y have correlation 0, but they are not independent.
5. Result: Let A and B be subsets of the real line. Then random variables X and
Y are independent if and only if
P(X A, Y B) = P(X A)P(Y B)
for all choices of sets A and B.
Proof: First assume that X Y . Let A and B be arbitrary sets on the real
line. Then
P(X A, Y B) =
B
f
X,Y
(x, y) dy dx
=
B
f
X
(x)f
Y
(y) dy dx by independence
5.10. INDEPENDENCE 33
=
A
f
X
(x)
B
f
Y
(y) dy dx = P(X A)P(Y B)
Therefore,
X Y =P(X A, Y B) = P(X A)P(Y B)
for any choice of sets. Second, assume that
P(X A, Y B) = P(X A)P(Y B) for all choices of sets A and B.
Choose A = (, x] and choose B = (, y]. Then
P(X A, Y B) = P(X x, Y y) = F
X,Y
(x, y)
= P(X A)P(Y B) = P(X x)P(Y y) = F
X
(x)F
Y
(y).
Accordingly,
f
X,Y
(x, y) =

2
xy
F
X,Y
(x, y)
=

2
xy
F
X
(x)F
Y
(y)
=

x
F
X
(x)

y
F
Y
(y) = f
X
(x)f
Y
(y).
Therefore
P(X A, Y B) = P(X A)P(Y B) =f
X,Y
(x, y) = f
X
(x)f
Y
(y).
6. Result: If X and Y are independent, then so are g(X) and h(Y ) for any g and
h.
Proof: Let A be any set of intervals in the range of g(x) and let B be any set
of intervals in the range of h(y). Denote by g
1
(A) the set of all intervals in
the support of X that satisfy x g
1
(A) g(x) A. Similarly, denote by
h
1
(B) the set of all intervals in the support of Y that satisfy
y h
1
(B) h(y) B. If X Y , then,
P[g(X) A, h(Y ) B] = P
X g
1
(A), Y h
1
(B)
= P
X g
1
(A)
Y h
1
(B)
= P[g(X) A] P[h(Y ) B].

The above equality implies that g(X) h(Y ) because the factorization is
satised for all A and B in the range spaces of g(X) and h(Y ). Note that we
already proved this result for discrete random variables.
7. The previous two results readily extend to any number of random variables
(not just two).
8. Suppose that X
i
for i = 1, . . . , n are independent. Then
(a) g
1
(X
1
), . . . , g
n
(X
n
) are independent,
(b) The Xs in any subset are independent,
(c) Var
a
i
X
i
a
2
i
Var(X
i
), and
(d) if the Xs are iid with variance
2
, then Var
a
i
X
i
=
2
a
2
i
.
5.11 Conditional Distributions
1. Denition: If f
X,Y
(x, y) is a joint pdf, then the pdf of Y , conditional on X = x
is
f
Y |X
(y|x)
def
=
f
X,Y
(x, y)
f
X
(x)
provided that f
X
(x) > 0.
2. Example: Suppose that X and Y have joint distribution
f
X,Y
(x, y) = 8xy for 0 < x < y < 1.
Then,
f
X
(x) =
1
x
f
X,Y
(x, y)dy = 4x(1 x
2
), 0 < x < 1;
E(X
r
) =
1
0
4x(1 x
2
)x
r
dx =
8
(r + 2)(r + 4)
;
f
Y
(y) = 4y
3
, 0 < y < 1;
E(Y
r
) =
1
0
4y
3
y
r
dy =
4
r + 4
f
X|Y
(x|y) =
8xy
4y
3
=
2x
y
2
, 0 < x < y; and
f
Y |X
(y|x) =
8xy
4x(1 x
2
)
=
2y
1 x
2
, x < y < 1.
Furthermore,
E(X
r
|Y = y) =
y
0
x
r
2x
y
2
dx =
2y
r
r + 2
and
E(Y
r
|X = x) =
1
x
y
r
2y
1 x
2
dy =
2(1 x
r+2
)
(r + 2)(1 x
2
)
.
3. Regression Function: Let (X, Y ) be a pair of random variables with joint pdf
f
X,Y
(x, y). Consider the problem of predicting Y after observing X = x.
Denote the predictor as y(x). The best predictor is dened as the function
Y (X) that minimizes

SSE = E
Y

Y (X)
2
=
[y y(x)]
2
f
X,Y
(x, y)dydx.
5.11. CONDITIONAL DISTRIBUTIONS 35
(a) Result: The best predictor is y(x) = E(Y |X = x).
Proof: Write f
X,Y
(x, y) as f
Y |X
(y|x)f
X
(x). Accordingly,
SSE =
[y y(x)]
2
f
Y,|X
(y, x)dy
f
X
(x)dx.
To minimize SSE, minimize the quantity in { } for each value of x. Note
that y(x) is a constant with respect to the conditional distribution of Y
given X = x. By the parallel axis theorem, the quantity in { } is
minimized by y(x) = E(Y |X = x).
(b) Example: Suppose that X and Y have joint distribution
f
X,Y
(x, y) = 8xy for 0 < x < y < 1.
Then,
f
Y |X
(y|x) =
8xy
4x(1 x
2
)
=
2y
1 x
2
, x < y < 1 and
y(x) = E(Y |X = x) =
1
x
y
2y
1 x
2
dy =
2(1 x
3
)
3(1 x
2
)
.
(c) Example; Suppose that (Y, X) has a bivariate normal distribution with
moments E(Y ) =
Y
, E(X) =
X
, Var(X) =
2
X
, Var(Y ) =
2
Y
, and
Cov(X, Y ) =
X,Y
Y
. Then it can be shown (we will not do so) that
the conditional distribution of Y given X is
(Y |X = x) N( + x,
2
), where
=
Cov(X, Y )
Var(X)
=

X,Y
X
; =
Y

X
and
2
=
2
Y
1
2
X,Y
.
4. Averaging Conditional pdfs and Moments (be able to prove any of these
results)
(a) E
X
f
Y |X
(y|X)
= f
Y
(y).
Proof:
E
X
f
Y |X
(y|X)
f
Y |X
(y|x)f
X
(x) dx
=
f
X,Y
(X, Y )
f
X
(x)
f
X
(x) dx
=
f
X,Y
(X, Y ) dx = f
Y
(y).
(b) E
X
{E[h(Y )|X]} = E[h(Y )]. This is the rule of iterated expectation. A
special case is E
X
[E(Y |X)] = E(Y ).
Proof:
E
X
{E[h(Y )|X]} =
E[h(Y )|x]f
X
(x) dx
=
h(y)f
Y |X
(y|x) dyf
X
(x) dx
=
h(y)
f
X,Y
(x, y)
f
X
(x)
dyf
X
(x) dx
=
h(y)f
X,Y
(x, y) dy dx =
h(y)
f
X,Y
(x, y) dxdy
=
h(y)f
Y
(y) dy = E[h(Y )].
(c) Var(Y ) = E
X
[Var(Y |X)] + Var [E(Y |X)]. That is, the variance of Y is
equal to the expectation of the conditional variance plus the variance of
the conditional expectation.
Proof:
Var(Y ) = E(Y
2
) [E(Y )]
2
= E
X
E(Y
2
|X)
{E
X
[E(Y |X)]}
2
by the rule of iterated expectation
= E
X
Var(Y |X) + [E(Y |X)]

2
{E
X
[E(Y |X)]}
2
because Var(Y |X) = E(Y
2
|X) [E(Y |X)]
2
= E
X
[Var(Y |X)] + E
X
[E(Y |X)]
2
{E
X
[E(Y |X)]}
2
= E
X
[Var(Y |X)] + Var [E(Y |X)]
because Var[E(Y |X)] = E
X
[E(Y |X)]
2
{E
X
[E(Y |X)]}
2
.
5. Example: Suppose that X and Y have joint distribution
f
X,Y
(x, y) =
3y
2
x
3
for 0 < y < x < 1.
Then,
f
Y
(y) =
1
y
3y
2
x
3
dx =
3
2
(1 y
2
), for 0 < y < 1;
E(Y
r
) =
1
0
3
2
(1 y
2
)y
r
dy =
3
(r + 1)(r + 3)
;
=E(Y ) =
3
8
and Var(Y ) =
19
320
;
f
X
(x) =
x
0
3y
2
x
2
dy = 1, for 0 < x < 1;
5.12. MOMENT GENERATING FUNCTIONS 37
f
Y |X
(y|x) =
3y
2
x
3
, for 0 < y < x < 1;
E(Y
r
|X = x) =
x
0
3y
2
x
3
y
r
dy =
3x
r
3 + r
=E(Y |X = x) =
3x
4
and
Var(Y |X = x) =
3x
2
80
;
Var [E(Y |X)] = Var
3X
4
=
9
16

1
12
=
3
64
;
E[Var(Y |X)] = E
3X
2
80
=
1
80
;
19
320
=
3
64
+
1
80
.
5.12 Moment Generating Functions
1. Denition: If X is a continuous random variable, then the mgf of X is
X
(t) = E
e
tX
e
tx
f
X
(x)dx,
provided that the expectation exists for t in a neighborhood of 0. If X is
discrete, then replace integration by summation. If all of the moments of X do
not exist, then the mgf will not exist. Note that the mgf is related to the pgf
by
X
(t) =
X
(e
t
)
whenever
X
(t) exists for t in a neighborhood of 1. Also note that if
X
(t) is a
mgf, then
X
(0) = 1.
2. Example: Exponential Distribution. If f
X
(x) = e
x
I
(0,)
(x), then
X
(t) =

0
e
tx
e
x
dx
=

t

0
( t)e
(t)x
dx
=

x
dx, where
= t,
=

=

t
provided that > t.
3. Example: Geometric Distribution. If X Geom(p), then
X
(t) =
x=1
e
tx
(1 p)
x1
p = pe
t
x=1
(1 p)
x1
e
t(x1)
= pe
t
x=0
(1 p)e
t
x
=
pe
t
1 (1 p)e
t
provided that t < ln(1 p).
4. MGF of a linear function: If
X
(t) exists, then
a+bX
(t) = E
e
t(a+bX)
= e
at
X
(tb).
For example, if Z = (X
X
)/
X
, then
Z
(t) = e
t
X
/
X
X
(t/
X
).
5. Independent Random Variables: If X
i
for i = 1, . . . , n are independent,
X
i
(t)
exists for each i, and S =
X
i
, then
S
(t) = E
e
t
P
X
i
= E
i=1
e
tX
i
=
n
i=1
X
i
(t).
If the Xs are iid random variables, then
S
(t) = [
X
(t)]
n
.
6. Result: Moment generating functions are unique. Each distribution has a
unique moment generating function and each moment generating function
corresponds to exactly one distribution. Accordingly, if the moment
generating function exists, then it uniquely determines the distribution. For
example, if the mgf of Y is
Y
(t) =
e
t
2 e
t
=
1
2
e
t
1
1
2
e
t
,
then Y Geom(0.5).
7. Computing Moments. Consider the derivative of
X
(t) with respect to t
evaluated at t = 0:
d
dt
X
(t)
t=0
=
d
dt
e
tx
t=0
f
X
(x)dx
=
xf
X
(x)dx = E(X).
Similarly, higher order moments can be found by taking higher order
derivatives:
E(X
r
) =
d
r
(dt)
r
X
(t)
t=0
.
5.12. MOMENT GENERATING FUNCTIONS 39
Alternatively, expand e
tx
around t = 0 to obtain
e
tx
=
r=0
(tx)
r
r!
.
Therefore
X
(t) = E
e
tX
= E
r=0
(tX)
r
r!
r=0
E(X
r
)
t
r
r!
.
Accordingly, E(X
r
) is the coecient of t
r
/r! in the expansion of the mgf.
8. Example: Suppose that X Geom(p). Then the moments of X are
E(X
r
) =
d
r
(dt)
r
X
(t)
t=0
=
d
r
(dt)
r
pe
t
1 (1 p)e
t

t=0
.
Specically,
d
dt
X
(t) =
d
dt
pe
t
1 (1 p)e
t
X
(t) +
1 p
p

X
(t)
2
and
d
2
(dt)
2
X
(t) =
d
dt
X
(t) +
1 p
p

X
(t)
2
X
(t) +
1 p
p

X
(t)
2
+
1 p
p
2
X
(t)
X
(t) +
1 p
p

X
(t)
2
.
Therefore,
E(X) = 1 +
1 p
p
=
1
p
;
E(X
2
) = 1 +
1 p
p
+
1 p
p
2
1 +
1 p
p
=
2 p
p
2
and
Var(X) =
2 p
p
2

1
p
2
=
1 p
p
2
.
9. Example: Suppose Y Unif(a, b). Use the mgf to nd the central moments
E[(Y
Y
)
r
] = E[(Y
a+b
2
)
r
]. Solution:
Y
(t) =
1
b a
b
a
e
ty
dy =
e
tb
e
ta
t(b a)
Y
Y
(t) = e
t(a+b)/2
Y
(t)
=
e
t(a+b)/2
e
tb
e
ta
t(b a)
=
2
t(b a)
e
t
2
(ba)
e
t
2
(ba)
2
=
2
t(b a)
sinh
t(b a)
2
=
2
t(b a)
i=0
t(b a)
2
2i+1
1
(2i + 1)!
=
i=0
t(b a)
2
2i
1
(2i + 1)!
=
i=0
t
2i
(2i)!
(b a)
2i
2
2i
(2i + 1)
.
Therefore, the odd moments are zero, and
E(Y
Y
)
2i
=
(b a)
2i
4
i
(2i + 1)
.
For example, E(Y
Y
)
2
= (b a)
2
/12 and E(Y
Y
)
4
= (b a)
4
/80.
Chapter 6
FAMILIES OF CONTINUOUS
DISTRIBUTIONS
6.1 Normal Distributions
1. PDF and cdf of the Standard Normal Distribution:
f
Z
(z) =
e
z
2
/2
2
I
(,)
(z) =
e
z
2
/2
2
and
F
Z
(z) = P(Z z) = (z) =
f
Z
(u) du
2. Result:
e
x
2
/2
2
dx = 1.
Proof: To verify that f
Z
(z) integrates to one, it is sucient to show that
e
x
2
/2
dx =
2.
Let
K =
e
x
2
/2
dx.
Then
K
2
=
e
u
2
/2
du
2
=
e
u
2
1
/2
du
1
e
u
2
2
/2
du
2
1
2
(u
2
1
+u
2
2
)
du
1
du
2
.
Now transform to polar coordinates:
u
1
= r sin ; u
2
= r cos and
41
42 CHAPTER 6. FAMILIES OF CONTINUOUS DISTRIBUTIONS
K
2
=
2
0

0
e
1
2
(r
2
)
r dr d
=
2
0
1
2
(r
2
)
d
=
2
0
1 d = 2.
Therefore K =
2 and f
Z
(z) integrates to one.
3. Other Normal Distributions: Transform from Z to X = + Z, where and
are constants that satisfy || < and 0 < < . The inverse
transformation is z = (x )/ and the Jacobian of the transformation is
|J| =
dz
dx
=
1
.
Accordingly, the pdf of X is
f
X
(x) = f
Z
=
e
1
2
2
(x)
2
2
2
.
We will use the notation X N(,
2
) to mean that X has a normal
distribution with parameters and .
4. Completing a square. Let a and b be constants. Then
x
2
2ax + b = (x a)
2
a
2
+ b for all x.
Proof:
x
2
ax + b = x
2
2
a
2
x +
a
2
a
2
2
+ b
=
a
2
a
2
2
+ b.
5. Moment Generating Function: Suppose that X N(,
2
). Then
X
(t) = e
t+t
2
2
/2
.
Proof:
X
(t) = E(e
tX
) =
e
tx
e
1
2
2
(x)
2
2
2
dx
=
1
2
2
[2t
2
x+(x)
2
]
2
2
dx.
Now complete the square in the exponent:
2t
2
x + (x )
2
= 2t
2
x + x
2
2x +
2
= x
2
2x( + t
2
) +
2
6.1. NORMAL DISTRIBUTIONS 43
=
x ( + t
2
)
2
( + t
2
)
2
+
2
=
x ( + t
2
)
2
2t
2
t
2
4
.
Therefore,
X
(t) = e
1
2
2
(2t
2
t
2
4
)
1
2
2
[x(+t
2
)]
2
2
2
dx
= e
t+t
2
2
/2
1
2
2
(x
)
2
2
2
dx where
= + t
2
= e
t+t
2
2
/2
because the second term is the integral of the pdf of a random variable with
distribution N(
,
2
) and this integral is one.
6. Moments of Normal Distributions
(a) Moments of the standard normal distribution: Let Z be a normal random
variable with = 0 and = 1. That is, Z N(0, 1). The moment
generating function of Z is
Z
(t) = e
t
2
/2
. The Taylor series expansion of
Z
(t) around t = 0 is
Z
(t) = e
t
2
/2
=
i=0
1
i!
t
2
2
i
=
i=0
(2i)!
2
i
i!
t
2i
(2i)!
.
Note that all odd powers in the expansion are zero. Accordingly,
E(Z
r
) =
0 if r is odd
r!
2
r/2
r
2
!
if r is even.
It can be shown by induction that if r is even, then
r!
2
r/2
r
2
!
= (r 1)(r 3)(r 5) 1.
In particular, E(Z) = 0 and Var(Z) = E(Z
2
) = 1.
(b) Moments of Other Normal Distributions: Suppose that X N(,
2
).
Then X can be written as X = + Z, where Z N(0, 1). To obtain
the moments of X, one may use the moments of Z or one may
dierentiate the moment generating function of X. For example, using
the moments of Z, the rst two moments of X are
E(X) = E( + Z) = + E(Z) = and
E(X
2
) = E
( + Z)
2
= E(
2
+ 2Z +
2
Z
2
) =
2
+
2
.
Note that Var(X) = E(X
2
) [E(X)]
2
=
2
. The alternative approach is
to use the moment generating function:
E(X) =
d
dt
X
(t)
t=0
=
d
dt
e
t+t
2
2
/2
t=0
= ( + t
2
)e
t+t
2
2
/2
t=0
= and
E(X
2
) =
d
2
dt
2
X
(t)
t=0
=
d
dt
( + t
2
)e
t+t
2
2
/2
t=0
=
2
e
t+t
2
2
/2
+ ( + t
2
)
2
e
t+t
2
2
/2
t=0
=
2
+
2
.
7. Box-Muller method for generating standard normal variables. Let Z
1
and Z
2
be iid random variables with distributions Z
i
N(0, 1). The joint pdf of Z
1
and Z
2
is
f
Z
1
,Z
2
(z
1
, z
2
) =
e
1
2
(z
2
1
+z
2
2
)
2
.
Transform to polar coordinates: Z
1
= R sin(T) and Z
2
= R cos(T). The joint
distribution of R and T is
f
R,T
(r, t) =
re
1
2
r
2
2
I
(0,)
(r)I
(0,2)
(T) = f
R
(r) f
T
(t) where
f
R
(r) = re
1
2
r
2
I
(0,)
(r) and f
T
(t) =
1
2
I
(0,2)
(t).
Factorization of the joint pdf reveals that R T. Their respective cdfs are
F
R
(r) = 1 e
1
2
r
2
and F
T
(t) =
t
2
.
Let U
1
= F
R
(R) and U
2
= F
T
(T). Recall that U
i
Unif(0, 1). Solving the cdf
equations for R and T yields
R =
2 ln(1 U
1
) and T = 2U
2
.
Lastly, express Z
1
and Z
2
as functions of R and T:
Z
1
= R sin(T) =
2 ln(1 U
1
) sin(2U
2
) and
Z
2
= R cos(T) =
2 ln(1 U
1
) cos(2U
2
).
Note that U
1
and 1 U
1
have the same distributions. Therefore Z
1
and Z
2
can
be generated from U
1
and U
2
by
Z
1
=
2 ln(U
1
) sin(2U
2
) and Z
2
=
2 ln(U
1
) cos(2U
2
).
8. Linear Functions of Normal Random Variables: Suppose that X and Y are
independently distributed random variables with distributions X N(
X
,
2
X
)
and Y N(
Y
,
2
Y
).
6.1. NORMAL DISTRIBUTIONS 45
(a) The distribution of aX + b is N(a
X
+ b, a
2
2
X
).
Proof: The moment generating function of aX + b is
aX+b
(t) = E(e
t(aX+b)
) = e
tb
E(e
taX
) = e
tb
X
(ta)
= e
tb
e
ta+t
2
a
2
2
/2
= e
t(a+b)+t
2
(a)
2
/2
and this is the moment generating function of a random variable with
distribution N(a + b, a
2
2
).
(b) Application: Suppose that X N(,
2
). Let Z = (X )/. Note,
Z = aX + b, where a = 1/ and b = /. Accordingly, Z N(0, 1).
(c) The distribution of aX + bY is N(a
X
+ b
Y
, a
2
2
X
+ b
2
2
Y
).
Proof: The moment generating function of aX + bY is
aX+bY
(t) = E(e
t(aX+bY )
) = E(e
taX
)E(e
tbY
) by independence
=
X
(ta)
Y
(tb) = e
ta
X
+t
2
a
2
2
X
/2
e
tb
Y
+t
2
b
2
2
Y
/2
= e
t(a
X
+b
Y
)+t
2
(a
2
2
X
+b
2
2
Y
)/2
.
and this is the moment generating function of a random variable with
distribution N(a
X
+ b
Y
, a
2
2
X
+ b
2
2
Y
).
(d) The above result is readily generalized. Suppose that X
i
for i = 1, . . . , n
are independently distributed as X
i
N(
i
,
2
i
). If T =
n
i=1
a
i
X
i
, then
T N(
T
,
2
T
), where
T
=
n
i=1
a
i
i
and
2
T
=
n
i=1
a
2
i
2
i
.
9. Probabilities and Percentiles
(a) If X N(
X
,
2
X
), then the probability of an interval is
P(a X b) = P
a
X
X
Z
b
X
b
X
a
X
.
(b) If X N(
X
,
2
X
), then the 100p
th
percentile of X is
x
p
=
X
+
X
z
p
,
where z
p
is the 100p
th
percentile of the standard normal distribution.
Proof:
P(X
X
+
X
z
p
) = P
X
X
X
z
p
= P(Z z
p
) = p
because Z = (X
X
)/
X
N(0, 1).
10. Log Normal Distribution
(a) Denition: If ln(X) N(,
2
), then X is said to have a log normal
distribution. That is
ln(X) N(,
2
) X LogN(,
2
).
Note: and
2
are the mean and variance of ln(X), not of X.
(b) PDF: Let Y = ln(X), and assume that Y N(,
2
). Note that x = g(y)
and y = g
1
(x), where g(y) = e
y
and g
1
(x) = ln(x). The Jacobian of the
transformation is
|J| =
d
dx
y
d
dx
ln(x)
=
1
x
.
Accordingly, the pdf of X is
f
X
(x) = f
Y
g
1
(x)
1
x
=
e
1
2
2
[ln(x)]
2
x
2
I
(0,)
(x).
(c) CDF: If Y LogN(,
2
), then
P(Y y) = P[ln(Y ) ln(y)] =
ln(y)
.
(d) Moments of a log normal random variable. Suppose that
X LogN(,
2
). Then E(X
r
) = e
r+r
2
2
/2
.
Proof: Let Y = ln(X). Then X = e
Y
and Y N(,
2
) and
E(X
r
) = E
e
rY
= e
r+r
2
2
/2
,
where the result is obtained by using the mgf of a normal random
variable. To obtain the mean and variance, set r to 1 and 2:
E(X) = e
+
2
/2
and Var(X) = e
2+2
2
e
2+
2
= e
2+
2
2
1
.
(e) Displays of various log normal distributions. The gure below displays
four log normal distributions. The parameters of the distribution are
summarized in the following table:
=
2
= = = =
Plot E[ln(X)] Var[ln(X)] E(X)
Var(X) /
1 3.2976 4.6151 100 1000 0.1
2 3.8005 1.6094 100 200 0.5
3 4.2586 0.6931 100 100 1
4 4.5856 0.0392 100 20 5
Note that each distribution has mean equal to 100. The distributions
dier in terms of , which is the coecient of variation.
6.2. EXPONENTIAL DISTRIBUTIONS 47
0 50 100 150 200
0
0.02
0.04
0.06
0.08
0.1
0.12
f
X
(
x
)
= 0.1
0 50 100 150 200
0
0.005
0.01
0.015
0.02
= 0.5
0 50 100 150 200
0
0.002
0.004
0.006
0.008
0.01
x
f
X
(
x
)
= 1
0 50 100 150 200
0
0.005
0.01
0.015
0.02
0.025
x
= 5
If the coecient of variation is small, then the log normal distribution
resembles an exponential distribution, As the coecient of variation
increases, the log normal distribution converges to a normal distribution.
6.2 Exponential Distributions
1. PDF and cdf
f
X
(x) = e
x
I
[0,)
(x) where is a positive parameter, and
F
X
(x) = 1 e
x
provided that x 0. We will use the notation X Expon() to mean that X
has an exponential distribution with parameter . Note that the 100p
th
percentile is x
p
= ln(1 p)/. The median, for example, is x
0.5
= ln(2)/.
2. Moment Generating Function. If X Expon(), then
X
(t) = /( t) for
t < .
Proof:
X
(t) = E(e
tX
) =

0
e
tx
e
x
dx
=

0
e
(t)x
dx =

t

0
( t)e
(t)x
dx =

t
because the last integral is the integral of the pdf of a random variable with
distribution Expon( t), provided that t > 0.
3. Moments: If X Expon(), then E(X
r
) = r!/
r
.
Proof:
X
(t) =

t
=
1
1 t/
=
r=0
r
=
r=0
t
r
r!
r!
provided that < t < . Note that E(X) = 1/, E(X

2
) = 2/
2
and
Var(X) = 1/
2
.
4. Displays of exponential distributions. Below are plots of four exponential
distributions. Note that the shapes of the distributions are identical.
0 20 40 60 80
0
0.02
0.04
0.06
0.08
0.1
f
X
(
x
)
= 0.1
0 2 4 6 8
0
0.2
0.4
0.6
0.8
1
= 1
0 1 2 3 4
0
0.5
1
1.5
2
x
f
X
(
x
)
= 2
0 0.5 1 1.5
0
1
2
3
4
5
x
= 5
5. Memoryless Property: Suppose that X Expon(). The random variable can
be thought of as the waiting time for an event to occur. Given that an event
has not occurred in the interval [0, w), nd the probability that the additional
waiting time is at least t. That is, nd P(X > t + w|X > w). Note: P(X > t)
is sometimes called the reliability function. It is denoted as R(t) and is related
to F
X
(t) by
R(t) = P(X > t) = 1 P(X t) = 1 F
X
(t).
The reliability function represents the probability that the lifetime of a
product (i.e., waiting for failure) is at least t units. For the exponential
6.2. EXPONENTIAL DISTRIBUTIONS 49
distribution, the reliability function is R(t) = e
t
. We are interested in the
conditional reliability function R(t + w|X > w). Solution:
R(t + w|X > w) = P(X > t + w|X > w) =
P(X > t + w)
P(X > w)
=
e
(t+w)
e
w
= e
t
.
Also,
R(t + w|X > w) = 1 F
X
(t + w|X > w) =F
X
(t + w|X > w) = 1 e
t
.
That is, no matter how long one has been waiting, the conditional distribution
of the remaining life time is still Expon(). It is as though the distribution
does not remember that we have already been waiting w time units.
6. Poison Inter-arrival Times: Suppose that events occur according to a Poisson
process with rate parameter . Assume that the process begins at time 0. Let
T
1
be the arrival time of the rst event and let T
r
be the time interval from
the (r 1)
st
arrival to the r
th
arrival. That is, T
1
, . . . , T
n
are inter-arrival
times. Then T
i
for i = 1, . . . , n are iid Expon().
Proof: Consider the joint pdf or T
1
, T
2
, . . . , T
n
:
f
T
1
,T
2
,...,Tn
(t
1
, t
2
, . . . , t
n
) =
= f
T
1
(t
1
) f
T
2
|T
1
(t
2
|t
1
) f
T
3
|T
1
,T
2
(t
3
|t
1
, t
2
)
f
Tn|T
1
,...,T
n1
(t
n
|t
1
, . . . , t
n1
)
by the multiplication rule. To obtain the rst term, rst nd the cdf of T
1
:
F
T
1
(t
1
) = P(T
1
t
1
) = P [one or more events in (0, t
1
)]
= 1 P [no events in (0, t
1
)] = 1
e
t
1
(t
1
)
0
0!
= 1 e
t
1
.
Dierentiating the cdf yields
f
T
1
(t
1
) =
d
dt
1
(1 e
t
1
) = e
t
1
I
(0,)
(t
1
).
The second term is the conditional pdf of T
2
given T
1
= t
1
. Recall that in a
Poisson process, events in non-overlapping intervals are independent.
Therefore,
f
T
2
|T
1
(t
2
|t
1
) = f
T
2
(t
2
) = e
t
2
.
Each of the remaining conditional pdfs also is just an exponential pdf.
Therefore,
f
T
1
,T
2
,...,Tn
(t
1
, t
2
, . . . , t
n
) =
n
i=1
e
t
i
I
[0,)
(t
i
).
This joint pdf is the product of n marginal exponential pdfs. Therefore, the
inter-arrival times are iid exponential random variables. That is,
T
i
iid Expon().
6.3 Gamma Distributions
1. Erlang Distributions:
(a) Consider a Poisson process with rate parameter . Assume that the
process begins at time 0. Let Y be the time of the r
th
arrival. Using the
dierential method, the pdf of Y can be obtained as follows:
P(y < Y < y + dy) P(r 1 arrivals before time y)
P[one arrival in (y, y + dy)]
=
e
y
(y)
r1
(r 1)!
dy.
Accordingly,
f
Y
(y) =
e
y
r
y
r1
(r 1)!
I
[0,)
(y).
The above pdf is called the Erlang pdf.
(b) Note that Y is the sum of r iid Expon() random variables (see page 49
of these notes). Accordingly, E(Y ) = r/ and Var(Y ) = r/
2
.
(c) CDF of an Erlang random variable: F
Y
(y) = 1 P(Y > y) and P(Y > y)
is the probability that fewer than r events occur in [0, y). Accordingly,
F
Y
(y) = 1 P(Y > y) = 1
r1
i=0
e
y
(y)
i
i!
.
2. Gamma Function
(a) Denition: () =

0
u
1
e
u
du, where > 0.
(b) Alternative expression: Let z =
2u so that u = z
2
/2; du = z dz; and
() =

0
z
21
e
z
2
/2
2
1
dz.
(c) Properties of ()
i. (1) = 1.
Proof:
(1) =

0
e
w
dw = e
w
0
= 0 + 1 = 1.
ii. ( + 1) = ().
Proof:
( + 1) =

0
w
e
w
dw.
6.3. GAMMA DISTRIBUTIONS 51
Let u = w
, let dv = e
w
dw and use integration by parts to obtain
du = w
1
, v = e
w
and
( + 1) = w
e
w
0
+

0
w
1
e
w
dw
= 0 + ().
iii. If n is a positive integer, then (n) = (n 1)!.
Proof: (n) = (n 1)(n 1) = (n 1)(n 2)(n 2) etc.
iv. (
1
2
) =
.
Proof:
1
2

0
e
z
2
/2
2
1
2
dz =
e
z
2
/2
2
dz =
because the integral of the standard normal distribution is one.

3. Gamma Distribution
(a) PDF and cdf: If Y Gam(, ), then
f
Y
(y) =
y
1
e
y
()
I
(0,)
(y) and F
Y
(y) =
y
0
u
1
e
u
()
du.
(b) Note: is called the shape parameter and is called the scale parameter.
(c) Moment Generating Function: If Y Gam(, ), then
Y
(t) =

0
e
ty
y
1
e
y
()
dy
=

0
y
1
e
(t)y
()
dy
=

( t)

0
y
1
( t)
e
(t)y
()
dy
=

( t)
because the last integral is the integral of a random variable with

distribution Gam(, t) provided that t > 0.
(d) Moments: If Y Gam(, ), then
E(Y ) =
d
dt
Y
(t)
t=0
=

( t)
+1
t=0
=

;
E(Y
2
) =
d
2
(dt)
2
Y
(t)
t=0
=

( + 1)
( t)
+2
t=0
=
( + 1)
2
; and
Var(Y ) = E(Y
2
) [E(Y )]
2
=
( + 1)
2

2
2
=

2
.
(e) General expression for moments (including fractional moments). If
Y Gam(, ), then
E(Y
r
) =
( + r)
r
()
provided that + r > 0.
Proof:
E(Y
r
) =

0
y
r
y
1
e
y
()
dy =

0
y
+r1
e
y
()
dy
=
( + r)
r
()

0
y
+r1
+r
e
y
( + r)
dy =
( + r)
r
()
because the last integral is the integral of a random variable with
distribution Gam( + r, ), provided that + r > 0.
(f) Distribution of the sum of iid exponential random variables. Suppose
that Y
1
, Y
2
, . . . , Y
k
are iid Expon() random variables. Then
T =
k
i=1
Y
i
Gam(k, ).
Proof:
Y
i
(t) = /( t) =
T
(t) =
k
/( t)
k
.
(g) Note that the Erlang distribution is a gamma distribution with shape
parameter equal to an integer.
6.4 Chi Squared Distributions
1. Denition: Let Z
i
for i = 1, . . . k be iid N(0, 1) random variables. Then
Y =
k
i=1
Z
2
i
is said to have a
2
distribution with k degrees of freedom. That
is, Y
2
k
.
2. MGF:
Y
(t) = (1 2t)
k
2
for t < 0.5.
Proof: First nd the mgf of Z
2
i
:
Z
2
i
(t) = E(e
tZ
2
) =
e
tz
2 e
1
2
z
2
2
dz
=
1
2(12t)
1
z
2
2
dz = (1 2t)
1
2
1
2(12t)
1
z
2
(1 2t)
1
2
2
dz
= (1 2t)
1
2
because the last integral is the integral of a N[0, (1 2t)
1
] random variable.
It follows that the mgf of Y is (1 2t)
k
2
. Note that this is the mgf of a
Gamma random variable with parameters = 0.5 and = k/2. Accordingly,
Y
2
k
Y Gamma
k
2
,
1
2
and
6.5. DISTRIBUTIONS FOR RELIABILITY 53
f
Y
(y) =
y
k
2
1
e
1
2
y
k
2
2
k
2
I
(0,)
(y).
3. Properties of
2
Random variables
(a) If Y
2
k
, then E(Y
r
) =
2
r
(k/2 + r)
(k/2)
provided that k/2 + r > 0.
Proof: Use the moment result for Gamma random variables.
(b) Using ( + 1) = (), it is easy to show that E(Y ) = k,
E(Y
2
) = k(k + 2), and Var(X) = 2k.
(c) Y N(k, 2k) for large k. This is an application of the central limit
theorem. A better approximation (again for large k) is
2Y
2k 1 N(0, 1).
(d) If Y
1
, Y
2
, . . . , Y
n
are independently distributed as Y
i

2
k
i
, then
n
i=1
Y
i

2
k
, where k =
n
i=1
k
i
.
Proof: use the mgf.
(e) If X
2
k
, X + Y
2
n
, and X Y , then Y
2
nk
.
Proof: See page 248 in the text. Note that by independence
X+Y
(t) =
X
(t)
Y
(t).
6.5 Distributions for Reliability
1. Denition: Suppose that L is a nonnegative continuous rv. In particular,
suppose that L is the lifetime (time to failure) of a component. The
reliability function is the probability that the lifetime exceeds x. That is,
Reliability Function of L = R
L
(x)
def
= P(L > x) = 1 F
L
(x).
2. Result: If L is a nonnegative continuous rv whose expectation exists, then
E(L) =

0
R
L
(x) dx =

0
[1 F
L
(x)] dx.
Proof: Use integration by parts with u = R
L
(x) =du = f
(
x) and
dv = dx =v = x. Making these substitutions,

0
R
L
(u) du =

0
u dv = uv

0
v du
= x[1 F
L
(x)]
0
+

0
xf
L
(x) dx
=

0
xf
L
(x) dx = E(L)
provided that lim
x
x[1 F
L
(x)] = 0.
3. Denition: the hazard function is the instantaneous rate of failure at time x,
given that the component lifetime is at least x. That is,
Hazard Function of L = h
L
(x)
def
= lim
dx0
P(x < L < x + dx|L > x)
dx
= lim
dx0
F
L
(x + dx) F
L
(x)
dx
1
R
L
(x)
=
f
L
(x)
R
L
(x)
.
4. Result:
h
L
(x) =
d
dx
ln[R
L
(x)] =
1
R
L
(x)

d
dx
R
L
(x)
=
1
R
L
(x)
f
L
(x) =
f
L
(x)
R
L
(x)
.
5. Result: If R
L
(0) = 1, then
R
L
(x) = exp
x
0
h
L
(u) du
.
Proof:
h
L
(x) =
d
dx
{ln [R
L
(x)] ln [R
L
(0)]}
=h
L
(x) =
d
dx
ln [R
L
(u)]
x
0
x
0
h
L
(u) du = ln [R
L
(x)]
=exp
x
0
h
L
(u) du
= R
L
(x).
6. Result: the hazard function is constant if and only if time to failure has an
exponential distribution. Proof: First, suppose that time to failure has an
exponential distribution. Then,
f
L
(x) = e
x
I
(0,)
(x) =R
L
(x) = e
x
=h
L
(x) =
e
x
e
x
= .
Second, suppose that the hazard function is a constant, . Then,
h
L
(x) = =R
L
(x) = exp
x
0
du
= e
x
=f
L
(x) =
d
dx
1 e
x
= e
x
.
7. Weibull Distribution: Increasing hazard function. The hazard function for the
Weibull distribution is
h
L
(x) =
x
1
,
6.5. DISTRIBUTIONS FOR RELIABILITY 55
where and are positive constants. The corresponding reliability function is
R
L
(x) = exp
x
0
h
L
(u) du
= exp
,
and the pdf is
f
L
(x) =
d
dx
F
L
(x) =
x
1
exp
I
(0,)
(x).
8. Gompertz Distribution: exponential hazzard function. The hazzard function
for the Gompertz distribution is
h
L
(x) = e
x
,
where and are positive constants. The corresponding reliability function is
R
L
(x) = exp
e
x
1
,
and the pdf is
f
L
(x) =
d
dx
F
L
(x) = e
x
exp
e
x
1
I
(0,)
(x).
9. Series Combinations: If a system fails whenever any single component fails,
then the components are said to be in series. The time to failure of the system
is the minimum time to failure of the components. If the failure times of the
components are statistically independent, then the reliability function of the
system is
R(x) = P(system life > x) = P(all components survive to x)
=
R
i
(x),
where R
i
(x) is the reliability function of the i
th
component.
10. Parallel Combinations: If a system fails only if all components fail, then the
components are said to be in parallel. The time to failure of the system is the
maximum time to failure of the components. If the failure times of the
components are statistically independent, then the reliability function of the
system is
R(x) = P(all components fail by time x) = 1 P(no component fails by time x)
= 1
F
i
(x) = 1
[1 R
i
(x)] ,
where F
i
(x) is the cdf of the i
th
component.
6.6 t, F, and Beta Distributions
1. t distributions: Let Z and X be independently distributed as Z N(0, 1) and
X
2
k
. Then
T =
Z
X/k
has a central t distribution with k degrees of freedom. The pdf is
f
T
(t) =
k + 1
2
k
2
1 +
t
2
k
(k+1)/2
.
If k = 1, then the pdf of T is
f
T
(t) =
1
(1 + t
2
)
.
which is the pdf of a standard Cauchy random variable.
Moments of a t random variable. Suppose that T t
k
. Then
E(T
r
) = E
k
r/2
Z
r
X
r/2
= k
r/2
E(Z
r
)E(X
r/2
), where
Z N(0, 1), X
2
k
, and Z X.
Recall that odd moments of Z are zero. Even moments of Z and
moments of X are
E(Z
2i
) =
(2i)!
i!2
i
and E(X
a
) =
2
a
(k/2 + a)
(k/2)
provided that a < k/2. Therefore, if r is a non-negative integer, then
E(T
r
) =
does not exist if r > k;

0 if r is odd and r < k;
k
r/2
r!
k r
2
r
2
! 2
r
k
2
if r is even and r < k.

Using the above expression, it is easy to show that E(T) = 0 if k > 1 and
that Var(T) = k/(k 2) if k > 2.
2. F Distributions: Let U
1
and U
2
be independent
2
random variables with
degrees of freedom k
1
and k
2
, respectively. Then
Y =
U
1
/k
1
U
2
/k
2
6.6. T, F, AND BETA DISTRIBUTIONS 57
has a central F distribution with k
1
and k
2
degrees of freedom. That is,
Y F
k
1
,k
2
. The pdf is
f
Y
(y) =
k
1
k
2
k
1
/2

k
1
+ k
2
2
y
(k
1
2)/2
k
1
2
k
2
2
1 +
yk
1
k
2
(k
1
+k
2
)/2
I
(0,)
(y).
If T t
k
, then T
2
F
1,k
.
Moments of an F random variable. Suppose that Y F
k
1
,k
2
. Then
E(Y
r
) = E
(k
2
U
1
)
r
(k
1
U
2
)
r
k
2
k
1
r
E(U
r
1
)E(U
r
2
), where
U
1

2
k
1
, U
2

2
k
2
, and U
1
U
2
.
Using the general expression for the moments of a
2
random variable, it
can be shown that for any real valued r,
E(Y
r
) =
does not exist if r > k

2
/2;
k
2
k
1
k
1
2
+ r
k
2
2
r
k
1
2
k
2
2
if r < k
2
/2.
Using the above expression, it is easy to show that
E(Y ) =
k
2
k
2
2
if k > 2 and that Var(Y ) =
2k
2
2
(k
1
+ k
2
2)
k
1
(k
2
2)
2
(k
2
4)
if k
2
> 4.
3. Beta Distributions: Let U
1
and U
2
be independent
2
random variables with
degrees of freedom k
1
and k
2
, respectively. Then
Y =
U
1
U
1
+ U
2
has a beta distribution with parameters k
1
/2 and k
2
/2. That is,
Y Beta
k
1
2
,
k
2
2
. More generally, if U
1
Gam(
1
), U
2
Gam(
2
), and
U
1
U
2
, then
Y =
U
1
U
1
+ U
2
Beta(
1
,
2
).
If Y Beta(
1
,
2
), then the pdf of Y is
f
Y
(y) =
y
1
1
(1 y)
2
1
(
1
,
2
)
I
(0,1)
(y),
where (
1
,
2
) is the beta function and is dened as
(
1
,
2
) =
(
1
)(
2
)
(
1
+
2
)
.
If B Beta(
1
,
2
), then
2
B
1
(1 B)
F
2
1
,2
2
.
If B Beta(
1
,
2
), where
1
=
2
= 1, then B Unif(0, 1).
If X Beta(
1
,
2
), then
E(X
r
) =
(
1
+ r)(
1
+
2
)
(
1
+
2
+ r)(
1
)
provided that
1
+ r > 0.
Proof:
E(X
r
) =
1
0
x
r
x
1
1
(1 x)
1
(
1
,
2
)
dx =
1
0
x
1
+r1
(1 x)
1
(
1
,
2
)
dx
=
(
1
+ r,
2
)
(
1
,
2
)
1
0
x
1
+r1
(1 x)
1
(
1
+ r,
2
)
dx
=
(
1
+ r,
2
)
(
1
,
2
)
=
(
1
+ r)(
1
+
2
)
(
1
+
2
+ r)(
1
)
,
provided that
1
+ r > 0, because the last integral is the integral of the
pdf of a random variable with distribution Beta(
1
+ r,
2
).
If F F
k
1
,k
2
, then
k
1
F
k
1
F + k
2
Beta
k
1
2
,
k
2
2
.
Chapter 7
ORGANIZING & DESCRIBING
DATA
The topics in this chapter are covered in Stat 216, 217, and 401. Please read this
chapter. With a few exceptions, I will not lecture on these topics. Below is a list of
terms and methods that you should be familiar with.
7.1 Frequency Distributions
1. Contingency (frequency) tables for categorical random variables, cell, marginal
distributions.
2. Bar graph for categorical and for discrete random variables.
7.2 Data on Continuous Variables
1. Stem & Leaf Displays for continuous random variables.
2. Frequency Distributions & Histograms for continuous random variables. Area
should be proportional to frequency regardless of whether bin widths are equal
or not.
3. Scatter Plots for paired continuous random variables.
4. Statistic: A numerical characteristic of the sample. A statistic is a random
variable.
7.3 Order Statistics
1. Order statistics are the ordered sample values. The conventional notation is to
denote the i
th
order statistic as X
(i)
, where X
(1)
X
(2)
X
(3)
X
(n)
.
2. Sample median: 50
th
percentile.
59
60 CHAPTER 7. ORGANIZING & DESCRIBING DATA
3. Quartiles: Q
1
= 25
th
, Q
2
= 50
th
, and Q
3
= 75
th
percentiles.
4. Interquartile Range: Q
3
Q
1
.
5. Range: X
(n)
X
(1)
.
6. Midrange: (X
(n)
+ X
(1)
)/2.
7. Midhinge: (Q
1
+ Q
3
)/2.
8. Five Number Summary: X
(1)
, Q
1
, Q
2
, Q
3
, X
(n)
.
9. Quantiles: For a data set of size n, the quantiles are the order statistics
X
(1)
, . . . , X
(n)
. The quantiles are special cases of percentiles (the book has this
backwards). The i
th
quantile is the 100p
th
i
percentile, where
p
i
= (i 3/8)/(n + 1/4). Note, the percentile is dened so that p
i
(0, 1) for
all i. For large n, p
i
i/n.
10. Q-Q Plots: These are scatter plots of the quantiles from two distributions. If
the distributions are the same, then the scatter plot should show a line of
points at a 45 degree angle. One application is to plot the empirical quantiles
against the quantiles from a theoretical distribution. This is called a
probability plot. Suppose, for example, that it is believed that the data have
been sampled from a distribution having cdf F. Then the probability plot is
obtained by plotting F
1
(p
i
) against X
(i)
for i = 1, . . . , n.
To visualize whether or not the data could have come from a normal
distribution, for example, the empirical quantiles can be plotted against
normal quantiles, +
1
(p
i
). For example, problem 7-17 on page 284 gives
the population densities per square mile for each of the 50 states. Below are
Q-Q plots comparing the quantiles of the data to the quantiles of the normal
distribution and to the quantiles of the log normal distribution. The
computations to construct the plots are on the following page. In the table,
the variable ln(y) is labeled as w. The quantiles of the normal distribution
and the log normal distribution are
y + s
y
z
p
i
and exp { w + s
w
z
p
i
} ,
respectively, where z
p
i
=
1
(p
i
) is the 100p
th
i
percentile of the standard
normal distribution.
The smallest three values correspond to Alaska, Montana, and Wyoming.
Values 4650 correspond to Maryland, Connecticut, Massachusetts, New
Jersey, and Rhode Island, respectively.
7.3. ORDER STATISTICS 61
400 200 0 200 400 600 800
400
200
0
200
400
600
800
1000
Normal Quantile
E
m
p
i
r
i
c
a
l

Q
u
a
n
t
i
l
e
Population Density Data
0 200 400 600 800 1000 1200 1400 1600 1800
0
200
400
600
800
1000
1200
1400
1600
1800
Log Normal Quantile
E
m
p
i
r
i
c
a
l

Q
u
a
n
t
i
l
e
Population Density Data
i y
(i)
p
i
z
p
i
y + s
y
z
p
i
w + s
w
z
p
i
exp{ w + s
w
z
p
i
}
1 1 0.012 2.24 355.23 0.95 2.58
2 5 0.032 1.85 264.99 1.52 4.58
3 5 0.052 1.62 213.93 1.85 6.34
4 7 0.072 1.46 176.66 2.08 8.03
5 9 0.092 1.33 146.62 2.27 9.72
6 9 0.112 1.22 121.08 2.44 11.43
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21 54 0.410 0.23 104.59 3.87 47.87
22 55 0.430 0.18 116.19 3.94 51.53
23 62 0.450 0.13 127.70 4.02 55.43
24 71 0.470 0.07 139.13 4.09 59.60
25 77 0.490 0.02 150.51 4.16 64.07
26 81 0.510 0.02 161.89 4.23 68.87
27 87 0.530 0.07 173.27 4.30 74.03
28 92 0.550 0.13 184.70 4.38 79.61
29 94 0.570 0.18 196.21 4.45 85.64
30 95 0.590 0.23 207.81 4.52 92.18
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46 429 0.908 1.33 459.02 6.12 454.22
47 638 0.928 1.46 489.06 6.31 549.64
48 733 0.948 1.62 526.33 6.55 696.36
49 987 0.968 1.85 577.39 6.87 962.96
50 989 0.988 2.24 667.63 7.44 1707.73
7.4 Data Analysis
1. Random variable versus realization: Let X
1
, X
2
, . . . , X
n
be a random sample
from some population. Then X
i
is a random variable whose distribution
depends on the population at hand. Also, the distribution of X
1
, X
2
, . . . , X
n
is
exchangeable. We will use lower case letters to denote a realization of the
random sample. That is, x
1
, x
2
, . . . , x
n
is a realization of the random sample.
2. Outlier: An observation that is far from the bulk of the data.
3. Random Sample: A simple random sample is a sample taken from the
population in a manner such that each possible sample of size n has an equal
probability of being selected. Note, this implies that each unit has the same
probability of being selected, but a sample taken such that each unit has the
same probability of being selected is not necessarily a simple random sample.
4. Transformations of X and/or Y are sometimes useful to change a non-linear
relationship into a linear relationship.
7.5. THE SAMPLE MEAN 63
7.5 The Sample Mean
1. X =
1
n
n
i=1
X
i
is a random variable whereas x =
1
n
n
i=1
x
i
is a realization.
2.
n
i=1
(X
i
X) = 0 with probability 1 and
n
i=1
(x
i
x) = 0.
3. If X
1
, . . . , X
n
is a random sample without replacement from a nite
population of size N with mean and variance
2
, then
E(X) = and Var(X) =

2
n
1
(n 1)
(N 1)
.
4. If X
1
, . . . , X
n
is a random sample with or without replacement from an innite
population or with replacement from a nite population with mean and
variance
2
, then
E(X) = and Var(X) =

2
n
.
7.6 Measures of Dispersion
1. Sample variance: S
2
=
1
n 1
n
i=1
(X
i
X)
2
is a random variable whereas
s
2
=
1
n 1
n
i=1
(x
i
x)
2
is a realization.
2. If X
1
, . . . , X
n
is a random sample with or without replacement from an innite
population or with replacement from a nite population with mean
X
and
variance
2
X
, then
E(S
2
X
) =
2
X
.
Proof: First write (X
i
X)
2
as
(X
i
X)
2
= X
2
i
2X
i
X + X
2
.
Accordingly,
S
2
X
=
1
n 1
i=1
X
2
i
nX
2
.
Recall that if Y is a random variable with mean
Y
and variance
2
Y
, then
E(Y
2
) =
2
Y
+
2
Y
. In this application, E(X
2
) =
2
X
+
2
X
/n. Accordingly,
E(S
2
X
) =
1
n 1
n(
2
X
+
2
X
) n
2
X
+

2
X
n
=
2
X
.
3. Let Y
1
, . . . , Y
n
be a sample with sample mean Y and sample variance S
2
Y
.
Dene X
i
by X
i
= a + bY
i
for i = 1, . . . , n. Then the sample mean and sample
variance of X
1
, . . . , X
n
, are
X = a + bY and S
2
X
= b
2
S
2
Y
.
Proof:
X =
1
n
n
i=1
X
i
=
1
n
n
i=1
(a + bY
i
)
=
1
n
na + b
n
i=1
Y
i
= a + bY .
Also,
S
2
X
=
1
n 1
X
i
X
2
=
1
n 1
a + bY
i
a + bY
2
=
1
n 1
bY
i
bY
2
=
1
n 1
b
2
Y
i
Y
2
= b
2
S
2
Y
.
This result also holds true for realizations y
1
, y
2
, . . . , y
n
.
4. MAD = n
1
n
i=1
|X
i
X| or, more commonly, MAD is dened as
MAD = n
1
n
i=1
|X
i
M|, where M is the sample median.
5. Result: Let g(a) =
n
i=1
|X
i
a|. Then, the minimizer of g(a) with respect to
a is the sample median.
Proof: The strategy is to take the derivative of g(a) with respect to a; set the
derivative to zero; and solve for a. First note that we can ignore any X
i
that
equals a because it contributes nothing to g(a). If X
i
= a, then
d
da
|X
i
a| =
d
da
(X
i
a)
2
=
1
2
(X
i
a)
2
1
2
2(X
i
a)(1) =
X
i
a
|X
i
a|
=
1 X
i
> a;
1 X
i
< a.
Accordingly,
d
da
g(a) =
n
i=1
I
(,X
i
)
(a) + I
(X
i
,)
(a)
= #Xs larger than a + #Xs smaller than a.

Setting the derivative to zero implies that the number of Xs smaller than a
must be equal to the number of Xs larger than a. Thus, a must be the sample
median.
7.7. CORRELATION 65
7.7 Correlation
1. Let (X
1
, Y
1
), (X
2
, Y
2
), . . . , (X
n
, Y
n
) be a random sample of ordered pairs from a
population having means (
X
,
Y
), variances (
2
X
,
2
Y
), and covariance
X,Y
.
The sample covariance between X and Y is
S
X,Y
def
=
1
n 1
n
i=1
(X
i
X)(Y
i
Y ).
2. The equation for S
X,Y
can be written as
S
X,Y
=
1
n 1
i=1
X
i
Y
i
nX Y
.
Proof: Multiply the X and Y deviations to obtain the following:
S
X,Y
=
1
n 1
n
i=1
X
i
Y
i
X
i
Y XY
i
+ X Y
=
1
n 1
i=1
X
i
Y
i
Y
n
i=1
X
i
X
n
i=1
Y
i
+ nX Y
=
1
n 1
i=1
X
i
Y
i
nY X nXY + nX Y
=
1
n 1
i=1
X
i
Y
i
nY X
.
3. If the population is innite or samples are taken with replacement, then
E(S
X,Y
) =
X,Y
.
Proof: First note that
X,Y
= E(X
i
Y
i
)
X
Y
and, by independence,
E(X
i
Y
j
) =
X
Y
if i = j. Also
X Y =
1
n
2
i=1
X
i
j=1
Y
j
=
1
n
2
n
i=1
n
j=1
X
i
Y
j
=
1
n
2
i=1
X
i
Y
i
+
i=j
X
i
Y
j
.
Therefore,
E(S
X,Y
) =
1
n 1
E
i=1
X
i
Y
i
1
n
n
i=1
X
i
Y
i
1
n
i=j
X
i
Y
j
=
1
n 1
E
1
1
n
i=1
X
i
Y
i
1
n
i=j
X
i
Y
j

1
n 1
[(n 1)E(X
i
Y
i
) (n 1)E(X
i
Y
j
)]
= E(X
i
Y
i
) E(X
i
Y
j
) = E(X
i
Y
i
)
X
Y
=
X,Y
.
4. Sample Correlation Coecient:
r
X,Y
def
=
S
X,Y
S
2
X
S
2
Y
.
5. If U
i
= a + bX
i
and V = c + dY
i
for i = 1, . . . , n, then the sample covariance
between U and V is
S
U,V
= bdS
X,Y
.
Proof: By the denition of sample covariance
S
U,V
=
1
n 1
n
i=1
U
i
U

V
i
V
=
1
n 1
n
i=1
a + bX
i
a + bX

c + dY
i
c + dY
=
1
n 1
n
i=1
bX
i
bX

dY
i
dY
=
1
n 1
bd
n
i=1
X
i
X

Y
i
Y
= bdS
X,Y
.
6. If U
i
= a + bX
i
and V = c + dY
i
for i = 1, . . . , n, then the sample correlation
between U and V is
r
U,V
= sign(bd) r
X,Y
.
Proof: By the denition of sample correlation,
r
U,V
=
S
U,V
S
2
U
S
2
V
=
bdS
X,Y
b
2
S
2
X
d
2
S
2
Y
=
bd
|bd|
S
X,Y
S
2
X
S
2
Y
= sign(bd) r
X,Y
.
Chapter 8
SAMPLES, STATISTICS, &
SAMPLING DISTRIBUTIONS
1. Denition: ParameterA characteristic of the population.
2. Denition: StatisticA characteristic of the sample. Specically, a statistic is
a function of the sample;
T = g(X
1
, X
2
, . . . , X
n
) and t = g(x
1
, x
2
, . . . , x
n
).
The function T is a random variable and the function t is a realization of the
random variable. For example, T
1
= X and T
2
= S
2
X
are statistics.
3. Denition: Sampling DistributionA sampling distribution is the distribution
of a statistic. For example, the sampling distribution of X is the distribution
of X.
8.1 Random Sampling
1. Some non-random samples
Voluntary response sample: the respondent controls whether or not s/he
is in the sample.
Sample of convenience: the investigator obtains a set of units from the
population by using units that are available or can be obtained
inexpensively.
2. Random sampling from a nite population
Procedure: select units from the population at random, one at a time.
Sampling can be done with or without replacement.
Properties of random sampling
The distribution of the sample is exchangeable
67
68 CHAPTER 8. SAMPLES, STATISTICS, & SAMPLING DISTRIBUTIONS
All possible samples of size n are equally likely (this is the denition
of a simple random sample).
Each unit in the population has an equal chance of being selected.
Denition: Population Distributionthe marginal distribution of X
i
,
where X
i
is the value of the i
th
unit in the sample. Note, the marginal
distribution of all X
i
s are identical by exchangeability.
3. Random sample of size n
In general, a random sample of size n has many possible meanings (e.g.,
with replacement, without replacement, stratied, etc.).
We (the text and lecture) will say random sample of size n when we
mean a sequence of independent and identically distributed (iid) random
variables. This can occur if one randomly samples from a nite
population with replacement, or randomly samples from an innite
population. Unless it is qualied, the phrase random sample of size n
refers to iid random variables and does not refer to sampling without
replacement from a nite population.
The joint pdf or pmf of a random sample of size n is denoted by
f
X
(x)
def
= f
X
1
,X
2
,...,Xn
(x
1
, x
2
, . . . , x
n
),
where X and x are vectors of random variables and realizations,
respectively. That is
X =
X
1
X
2
.
.
.
X
n
and x =
x
1
x
2
.
.
.
x
n
.
The transpose of a column vector U is denoted by U
. For example,
X
X
1
X
2
X
n
.
Using independence,
f
X
(x) =
n
i=1
f
X
(x
i
).
4. Example: Suppose that X
1
, X
2
, . . . , X
n
is a random sample of size n from an
Expon() distribution. Then the joint pdf of the sample is
f
X
(x) =
n
i=1
e
x
i
I
(0,)
(x
i
) =
n
exp
i=1
x
i
I
(0,x
(n)
]
(x
(1)
)I
[x
(1)
,)
(x
(n)
).
8.1. RANDOM SAMPLING 69
1
, X
2
, . . . , X
n
is a random sample of size n from
Unif[a, b]. Then the joint pdf of the sample is
f
X
(x) =
n
i=1
(b a)
1
I
[a,b]
(x
i
) = (b a)
n
I
[a,x
(n)
]
(x
(1)
)I
[x
(1)
,b]
(x
(n)
).
6. PMF of a random sample taken without replacement from a nite population.
Consider a population of size N having k N distinct values. Denote the
values as v
1
, v
2
, . . . , v
k
. Suppose that the population contains M
1
units with
value v
1
, M
2
units with value v
2
, etc. Note that N =
k
j=1
M
j
. Select n units
at random without replacement from the population. Let X
i
be the value of
the i
th
unit in the sample and denote the n 1 vector of Xs by X. Let x be a
realization of X. That is, x is an n 1 vector whose elements are chosen from
v
1
, v
2
, . . . , v
k
. Then the pmf of the sample is
f
X
(x) = P(X = x) =
k
j=1
M
j
y
j
N
n
n
y
1
, y
2
, . . . , y
n
,
where y
j
is the frequency of v
j
in x.
Proof: Let Y
j
for j = 1, . . . , k be the frequency of v
j
in X. Note that
k
j=1
Y
j
= n. Denote the vector of Y s by Y and the vector of ys by y. Also,
denote the number of distinct x sequences that yield y by n
y
. Then
f
Y
(y) = Pr(Y = y) = f
X
(x) n
y
,
where f
X
(x) is the probability of any specic sequence of xs that contains y
1
units with value v
1
, y
2
units with value v
2
, etc. Multiplication of f
X
(x) by n
y
is correct because each permutation of x has the same probability (by
exchangeability). Using counting rules from Stat 420, we can show that
f
Y
(y) =
k
j=1
M
j
y
j
N
n
and n
y
=
n
y
1
, y
2
, . . . , y
n
.
Accordingly, the pmf of the sample is
f
X
(x) =
k
j=1
M
j
y
j
N
n
n
y
1
, y
2
, . . . , y
n
.
7. Example: Consider the population consisting of 12 voles, M
j
voles of species j
for j = 1, 2, 3. Suppose that X
1
, X
2
, X
3
, X
4
is a random sample taken without
replacement from the population. Furthermore, suppose that
x =
s
3
s
1
s
1
s
2
,
where s
j
denotes species j. The joint pdf of the sample is
f
X
(s
3
, s
1
, s
1
, s
2
) =
M
1
2
M
2
1
M
3
1
12
4
4
2, 1, 1
=
1
2
M
1
(M
1
1)M
2
M
3
495 12
=
M
1
(M
1
1)M
2
(12 M
1
M
2
)
11,880
.
8.2 Likelihood
1. Family of probability distributions or models: If the joint pdf or pmf of the
sample depends on the value of unknown parameters, then the joint pdf or
pmf is written as
f
X
(x|) where =
1

2

k
is a vector of unknown parameters. For example, if X

1
, . . . , X
n
is a random
sample of size n from N(,
2
), where and
2
are unknown, then the joint
pdf is
f
X
(x|) = f
X
(x|,
2
) =
exp
1
2
2
n
i=1
(x
i
)
2
(2
2
)
n/2
, where =
.
If contains only one parameter, then is will be denoted as (i.e., no bold
face).
2. Likelihood Function: The likelihood function is a measure of how likely a
particular value of is, given that x has been observed. Caution: the
likelihood function is not a probability. The likelihood function is denoted by
L() and is obtained by
interchanging the roles of and x in the joint pdf or pmf of x, and
dropping all terms that do not depend on .
That is,
L() = L(|x) f
X
(x|).
8.2. LIKELIHOOD 71
1
, X
2
, . . . , X
n
is a random sample of size n from an
Expon() distribution. Then the likelihood function is
L() =
n
exp
i=1
x
i
,
provided that all xs are in (0, ). Note that the likelihood function and the
joint pdf are identical in this example. Suppose that n = 10 and that
x = (0.4393 0.5937 0.0671 2.0995 0.1320 0.0148 0.0050
0.1186 0.4120 0.3483)
has been observed. The sample mean is x = 4.2303/10 = 0.42303. The

likelihood function is plotted below. Ratios are used to compare likelihoods.
For example, the likelihood that = 2.5 is 1.34 times as large as the likelihood
that = 3;
L(2.5)
L(3)
= 1.3390.
Note: the x values actually were sampled from Expon(2).
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.05
0.1
0.15
0.2
0.25
L
i
k
e
l
i
h
o
o
d
Likelihood Function for a Sample from Expon(); n=10
1
, X
2
, . . . , X
n
Unif[, b]. Then the likelihood function is
L(b) = (b )
n
I
[x
(n)
,]
(b),
provided that x
(1)
> . Suppose that n = 10 and that
x = (5.9841 4.9298 3.7507 5.1264 3.8780 4.8656 6.0682
4.1946 5.2010 4.3728)
has been observed. For this sample, x

(1)
= 3.7507 and x
(n)
= 6.0682. The
likelihood function is plotted below. Note, the x values actually were sampled
from Unif(, 2).
3 4 5 6 7 8 9 10
0
0.5
1
1.5
2
2.5
x 10
5
b
L
i
k
e
l
i
h
o
o
d
Likelihood Function for a Sample from Unif(,b); n=10
5. Example: Consider the population consisting consisting of 12 voles, M
j
voles
of species j for j = 1, 2, 3. Suppose that X
1
, X
2
, X
3
, X
4
is a random sample
taken without replacement from the population. Furthermore, suppose that
x =
s
3
s
1
s
1
s
2
,
where s
j
denotes species j. The likelihood function is
L(M
1
, M
2
) = M
1
(M
1
1)M
2
(12 M
1
M
2
).
Note, there are only two parameters, not three, because M
1
+ M
2
+ M
3
= 12.
The likelihood function is displayed in the table below. Note: the x values
actually were sampled from a population in which M
1
= 5, M
2
= 3, and
M
3
= 4.
8.2. LIKELIHOOD 73
Value of M
2
M
1
0 1 2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0
2 0 18 32 42 48 50 48 42 32 18 0
3 0 48 84 108 120 120 108 84 48 0 0
4 0 84 144 180 192 180 144 84 0 0 0
5 0 120 200 240 240 200 120 0 0 0 0
6 0 150 240 270 240 150 0 0 0 0 0
7 0 168 252 252 168 0 0 0 0 0 0
8 0 168 224 168 0 0 0 0 0 0 0
9 0 144 144 0 0 0 0 0 0 0 0
10 0 90 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0
6. Likelihood Principle:
All the information which the data provide concerning the relative
merits of two hypotheses is contained in the likelihood ratio of those
hypotheses on the data (Edwards, 1992).
Another way of stating the likelihood principal is that if two experiments, each
based on a model for , give the same likelihood, then the inference about
should be the same in the two experiments.
7. Example
(a) Experiment 1: Toss a 35 cent coin n independent times. Let be the
probability of a head and let X be the number of heads observed. Then
X has a binomial pmf:
f
X
(x|) =
n
x
x
(1 )
nx
I
{0,1,...,n}
(x),
where n = 20. Suppose that x = 6 heads are observed. Then the
likelihood function is
L(|x = 6) =
6
(1 )
14
.
(b) Experiment 2: The 35 cent coin was tossed on independent trials until
r = 6 heads were observed. Let Y be the number of tosses required to
obtain 6 heads. Then Y has a negative binomial pmf:
f
Y
(y|, r) =
y 1
r 1
y
(1 )
yr
I
{r,r+1,...}
(y),
where r = 6. Suppose that the 6
th
head occurred on the 20
th
trial. Then,
the likelihood function is
L(|y = 20) =
6
(1 )
14
.
The likelihood principal requires that any inference about be the same
from the two experiments.
(c) Suppose that we would like to test H
0
: = 0.5 against H
a
: < 0.5.
Based on the above two experiments, the p-values are
P(X 6|n = 20, = 0.5) =
6
x=0
20
x
(1/2)
x
(1 1/2)
20x
= 0.0577
in the binomial experiment and
P(Y 20|r = 6, = 0.5) =
y=20
y 1
6 1
(1/2)
6
(1 1/2)
y6
= 0.0318
in the negative binomial experiment. If we fail to reject H
0
in the rst
experiment, but reject H
0
in the second experiment, then we have
violated the likelihood principle.
8.3 Sucient Statistics
1. Denition from the textbook: A statistic, T = t(X), is sucient for a family of
distributions, f
X
(x|), if and only if the likelihood function depends on X
only through T:
L() = h [t(X), ] .
2. Usual denition: A statistic, T = t(X), is sucient for a family of
distributions, f
X
(x|), if and only if the conditional distribution of X given T
does not depend on :
f
X|T
(x|t, ) = h(x).
This denition says that after observing T, no additional functions of the data
provide information about . It can be shown that the two denitions are
equivalent.
3. Sample Space and Partitions: The sample space is the set of all possible values
of X. It is the same as the support for the joint pdf (or pmf) of X. A statistic
partitions the sample space. Each partition corresponds to a dierent value of
of the statistic. A specic partition contains all possible values of x that yield
the specic value of the statistic that indexes the partition. If the statistic is
sucient, then the only characteristic of the data that we need to examine is
which partition the sample belongs to.
4. Non-uniqueness of the sucient statistic: If T is a sucient statistic, then any
one-to-one transformation of T also is sucient. Note that any transformation
of T induces the same partitioning of the sample space. Accordingly, the
sucient statistic is not unique, but the partitioning that corresponds to T is
unique.
8.3. SUFFICIENT STATISTICS 75
5. Factorization Criterion (Neyman): A statistic, T = t(X) is sucient if and
only if the joint pdf (pmf) factors as
f
X
(x|) = g [t(x)|] h(x).
In some cases, h(x) is a trivial function of x. For example, h(x) = c, where c
is a constant not depending on x.
6. Example: Bernoulli trialsLet X
i
for i = 1, . . . , n be iid Bern(p) random
variables. Note, = p. The joint pmf is
f
X
(x|p) =
n
i=1
p
x
i
(1 p)
1x
i
I
{0,1}
(x
i
) = p
y
(1 p)
ny
n
i=1
I
{0,1}
(x
i
),
where y =
n
i=1
x
i
. Accordingly, Y =
n
i=1
X
i
is sucient.
For this example, it is not too hard to verify that the conditional distribution
of X given Y does not depend on p. The conditional distribution of X given
Y = y is
P(X = x|Y = y) =
P(X = x)I
{y}
(
x
i
)
P(Y = y)
=
n
i=1
p
x
i
(1 p)
1x
i
I
{0,1}
(x
i
)I
{y}
x
i
n
y
p
y
(1 p)
ny
I
{0,1,2,...,n}
(y)
=
n
i=1
I
{0,1}
(x
i
)I
{y}
x
i
n
y
I
{0,1,2,...,n}
(y)
which does not depend on p. That is, the conditional distribution of X given a
sucient statistic does not depend on .
7. Example: Sampling from Poi(). Let X
1
, . . . , X
n
be a random sample of size n
from Poi(). The joint pmf is
f
X
(x|) =
e
n
t
n
i=1
x
i
!
n
i=1
I
{0,1,...,}
(x
i
), where t =
n
i=1
x
i
.
Accordingly, the likelihood function is
L() = e
n
t
and T =
n
i=1
X
i
is sucient. Recall that T Poi(n). Therefore, the
distribution of X conditional on T = t is
P(X = x|T = t) =
P(X = x, T = t)
P(T = t)
=
e
n
t
I
{t}
x
i
t!
n
i=1
I
{0,1,...,}
(x
i
)
i=1
x
i
!
e
n
(n)
t
=
t
x
1
, x
2
, . . . , x
n
1
n
x
1
1
n
x
2

1
n
xn
=(X|T = t) multinom
t,
1
n
,
1
n
, . . . ,
1
n
.
Note that the distribution of X, conditional on the sucient statistic does not
depend on .
i
iid N(, 1), for i = 1, . . . , n. The joint pdf is
f
X
(x|) =
exp
1
2
n
i=1
(x
i
)
2
(2)
n
2
=
exp
1
2
n
i=1
(x
i
x + x )
2
(2)
n
2
=
exp
1
2
n
i=1
(x
i
x)
2
+ 2(x
i
x)( x ) + ( x )
2
(2)
n
2
=
exp
1
2
n
i=1
(x
i
x)
2
+ n( x )
2
(2)
n
2
because
n
i=1
(x
i
x) = 0
= exp
n
2
( x )
2
exp
1
2
n
i=1
(x
i
x)
2
(2)
n
2
.
Accordingly, the likelihood function is
L() = exp
n
2
( x )
2
,
and X is sucient for the family of distributions. This means that X contains
all of the information about that is contained in the data. That is, if we
8.3. SUFFICIENT STATISTICS 77
want to use the sample to learn about , we should examine X and we need
not examine any other function of the data.
9. Order Statistics are sucient: If X
1
, . . . , X
n
is a random sample (with or
without replacement), then the order statistics are sucient.
Proof: By exchangeability,
f
X
(X|) = f
X
(X
(1)
, X
(2)
, . . . , X
(n)
|).
The likelihood function is proportional to the joint pdf or pmf. Therefore, the
likelihood function is a function of the order statistics and, by denition 1, the
order statistics are sucient.
If the random sample is taken from a continuous distribution, then it can be
shown that
P(X = x|x
(1)
, . . . , x
(n)
) =
1
n!
and this distribution does not depend on . Therefore, by denition 2 the
order statistics are sucient.
10. The One Parameter Exponential Family: The random variable X is said to
have a distribution within the one parameter regular exponential family if
f
X
(x|) = B()h(x) exp{Q()R(x)},
where Q() is a nontrivial continuous function of , and R(x) is a nontrivial
function of x. Note that if the support of X is represented as an indicator
variable, then the indicator variable is part of h(x). That is, the support
cannot depend on . Either or both of the functions B() and h(x) could be
trivial.
A random sample of size n from an exponential family has pdf (or pmf)
f
X
(x|) = B()
n
exp
Q()
n
i=1
R(x
i
)
i=1
h(x
i
).
By the factorization criterion, T =
n
i=1
R(X
i
) is sucient for .
11. Examples of one parameter exponential families and the corresponding
sucient statistic.
Consider a random sample of size n from N(,
2
), where
2
is known.
Then T =
n
i=1
X
i
is sucient.
Consider a random sample of size n from N(,
2
), where is known.
Then T =
n
i=1
(X
i
)
2
is sucient.
Consider a random sample of size n from Bern(p). Then T =
n
i=1
X
i
is
sucient.
Consider a random sample of size k from Bin(n, p). Then T =
k
i=1
Y
i
is
sucient.
Consider a random sample of size n from Geom(p). Then T =
n
i=1
X
i
is
sucient.
Consider a random sample of size n from NegBin(r, p), where r is known.
Then T =
n
i=1
X
i
is sucient.
Consider a random sample of size n from Poi(). Then T =
n
i=1
X
i
is
sucient.
Consider a random sample of size n from Expon(). Then T =
n
i=1
X
i
is sucient.
Consider a random sample of size n from Gam(, ), where is known.
Then T =
n
i=1
X
i
is sucient.
Consider a random sample of size n from Gam(, ), where is known.
Then T =
n
i=1
ln(X
i
) is sucient.
Consider a random sample of size n from Beta(
1
,
2
), where
1
is
known. Then T =
n
i=1
ln(1 X
i
) is sucient.
Consider a random sample of size n from Beta(
1
,
2
), where
2
is
known. Then T =
n
i=1
ln(X
i
) is sucient.
12. Examples of distributions that do not belong to the exponential family.
Consider a random sample of size n from Unif(a, b), where a is known.
Then T = X
(n)
is sucient by the factorization criterion.
Consider a random sample of size n from Unif(a, b), where b is known.
Then T = X
(1)
Consider a random sample of size n from Unif(a, b), where neither a nor b
is known. Then
T =
X
(1)
X
(n)

Consider a random sample of size n from Unif(, + 1). Then
T =
X
(1)
X
(n)

13. Example: consider a random sample of size n from N(,
2
), where neither
parameter is known. Write (X
i
) as
(X
i
)
2
= [(X
i
X) + (X )]
2
= (X
i
X)
2
+ 2(X
i
X)(X ) + (X )
2
.
8.4. SAMPLING DISTRIBUTIONS 79
The likelihood function can be written as
L(,
2
|X) =
exp
1
2
2
n
i=1
(X
i
)
2
2
2
n
2
=
exp
1
2
2
n
i=1
(X
i
X)
2
+ 2(X
i
X)(X ) + (X )
2
2
2
n
2
=
exp
1
2
2
i=1
(X
i
X)
2
+ n(X )
2
2
2
n
2
.
By the factorization criterion,
T =
S
2
X
X
is sucient.
8.4 Sampling Distributions
Recall that a statistic is a random variable. The distribution of a statistic is called a
sampling distribution. This section describes some sampling distributions that can
be obtained analytically.
1. Sampling without replacement from a nite population. Consider a nite
population consisting of N units, where each unit has one of just k values,
v
1
, . . . , v
k
. Of the N units, M
j
have value v
j
for j = 1, . . . , k. Note that
k
j=1
M
j
= N. Take a sample of size n, one at a time at random and without
replacement. Let X
i
be the value of the i
th
unit in the sample. Also, let Y
j
be
the number of Xs in the sample that have value v
j
. If
M
1
M
2
M
k
is the vector of unknown parameters, then the joint

pmf of X
1
, . . . , X
n
is
f
X
(x|) =
k
j=1
M
j
y
j
N
n
n
y
1
, y
2
, . . . , y
n
I
{n}
(
k
j=1
y
j
)
n
i=1
I
{v
1
,...,v
k
}
(x
i
)
k
j=1
I
{0,1,...,M
j
}
(y
j
).
By the factorization theorem, Y =
Y
1
, Y
2
Y
k
is a sucient statistic.
The sampling distribution of Y is
f
Y
(y|) =
k
j=1
M
j
y
j
N
n
I
{n}
j=1
y
j
j=1
I
{0,1,...,M
j
}
(y
j
).
Note that T =
Y
1
, Y
2
Y
k1
also is sucient because

Y
k
= N
k1
j=1
Y
j
and therefore Y is a one-to-one function of T If k = 2, then
the sampling distribution simplies to the hypergeometric distribution.
2. Sampling with replacement from a nite population that has k distinct values
or sampling without replacement from an innite population that has k
distinct values. Consider a population for which the proportion of units having
value v
j
is p
j
, for j = 1, . . . , k. Note then
k
j=1
p
j
= 1. Take a sample of size
n, one at a time at random and with replacement if the population is nite.
Let X
i
be the value of the i
th
unit in the sample. Also, let Y
j
be the number of
Xs in the sample that have value v
j
. Let
p
1
p
2
p
k
be the vector
of unknown parameters. The Xs are iid and the joint pmf of the sample is
f
X
(x|) =
n
i=1
k
j=1
p
I
{v
j
}
(x
i
)
j
I
{v
1
,...,v
k
}
(x
i
)
=
k
j=1
p
y
j
j
I
{n}
j=1
y
j
j=1
I
{0,1,...,n}
(y
j
).
Accordingly, Y
Y
1
, Y
2
Y
k
is a sucient statistic. The sampling

distribution of Y is multinomial:
f
Y
(y|) =
n
y
1
, y
2
, . . . , y
k
j=1
p
y
j
j
I
{n}
j=1
y
j
j=1
I
{0,1,...,n}
(y
j
).
If k = 2, then the sampling distribution simplies to the binomial distribution.
3. Sampling from a Poisson distribution. Suppose that litter size in coyotes
follows a Poisson distribution with parameter . Let X
1
, . . . , X
n
be a random
sample of litter sizes from n dens. The Xs are iid and the joint pmf of the
sample is
P(X = x) =
n
i=1
e
x
i
x
i
!
I
{0,1,...}
(x
i
)
=
e
n
y
n
i=1
x
i
!
n
i=1
I
{0,1,...}
(x
i
),
8.4. SAMPLING DISTRIBUTIONS 81
where y =
x
i
. Accordingly, Y =
X
i
is sucient. The sampling
distribution of Y is Poisson:
f
Y
(y|) =
e
n
(n)
y
y!
I
{0,1,...}
(y).
4. Minimum of exponential random variables. Let T
i
iid Expon() for
i = 1, . . . , n and let T
(1)
be the smallest order statistic. Then the sampling
distribution of T
(1)
is T
(1)
Expon(n). See problem 6-31.
5. Maximum of exponential random variables. As in problem 6-31 Let t
i
be the
failure time for the i
th
bus. Suppose that T
i
iid Expon() for i = 1, . . . , n
and let T
(n)
be the largest order statistic. The cdf of T
(n)
is
P(T
(n)
t) = F
T
(n)
(t) = P(all buses fail before time t)
=
n
i=1
P(T
i
< t) because the failure times are
=
n
i=1
(1 e
t
) = (1 e
t
)
n
I
(0,)
(t).
The pdf of T
(n)
can be found by dierentiation:
f
T
(n)
(t) =
d
dt
F
T
(n)
(t) = (1 e
t
)
n1
ne
t
I
(0,)
(t).
6. Maximum of uniform random variables. Suppose that X
i
iid Unif(0, ).
The Xs are iid and the joint pdf is
f
X
(x|) =
n
i=1
1
I
(0,)
(x
i
)
=
1
n
I
(0,)
(x
(n)
)
n
i=1
I
(0,x
(n)
)
(x
i
).
Accordingly, X
(n)
is sucient. The cdf of X
(n)
is
P(X
(n)
x) = F
X
(n)
(x) = P(all Xs x)
=
n
i=1
P(X
i
< x) because the Xs are
=
n
i=1
x
n
I
(0,)
(x).
The pdf of X
(n)
can be found by dierentiation:
f
X
(n)
(x) =
d
dx
F
T
(n)
(x) =
nx
n1
n
I
(0,)
(x).
8.5 Simulating Sampling Distributions
1. How to simulate a sampling distribution
(a) Choose a population distribution of interest: example Cauchy with
= 100 and = 10
(b) Choose a statistic or statistics of interest: example sample median and
sample mean
(c) Choose a sample size: example n = 25
(d) Generate a random sample of size n from the specied distribution. The
inverse cdf method is very useful here. For the Cauchy(,
2
)
distribution, the cdf is
F
X
(x|, ) =
arctan
+
1
2
.
Accordingly, if U Unif(0, 1), then
X = tan
(U
1
2
)
+ Cauchy(,
2
).
(e) Compute the statistic or statistics.
(f) Repeat the previous two steps a large number of times.
(g) Plot, tabulate, or summarize the resulting distribution of the statistic.
2. Example: Sampling distribution of the mean; n = 25, from Cauchy with
= 100 and = 10;
(a) Number of samples generated: 50,000
(b) Mean of the statistic: 85.44
(c) Standard deviation of the statistic: 4,647.55
(d) Plot of the statistic.
8.5. SIMULATING SAMPLING DISTRIBUTIONS 83
10 8 6 4 2 0 2 4
x 10
5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
x 10
4
Sample Mean
F
r
e
q
u
e
n
c
y
Sampling Distribution of Sample Mean from Cauchy, n=25, = 100, = 10
(e) Most of the distribution is centered near , but the tails are very fat. It
can be shown that the sample mean also has a Cauchy distribution with
= 100 and = 10.
3. Example: Sampling distribution of the median; n = 25, from Cauchy with
= 100 and = 10;
(a) Number of samples generated: 50,000
(b) Mean of the statistic: 100.01
(c) Standard deviation of the statistic: 3.35
(d) Plot of the statistic.
80 85 90 95 100 105 110 115 120 125
0
500
1000
1500
2000
2500
3000
Sample Median
F
r
e
q
u
e
n
c
y
Sampling Distribution of Sample Median from Cauchy, n=25, = 100, = 10
(e) Let M
n
be the sample median from a sample of size n from the Cauchy
distribution with parameters and . It can be shown that as n goes to
innity, the distribution of the statistic
Z
n
=
n(M
n
)
1
2
converges to N(0, 1). That is, for large n,

M
n
N
,

2
2
4n
.
Note, for n = 25 and = 10, Var(M)
2
.
4. To generate normal random variables, the Box-Muller method can be used.
see page 44 of these notes.
8.6 Order Statistics
This section examines the distribution of order statistics from continuous
distributions.
1. Marginal Distributions of Order Statistics
(a) Suppose that X
i
, i = 1, . . . , n is a random sample of size n from a
population with pdf f
X
(x) and cdf F
X
(x). Consider X
(k)
, the k
th
order
statistic. To nd the pdf, f
X
(k)
(x), rst partition the real line into three
pieces:
I
1
= (, x], I
2
= (x, x + dx], and I
3
= (x + dx, ).
The pdf of f
X
(k)
(x) is (approximately) the probability of observing k 1
Xs in I
1
, exactly one X in I
2
and the remaining n k Xs in I
3
. This
probability is
f
X
(k)
(x)
n
k 1, 1, n k
[F
X
(x)]
k1
[f
X
(x)dx]
1
[1 F
X
(x)]
nk
.
Accordingly (by the dierential method), the pdf of X
(k)
is
f
X
(k)
(x) =
n
k 1, 1, n k
[F
X
(x)]
k1
[1 F
X
(x)]
nk
f
X
(x).
(b) ExampleSmallest order statistic:
f
X
(1)
(x) =
n
0, 1, n 1
[F
X
(x)]
0
[1 F
X
(x)]
n1
f
X
(x)
= n[1 F
X
(x)]
n1
f
X
(x).
(c) ExampleLargest order statistic:
f
X
(n)
(x) =
n
n 1, 1, 0
[F
X
(x)]
n1
[1 F
X
(x)]
0
f
X
(x)
= n[F
X
(x)]
n1
f
X
(x).
(d) ExampleUnif (0, 1) distribution. The cdf is F
X
(x) = x and the pdf of
the k
th
order statistic is
f
X
(k)
(x) =
n
k 1, 1, n k
x
k1
(1 x)
nk
I
(0,1)
(x)
=
x
k1
(1 x)
nk
B(k, n k + 1)
I
(0,1)
(x),
where B is the beta function. That is, X
(k)
Beta(k, n k + 1).
(e) Example: Find the exact pdf of the median from an odd size sample. In
this case, k = (n + 1)/2 and the pdf is
f
X
((n+1)/2)
(x) =
n
n1
2
, 1,
n1
2
[F
X
(x)]
(n1)/2
[1 F
X
(x)]
(n1)/2
f
X
(x)
=
[F
X
(x)]
(n1)/2
[1 F
X
(x)]
(n1)/2
f
X
(x)
B
n 1
2
,
n 1
2
.
For example, if X has a a Cauchy distribution with parameters and ,
then the cdf is
F
X
(x) =
arctan
+
1
2
and the pdf of the median, M = X
(
n1
2
)
, is
f
M
(m) =
arctan
+
1
2
(n1)/2
1
2

arctan
(n1)/2
B
n 1
2
,
n 1
2
1 +
1
.
2. Joint Distributions of Order Statistics
(a) Suppose that X
i
, i = 1, . . . , n is a random sample of size n from a
population with pdf f
X
(x) and cdf F
X
(x). Consider (X
(k)
, X
(m)
) the k
th
and m
th
order statistics, where k < m. To nd the joint pdf
f
X
(k)
,X
(m)
(v, w), rst partition the real line into ve pieces:
I
1
= (, v], I
2
= (v, v + dv], I
3
= (v + dv, w],
I
4
= (w, w + dw], and I
5
= (w + dw, ).
The joint pdf of f
X
(k)
,X
(m)
(v, w) is (approximately) the probability of
observing k 1 Xs in I
1
, exactly one X in I
2
, mk 1 Xs in I
3
, exactly
one X in I
4
and the remaining n m Xs in I
5
. This probability is
f
X
(k)
,X
(m)
(v, w)
n
k 1, 1, mk 1, 1, n m
[F
X
(v)]
k1
[f
X
(v) dv]
1
[F
X
(w) F
X
(v)]
mk1
[f
X
(w)dw]
1
[1 F
X
(w)]
nm
,
where v < w. Accordingly (by the dierential method), the joint pdf of
X
(k)
and X
(m)
is
f
X
(k)
,X
(m)
(v, w) =
n!
(k 1)!(mk 1)!(n m)!
[F
X
(v)]
k1
[F
X
(w) F
X
(v)]
mk1
[1 F
X
(w)]
nm
f
X
(v)f
X
(w)I
(v,)
(w).
(b) Examplejoint distribution of smallest and largest order statistic. Let
k = 1 and m = n to obtain
f
X
(1)
,X
(n)
(v, w) = n(n 1) [F
X
(w) F
X
(v)]
n2
f
X
(v)f
X
(w)I
(v,)
(w).
(c) Examplejoint distribution of smallest and largest order statistics from
Unif(0, 1). The cdf is F
X
(x) = x and the joint distribution of X
(1)
and
X
(n)
is
f
X
(1)
,X
(n)
(v, w) = n(n 1)(w v)
n2
I
(v,)
(w).
3. Distribution of Sample Range
(a) Let R = X
(n)
X
(1)
. The distribution of this random variable is needed
to construct R charts in quality control applications and to compute
percentiles of Tukeys studentized range statistic (useful when making
comparisons among means in ANOVA). To nd the pdf of R, we will rst
nd an expression for the cdf of R:
P(R r) = F
R
(r) = P[X
(n)
X
(1)
r] = P[X
(n)
r + X
(1)
]
= P[X
(1)
X
(n)
r + X
(1)
]
because X
(1)
X
(n)
must be satised
=
v+r
v
f
X
(1)
,X
(n)
(v, w) dwdv.
To obtain f
R
(r), take the derivative with respect to r. Leibnitzs rule can
be used.
Leibnitzs Rule: Suppose that a(), b(), and g(x, ) are
dierentiable functions of . Then
d
d
b()
a()
g(x, )dx = g [b(), ]
d
d
b() g [a(), ]
d
d
a()
+
b()
a()
d
d
g(x, )dx.
Accordingly,
f
R
(r) =
d
dr
F
R
(r) =
d
dr
v+r
v
f
X
(1)
,X
(n)
(v, w) dwdv
=
d
dr
v+r
v
f
X
(1)
,X
(n)
(v, w) dwdv
=
f
X
(1)
,X
(n)
(v, v + r)
d
dr
(v + r) f
X
(1)
,X
(n)
(v, v)
d
dr
v
dv
+
v+r
v
d
dr
f
X
(1)
,X
(n)
(v, w) dwdv
=
f
X
(1)
,X
(n)
(v, v + r) dv.
(b) ExampleDistribution of sample range from Unif(0, 1). In this case, the
support for X
(1)
, X
(n)
is 0 < v < w < 1. Accordingly, f
X
(1)
,X
(n)
(v, v + r) is
non-zero only if 0 < v < v + r < 1. This implies that 0 < v < 1 r and
that r (0, 1). The pdf of R is
f
R
(r) =
1r
0
n(n 1)(v + r v)
n2
dv = n(n 1)r
n2
(1 r)I
(0,1)
(r)
=
r
n2
(1 r)
B(n 1, 2)
I
(0,1)
(r).
That is, R Beta(n 1, 2).
4. Joint distribution of All Order Statistics. Employing the same procedure as
for a pair if order statistics, it can be shown that the joint distribution of
X
(1)
, . . . , X
(n)
is
f
X
(1)
,...,X
(n)
(x
1
, . . . , x
n
) = n!
n
i=1
f
X
(x
i
) where x
1
< x
2
< < x
n
.
8.7 Moments of Sample Means and Proportions
Let X
1
, . . . , X
n
be a random sample of size n taken either with or without
replacement from a population having mean
X
and variance
2
X
. Denote the
support of the random variable X by
SX
. The following denitions are used:
X =
1
n
n
i=1
X
i
p =
1
n
n
i=1
X
i
if
SX
= {0, 1}
S
2
X
=
1
n 1
n
i=1
(X
i
X)
2
=
1
n 1
i=1
X
2
i
nX
2
and
S
2
X
=
1
n 1
n
i=1
(X
i
X)
2
=
n p(1 p)
n 1
if
SX
= {0, 1}.
This section examines the expectation and variance of X and p; the expectation
of S
2
X
; and unbiased estimators of Var(X). The following preliminary results are
important and could be asked for on exams:
E(X
i
) =
X
; (8.1)
Var(X
i
) =
2
X
; (8.2)
E(X
2
i
) =
2
X
+
2
X
; (8.3)
Var(X) = n
2
i=1
Var(X
i
) +
i=j
Cov(X
i
, X
j
)
; (8.4)
8.7. MOMENTS OF SAMPLE MEANS AND PROPORTIONS 89
Cov(X
i
, X
j
) =
0 if sampling with replacement,

2
X
N1
if sampling without replacement; and
(8.5)
E(X
2
) =
2
X
+ Var(X). (8.6)
The result in equation 8.5 is true because X
1
, . . . , X
n
are iid if sampling with
replacement and
Var
i=1
X
i
= 0 = N Var(X
i
) + N(N 1) Cov(X
i
, X
j
)
if sampling without replacement. The remaining results in equations 8.18.6 follow
from exchangeability and from the denition of the variance of a random variable.
Be able to use the preliminary results to prove any of the following results. See
pages 63 to 64 of these notes.
1. Case I: Random Sample of size n = X
1
, X
2
, . . . , X
n
are iid.
(a) Case Ia: Random Variable has Arbitrary Support
E(X
i
) =
X
.
Cov(X
i
, X
j
) = 0 for i = j
Var(X) = E(X
2
i
) [E(X
i
)]
2
=
2
X
.
E(X) =
X
.
Var(X) =

2
X
n
.
E(S
2
X
) =
2
X
.
E
S
2
X
n
=

2
X
n
= Var(X).
(b) Case Ib: Random Variable has Support
SX
= {0, 1}
E(X
i
) = p.
Cov(X
i
, X
j
) = 0 for i = j
Var(X) = E(X
2
i
) [E(X
i
)]
2
=
2
X
= p(1 p).
E( p) = p.
Var( p) =

2
X
n
=
p(1 p)
n
.
E(S
2
X
) =
2
X
= p(1 p). When taking large samples from a binary
population,
2
X
= p(1 p) is usually estimated by
2
= p(1 p)
rather than S
2
X
= p(1 p)
n
n1
. Note that
2
has bias p(1 p)/n.
E
S
2
X
n
=
p(1 p)
n
= Var( p).
2. Case II: Random Sample of size n without replacement
(a) Case IIa: Random Variable has Arbitrary Support
E(X
i
) =
X
.
Cov(X
i
, X
j
) =
2
X
N
for i = j
Var(X) = E(X
2
i
) [E(X
i
)]
2
=
2
X
.
E(X) =
X
.
Var(X) =

2
X
n
1
n 1
N 1
.
E(S
2
X
) =
2
X
N
N 1
.
E
S
2
X
n
1
n
N
=

2
X
n
1
n 1
N 1
= Var(X).
(b) Case IIb: Random Variable has Support
SX
= {0, 1}
E(X
i
) = p.
Cov(X
i
, X
j
) =
2
X
N
=
p(1 p)
N
for i = j
Var(X) = E(X
2
i
) [E(X
i
)]
2
=
2
X
= p(1 p).
E( p) = p.
Var( p) =

2
X
n
1
n 1
N 1
=
p(1 p)
n
1
n 1
N 1
.
E(S
2
X
) = E
n p(1 p)
n 1
=
2
X
N
N 1
= p(1 p)
N
N 1
.
E
S
2
X
n
1
n
N
= E
n p(1 p)
n(n 1)
1
n
N
=
p(1 p)
n
1
n 1
N 1
= Var( p).
8.8 The Central Limit Theorem (CLT)
Theorem Let X
1
, X
2
, . . . , X
n
be a random sample of size n from a population with
mean
X
and variance
2
X
. Then, the distribution of
Z
n
=
X
X
X
/
n
converges to N(0, 1) as n .
The importance of the CLT is that the convergence of Z
n
to a normal
distribution occurs regardless of the shape of the distribution of X. Transforming
from Z
n
to X reveals that
X N
X
,

2
X
n
if n is large.
8.8. THE CENTRAL LIMIT THEOREM (CLT) 91
1. The asymptotic distribution of X is said to be N(
X
,
2
X
/n). The limiting
distribution of X is degenerate lim
n
Pr(X =
X
) = 1.
2. Another way to express the CLT is
lim
n
Pr
n(X
X
)
X
c
= (c).
Note, equation (2) on page 341 of the text is not correct. It should be
lim
n
P(X c) =
0 if c <
X
,
1 if c
X
.
3. Application to Sums of iid random variables: If X
1
, X
2
, . . . , X
n
are iid from a
population with mean
X
and variance
2
X
, then
E
i=1
X
i
= n
X
,
Var
i=1
X
i
= n
2
X
, and
lim
n
Pr
i=1
X
i
n
X
n
X
c
= (c).
4. How large must n be before X is approximately normal? The closer the
parent distribution is to a normal distribution, the smaller is the required
sample size. When sampling from a normal distribution, a sample size of
n = 1 is sucient. Larger sample sizes are required from parent distributions
with strong skewness and/or strong kurtosis. For example, suppose that
X Expon(). This distribution has skewness and kurtosis
3
=
E(X
X
)
3
3
2
X
= 2 and
4
=
E(X
X
)
4
4
X
3 = 6,
where
X
= 1/ and
2
X
= 1/
2
. The sample mean, X has distribution
Gam(n, n). The skewness and kurtosis of X are
3
=
E(X
X
)
3
3
2
X
=
2
n
and
4
=
E(X
X
)
4
4
X
3 =
6
n
,
where
X
= 1/ and
2
X
= 1/(n
2
). Below are plots of the pdf of Z
n
for
n = 1, 2, 5, 10, 25, 100.
4 2 0 2 4
0
0.5
1
n = 1
3
= 2
4
= 6
p
d
f
4 2 0 2 4
0
0.2
0.4
0.6
0.8
n = 2
3
= 1.4
4
= 3
4 2 0 2 4
0
0.2
0.4
0.6
0.8
n = 5
3
= 0.89
4
= 1.2
p
d
f
4 2 0 2 4
0
0.2
0.4
0.6
0.8
n = 10
3
= 0.63
4
= 0.6
4 2 0 2 4
0
0.2
0.4
0.6
0.8
n = 25
3
= 0.4
4
= 0.24
p
d
f
z
n
4 2 0 2 4
0
0.2
0.4
0.6
0.8
n = 100
3
= 0.2
4
= 0.06
z
n
5. Application to Binomial Distribution: Suppose that X Bin(n, p). Recall
that X has the same distribution as the sum of n iid Bern(p) random
variables. Accordingly, for large n
p N
p,
p(1 p)
n
and
Pr( p c)
n(c p)
p(1 p)
.
6. Continuity Correction. If X Bin(n, p), then for large n
X N[np, np(1 p)] and
Pr(X = x) = Pr
x
1
2
X x +
1
2
for x = 0, 1, . . . , n

x + 0.5 np
np(1 p)
x 0.5 np
np(1 p)
.
Adding or subtracting 0.5 is called the continuity correction. The continuity
corrected normal approximations to the cdfs of X and p are
Pr(X x) = Pr
X x +
1
2
for x = 0, 1, . . . , n; and

x + 0.5 np
np(1 p)
8.9. USING THE MOMENT GENERATING FUNCTION 93

Pr( p c) = Pr
p c +
1
2n
for c =
0
n
,
1
n
,
2
n
, . . . ,
n
n

n(c +
1
2n
p)
p(1 p)
.
8.9 Using the Moment Generating Function
1. Let X
1
, X
2
, . . . , X
n
be a random sample of size n. We wish to nd the
distribution of X. One approach is to nd the mgf of X and (hopefully) to
identify the corresponding pdf or pmf. Let
X
(t) be the mgf of X. The mgf of
X is
X
(t) = E
exp
t
n
n
i=1
X
i
= E
i=1
exp
t
n
X
i
=
n
i=1
E
exp
t
n
X
i
by independence
=
n
i=1
X
i
t
n
t
n
n
because the Xs are identically distributed.
2. Example: Exponential distribution. If X
1
, X
2
, . . . , X
n
is a random sample of
size n from Expon(), then
X
(t) =

t
and
X
(t) =

t
n
n
=
n
n t
n
which is the mgf of Gam(n, n).
3. Example: Normal Distribution. If X
1
, X
2
, . . . , X
n
is a random sample of size n
from N(
X
,
2
X
), then
X
(t) = exp
t
X
+
t
2
2
X
2
and
X
(t) =
exp
t
n
X
+
t
2
2
X
2n
2
n
= exp
t
X
+
t
2
2
X
2n
which is the mgf of N(

X
,
2
X
/n).
4. Example: Poisson Distribution. If X
1
, X
2
, . . . , X
n
is a random sample from
Poi(), then
X
(t) = e
(e
t
1)
and
Y
(t) = e
n(e
t
1)
,
where Y =
n
i=1
X
i
= nX. Accordingly, nX Poi(n) and
P(X = x) = P(nX = nx) =
e
n
nx
(nx)!
for x =
0
n
,
1
n
,
2
n
, . . .;
0 otherwise.
5. A useful limit result. Let a be a constant and let o (n
1
) be a term that goes
to zero faster than does n
1
. That is,
lim
n
o (n
1
)
1/n
= lim
n
no
n
1
= 0.
Then
lim
n
1 +
a
n
+ o
n
1
n
= e
a
.
Proof:
lim
n
1 +
a
n
+ o
n
1
n
= lim
n
exp
nln
1 +
a
n
+ o
n
1
= exp
lim
n
nln
1 +
a
n
+ o
n
1
.
The Taylor series expansion of ln(1 + ) around = 0 is
ln(1 + ) =
i=1
(1)
i+1
i
i
=

2
2
+

3
3

4
4
+ ,
provided that || < 1. Let = a/n + o (n
1
). If n is large enough to satisfy
|a/n + o (n
1
) | < 1, then
ln
1 +
a
n
+ o
n
1
=
a
n
+ o
n
1
1
2
a
n
+ o
n
1
2
+
1
3
a
n
+ o
n
1
3

=
a
n
+ o
n
1
because terms such as a

2
/n
2
and ao (n
1
) /n go to zero faster than does 1/n.
Accordingly,
lim
n
1 +
a
n
+ o
n
1
n
= exp
lim
n
nln
1 +
a
n
+ o
n
1
8.9. USING THE MOMENT GENERATING FUNCTION 95

= exp
lim
n
n
a
n
+ o
n
1
= exp
lim
n
a + no
n
1
= exp{a + 0} = e
a
.
6. Heuristic Proof of CLT using MGF: Write Z
n
as
Z
n
=
X
X
X
/
n
=
1
n
n
i=1
X
i
X
/
n
=
n
i=1
1
n
(X
i
X
)
X
/
n
=
n
i=1
Z
n
, where Z
i
=
X
i
X
=
n
i=1
U
i
, where U
i
=
Z
n
.
Note that Z
1
, Z
2
, . . . , Z
n
are iid with E(Z
i
) = 0 and Var(Z
i
) = 1. Also,
U
1
, U
2
, . . . , U
n
are iid with E(U
i
) = 0 and Var(U
i
) = 1/n. If U
i
has a moment
generating function, then it can be written in expanded form as
U
i
(t) = E
e
tU
i
j=0
t
j
j!
E(U
j
i
)
= 1 + tE(U
i
) +
t
2
2
E(U
2
i
) +
t
3
3!
E(U
3
i
) +
t
4
4!
E(U
4
i
) +
= 1 + t
E(Z
i
)
n
+
t
2
2
E(Z
2
i
)
n
+
t
3
3!
E(Z
3
i
)
n
3
2
+
t
4
4!
E(Z
4
i
)
n
2
+
= 1 +
t
2
2n
+ o
n
1
.
Therefore, the mgf of Z
n
is
Zn
(t) = E(exp {tZ
n
})
= E
exp
t
n
i=1
U
i
= [
U
i
(t)]
n
because U
1
, U
2
, . . . , U
n
are iid
=
1 +
t
2
2n
+ o
n
1
n
.
Now use the limit result to take the limit of
Zn
(t) as n goes to :
lim
n
Zn
(t) = lim
n
1 +
t
2
2n
+ o
n
1
n
= exp
t
2
2
which is the mgf of N(0, 1). Accordingly, the distribution of Z

n
converges to
N(0, 1) as n .
8.10 Normal Populations
This section discusses three distributional results concerning normal distributions.
Let X
1
, . . . , X
n
be a random sample from N(,
2
). Dene, as usual, the sample
mean and variance as
X = n
1
n
i=1
X
i
and S
2
X
= (n 1)
1
n
i=1
(X
i
X)
2
.
Recall, that X and S
2
X
are jointly sucient for and
2
. The three distributional
results are the following.
1. X N
,

2
n
.
2.
(n 1)S
2
X
2

2
n1
.
3. X S
2
X
.
We have already veried result #1. The textbook assumes that result #3 is
true and uses results #1 and #3 to prove result #2. The argument relies on
another result; one that we already have veried:
n
i=1
(X
i
)
2
=
n
i=1
(X
i
X)
2
+ n(X )
2
.
Divide both sides by
2
to obtain
n
i=1
(X
i
)
2
2
=
n
i=1
(X
i
X)
2
2
+
n(X )
2
2
. (8.7)
Let
Z
i
=
X
i
and let Z =
X
/
n
.
Note that Z
i
iid N(0, 1) and that Z N(0, 1). The equality in equation 8.7 can
be written as
n
i=1
Z
2
i
=
(n 1)S
2
X
2
+ Z
2
.
8.11. UPDATING PRIOR PROBABILITIES VIA LIKELIHOOD 97
The left-hand-side above is distributed as
2
n
and the second term of the
right-hand-side is distributed as
2
1
. If X and S
2
X
are independently distributed,
then the two right-hand-side terms are independently distributed. Using the second
result at the top of page 248 in the text (see page 53 of these notes), it can be
concluded that (n 1)S
2
X
/
2

2
n1
.
We will not attempt to prove result #3. It is an important result that is proven
in the graduate linear models course (Stat 505).
8.11 Updating Prior Probabilities Via Likelihood
1. Overview: This section introduces the use of Bayes rule to update
probabilities. Let H represent a hypothesis about a numerical parameter . In
the frequentist tradition, the hypothesis must be either true or false because
the value of is a xed number. That is P(H) = 0 or P(H) = 1.
In Bayesian analyses, prior beliefs and information are incorporated by
conceptualizing as a realization of a random variable . In this case, P(H)
can take on any value in [0, 1]. The quantity, P(H) is called the prior
probability. It represents the belief of the investigator prior to collecting new
data. One goal of Bayesian analyses is to compute the posterior probability
P(H|X = x), where X represents new data. By Bayes rule,
P(H|X = x) =
P(H, X = x)
P(X = x)
=
P(X = x|H)P(H)
P(X = x|H)P(H) + P(X = x|H
c
)P(H
c
)
.
The quantity P(X = x|H) is the likelihood function. The quantity P(X = x)
does not depend on H and therefore is considered a constant (conditioning on
X makes X a constant rather than a random variable). Accordingly, Bayes
rule can be written as
P(H|X = x) L(H|x)P(H).
That is, the posterior is proportional to the prior times the likelihood
function. Note, the functions P(X = x|H) and P(X = x) are either pmfs or
pdfs depending on whether X is discrete or continuous.
2. Example: The pap smear is a screening test for cervical cancer. The test is
not 100% accurate. Let X be the outcome of a pap smear:
X =
0 if the test is negative, and

1 if the test is positive.
Studies have shown that the false negative rate of the pap smear is
approximately 0.1625 and the false positive rate is approximately 0.1864.
That is, 16.25% of women without cervical cancer test positive on the pap
smear and 18.64% of women with cervical cancer test negative on the pap
smear. Suppose a specic woman, say Gloria, plans to have a pap smear test.
Dene the random variable (parameter) as
=
0 if Gloria does not have cervical cancer, and

1 if Gloria does have cervical cancer.
The likelihood function is
P(X = 0| = 1) = 0.1625; P(X = 1| = 1) = 1 0.1625 = 0.8375;
P(X = 0| = 0) = 1 0.1864 = 0.8136; and P(X = 1| = 0) = 0.1864.
Suppose that the prevalence rate of cervical cancer is 31.2 per 100,000 women.
A Bayesian might use this information to specify a prior probability for
Gloria, namely P( = 1) = 0.000312. Suppose that Gloria takes the pap
smear test and the test is positive. The posterior probability is
P( = 1|X = 1) =
P(X = 1| = 1)P( = 1)
P(X = 1)
=
P(X = 1| = 1)P( = 1)
P(X = 1| = 1)P( = 1) + P(X = 1| = 0)P( = 0))
=
(0.8375)(0.000312)
(0.1864)(0.999688) + (0.8375)(0.000312)
= 0.0014.
Note that
P( = 1|X = 1)
P( = 1)
=
0.0014
0.000312
= 4.488
so that Gloria is approximately four and a half times more likely to have
cervical cancer given the positive test than she did before the test, even
though the probability that she has cervical cancer is still low. The posterior
probability, like the prior probability, is interpreted as a subjective probability
rather than a relative frequency probability. A relative frequency
interpretation makes no sense here because the experiment can not be
repeated (there is only one Gloria).
3. Bayes Factor (BF): One way of summarizing the evidence about the
hypothesis H is to compute the posterior odds H divided by the prior odds H.
This odds ratio is called the Bayes Factor (BF) and it is equivalent to the
ratio of likelihood functions. Denote the sucient statistic by T. In the pap
smear example, T = X because there is just one observation. Denote the pdfs
or pmfs of T given H or H
c
by f
T|H
(t|H) and f
T|H
c(t|H
c
), respectively. The
marginal distribution of T is obtained by summing the joint distribution of T
and the hypothesis over H and H
c
:
m
T
(t) = f
T|H
(t|H)P(H) + f
T|H
c(t|H
c
)P(H
c
).
8.12. SOME CONJUGATE FAMILIES 99
The Posterior odds of H are
posterior Odds of H =
P(H|T = t)
1 P(H|T = t)
=
P(H|T = t)
P(H
c
|T = t)
=
f
T|H
(t|H)P(H)
m
T
(t)

f
T|H
c(t|H
c
)P(H
c
)
m
T
(t)
=
f
T|H
(t|H)P(H)
f
T|H
c(t|H
c
)P(H
c
)
.
The prior odds of H are
Prior Odds of H =
P(H)
1 P(H)
=
P(H)
P(H
c
)
.
4. Result: The Bayes Factor is equivalent to the ratio of likelihood functions,
BF =
f
T|H
(t|H)
f
T|H
c(t|H
c
)
.
Proof:
BF =
Posterior odds of H
Prior odds of H
=
P(H|T = t)/P(H
c
|T = t)
P(H)/P(H
c
)
=
f
T|H
(t|H)P(H)
f
T|H
c(t|H
c
)P(H
c
)

P(H)
P(H
c
)
=
f
T|H
(t|H)
f
T|H
c(t|H
c
)
which is the ratio of likelihood functions.
Frequentist statisticians refer to this ratio as the likelihood ratio. A Bayes
factor greater than 1 means that the data provide evidence for H relative to
H
c
and a Bayes factor less than 1 means that the data provide evidence for
H
c
relative to H, For the cervical cancer example, the hypothesis is = 1 and
the Bayes factor is
BF =
P(X = 1| = 1)
P(X = 1| = 0)
=
0.8375
0.1864
= 4.493.
The above Bayes factor is nearly the same as the ratio of the posterior
probability to the prior probability of H because the prior probability is
nearly zero. In general, these ratios will not be equal.
8.12 Some Conjugate Families
1. Overview: Let X
1
, X
2
, . . . , X
n
be a random sample (with or without
replacement) from a population having pdf or pmf f
X
(x|). A rst step in
making inferences about is to reduce the data by nding a sucient
statistic. Let T be the sucient statistic and denote the pdf or pmf of T by
f
T|
(t|). Suppose that prior beliefs about can be represented as the prior
distribution g
(). By the denition of conditional probability, the posterior

distribution of is
g
|T
(|t) =
f
,T
(, t)
m
T
(t)
=
f
T|
(t|) g
()
m
T
(t)
,
where m
T
(t) is the marginal distribution of T which can be obtained as
m
T
(t) =
f
T,
(t, ) d
=
f
T|
(t|)g
() d.
Integration should be replaced by summation if the prior distribution is
discrete.
In practice, obtaining an expression for the marginal distribution of T may be
unnecessary. Note that m
T
(t) is just a constant in the posterior distribution.
Accordingly, the posterior distribution is
g
|T
(|t) L(|t) g
() because L(|t) f
T|
(t|). (8.8)
2. The kernel of a pdf or pmf is proportional to the pmf or pdf and is the part of
the function that depends on the random variable. That is, the kernel is
obtained by deleting any multiplicative terms that do not depend on the
random variable. The right-hand-side of equation (8.8) contains the kernel of
the posterior. If the kernel can be recognized, then the posterior distribution
can be obtained without rst nding the marginal distribution of T.
The kernels of some well-known distributions are given below.
(a) If Unif(a, b), then the kernel is I
(a,b)
().
(b) If Expon(), then the kernel is e
I
(0,)
().
(c) If Gamma(, ), then the kernel is
1
e
I
(0,)
().
(d) If Poi(), then the kernel is

!
I
{0,1,2,...}
().
(e) If Beta(, ), then the kernel is
1
(1 )
1
I
(0,1)
().
(f) If N(,
2
), then the kernel is e
1
2
2
(
2
2)
.
3. Conjugate Families: A family of distributions is conjugate for a likelihood
function if the prior and posterior distributions both are in the family.
4. Example 1. Consider the problem of making inferences about a population
proportion, . A random sample X
1
, X
2
, . . . , X
n
will be obtained from a
Bernoulli() population. By suciency, the data can be reduced to Y =
X
i
,
8.12. SOME CONJUGATE FAMILIES 101
and conditional on = , the distribution of Y is Y Bin(n, ) One natural
prior distribution for is Beta(, ). Your textbook gives plots of several
beta pdfs on page 356. The lower left plot is not correct. The beta parameters
must be greater than zero. The limiting distribution as and go to zero is
lim
0,0
1
(1 )
1
B(, )
=
1
2
{0, 1}
0 otherwise.
The parameter 1 can be conceptualized as the number of prior successes
and 1 can be conceptualized as the number of prior failures.
(a) Prior:
g
(|, ) =

1
(1 )
1
B(, )
I
(0,1)
().
(b) Likelihood Function:
f
Y |
(y|) =
n
y
y
(1 )
ny
I
{0,1,...,n}
(y) and
L(|y) =
y
(1 )
ny
.
(c) Posterior:
g
|Y
(|, , y)
y
(1 )
ny
1
(1 )
1
B(, )
I
(0,1)
().
The kernel of the posterior is
y+1
(1 )
ny+1
. Accordingly, the
posterior distribution is Beta(y + , n y + ). Note that the posterior
mean (a point estimator) is
E(|Y = y) =
y +
n + +
.
(d) Note that both the prior and posterior are beta distributions.
Accordingly, the beta family is conjugate for the binomial likelihood.
5. Example 2. Consider the problem of making inferences about a population
mean, , when sampling from a normal distribution having known variance,
2
. By suciency, the data can be reduced to X. One prior for is N(,
2
).
(a) Prior:
g
(|, ) =
e
1
2
2
()
2
2
2
.
(b) Likelihood Function:
f
X|
( x|,
2
) =
exp{
n
2
2
( x )
2
}
2
n
and
L(|x) = e
n
2
2
(
2
2 x)
.
(c) Posterior:
g
|X
(|, , , x)
e
1
2
2
()
2
2
2
e
n
2
2
(
2
2 x)
.
The combined exponent, after dropping multiplicative terms that do not
depend on is
1
2
2
(
2
2 x) +
1
2
(
2
2)
=
1
2
2
+
1
2
2
2
+
1
1
xn
2
+

+ C
=
1
2
2
+
1
2
+
1
1
xn
2
+

2
+ C
,
where C and C
are terms that do not depend on . Note, we have

completed the square. This is the kernel of a normal distribution with
mean and variance
E(| x) =
2
+
1
1
xn
2
+

and Var(| x) =
2
+
1
1
(d) Note that both the prior and the posterior are normal distributions.
Accordingly, the normal family is conjugate for the normal likelihood
when
2
is known.
(e) Precision. An alternative expression for the posterior mean and variance
uses what is called the precision of a random variable. Precision is
dened as the reciprocal of the variance. Thus, as the variance increases,
the precision decreases. Your textbook uses the symbol to stand for
precision. For this problem
X
=
n
2
,
=
1
2
, and
|X
=
n
2
+
1
2
=
X
+
.
That is, the precision of the posterior is the sum of the precision of the
prior plus the precision of the data. Using this notation, the posterior
mean and variance are
E(| x) =

X
X
+
x +

X
+
and Var(| x) = (
X
+
)
1
.
Note that the posterior mean is a weighted average of the prior mean and
that data mean.
8.13 Predictive Distributions
1. The goal in this section is to make predictions about future observations,
Y
1
, Y
1
, . . . , Y
k
. We may have current observations X
1
, X
1
, . . . , X
n
to aid us.
8.13. PREDICTIVE DISTRIBUTIONS 103
2. Case I: No data available. If the value of is known, then the predictive
distribution is simply the pdf (or pmf), f
Y|
(y|). In most applications is
not known. The Bayesian solution is to integrate out of the joint
distribution of (, Y). That is, the Bayesian predictive distribution is
f
Y
(y) = E
f
Y|
(y|)
f
Y|
(y|)g
() d,
where g
() is the prior distribution of . Replace integration by summation

if the distribution of is discrete.
3. Case II: Data available. Suppose that X
1
, X
2
, . . . , X
n
has been observed from
f
X|
(x|). Denote the sucient statistic by T and denote the pdf (pmf) of T
given by f
T|
(t|). The Bayesian posterior predictive distribution is given
by
f
Y|T
(y|t) = E
|T
f
Y|
(y|)
f
Y|
(y|)g
|T
(|t) d,
where g
|T
(|t) is the posterior distribution of . Replace integration by
summation if the distribution of is discrete. The posterior distribution of
is found by Bayes rule
g
|T
(|t) L(|t)g
().
4. Example of case I. Consider the problem of predicting the number of successes
in k Bernoulli trials. Thus, conditional on = , the distribution of the sum
of the Bernoulli random variables is Y Bin(k, ). The probability of success,
is not known, but suppose that the prior belief function can be represented
by a beta distribution. Then the Bayesian predictive distribution is
f
Y
(y) =
1
0
k
y
y
(1 )
ky
1
(1 )
1
B(, )
d
=
k
y
B( + y, + k y)
B(, )
I
{0,1,...,k}
(y).
This predictive pmf is known as the beta-binomial pmf. It has expectation
E(Y ) = E
[E(Y |)] = E
(k) = k

+
.
For example, suppose that the investigator has no prior knowledge and
believes that is equally likely to be anywhere in the (0, 1) interval. Then an
appropriate prior is Beta(1, 1), the uniform distribution. The Bayesian
predictive distribution is
f
Y
(y) =
k
y
B(1 + y, 1 + k y)
B(1, 1)
I
{0,1,...,k}
(y) =
1
k + 1
I
{0,1,...,k}
(y)
which is a discrete uniform distribution with support {0, 1, . . . , k}. The
expectation of Y is E(Y ) = k/2.
5. Example of case II. Consider the problem of predicting the number of
successes in k Bernoulli trials. Thus, conditional on = , the distribution of
the sum of the Bernoulli random variables is Y Bin(k, ). A random sample
of size n from Bern() has been obtained. The sucient statistic is T =
X
i
and T Bin(n, ). The probability of success, is not known, but suppose
that the prior belief function can be represented by a beta distribution. Then
the posterior distribution of is
g
|T
(|t)
n
t
t
(1 )
nt
1
(1 )
1
B(, )
.
By recognizing the kernel, it is clear that the posterior distribution of is
beta(t + , n t + ). The Bayesian posterior predictive distribution is
f
Y |T
(y|t) =
1
0
k
y
y
(1 )
ky
t+1
(1 )
nt+1
B(t + , n t + )
d
=
k
y
B( + t + y, + n t + k y)
B(t + , n t + )
I
{0,1,...,k}
(y).
This is another beta-binomial pmf. It has expectation
E(Y ) = E
[E(Y |)] = E
(k) = k
+ t
n + +
.
For example, suppose that the investigator has no prior knowledge and
believes that is equally likely to be anywhere in the (0, 1) interval. Then an
appropriate prior is Beta(1, 1), the uniform distribution. One Bernoulli
random variable has been observed and its value is x = 0. That is, the data
consist of just one failure; n = 1, t = 0. The posterior distribution of is
Beta(1, 2) and the Bayesian posterior predictive distribution is
f
Y |T
(y|t) =
k
y
B(1 + y, 2 + k y)
B(1, 1)
I
{0,1,...,k}
(y) =
2(k + 1 + y)
(k + 1)(k + 2)
I
{0,1,...,k}
(y).
The expectation of Y is E(Y ) = k/3.
Chapter 9
ESTIMATION
9.1 Errors in Estimation
1. Estimator versus Estimate: An estimator of a population parameter, say , is
a function of the data and is a random variable. An estimate is a realization of
the random variable.
2. Variance and Bias: Let T = T(X) be an estimator of . The bias of T is
b
T
= E(T ) = E(T) . If b
T
= 0, then T is unbiased for . The variance of
T is
2
T
= E(T
T
)
2
, where
T
= E(T).
3. Mean Square Error: The mean square error of T is
MSE
T
() = E(T )
2
= E[(T )
2
].
4. Result: MSE
T
() = Var(T) + b
2
T
.
Proof:
MSE
T
() = E[(T
T
) + (
T
)]
2
= E[(T
T
)
2
+ 2(T
T
)(
T
) + (
T
)
2
]
= E(T
T
)
2
+ 2(
T
)E(T
T
) + (
T
)
2
= Var(T) + b
2
T
.
5. Root Mean Square Error: RMSE
T
() =
MSE
T
().
6. Example: Sample Variance. Let X
1
, . . . , X
n
be a random sample from
N(,
2
). Compare two estimators of
2
:
S
2
=
1
n 1
n
i=1
(X
i
X)
2
and V =
1
n
n
i=1
(X
i
X)
2
.
We know that S
2
is unbiased for
2
(this result does not depend on
normality). Therefore
b
S
2 = 0 and b
V
= E
n 1
n
S
2
2
=
2
n
.
105
106 CHAPTER 9. ESTIMATION
Recall that (n 1)S
2
/
2

2
n1
. Therefore
Var
i=1
(X
i
X)
2
= Var
2
n1
=
4
2(n 1)
because Var(
2
n1
) = 2(n 1). The MSEs of S
2
and V are
MSE
S
2(
2
) = Var(S
2
) =
2(n 1)
4
(n 1)
2
=
2
4
n 1
and
MSE
V
(
2
) = Var(V ) + b
2
V
=
2(n 1)
4
n
2
+

4
n
2
=
(2n 1)
4
n
2
=
2
4
n 1
1
3n 1
2n
2
.
Note that MSE
S
2(
2
) > MSE
V
(
2
) even though S
2
is unbiased and V is
biased.
7. Standard Error: An estimator (or estimate) of the standard deviation of an
estimator is called the standard error of the estimator.
8. Example of standard errors: In the following table, f =
n 1
N 1
.
Parent Param- Estim-
Distribution eter ator (T) Var(T) SE(T)
Any,
Sampling w
replacement
X

2
n
S
n
Bernoulli,
Sampling w
replacement
p p
p(1 p)
n
p(1 p)
n 1
Any nite pop.,
Sampling w/o
replacement
X

2
n
(1 f)
S
2
n
(1 f)
Finite Bern.,
Sampling w/o
replacement
p p
p(1 p)
n
(1 f)
p(1 p)
n 1
1
n
N
Normal
2
S
2
2
4
n 1
S
2
2
n1
9.2 Consistency
1. Chebyshevs Inequality: Suppose that X is a random variable with pdf or pmf
f
X
(x). Let h(X) be a non-negative function of X whose expectation exists
and let k be any positive constant. Then
P [h(X) k]
E[h(X)]
k
.
9.2. CONSISTENCY 107
Proof: Suppose that X is a continuous rv. Let R be the set
R = {x; x
SX
; h(x) k}. Then
E[h(X)] =
SX
h(x)f
X
(x) dx
R
h(x)f
X
(x) dx
k
R
f
X
(x) dx = kP [h(X) k]
=
E[h(X)]
k
P [h(X) k] .
If X is discrete, then replace integration by summation. A perusal of books in
my oce reveals that Chebyshevs inequality also is known as
(a) Tchebichevs inequality (Roussas, Introduction to Probability and
Statistical Inference, 2003, Academic Press),
(b) Tchebyshes theorem (Mendenhall et al., A Brief Introduction to
Probability and Statistics, 2002, Duxbury; Freund & Wilson, Statistical
Methods, 2003, Academic Press),
(c) Tchebychevs inequality (Schervish, Theory of Statistics, 1995,
Springer-Verlag),
(d) Chebychevs inequality (Casella & Berger, Statistical Inference, 2002,
Duxbury), and
(e) possibly other variants.
2. Application 1: Suppose that E(X) =
X
and Var(X) =
2
X
< . Then
P
|X
X
|
2
2
X
k
2
1
k
2
.
Proof: Choose h(X) to be
h(X) =
(X
X
)
2
2
X
.
By the denition of Var(X), it follows that E[h(X)] = 1. Also,
P
|X
X
|
X
k
= P [|X
X
| k
X
]
= P
|X
X
|
2
2
X
k
2
1
k
2
by Chebyshev
=P [|X
X
| < k
X
] 1
1
k
2
.
3. Application 2: Suppose that T is a random variable (estimator of the
unknown parameter ) with E(T) =
T
and Var(T) =
2
T
< . Then
P [|X | < ] 1
MSE
X
()
2
,
Proof: Choose h(X) to be
h(X) = (X )
2
.
Then E[h(T)] = MSE
T
() and
P [|T | ] = P
|T |
2

2
MSE
T
()
2
by Chebyshev
=

2
T
+ [E(T) ]
2
2
=P [|T | < ] 1
MSE
T
()
2
.
4. Consistency Denition: A sequence of estimators, {T
n
}, is consistent for if
lim
n
P [|T
n
| < ] = 1
for every > 0.
5. Converge in Probability Denition: A sequence of estimators, {T
n
}, converges
in probability to if the sequence is consistent for . Convergence in
probability is usually written as
T
n
prob
.
6. Law of Large Numbers If X is the sample mean based on a random sample of
size n from a population having mean
X
, then
X
prob
X
.
We will prove the law of large numbers for the special case in which the
population variance is nite (see # 9 below). The more general result when
the population variance is innite is sometimes called Khintchines Theorem.
7. Mean Square Consistent Denition: An estimator of is mean square
consistent if
lim
n
MSE
Tn
() = 0.
9.3. LARGE SAMPLE CONFIDENCE INTERVALS 109
8. Result: If an estimator is mean square consistent, then it is consistent.
Proof: Let T
n
be an estimator of . Assume that T
n
has nite mean and
variance. Then it follows from Chebyshevs Theorem that
P [|T
n
| < ] 1
MSE
Tn
()
2
,
where is any positive constant. In T
n
, is mean square consistent for , then
lim
n
P [|T
n
| < ] lim
n
1
MSE
Tn
()
2
= 1
because lim
n
MSE
Tn
()
2
= 0
for any > 0.
9. Application: The sample mean based on a random sample of size n from a
population with nite mean and variance has mean
X
and variance
2
X
/n.
Accordingly,
MSE
X
(
X
) =

2
X
n
and lim
n
MSE
X
(
X
) = 0
which reveals that X is mean square consistent. It follows from the result in
(8), that X
prob
X
.
9.3 Large Sample Condence Intervals
1. General setting: Suppose that T
n
is an estimator of and that
lim
n
P
T
n
Tn
c
= (c).
That is, T
n
N(,
2
Tn
) provided that sample size is suciently large. Suppose
that
2
Tn
=
2
/n and that W
2
n
is a consistent estimator of
2
. That is,
S
Tn
= SE(T
n
) = W
n
/
n and W
2
n
prob
2
. Then, it can be shown that
lim
n
P
T
n
S
Tn
c
= (c).
We will not prove the above result. It is an application of Slutskys theorem
which is not covered in Stat 424..
2. Constructing a condence interval: Denote the 100(1 /2) percentile of the
standard normal distribution by z
/2
. That is
1
(1 /2) = z
/2
.
Then, using the large sample distribution of T
n
, it follows that
P
z
/2

T
n
S
Tn
z
/2
1 .
Using simple algebra to manipulate the three sides of the above equation yields
P
T
n
z
/2
S
Tn
T
n
+ z
/2
S
Tn
1 .
The above random interval is a large sample 100(1 )% condence interval
for . The interval is random because T
n
and S
Tn
are random variables.
3. Interpretation of the interval: Let t
n
and s
Tn
be realizations of T
n
and S
Tn
.
Then
(t
n
z
/2
s
Tn
, t
n
+ z
/2
s
Tn
)
is a realization of the random interval. We say that we are 100(1 )%
condent that the realization captures the parameter . The 1 probability
statement applies to the interval estimator, but not to the interval estimate
(i.e., a realization of the interval).
4. Example 1: Condence interval for
X
. Let X
1
, . . . , X
n
be a random sample of
size n from a population having mean
X
and variance
2
X
. If sample size is
large, then X N(
X
,
2
X
/n) by the CLT. The estimated variance of X is
S
2
X
= S
2
X
/n. It can be shown that
Var(S
2
X
) =
2
4
X
n 1
1 +
(n 1)
4
2n
,
where
4
is the standardized kurtosis of the parent distribution. Recall, that if
X is normal, then
4
= 0. If
4
is nite, then Chebyshevs inequality reveals
that S
2
X
prob
2
. It follows that S
X
prob
. Accordingly
lim
n
P
X
X
S
X
/
n
c
= (c)
and
P
X z
/2
S
X
n

X
X + z
/2
S
X
1 .
5. Example 2: Condence interval for a population proportion, p. Let X
1
, . . . , X
n
be a random sample of size n from Bern(p). If sample size is large, then
p N(p, p(1 p)/n) by the CLT. The usual estimator of
2
X
is V
X
= p(1 p).
We know that p
prob
p (by the law of large numbers). It follows that
p(1 p)
prob
p(1 p) and, therefore, V
X
= p(1 p) is consistent for
2
X
.
Accordingly
lim
n
P
p p
p(1 p)/n
c
= (c)
9.4. DETERMINING SAMPLE SIZE 111
and
P
p z
/2
p(1 p)
n
p p + z
/2
p(1 p)
n
1 .
9.4 Determining Sample Size
1. Margin of Error Suppose that, for large n, T
n
N(,
2
/n) and that
2
is
known. The large sample 100(1 )% condence interval for is
T
n
M
, where M
= z
/2
n
.
The quantity M
is
1
2
of the condence interval width and is called the margin
of error.
2. Choosing n: Suppose that the investigator would like to estimate to within
m with condence 100(1 )%. The required sample size is obtained by
equating m to M
and solving for n. The solution is

n =
z
/2
2
.
If the solution is not an integer, then round up.
3. Application 1: Let X
1
, . . . , X
n
be a random sample from a distribution with
unknown mean
X
and known variance
2
X
. If n is large, then
X N(
X
,
2
X
/n) by the CLT. To estimate
X
to within m with
100(1 )% condence, use sample size
n =
z
/2
X
m
2
.
1
, . . . , X
n
be a random sample from a distribution with
unknown mean
X
and unknown variance
2
X
. If n is large, then
X N(
X
,
2
X
/n). The investigator desires to estimate
X
to within m with
100(1 )% condence. To make any progress, something must be known
about
X
. If the likely range of the data is known, then a rough estimate of
X
is the range divided by four. Another approach is to begin data collection
and then use S
X
to estimate
X
after obtaining several observations. The
sample size formula can be used to estimate the number of additional
observations that must be taken. The sample size estimate can be updated
after collecting more data are re-estimating
X
.
1
, . . . , X
n
be a random sample from a Bern(p)
distribution. If n is large, then p N(p, p(1 p)/n) by the CLT. To estimate
p to within m with 100(1 )% condence, it would appear that we should
use sample size
n =
z
/2
p(1 p)
m
2
.
The right-hand-side above, however, cannot be computed because p is
unknown and, therefore, p(1 p) also is unknown. Note that p(1 p) is a
quadratic function that varies between 0 (when p = 0 or p = 1) and 0.25 (when
p = 0.5). A conservative approach is to use p(1 p) = 0.25 in the sample size
formula. This approach ensures that the computed sample size is suciently
large, but in most cases it will be larger than necessary. If it is known that
p > p
0
or that p < p
0
, then p
0
may be substituted in the sample size formula.
9.5 Small Sample Condence Intervals for
X
1. Setting: Let X
1
, . . . , X
n
be a random sample from N(
X
,
2
X
), where neither
the mean nor the variance is known. It is desired to construct a 100(1 )%
condence interval for
X
. If n is small, then the large sample procedure will
not work well because S
X
may dier substantially from
X
.
2. Solution: Consider the random variable
T =
X
X
S
X
/
n
=
X
X
X
/
n

S
X
=
X
X
X
/
(n 1)S
2
X
2
(n 1)
1
2
.
Recall that
X
X
X
/
n
N(0, 1),
(n 1)S
2
X
2

2
n1
, and
X
X
X
/
n
(n 1)S
2
X
2
.
The independence result follows from X S
2
X
. Accordingly, the random
variable T has the same distribution as the ratio Z/
W/(n 1) where
Z N(0, 1) and W
2
n1
. This quantity has a t distribution with n 1
degrees of freedom.
3. Solution to the problem. Let t
/2,n1
be the 100(1 /2) percentile of the
t
n1
distribution. That is F
1
T
(1 /2) = t
/2,n1
, where F
T
( ) is the cdf of
T. Then, using the symmetry of the t distribution around 0, it follows that
P
t
/2,n1

X
X
X
/
n
t
/2,n1
= 1 .
Algebraic manipulation reveals that
P
X t
/2,n1
S
X
n

X
X + t
/2,n1
S
X
= 1 .
9.6. THE DISTRIBUTION OF T 113
Accordingly, an exact 100(1 )% condence interval for
X
is
X t
/2,n1
S
X
n
.
4. Caution: The above condence interval is correct if one is sampling from a
normal distribution. If sample size is small and skewness or kurtosis is large,
then the true condence can dier substantially from 100(1 )%.
9.6 The Distribution of T
Recall that if Z N(0, 1), Y
2
k
and Z Y , then
T =
Z
Y
k
t
k
.
In this section, we will derive the pdf of T.
The strategy that we will use is (a) rst nd an expression for the cdf of T and
then (b) dierentiate the cdf to obtain the pdf. The cdf of T is
P(T t) = F
T
(t) = P
Y
k
t
= P
Z t
Y
k
y/k
f
Z
(z)f
Y
(y)dz dy
using f
Z,Y
(z, y) = f
Z
(z)f
Y
(y) which follows from Y Z. Using Leibnitzs rule, the
pdf of T is
f
T
(t) =
d
dt
F
T
(t) =

0
d
dt
y/k
f
Z
(z)f
Y
(y)dz dy
=
y/kf
Z
y/k
f
Y
(y)dy.
Substituting the pdfs for Z and Y yields
f
T
(t) =
y/k
e
1
2
t
2
k
y
y
k
2
1
e
1
2
y
k
2
2
k
2
dy =

0
y
k+1
2
1
e
1
2
t
2
k
+1
k
2
2
k+1
2
dy
=
k + 1
2
k
2
t
2
k
+ 1
k+1
2
I
(,)
(t).
The last integral is evaluated by recognizing the kernel of a gamma distribution.
That is,

0
y
1
e
y
()
dy = 1

0
y
1
e
y
dy =
()
.
9.7 Pivotal Quantities
1. Denition: A pivotal quantity is a function of a statistic and a parameter.
The distribution of the pivotal quantity does not depend on any unknown
parameters.
2. How to construct condence intervals. Suppose that Q(T; ) is a pivotal
quantity. The distribution of Q is known, so percentiles of Q can be
computed. Let q
1
and q
2
be percentiles that satisfy
P [q
1
Q(T; ) q
2
] = 1 .
If Q(T; ) is a monotonic increasing or decreasing function of for each
realization of T, then the inverse function Q
1
[Q(T; )] = exists, and
P
Q
1
(q
1
) Q
1
(q
2
)
= 1
if Q(T; ) is an increasing function of and
P
Q
1
(q
2
) Q
1
(q
1
)
= 1
if Q(T; ) is a decreasing function of .
3. Example 1: Suppose that X
1
, . . . , X
n
is a random sample from N(,
2
).
(a) Q(X, S
X
; ) =
X
S
X
/
n
t
n1
which reveals that Q is a pivotal quantity.
Note that
T =
X
S
X
is two-dimensional. Also, Q is a decreasing function of ,

Q
1
(Q) = X
S
X
n
Q = and
P
X
S
X
n
q
2
X
S
X
n
q
1
= 1 ,
where q
1
and q
2
are appropriate percentiles of the t
n1
distribution.
(b) Q(S
2
X
;
2
) =
(n 1)S
2
X
2

2
n1
which reveals that Q is a pivotal
quantity. Furthermore, Q is a decreasing function of ,
Q
1
(Q) =
(n 1)S
2
X
Q
= and
9.8. ESTIMATING A MEAN DIFFERENCE 115
P
(n 1)S
2
X
q
2

(n 1)S
2
X
q
1
= 1 ,
where q
1
and q
2
are appropriate percentiles of the
2
n1
distribution.
1
, . . . , X
n
is a random sample from Unif(0, ). It is
easy to show that X
(n)
is sucient statistic. Note that X
i
/ Unif(0, 1).
Accordingly, X
(n)
/ is distributed as the largest order statistic from a
Unif(0, 1) distribution. That is, Q(X
(n)
; ) = X
(n)
/ Beta(n, 1) which
reveals that Q is a pivotal quantity. Furthermore, Q is a decreasing function
of ,
Q
1
(Q) =
X
(n)
Q
= and
P
X
(n)
q
2

X
(n)
q
1
= 1 ,
where q
1
and q
2
are appropriate percentiles of the Beta(n, 1) distribution.
(a) Note, q
2
= 1 is the 100
th
percentile of Beta(n, 1) and q
1
=
1/n
is the
100 percentile of Beta(n, 1). Accordingly, a 100(1 ) condence
interval for can be based on
P
X
(n)

X
(n)
1/n
= 1 .
(b) Note, q
1
= 0 is the 0
th
percentile of Beta(n, 1) and q
2
= (1 )
1/n
is the
100(1 ) percentile of Beta(n, 1). Accordingly, a 100(1 ) one-sided
condence interval for can be based on
P
X
(n)
(1 )
1/n

= 1 .
9.8 Estimating a Mean Dierence
1. Setting: Suppose that T
1,n
1
N(
1
,
2
1
/n
1
); T
2,n
2
N(
2
,
2
2
/n
2
); and
T
1,n
1
T
2,n
2
. The goal is to construct a condence interval for
1
2
. Note
that
T
1,n
1
T
2,n
2
N
2
,

2
1
n
1
+

2
2
n
2
.
If W
2
1
and W
2
2
are consistent estimators of
2
1
and
2
2
(i.e., W
2
1
prob
2
1
and
W
2
2
prob
2
2
), then
(T
1,n
1
T
2,n
2
) (
1
2
)
W
2
1
n
1
+
W
2
2
n
2
N(0, 1).
A large sample 100(1 )% condence interval for
1
2
can be based on
P
T
1,n
1
T
2,n
2
z
/2
SE
1
2
T
1,n
1
T
2,n
2
+ z
/2
SE
1 ,
where SE = SE(T
1,n
1
T
2,n
2
) =
W
2
1
n
1
+
W
2
2
n
2
.
2. Application 1: Suppose that X
11
, X
12
, . . . , X
1n
1
is a random sample from a
population having mean
1
and variance
2
1
and that X
21
, X
22
, . . . , X
2n
1
is an
independent random sample from a population having mean
2
and variance
2
2
. Then
(X
1
X
2
) (
1
2
)
S
2
1
n
1
+
S
2
2
n
2
N(0, 1).
A large sample 100(1 )% condence interval for
1
2
can be based on
P
X
1
X
2
z
/2
S
2
1
n
1
+
S
2
2
n
2

1
2
X
1
X
2
+ z
/2
S
2
1
n
1
+
S
2
2
n
2
1 .
3. Application 2: Suppose that X
11
, X
12
, . . . , X
1n
1
is a random sample from
Bern(p
1
) and that X
21
, X
22
, . . . , X
2n
1
is an independent random sample from
Bern(p
2
). Then
( p
1
p
2
) (p
1
p
2
)
p
1
(1 p
1
)
n
1
+
p
2
(1 p
2
)
n
2
N(0, 1).
A large sample 100(1 )% condence interval for p
1
p
2
can be based on
P
p
1
p
2
z
/2
SE p
1
p
2
p
1
p
2
+ z
/2
SE
1 , where
SE = SE( p
1
p
2
) =
p
1
(1 p
1
)
n
1
+
p
2
(1 p
2
)
n
2
.
9.9 Estimating Variability
Most of this section is a review of earlier material. The only new material is
concerned with the distribution of the sample range when sampling from N(0, 1).
1. Let X
1
, X
2
, . . . , X
n
be a random sample from N(,
2
). Dene Z
i
as
Z
i
= (X
i
)/. Then Z
i
iid N(0, 1). Note that the joint distribution of
Z
1
, . . . , Z
n
does not depend on or . Accordingly, the distribution of
R
Z
= Z
(n)
Z
(1)
=
X
(n)

X
(1)
=
X
(n)
X
(1)
=
R
X
does not depend on or . That is, R

X
/ is a pivotal quantity.
9.10. DERIVING ESTIMATORS 117
2. Percentiles of W = R
X
/ for various sample sizes are listed in Table XIII.
They can be used to make probability statements such as
P(w
1
W w
2
) = 1 = ,
where w
1
and w
2
are appropriate percentiles of the distribution of W. Note
that W is a decreasing function of and that
W
1
(W) = (X
(n)
X
(1)
)/W = . Therefore, condence intervals can be based
on
P
X
(n)
X
(1)
w
2

X
(n)
X
(1)
w
1
= 1 .
3. Table XIII also gives E(W) and
W
. These values can be used to obtain a
point estimator of and to compute the standard error of the estimator. The
point estimator is
=
X
(n)
X
(1)
E(W)
.
The estimator is unbiased for because
E( ) =
E(R
X
)
E(W)
=
E(R
X
)
E(R
X
)
= .
The variance of is
Var( ) = Var
R
X
E(W)
= Var
W
E(W)
=
2
Var(W)
[E(W)]
2
.
Accordingly,
SE( ) =

W
E(W)
is an estimator of
Var( ).
9.10 Deriving Estimators
1. Method of Moments
(a) Setting: Suppose that X
i
, X
2
, . . . , X
n
is a random sample from f
X
(x|),
where is a k 1 vector of parameters. The goal is to derive an
estimator of .
(b) Let M
j
be the j
th
sample moment about the origin and let
j
be the
corresponding population moment. That is
M
j
=
1
n
n
i=1
X
j
i
and
j
=
j
() = E(X
j
)
for j = 1, 2, . . .. The population moments are denoted by
j
() because
they are functions of the components of .
(c) The method of moments estimator consists of equating sample moments
to population moments and solving for . That is, solve
M
j
=
j
(
), j = 1, . . . , k
for

.
(d) Central Moments: It sometimes is more convenient to use central
moments. For the special case of k = 2, solve
M
j
=
j
(
), j = 1, 2 where
M
1
= M
1
=
1
n
n
i=1
X
i
; M
2
= S
2
X
=
1
n 1
n
i=1
(X
i
X)
2
;
1
=
1
= E(X); and
2
=
2
X
= E(X
1
)
2
.
(e) Example: Gamma Distribution. Suppose X
1
, . . . , X
n
is a random sample
from Gamma(, ). The moments about the origin are
j
() = E(X
j
) =
( + j)
()
j
.
The rst two central moments are
1
() = E(X) =

and
2
() = Var(X) =

2
.
The method of moments estimators of and are obtained by solving
X =

and S
2
X
=

2
for and

. The solutions are
=
X
2
S
2
X
and

=
X
S
2
X
.
(f) Example: Beta Distribution. Suppose X
1
, . . . , X
n
is a random sample
from Beta(
1
,
2
). The moments about the origin are
j
() = E(X
j
) =
B(
1
+ j,
2
)
B(
1
,
2
)
=
(
1
+ j)(
1
+
2
)
(
1
)(
1
+
2
+ j)
.
The rst two central moments are
1
() = E(X) =

1
1
+
2
and
2
() = Var(X) =

1
2
(
1
+
2
)
2
(
1
+
2
+ 1)
.
The method of moments estimators of
1
and
2
are obtained by solving
X =

1

1
+
2
and S
2
X
=

1

2
(
1
+
2
)
2
(
1
+
2
+ 1)
for
1
and
2
. The solutions are

1
= X
X(1 X)
S
2
X
1
and
2
= (1 X)
X(1 X)
S
2
X
1
.
2. Maximum Likelihood Estimators (MLEs)
(a) Setting: Suppose that X
i
, X
2
, . . . , X
n
X
(x|),
where is a k 1 vector of parameters. The goal is to derive an
estimator of .
(b) Denition: A maximum likelihood estimator (MLE) of is any value

that maximizes the likelihood function and is a point in the parameter

space or on the boundary of the parameter space.
(c) If the likelihood function is a dierentiable function of , then the
maximum likelihood estimator is a solution to
L(|X)
= 0.
(d) Note, any maximizer of L(|X) also is a maximizer of ln [L(|X)].
Accordingly, one can maximize the log likelihood function rather than
the likelihood function.
(e) Example: Suppose that X
1
, . . . , X
n
is a random sample from Expon().
The log likelihood function is
ln [L(|X)] = nln()
n
i=1
X
i
.
Taking the derivative with respect to ; setting it to zero; and solving for
yields
=
1
X
.
(f) Example: Suppose that X
1
, . . . , X
n
is a random sample from Unif().
The likelihood function is
L(|X
(n)
) =
1
n
I
(X
(n)
,)
().
Plotting the likelihood function reveals that

= X
(n)
is the MLE. Note,
taking derivatives does not work in this case.
(g) Example: Suppose that X
1
, . . . , X
n
2
).
The log likelihood function is
ln
,
2
|X, S
2
X
=
n
2
ln(
2
)
1
2
2
n
i=1
(X
i
X)
2
n
2
2
(X )
2
.
Taking the derivatives with respect to and
2
and setting them to zero
yields two equations to solve:
n

2
(X ) = 0 and
n
2
2
+
1
2
4
n
i=1
(X
i
X)
2
+
n
2
4
(X )
2
= 0.
Solving the rst equation for yields = X. Substituting = X into
the second equation and solving for
2
yields

2
=
1
n
n
i=1
(X
i
X)
2
.
(h) Invariance property of MLEs. Let g() be a function of . Then, the
MLE of g() is g(
), where

is the MLE of .
Proof: We will prove the invariance property only for the special case in
which g() is a one-to-one function. Note, if the dimension of is k, then
the dimension of g() also must be k. Let = g(). Then = g
1
()
because g is one-to-one. Dene L
(|X) as the likelihood function when

g() = . That is,
L
(|X) = f
X
[X|g
1
()] = L[g
1
()|X], where g
1
() = .
Note that,
max
(|X) = max
L[g
1
()|X] = max
L[|X].
That is, the maximized likelihood is the same whether one maximizes L
with respect to or maximizes L with respect to . Accordingly, if

maximizes the likelihood function L(|X), then = g(
) maximizes the
likelihood function L
(|X).
Example Suppose that X
1
, X
2
, . . . , X
n
2
).
Find the MLE of the coecient of variation g(,
2
) = 100/. Solution:
g = 100

X
, where
2
=
1
n
n
i=1
(X
i
X)
2
.
(i) Properties of MLEs: Under certain regularity conditions, it can be shown
(we will not do so) that
i. MLEs are consistent,
ii. MLEs have asymptotic normal distributions, and
iii. MLEs are functions of sucient statistics. This property is easy to
prove because the likelihood function depends on the data solely
through a sucient statistic.
3. Rao-Blackwell Theorem If U(X) is unbiased for (a scalar) and T(X) is
sucient, then V = V (T) = E(U|T) is
(a) a statistic,
(b) unbiased for , and
(c) Var(V ) Var(U), with strict inequality i and only if U is a function of
T.
Proof.
(a) V is a statistic because the distribution of X conditional on T does not
depend on . Accordingly, the expectation of V with respect to the
distribution of X conditional on T does not depend on .
(b) Note, E(U) = because U is unbiased. Now use iterated expectation:
= E(U) = E
T
[E(U|T)] = E
T
(V ) = E(V ).
(c) Use iterated variance:
Var(U) = E[Var(U|T)] + Var [E(U|T)]
= E[Var(U|T)] + Var(V ) Var(V ).
Example. Suppose that X
1
, X
2
, . . . , X
n
Poi(). The goal is to nd a good unbiased estimator of of P(X = 0) = e
.
Recall that T =
n
i=1
X
i
is sucient and that T Poi(n). Consider
U = I
{0}
(X
1
) =
1 if X
1
= 0
0 if X
1
= 1.
The support of U is {0, 1} and the expectation of U is
E(U) = 1 P(U = 1) + 0 P(U = 0) = 1 P(X
1
= 0) = e
.
Thus, U is unbiased for e
. To nd a better estimator, use the Rao-Blackwell

theorem. It was shown on page 75 of these notes that the conditional
distribution of X
1
, X
2
, . . . , X
k
given T = t is
P(X
1
= x
1
, X
2
= x
2
, . . . , X
n
= x
n
|T = t)
=
t
x
1
, x
2
, . . . , x
n
1
n
x
1
1
n
x
2

1
n
xn
.
That is, given T = t, the Xs have a multinomial distribution with t trials, n
categories, and probability 1/n for each category. Note that the conditional
distribution of the data given T does not depend on . This is because T is
sucient. The conditional distribution of X
1
given T = t is binomial:
X
1
|(T = t) Bin
T,
1
n
.
The expectation of U given T = t is
E(U|T = t) = 1 P(U = 1|T = 1) + 0 P(U = 0|T)
= P(U = 1|T = t) = P(X
1
= 0|T = t)
=
t
0
1
n
1
1
n
t0
=
1
1
n
t
.
Accordingly, an unbiased estimator of e
that has smaller variance than U is

E(U|T) =
1
1
n
T
.
9.11 Bayes Estimators
1. Setting: Suppose that we have (a) data, X
1
, . . . , X
n
, (b) the corresponding
likelihood function, L(|T), where T is sucient, and (c) a prior distribution
for , g
(). Then, if we are skillful, we can nd the posterior g

|T
(|T). We
would like to nd point and interval estimators of .
2. Point Estimators: In general, we can use some characteristic of the posterior
distribution as our point estimator. Suitable candidates are the mean, median,
or mode. How do we choose which one to use?
(a) Loss Function: Suppose that we can specify a loss function that describes
the penalty for missing the mark when estimating . Denote our
estimator as a or a(t) because it will depend on t. Two possible loss
functions are
1
(, a) = | a| and
2
(, a) = ( a)
2
.
(b) Bayes Estimator: Recall that is a random variable. A posterior
estimator, a, is a Bayes estimator with loss function if a minimizes the
posterior expected loss (Bayes loss):
Bayes Loss = B(a) = E
|T
[(a)] .
9.11. BAYES ESTIMATORS 123
(c) From prior results (see page 24 of these notes), the Bayes estimator for
loss
1
is known to be the median of the posterior distribution.
Recall that E(X) is the minimizer of E(X a)
2
with respect to a.
Accordingly, the Bayes estimator for loss
2
is the mean of the posterior
distribution
(d) Example: Suppose that X
1
, X
2
, . . . , X
n
is a random sample from Bern().
Recall that T =
n
i=1
X
i
is sucient. Furthermore, suppose that the
prior on is beta(
1
,
2
). Then the posterior, conditional on T = t
is beta(
1
+ t,
2
+ n t). The mean of the posterior is
E(|T = t) =

1
+ t
1
+
2
+ n
.
If n = 10,
1
=
2
= 1, and t = 6, then
E(|T = t) =
7
12
= 0.5833
and the median of the beta(7, 5) distribution is 0.5881, obtained by using
Matlabs betainv function.
3. Interval Estimator: Use the posterior distribution to nd lower and upper
limits, h
1
and h
2
such that
P(h
1
h
2
|T = t) = 1 .
The above interval is a Bayesian 100(1 )% condence interval. In the
statistical literature, this interval usually is called a credibility interval. Unlike
the frequentist condence interval, the credibility interval is interpreted as a
probability. That is, we say that is contained in the interval with
probability 1 . This is the interpretation that 216 students often give to
frequentist intervals and we give them no credit when they do so.
Example 1 In the binomial example, the posterior distribution of is
beta(7, 5). Using Matlabs betainv function, a 95% Bayesian condence
interval is
P(0.3079 0.8325) = 0.95.
Example 2 Suppose that X
1
. . . , X
n
2
), where
2
is known. If the prior on is N(,
2
), then the posterior distribution is
| x N
x
+
x +

x
+
, (
x
+
)
1
,
where the precisions are
x
=
n
2
; and
=
1
2
.
A 95% Bayesian condence interval for is
P
1.96
(
x
+
)
1
+ 1.96
(
x
+
)
1
X = x
= 0.95 where
=

x
x
+
x +

x
+
.
As increases (indicating less and less a priori knowledge about ), the
Bayesian condence interval approaches
P
x 1.96

n
x + 1.96

X = x
= 0.95.
This is interpreted as a xed interval that has probability 0.95 of capturing
the random variable . Note that the above Bayesian credibility interval is
identical to the usual 95% frequentist condence interval for when
2
is
known.
9.12 Eciency
1. This section is concerned with optimal estimators. For example, suppose that
we are interested in estimating g(), for some function g. The question to be
addressed ishow do we know if we have the best possible estimator? A
partial answer is given by the Cramer-Rao inequality. It gives a lower bound
on the variance of an unbiased estimator. If our estimator attains the
Cramer-Rao lower bound, then we know that we have the best unbiased
estimator.
2. Cramer-Rao Inequality (Information Inequality). Suppose that the joint pdf
(pmf) of X
1
, X
2
, . . . , X
n
is f
X
(x|), where is a scalar and the support of X
does not depend on . Furthermore, suppose that the statistic T(X) is an
unbiased estimator of a dierentiable function of . That is, E(T) = g().
Then, under mild regularity conditions,
Var(T)
g()
2
I
, where I
= E
ln f
X
(X|)
.
The quantity I
is called Fishers information and it is an index of the amount

of information that X has about .
Proof. Dene the random variable S as
S = S(X, ) =
ln f
X
(X|)
=
1
f
X
(X|)
f
X
(X|)
.
This quantity is called the score function (not in your text). Your text denotes
the score function by W.
9.12. EFFICIENCY 125
(a) Result: The expected value of the score function is zero.
Proof: This result can be shown by interchanging integration and
dierentiation (justied if the regularity conditions are satised):
E(S) =
S(x, )f
X
(x|) dx
=
1
f
X
(x|)
f
X
(x|)
f
X
(x|) dx
=
f
X
(x|)
dx
=

f
X
(x|) dx
=

1 = 0.
because the integral of the joint pdf over the entire sample space is 1.
Substitute summation for integration if the random variables are discrete.
(b) Result: The variance of S is I
.
Proof: This result follows from the rst result and from the denition of
I
:
Var(S) = E(S
2
) [E(S)]
2
= E(S
2
) = I
.
(c) Result: The covariance between S and T is
Cov(S, T) =
g()
.
Proof: To verify this result, again we will interchange integration and
dierentiation. First, note that Cov(S, T) = E(ST) E(S)E(T) = E(ST)
because E(S) = 0. Accordingly,
Cov(S, T) = E(ST) =
S(x, )T(x)f
X
(x|) dx
=
1
f
X
(x|)
f
X
(x|)
T(x)f
X
(x|) dx
=
f
X
(x|)
T(x) dx
=

f
X
(x|)T(x) dx
=

E(T) =
g()
.
(d) Result: Cramer-Rao Inequality:
Var(T)
g()
2
I
.
The right-hand-side of the above equation is called the Cramer-Rao
Lower Bound (CRLB). That is,
CRLB =
g()
2
I
.
Proof: If is a correlation coecient, then from the Cauchy-Schwartz
inequality it is known that 0
2
1. Accordingly,
2
S,T
=
[Cov(S, T)]
2
Var(S) Var(T)
1
=Var(T)
[Cov(S, T)]
2
Var(S)
=
g()
2
I
.
(e) If g() = , then the inequality simplies to
Var(T)
1
I
because

= 1.
(f) Note: if X
1
, X
2
, . . . , X
n
are iid, then the score function can be written as
S(X|) =

i=1
ln f
X
(X
i
|)
=
n
i=1
ln f
X
(X
i
|)
=
n
i=1
S
i
(X
i
, ), where S
i
=
ln f
X
(X
i
|)
is the score function for X

i
. The score functions S
i
for i = 1, . . . , n are
iid, each with mean zero. Accordingly,
Var(S) = Var
i=1
S
i
=
n
i=1
Var(S
i
) by independence
=
n
i=1
E(S
2
i
) = nE(S
2
1
),
where S
1
is the score function for X
1
.
1
, X
2
, . . . , X
n
is a random sample from Poi(). The
score function for a single X is
S(X
i
, ) =
[ + X
i
ln() ln(X
i
!)]
= 1 +
X
i
.
Accordingly, the information is
I
= nVar(1 + X
i
/) = n
Var(X
i
)
2
= n

2
=
n
.
Suppose that the investigator would like to estimate g() = . The MLE of
is X (homework) and E(X) = , so the MLE is unbiased. The variance of a
Poisson random variable is and therefore Var(X) = /n. The CRLB for
estimating is
CRLB =
2
n/
=

n
.
Therefore, the MLE attains the CRLB.
4. Eciency: The eciency of an unbiased estimator of g() is the ratio of the
CRLB to the variance of the estimator. That is, suppose that T is an
unbiased estimator of g(). Then the eciency of T is
Eciency =
CRLB
Var(T)
=
g()
2
I
Var(T) =
g()
2
I
Var(T)
.
If this ratio is one, then the estimator is said to be ecient. Eciency always
is less than or equal to one.
5. Exponential Family Results: Recall, if the distribution of X belongs to the one
parameter exponential family and X
1
, X
2
, . . . , X
n
is a random sample, then
the joint pdf (pmf) is
f
X
(X|) = [B()]
n
i=1
h(X
i
)
exp
Q()
n
i=1
R(X
i
)
.
(a) The score function is
S(T, ) = n
ln B()
+ T
Q()
, where T =
n
i=1
R(X
i
).
(b) Note that the score function is a linear function of T:
S = a + bT, where a = n
ln B()
and b =
Q()
.
(c) Recall that E(S) = 0. It follows that
n
ln B()
+ E(T)
Q()
= 0 and
E(T) = n
ln B()
Q()
1
.
(d) Result: Suppose that g() is chosen to be
g() = E(T) = n
ln B()
Q()
1
.
Then,
Var(T) =
g()
2
I
= CRLB
and T is the minimum variance unbiased estimator of g().
Proof: First, note that T is unbiased for E(T). Second, note that
S = a + bT =
2
S,T
= 1 =
g()
2
Var(T) I
= 1
=Var(T) =
g()
2
I
= CRLB.
6. Example. Suppose that X
1
, X
2
, . . . , X
n
is a random sample form Geom().
The pmf of X
i
is
f
X
(x
i
|) = (1 )
x
i
1
I
{1,2,...}
(x
i
) =

1
I
{1,2,...}
(x
i
) exp{ln(1 )x
i
}.
Accordingly, the distribution of X belongs to the exponential family with
B() =

1
; h(x
i
) = I
{1,2,...}
(x
i
); Q() = ln(1 ); and R(x
i
) = x
i
.
The score function for the entire sample is
S(X, ) = n
ln

1
+ T
ln(1 )
= n
+
1
1
T
1
,
where T =
n
i=1
X
i
. It follows that
E(T) = g() =
n
; E
1
n
T
= E(X) =
1
;
and that T is the minimum variance unbiased estimator of n/. Equivalently,
T/n = X is the minimum variance unbiased estimator of 1/. The variance of
T can be obtained from the moment generating function. the result is
Var(X
i
) =
1
2
=Var(T/n) =
1
n
2
.
7. Another Exponential Family Result: Suppose that the joint pdf (pmf) of
X
1
, X
2
, . . . , X
n
is f
X
(x|); T(X) is a statistic that is unbiased for g(); and
Var(T) attains the Cramer-Rao lower bound. Then T is sucient and the
joint pdf (pmf) belongs to the one parameter exponential family.
Proof: If Var(T) attains the Cramer-Rao lower bound, then it must be true
that
2
S,T
= 1 and that
S(X, ) = a() + b()T(X)
for some functions a and b. Integrating S with respect to gives
S(X, ) d = lnf
X
(X|) + K
1
(X) for some function K
1
(X)
=
a() + b()T(X) d = A() + B()T(X) + K

2
(X)
=f
X
(X|) = exp{A()} exp {[K
2
(X) K
1
(X)]} exp{B()T(X)}.
which shows that the distribution belongs to the exponential family and that
T is sucient.
Chapter 10
SIGNIFICANCE TESTING
This chapter describes hypothesis testing from a Fisherian viewpoint. The main
ingredients are hypotheses, test statistics, and p-values.
10.1 Hypotheses
1. H
0
and H
a
are statements about probability models or, equivalently, about
population characteristics.
2. The null hypothesis, H
0
, usually says no eect, no dierence, etc. In terms of
parameters, it usually is written as H
0
: =
0
; H
0
:
0
; or H
0
:
0
,
where
0
is a value specied by the investigator. It is important that the null
contains a point of equality.
3. The alternative states that the null is false and usually is written as
H
a
: =
0
; H
a
: >
0
; or H
a
: <
0
.
10.2 Assessing the Evidence
1. A signicance test is a test of H
0
. The strategy is as follows:
(a) Translate the scientic hypothesis into H
0
and H
a
.
(b) Begin with the assumption that H
0
is true.
(c) Collect data.
(d) Determine whether or not the data contradict H
0
. The p-value is a
measure of how strongly the data contradict the null. A small p-value is
strong evidence against H
0
.
2. Test statistic: Denition: A test statistic is a function of the data and
0
. The
test statistic is chosen to discriminate between H
0
and H
a
. Usually, it
incorporates an estimator of . Familiar test statistics are
131
132 CHAPTER 10. SIGNIFICANCE TESTING
(a) Z =

0
SE(
|H
0
)
(b) t =
X
0
S
X
/
n
(c) X
2
=
(n 1)S
2
X
2
0
3. p-value Denote the test statistic as T and denote the realized value of T as
t
obs
. The p-value is a measure of consistency between the data and H
0
. It is
dened as
p-value = P
T is as or more extreme than t

obs
in the direction of H
a
H
0
.
A small p-value means that the data are not consistent with H
0
. That is,
small p-values are taken to be evidence against H
0
.
4. Common Error I: Many investigators interpret a large p-value to be evidence
for H
0
. This is not correct. A large p-value means that there is little or no
evidence against H
0
. For example, consider a test of H
0
: = 100 versus
H
a
: = 100 based on a sample of size n = 1 from N(0, 20
2
). Suppose that the
true mean is = 105 and that X = 101 is observed (4/20 = 1/5 below the
true mean). The Z statistic is
z
obs
=
101 100
20
= 0.05
and the p-value is
p-value = 2[1 (0.05)] = 2(1 0.5199) = 0.9601.
The p-value is large, but the data do not provide strong evidence that H
0
is
true.
5. Common Error II: Many investigators interpret a very small p-value to mean
that a large (important) eect has been found. This is not correct. A very
small p-value is strong evidence against H
0
. It is not evidence that a large
eect has been found. Example: Suppose that the standard treatment for the
common cold reduces symptoms in 62% of the population. An investigator
develops a new treatment and wishes to test H
0
: p = 0.62 against
H
0
: p > 0.62. The usual test statistic is
Z =
p p
0
p
0
(1p
0
)
n
,
where p
0
= 0.62. If the new treatment reduces symptoms in 62.1% of a sample
of size n = 3,000,000, then the observed value of the test statistic is
z
obs
=
.621 .62
.62(1.62)
3,000,000
= 3.5684.
10.2. ASSESSING THE EVIDENCE 133
The p-value is P(Z > 3.5684|H
0
) = 0.00018. This is strong evidence against
H
0
, but the eect size is very small.
6. Using likelihood to nd a test statistic. One way to nd a test statistic for
testing H
0
: =
0
against a one or two-sided alternative is to examine the
likelihood ratio
LR =
L(
0
|X)
max
L(|X)
,
where the maximization in the denominator is over all that satisfy H
a
. The
LR is ratio of the probability of the data under H
0
to the largest possible
probability of the data under H
a
. The LR satises LR (0, 1). Small values of
LR are interpreted to be evidence against H
0
. Example: suppose that
H
0
: p = p
0
is to be tested against H
a
: p = p
0
using a random sample from
Geom(p). Recall that the MLE of p is p = 1/X. The likelihood ratio is
LR =
(1 p
0
)
n(X1)
p
n
0
(1 1/X)
n
X
n
=
1 p
0
X 1
n(X1)
X
nX
p
n
0
.
In the following display, the log of the LR statistic is plotted against X for the
special case n = 10 and p
0
= 0.25.
0 2 4 6 8 10 12 14 16 18 20
30
25
20
15
10
5
0
5
X bar
l
n
(
L
R
)
Ln(LR) Statistic: Geometric Distribution, p
0
= 1/4
The above plot reveals that the log likelihood ratio statistic is small if X is
substantially larger or substantially smaller that 4. Note that
X = 4 =1/X = 0.25 = p
0
and that the LR statistic is 1 if X = 4; i.e., the
log of the LR statistic is zero.
To use the likelihood ratio as a test statistic, it is necessary to determining its
sampling distribution under a true null. This can be quite dicult, but
fortunately there is an easy to use large sample result. If is a scalar, then
under certain regularity conditions, the asymptotic null distribution of
2 ln(LR) is
2
1
.
10.3 One Sample Z Tests
1. Form of the test Statistic. Suppose that it is of interest to test H
0
: =
0
against either H
a
: =
0
, H
a
: >
0
, or H
a
: <
0
. Further, suppose that we
have an estimator of that satises
T|H
0
N
,

2
0
n
and an estimator of
2
0
, say W
2
0
that satises W
2
0
prob
2
0
whenever H
0
is true.
The subscript on is a reminder that the variance of T under H
0
could be a
function of
0
.
2. A reasonable test statistic is
Z =
T
0
W
0
/
n
.
If sample size is large, then the distribution of Z under H
0
is approximately
N(0, 1).
3. p-values Let z
obs
be the observed test statistic. Then the p-value for testing H
0
against H
a
is
(a) 1 P(|z
obs
| Z |z
obs
|) = 2 [1 (|z
obs
|)] if the alternative hypothesis
is H
a
: =
0
;
(b) P(z
obs
Z) = 1 (z
obs
) if the alternative hypothesis is H
a
: >
0
; and
(c) P(Z z
obs
) = (z
obs
) if the alternative hypothesis is H
a
: <
0
.
4. Example: X
1
, X
2
, . . . , X
n
is a random sample from a population with mean
and variance
2
. To test H
0
: =
0
against either H
a
: =
0
, H
a
: >
0
, or
H
a
: <
0
, use the test statistic
Z =
X
0
S
X
/
n
.
5. Example: X
1
, X
2
, . . . , X
n
is a random sample from Bern(p). To test H
0
: p = p
0
against either H
a
: p = p
0
, H
a
: p > p
0
, or H
a
: p < p
0
Z =
p p
0
p
0
(1 p
0
)/n
.
Note that
2
0
= p
0
(1 p
0
).
10.4. ONE SAMPLE T TESTS 135
10.4 One Sample t Tests
Suppose that X
1
, X
2
, . . . , X
n
2
). To test H
0
: =
0
against either H
a
: =
0
, H
a
: >
0
, or H
a
: <
0
T =
X
0
S
X
/
n
.
Under H
0
, the test statistic has a t distribution with n 1 degrees of freedom.
10.5 Some Nonparametric Tests
We will examine only the sign test. Suppose that X
1
, X
2
, . . . , X
n
is a random
sample from a continuous distribution having median . A test of H
0
: =
0
against either H
a
: =
0
, H
a
: >
0
, or H
a
: <
0
is desired. Let
U
i
= I
(,e
0
]
(X
i
) =
1 if X
i

0
; and
0 otherwise.
Under H
0
, U
i
iid Bern(0.5) and Y =
n
i=1
U
i
Bin(n, 0.5). Accordingly, the test
statistic
Z =
p 0.5
.25/n
is distributed approximately N(0, 1) under H
0
, where p = Y/n.
10.6 Probability of the Null Hypothesis
When using the frequentist approach, it is not correct to interpret the p-value as the
probability that H
0
is true. When using the Bayesian approach, then the posterior
probability that H
0
is true can be computed.
Suppose, for example, that X
i
iid Bern() for i = 1, . . . , n. Furthermore,
suppose that the prior on is Beta(
1
,
2
). The sucient statistic is
Y =
n
i=1
X
i
and the posterior is |(Y = y) Beta(
1
+ y,
2
+ n y). Suppose
that one wants to test H
0
:
0
against H
a
: >
0
. Then
P(H
0
|Y = y) = P(
0
|Y = y) =

0
0
g
|Y
(|y) d.
For example, if
1
=
2
= 1, n=40, y=30, and we want to test H
0
: 0.6
against H
a
: > 0.6, then |(Y = 30) Beta(31, 11) and
P(H
0
|Y = 30) =
0.6
0
30
(1 )
10
B(31, 11)
d = 0.0274.
The test statistic for computing the frequentist p-value is
z
obs
=
0.75 0.6
0.6(10.6)
40
= 1.9365.
The p-value is
p-value = P(Z > 1.9365) = 0.0264.
If the correction for continuity is employed, then
z
obs
=
0.75
1
80
0.6
0.6(0.4)
40
= 1.7751 and 4-value = 0.0379.
There are some complications if one wants a Bayesian test of H
0
: =
0
against
H
a
: =
0
. Your textbook gives one example. We will not have time to discuss this
issue.
Chapter 11
TESTS AS DECISION RULES
This chapter introduces the Neyman-Pearson theory of tests as decision rules. In
this chapter it is assumed that a decision about H
0
verses H
a
must be made. Based
on the data, the investigator either will reject H
0
and act as though H
a
is true or fail
to reject H
0
and act as though H
0
is true. The latter decision is called accept H
0
.
11.1 Rejection Regions and Errors
1. Suppose that the data consist of X
1
, X
2
, . . . , X
n
. The joint sample space of X
is partitioned into two pieces, the rejection region and the acceptance region.
In practice, the partitioning is accomplished by using a test statistic.
(a) Rejection region: the set of values of the test statistic that call for
rejecting H
0
.
(b) Acceptance region: the set of values of the test statistic that call for
accepting H
0
. The acceptance region is the complement of the rejection
region.
2. Errors
(a) Type I: rejecting a true H
0
.
(b) Type II: accepting a false H
0
.
3. Size of the Test:
size = = P(reject H
0
|H
0
true).
4. Type II error probability:
P(type II error) = = P(accept H
0
|H
0
false).
5. Power:
power = 1 = P(reject H
0
|H
0
false).
137
138 CHAPTER 11. TESTS AS DECISION RULES
6. Example; Consider a test of H
0
: p = 0.4 versus H
a
: p = 0.4 based on a random
sample of size 10 from Bern(p). Let Y =
X
i
. If the rejection rule is to reject
H
0
if Y 0 or Y 8, then the test size is
= 1 P(1 Y 7|p = 0.4) = 0.0183.
The type II error probability and the power depend on the value of p when H
0
is not true. Their values are
P(type II error) = P(1 Y 7|p) and power = 1 P(1 Y 7|p).
The two plots below display the type II error probabilities and power for
various values of p.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
True value of p
T
y
p
e

I
I

E
r
r
o
r

P
r
o
b
a
b
i
l
i
t
y
:

11.2. THE POWER FUNCTION 139
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
True value of p
P
o
w
e
r
:

1
11.2 The Power function

Suppose that X
1
, X
2
, . . . , X
n
X
(x|). Denote the
parameter space of by and let
0
and
a
be two disjoint subspaces of .
Consider a test of H
0
:
0
against H
a
:
a
. The power function is a function
of and is dened by
() = P(reject H
0
|).
The function usually is used when
a
, but the function is dened for all .
For example, consider a test of H
0
: =
0
against H
a
: >
0
based on a
random sample of size n from N(,
2
), where
2
is known. Let
1
(1 ) = z
1
be the 100(1 ) percentile of the standard normal distribution. Then a
one-sample Z test of H
0
will reject H
0
if Z > z
1
, where
Z =
X
0
/
n
.
The power function is
(
a
) = P(Z > z
1
| =
a
) = 1 P
X
0
/
n
z
1
=
a
]
= 1 P
X
a
+
a
0
/
n
z
1
=
a
]
= 1 P
X
a
/
n
z
1
0
/
=
a
]
= 1
z
1
0
/
.
As an illustration, if = 10,
0
= 100, n = 25, and = 0.05, then the power
function is
(
a
) = 1
1.645

a
100
2
.
This function is displayed below for various values of
a
.
95 100 105 110
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Power Function: One Sample Z when
0
= 100, n=25, = 10
a
)
11.3 Choosing a Sample Size
An investigator may want to plan a study so that power will be adequate to detect
a meaningful dierence. Consider the power function from the last section. Suppose
that the investigator decides that the minimal dierence of importance is two
points. That is, if
a
102, then the investigator would like to reject H
0
. If
a
is
xed at
a
= 102, then the power of the test as a function of n is
(
a
) = 1
1.645
102 100
10/
= 1
1.645
n
5
.
This function is plotted below for various values of n.
11.4. QUALITY CONTROL 141
0 50 100 150 200 250 300
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Power Function: One Sample Z when
0
= 100,
a
= 102, = 10
n
a
)
If the investigator has decided that a specic power is necessary, then the required
sample size can be read from the above display.
In general, the required sample size for a one sample Z test of H
0
: =
0
against H
0
: >
0
can be obtained by equating the power function to the desired
value of 1 and solving for n. Denote the 100 percentile of the standard normal
distribution by
1
() = z
. That is
(
a
) = 1
z
1
0
/
= 1
z
1
0
/
=
z
1
0
/
n
=
1
() = z
n =

2
(z
1
z
)
2
(
a
0
)
2
.
For example, if
0
= 100,
a
= 102, = 0.05, = 0.10, and = 10, then
n =
100(1.645 + 1.282)
2
(102 100)
2
= 214.18.
A sample size of n = 215 is required.
11.4 Quality Control
Skip this section.
11.5 Most Powerful tests
1. Denition: Simple Hypothesis. A simple hypothesis is one that completely
species the joint distribution of the data. That is, there are no unknown
parameters under a simple hypothesis. For example H
0
: Y Bin(25,
1
3
) is a
simple hypothesis. In this section, we will examine the test of a simple H
0
against a simple H
a
.
2. Denition: Most Powerful Test. A test of a simple H
0
versus a simple H
a
is a
most powerful test of size if no other test which has size has greater
power.
3. Neyman-Pearson Lemma: Consider the hypotheses H
0
: X f
0
(x) against
H
a
: X f
1
(x), where f
0
and f
1
are the joint pdfs (pmfs) under H
0
and H
a
,
respectively. Then the most powerful test is to reject H
0
if
(x) < K, where (x) =
f
0
(x)
f
1
(x)
is the likelihood ratio. Furthermore, the size of the test is
=
R
f
0
(x) dx, where R = {x; (x) < K}.
Proof: Consider any other test that has size . Denote the rejection region
of the competing test by R
and denote the power of the competing test by

1
. Then
f
0
(x) dx =
and
(1 ) (1
) =
R
f
1
(x) dx
f
1
(x) dx.
We will show that the above dierence is greater than or equal to zero. Note
that R = (R R
) (R R
c
) and that (R R
) is disjoint from (R R
c
).
Similarly, R
= (R
R) (R
R
c
) and (R
R) is disjoint from (R
R
c
).
Accordingly,
R
f
1
(x) dx =
RR
f
1
(x) dx +
RR
c
f
1
(x) dx,
f
1
(x) dx =
R
c
f
1
(x) dx +
R
f
1
(x) dx, and
RR
f
1
(x) dx +
RR
c
f
1
(x) dx
R
f
1
(x) dx
R
c
f
1
(x) dx
=
RR
c
f
1
(x) dx
R
c
f
1
(x) dx.
11.5. MOST POWERFUL TESTS 143
Note that (R R
) R so that f
1
(x) > K
1
f
0
(x) in the rst integral. Also,
(R
R
c
) R
c
so that f
1
(x) < K
1
f
0
(x) in the second integral. Therefore,

1
K
RR
c
f
0
(x) dx
1
K
R
c
f
0
(x) dx
=
1
K
RR
c
f
0
(x) dx +
1
K
RR
f
0
(x) dx
1
K
R
c
f
0
(x) dx +
1
K
R
f
0
(x) dx
by adding zero
=
1
K
R
f
0
(x) dx
1
K
f
0
(x) dx =
1
K
(
) 0
because the size of the competing test is
.
1
, X
2
, . . . , X
n
is a random sample from NegBin(k, ),
where k is known. Find the most powerful test of H
0
: =
0
against
H
a
: =
a
, where
a
>
0
. Solution: The likelihood ratio test statistic is
(x) =
n
i=1
x
i
1
k 1
k
0
(1
0
)
x
i
k
n
i=1
x
i
1
k 1
k
a
(1
a
)
x
i
k
=

nk
0
(1
0
)
n( xk)
nk
a
(1
a
)
n( xk)
=
0
(1
a
)
a
(1
0
)
nk
1
0
1
a
n( xk)
.
Note that the likelihood ratio test statistic depends on the data solely through
x and that
a
>
0
=
1
0
1
a
> 1.
Accordingly (x) is an increasing function of x. Rejecting H
0
for small values
of (x) is equivalent to rejecting H
0
for small values of x and the most
powerful test is to reject H
0
if x < K
where K
is determined by the relation

P(X < K
| =
0
) .
The above probability can be evaluated without too much diculty because
nX =
n
i=1
X
i
NegBin(nk,
0
)
under H
0
.
11.6 Randomized Tests
Skip this section.
11.7 Uniformly Most Powerful tests
1. Consider the problem of testing H
0
: =
0
against H
0
: >
0
based on a
random sample of size n from f
X
(x|).
2. Uniformly Most Powerful (UMP) Test. Denition: If a test of H
0
: =
0
against H
a
: =
a
, is a most powerful test for every
a
>
0
among all tests
with size , then the test is uniformly most powerful for testing H
0
: =
0
against H
a
: >
0
.
3. Approach to nding a UMP test. First use the Neyman-Pearson Lemma to
nd a most powerful test of H
0
: =
0
against H
a
: =
a
for some
a
>
0
. If
the form of the test is the same for all
a
, then the test is UMP.
1
, X
2
, . . . , X
n
is a random sample from NegBin(k, ),
where k is known. Find the UMP test of H
0
: =
0
against H
a
: >
0
.
Solution: The most powerful test of H
0
: =
0
against H
a
: =
a
, where
a
>
0
is to reject H
0
if x < K
where K
is determined by the relation

P(X < K
| =
0
) .
Note that the form of the test does not depend on the particular value of
a
.
Accordingly, the test that rejects H
0
for small values of x is the UMP test of
H
0
: =
0
against H
a
: >
0
.
5. A similar argument shows that in the negative binomial example, the UMP
test of H
0
: =
0
against H
a
: <
0
is to reject H
0
for large values of x.
6. The UMP idea can be extended to tests of H
0
:
0
against H
a
: >
0
. If
the power function is monotonic in , then the UMP test of H
0
: =
0
against
H
a
: >
0
also is UMP for testing H
0
:
0
against H
a
: >
0
. The size of
the test is
= sup
0
P(reject H
0
|) = sup
0
() = P(reject H
0
|
0
)
because the power function is monotonic It can be shown that the power
function is monotone in if the distribution of X belongs to the one
parameter exponential family.
11.8. LIKELIHOOD RATIO TESTS 145
11.8 Likelihood Ratio Tests
1. Consider the problem of testing H
0
:
0
against H
a
:
a
, where
0
and
a
are disjoint subspaces of the parameter space. The parameter may be a
vector.
2. The generalized likelihood ratio test of H
0
against H
a
is to reject H
0
for small
values of the likelihood ratio test statistic
(X) =
L(
0
|X)
L(
a
|X)
, where
L(
0
|X) = sup
0
L(|X) and L(
a
|X) = sup
0
a
L(|X).
That is, the likelihood function is maximized twice; rst under the null, and
second under the union of the null and alternative.
3. Properties of (X).
(a) (X) [0, 1].
(b) Small values of are evidence against H
0
.
(c) Under mild regularity conditions, the asymptotic null distribution of
2 ln [(X)] is
2
with degrees of freedom equal to the number of
restrictions under H
0
minus the number of restrictions under H
a
.
(d) The decision rule is to reject H
0
for large values of 2 ln().
ij
ind
Bern(p
i
) for j = 1, . . . , n
i
. That is, we have
independent samples from each of two Bernoulli populations. Consider the
problem of testing H
0
: p
1
= p
2
against H
a
: p
1
= p
2
. The sucient statistics
are Y
1
=
n
1
j=1
X
1j
and Y
2
=
n
2
j=1
X
2j
. These statistics are independently
distributed as Y
i
Bin(n
i
, p
i
). The likelihood function is
L(p
1
, p
2
|y
1
, y
2
) = p
y
1
1
(1 p
1
)
n
1
y
1
p
y
2
2
(1 p
2
)
n
2
y
2
.
Under H
0
, the MLE of the common value p = p
1
= p
2
is
p = (y
1
+ y
2
)/(n
1
+ n
2
). Under the union of H
0
and H
a
, there are no
restrictions on p
1
and p
2
and the MLEs are p
1
= y
1
/n
1
and p
2
= y
2
/n
2
. The
likelihood ratio test statistic is
(y
1
, y
2
) =
p
y
1
+y
2
(1 p)
n
1
+n
2
y
1
y
2
p
y
1
1
(1 p
1
)
n
1
y
1
p
y
2
2
(1 p
2
)
n
2
y
2
.
If H
0
is true and sample sizes are large, then 2 ln[(Y
1
, Y
2
)] is approximately
distributed as a
2
random variable. There are no restrictions under H
a
and
one restriction under H
0
, so the
2
random variable has 1 degree of freedom.
For example, if n
1
= 30, n
2
= 40, y
1
= 20, and y
2
= 35, then
(y
1
, y
2
) = 0.1103, 2 ln() = 4.4087, and the p-value is 0.0358. For
comparison, the familiar large sample test statistic is
z =
p
1
p
2
p(1 p)
1
n
1
+
1
n
2
= 2.1022
and the p-value is 0.0355. Note, Z
2

2
1
and z
2
= 4.4192 which is very close
to the LR test statistic.
5. Example 2: Your textbook (page 476) shows that the usual one-sample t test
of H
0
: =
0
against H
a
: =
0
when sampling from a normal distribution
with unknown variance is the likelihood ratio test. Below is another version of
the proof.
(a) Lemma: Let a be a constant or a variable that does not depend on
2
and let n be a positive constant. Then, the maximizer of
h(
2
; a) =
e
1
2
2
a
2
n
2
with respect to
2
is

2
=
a
n
.
Also,
max
2
h(
2
; a) = h
a
n
, a
=
e
n
2
a
n
n
2
.
Proof: The natural log of h is
ln(h) =
1
2
2
a
n
2
ln(
2
).
Equating the rst derivative of ln(h) to zero and solving for
2
yields

2
ln(h) =
1
2
4
a
n
2
2
,

2
ln(h) = 0 =
a
2
n = 0
=
2
=
a
n
.
The solution is a maximizer because the second derivative of ln(h)
evaluated at
2
= a/n is negative:
2
(
2
)
2
ln(h) =

2
a
2
4

n
2
2
=
a
6
+
n
2
4
,
2
(
2
)
2
ln(h)
2
=a/n
=
a
(a/n)
3
+
n
2(a/n)
2
=
n
3
a
2
< 0.
To obtain the maximum, substitute a/n for
2
in h(
2
, a).
11.8. LIKELIHOOD RATIO TESTS 147
(b) Theorem: The generalized likelihood ratio test of H
0
: =
0
against
H
a
: =
0
based on a random sample of size n from a normal
distribution with unknown variance is to reject H
0
for large |T|, where
T =
n(X
0
)
S
X
,
X =
1
n
n
i=1
X
i
, and S
2
X
=
1
n 1
n
i=1
(X
i

X)
2
.
Proof: The likelihood function of and
2
given X
i
iid
N(,
2
) for
i = 1, . . . , n is
L(,
2
|X) =
exp
1
2
2
n
i=1
(X
i
)
2
2
n
2
(2)
n
2
.
Under H
0
: =
0
, the likelihood function is
L(
0
,
2
|X) =
exp
1
2
2
n
i=1
(X
i
0
)
2
2
n
2
(2)
n
2
.
Using the Lemma with
a =
n
i=1
(X
i
0
)
2
yields

2
0
=
1
n
n
i=1
(X
i
0
)
2
and
max
2
L(
0
,
2
|X) = L(
0
,
2
0
|X) =
e
n
2
1
n
n
i=1
(X
i
0
)
2
n
2
(2)
n
2
.
Under H
0
H
a
, the likelihood function must be maximized with respect
to and
2
. Note that the sign of the exponent in the numerator of L is
negative. Accordingly, to maximize L with respect to , we must
minimize
n
i=1
(X
i
)
2
with respect to . By the parallel axis theorem, it is known that the
minimizer is = X. Substitute X for in the likelihood function and
use the Lemma with
a =
n
i=1
(X
i
X)
2
to obtain

2
a
=
1
n
n
i=1
(X
i
X)
2
and
max
2
,
L(,
2
|X) = L(X,
2
a
|X) =
e
n
2
1
n
n
i=1
(X
i
X)
2
n
2
(2)
n
2
.
The likelihood ratio test statistic is
=
L(
0
,
2
0
)
L(X,
2
a
)
=
i=1
(X
i
X)
2
n
i=1
(X
i
0
)
2
n
n
.
Recall, that from the parallel axis theorem,
n
i=1
(X
i
0
)
2
=
n
i=1
(X
i
X)
2
+ n(X
0
)
2
.
Accordingly, the Likelihood Ratio Test (LRT) is to reject H
0
for small ,
where
=
i=1
(X
i
X)
2
n
i=1
(X
i
X)
2
+ n(X
0
)
2
n
2
.
Any monotonic transformation of also can be used as the LRT
statistic. In particular,
2
n
1
(n 1) =
n(X
0
)
2
S
2
X
= T
2
is a decreasing function of . Therefore, the LRT rejects H
0
for large T
2
or, equivalently, for large |T|.
11.9. BAYESIAN TESTING 149
11.9 Bayesian Testing
1. To develop a Bayes Test, rst make the following denitions:
(a) Loss Function: (, act) = loss incurred if action act is performed when
the state of nature is . The action act will be either reject H
0
or
accept H
0
. For example, (H
0
, reject H
0
) is the loss incurred when a
true H
0
is rejected. It is assumed that making the correct action incurs
no loss.
(b) Parameter Space: Denote the support of under H
0
by
0
and denote
the support of under H
a
by
a
. For example, if the hypotheses are
H
0
: 100 and H
a
: > 100, then,
0
= (, 100] and
a
= (100, ).
(c) Prior: Before new data are collected, the prior pdf or pmf for is denoted
by g().
(d) Posterior: After new data have been collected, the posterior pdf or pmf
for is denoted by g(|X).
(e) Bayes Loss: The posterior Bayes Loss for action act is
B(act |X) = E
[(, act)]
=
0
(, reject H
0
)g(|X) d if act = reject H
0
,
a
(, accept H
0
)g(|X) d if act = accept H
0
.
(f) Bayes Test: A Bayes test is the rule that minimizes Bayes Loss.
2. Theorem: When testing a simple null against a simple alternative, the Bayes
test is a Neyman-Pearson test and a Neyman-Pearson rejection region,
f
0
/f
a
< K, corresponds to a Bayes test for some prior.
Proof: The null and alternative can be written as H
0
: f(x) = f
0
(x) versus
H
1
: f(x) = f
1
(x). Also, the support of has only two points:
= {f
0
, f
1
},
0
= {f
0
}, and
a
= {f
1
} or, equivalently,
= {H
0
, H
1
},
0
= {H
0
}, and
a
= {H
1
}.
Denote the prior probabilities of as
g
0
= P(H
0
) and g
1
= P(H
1
).
The posterior probabilities are
f(H
0
|x) =
f(H
0
, x)
f(x)
=
f(x|H
0
)f(H
0
)
f(x|H
0
)f(H
0
) + f(x|H
1
)f(H
1
)
=
f
0
(x)g
0
f
0
(x)g
0
+ f
1
(x)g
1
and
f(H
1
|x) =
f(H
1
, x)
f(x)
=
f(x|H
1
)f(H
1
)
f(x|H
0
)f(H
0
) + f(x|H
1
)f(H
1
)
=
f
1
(x)g
1
f
0
(x)g
0
+ f
1
(x)g
1
.
Denote the losses for incorrect decisions by
(H
0
, reject H
0
) =
0
and (H
1
, accept H
0
) =
1
.
Note that
0
and
1
are merely scalar constants. Then, the posterior Bayes
losses are
B(reject H
0
|x) =
0
f
0
(x)g
0
f
0
(x)g
0
+ f
1
(x)g
1
and
B(accept H
0
|x) =
1
f
1
(x)g
1
f
0
(x)g
0
+ f
1
(x)g
1
.
The Bayes test consists of choosing the action that has the smallest Bayes
loss. Alternatively, the ratio of Bayes losses can be examined:
B(reject H
0
|x)
B(accept H
0
|x)
=

0
f
0
(x)g
0
1
f
1
(x)g
1
.
If the ratio is smaller than 1, then the Bayes test is to reject H
0
, otherwise
accept H
0
. That is, H
0
is rejected if
0
f
0
(x)g
0
1
f
1
(x)g
1
< 1 or, equivalently,
f
0
(x)
f
1
(x)
< K, where K =

1
g
1
0
g
0
.
Accordingly, the Bayes test is a Neyman-Pearson test. Also, a
Neyman-Pearson rejection region, f
0
/f
1
< K, corresponds to a Bayes test,
where the priors and losses satisfy
K =

1
g
1
0
g
0
.
3. Example 11.9a (with details) A machine that lls bags with our is adjusted
so that the mean weight in a bag is 16 ounces. To determine whether the
machine is at the correct setting, a sample of bags can be weighed. There is a
constant cost for readjusting the machine. The cost is due to shutting down
the production line, etc. If the machine is not adjusted, then the company
may be over-lling the bags with cost 2( 16) or under-lling the bags with
cost 16 . The under-lling cost is due to customer dissatisfaction.
Consider testing H
0
: 16 against H
a
: > 16 based on a random sample of
size n from N(,
2
), where
2
is known. Furthermore, suppose that the prior
on is N(,
2
). Using the result in Example 2 on page 101 of these notes, the
posterior distribution of is normal with
E(| x) =
n
2
n
2
+
2
x +
1
n
2
n
2
+
2
and
Var(| x) =
2
+
1
1
.
If n = 5, x = 16.4,
2
= 0.05
2
, = 16, and
2
= 0.002, then the posterior
distribution of is N(16.32, 0.02
2
). Suppose that the loss functions are
(, reject H
0
) =
2(16 ) if 16
16 if > 16,
and (, accept H
0
) =
1
.
The Bayes losses are
B(reject H
0
|x) = E
|x
[(, reject H
0
)]
=
1
P(reject H
0
| = 16, x) +
1
P(reject H
0
| = 16, x) =
1
, and
B(accept H
0
|x) = E
|x
[(, accept H
0
)]
=
16
2(16 )f
|x
(|x) d +

16
( 16)f
|x
(|x) d.
The latter integral can be computed as follows. Denote the conditional mean
and variance of as
|x
and
2
|x
. That is,
|x
= E(|x) and
2
|x
= Var(|x).
Transform from to
z =

|x
|x
.
Denote the pdf of the standard normal distribution as (z). Then,
= z
|x
+
|x
,
d =
|x
dz, and
B(accept H
0
|x) =
(16
|x
)/
|x
2(16
|x
z
|x
)(z) dz
+

(16
|x
)/
|x
(
|x
z +
|x
16)(z) dz
=
16
2(0.32 + 0.02z)(z) dz +

16
(0.02z + 0.32))(z) dz

16
(0.02z + 0.32))(z) dz
E(0.02Z + 0.32) = 0.032,
because essentially the entire standard normal distribution lies in the interval
(16, ) and essentially none of the distribution lies in the interval
(, 16). The Bayes test rejects H
0
if
1
0.32 and accepts H
0
if
1
> 0.32.
A Matlab program to compute the integrals together with the program output
are listed below.
n=5;
xbar=16.4;
sigma2=.05^2;
nu=16;
tau2=0.002;
w=n*tau2/(n*tau2+sigma2);
m=w*xbar+(1-w)*nu;
v2=(n/sigma2 + 1/tau2)^(-1);
v=sqrt(v2);
disp([Conditional Mean and SD of mu are])
disp([m v])
g1 = inline(2*(16-z*s-m).*normpdf(z), z,s,m);
g2 = inline((z*s+m-16).*normpdf(z),z,s,m);
z0=(16-m)/v;
tol=1.e-10;
Integral_1=quadl(g1,-30,z0,tol,[],v,m)
Integral_2=quadl(g2,z0,30,tol,[],v,m)
Bayes_Loss = Integral_1+Integral_2
Conditional Mean and SD of mu are 16.3200 0.0200
Integral_1 = 5.6390e-67
Integral_2 = 0.3200
Bayes_Loss = 0.3200
4. Example, Problem 11-33 (with details): The goal is to conduct a Bayes Test of
H
0
: p
1
2
against H
a
: p >
1
2
based on a random sample of size n from
Bern(p). The losses are
(p, act) = 0 if the correct decision is made
(H
0
, reject H
0
) =
0
, and
(H
a
, accept H
0
) =
1
.
The prior on p is Beta(, ). Using the results in example 1 on page 100 of
these notes, the posterior distribution of p conditional on x is
Beta( + y, + n y), where y is the observed number of successes on the n
Bernoulli trials.
The Bayes losses are
B(reject H
0
|x) = E
p|x
[(p, reject H
0
)]
=
H
0
0
p
+y1
(1 p)
+ny1
B( + y, + n y)
dp
=
0
0.5
0
p
+y1
(1 p)
+ny1
B( + y, + n y)
dp
=
0
P (p 0.5|x) and
B(accept H
0
|x) = E
p|x
[(p, accept H
0
)]
=
Ha
1
p
+y1
(1 p)
+ny1
B( + y, + n y)
dp
=
1
1
0.5
p
+y1
(1 p)
+ny1
B( + y, + n y)
dp
=
1
P (p > 0.5|x) .
The required probabilities can be computed using any computer routine that
calculates the CDF of a beta distribution.
If n = 10, y = 3, = 7, = 3,
0
= 3,
1
= 2, then the posterior distribution
of p is Beta(10, 10) and the Bayes Losses are
B(reject H
0
|x) = 3P(W 0.5) and B(accept H
0
|x) = 2P(W > 0.5),
where W Beta(10, 10). This beta distribution is symmetric around 0.5 and,
therefore, each of the above probabilities is
1
2
. The Bayes test is to accept H
0
because the Bayes loss is 1, whereas the Bayes loss for rejection is 1.5.
Appendix A
GREEK ALPHABET
Name Lower Case Upper Case
Alpha A
Beta B
Gamma
Delta
Epsilon or E
Zeta Z
Eta H
Theta or
Iota I
Kappa K
Lambda
Mu M
Nu N
Xi
Omicron o O
Pi
Rho or P
Sigma or
Tau T
Upsilon
Phi or
Chi X
Psi
Omega
155
156 APPENDIX A. GREEK ALPHABET
Appendix B
ABBREVIATIONS
BF: Bayes Factor. If H is a hypothesis and T is a sucient statistic, then
BF =
Posterior odds of H
Prior odds of H
=
P(H|T = t)/P(H
c
|T = t)
P(H)/P(H
c
)
=
f
T|H
(t|H)
f
T|H
c(t|H
c
)
.
CDF or cdf: Cumulative Distribution Function. If X is a random variable,
then
F
X
(x) = P(X x)
is the cdf of X.
CLT: Central Limit Theorem. If X
1
, X
2
, . . . , X
n
is a random sample of size n
from a population with mean
X
and variance
2
X
, then, the distribution of
Z
n
=
X
X
X
/
n
converges to N(0, 1) as n .
CRLB: Cramer-Rao Lower Bound. The CRLB is the lower bound on the
variance of an unbiased estimator of g(). The bound is
CRLB =
g()
2
I
,
where I
is Fishers information.
LR: Likelihood Ratio. When testing a simple null against a simple alternative,
the LR is
=
f
0
(x)
f
1
(x)
.
When testing a composite null against a composite alternative, the LR is
=
sup
0
f(x|)
sup
a
f(x|)
,
where
0
and
a
are the parameter spaces under H
0
and H
a
, respectively.
157
158 APPENDIX B. ABBREVIATIONS
LRT: Likelihood Ratio Test. The LRT of H
0
versus H
a
is to reject H
0
for small
values of the LR. The critical value is chosen so that the size of the test is .
MGF or mgf: Moment Generating Function. If X is a random variable, then
X
(t) = E
e
tX
is the mgf of X.
MLE: Maximum Likelihood Estimator. Suppose that X
i
, X
2
, . . . , X
n
is a
random sample from f
X
(x|), where is a k 1 vector of parameters. A
maximum likelihood estimator of is any value

that maximizes the
likelihood function and is a point in the parameter space or on the boundary
of the parameter space.
MSE: Mean Square Error. If T is an estimator of a parameter, , then
MSE
T
() = E(T )
2
=
2
T
+ bias
2
,
where bias = E(T ).
PDF or pdf: Probability Density Function. If X is a continuous random
variable, then
d
dx
F
X
(x) = f
X
(x)
is the pdf of X.
PF or pf: Probability Function. If X is a discrete random variable, then
P(X = x) = f
X
(x)
is the pf of X. The terms pf and pmf are interchangeable.
PGF or pgf: Probability Generating Function. If X is a random variable, then
X
(t) = E
t
X
is the pgf of X. The pgf is most useful for discrete random variables.
PMF or pmf: Probability Mass Function. If X is a discrete random variable,
then
P(X = x) = f
X
(x)
is the pmf of X. The terms pmf and pf are interchangeable.
RV or rv: Random Variable.
UMP Test: Uniformly Most Powerful Test. A UMP test of H
0
against H
a
is
most powerful regardless of the value of the parameter under H
0
and H
a
.
Appendix C
PRACTICE EXAMS
C.1 Equation Sheet
Series and Limits
n
i=1
r
i
=
1 r
n+1
1 r
if r = 1
n + 1 if r = 1
i=1
r
i
=
1 r
n+1
1 r
if |r| < 1
if r > 1
undened if r < 1
n
i=1
i =
n(n + 1)
2
n
i=1
i
2
=
n(n + 1)(2n + 1)
6
(a + b)
n
=
n
i=0
n
i
a
i
b
ni
ln(1 + ) =
i=1
()
i
i
if || < 1
ln(1 + ) = + o() if || < 1
lim
n
1 +
a
n
+ o(n
1
)
n
= e
a
e
a
=
i=0
a
i
i!
Distribution of Selected Sums & Expectations
X
i
iid Bern() =E(X
i
) = ; Var(X
i
) = (1 ); and
n
i=1
X
i
Bin(n, )
X
i
iid Geom() =E(X
i
) =
1
; Var(X
i
) =
1
2
; and
n
i=1
X
i
NegBin(n, )
X
i
iid Poi() =E(X
i
) = ; Var(X
i
) = ; and
n
i=1
X
i
Poi(n)
159
160 APPENDIX C. PRACTICE EXAMS
X
i
iid Expon() =E(X
i
) =
1
; Var(X
i
) =
1
2
; and
n
i=1
X
i
Gamma(n, )
X
i
iid NegBin(k, ) =E(X
i
) =
k
; Var(X
i
) =
k(1 )
2
; and
n
i=1
X
i
NegBin(nk, )
C.2 Exam 1
1. Suppose X Gam(, );
f
X
(x) =
x
1
e
x
()
I
(0,)
(x),
where > 0 and > 0.
(a) Verify that the mgf of X is
X
(t) =
.
(b) For what values of t does the mgf exist?
2. Suppose that W
1
, . . . , W
n
is a random sample of size n from Expon();
f
W
(w) = e
w
I
(0,)
(w),
where > 0. Use mgfs to obtain the distribution of Y =
n
i=1
W
i
. Hint: The
mgf of W can be obtained from question #1 because the exponential
distribution is a special case of the gamma distribution.
3. Suppose that X is a random variable with mgf
X
(t) =
1
1 t
.
(a) Give the pdf of X.
(b) Derive an expression for E(X
r
); r = 0, 1, 2, . . ..
4. Suppose that X N(
X
,
2
X
); Y N(
Y
,
2
Y
); and that X Y . The mgf of X
is
X
(t) = exp
t
X
+
t
2
2
X
2
.
C.3. EXAM 2 161
(a) Verify that E(X) =
X
and that Var(X) =
2
X
.
(b) Prove that X Y N(
X

Y
,
2
X
+
2
Y
).
5. Suppose that X LogN(,
2
). Compute
Pr
X e
+
.
6. Let W
i
for i = 1, . . . , n and X
i
for i = 1, . . . , m be iid random variables, each
with distribution N(0,
2
).
(a) Give the distribution of
U =
n
i=1
W
i
2
.
Justify your answer. Hint: First give the distribution of W
i
/.
(b) Give the distribution of
V =
m
n
i=1
W
2
i
m
i=1
X
2
i
.
Justify your answer.
7. Suppose that X
i
is a random sample of size n from an innite sized population
having mean and variance
2
. Let X be the sample mean.
(a) Verify that E(X) =
(b) Verify that Var(X) =
2
/n.
(c) Let S
2
be the sample variance;
S
2
=
1
n 1
n
i=1
(X
i
X)
2
=
1
n 1
i=1
X
2
i
nX
2
.
Verify that E(S
2
) =
2
.
C.3 Exam 2
1. Suppose that X
1
, X
2
, . . . , X
n
is a random sample of size n from f
X
(x|, ),
where
f
X
(x|, ) =

x
+1
I
(,)
(x),
where > 0 and > 0 are unknown parameters. This distribution is called
the Pareto(, ) distribution.
(a) Find a two dimensional sucient statistic.
(b) Verify that the pdf of X
(1)
is Pareto(n, ). That is,
f
X
(1)
(x|, ) =
n
n
x
n+1
I
(,)
(x).
(c) The joint sampling distribution of the sucient statistics can be studied
using simulation. Let U
1
, U
2
, . . . , U
n
be a random sample from Unif(0, 1).
Show how U
i
can be transformed into a random variable having a
Pareto(, ) distribution.
2. Suppose that X Gamma(, ), where is known.
(a) Verify that the distribution of X belongs to the exponential family.
(b) Let X
1
, X
2
, . . . , X
n
be a random sample from the Gamma(, )
distribution, where is known. Use the results from part (a) to nd a
sucient statistic.
(c) Give the likelihood function that corresponds to part (b).
3. Consider the problem of making inferences about , the parameter of a
geometric distribution. Let X
1
, X
2
, . . . , X
n
be a random sample from
f
X|
(x|), where
f
X|
(x|) = (1 )
x1
I
{1,2,...}
(x).
(a) Verify that T =
n
i=1
X
i
is a sucient statistic.
(b) Verify that the conditional distribution P(X = x|T = t) does not depend
on .
(c) Suppose that the investigators prior beliefs about can be summarized
as Beta(, ). Find the posterior distribution of and nd the
expectation of conditional on T = t.
(d) Let Z
1
, Z
2
, . . . , Z
k
be a sequence of future Geom() random variables and
let Y =
k
i=1
Z
i
. Find the posterior predictive distribution of Y given T.
That is, nd f
Y |T
(y|t).
4. Let X
1
, X
2
, . . . , X
n
be a random sample of size n from a distribution having
mean , variance
2
. Dene Z
n
as
Z
n
=
X
/
n
.
(a) State the central limit theorem.
(b) Verify that
Z
n
=
n
i=1
U
i
, where U
i
=
Z
n
and Z
i
=
X
i
.
C.4. EXAM 3 163
(c) Assume that X has a moment generating function. Verify that
Zn
(t) = [
U
i
(t)]
n
.
(d) Verify that the mean and variance of U
i
are 0 and n
1
, respectively.
(e) Complete the proof of the central limit theorem.
C.4 Exam 3
1. Let X be a random variable; let h(X) be a non-negative function whose
expectation exists; and let k be any positive number. Chebyshevs inequality
reveals that
P [h(X) k]
E[h(X)]
k
or, equivalently, that
P [h(X) < k] 1
E[h(X)]
k
.
(a) Dene what it means for an estimator T
n
to be consistent for a
parameter .
(b) Use Chebyshevs inequality to verify that
lim
n
MSE
Tn
() = 0 =T
n
prob
.
2. Suppose that X
1
, X
2
, . . . , X
n
is a random sample from Bern().
(a) Give the likelihood function.
(b) Find a sucient statistic.
(c) Verify that the score function is
S(|X) =
n
i=1
X
i
n
(1 )
.
(d) Derive the MLE of .
(e) Derive the MLE of
1
.
(f) Derive Fishers information.
(g) Verify or refute the claim that the MLE of is the minimum variance
unbiased estimator of .
3. Suppose that X
i
iid Expon() for i = 1, . . . , n. It can be shown that
Y =
n
i=1
X
i
is sucient and that Y Gamma(n, ).
(a) Derive the moment generating function of Q = 2
n
i=1
X
i
and verify
that Q is a pivotal quantity. Use the moment generating function of Q to
determine its distribution.
(b) Use Q to nd a 100(1 )% condence interval for .
4. Suppose that X
1
, X
2
, . . . , X
n
X
(x|), where
f
X
(x|) = x
1
I
(0,1)
(x) and > 0.
(a) Verify or refute the claim that the distribution of X belongs to the
exponential class.
(b) Find the most powerful test of H
0
: =
0
versus H
a
: =
a
, where
a
>
0
.
(c) Find the most uniformly powerful test of H
0
: =
0
versus H
a
: >
0
.
(d) Suppose that the investigators prior beliefs about can be summarized
as Gamma(, ). Find the posterior distribution of . Hint: write
x
i
as x
i
= e
ln(x
i
)
.
(e) Find the Bayes estimator of based on a squared error loss function.

Notes 424 03

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes 424 03

Uploaded by

Copyright:

Available Formats

LECTURE NOTES

all are true.

y for y [0, 1] and x = +

y for y (1, 4]. The cdf of Y is

y/3 y [0, 1];

y + 1)/3 y (1, 4];

because the Jacobian is positive. Suppose that Y = g(X) is strictly

because the Jacobian is negative.

y. There is a single inverse function for y (1, 4]. The

make the change of variable from x to z =

5.7. BIVARIATE DISTRIBUTIONS 25

= P[g(X) A] P[h(Y ) B].

Y (X) that minimizes

Var(Y |X) + [E(Y |X)]

provided that < t < . Note that E(X) = 1/, E(X

because the integral of the standard normal distribution is one.

because the last integral is the integral of a random variable with

does not exist if r > k;

if r is even and r < k.

does not exist if r > k

= #Xs larger than a + #Xs smaller than a.

66 CHAPTER 7. ORGANIZING & DESCRIBING DATA

is a vector of unknown parameters. For example, if X

has been observed. The sample mean is x = 4.2303/10 = 0.42303. The

has been observed. For this sample, x

is sucient by the factorization criterion.

is sucient by the factorization criterion.

is the vector of unknown parameters, then the joint

also is sucient because

is a sucient statistic. The sampling

converges to N(0, 1). That is, for large n,

0 if sampling with replacement,

8.9. USING THE MOMENT GENERATING FUNCTION 93

which is the mgf of N(

because terms such as a

8.9. USING THE MOMENT GENERATING FUNCTION 95

which is the mgf of N(0, 1). Accordingly, the distribution of Z

0 if the test is negative, and

0 if Gloria does not have cervical cancer, and

(). By the denition of conditional probability, the posterior

are terms that do not depend on . Note, we have

() is the prior distribution of . Replace integration by summation

and solving for n. The solution is

is two-dimensional. Also, Q is a decreasing function of ,

does not depend on or . That is, R

that maximizes the likelihood function and is a point in the parameter

(|X) as the likelihood function when

with respect to or maximizes L with respect to . Accordingly, if

maximizes the likelihood function L(|X), then = g(

. To nd a better estimator, use the Rao-Blackwell

that has smaller variance than U is

(). Then, if we are skillful, we can nd the posterior g

is called Fishers information and it is an index of the amount

is the score function for X

a() + b()T(X) d = A() + B()T(X) + K

T is as or more extreme than t

11.2 The Power function

and denote the power of the competing test by

is determined by the relation

is determined by the relation

You might also like