Advanced Topics in Introductory Probability

NICHOLAS N.N.
NSOWAH-NUAMAH
ADVANCED TOPICS
IN INTRODUCTORY
PROBABILITY
A FIRST COURSE IN PROBABILITY
THEORY – VOLUME III
2
Advanced Topics In Introductory Probability: A first Course in Probability Theory – Volume III
2nd edition
© 2018 Nicholas N.N. Nsowah-Nuamah & bookboon.com
ISBN 978-87-403-2238-5
3
ADVANCED TOPICS IN INTRODUCTORY
PROBABILITY: A FIRST COURSE IN
PROBABILITY THEORY – VOLUME III Contents
CONTENTS
Part 1 Bivariate Probability Distributions 7
Chapter 1 Probability And Distribution Functions

of Bivariate Distributions 8
1.1 Introduction 8
1.2 Concept of Bivariate Random Variables 8
1.3 Joint Probability Distributions 9
1.4 Joint Cumulative Distribution Functions 18
1.5 Marginal Distribution of Bivariate Random Variables 23
1.6 Conditional Distribution of Bivariate Random Variables 30
1.7 Independence of Bivariate Random Variables 35
Chapter 2 Sums, Differences, Products and Quotients

of Bivariate Distributions 45
2.1 Introduction 45
2.2 Sums of Bivariate Random Variables 46
2.3 Differences of Random Variables 63
�e Graduate Programme
I joined MITAS because for Engineers and Geoscientists
I wanted real responsibili� www.discovermitas.com
Maersk.com/Mitas �e G
I joined MITAS because for Engine
I wanted real responsibili� Ma
Month 16
I was a construction Mo
supervisor ina const
I was
the North Sea super
advising and the No
Real work he
helping foremen advis
International
al opportunities
Internationa
�ree wo
work
or placements ssolve problems
Real work he
helping fo
International
Internationaal opportunities
�ree wo
work
or placements ssolve pr
4
2.4 Products of Bivariate Random Variables 68

2.5 Quotients of Bivariate Random Variables 72
Chapter 3 Expectation and Variance of Bivariate Distributions 80

3.1 Introduction 80
3.2 Expectation of Bivariate Random Variables 80
3.3 Variance of Bivariate Random Variables 104
Chapter 4 Measures of Relationship of Bivariate Distributions 120

4.1 Introduction 120
4.2 Product Moment 121
4.3 Covariance of Random Variables 124
4.4 Correlation Coefficient of Random Variables 130
4.5 Conditional Expectations 135
4.6 Conditional Variances 141
4.7 Regression Curves 143
Part 2 Statistical Inequalities, Limit Laws and Sampling Distributions 154
Chapter 5 Statistical Inequalities and Limit Laws 155

5.2 Markov’s Inequality 156
5.3 Chebyshev’s Inequality 160
5.4 Law of Large Numbers 170
5.5 Central Limit Theorem 177
Chapter 6 Sampling Distributions I: Basic Concepts 191

6.2 Statistical Inference 192
6.3 Probability Sampling 195
6.4 Sampling With and Without Replacement 200
Chapter 7 Sampling Distributions II:

Sampling Distribution of Statistics 202
7.2 Sampling Distribution of Means 206
7.3 Sampling Distribution of Proportions 216
7.4 Sampling Distribution of Differences 221
7.5 Sampling Distribution of Variance 225
5
Chapter 8 Distributions Derived from Normal Distribution 236

8.2 χ Distribution
2
237
8.3 t Distribution 243
8.4 F Distribution 247
Statistical Tables 254
Answers to Odd-Numbered Exercises 271
Bibliography 273
6
PROBABILITY THEORY – VOLUME III Part 1 Bivariate Probability Distributions
PART 1
BIVARIATE PROBABILITY DISTRIBUTIONS
I salute the discovery of a single even insignificant truth more highly than
all the argumentation on the highest questions which fails to reach a truth
GALILEO (1564–1642)
7
PROBABILITY: A FIRST COURSE IN PROBABILITY AND DISTRIBUTION FUNCTIONS
PROBABILITY THEORY – VOLUME III OF BIVARIATE DISTRIBUTIONS
Chapter 1
PROBABILITY AND DISTRIBUTION FUNCTIONS

OF BIVARIATE DISTRIBUTIONS
1.1 INTRODUCTION
So far, all discussions in the two volumes of my book on probability (Nsowah-

Nuamah, 2017 and 2018) have been associated with a single random variable
X (that is, a one-dimensional or univariate random variable). Frequently,
we may be concerned with multivariate situations that simultaneously in-
volve two or more random variables. For instance, if we wanted to study
the relationship between weight and height of individual students we might
consider weight and height to be two random variables X and Y , respec-
tively, whose values are determined by measuring the weights and heights of
the students in the school. Such study will produce the ordered pair (X, Y ).
1.2 CONCEPT OF BIVARIATE RANDOM VARIABLES
1.2.1 Definition of Bivariate Random Variables
Many of the concepts discussed for the one-dimensional random variables

also hold for higher-dimensional case. Here in most cases we shall limit
ourselves to the two-dimensional (bivariate) case; more complex multivariate
4situations are straightforward Advanced Topics in Introductory Probability
generalisations.
Definition 1.1 BIVARIATE RANDOM 3 VARIABLE

If X = X(s) and Y = Y (s) are two real-valued functions on the
sample space S, then the pair (X, Y ) that assigns a point in the real
(x, y) plane to each point s ∈ S is called a bivariate random variable
Synonyms of a bivariate random variable are a two-dimensional random

variable/vector Fig. 1.1 is an illustration of a bivariate random variable.
Fig. 1.1 Bivariate Random Variables
1.2.2 Types of Bivariate Random Variables
Multivariate situations, similar to univariate cases, may involve discrete, as

well as continuous random variables.
8
Fig. 1.1 Bivariate Random Variables
1.2.2 Types of Bivariate Random Variables
Multivariate situations, similar to univariate cases, may involve discrete, as

well as continuous random variables.
Definition 1.2 DISCRETE BIVARIATE RANDOM

VARIABLE
(X, Y ) is a discrete bivariate random variable, if each of the random
variables X and Y is discrete
Definition 1.3 CONTINUOUS BIVARIATE RANDOM

VARIABLE
(X, Y ) is a continuous bivariate random variable if each of the ran-
dom variables is continuous
There are cases where one variable is discrete and the other continuous
but this will not be considered here.
Density and Distribution Functions of Bivariate Distributions 5
1.3 JOINT PROBABILITY DISTRIBUTIONS
A joint distribution is a distribution having two or more random variables,

with each random variable still having its own probability distribution, ex-
pected value and variance. In addition, for ordered pair values of the ran-
dom variables, probabilities will exist and the strength of any relationship
between the two variables can be measured.
In the multivariate case as in the univariate case we often associate a
probability (mass) function with discrete random variables and a probability
density function with continuous random variables. We shall take up the
discrete case first since it is the easier one to understand.
1.3.1 Joint Probability Distribution of Discrete Random

Variables
Suppose that X and Y are discrete random variables, and X takes values
i = 0, 1, 2, · · · , n, and Y takes values j = 1, 2, · · · , m. Most often, such a
joint distribution is given in table form. Table 1.1 is an n-by-m array which
displays the number of occurrences of the various combinations of values
of X and Y . We may observe that each row represents values of X and
each column represents values of Y . The row and column totals are called
marginal totals. Such a table is called the joint frequency distribution.
Table 1.1 Joint Frequency Distribution of X and Y

X Y Row Totals
y1 y2 ··· ym

x1 (x1 , y1 ) (x1 , y2 ) ··· (x1 , ym ) x1
y

x2 (x2 , y1 ) (x2 , y2 ) ··· (x2 , ym ) x2
y
.. .. .. .. .. 9
..
. . . . . .
i = 0, 1, 2, · · · , n, and Y takes values j = 1, 2, · · · , m. Most often, such a
joint distribution is given in table form. Table 1.1 is an n-by-m array which
displays
ADVANCED the number
TOPICS of occurrences of the various combinations of values
IN INTRODUCTORY
of X and Y . A We
PROBABILITY: may
FIRST observe
COURSE IN that each row represents values ofAND
PROBABILITY and
X DISTRIBUTION FUNCTIONS
PROBABILITY
each column THEORY
represents – VOLUME
values IIIof Y . The row and column totals areOFcalled
BIVARIATE DISTRIBUTIONS
marginal totals. Such a table is called the joint frequency distribution.

X Y Row Totals
y1 y2 ··· ym

x1 (x1 , y1 ) (x1 , y2 ) ··· (x1 , ym ) x1
y

x2 (x2 , y1 ) (x2 , y2 ) ··· (x2 , ym ) x2
y
.. .. .. .. .. ..
. . . . . .

xn (xn , y1 ) (xn , y2 ) ··· (xn , ym ) xn
y

Column y1 y2 ··· ym xi yj = N
Totals x x x x y
6 Advanced Topics in Introductory Probability
For example, suppose X and Y are discrete random variables, and X takes
values 0, 1, 2, 3, and Y takes values 1, 2, 3. Each of the nm row-column in-
tersections in Table 1.2 represents the frequency that belongs to the ordered
pair (X, Y ).
Values Values of Y Row

of X Totals
1 2 3
0 1 0 0 1
1 0 2 1 3
2 0 2 1 3
3 1 0 0 1
Column 2 4 2 8
Totals
Definition 1.4 JOINT PROBABILITY DISTRIBUTION

Let X and Y be discrete random variables with possible values
xi , i = 1, 2, ..., n and yj , j = 1, 2, 3, ..., m, respectively. The joint
(or bivariate) probability distribution for X and Y is given by
p(xi , yj ) = P ({X = xi } ∩ {Y = yj })
defined for all (xi , yj )
The function p(xi , yj ) is sometimes referred to as the joint probability

mass function (p.m.f.) or the joint probability function (p.f.) of X and Y .
This function gives the probability that X will assume a particular value x
while at the same time Y will assume a particular value y.
Note
(a) The notation p(x, y) for all (x, y) is the same as writing p(xi , yj ) for
i = 1, 2, ..., n and j = 1, 2, 3, ..., m. Sometimes when there is no ambi-
10
guity we shall use simply p(x, y).
The function p(xi , yj ) is sometimes referred to as the joint probability
mass functionA (p.m.f.)
PROBABILITY: or theINjoint probability function
FIRST COURSE (p.f.) of XAND
PROBABILITY andDISTRIBUTION
Y. FUNCTIONS
This functionTHEORY
PROBABILITY gives the probability
– VOLUME III that X will assume a particular OFvalue
BIVARIATE
x DISTRIBUTIONS
while at the same time Y will assume a particular value y.
Note
(a) The notation p(x, y) for all (x, y) is the same as writing p(xi , yj ) for
i = 1, 2, ..., n and j = 1, 2, 3, ..., m. Sometimes when there is no ambi-
guity
Density andwe shall use simply
Distribution Functions of Bivariate Distributions
p(x, y). 7
(b) The joint probability p(xi , yj ) is sometimes denoted as

P (X = x, Y = y),
where the comma stands for ‘and’ or ‘∩’.
Definition 1.5
If X and Y are discrete random variables with joint probability mass
function p(xi , yj ), then
(a) p(xi , yj ) ≥ 0, for all i and j

n
m
(b) p(xi , yj ) = 1
i=1 j=1
Once the joint probability mass function is determined for discrete random
variables X and Y , calculation of joint probabilities involving X and Y is
straightforward.
Let the value that the random variables X and Y jointly take be de-
noted by the ordered pair (xi , yj ). The joint probability p(xi , yj ) is obtained
by counting the number of occurrences of that combination of values X and
Y and dividing the count by the total number of all the sample points. Thus,
#({X = xi } ∩ {Y = yj })
P ({X = xi } ∩ {Y = yj }) = n
m

#({X = xi } ∩ {Y = yj })
i=1 j=1
#(xi , yj )
= n
m

#(xi , yj )
i=1 j=1
where
#(xi , yj ) is the number of occurrences in the cell of the
ordered pair (xi , yj );
n
m
#(xi , yj ) is the total number of all sample points (cells)
i=1 j=1
of the ordered pairs (xi , yj ), denoted by N .
11
8
PROBABILITY: A FIRST COURSE INAdvanced Topics in Introductory Probability
PROBABILITY AND DISTRIBUTION FUNCTIONS
Joint Probability Distribution of Bivariate Random Variables

in Tabular Form
The joint probability distribution may be given in the form of a table of

n rows and m columns (See Table 1.3). The upper margins of the table
indicate the possible distinct values of X and Y . The numbers in the body
of the table are the probabilities for the joint occurrences of the two events
corresponding to X = xi (1 ≤ i ≤ n) and Y = yj (1 ≤ i ≤ m). The row and
column totals are the probabilities for the individual random variables and
are called marginal probabilities because they appear on the margins of
the table. Such a table is also called the joint relative frequency distribution.
Table 1.3 Joint Probability Distribution of X and Y
X Y Row Totals
y1 y2 ··· ym
x1 p(x1 , y1 ) p(x1 , y2 ) ··· p(x1 , ym ) p(x1 )
x2 p(x2 , y1 ) p(x2 , y2 ) ··· p(x2 , ym ) p(x2 )
.. .. .. .. .. ..
. . . . . .
xn p(xn , y1 ) p(xn , y2 ) ··· p(xn , ym ) p(xn )

n
m
Column p(y1 ) p(y2 ) ··· p(ym ) p(xi , yj ) = 1
Totals i=1 j=1
Note
The marginal probabilities for X are simply the simple probabilities that
X = xi for values of yj , where j assumes a value from 1 to m. Similarly, the
marginal probabilities for Y are the simple probabilities that Y = yj , where
i assumes a value from 1 and n.
It is important to note that the distribution satisfies a joint probability

function, namely,
(a) p(xi , yj ) ≥ 0, for all i = 1, 2, · · · , n; j = 1, 2, · · · , m.
n
m
(b) p(xi , yj ) = 1
i=1 j=1
www.job.oticon.dk
12
xn
ADVANCED
p(x , y ) p(xn , y2 )
TOPICSn IN 1INTRODUCTORY
··· p(xn , ym ) p(xn )
n
m
PROBABILITY p(y1 ) – VOLUME
Column THEORY p(y2 )III ··· p(ym ) p(xi , yj ) =OF1 BIVARIATE DISTRIBUTIONS
Totals i=1 j=1
Note
The marginal probabilities for X are simply the simple probabilities that
X = xi for values of yj , where j assumes a value from 1 to m. Similarly, the
marginal probabilities for Y are the simple probabilities that Y = yj , where
i assumes a value from 1 and n.
It is important to note that the distribution satisfies a joint probability

function, namely,
(a) p(xi , yj ) ≥ 0, for all i = 1, 2, · · · , n; j = 1, 2, · · · , m.
n
m
(b) p(xi , yj ) = 1
i=1 j=1
Example 1.1
Example 1.1
(a) For the data in Table 1.2, calculate the joint probabilities of X and Y .
(a) For the data in Table 1.2, calculate the joint probabilities of X and Y .
(b) Does this distribution satisfy the properties of a joint probability func-
(b) Does
tion? this distribution satisfy the properties of a joint probability func-
tion?
Solution
Solution
(a) From Table 1.2, the cell ({X = 0} ∩ {Y = 1}) = (0, 1) contains one
(a) From Table 1.2,
element; the cell ({X = 0} ∩ {Y = 1}) = (0, 1) contains one
element;
Total number of elements in all cells is 8.
Total
Hencenumber of elements in all cells is 8.
Hence
P ({X = 0} ∩ {Y = 1}) = p(0, 1)
P ({X = 0} ∩ {Y = 1}) = p(0, #({X
1) = 0} ∩ {Y = 1}) 1
= n m #({X = 0} ∩ {Y = 1}) =1

= n m = 8
#({X = xi } ∩ {Y = yj }) 8
i j #({X = xi } ∩ {Y = yj })
i j
Similarly,
Similarly,
0
P ({X = 0} ∩ {Y = 2}) = p(0, 2) = 08 =0
P ({X = 0} ∩ {Y = 2}) = p(0, 2) = =0
80
P ({X = 0} ∩ {Y = 3}) = p(0, 3) = 08 =0
P ({X = 0} ∩ {Y = 3}) = p(0, 3) = =0
80
P ({X = 1} ∩ {Y = 1}) = p(1, 1) = 08 =0
P ({X = 1} ∩ {Y = 1}) = p(1, 1) = =0
82 1
P ({X = 1} ∩ {Y = 2}) = p(1, 2) = 28 =1
P ({X = 1} ∩ {Y = 2}) = p(1, 2) = = 4
81 41
P ({X = 1} ∩ {Y = 3}) = p(1, 3) = 18 =1
P ({X = 1} ∩ {Y = 3}) = p(1, 3) = = 8
8 8
When probabilities of all possible joint events, P (X = xi , Y = yj ),

When probabilities
have been determinedof in
allthis
possible joint
fashion, weevents,
have a P (X probability
joint = xi , Y = ydis-
j ),
have been determined in this fashion, we have a joint probability
tribution of X and Y and these results may be presented in a two-way dis-
tribution of X and
table as shown Y and
in the tablethese results may be presented in a two-way
below:
table as shown in the table below:
13
8
2 1
P ({X = 1} ∩ {Y = 2}) = p(1, 2) = =
ADVANCED TOPICS IN INTRODUCTORY 8 4
PROBABILITY: A FIRST COURSE IN 1 1
P ({X = 1} ∩ {Y = 3}) = p(1, 3) PROBABILITY
= = AND DISTRIBUTION FUNCTIONS
PROBABILITY THEORY – VOLUME III 8 8 OF BIVARIATE DISTRIBUTIONS
When probabilities of all possible joint events, P (X = xi , Y = yj ),

10 have been determined inAdvanced Topics
this fashion, in Introductory
we have Probability
a joint probability dis-
10 tribution of X and Y andAdvanced
these results
Topicsmayinbe presented inProbability
Introductory a two-way
table as shown in the table below:
X Y
X 1 Y2 3
0 1/8 1 02 03
10 1/8 0 2/80 1/8
0
21 00 2/8
2/8 1/8 1/8
32 1/8 0 0
2/8 0
1/8
3 1/8 0 0
(b) From the table above,

(b) From the, table
(i) p(x above,for all i = 0, 1, 2, 3; j = 1, 2, 3.
i yj ) ≥ 0,
(i) 3
p(x i , yj ) ≥ 0,
3 for1 all1 i =2 0, 1,22, 3;1 j =1 1, 2, 3.
(ii) p(xi , yj ) = + + + + + = 1
3 3
i=0 j=1
81 81 82 82 81 81
(ii) p(xi , yj ) = + + + + + = 1
8 8 8 8 8 8
Hencei=0
this
j=1 distribution is a joint probability function.
Hence this distribution is a joint probability function.

Joint Probability Distribution of Bivariate Random Variables
in Expression
Joint ProbabilityFormDistribution of Bivariate Random Variables
in Expression Form
Sometimes the joint probability distribution of discrete random variables X
and Y is given
Sometimes by a formula.
the joint probability distribution of discrete random variables X
and Y is given by a formula.
Example 1.2
Given the1.2
Example function
Given the function
p(x, y) = k(3x + 2y), x = 0, 1; y = 0, 1, 2
p(x, y) = k(3x + 2y), x = 0, 1; y = 0, 1, 2
(a) Find the constant k > 0 such that the p(x, y) is a joint probability
(a) mass
Find function.
the constant k > 0 such that the p(x, y) is a joint probability
mass function.
(b) Present it in a tabular form for the probabilities associated with the
(b) sample
Present points
it in a(x, y). Obtain
tabular the the
form for rowprobabilities
and column totals.
associated with the
sample points (x, y). Obtain the row and column totals.
Solution
(a) (i) p(x, y) ≥ 0
Solution
(a) (i) and
Density p(x, 2 ≥0
y)
1 Distribution Functions
of1 Bivariate Distributions 11

Density and
Distribution Functions of Bivariate Distributions 11
(ii) k(3x + 2y) = k [(3x + 0) + (3x + 2) + (3x + 4)]
1 2 1
x=0 y=0 x=0
(ii) k(3x + 2y) = k [(3x + 0)
1 + (3x + 2) + (3x + 4)]
x=0 y=0 = 1 (9x + 6)
k x=0
= k x=0(9x + 6)
= k[(0 + 6) + (9 + 6)]
x=0
= k[(0 + 6) + (9 + 6)]
= 21k
= 21k
For p(x, y) to be a joint probability function we must have 21k = 1 from
For p(x, y) to be a joint probability function we must have 21k = 1 from
which
which 1
k= 1
k = 21
21
(b) For the sample point {X = 0, Y = 0} = (0, 0)
1
p(0, 0) = 1 [3(0) + 2(0)] = 0
p(0, 0) = 21 [3(0) + 2(0)]
14= 0
21
= 21k
For
ADVANCED to be IN
p(x, y)TOPICS a joint probability function we must have 21k = 1 from
INTRODUCTORY
which PROBABILITY AND DISTRIBUTION FUNCTIONS
PROBABILITY THEORY – VOLUME III 1 OF BIVARIATE DISTRIBUTIONS
k=
21
1
p(0, 0) = [3(0) + 2(0)] = 0
21
Similarly,
1 2
p(0, 1) = [3(0) + 2(1)] =
21 21
1 4
p(0, 2) = [3(0) + 2(2)] =
21 21
1 3
p(1, 0) = [3(1) + 2(0)] =
21 21
1 5
p(1, 1) = [3(1) + 2(1)] =
21 21
1 7
p(1, 2) = [3(1) + 2(2)] =
21 21
These results are presented in the following table. Recollect that the row
and column totals are the marginal probabilities.
X Y Row
Totals
0 1 2
0 0 2/21 4/21 6/21
1 3/21 5/21 7/21 15/21
Column 3/21 7/21 11/21 1
12 Totals Advanced Topics in Introductory Probability
1.3.2 Joint Distribution of Continuous Random Variables
Definition 1.6 JOINT PROBABILITY DENSITY FUNCTION

Let X and Y be two continuous random variables. The joint (or
bivariate) probability density function f (x, y) of X and Y is given
by
y2 x2
P ({x1 ≤ X ≤ x2 } ∩ {y1 ≤ Y ≤ y2 }) = f (x, y) dxdy
y1 x1
for two pairs (x1 , x2 ) and (y1 , y2 ) with x2 ≥ x1 , y2 ≥ y1
Fig 1.2 depicts the case of the continuous bivariate random variables
(X, Y ) which assume all the values in the rectangle x1 ≤ X ≤ x2 and
y1 ≤ Y ≤ y2 .
15
Fig 1.2
ADVANCED depicts
TOPICS the case of the continuous bivariate random variables
IN INTRODUCTORY
PROBABILITY:
(X, Y ) whichA assume
FIRST COURSE
all theINvalues in the rectanglePROBABILITY
x1 ≤ X ≤ANDx2 DISTRIBUTION
and FUNCTIONS
y1 ≤ Y ≤ y2 .
Fig. 1.2 Region assumed by (X, Y ) is shaded
Thus, the probability that the event
{x1 ≤ X ≤ x2 } ∩ {y1 ≤ Y ≤ y2 }
will fall in the shaded rectangle is
P ({x1 ≤ X ≤ x2 } ∩ {y1 ≤ Y ≤ y2 })
This probability can be found by subtracting from the probability that the
event will fall in the (semi-infinite) rectangle having the upper-right corner
(x2 , y2 ) the probabilities that it will fall in the semi-infinite rectangle having
the upper-right corner (x1 , y2 ) and (x2 , y1 ) respectively, and then adding
back the probability that it will fall in the semi-infinite rectangle with the
upper-right corner at (x1 , y1 ). That is,
P (x1 ≤ X ≤ x2 , y1 ≤ Y ≤ y2 ) = P (X ≤ x2 , Y ≤ y2 ) − P (X ≤ x2 , Y ≤ y1 )
−P (X ≤ x1 , Y ≤ y2 ) + P (X ≤ x1 , Y ≤ y1 )
For computational purposes, we shall adopt Definition 1.7.
WHY WAIT FOR

Definition 1.7
PROGRESS?
Let (X, Y ) be a continuous bivariate random variable assuming all
values in the region R. The joint probability density function f is a
function satisfying the following properties:
DARE TO1 DISCOVER

Property f (x, y) ≥ 0, for all (x, y) ∈ R

Discovery means many different things at
Property 2 But it’s the spirit
Schlumberger. f (x,that
y) unites =1
dxdyevery
single one of us. It doesn’t
R matter whether they
join our business, engineering or technology teams,
our trainees push boundaries, break new ground
and deliver
Property the exceptional.
2 states that theIf that excites
total you,
volume
bounded by the surface given by
then we want to hear from you.
equation z = f (x, y) and the region R on the xy-plane is equal to 1.
careers.slb.com/recentgraduates
Example 1.3
Given the following function of a two-dimensional continuous random vari-
able (X, Y ):
xy
x2 + , 0≤x≤1 0≤y≤2
f (x, y) = k
0, elsewhere
where k is a constant. 16
(a) Find the value of k > 0 such that f (x, y) is a probability density
(x2 , y2 ) the probabilities that it will fall in the semi-infinite rectangle having
the upper-right corner (x1 , y2 ) and (x2 , y1 ) respectively, and then adding
back the probability
PROBABILITY: that it will
A FIRST COURSE IN fall in the semi-infinite rectangle AND
PROBABILITY withDISTRIBUTION
the FUNCTIONS
upper-right
PROBABILITYcorner
THEORYat– (x ,
1 1y
VOLUME ). That
III is, OF BIVARIATE DISTRIBUTIONS
P (x1 ≤ X ≤ x2 , y1 ≤ Y ≤ y2 ) = P (X ≤ x2 , Y ≤ y2 ) − P (X ≤ x2 , Y ≤ y1 )
−P (X ≤ x1 , Y ≤ y2 ) + P (X ≤ x1 , Y ≤ y1 )
For computational purposes, we shall adopt Definition 1.7.
Definition 1.7
Let (X, Y ) be a continuous bivariate random variable assuming all
values in the region R. The joint probability density function f is a
function satisfying the following properties:
Property 1 f (x, y) ≥ 0, for all (x, y) ∈ R

Property 2 f (x, y) dxdy = 1
R
Property 2 states that the total volume bounded by the surface given by
equation z = f (x, y) and the region R on the xy-plane is equal to 1.
Example 1.3
Given the following function of a two-dimensional continuous random vari-
able (X, Y ):
xy
x2 + , 0≤x≤1 0≤y≤2
f (x, y) = k
0, elsewhere
where k is a constant.
(a) Find the value of k > 0 such that f (x, y) is a probability density
function.
14(b) Find P (0 < X < 1, 1 < Advanced
Y < 2). Topics in Introductory Probability
Solution
(a) For f (x, y) to be a p.d.f, it should satisfy the two conditions of Theorem
1.2. Obviously,
f (x, y) ≥ 0
since x ≥ 0, y ≥ 0, and k > 0.
Also +∞ +∞
f (x, y) dx dy = 1
−∞ −∞
Now,
2 1 2 1
xy xy
x +2
dx dy = x +
2
dx dy
y=0 x=0 k y=0 x=0 k
2 3 1
x x2 y
= + dy
y=0 3 2k x=0
2
1 y
= + dy
y=0 3 2k
2
1 y2
= y+
3 4k 0
2 171
= + =1
3 k
x + dx dy = x + dx dy
y=0 x=0 k y=0 x=0 k
2 3 1
ADVANCED TOPICS IN INTRODUCTORY x x2 y
= + dy
PROBABILITY: A FIRST COURSE IN y=0 2k x=0
3 PROBABILITY AND DISTRIBUTION FUNCTIONS
PROBABILITY THEORY – VOLUME III 2 OF BIVARIATE DISTRIBUTIONS
1 y
= + dy
y=0 3 2k
2
1 y2
= y+
3 4k 0
2 1
= + =1
3 k
giving k = 3.
(b) Using the value of k found in (a) we have
1 2
xy
P (0 < X < 1, 1 < Y < 2) = dx dy x2 +
x=0 y=1 3
2 1
xy
= x +
2
dx dy
y=1 x=0 3
2 3 1
x x2 y
= + dy
y=1 3 6 x=0
2
1 y
= + dy
y=1 3 6
2
y y2 7
Density and Distribution Functions of=Bivariate
+ Distributions
= 15
3 12 y=1 12
1.4 JOINT CUMULATIVE DISTRIBUTION FUNCTIONS
1.4.1 Definition of Joint Bivariate Distribution Function
The joint behaviour of two random variables, X and Y is determined by the

joint cumulative distribution function, also called the bivariate cumulative
distribution function, or simply the joint or bivariate distribution function
of the two random variables, X and Y . The definition given in Definition
1.8 is applicable whether X and Y are discrete or continuous.
Definition 1.8 JOINT DISTRIBUTION FUNCTION

For any random variables X and Y , the joint (bivariate) cumulative
distribution function F (x, y), is given by
F (x, y) = P ({X ≤ x} ∩ {Y ≤ y}) = P (X ≤ x, Y ≤ y)
1.4.2 Joint Distribution Function of Discrete Random Variables
The joint cumulative distribution function of random variables X and Y

gives the probability that X takes on a value less that or equal to xi ,
i = 1, 2, · · · , n and that the Y takes on a value less than or equal to yj ,
j = 1, 2, · · · , m.

(Discrete Case)
The joint distribution function of two discrete random variables X
and Y is
F (x, y) = p(xi , yj )
xi ≤x yi ≤y
18
The joint cumulative distribution function of random variables X and Y
The joint
gives cumulative that
the probability distribution
X takesfunction of random
on a value variables
less that or equal and
X to xi ,Y
gives the probability
i = 1, 2, · · · , nA FIRST
PROBABILITY:
that
and that the IN
COURSE
X takes on a value less that or equal
Y takes on a value less PROBABILITY to yxji ,
than or equalAND DISTRIBUTION FUNCTIONS
= 1,
ji = 1,2,
2,······,,m.
PROBABILITY n and that
THEORY the YIIItakes on a value less than or equal OF
– VOLUME to BIVARIATE
yj , DISTRIBUTIONS
j = 1, 2, · · · , m.

Definition 1.9 JOINT
(DiscreteDISTRIBUTION
Case) FUNCTION
(Discrete Case)
The joint distribution function of two discrete random variables X
The Yjoint
and is distribution function of two discrete random variables X

and Y is F (x, y) = p(xi , yj )
F (x, y) = xi ≤x yi ≤y p(xi , yj )
xi ≤x yi ≤y
Example 1.4
16
Example 1.4 table of Example Advanced Topics in Introductory Probability
Refer
16 to the 1.1. Calculate
Advanced Topics in Introductory Probability
Refer to the table of Example 1.1. Calculate
(a) the joint probability P (2 ≤ X ≤ 3, 1 ≤ Y ≤ 2);
(b)
(a) the
the joint
joint cumulative
probability probability
P (2 ≤ X ≤ P (X ≤Y1, ≤Y 2);
≤ 2).
(b) the joint cumulative probability P3,(X
1≤≤ 1, Y ≤ 2).
Solution
Solution
(a) The joint probability P (2 ≤ X ≤ 3, 1 ≤ Y ≤ 2) is obtained as follows:
(a) The joint probability P (2 ≤ X ≤ 3, 1 ≤ Y ≤ 2) is obtained as follows:
P (2 ≤ X ≤ 3, 1 ≤ Y ≤ 2) = p(2, 1) + p(2, 2) + p(3, 1) + p(3, 2)
P (2 ≤ X ≤ 3, 1 ≤ Y ≤ 2) = p(2, 21) + 1p(2, 2) + p(3, 3 1) + p(3, 2)
= 0+ 2 + 1 +0== 3
= 0+ 8 + 8 +0== 8
8 8 8
(b) The joint probability P (X ≤ 1, Y ≤ 2) is as follows:
(b) The joint probability P (X ≤ 1, Y ≤ 2) is as follows:
P (X ≤ 1, Y ≤ 2) = F (1, 2),
P (X ≤ 1, Y ≤ 2) = F (1, 2),
= p(0, 1) + p(0, 2) + p(1, 1) + p(1, 2)
= p(0,
1 1) + p(0,22) +3p(1, 1) + p(1, 2)
= 1 +0+0+ 2 = 3
= 8 +0+0+ 8 = 8
8 8 8
Example 1.5
Example
Refer 1.5
to Example 1.2. Calculate
Refer to Example 1.2. Calculate
(a) the joint probability distribution P (0 ≤ X ≤ 1, 1 ≤ Y ≤ 2);
PREPARE FOR A
(b) the joint cumulative distribution function P (X ≤ 1, Y ≤ 1);
LEADING ROLE.
(c) the joint cumulative distribution function P (X ≤ 1, Y ≤ 2).
(c) the joint cumulative distribution function P (X ≤ 1, Y ≤English-taught
2). MSc programmes
Solution in engineering: Aeronautical,
Solution Biomedical, Electronics,
Mechanical, Communication
1 2
1
(a) P (0 ≤ X ≤ 1, 1 ≤ Y ≤ 2) = 1 2
1 (3x + 2y) systems and Transport systems.
21
(a) P (0 ≤ X ≤ 1, 1 ≤ Y ≤ 2) = x=0 y=1 (3x + 2y)
x=0 y=1
21 No tuition fees.
1
1 1
= 1 [(3x + 2) + (3x + 4)]
= 21 [(3x + 2) + (3x + 4)]
21 x=0
x=0
1
1
= 1 1 (6x + 6)
= 21 (6x + 6)
21 x=0
1 x=0 E liu.se/master
6
= 1 [(0 + 6) + (6 + 6)] = 67
21
= [(0 + 6) + (6 + 6)] =
(b) P (X ≤ 1, Y ≤ 1) = 21(1, 1)
F 7
‡
(b) P (X ≤ 1, Y ≤ 1) = F1(1, 11)
1
= 1 1
1 (3x + 2y)
= x=0 y=0
21 (3x + 2y)
x=0 y=0
21
19
P (X ≤ 1, Y ≤ 2) = F (1, 2),
= p(0, 1)
PROBABILITY: A FIRST COURSE IN + p(0, 2) + p(1,PROBABILITY
1) + p(1, 2) AND DISTRIBUTION FUNCTIONS
PROBABILITY THEORY – VOLUME III 1 2 3 OF BIVARIATE DISTRIBUTIONS
= +0+0+ =
8 8 8
Example 1.5
(c) the joint cumulative distribution function P (X ≤ 1, Y ≤ 2).
Solution
2
1
1
(a) P (0 ≤ X ≤ 1, 1 ≤ Y ≤ 2) = (3x + 2y)
x=0 y=1
21
1
1
= [(3x + 2) + (3x + 4)]
21 x=0
1
1
= (6x + 6)
21 x=0
1 6
= [(0 + 6) + (6 + 6)] =
21 7
(b) P (X ≤ 1, Y ≤ 1) = F (1, 1)
1 1

Density and Distribution Functions of Bivariate 1
Distributions 17
= (3x + 2y)
x=0 y=0
21
1
1
= [(3x + 0) + (3x + 2)]
21 x=0
1
1
= (6x + 2)
21 x=0
1 10
= [(0 + 2) + (6 + 2)] =
21 21
The reader is asked in Exercise 1.5 to solve part (c) of this example.
1.4.3 Joint Distribution Function of Continuous Random

Variables

(Continuous Case)
The cumulative distribution function F of the two-dimensional ran-
dom variable (X, Y ) is defined as
y x
F (x, y) = P ({X ≤ x} ∩ {Y ≤ y}) = f (s, t) ds dt
−∞ −∞
where f (s, t) is the value of the joint probability density of X and Y

at (s, t)
The joint distribution function gives the probability that the point X, Y
belongs to a semi-infinite rectangle in the plane, as shown in Fig. 1.3.
20
F (x, y) = P ({X ≤ x} ∩ {Y ≤ y}) = f (s, t) ds dt
−∞ −∞
where f (s,At)FIRST
PROBABILITY: is the value of
COURSE IN the joint probability density of X and
PROBABILITY ANDY DISTRIBUTION FUNCTIONS
at (s, t)
The joint distribution function gives the probability that the point X, Y
belongs to a semi-infinite rectangle in the plane, as shown in Fig. 1.3.
18 Fig. 1.3 Region of {X ≤Topics

Advanced x, Y ≤iny} is shaded Probability
Introductory
Example 1.6
Refer to Example 1.3. Calculate P (X ≤ 1, Y < 1).
Solution
1 1
xy
P (X ≤ 1, Y < 1) = x + dx dy 2
y=0 x=0 3
1 1
xy
= x +
2
dx dy
y=0 x=0 3
1 3 1
x x2 y
= + dy
y=0 3 6 x=0
1
1 y
= + dy
y=0 3 6
1
1 y2
= y+
3 12 y=0
5
=
12
Theorem 1.1
If F is the cumulative distribution function of a two-dimensional
random variable with joint probability density function f (x, y) then
∂ 2 F (x, y)
= f (x, y)
∂x ∂y
wherever F is differentiable
Example 1.7
Let
F (x, y) = (1 − e−x )(1 − e−y ), x ≥ 0, y≥0
Find the joint probability density function f (x, y).
Solution
∂F (x, y)
= e−x (1 − e−y )
∂x
21
∂ 2 F (x, y)
ADVANCED TOPICS IN INTRODUCTORY = f (x, y)
∂x ∂y
wherever F is differentiable
Example 1.7
Let
F (x, y) = (1 − e−x )(1 − e−y ), x ≥ 0, y≥0
Find the joint probability density function f (x, y).
Solution
∂F (x, y)Functions
Density and Distribution of Bivariate Distributions 19
= e−x (1 − e−y )
∂x
∂ 2 F (x, y)
= e−x e−y
∂x∂y
= e−(x+y) , x ≥ 0, y ≥ 0.
Hence
f (x, y) = e−(x+y) , x ≥ 0, y ≥ 0}
Note
∂ 2 F (x, y) ∂ 2 F (x, y)
=
∂x∂y ∂y∂x
1.4.4 Properties of Bivariate Cumulative Distribution Function
The joint c.d.f. of a bivariate random variable has properties which are
analogous to those of the univariate random variable.
Property 1
The function F (x, y) is a probability, hence
0 ≤ F (x, y) ≤ 1
Click here
Property 2 to learn more
TAKE THE
The bivariate distribution function F (x, y) is monotonic increasing, in a
wider sense for both variables, that is,
RIGHT TRACK
if x1 ≤ x2 , then
F (x1 , y) ≤ F (x2 , y), for y fixed
if y1 ≤ y2 , then
Give your career a head start
byy1studying
F (x, ) ≤ F (x, ywith
2 ), us.
forExperience
x fixed the advantages
of our collaboration with major companies like
Property 3 ABB, Volvo and Ericsson!
The following relations are also true:
(a) F (−∞, y) = 0
(b) F (x, −∞)Apply

= 0 by World class
www.mdh.se
15 January research
(c) F (+∞, +∞) = 1
22
f (x, y) = e , x ≥ 0, y ≥ 0}

Note PROBABILITY AND DISTRIBUTION FUNCTIONS
∂ 2 F (x,
PROBABILITY THEORY – VOLUME III y) ∂ 2 F (x, y) OF BIVARIATE DISTRIBUTIONS
=
∂x∂y ∂y∂x
1.4.4 Properties of Bivariate Cumulative Distribution Function
The joint c.d.f. of a bivariate random variable has properties which are
analogous to those of the univariate random variable.
Property 1
The function F (x, y) is a probability, hence
0 ≤ F (x, y) ≤ 1
Property 2
The bivariate distribution function F (x, y) is monotonic increasing, in a
wider sense for both variables, that is,
if x1 ≤ x2 , then
F (x1 , y) ≤ F (x2 , y), for y fixed
if y1 ≤ y2 , then
F (x, y1 ) ≤ F (x, y2 ), for x fixed
Property 3
The following relations are also true:
(a) F (−∞, y) = 0
(b) F (x, −∞) = 0

(c) F (+∞, +∞) = 1
(d) F (−∞, −∞) = 0
(e) F (x, +∞) = P (X ≤ x) = F1 (x), where F1 (x) is the c.d.f of X
(g) P (a < X < b, Y ≤ y) = F (b, y) − F (a, y)
(h) P (X ≤ x, c < Y < d) = F (x, d) − F (x, c)
(f) P (a < X < b, c < Y < d) = F (b, d) − F (a, d) − F (b, c) + F (a, c)
Property 4
At points of continuity of f (x, y) is
∂ 2 F (x, y)
= f (x, y)
∂x∂y
1.5 MARGINAL DISTRIBUTION OF BIVARIATE RANDOM

VARIABLES
1.5.1 Marginal Distribution of Discrete Bivariate Random

Variables
The row totals of Table 1.3 provide us with the probability distribution of
X. Similarly, the column totals provide the probability distribution of Y .
These are typically called marginal probability mass functions because they
are found on the margins of tables.
Definition 1.11 MARGINAL BIVARIATE DISTRIBUTIONS

OF X AND Y 23
Let X and Y be discrete random variables with joint probability
Variables
The row totals

ADVANCED TOPICSofINTable 1.3 provide us with the probability distribution of
INTRODUCTORY
PROBABILITY: A FIRST COURSEtotals
X. Similarly, the column IN provide the probability distribution
PROBABILITY AND of Y.
DISTRIBUTION FUNCTIONS
These are typically called marginal probability mass functions because they
are found on the margins of tables.
Definition 1.11 MARGINAL BIVARIATE DISTRIBUTIONS

OF X AND Y
Let X and Y be discrete random variables with joint probability
function p(xi , yj ). Then the marginal distributions of X and Y re-
spectively are given by
m

g(xi ) = P (X = xi ) = p(xi , yj ), i = 1, 2, 3, ..., n
j=1
n

h(yj ) = P (Y Functions
Density and Distribution = yj ) = ofp(xi , yj ), Distributions
Bivariate j = 1, 2, 3, ..., m 21
i=1
Example 1.8
For the data
Example 1.8 in Table 1.2, find the marginal probability distribution for
(a)
For the
X data andin Table
(b) Y . find the marginal probability distribution for
1.2,
(a) X and (b) Y .
Solution
To calculate the marginal probabilities the joint probabilities are required.
Solution
The joint probabilities
To calculate the marginalfor probabilities
this problemthe have been
joint calculated are
probabilities in Example
required.
1.1.
The joint probabilities for this problem have been calculated in Example
1.1.
(a) From the table obtained in Example 1.1. we shall calculate the marginal
(a) probabilities
From the table forobtained
each xi inbyExample
fixing i 1.1.
and we
summing all the joint
shall calculate proba-
the marginal
bilities acrossfor
probabilities j. Thus:
each xi by fixing i and summing all the joint proba-
bilities across j. Thus:
P (X = 0) = P (X = 0, Y = 1) + P (X = 0, Y = 2)
P (X = 0) = P (X +P= (X0,=Y 0,=Y1)=+3)P (X = 0, Y = 2)
+P (X = 0, Y = 3) 1 1
= p(0, 1) + p(0, 2) + p(0, 3) = + 0 + 0 =
81 81
= p(0, 1) + p(0, 2) + p(0, 3) = + 0 + 0 =
8 8
P (X = 1) = P (X = 1, Y = 1) + P (X = 1, Y = 2)
P (X = 1) = P (X+P = 1,
(XY ==1,1)Y+=P3)(X = 1, Y = 2)
+P (X = 1, Y = 3) 2 1 3
= p(1, 1) + p(1, 2) + p(1, 3) = 0 + + =
82 81 83
= p(1, 1) + p(1, 2) + p(1, 3) = 0 + + =
8 8 8
P (X = 2) = P (X = 2, Y = 1) + P (X = 2, Y = 2)
P (X = 2) = P (X +P= (X2,=Y 2,=Y1)=+3)P (X = 2, Y = 2)
+P (X = 2, Y = 3) 2 1 3
= p(2, 1) + p(2, 2) + p(2, 3) = 0 + + =
82 81 83
= p(2, 1) + p(2, 2) + p(2, 3) = 0 + + =
8 8 8
P (X = 3) = P (X = 3, Y = 1) + P (X = 3, Y = 2)
P (X = 3) = P (X +P= (X3,=Y 3,=Y1)=+3)P (X = 3, Y = 2)
+P (X = 3, Y = 3) 1 1
= p(3, 1) + p(3, 2) + p(3, 3) = + 0 + 0 =
81 81
= p(3, 1) + p(3, 2) + p(3, 3) = + 0 + 0 =
The results are summarised in the table below: 8 8
The results are summarised in the table below:
xi 0 1 2 3
xii ) 10 31 283 381
g(x 8 8
g(xi ) 1
8
3
8
3
8
1
8
24
PROBABILITY
22 THEORY – VOLUME Advanced
III OF BIVARIATE DISTRIBUTIONS
Topics in Introductory Probability
(b) Similar to (a) we fix j and sum all the joint probabilities across i.
Hence,
P (Y = 1) = P (Y = 1, X = 0) + P (Y = 1, X = 1)
+P (Y = 1, X = 2) + P (Y = 1, X = 3)
= p(1, 0) + p(1, 1) + p(1, 2) + p(1, 3)
1 1
= +0+0+0=
8 8
P (Y = 2) = P (Y = 2, X = 0) + P (Y = 2, X = 1) +
P (Y = 2, X = 2) + P (Y = 2, X = 3) +
= p(2, 0) + p(2, 1) + p(2, 2) + p(2, 3)
2 2 4
= 0+ + +0=
8 8 8
P (Y = 3) = P (Y = 3, X = 0) + P (Y = 3, X = 1) +
P (Y = 3, X = 2) + P (Y = 3, X = 3)
= p(3, 0) + p(3, 1) + p(3, 2) + p(3, 3)
1 1 2
= 0+ + +0=
8 8 8
The results are summarised in the table below.
yj 1 2 3
g(yj ) 2
8
4
8
2
8
The results of Examples 1.1 and 1.8 (that is, the joint and marginal proba-
bilities) are usually presented in a single table such as in Table 1.4.
How will people travel in the future, and
Note how will goods be transported? What re-
sources will we use, and how many will
(a) The marginal distributions of X and Y are the ordinary we probability
need? The passenger and freight traf-
distribution functions of X and Y but when derived from fic the
sectorjoint
is developing rapidly, and we
distribution function the adjective “marginal” is added. provide the impetus for innovation and
movement. We develop components and
(b) In marginal distribution, the probability of different values of a for
systems ran-
internal combustion engines
dom variable in a subset of random variables is determined that without
operate more cleanly and more ef-
reference to any possible values of the other variables. ficiently than ever before. We are also
pushing forward technologies that are
bringing hybrid vehicles and alternative
drives into a new dimension – for private,
corporate, and public use. The challeng-
es are great. We deliver the solutions and
offer challenging jobs.
www.schaeffler.com/careers
25
P (Y = 3) = P (Y = 3, X = 0) + P (Y = 3, X = 1) +
P (Y = 3, X = 2) + P (Y = 3, X = 3)
= p(3, 0) + p(3, 1)
+ p(3, 2) +PROBABILITY
p(3, 3) AND DISTRIBUTION FUNCTIONS
PROBABILITY THEORY – VOLUME III1 1 2 OF BIVARIATE DISTRIBUTIONS
= 0+ + +0=
8 8 8
The results are summarised in the table below.
yj 1 2 3
g(yj ) 2
8
4
8
2
8
The results of Examples 1.1 and 1.8 (that is, the joint and marginal proba-
bilities) are usually presented in a single table such as in Table 1.4.
Note
(a) The marginal distributions of X and Y are the ordinary probability
distribution functions of X and Y but when derived from the joint
distribution function the adjective “marginal” is added.
(b) In marginal distribution, the probability of different values of a ran-
dom
Density andvariable in a subset
Distribution of random
Functions variables
of Bivariate is determined without
Distributions 23
reference to any possible values of the other variables.
(c) From the table,
n
m m
n

p(xi , yj ) = g(xi , yj ) = h(xi , yj ) = 1
i=1 j=1 j=1 i=1
that is, the total probability is 1.
Table 1.4 Joint and Marginal Probabilities for Table 1.2
X Y Row
Totals
1 2 3
0 1/8 0 0 1/8
1 0 2/8 1/8 3/8
2 0 2/8 1/8 3/8
3 1/8 0 0 1/8
Column 2/8 4/8 2/8 1
Totals
Computation of Marginal Probabilities of Discrete Bivariate

Random Variables from Formula
We may also compute marginal distributions from a formula.
Example 1.9
Refer to Example 1.2. Find the marginal distributions of
(a) X, (b) Y .
Solution
(a) The marginal distribution of X is given by:

2
1
g(x) = (3x + 2y)
y=0
21
 
2
1 
= (3x + 2y)
21 y=0 26
We may also compute marginal distributions from a formula.
Example 1.9 A FIRST COURSE IN
PROBABILITY: PROBABILITY AND DISTRIBUTION FUNCTIONS
Refer to Example
PROBABILITY 1.2.
THEORY Find theIII marginal distributions of
– VOLUME OF BIVARIATE DISTRIBUTIONS
(a) X, (b) Y .
Solution
(a) The marginal distribution of X is given by:

2
1
g(x) = (3x + 2y)
y=0
21
 
2
24 1 Advanced

=  (3x +Topics
2y) in Introductory Probability
21 y=0
1
= [(3x + 0) + (3x + 2) + (3x + 4)]
21
1
= (3x + 2)
7
(b) The marginal distribution of Y is given by:
1
1
h(y) = (3x + 2y)
x=0
21
1
1
= (3x + 2y)
21 x=0
1
= [(0 + 2y) + (3 + 2y)]
21
1
= (3 + 4y)
21
1.5.2 Marginal Distribution of Continuous Bivariate Random

Variables
Definition 1.12
Suppose f (x, y) be the joint probability density function of the con-
tinuous two-dimensional random variable (X, Y ). We define g(x)
and h(y), the marginal probability density function of X and Y ,
respectively by ∞
g(x) = f (x, y) dy
−∞
and ∞
h(y) = f (x, y) dx
−∞
Example 1.10
Refer to Example 1.3. Find the marginal probability distribution of X and
Y.
Solution
Marginal probability distribution of X:
∞
g(x) = f (x, y) dy
−∞
2
xy
= x2 + dy, 0<x<1
0 3
2
xy 2
= x y+ 2
6 0
4x
= 2x2 +
6
27
ADVANCED TOPICS
Density and IN INTRODUCTORY
Distribution Functions of Bivariate Distributions 25
PROBABILITY THEORY – VOLUME
2 III OF BIVARIATE DISTRIBUTIONS
xy
= 2
x + dy, 0<x<1
0 3
2
xy 2
= x y+ 2
6 0
4x
= 2x2 +
6
That is,
2x2 + 23 x, 0 < x < 1
g(x) =
0, elsewhere
Marginal probability distribution of Y :

∞
h(y) = f (x, y) dx
−∞
1
xy
= x2 + dx, 0<y<2
0 3
1
x3 x2 y
= +
3 6 0
1 y
= +
3 6
That is
1
+ y6 , 0 < y < 2
h(y) = 3
0, elsewhere
Definition 1.13 MARGINAL CUMULATIVE DISTRIBUTION

The marginal probability distribution function of X, denoted by

FX (x) is 678'<)25<2850$67(5©6'(*5((
FX (x) = P (X ≤ x)
and &KDOPHUV8QLYHUVLW\RI7HFKQRORJ\FRQGXFWVUHVHDUFKDQGHGXFDWLRQLQHQJLQHHU
the marginal probability distribution function of Y, denoted by
FY (y) is
LQJDQGQDWXUDOVFLHQFHVDUFKLWHFWXUHWHFKQRORJ\UHODWHGPDWKHPDWLFDOVFLHQFHV
FY (y) = P (Y ≤ y)
DQGQDXWLFDOVFLHQFHV%HKLQGDOOWKDW&KDOPHUVDFFRPSOLVKHVWKHDLPSHUVLVWV
IRUFRQWULEXWLQJWRDVXVWDLQDEOHIXWXUH¤ERWKQDWLRQDOO\DQGJOREDOO\
9LVLWXVRQ&KDOPHUVVHRU1H[W6WRS&KDOPHUVRQIDFHERRN
28
3 6
That is
PROBABILITY: A FIRST COURSE
1
+ y, 0<y<2
h(y) =IN 3 6 PROBABILITY AND DISTRIBUTION FUNCTIONS
PROBABILITY THEORY – VOLUME III 0, elsewhere OF BIVARIATE DISTRIBUTIONS
Definition 1.13 MARGINAL CUMULATIVE DISTRIBUTION

The marginal probability distribution function of X, denoted by
FX (x) is
FX (x) = P (X ≤ x)
and the marginal probability distribution function of Y, denoted by
FY (y) is
FY (y) = P (Y ≤ y)
Thus,
F (x, y) = P (X ≤ x, Y ≤ y)
= lim F (x, y)
y→∞
= FX (x)
x ∞
= f (u, y) dy du
−∞ −∞
Similarly,
F (x, y) = lim F (x, y)

x→∞
= FY (y)
y ∞
= f (x, v) dx dv
−∞ −∞
From this, it follows that the probability density function of X alone, known
as the marginal density of X, is
g(x) = fX (x)
= F X (x)
∞
= f (x, y) dy
−∞
Similarly
h(y) = fY (y)
= FY (y)
∞
= f (x, y) dx
−∞
Note
The marginal probability density functions g(x) and h(y) can easily be de-
termined from the knowledge of the joint density function f (x, y). However,
the knowledge of the marginal probability density functions does not, in gen-
eral, uniquely determine the joint density function. The exception occurs
when the two random variables are independent.
29
1.6 CONDITIONAL DISTRIBUTION OF BIVARIATE RANDOM

VARIABLES
1.6.1 Conditional Distribution of Discrete Random Variables
The conditional probability distribution of a random variable is analogous

to the concept of conditional probability of events. For two events A and B,
the multiplication law gives the probability of the intersection A ∩ B as
P (A ∩ B) = P (A)P (B|A)
where P (A) is the unconditional probability of A and P (B|A) is the condi-

tional probability of B given that A has occurred.
Now, consider ({X = x}) ∩ ({Y = y}), represented by the bivariate
event (x, y). Then it follows directly from the multiplication law of proba-
bility that the bivariate probability for the intersection (xi , yj ) is
p(xi , yj ) = g(xi )p(yj |xi )
or
p(xi , yj ) = h(yj )p(xi |yj )
The probabilities g(xi ) and h(yj ) are those associated with the marginal
probability distributions for X and Y , respectively. The probability p(x|y)
is the probability that the random variable X takes a specific value x given
that Y takes on the value y written in full as P (X = x|Y = y). Thus,
P (X = 2|Y = 1) is the conditional probability that X = 2 given that Y =
1. A similar interpretation can be attached to the conditional probability
p(y|x).
Note
Conditional distribution is the opposite of marginal distribution, in which
the probability of a value of a random variable is determined without refer-
ence to the possible values of the other variables.
Definition 1.14 BIVARIATE CONDITIONAL

PROBABILITY DISTRIBUTION
(Discrete Case)
Suppose X and Y are discrete random variables with joint probability
distribution p(x, y) then the conditional discrete probability function
of X given Y is
P (X = x, Y = y)
P (X = x|Y = y) = , P (Y = y) > 0
P (Y = y)
That is, given the joint probability distribution p(x, y) and marginal prob-
ability functions g(x) and h(y), respectively, the conditional discrete proba-
bility function of X given Y is
30
p(x, y)
P (X = x|Y = y) = , P (Y = y) > 0
P (XP=(Yx,=Yy)= y)
P (X = x|Y = y) = , P (Y = y) > 0
ADVANCED TOPICS IN INTRODUCTORYP (Y = y)
That is, given the joint probability distribution p(x, y) and marginal prob-
ability
That is,functions
given the and
joint
g(x) h(y), respectively,
probability the p(x,
distribution conditional discrete proba-
y) and marginal prob-
bility
abilityfunction of g(x)
functions X given Y is respectively, the conditional discrete proba-
and h(y),
bility function of X given Y is
p(x, y)
p(x|y) = , h(y) > 0
h(y)y)
p(x,
p(x|y) = , h(y) > 0
h(y)
and similarly, the conditional discrete probability function of Y given X is
and similarly, the conditional discrete probability function of Y given X is
p(x, y)
p(y|x) = , g(x) > 0
g(x)y)
p(x,
p(y|x) = , g(x) > 0
g(x)
This definition shows that if we have the joint probability function
of two random
This variables
definition shows andthat
desire
if we thehave
conditional
the jointdistribution
probabilityforfunction
one of
them
of twowhen
randomthe other is held
variables andfixed,
desireit the
is merely necessary
conditional to dividefor
distribution theone
joint
of
probability
them when function
the otherbyis the
heldprobability
fixed, it is function of the fixed
merely necessary variable.
to divide the joint
probability function by the probability function of the fixed variable.
Example 1.11
Refer to the
Example 1.11table in Example 1.1. Find
Refer to the table in Example 1.1. Find
(a) P (Y = 2|X = 1)
(a) P (Y = 2|X = 1)
(b) P (X = 1|Y = 2)
(b) P (X = 1|Y = 2)
(c) P (X = 1|Y = 1)
(c) P (X = 1|Y = 1)
Scholarships
Open your mind to

new opportunities
With 31,000 students, Linnaeus University is
one of the larger universities in Sweden. We
are a modern university, known for our strong
international profile. Every year more than Bachelor programmes in
1,600 international students from all over the Business & Economics | Computer Science/IT |
world choose to enjoy the friendly atmosphere Design | Mathematics
and active student life at Linnaeus University.
Master programmes in
Welcome to join us! Business & Economics | Behavioural Sciences | Computer
Science/IT | Cultural Studies & Social Sciences | Design |
Mathematics | Natural Sciences | Technology & Engineering
Summer Academy courses
31
PROBABILITY
Density and THEORY – VOLUME
Distribution III
Functions of Bivariate Distributions OF BIVARIATE
29 DISTRIBUTIONS
Solution
2
P (X = 1, Y = 2) p(1, 2) 2
(a) P (Y = 2|X = 1) = = = 8
=
P (X = 1) g(1) 3
8
3
2
P (X = 1, Y = 2) p(1, 2) 2
(b) P (X = 1|Y = 2) = = = 8
=
P (Y = 2) h(2) 4
8
4
P (X = 1, Y = 1) p(1, 1) 0
(c) P (X = 1|Y = 1) = = = 2 =0
P (Y = 1) h(1) 8
Example 1.12
Refer to Example 1.2. Find the conditional probability function
(a) p(x|y), (b) p(y|x)
Solution
p(x, y)
p(x|y) =
h(y)
(a) From Example 1.2,
1
p(x, y) = (3x + 2y)
21
From Example 1.9,
1
h(y) = (3 + 4y)
21
Hence
21 (3x + 2y)
1
p(x|y) =
21 (3 + 4y)
1
3x + 2y
=
(3 + 4y)
(b)
p(x, y)
p(y|x) =
g(x)
From Example 1.9,
30 Advanced
1 Topics in Introductory Probability
g(x) = (3x + 2)
7
Hence
21 (3x + 2y)
1
p(y|x) =
7 (3x + 2)
1
7(3x + 2y)
=
21(3x + 2)
3x + 2y
=
3(3x + 2)
1.6.2 Conditional Distribution of Continuous Variables
Definition 1.15
Let (X, Y ) be a continuous two-dimensional random variable with
joint probability density function f (x, y). Let g and h be the
marginal probability density functions of X32and Y respectively. The
conditional probability of X for given Y = y is defined by
7
7(3x + 2y)
=
ADVANCED TOPICS IN INTRODUCTORY 21(3x + 2)
PROBABILITY: A FIRST COURSE IN 3x + 2y PROBABILITY AND DISTRIBUTION FUNCTIONS
PROBABILITY THEORY – VOLUME III = OF BIVARIATE DISTRIBUTIONS
3(3x + 2)
1.6.2 Conditional Distribution of Continuous Variables
Definition 1.15
Let (X, Y ) be a continuous two-dimensional random variable with
joint probability density function f (x, y). Let g and h be the
marginal probability density functions of X and Y respectively. The
conditional probability of X for given Y = y is defined by
f (x, y)
f (x|y) = , h(y) > 0
h(y)
and the conditional probability of Y for given X = x is defined by
f (x, y)
f (y|x) = , g(x) > 0
g(x)
Example 1.13
Refer to Example 1.3. Find (a) f (x|y), (b) f (y|x).
Solution
From Example 1.10
1 y
h(y) = +
3 6
(a)
f (x, y) Distributions
Density and Distribution Functions of Bivariate 31
f (x|y) =
Therefore h(y)
Therefore x2 + xy
f (x|y) = y
3
x132 + xy
f (x|y) = 12 y
63
3 ++62xy
6x
= , 0 ≤ x ≤ 1; 0 ≤ y ≤ 2
6x22 + 2xyy
= , 0 ≤ x ≤ 1; 0 ≤ y ≤ 2
2+y
(b) The conditional probability of Y for given X = x is
(b) The conditional probability of Y for given X = x is
f (x, y)
f (y|x) = .
(x, y)
f g(x)
f (y|x) = .
g(x)
From Example 1.10,
2
From Example 1.10, g(x) = 2x2 + x
32
g(x) = 2x2 + x
Therefore 3
Therefore
x2 + xy
f (y|x) = 3
2x + xy
x22 + 2
33 x
f (y|x) =
2x
3x + xy
2 2
3x
=
6x + xy
3x 2 2x
= 3x + y
= 6x + 2x
2
, 0 ≤ y ≤ 2; 0≤x≤1
6x + y2
3x
= , 0 ≤ y ≤ 2; 0≤x≤1
6x + 2
Example 1.14
Consider the joint probability density function
Example 1.14
33
Consider the joint probability
density function
6(1 − y), 0 ≤ x ≤ y ≤ 1
= 3x 3x 2+
2
2 +
3
xy
=
= 6x
3x2 + xy 2x
xy
= 3x 6x 2+ 2x
= 3x 6x2+
6x
+
+
+yy2x
2x
, 0 ≤ y ≤ 2; 0 ≤ x ≤ 1
PROBABILITY: A FIRST COURSE = 3x
6x
IN3x +
+ 2
y
y ,, 00 ≤ y ≤ 2; 0PROBABILITY
≤x ≤ 11 AND DISTRIBUTION FUNCTIONS
=
= III6x
6x +
+ 2
2 , 0≤ ≤ 2;
≤ yy ≤ 2; 00 ≤ ≤x x≤≤1 OF BIVARIATE DISTRIBUTIONS
6x + 2
Example 1.14
Example
Example 1.14
Example 1.14
1.14
Consider
Consider the
the joint
joint probability
probability density
density function
function

6(1 − y), 0 ≤ x ≤ y ≤ 1
f (x, y) = 6(1 − y), 0 ≤ x ≤ y ≤ 1
ff (x, = 6(1
0, − y), 00elsewhere
− y), ≤x x≤ ≤ yy ≤ ≤ 11
(x, y)
y) == 6(1 0,
≤
elsewhere
f (x, y) 0,
0, elsewhere
elsewhere
1 3
Find P
Y < 1 X < 3
.
11 X < 34
2
Find
Find PP Y < 3 .
Find P Y Y <
< 22 X X< < 44 ..
Solution 2 4
Solution
Solution
Solution
1 3

32 P Y < , X <
1 Advanced 3 PTopics 2111 ,, X 4333
in Introductory Probability
P
Y < X < = P Y
Y <
<
X <
<
32 1
1
3
Advanced
43 P Topics
Y < in 2 , Introductory
22 3 X < 4
44 Probability
P Y
P Y <
< 12 X
X <
< 3 = = P
X <
P Y < 2 X < 4 = 3
of 43

We first evaluate the2 2 numerator. 4
4 The P region
P X< 3integration is sketched
P X X< < 44
below.We first evaluate the numerator. The region of4integration is sketched
below.
Now,
Now,
1
1 3 2
y
P Y < , X < = y 6(1 − y) dx dy
1
21 43 x=0 2
y=0
P Y < ,X< = 6(1
1 − y) dx dy
2 4 y=0
2 x=0
= 3 y − 2 y3 21
y=0
= 13 y 2 − 2 y3 2
y=0
=
21
=
2
The denominator is evaluated as follows. First we find the marginal p.d.f of
The denominator is evaluated
X. 1 as follows. First we find the marginal p.d.f of
X. g(x) = 6(1 − y) dy = 3(1 − x)2 0 ≤ x ≤ 1
1
y=x
g(x) = 6(1 − y) dy = 3(1 − x)2 0≤x≤1
y=x
so that
so that 3
3 4
P X < = 3 3(1 − x)2 dx
43 4
P X< = x=0 3(1 − x)2 dx
4 63
= x=0
63
64
=
64
Hence,

Hence, 1 3 1/2 32
P Y < X < = =
21 43 1/2
63/64 32
63
P Y < X < = =
2 4 63/64 63
34
Now,
1
1 3 2
y
P Y < ,X< = 6(1 − y) dx dy
2 4 y=0 x=0
1
= 3 y 2 − 2 y3 2
y=0
1
=
2
The denominator is evaluated as follows. First we find the marginal p.d.f of
X. 1
g(x) = 6(1 − y) dy = 3(1 − x)2 0 ≤ x ≤ 1
y=x
so that
3
3 4
P X< = 3(1 − x)2 dx
4 x=0
63
=
64
Hence,

1 3 1/2 32
P Y < X< = =
2 4 63/64 63
1.7 INDEPENDENCE OF BIVARIATE RANDOM VARIABLES
1.7.1 Definition of Independence of Bivariate Random

Variables
We recall from the introductory probability course that, in terms of event

probabilities, two events A and B are independent if the realization of A
is not affected by the occurrence of B. That is, two events A and B are
independent if
P (A|B) = P (A)
This condition can be carried over to two random variables. That is, two
random variables are independent if the realization of one does not affect
the probability distribution of the other.
We also recall that the definition of independence of two events are
usually given as
P (A ∩ B) = P (A)P (B)
Similarly, we may carry this condition of independence over to two random
variables as in Definition 1.16.
Definition 1.16 INDEPENDENCE OF BIVARIATE

RANDOM VARIABLES
Let X have distribution function F1 (x), Y have distribution function
F2 (y) with X and Y having joint distribution function F (x, y). Then
X and Y are said to be independent if and only if
F (x, y) = F1 (x)F2 (y)
for every pair of real numbers (x, y) 35

We also recall that the definition of independence of two events are
usually given as
P (A ∩ B) = P (A)P (B)
PROBABILITY
Similarly, weTHEORY – VOLUME
may carry III
this condition OF BIVARIATE DISTRIBUTIONS
of independence over to two random
variables as in Definition 1.16.
Definition 1.16 INDEPENDENCE OF BIVARIATE

RANDOM VARIABLES
Let X have distribution function F1 (x), Y have distribution function
F2 (y) with X and Y having joint distribution function F (x, y). Then
X and Y are said to be independent if and only if
F (x, y) = F1 (x)F2 (y)
for every pair of real numbers (x, y)
Thus, the independence of the random variables X and Y implies that their
joint distribution function factors into the products of their marginal distri-
bution functions. This definition applies whether the random variables are
discrete or continuous.
If X and Y are not independent, they are said to be dependent. It is
usually more convenient to verify independence or otherwise with the help
34 Advanced
of the p.m.f. (in the discrete case) Topics
or p.d.f. in Introductory
(in the Probability
continuous case).
1.7.2 Independence of Bivariate Discrete Random Variables
Definition 1.17
If X and Y are discrete random variables with joint probability func-
tion p(x, y) and marginal probability function g(x) and h(y) respec-
tively, then X and Y are independent if and only if
p(x, y) = g(x)h(y)
for all pairs of real numbers (x, y)
Example 1.15
Refer to the table in Example 1.1, verify whether or not X and Y are
independent.
Solution
Consider the ordered pair (0, 1).
From Example 1.1
1
P (X = 0, Y = 1) =
8
But from the marginal distributions,
1
P (X = 0) =
8
2
P (Y = 1) =
8

1 2
P (X = 0)P (Y = 1) =
8 8
1
which does not equal . Therefore X and Y are not independent.
8
Example 1.16
Refer to Example 1.2. Are X and Y independent?
Solution 36
From Example 1.2
P (Y = 1) =
8

1 2
ADVANCED TOPICS IN P (X = 0)P (Y =
INTRODUCTORY 1) =
PROBABILITY: A FIRST COURSE IN 8 8 PROBABILITY AND DISTRIBUTION FUNCTIONS
1
which does not equal . Therefore X and Y are not independent.
8
Example 1.16
Refer to Example 1.2. Are X and Y independent?
Solution
From Example 1.2
Density and Distribution Functions of1 Bivariate Distributions 35
p(x, y) = (3x + 2y)
21
From Example 1.9,
1
g(x) = (3x + 2)
7
1
h(y) = (3 + 4y)
21
Now
1 1
g(x)h(y) = (3x + 2) (3 + 4y)
7 21
1
= (9x + 12xy + 8y + 6)
147
which is not equal to p(x, y); hence X and Y are not independent.
1.7.3 Independence of Continuous Random Variables
Like the discrete case, we usually verify for independence or lack of it of

continuous random variables using the result contained in Definition 1.16.
Definition 1.18
If X and Y are are continuous random variables with a joint density
function of f (x, y) and marginal density functions of g(x) and h(y),
respectively, then X and Y are independent if and only if
f (x, y) = g(x)h(y)
Example 1.17
Refer to Example 1.3. Verify whether X and Y are independent.
Solution
From Example 1.10,
2
g(x) = 2x2 + x
3
1 y
h(y) = +
3 6
37
1 1
g(x)h(y) = (3x + 2) (3 + 4y)
7 21
ADVANCED TOPICS IN INTRODUCTORY 1
PROBABILITY: A FIRST COURSE =
IN (9x + 12xy + 8y +PROBABILITY
6) AND DISTRIBUTION FUNCTIONS
147
which is not equal to p(x, y); hence X and Y are not independent.
1.7.3 Independence of Continuous Random Variables
Like the discrete case, we usually verify for independence or lack of it of

continuous random variables using the result contained in Definition 1.16.
Definition 1.18
If X and Y are are continuous random variables with a joint density
function of f (x, y) and marginal density functions of g(x) and h(y),
respectively, then X and Y are independent if and only if
f (x, y) = g(x)h(y)
Example 1.17
Refer to Example 1.3. Verify whether X and Y are independent.
Solution
From Example 1.10,
36
36 Advanced
Advanced Topics
2 in
Topics in Introductory
Introductory Probability
Probability
36 Advanced
g(x) = 2x2Topics
+ x in Introductory Probability
36 3 in Introductory Probability
Advanced Topics
Now 1 y
Now
Now h(y) = +
3 6
Now
2 22 1
11 + yyy
g(x)h(y)
g(x)h(y) =
= 2x
2x 2+ 2
2+ x
x +
g(x)h(y) = 2x + 33 x 33 + 66
2 23 132 y6 1
2 22 +1 ++ 11 xy
g(x)h(y) = = 2x2+ 1 2y + 2
x
2
= 2 x
x2 + 1 3xx
x2 yy + 329 x
x
x++6 9 xy
xy
= 3 3 x + 3
3 + 9 9
23 2 13 2 29 19
which is
which is not
is not equal
not equal to
equal to
to = x + x y + x + xy
which 3 32 xy 9 9
ff (x, y) = x 2 + xy
(x, y) = x 2 + xy
which is not equal to f (x, y) = x + 33
3
xy
Hence
Hence X
X and
and Y
Y are
are not
not independent.
(x,
independent.
f y) = x 2
+
Hence X and Y are not independent. 3
Example
Hence
ExampleX 1.18
and Y are not independent.
1.18
Example 1.18
Suppose the
Suppose the joint
the joint p.d.f.
joint p.d.f. of
p.d.f. of X
of X and Y
and Y
X and Y isis given
is given by
given by
by
Suppose
Example 1.18

Suppose the joint p.d.f. of and Y 0
X 4xy,
4xy, 0is<given
x< by0 < yy <
1, 1
ff (x, y)
(x, y) =
= 4xy,
y) = 0< <xx<< 1,
1, 00 <
<y< < 11
f (x, 0,
0, elsewhere
elsewhere
0,
4xy, 0elsewhere
< x < 1, 0 < y < 1
Verify whether f (x,
or not y)X=and Y are independent.
Verify whether or not X and 0,
Y are independent.
elsewhere
Verify whether or not X and Y are independent.
Solution
Verify whether or not X and Y are independent.
Solution
Solution
We
We have
We have
have
Solution
11
We have g(x) = 1 f (x, y) dy
g(x) =
g(x) = 00 ff (x,
(x, y)
y) dy
dy

0 1
1
g(x) =
=
11 f (x, y) dy
4xy dy
=
= 000 4xy
4xy dy
dy

01
11
= 4xy
y 22 dy
y2 1
=
= 04x y
= 4x
4x 222 10
y2 00
=
= 2x,
4x 00 <
2x, x< 1
= 2x, 2 0 <
<x < 11
x<
0
Similarly
Similarly = 2x, 0 < x < 1
Similarly

Similarly 11 38
h(y) = 1 f (x, y)dx
h(y) = 00 ff (x,
h(y) = (x, y)dx
y)dx
1
= 4xy dy
0
1
PROBABILITY: A FIRST COURSE IN =
y2
4x PROBABILITY AND DISTRIBUTION FUNCTIONS
PROBABILITY THEORY – VOLUME III 2 0 OF BIVARIATE DISTRIBUTIONS
= 2x, 0 < x < 1
Similarly
1
h(y) = f (x, y)dx
0
= 2y, 0<y<1
Hence,
f (x, y) = g(x)h(y)
for all real numbers (x, y), and therefore X and Y are independent.
To conclude this chapter we shall make two important observations.
(a) Multivariate Distributions
Our discussion of the bivariate case can be readily extended to multivariate

distributions. Apart from added computational labour, joint distributions
of three or more variables do not pose new analytical problems. Thus, we
simply state here that the joint probability mass function of k discrete ran-
dom variables p(x1 , x2 , ..., xk ) is specified to be nonnegative and its sum
over the k dimensional space is one. There are marginal density functions of
various dimensions in that case. Suppose X, Y, and Z are jointly continu-
ous random variables with density function f (x, y, z), The one dimensional
marginal distribution of X is
∞ ∞
fX (x) = f (x, y, z) dy dz
−∞ −∞
and the two-dimensional marginal distribution of X and Y is

∞
fXY (x, y) = f (x, y, z) dz
−∞
(b) Algebra of Random Variables
The four basic operations of algebra, namely, addition, subtraction, multi-

plication and division can be performed on random variables to form new
random variables. That is, if X and Y are random variables and c is a
X
real number, then we can form new variables X + Y, X − Y, XY, , c+
Y
X, cX. We can also form the random variable f (X), where f is any func-
tion on R. That is, if X has range {x1 , x2 , · · · , xn }, then f (X) has range
{f (x1 ), f (x2 ), · · · , f (xn )}. For example, if X is a random variable taking
1 1 1 1
the values {1, 2, 3, 4} with probabilities , , , and , respectively, then
2 8 4 8
1 1 1 1
X takes the values {1, 4, 9, 16} with probabilities , , , and , respec-
2
2 8 4 8
tively.
In general, the probability distributions of cX (that is, adding a random
variable to itself c times) and f (X) (if f is one-to-one) such as multiplying a
random variable by itself, or finding the root or cosine of a random variable
are the same as that of X. However, the determination of the probability
X
distributions of X ± Y , XY , and (Y = 0) is not straightforward and this
Y
is the subject of the next chapter.
EXERCISES
39
1.1 Given the following joint frequency distribution of X and Y
X
distributions of X ± Y , XY , and (Y = 0) is not straightforward and this
Y
is the subject
ADVANCED of the
TOPICS next chapter.X
IN INTRODUCTORY
distributions
PROBABILITY: AofFIRST , XY , and
X ± YCOURSE IN (Y =
0) is not straightforward and DISTRIBUTION
PROBABILITY AND this FUNCTIONS
PROBABILITY THEORY – VOLUME III
Y OF BIVARIATE DISTRIBUTIONS
is the subject of the next chapter.
EXERCISES
EXERCISES

X Y
1 2 3
X1 3 Y5 1
2 12 24 36
13 31 52 10
24 24 41 65
3 1 2 0
4 4 1 5
Calculate the joint probability of X and Y .
Calculate the joint probability of X and Y .

1.2 Refer to Exercise 1.1.
1.2 Refer to Exercise

(a) Find 1.1. probability distribution of
the marginal
(i) X, (ii) Y
(a) Find the marginal probability distribution of
(b) Verify the independence of X and Y for the data of Exercise 1.1.
(i) X, (ii) Y
(b) Verify
1.3 Refer the independence
to Exercise of X and Y for the data of Exercise 1.1.
1.1. Calculate
1.3 Refer
(a) Pto(XExercise
< 2, Y 1.1.
≤ 3),Calculate
(b) P (2 ≤ X < 3, 1 ≤ Y ≤ 2)
(c) P (1 < X < 2, Y < 3), (d) P (X < 2, 0 ≤ Y < 2)
(a) P (X < 2, Y ≤ 3), (b) P (2 ≤ X < 3, 1 ≤ Y ≤ 2)
(c) Pto
1.4 Refer (1 Exercise
< X < 2,1.1.
Y < 3),
Find (d) P (X < 2, 0 ≤ Y < 2)
1.4 Refer
(a) Pto(YExercise
= 3|X = 1.1.
1), Find (b) P (Y = 2|X = 3),
(c) P (X = 2|Y = 2), (d) P (X = 1|Y = 2),
(a) P (Y = 3|X = 1), (b) P (Y = 2|X = 3),
(e) P (X = 3|Y = 1), (f) P (X = 3|Y = 2)
(c) P (X = 2|Y = 2), (d) P (X = 1|Y = 2),
(e) Pto
1.5 Refer (XExample
= 3|Y =1.5.
1), Solve(f)part
P (X = 3|Y = 2)
(c).
1.5 Refer to Example 1.5. Solve part (c).
40
1.3 Refer to Exercise 1.1. Calculate
(a) P (X < 2, Y ≤ 3),
(b) P (2 ≤ X < 3, 1 ≤ Y ≤ 2) OF BIVARIATE DISTRIBUTIONS
(c) P (1 < X < 2, Y < 3), (d) P (X < 2, 0 ≤ Y < 2)
1.4 Refer to Exercise 1.1. Find
(a) P (Y = 3|X = 1), (b) P (Y = 2|X = 3),

(c) P (X = 2|Y = 2), (d) P (X = 1|Y = 2),
(e) P (X = 3|Y = 1), (f) P (X = 3|Y = 2)
1.5 Refer to Example 1.5. Solve part (c).
1.6 The joint probability mass function of X and Y is given by
1 1 1 1
p(1, 1) = , p(1, 2) = , p(2, 1) = , p(2, 2) =
8 4 8 2
(a) Compute the conditional probability mass function of X, given
Y = i, i = 1, 2.
(b) Are X and Y independent?
1.7 Consider the bivariate discrete random variables X and Y with prob-
ability function
1
p(x, y) = (2x + y) x = 0, 1, 2; y = 0, 1, 2.
27
(a) Verify that p(x, y) is a legitimate probability mass function.
(b) Find the joint probability
P (1 ≤ X ≤ 2, 0 ≤ Y ≤ 1)
(c) Find the marginal distributions of X and Y .

(d) Find the conditional probability of
(i) Y for given X = x
(ii) X for given Y = y
(e) Verify whether X and Y are independent or not.
1.8 A bivariate discrete random variable (X, Y ) is such that
P (X = x, Y = y) = θx (1 − θ)y ,
where 0 < θ < 1 and x, y = 1, 2, 3, ...
(a) Show that the above probability mass function is legitimate

(b) Find the marginal probability distribution of X and Y
(c) Find the conditional probability of
(i) X given Y = 2, (ii) Y given X = 0
(d) Verify whether X and Y are independent or not.
41
PROBABILITY
III OF BIVARIATE DISTRIBUTIONS
1.9 A bivariate discrete random variable (X, Y ) has the following proba-
bility mass function.

x+y x
p (1 − p)y e−λ λx+y
x
p(x, y) = , x, y = 0, 1, 2, ...
(x + y)!
(a) Show that the marginal probability function of X is
e−λ p (λ p)x
g(x) = ; x = 0, 1, 2, ...
x!
(b) Find the marginal probability mass function of Y .
(c) Examine whether X and Y are independent.
1.10 Consider the following function of two discrete random variables

1 2
f (x, y) = (x − y), f or x = 2, 3; y = 1, 2
20
(a) Find the value of k such that the function is a legitimate proba-
bility mass function.
(b) Construct the joint probability table.
(c) Find the marginal probability distributions of X and Y.
(d) Find the conditional probability of
(i) Y for given X = x
(ii) X for given Y = y
(e) Verify whether or not X and Y are independent.
1.11 Suppose that (X, Y ) is a two-dimensional continuous random variable

with joint probability density function.

k(x + y − 2xy), 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
f (x, y) =
0, elsewhere
Density(a)
andFind the value Functions
Distribution of k. of Bivariate Distributions 41
(b) Find the marginal probability distribution of X and Y .
(c) Find the conditional probability
(i) X given Y = y
(ii) Y given X = x
(d) Verify whether X and Y are independent or not.
(e) P (X > Y )
1.12 A bivariate continuous random variable (X, Y ) has joint probability

density function

k(x + 2y), 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
f (x, y) =
0, elsewhere
(a) Find the value of k.

42
(c) Find the conditional probability
(i) X given Y = y
(ii) Y given X = x
(d) Verify
PROBABILITY whether
THEORY X and
– VOLUME III Y are independent or not. OF BIVARIATE DISTRIBUTIONS
(e) P (X > Y )
1.12 A bivariate continuous random variable (X, Y ) has joint probability

density function

k(x + 2y), 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
f (x, y) =
0, elsewhere
(a) Find the value of k.

(i) Y given X = x
(ii) X given Y = y
(d) Verify whether or not X and Y are independent
1.13 Suppose that a bivariate function is given by

f (x, y) = a(x2 + xy), 0 ≤ x ≤ 1, 0 ≤ y ≤ 1,
f (x, y) =
0, elsewhere
Find
(a) the constant a such that f (x, y) is a p.d.f.

(b) P (X > Y );
(c) the marginal p.d.f. of X;
(d) the marginal p.d.f. of Y.
1.14 The joint probability density function of X and Y is given by

42 7 (x
6 2
+ xy
2 ),
Advanced 0<x<
Topics in 1, 0 < y < 2Probability
Introductory
f (x, y) =
0, elsewhere
(a) Verify that this is indeed a joint p.d.f.
(b) Compute the distribution function of X.
(c) Find P (X > Y )

1 1
(d) Find P Y < X > .
2 2
(e) Verify whether X and Y are independent.
1.15 Consider the following joint density function
f (x, y) = λ2 eλy , 0≤x≤y
Find
(a) the marginal p.d.f. of X

(b) the marginal p.d.f. of Y ;
(c) the conditional p.d.f. of Y given X = x;
(d) the conditional p.d.f. of X given Y = y
1.16 The number of people that enter a church in a given hour is a Poisson
random variable with λ = 10. Compute the conditional probability
that at most 2 men entered the church in a given hour, given that 10
women entered in that hour. What assumptions have you made?
1.17 The joint probability density function of 43

X and Y is given by

(a) the marginal p.d.f. of X
(b) the marginal p.d.f. of Y ;
(c) the
PROBABILITY conditional
THEORY p.d.f.
– VOLUME III of Y given X = x; OF BIVARIATE DISTRIBUTIONS
(d) the conditional p.d.f. of X given Y = y
1.16 The number of people that enter a church in a given hour is a Poisson
random variable with λ = 10. Compute the conditional probability
that at most 2 men entered the church in a given hour, given that 10
women entered in that hour. What assumptions have you made?

xe−(x+y) , x > 0, y > 0
f (x, y) =
0, elsewhere
(a) Find the marginal p.d.f of (i) X (ii) Y

(b) Find the conditional p.d.f of
(i) X, given Y = y;
(ii) Y given X = x.
(c) Verify whether X and Y are independent.

Density and Distribution 2, of
Functions 0< 0<y<1
x < y, Distributions
Bivariate 43
f (x, y) =
0, elsewhere
(a) Find the marginal p.d.f of (i) X, (ii) Y.

(b) Find the conditional p.d.f of
(i) X, given Y = y, (ii) Y given X = x
f (x, y) = λ1 λ2 exp{−(λ1 x + λ2 y)}, 0 < x < ∞, 0 < y < ∞
(a) Find the marginal p.d.f. of (i) X (ii) Y .

(b) Verify whether X and Y are independent.
1.20 Show that the conditional densities f (x|y) and f (y|x) are legitimate
density functions.
44
PROBABILITY: A FIRST COURSE IN SUMS, DIFFERENCES, PRODUCTS AND
PROBABILITY THEORY – VOLUME III QUOTIENTS OF BIVARIATE DISTRIBUTIONS
Chapter 2
SUMS, DIFFERENCES, PRODUCTS AND

QUOTIENTS OF BIVARIATE DISTRIBUTIONS
2.1 INTRODUCTION
In the previous chapter, we discussed the joint probability distributions,

cumulative distributions and marginal distributions of random variables X
and Y . We also discussed the independence of bivariate random variables.
In this chapter, we shall take up the distribution of sums, differences,
products and quotients of X and Y . The numerical characterisation of the
joint distribution of X and Y will be discussed in the next chapter. As
has been the practice in this book, discussions will be done separately for
discrete and continuous cases.
2.2 SUMS OF BIVARIATE RANDOM VARIABLES
2.2.1 Sums of Discrete Bivariate Random Variables
We shall introduce the concept of the sum of discrete random variables with
an example.
Example 2.1
Consider a sequence of independent experiments in each of which a fixed
event is of interest. Suppose X1 , X2 , · · · , Xn are discrete random variables
44 WE WILL TURN YOUR CV

INTO AN OPPORTUNITY
OF A LIFETIME
Do you like cars? Would you like to be a part of a successful brand?

Send us your CV on
As a constructer at ŠKODA AUTO you will put great things in motion. Things that will
www.employerforlife.com
ease everyday lives of people all around Send us your CV. We will give it an entirely
new new dimension.
45
In this chapter, we shall take up the distribution of sums, differences,
products and quotients of X and Y . The numerical characterisation of the
joint distribution
ADVANCED of INTRODUCTORY
TOPICS IN X and Y will be discussed in the next chapter. As
has been the Apractice
PROBABILITY: in thisINbook, discussions will be done SUMS,
FIRST COURSE separately for
DIFFERENCES, PRODUCTS AND
PROBABILITY
discrete and THEORY
continuous– VOLUME
cases. III QUOTIENTS OF BIVARIATE DISTRIBUTIONS
2.2 SUMS OF BIVARIATE RANDOM VARIABLES
2.2.1 Sums of Discrete Bivariate Random Variables
We shall introduce the concept of the sum of discrete random variables with
an example.
Example 2.1
Sums, Differences, Products and Quotients of Bivariate Distributions 45
Consider a sequence of independent experiments in each of which a fixed
event is of interest. Suppose X1 , X2 , · · · , Xn are discrete random variables
defined by
44
1, if the fixed event happens in the ith experiment;
Xi =
0, if the fixed event does not happen in the ith experiment.
Let
Zn = X1 + X2 + · · · + Xn
The random variable Zn is a sum of discrete random variables Xi , i =
1, 2, ..., n, and denotes the number of times the fixed event happens in the
n trials.
For instance, the President of Regent University receives visitors com-
ing every day from the two surrounding communities. The number of visitors
on a particular day from the communities are X1 and X2 , respectively. Let
Z2 = X1 + X2 . The random variable Z2 is a sum of two random variables
and gives the total number of visitors coming to the President of Regent
University on a particular day.
Distribution of Sums of Discrete Bivariate Random Variables

in Tabular Form
Case 1
Suppose that the joint distribution of X and Y is in the form of the table as
presented in Table 1.3. Such a table can be referred to as the parent joint
probability distribution table. Then the various sums of the discrete
random variables X and Y, denoted as X + Y, and their corresponding
probabilities may be presented as in Table 2.1. This can be referred to as
the derived joint probability distribution table.
Table 2.1 Various Sums (X + Y ) and their Corresponding Probabilities
xi + yj x1 + y1 x1 + y2 ··· x1 + ym x2 + y1 x2 + y2 ···
p(xi , yj ) p(x1 , y1 ) p(x1 , y2 ) ··· p(x1 , ym ) p(x2 , y1 ) p(x2 , y2 ) ···
··· x2 + ym xn + y1 xn + y2 ··· xn + ym
··· p(x2 , ym ) p(xn , y1 ) p(xn , y2 ) ··· p(xn , ym )
To find the distribution of the sum, we apply the principle of the probabilities
46
xi + yj x1 + y1 x1 + y2 ··· x1 + ym x2 + y1 x2 + y2 ···
· · · x2 + ym x + y1 xn + y2 · · · xn + ym
PROBABILITY THEORYn– VOLUME III QUOTIENTS OF BIVARIATE DISTRIBUTIONS
· · · p(x2 , ym ) p(xn , y1 ) p(xn , y2 ) · · · p(xn , ym )
To find the distribution of the sum, we apply the principle of the probabilities
of equivalent events.1 By this principle, the probabilities of equivalent events
are equal. That is, if A ⊂ S and B ⊂ RX are equivalent events, then we
define the probability of the event B, P (B), to be equal to P (A).
This principle is illustrated in some examples below after the following
theorem.
46 Theorem 2.1 Advanced Topics in Introductory Probability

The distribution (probability) of the sum of X = xi and Y = yj
being equal to k is the sum of the probabilities that correspond to
of equivalent
all indicesevents.
i and 1j By this
that sumprinciple,
to k the probabilities of equivalent events
are equal. That is, if A ⊂ S and B ⊂ RX are equivalent events, then we
define the probability of the event B, P (B), to be equal to P (A).
This principle is illustrated in some examples below after the following
Example 2.2
theorem.
Given the table below, find the distribution of the sum X+Y.
Theorem 2.1
Y
The distribution (probability) of the sum of X = xi and Y = yj
X −1 0 1 Row Totals
being equal to k is the sum of the probabilities that correspond to
j that sum0.10
all indices i and−1 to k 0.20 0.11 0.41
0 0.08 0.02 0.26 0.36
2 0.03 0.17 0.03 0.23
Column Totals 0.21 0.39 0.40 1.00
Example 2.2
Given the table below, find the distribution of the sum X+Y.
Solution
We shall list the values of x + y in each cell within parenthesis as in the
table Differences,
Sums, below: Products and Quotients Y of Bivariate Distributions 47
1
The events A ⊂ S −1 0.10called
and B ⊂ RX are 0.20 0.11 0.41
Y equivalent events if
0 0.08 0.02 0.26 0.36
2 0.03 0.17
A = {s ∈ S|X(s) ∈ B} 0.03 0.23
−1 0.10 0.20 0.11 0.41
Column Totals 0.21 0.39 0.40 1.00
where RX is the range space of the(-2)
random(-1)
variable(0)
X.
0 0.08 0.02 0.26 0.36
(-1) (0) (1)
Solution 2 0.03 0.17 0.03 0.23
We shall list the values of x + (1)y in (2)
each cell(3) within parenthesis as in the
table below:Column Totals 0.21 0.39 0.40 1.00
The1values for x+y,

The events A ⊂ Salso
and known
B ⊂ RX as
arethe support
called eventsare
of x+y,
equivalent if {−2, −1, 0, 1, 2, 3}
and the probability of each value P [(X + Y ) = x + y] is the sum of all the
2
cell probabilities where that valueA = {soccurs.

∈ S|X(s)For
∈ B}example, there are two (mu-
tually exclusive) ways that the sum X + Y can equal 0, and that is, if X = 0
and Y R=X 0is or
where the if X =
range −1ofand
space Y = 1,variable
the random so that
X.the probability of the sum
X + Y = 0 is P (X = 0, Y = 0) or P (X = −1, Y = 1). That is,
P [(X + Y ) = 0] = P (X = 0, Y = 0) + P (X = −1 Y = 1)
= 0.11 + 0.02
= 0.13 = p(0) 47
0 0.08
(-1) 0.02(0) 0.030.26
(1) 0.36
2 0.03 0.17 0.23
2 (-1)
0.03 (0) (1)
(1) 0.17 (2) 0.03 (3) 0.23
ADVANCED TOPICS IN 2INTRODUCTORY 0.03 0.17 0.03 0.23
(1) (2) (3)
PROBABILITY: Column Totals IN0.21
A FIRST COURSE (1) 0.39 (2) 0.40 (3) 1.00 SUMS, DIFFERENCES, PRODUCTS AND
Column– Totals
PROBABILITY THEORY VOLUME III0.21 0.39 0.40 1.00QUOTIENTS OF BIVARIATE DISTRIBUTIONS
Column Totals 0.21 0.39 0.40 1.00
The values for x+y, also known as the support of x+y, are {−2, −1, 0, 1, 2, 3}
The values
and the for x+y, also
probability known
of each as2the
value support
P [(X +Y) = + y] are
ofxx+y, {−2,
is the sum−1,of0,all
1, 2,
the3}
The
and values
the for x+y,
probability also
of known
each as2the
value P support
[(X + Y ) ofxx+y,
= + y] are
is {−2,
the sum−1,of0,all
1, 2,
the3}
cell probabilities where that value2 occurs. For example, there are two (mu-
and
cell the probability
probabilities of each
wherethat value
thatthe
value P [(X + Y ) = x + y] is the sum of all the
tually exclusive) ways sumoccurs.
X + Y canFor equal
example,
0, andthere are
that is,two
if X(mu-
=0
cell probabilities
tually where that value occurs. For example, there are two X(mu-
and Y exclusive)
= 0 or if waysX =that−1 the
andsumY =X1,+ Y so can
thatequal 0, and that is,
the probability of if
the =0
sum
tually
and Y exclusive)
= 0 or if ways
X = that
−1 the
and sum
Y = X1,+ Yso can
that equal
the 0, and that is,
probability of if
the =0
X sum
X + Y = 0 is P (X = 0, Y = 0) or P (X = −1, Y = 1). That is,
and Y = 0 or if X = −1 and Y = 1, so that
X + Y = 0 is P (X = 0, Y = 0) or P (X = −1, Y = 1). That is, the probability of the sum
X +Y = 0 is +
P [(X P (X 0, Y== 0)
Y ) = 0] or P
P (X =(X0, Y= =−1,0) Y+ = 1). =
P (X That
−1 Yis,= 1)
P [(X + Y ) = 0] = P (X = 0, Y = 0) + P (X = −1 Y = 1)
P [(X + Y ) = 0] = = P 0.11
(X+=0.020, Y = 0) + P (X = −1 Y = 1)
= 0.11 + 0.02
= 0.13 =
= 0.11 + 0.02 p(0)
= 0.13 = p(0)
Again, the only way X += Y can0.13equal
= p(0)2 is when X = 2 and Y = 0 so that
Again, the only way X + Y can
the probability of the sum X + Y = 2 is equal 2 is when X = 2 and Y = 0 so that
Again, the only way X + Y can
the probability of the sum X + Y = 2 is equal 2 is when X = 2 and Y = 0 so that
the probability of P the
[(Xsum
+ YX) += Y2] ==2 isP (X = 2, Y = 0)
P [(X + Y ) = 2] = P (X = 2, Y = 0)
P [(X + Y ) = 2] = = P 0.17
(X ==p(2)
2, Y = 0)
= 0.17 = p(2)
Thus, we can summarise the table = 0.17 = p(2)
as follows:
Thus, we can summarise the table as follows:
Thus, we can summarise the table as follows:
(x + y) −2 −1 0 1 2 3
(x + y) −2 −1 0
p(x + y) 0.1 0.28 0.13 0.29 0.17 0.03 1 2 3
(x +
p(x +y)y) −20.1 0.28−1 0.13 0 1
0.29 2
0.17 3
0.03
p(x + y) 0.1 0.28 0.13 0.29 0.17 0.03
Aliter
Aliter
We can obtain this distribution directly from the joint distribution table by
Aliter
We can obtain this
this example
distribution
going through stepdirectly
by step.from the joint distribution table by
We can
going obtain this
through this example
distribution
step directly
by step.from the joint distribution table by
going through this example step by step.
2
Take note that
2
Takenote that p(x + y) = P [(X + Y ) = x + y]
2
Take note that p(x + y) = P [(X + Y ) = x + y]
p(x + y) = P [(X + Y ) = x + y]
Develop the tools we need for Life Science

Masters Degree in Bioinformatics
Bioinformatics is the
exciting ﬁeld where biology,
computer science, and
mathematics meet.
We solve problems from

biology and medicine using
methods and tools from
computer science and
mathematics.
Read more about this and our other international masters degree programmes at www.uu.se/master
48
48
PROBABILITY THEORY – VOLUME Advanced
III QUOTIENTS
Topics in Introductory OF BIVARIATE DISTRIBUTIONS
Probability
Step 1
Indicate the various summands with their corresponding joint probabilities
xi + yj −1 + (−1) −1 + 0 −1 + 1 0 + (−1) 0+0 0+1

p(xi , yj ) 0.10 0.20 0.11 0.08 0.02 0.26
2 + (−1) 2+0 2+1

0.03 0.17 0.03
Step 2
48
Sum the various values for X Advanced
and Y Topics in Introductory Probability
x+y −2 −1 0 −1 0 1 1 2 3
Step 1
p(x, y) 0.1 0.2 0.11 0.08 0.02 0.26 0.03 0.17 0.03
Indicate the various summands with their corresponding joint probabilities
By the principle of equality of the probabilities of equivalent events, we shall

x + yj −1 + (−1) −1 + 0 −1 + 1 0 + (−1) 0 + 0 0 + 1
perform ithe operation in Step 3.
p(xi , yj ) 0.10 0.20 0.11 0.08 0.02 0.26
Step 3
2 + (−1) 2 + 0 2 + 1
Sum the various probabilities corresponding to a particular value of the sum
0.03 0.17 0.03
x+y, denoted by p(x+y) (by the addition rule of probability)3 . For example,
the probability of the sum x + y = 1, is given by
Step 2
P [(X +
Sum the various values ) =and
forYX 1] Y= p(1)
= 0.26 + 0.03 = 0.29
x+y −2 −1 0 −1 0 1 1 2 3
Thus, we summarise the table in Step 3 in the table below:
p(x, y) 0.1 0.2 0.11 0.08 0.02 0.26 0.03 0.17 0.03
x+y −2 −1 1 0 2 3
By the principle of equality of the probabilities of equivalent events, we shall
p(x + y) 0.1 0.2 + 0.08 0.26 + 0.03 0.11 + 0.02 0.17 0.03
perform the operation in Step 3.
Step 4
Step 3
Present the final result as in the table below:
Sum the various probabilities corresponding to a particular value of the sum
x+y, denoted by p(x+y) (by the addition rule of probability)3 . For example,
Sums, Differences,(x +Products
y) −2 −1Quotients 0 of1by 2 Distributions
3 49
the probability of the sum x and+ y = 1, is given Bivariate
p(x + y) 0.1 0.28 0.13 0.29 0.17 0.03
P [(X + Y ) = 1] = p(1)
3
We iscan
That if averify thatvalue
particular thisindistribution of +
the sum appears
= 0.26 the sum
more
0.03 = is
than a probability
once,
0.29 distri-
their probabilities
are addedThat
bution. together
is, for the purpose of constructing the probability distribution.
Thus, we summarise the table0 in Step+3y)in≤the
≤ p(x 1 table below:
and x+y −2 −1 1 0 2 3
p(x + y) = 1
p(x + y) 0.1 0.2 + 0.08 0.26 + 0.03 0.11 + 0.02 0.17 0.03
Step 4
Case 2
Suppose that the joint distribution of X and Y is in the form of Table 1.3.
Suppose also that X and Y are independent so that p(xi , yj ) = g(xi )h(yj ).
(x + y) −2 −1 0 1 2 3
Then the various sums X + Y and their corresponding probabilities may be
p(x + y) 0.1 0.28 0.13 0.29 0.17 0.03
presented in the form of the following table.
3
That is if a particular value in the sum appears more than once, their probabilities
are added together for the purpose of constructing the probability distribution.
xi + yj x1 + y1 x2 + y2 ··· xn + ym
p(xi , yj ) p(x1 )p(y1 ) p(x2 )p(y2 ) ··· p(xn )p(ym )
49
Theorem 2.2
Case 2
Suppose that the joint distribution of X and Y is in the form of Table 1.3.
Suppose also that X and Y are independent so that p(xi , yj ) = g(xi )h(yj ).
Then the various sums X + Y and their corresponding probabilities
may be
QUOTIENTS OF BIVARIATE DISTRIBUTIONS
presented in the form of the following table.
xi + yj x1 + y1 x2 + y2 ··· xn + ym
p(xi , yj ) p(x1 )p(y1 ) p(x2 )p(y2 ) ··· p(xn )p(ym )
Theorem 2.2
Suppose that X and Y are independent, discrete random variables
with marginal probability distributions g(x) and h(y) respectively,
then P (X + Y = k) is the sum of the product of g(x) and h(y)
for the corresponding value k = x + y
Example 2.3
Suppose the joint probabilities of two random variables X and Y are given
in the table below.
Y
X 3 6 Row Total
-2 0.28 0.12 0.4
4 0.42 0.18 0.6
Column Total 0.7 0.3 1.0

(a) Verify that X and Y are independent random variables.
(b) Find the distribution of the sum of X and Y .
Solution
(a) The marginal probability distributions of the X is
x -2 4
g(x) 0.4 0.6
and the marginal probability distributions of the Y is
y 3 6
h(y) 0.7 0.3
so that we have
p(−2, 3) = 0.28 = g(−2)h(3) = 0.4(0.7) = 0.28

p(−2, 6) = 0.12 = g(−2)h(6) = 0.4(0.3) = 0.12
p(4, 3) = 0.42 = g(4)h(3) = 0.6(0.7) = 0.42
p(4, 6) = 0.18 = g(4)h(6) = 0.6(0.3) = 0.18
Hence, X and Y are independent.

(b) It has been shown that X and Y are independent. Hence, we obtain
the following table:
xi + yj −2 + 3 −2 + 6 4+3 4+6
p(xi + yj ) 0.4(0.7) 0.4(0.3) 0.6(0.7) 0.6(0.3)
50
p(−2, 6) = 0.12 = g(−2)h(6) = 0.4(0.3) = 0.12
ADVANCED TOPICSp(4, 3) = 0.42 =
IN INTRODUCTORY g(4)h(3) = 0.6(0.7) = 0.42
p(4, 6) = 0.18 = g(4)h(6) = 0.6(0.3) = 0.18
Hence, X and Y are independent.
(b) It has been shown that X and Y are independent. Hence, we obtain
the following table:
xi + yj −2 + 3 −2 + 6 4+3 4+6
p(xi + yj ) 0.4(0.7) 0.4(0.3) 0.6(0.7) 0.6(0.3)
Note
Even though the values of the random variables of X and Y are added
together, their corresponding probabilities g(xi ) and h(yj ) are multi-
plied, if X and Y are independent.
Sums,By the principle

Differences, of equality
Products of probabilities
and Quotients of equivalent
of Bivariate 51
events, the
Distributions
distribution of the sum of X and Y is presented in the table below:
xi + yj 1 4 7 10
p(xi + yj ) 0.28 0.12 0.42 0.18
Distribution of Sums of Discrete Bivariate Random Variable in

Expression Form
In most cases the distribution of X and Y may not be presented in a table.

Suppose that all that we know are the values of X + Y and their corre-
sponding probabilities. In such a case, to obtain the distribution of X + Y
we shall use Theorem 2.3.
Theorem 2.3
Suppose X and Y are nonnegative discrete random variables with
joint probability function p(x, y). Let Z = X + Y . Then the proba-
bility distribution of Z is
z

P (Z = z) = p(x, z − x)
x=0
or equivalently,
Copenhagen P (Z = z) =
z

y=0
p(z − y, y)
Master of Excellence
Proof
cultural studies
P (Z = z)Master
Copenhagen = P (Xof+Excellence
Y = z) are
z
two-year master= degrees
P (X taught
= x, X +in
Y =English
z)
at one of Europe’s x=0
leading universities religious studies
z

= P (X = x, Y = z − x) (i)
Come to Copenhagenx=0
- and aspire!
fromApply now
which the at follows.
result science
Similarly,
www.come.ku.dk
z

P (Z = z) = p(z − y, y)
y=0
51
In most cases
ADVANCED the IN
TOPICS distribution of X and Y may not be presented in a table.
INTRODUCTORY
Suppose thatAall
PROBABILITY: that
FIRST we know
COURSE IN are the values of X + Y and their
SUMS, corre-
DIFFERENCES, PRODUCTS AND
sponding probabilities. In suchIIIa case, to obtain the distribution
QUOTIENTSof OF
X+ BIVARIATE
Y DISTRIBUTIONS
we shall use Theorem 2.3.
Theorem 2.3
Suppose X and Y are nonnegative discrete random variables with
joint probability function p(x, y). Let Z = X + Y . Then the proba-
bility distribution of Z is
z

P (Z = z) = p(x, z − x)
x=0
or equivalently,
z

P (Z = z) = p(z − y, y)
y=0
Proof
P (Z = z) = P (X + Y = z)
z

= P (X = x, X + Y = z)
x=0
z
= P (X = x, Y = z − x) (i)
x=0
from which the result follows.

Similarly,
z

P (Z = z) = p(z − y, y)
52 y=0
Note
It is important to note the limits of summation. Beyond these limits, one
of the two component mass functions is zero. When dealing with densities
that are nonzero only on some subset of values, we must always be careful.
In case X and Y are allowed to take negative values as well, the lower index
of summation is changed from 0 to −∞.
Example 2.4
Refer to Example 1.2, find
(a) the distribution of Z = X + Y ;
(i) from the first principle (that is, using the joint distribution of X
and Y directly;)
(ii) using Theorem 2.3.
(b) P (Z = 2),
(i) using the distribution obtained in (a) (i);

(ii) using the distribution obtained in (a) (ii).
Solution
Recall from the solution of Example 1.2 that
1
p(x, y) = (3x + 2y), for x = 0, 1; y = 0, 1, 2
21
The various pairs of X and Y that we require 52
are
(b) P (Z = 2),
(i) using
PROBABILITY: the distribution
A FIRST COURSE IN obtained in (a) (i); SUMS, DIFFERENCES, PRODUCTS AND
(ii) using the distribution obtained in (a) (ii).
Solution
Recall from the solution of Example 1.2 that
1
p(x, y) = (3x + 2y), for x = 0, 1; y = 0, 1, 2
21
The various pairs of X and Y that we require are
(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2)
By the principle of equality of probabilities of equivalent events, the events

{(0, 1)} and {1, 0} are equivalent to the event {Z = 1}. Similarly, the
events {(0, 2)} and {(1, 1)} are equivalent to the event {Z = 2}. Hence,
the possible values of Z are z = 0, 1, 2, 3.
(a) (i) Writing the sum of the joint probability mass function as
1 2
1
P (X + Y = x + y) = (3x +
Sums, Differences, Products and Quotients21of Bivariate 2y)
Distributions 53
x=0 y=0
we calculate the distribution of Z = X + Y as follows.
P (Z = 0) = p(0, 0)
1
= [3(0) + 2(0)]
21
= 0
P (Z = 1) = p(0, 1) + p(1, 0)
1 1
= [3(0) + 2(1)] + [3(1) + 2(0)]
21 21
5
=
21
P (Z = 2) = p(0, 2) + p(1, 1)
1 1
= [3(0) + 2(2)] + [3(1) + 2(1)]
21 21
9
=
21
P (Z = 3) = p(1, 2)
1
[3(1) + 2(2)]
21
7
=
21
Hence the distribution of Z = X + Y is
x+y 0 1 2 3
p(x + y) 0 5
21
9
21
7
21
(ii) The distrbution of Z = X + Y by Theorem 2.3 is

 z



 p(0, z − 0), z = 0, 1, 2, 3
 x=0
P (Z = z) = 53

 z


x+y
ADVANCED TOPICS IN INTRODUCTORY 0 1 2 3
PROBABILITY: A FIRST COURSE IN+ y)
p(x 0 5 9 7 SUMS, DIFFERENCES, PRODUCTS AND
PROBABILITY THEORY – VOLUME III 21 21 21 QUOTIENTS OF BIVARIATE DISTRIBUTIONS
(ii) The distrbution of Z = X + Y by Theorem 2.3 is

 z



 p(0, z − 0), z = 0, 1, 2, 3
 x=0
P (Z = z) =

54  z
Advanced

 Topics in Introductory Probability
 p(1, z − 1), z = 1, 2, 3
x=1
Aliter
 z




p(0, z − 0), z = 0, 1, 2, 3

 y=0





 z
P (Z = z) = p(1, z − 1), z = 1, 2, 3

 y=1





 z



 p(2, z − 2), z = 2, 3
y=2
(b) (i) From the table obtained in (a), that is, calculating the probability
from first principle, we have
3
P (Z = 2) =
7
(ii) By Theorem 2.3 (that is, using the results in (a)(ii)) we have
1

P (Z = 2) = p(x, 2 − x)
x=0
1
1
= [3x + 2(2 − x)]
21 x=0
1
1
Brain power =
21 x=0
3
(x + 4) By 2020, wind could provide one-tenth of our planet’s
electricity needs. Already today, SKF’s innovative know-
how is crucial to running a large proportion of the
= world’s wind turbines.
7 Up to 25 % of the generating costs relate to mainte-
nance. These can be reduced dramatically thanks to our
systems for on-line condition monitoring and automatic
Aliter lubrication. We help make it more economical to create
Using the alternative formula in Theorem 2.3, cleaner, cheaper energy out of thin air.
By sharing our experience, expertise, and creativity,
2
industries can boost performance beyond expectations.
P (Z = 2) = p(2 − y, y) Therefore we need the best employees who can
y=0 meet this challenge!
2

= p(2 − y, y), since The=Power
p(2, 0) 0 of Knowledge Engineering
y=1
2
1
= [3(2 − y) + 2y]
21 y=1
Plug into The Power of Knowledge Engineering.

Visit us at www.skf.com/knowledge
54
y=2
(b) (i) From the table obtained in (a), that is, calculating the probability
from first
PROBABILITY: principle,
A FIRST COURSEwe IN
have SUMS, DIFFERENCES, PRODUCTS AND
3
P (Z = 2) =
7
(ii) By Theorem 2.3 (that is, using the results in (a)(ii)) we have
1

P (Z = 2) = p(x, 2 − x)
x=0
1
1
= [3x + 2(2 − x)]
21 x=0
1
1
= (x + 4)
21 x=0
3
=
7
Aliter
Using the alternative formula in Theorem 2.3,
2

P (Z = 2) = p(2 − y, y)
y=0
2

= p(2 − y, y), since p(2, 0) = 0
y=1
2
Sums, Differences, Products 1 55
= and Quotients of 2y]
[3(2 − y) + Bivariate Distributions
21 y=1
2
1
= (6 − y)
21 y=1
3
=
7
Theorem 2.4 CONVOLUTION THEOREM

(Discrete Case)
Suppose that X and Y are independent nonnegative discrete ran-
dom variables with marginal probability distributions g(x) and h(y)
respectively. Let Z be the sum of X and Y . Then the convolution
of g and h is the distribution of the sum Z given by
z

P (Z = z) = g(x)h(z − x) z≥0
x=0
or equivalently,
z

P (Z = z) = g(z − y)h(y) z≥0
y=0
called the convolution of g and h
Proof
Since X and Y are independent, the proof follows by writing
P (X = x, Y = z − x)
as
P (X = x)P (Y = z − x)
in Theorem 2.3.
55
Theorem 2.4 suggests a method of obtaining probability mass function when
z

P (Z = z) = g(z − y)h(y) z≥0
y=0
called the convolution of g and h
Proof
Since X and Y are independent, the proof follows by writing
P (X = x, Y = z − x)
as
P (X = x)P (Y = z − x)
in Theorem 2.3.
Theorem 2.4 suggests a method of obtaining probability mass function when

X + Y are independent.
When we have two one-dimensional probability mass functions g and
h, the convolution formula
is a one-dimensional probability
z mass function. Thus the probability func-
tion
56 of the sum ofPtwo
(Z independent
= z) = random
g(x)h(z
Advanced variables
− x),
Topics is the
z≥ 0 convolution
in Introductory
4 of
Probability
the individual probability functions.
x=0
is
Notea one-dimensional probability mass function. Thus the probability func-
tion of the sum of two independent random variables is the convolution4 of
(a) Theorem 2.3 also expresses convolution;
the individual probability functions.
(b) The convolution of g and h is denoted as g ∗ h.
Note
Since
(a) Theorem 2.3 also expresses convolution;
X +Y =Y +X
(b) The convolution of g and h is denoted as g ∗ h.
it is easy to verify that
Since z

P (Z = z)
X +=Y = Y g(x)h(z
+ X − x)
x
it is easy to verify that z

= z
g(z − y)h(y)

y
P (Z = z) = g(x)h(z − x)
that is, g ∗ h = h ∗ g. x
z

Example 2.5 = g(z − y)h(y)
y
Suppose that X and Y are independent discrete random variables having
that is, g ∗ hdistributions
probability = h ∗ g.
Example 2.5 e−λ λx e−θ θy

pX (x) = and pY (y) =
Suppose that X and Y are independent
x! discrete random y! variables having
probability distributions
respectively, where x = 0, 1, 2, . . . and y = 0, 1, 2, . . . Find
(a) the distribution of thee−λ λx X +pY.(y) = e θ
−θ y
pX (x) = sum Z = and Y
x! y!
(b) P (X + Y = 4), if λ = 1 and θ = 2.
Solution
(a) the distribution of the sum Z = X + Y.
The joint probability distribution of X and Y is given by
(b) P (X + Y = 4), if λ = 1 and θ = 2.
e−λ λx e−θ θy
p(x, y) = ·
Solution x! y!
The
4 joint probability distribution of X and Y is given by
Some writers prefer to use the French word Composition or even the German equiva-
lent Faltung
e−λ λx e−θ θy
p(x, y) = ·
x! y!
4
Some writers prefer to use the French word Composition or even the German equiva-
lent Faltung
56
x! y!
(a) the distribution
PROBABILITY: of the sum
A FIRST COURSE IN Z = X + Y. SUMS, DIFFERENCES, PRODUCTS AND
(b) P (X + Y = 4), if λ = 1 and θ = 2.
Solution
The joint probability distribution of X and Y is given by
Sums, Differences, Products and Quotients
e−λ λx of
e−θBivariate
θy Distributions 57
p(x, y) = ·
x! y!
(a)
4 Bywriters
Some Theorem 2.4
prefer to use the French word Composition or even the German equiva-
lent Faltung z
e−λ λx e−θ θz−x
P (Z = z) = ·
x=0
x! (z − x)!
(Note that the summation goes from x = 0 to x = z, since x cannot

exceed z = x + y).
z
e−(λ+θ) λx θz−x
P (Z = z) =
x=0
x!(z − x)!
z
λx θz−x
= e−(λ+θ)
x=0
x!(z − x)!
z
e−(λ+θ) z!
= λx θz−x
z! x=0 x!(z − x)!
Identifying the sum as the binomial expansion of (λ + θ)z we obtain
e−(λ+θ) (λ + θ)z
P (Z = z) = , z = 0, 1, 2, ...
z!
(b) When λ = 1 and θ = 2
e−(1+2) (1 + 2)4
P (Z = 4) =
4!
Trust and responsibility
e−3 (3)4
=
4!
NNE and Pharmaplan have joined forces to create – You have to be proactive and open-minded as a
NNE Pharmaplan, the world’s leading engineering newcomer and make it clear to your colleagues what
and consultancy company focused entirely on the you are able to cope. The pharmaceutical field is new
Example 2.5 biotech
pharma and showsindustries.
that, if X and Y are independent random
to me. But busy asvariables
they are, most of my colleagues
having Poisson distribution with parameters λ and θ firespectively thenme,
nd the time to teach their
and they also trust me.
Inés Aréizaga Esteva (Spain), 25 years old Even though it was a bit hard at first, I can feel over
sum hasEducation:
a Poisson distribution with parameter λ + time
Chemical Engineer
θ. That is, the sum of
that I am beginning to be taken seriously and
the two independent Poisson random variables is again that mya contribution
Poisson random
is appreciated.
variable. The ideas here could be used to prove more generally by induction
that the sum of a finite Poisson random variables also has a Poisson distri-
bution, a result we proved earlier using the moment generating function in
my book “Introductory Probability Theory” (Nsowah-Nuamah, 2017).
Note
In general, if two independent random variables follow the same type of
NNE Pharmaplan is the world’s leading engineering and consultancy company

focused entirely on the pharma and biotech industries. We employ more than
1500 people worldwide and offer global reach and local knowledge along with
our all-encompassing list of services. nnepharmaplan.com
57
x=0
Identifying
ADVANCED TOPICS the sum as the binomial
IN INTRODUCTORY expansion of (λ + θ)z we obtain
−(λ+θ)
e (λ + θ)z
P (Z = z) = , z = 0, 1, 2, ...
z!
(b) When λ = 1 and θ = 2
e−(1+2) (1 + 2)4
P (Z = 4) =
4!
e−3 (3)4
=
4!
Example 2.5 shows that, if X and Y are independent random variables

having Poisson distribution with parameters λ and θ respectively then their
sum has a Poisson distribution with parameter λ + θ. That is, the sum of
the two independent Poisson random variables is again a Poisson random
variable. The ideas here could be used to prove more generally by induction
that the sum of a finite Poisson random variables also has a Poisson distri-
bution, a result we proved earlier using the moment generating function in
my book “Introductory Probability Theory” (Nsowah-Nuamah, 2017).
Note
In general, if two independent random variables follow the same type of
distribution, it is not necessarily true that their sum follows the same type
of distribution (see Example 2.7 in the sequel.)
2.2.2 Sums of Continuous Bivariate Random Variables
The formulas for finding sums of continuous random variables are similar to
those in the discrete case by replacing the sum by the integral. However,
the actual computation is more complex with the continuous case.
Theorem 2.5
Suppose that X and Y are continuous random variables having joint
density function f (x, y). Let Z = X + Y and denote the density
function of Z by s(z). Then
∞
s(z) = f (x, z − x) dx
−∞
or equivalently, ∞
s(z) = f (z − y, y) dy
−∞
Proof
We shall use the cumulative distribution function technique. Let S denote
the c.d.f. of the random variable Z. Then

S(z) = P (Z ≤ z) = P (X + Y ≤ z) = f (x, y) dx dy
R
where R is the part of the region over which x + y < z, shown shaded in the
Figure 2.1.
To represent this integral as an iterated integral, we fix x and then
integrate with respect to y from z − x to −∞. Next integrate with respect
to x from −∞ to ∞. Thus
∞ z−x 58
S(z) = f (x, y) dy dx

S(z) = P (Z ≤ z) = P (X + Y ≤ z) = f (x, y) dx dy
PROBABILITY: A FIRST COURSE IN R SUMS, DIFFERENCES, PRODUCTS AND
where R is the part of the region over which x + y < z, shown shaded in the
Figure 2.1.
To represent this integral as an iterated integral, we fix x and then
integrate with respect to y from z − x to −∞. Next integrate with respect
to x from −∞ to ∞. Thus
∞ z−x
S(z) = and Quotientsf of
Sums, Differences, Products (x,Bivariate
y) dy dxDistributions 59
x=−∞ y=−∞
Let u = x+y; for fixed x, du = dy. If x is fixed then when y = z−x, u = z.

Thus ∞ z
S(z) = f (x, u − x) du dx
x=−∞ u=−∞
We assume that f (x, y) has continuous partial derivative. Therefore we

exchange the order of integration and obtain
z ∞
S(z) = f (x, u − x) dx du
u=−∞ x=−∞
Fig. 2.1 Region x + y < z is shaded
Differentiating S(z) with respect to z using the fundamental theorem of

calculus will yield the density function of the random variable Z. Hence,
z ∞
d S(z) d
= f (x, u − x) dx du
dz dz u=−∞ x=−∞
∞
s(z) = f (x, z − x) dx
−∞
Reversing the roles played by X and Y , we obtain,

∞
s(z) = f (z − y, y) dy
−∞
Example 2.6
Refer to Example 1.3.
(a) Find the distribution of Z = X + Y ;
(b) Calculate P (Z ≤ 2) using
(i) the distribution of Z;
(ii) the joint distribution of (X, Y ), that is, from the first principle.
Solution
(a) Z = X + Y =⇒ Y = Z − X
Therefore
59
0 < Y < 2 =⇒ 0 ≤ Z − X ≤ 2 =⇒ X ≤ Z ≤ X + 2
(b) Calculate P (Z ≤ 2) using
(i) theA distribution
PROBABILITY: FIRST COURSE ofINZ; SUMS, DIFFERENCES, PRODUCTS AND
(ii) the joint distribution of (X, Y ), that is, from the first principle.
Solution
(a) Z = X + Y =⇒ Y = Z − X
Therefore
0 < Y < 2 =⇒ 0 ≤ Z − X ≤ 2 =⇒ X ≤ Z ≤ X + 2
To determine the region of integration we sketch below the preceding

inequality and 0 < X < 1.
It can be seen from the sketch that the region of integration is parti-
tioned into three, namely,
0 ≤ X ≤ Z when 0 < Z ≤ 1
0 ≤ X ≤ 1 when 1 < Z ≤ 2
Z − 2 ≤ X ≤ 1 when 2 < Z ≤ 3
For 0 < Z ≤ 1,

z x(z − x) 7 3
s(z) = x2 + dx = z
x=0 3 18
For 1 < Z ≤ 2,
This e-book

1 x(z − x) 2 z
s(z) = x +
2
dx = +
x=0 3 9 6
is made with SETASIGN

SetaPDF
PDF components for PHP developers
www.setasign.com
60
0≤X≤Z
ADVANCED TOPICS IN INTRODUCTORY when 0 < Z ≤ 1
0≤X≤1
when 1 < Z ≤ 2QUOTIENTS OF BIVARIATE DISTRIBUTIONS
Z − 2 ≤ X ≤ 1 when 2 < Z ≤ 3
For 0 < Z ≤ 1,

z x(z − x) 7 3
s(z) = x +
2
dx = z
x=0 3 18
Sums,For 1 < Z ≤ Products
Differences, 2, and Quotients of Bivariate Distributions 61
Sums, Differences, Products and Quotients of Bivariate

1and Quotients of Bivariate Distributions Distributions 61
Sums, Differences, Products x(z − x) 2 z 61
s(z) = x +
2
dx = +
Finally, for 2 < Z ≤ x=0 3, 3 9 6
Finally, for 2 < Z ≤ 3,
Finally, for
1 2 <Z ≤ 3,
x(z − x)

1
2
s(z) = 1
x2 + x(z − x) dx = 1 −7z 33 + 36z 22 − 57z + 36
s(z) = x=z−2
1 x + x(z 3− x) dx = 18 1 −7z + 36z − 57z + 36
s(z) = x=z−2 x2 + 3 dx = 18 −7z 3 + 36z 2 − 57z + 36
x=z−2 3 18
Therefore, the distribution of Z is
Therefore, the distribution of Z is
Therefore, the distribution
 7 of Z is
 7 z3,

 0<z≤1


 18

 7 z 33 , 0<z≤1


 1818 z , 0<z≤1

 2 z


 2 + z,

 1<z≤2
s(z) =  
 9
2 + z6 , 1<z≤2
s(z) =  

 9 + 6, 1<z≤2
s(z) =  
 91 6



 1 −7z 33 + 36z 22 − 57z + 36 , 2 < z ≤ 3

 18
 1 −7z 3 + 36z 2 − 57z + 36 , 2 < z ≤ 3

 18 −7z + 36z − 57z + 36 , otherwise
 0, 2<z≤3



 0,
 18 otherwise
0, otherwise
(b) (i) Using the distribution of Z we calculate the required probability
as follows:
as follows:
as follows:
P (X ≤ 2) = P (0 < z ≤ 1) + P (1 < z < 2)
P (X ≤ 2) = P (01 <7z ≤ 1) +P 2(1 < z < 2)
P (X ≤ 2) = P (0 < z ≤ 1) + P (1 < 2z < x2)
= 1 7 x dx + 2

2 + x dx
= x=0 1 18
7 x dx + 2
x=1 9
2 x+ 6 dx
= x=0 418 1x dx +x=1 9 +62 dx
x=07 z 18 1 x=1
2 9z 6
2
= 7 z 4 11 + 1 2 z + z 2 22
= 18
7 z44 + 31 32 z + z42 1
= 18 4 00 + 3 3 z + 4
4118 4 3 3 4 1
= 41 0 1
= 72 41
= 72
72
(ii) Now, using the joint distribution of (X, Y )

P (Z ≤ 2) = f (x, y) dy dx
P (Z ≤ 2) = f (x, y) dy dx
P (Z ≤ 2) = z<2 f (x, y)dy dx
z<21 2−x xy
= z<21 2−x x2 + xy dy dx
x=0 3 dy dx
= 1 2−x x + xy
y=0
2
= x=0 y=0 x2 2+2−x 3 dy dx

1 y=0
x=0 xy 2 2−x3
= 1 x2 y + xy 2−x dx
2
= x=0 1 x y + xy 62 dx
= x=0 x2 y + 6 y=0 dx
62
1
x=0 Advanced
6Topicsy=0
x in Introductory
2 Probability
x
= 1 x2 (2 − x) + x (2 − x)2 − x2 + x dx
2 y=0 2
= x=0 1 x2 (2 − x) + x 6 (2 − x)2 − x + x 6 dx
= x=0 6
1 x (2 − x) + 1 (2 − x) − x +
2 6 dx
x=0 6 6 2 x
= (2x − x ) + (4x − 4x + x ) − (x + ) dx
2 3 2 3
x=0 6 6
1 1 1
2x2 4x4 1 4x2 x4 x3 x2
= − + 2x2 − + − +
3 4 0
6 3 4 0
3 12 0
41
=
72

61
(Continuous Case)
x=0 6 6
1 1 1
2x2 4x4 1 4x2 x4 x3 x2
=
ADVANCED TOPICS IN INTRODUCTORY − + 2x2 − + − +
3
6 3 4SUMS,
0
3 12 0 PRODUCTS AND
DIFFERENCES,
41
=
72

(Continuous Case)
Suppose that X and Y are independent continuous random variables

having joint density function f (x, y) with marginal probability
distributions g(x) and h(y) respectively. Let Z = X + Y and denote
the p.d.f. of Z by s(z). Then
∞
s(z) = g(x)h(z − x) dx
−∞
s(z) = g(z − y)h(y) dy
−∞
The theorem follows from Theorem 2.5 by noting that since X and Y are
independent:
f (x, z − x) = g(x)h(z − x)
f (z − y, y) = g(z − y)h(y)
The function s is called the convolution of the functions g and h, and the
expression in the Theorem is called the convolution formula.
If X and Y are nonnegative random variables, then the convolution
formula reduces to the following
z
s(z) = g(x)h(z − x) dx
0
This isDifferences,
Sums, because each of g andand
Products h isQuotients
equal to 0offor negativeDistributions
Bivariate argument. 63
Example 2.7
Suppose that X and Y are independent and identically distributed random
variables having p.d.f.’s
fX (x) = λ e−λ x , x≥0
and
fY (y) = λ e−λ y , y≥0
respectively, find the distribution of the sum Z = X + Y .
Solution
z
fZ (z) = λ e−λ t λ e−λ(z−t) dt
0
z
= λ2 e−λ z dt
0
= λ2 z e−λ z , z≥0
Note
(a) See note after Theorem 2.3.
(b) This distribution is a gamma distribution62with parameters 2 and λ.

fZ (z) = λ e−λ t λ e−λ(z−t) dt
0
z
= λ2 e−λ z dt
PROBABILITY: A FIRST COURSE IN 0 SUMS, DIFFERENCES, PRODUCTS AND
PROBABILITY THEORY – VOLUME III 2 −λ z QUOTIENTS OF BIVARIATE DISTRIBUTIONS
= λ ze , z≥0
Note
(a) See note after Theorem 2.3.
(b) This distribution is a gamma distribution with parameters 2 and λ.
2.3 DIFFERENCES OF RANDOM VARIABLES
2.3.1 Differences of Discrete Bivariate Random Variables

The probability distribution of the difference of two independent discrete
random variables X and Y may be considered as the probability distribution
of the sum of X and (−Y ). This is demonstrated in Table 2.2.
Table 2.2 Various Differences (X − Y ) and their Corresponding Proba-

bilities
xi − yj x1 − y1 x1 − y2 · · · x1 − ym x2 − y1 x2 − y2 · · ·
64p(xi , yj ) p(x1 , y1 ) ) · · · p(x
p(x1 , y2Advanced Topics
,
1 my )
in )
Introductory
p(x ,
2 1y 2 , y2 ) · · ·
Probability
p(x
··· x2 − ym xn − y1 xn − y2 ··· xn − ym
To find the distribution of the difference, we similarly apply the principle of

the probabilities of equivalent events.
Similar to sums of random variables, we shall present the two cases for
the difference, namely, when X and Y are not independent and when they
are.
ExampleSharp
2.8 Minds - Bright Ideas!
For the data in Example 2.2, find the distribution of the difference of the
Employees at FOSS Analytical A/S are living proof of the company value - First - using The Family owned FOSS group is
random new
variables X and Y .
inventions to make dedicated solutions for our customers. With sharp minds and the world leader as supplier of
cross functional teamwork, we constantly strive to develop new unique products - dedicated, high-tech analytical
SolutionWould you like to join our team? solutions which measure and
control the quality and produc-
IndicateFOSS
theworks
various difference
diligently withand
with innovation their corresponding
development as basis for itsjoint probabilities
growth. It is tion of agricultural, food, phar-
reflected in the fact that more than 200 of the 1200 employees in FOSS work with Re- maceutical and chemical produ-
search & Development in Scandinavia and USA. Engineers at FOSS work in production,
i − yj
xdevelopment −1 − (−1) within
and marketing, −1a−wide 0 range 1 0−
−1 of−different (−1)
fields, 0−0
i.e. Chemistry, 0−1 cts. Main activities are initiated
from Denmark, Sweden and USA
p(x ,
i jy )
Electronics, 0.1
Mechanics, Software, 0.2
Optics, 0.11
Microbiology, 0.08
Chemometrics. 0.08 0.26 with headquarters domiciled in
Hillerød, DK. The products are
We offer marketed globally by 23 sales
2 − (−1) 2 − 0 2 − 1
A challenging job in an international and innovative company that is leading in its field. You will get the companies and an extensive net
0.03 to work
opportunity 0.17 0.03
with the most advanced technology together with highly skilled colleagues. of distributors. In line with
the corevalue to be ‘First’, the
Read more about FOSS at www.foss.dk - or go directly to our student site www.foss.dk/sharpminds where
company intends to expand
you can learn more about your possibilities of working together with us on projects, your thesis etc.
Step 2 its market position.
Subtract the various values for Y from the various values for X:
Dedicated Analytical Solutions
FOSS
−y 0 69−1
xSlangerupgade −2 1 0 −1 3 2 1
p(x,
3400y) 0.1 0.2
Hillerød 0.11 0.08 0.02 0.26 0.03 0.17 0.03
Tel. +45 70103370
Step 3 www.foss.dk
By the principle of equality of the probabilities of equivalent events we shall
obtain the table below:
x−y −2 −1 0 1 2 3
63
p(x + y) 0.11 0.2 + 0.26 0.1 + 0.02 0.08 + 0.03 0.17 0.03
· · · x2 − yTHEORY
PROBABILITY m xn– − y1
VOLUME
xIIIn − y2 ··· xn − ym QUOTIENTS OF BIVARIATE DISTRIBUTIONS
· · · p(x2 , ym ) p(xn , y1 ) p(xn , y2 ) ··· p(xn , ym )
To find the distribution of the difference, we similarly apply the principle of

the probabilities of equivalent events.
Similar to sums of random variables, we shall present the two cases for
the difference, namely, when X and Y are not independent and when they
are.
Example 2.8
random variables X and Y .
Solution
Indicate the various difference with their corresponding joint probabilities
xi − yj −1 − (−1) −1 − 0 −1 − 1 0 − (−1) 0−0 0−1

p(xi , yj ) 0.1 0.2 0.11 0.08 0.08 0.26
2 − (−1) 2−0 2−1

0.03 0.17 0.03
Step 2
Subtract the various values for Y from the various values for X:
x−y 0 −1 −2 1 0 −1 3 2 1
p(x, y) 0.1 0.2 0.11 0.08 0.02 0.26 0.03 0.17 0.03
Step 3
By the principle of equality of the probabilities of equivalent events we shall
x−y −2 −1 0 1 2 3
p(x + y) 0.11 0.2 + 0.26 0.1 + 0.02 0.08 + 0.03 0.17 0.03
Step 4Differences, Products and Quotients of Bivariate Distributions

Sums, 65
x−y −2 −1 0 1 2 3
p(x − y) 0.11 0.46 0.12 0.11 0.17 0.03
which can be presented as follows:
x−y −2 −1 0 1 2 3
p(x − y) 0.11 0.46 0.12 0.11 0.17 0.03
We can verify that this distribution of the sum is a probability distri-

bution. That is,
0 ≤ p(x − y) ≤ 1
and

p(x − y) = 1
64
We can
ADVANCED verifyINthat
TOPICS this distribution
INTRODUCTORY of the sum is a probability distri-
bution. That is,
0 ≤ p(x − y) ≤ 1
and

p(x − y) = 1
Example 2.9
Solution
It has been shown in Example 2.3 that X and Y are independent. Therefor,
the various values of the sum {x + (−y)} and their corresponding probabil-
ities are given in the following table:
xi − yj −2 − 3 −2 − 6 4−3 4−6
p(xi , yj ) 0.4(0.7) 0.4(0.3) 0.6(0.7) 0.6(0.3)
By the principle of equality of probabilities of equivalent events, we obtain

the distribution of the difference of X and Y, presented in the table below:
xi − yj −5 −8 1 −2
p(xi − yj ) 0.28 0.12 0.42 0.18
2.3.2 Differences of Continuous Bivariate Random Variables
Theorem 2.7
density function f (x, y) and let U = X − Y . Then
∞
fu (u) = f (x, x − u) dx
−∞
fu (u) = f (u + y, y) dy
−∞
Corollary 2.1
Suppose that X and Y are independent nonnegative continuous random
variables having joint density function f (x, y) with marginal probability
distributions g(x) and h(y) respectively, and let U = X − Y . Then
∞
fu (u) = g(x) h(x − u) dx
0
fu (u) = g(u + y) h(y) dy
0
Example 2.10
(a) Find the distribution of U = X − Y ;
65

1
fu (u) = g(x) h(x − u) dx
0
or equivalently,
∞ SUMS, DIFFERENCES, PRODUCTS AND
fu (u) =III
PROBABILITY THEORY – VOLUME g(u + y) h(y) dy QUOTIENTS OF BIVARIATE DISTRIBUTIONS
0
Example 2.10
(a) Find the distribution of U = X − Y ;

1
(b) Using the distribution in (a) calculate (i) P (U ≤ 1) (ii) P U < .
2
Solution
(a) X − Y = U =⇒ X = U + Y .
Therefore
0 < X < 1 =⇒ 0 ≤ U + Y ≤ 1 =⇒ −Y ≤ U ≤ 1 − Y
The region defined by the preceding inequality and 0 < Y < 2 is

sketched in the diagram below.
−U ≤ Y ≤ 2 when −2 ≤ U < −1
−U ≤ Y ≤ 1 − U when −1 ≤ U < 0
0≤Y ≤1−U when 0 ≤ U ≤ 1
For −2 ≤ U < −1

2 7 4
fu (u) = u + u y + y 3 dy
2
−u 3 3
5 3 14 32
= u + 2 u2 + u+
18 3 9
For −1 ≤ U < 0
1−u 7 4

fu (u) = u2 + u y + y 2 dy
−u 3 3
4 u
= −
9 6
For 0 ≤ U ≤ 1
1−u 7 4

fu (u) = u + u y + y2 2
dy
0 3 3
4 u 5 3
= − − u
9 6 18
66
PROBABILITY
III QUOTIENTS
Probability
Therefore

 5 3 14 32


 u + 2 u2 + u + , −2 ≤ u < −1



18 3 9





 4 u

 − , −1 ≤ u < 0
fu (u) = 9 6



 4 u 5 3

 − − u , 0≤u≤1



 9 6 18





0, elsewhere
(b)
(i) P (U ≤ 1) = P (−2 ≤ U ≤ 1)
= P (−2 ≤ U < 1) + P (−1 ≤ U < 0) + P (0 ≤ U ≤ 1)
−1
5 14 32
= u + 2u + 3
u+ 2
du
−2 18 3 9
0 1
4 u 4 u 5 3
+ − du + − − u du
−1 9 6 0 9 6 18
13 19 21
= + +
72 36 72
= 1

1
(ii) P U≤ = P (−2 ≤ U ≤ 1) + P (−1 ≤ U < 0)
2

1
+P 0 ≤ U ≤
2
−1
5 14 32
= u3 + 2u2 + u+ du
−2 18 3 9
0 1
u 4 2 4 u 5
+ du +
− − − u3 du
−1 9 6 0 9 6 18
13 19 227
= + +
72 36 1152
1043
=
1152
67
4 u 4 u 5 3
+ − du + − − u du
−1 9 6 0 9 6 18
13 IN19 21
=
PROBABILITY: A FIRST COURSE+ + SUMS, DIFFERENCES, PRODUCTS AND
36
72 III
PROBABILITY THEORY – VOLUME 72 QUOTIENTS OF BIVARIATE DISTRIBUTIONS
= 1

1
(ii) P U≤ = P (−2 ≤ U ≤ 1) + P (−1 ≤ U < 0)
2

1
+P 0 ≤ U ≤
2
−1
5 3 14 32
= u + 2u + u +
2
du
−2 18 3 9
0 1
u 4 2 4 u 5
+ − du + − − u3 du
−1 9 6 0 9 6 18
13 19 227
= + +
72 36 1152
1043
=
1152
2.4 PRODUCTS OF BIVARIATE RANDOM VARIABLES
2.4.1 Product of Discrete Bivariate Random Variables
The probability distribution of the product of two discrete random variables

X and Y whose joint probability functions are in tabular form can be calcu-
lated using the principle of the equality of the probabilities of two equivalent
events. This is demonstrated in Table 2.3.
Table 2.3 Various Products XY and their Corresponding Probabilities
xi yj x1 y1 x1 y2 ··· x1 ym x2 y1 x2 y2 ···
··· x2 ym xn y1 xn y2 ··· xn ym
Example 2.11
For the data in Example 2.2, find the distribution of the product of the
Solution
Step 1
Indicate the various products with their corresponding joint probabilities
xi yj −1(−1) −1(0) −1(1) 0(−1) 0(0) 0(1)

p(xi , yj ) 0.1 0.2 0.11 0.08 0.08 0.26
2(−1) 2(0) 2(1)

0.03 0.17 0.03
Step 2
Multiply the various values for X by the various values for Y :
68
p(xi , yj ) 0.1 0.2 0.11 0.08 0.08 0.26
2(−1) 2(0) 2(1)
0.03 0.17 0.03
70
Step 2 Advanced Topics in Introductory Probability
Multiply the various values for X by the various values for Y :
xy 1 0 −1 0 0 0 −2 0 2
xyy)
p(x, 1
0.1 0
0.2 −1
0.11 0
0.08 0
0.02 0
0.26 −2
0.03 0
0.17 2
0.03
p(x, y) 0.1 0.2 0.11 0.08 0.02 0.26 0.03 0.17 0.03
Step 3
Step
By the3 principle of equality of the probabilities of equivalent events we shall
By the the
obtain principle
table of equality of the probabilities of equivalent events we shall
below:
xy −2 −1 0 1 2
xy
p(xy) −2
0.03 −1
0.11 0 + 0.26 + 0.17
0.2 + 0.08 + 0.02 1
0.1 2
0.03
p(xy) 0.03 0.11 0.2 + 0.08 + 0.02 + 0.26 + 0.17 0.1 0.03
Step 4
Step 4 the final result as in the table below:
Present
xy −2 −1 0 1 2
xy
p(xy) −2
0.03 −1
0.11 0
0.73 1
0.1 2
0.03
p(xy) 0.03 0.11 0.73 0.1 0.03
We can verify that this distribution of the sum is a probability distri-
WeThat
bution. can verify
is, that this distribution of the sum is a probability distri-
bution. That is, 0 ≤ p(xy) ≤ 1
0 ≤ p(xy) ≤ 1
and
and p(xy) = 1
p(xy) = 1
Example 2.12
Example 2.12in Example 2.3, find the distribution of the product of X and
For the data
For
Y. the data in Example 2.3, find the distribution of the product of X and
Y.
Solution
Solution
It has been shown in Example 2.3 that X and Y are independent. Therefor,
It has
the been shown
various values in
of Example 2.3 that
the product X and
xy and are independent.
Y corresponding
their Therefor,
probabilities
the various values of the product
are given in the following table: xy and their corresponding probabilities
are given in the following table:
xi yj −2(3) −2(6) 4(3) 4(6)
−2(3) 0.4(0.3)
xi y, yj ) 0.4(0.7)
p(x 4(3)
−2(6) 0.6(0.7) 4(6)
0.6(0.3)
i j
p(xi , yj ) 0.4(0.7) 0.4(0.3) 0.6(0.7) 0.6(0.3)
By the principle of the equality of probabilities of equivalent events we obtain
By 71
the the principle of the equality of probabilities
Sums, Differences, Products and Quotients of of equivalent
Bivariate events we obtain
Distributions
result summarised in the following table.
the result
xi ysummarised
j −6 in the following
−12 12 24table.
x iyyj ) 0.28
p(x
i j −6 0.12 −12 0.42
12 0.1824
p(xi yj ) 0.28 0.12 0.42 0.18
Example 2.13
Suppose there
Example 2.13 are two independent distributions:
Suppose there are two independent distributions:
x −2 4
g(x)
x 0.4
−2 0.6
4
g(x) 0.4 0.6
y −2 4
h(y)
y 0.4
−2 0.6
4 69
x −2 4
ADVANCED
g(x) 0.4 IN0.6
x TOPICS
−2 4INTRODUCTORY
PROBABILITY 0.4 0.6
g(x) THEORY – VOLUME III QUOTIENTS OF BIVARIATE DISTRIBUTIONS
y −2 4
h(y)
y 0.4
−2 0.6
4
h(y) 0.4 0.6
Find the distribution of (a) X 2 (b) XY.

Find the distribution of (a) X 2 (b) XY.
Solution
Solution
x2 4 16
(a)
x2 ) 0.4 0.6
p(x2
4 16
(a)
p(x2 ) 0.4 0.6
xy −2(−2) −2(4) 4(−2) 4(4)

(b)
p(x,
xy y) 0.4(0.4)
−2(−2) 0.4(0.6)
−2(4) 0.6(0.4)
4(−2) 0.6(0.6)
4(4)
(b)
p(x, y) 0.4(0.4) 0.4(0.6) 0.6(0.4) 0.6(0.6)
xy 8 −8 −8 16
p(x,
xy y) 0.16
8 0.24
−8 0.24
−8 0.36
16
p(x, y) 0.16 0.24 0.24 0.36
xy 4−8 16
72 0.16
0.48 Advanced Topics in Introductory Probability
p(xy)
xy −8 0.36
4 16
p(xy) 0.16 0.48 0.36
2.4.2 Product of Continuous Bivariate Random Variables
Theorem 2.8
density function f (x, y) and let V = XY . Then
∞
1 v
fv (v) = f x, dx
−∞ |x| x
or equivalently,
∞
1 v
fv (v) = f ,y dy
−∞ |y| y
Corollary 2.2
Suppose that X and Y are independent, nonnegative continuous random
distributions g(x) and h(y) respectively and let V = XY . Then
∞
1 v
fv (v) = h g(x) dx, y>0
0 x x
or equivalently,
∞
1 v
fv (v) = g h(y) dy, x>0
0 y y
70
We can calculate the probability P (XY ≤ t) for the independent, nonnega-
or equivalently,

PROBABILITY: A FIRST COURSE∞IN 1 v SUMS, DIFFERENCES, PRODUCTS AND
fv (v)
PROBABILITY THEORY =
– VOLUME III g h(y) dy, x > 0QUOTIENTS OF BIVARIATE DISTRIBUTIONS
0 y y
We can calculate the probability P (XY ≤ t) for the independent, nonnega-

tive continuous random variables X and Y thus

P (XY ≤ t) = g(x)h(y) dx dy
0<xy≤t
t ∞
1 v
= g h(y) dy dv
0 −∞ y y
t
Sums, Differences, Products fv (v) dv of Bivariate Distributions
= and Quotients 73
0
The integration is restricted to the first quadrant, since g(x)h(y), the joint
density function of X and Y, are zero elsewhere.
Example 2.14
(a) Find the distribution of V = XY ;

1
(b) Using the distribution in (a), calculate P ≤U ≤1 .
2
Solution
V
(a) XY = V =⇒ X = .
Y
Therefore
V
0 < X < 1 =⇒ 0 ≤ ≤ 1 =⇒ 0 ≤ V ≤ Y
Y
It can be seen from the sketch that the region of integration is

V ≤ Y ≤ 2 when 0 < V ≤ 2
Therefore

2 v2 1 v
h(u) = + dy
y=v y3 3 y
71

v 1 2
= ln y −
3 2 y2 v

v 2 1 v2
= ln + − , 0<v≤2
3 v 2 8
(b) Now,
1
1 v 2 1 v2
P ≤V ≤1 = ln + − dv
2 1
2
3 v 2 8
1
v v3 v2 2
= − + 2 + ln
2 24 6 v 1
2
= 0.3338
2.5 QUOTIENTS OF BIVARIATE RANDOM VARIABLES
2.5.1 Quotient of Discrete Bivariate Random Variables
The probability distribution of the quotient of two discrete random variables

X
X and Y ( ) whose joint probability functions are in tabular form can be
Y
calculated using the principle of the equality of the probabilities of two
equivalent events. This is demonstrated in Table 2.4.
X
Table 2.4 Various Quotient and their Corresponding Probabilities
Y
xi x1 x1 x1 x2 x2
yj The Wake
p(xi , yj )
y1
p(x1 , y1 )
y2
p(x1 , y2 )
···
···
ym
p(x1 , ym )
y1
p(x2 , y1 )
y2
p(x2 , y2 )
···
···
the only emission we want to leave behind
x2 xn xn xn
··· ···
ym y1 y2 ym
Suppose X and Y are two independent variables. Then the distribution of
.QYURGGF'PIKPGU/GFKWOURGGF'PIKPGU6WTDQEJCTIGTU2TQRGNNGTU2TQRWNUKQP2CEMCIGU2TKOG5GTX
6JGFGUKIPQHGEQHTKGPFN[OCTKPGRQYGTCPFRTQRWNUKQPUQNWVKQPUKUETWEKCNHQT/#0&KGUGN6WTDQ
2QYGTEQORGVGPEKGUCTGQHHGTGFYKVJVJGYQTNFoUNCTIGUVGPIKPGRTQITCOOGsJCXKPIQWVRWVUURCPPKPI
HTQOVQM9RGTGPIKPG)GVWRHTQPV
(KPFQWVOQTGCVYYYOCPFKGUGNVWTDQEQO
72
The probability distribution of the quotient of two discrete random variables
X
X and Y ( A) FIRST
PROBABILITY:
whoseCOURSE
joint probability
IN
functions are in tabular form can be
Y
calculated using
PROBABILITY the– principle
THEORY VOLUME IIIof the
equality of the probabilities of BIVARIATE
QUOTIENTS OF two DISTRIBUTIONS
equivalent events. This is demonstrated in Table 2.4.
X
Table 2.4 Various Quotient and their Corresponding Probabilities
Y
xi x1 x1 x1 x2 x2
··· ···
yj y1 y2 ym y1 y2
x2 xn xn xn
··· ···
ym y1 y2 ym

Suppose X and Y are two independent variables. Then the distribution 75
Sums, Differences, Products and Quotients of Bivariate Distributions of
X
X , Y = 0 can, as before, be calculated using the principle of equality of
Y , Y = 0 can, as before, be calculated using the principle of equality of
probabilities
Y of equivalent events.
probabilities of equivalent events.
Example 2.15
Example
Consider 2.15
the distributions of two independent random variables X and Y .
Consider the distributions of two independent random variables X and Y .
x 2 5
x
g(x) 2
0.2 5
0.8
g(x) 0.2 0.8
y 4 8
y
h(y) 4
0.7 8
0.3
h(y) 0.7 0.3
X Y
Find the distribution of X and Y.
Find the distribution of Y and X.
Y X
Solution
Solution
Proceeding as before:
Proceeding as before:
x/y 2/4 2/8 5/4 5/8

x/yy)
p(x, 2/4
0.2(0.7) 2/8
0.2(0.3) 5/4
0.8(0.7) 5/8
0.8(0.3)
p(x, y) 0.2(0.7) 0.2(0.3) 0.8(0.7) 0.8(0.3)
X
The distribution of X is thus
The distribution of Y is thus
Y
x/y 1/2 1/4 5/4 5/8

x/y
p(x/y) 1/2
0.14 1/4
0.06 5/4
0.56 5/8
0.24
p(x/y) 0.14 0.06 0.56 0.24
Y
The distribution of Y follows similarly and is given in the table below:
The distribution of X follows similarly and is given in the table below:
X
y/x 2 4 4/5 8/5

y/x
p(y/x) 2
0.14 4
0.06 4/5
0.56 8/5
0.24
p(y/x) 0.14 0.06 0.56 0.24
73
76 III QUOTIENTS
Probability
2.5.2 Quotient of Continuous Bivariate Random Variables
Theorem 2.9
X
density function f (x, y) and let W = . Then
Y
∞
fw (w) = |y| f (wy, y) dy
−∞
Corollary 2.3
Suppose that X and Y are independent, nonnegative continuous random
X
distributions g(x) and h(y) respectively, and let W = . Then
Y
∞
fw (w) = y f (wy) h(y) dy
0
X
We can calculate the cumulative distribution function of for the indepen-
Y
dent, nonnegative random variables X and Y by

X
P ≤t = y f (wy)h(y) dy
Y
0 <x
y
≤t
0<y<∞
t ∞
= y f (wy) h(y) dy dw
0 0
t
= fw (w) dw
0
Example 2.16
X
(a) Find the distribution of W = ;
Y

1 Distributions
Sums, Differences, Products and Quotients of Bivariate 77
(b) Using the distribution in (a), calculate P W ≤ .
4
Solution
X
(a) Distribution of W = :
Y
X
= W =⇒ X = Y W
Y
Therefore
1
0 < X < 1 =⇒ 0 ≤ Y W ≤ 1 =⇒ 0 ≤ W ≤
Y
74
Y
X
ADVANCED TOPICS IN INTRODUCTORY= W =⇒ X = Y W
Y
Therefore
1
0 < X < 1 =⇒ 0 ≤ Y W ≤ 1 =⇒ 0 ≤ W ≤
Y
tioned into two, namely,
1
0 < Y ≤ 2 when 0 < W ≤
2
1 1
0<Y ≤ when ≤W <∞
W 2

w 2 1
h(w) = w2 + y 3 dy, 0≤w<
3 0 2
1
w w 1
= w2 + y 3 dy, ≤w<∞
3 0 2
Therefore  w


 4(w2 + ), 0≤w< 1
 3 2
h(w) =

 1 1 w

 (w + ), 1
≤w<∞
4w3 3 2
Challenge the way we run
EXPERIENCE THE POWER OF

FULL ENGAGEMENT…
RUN FASTER.
RUN LONGER.. READ MORE & PRE-ORDER TODAY
RUN EASIER… WWW.GAITEYE.COM
1349906_A6_4+0.indd 1 22-08-2014 12:56:57
75
2
w 1
h(w) = w2 + y 3 dy, 0≤w<
PROBABILITY: A FIRST COURSE IN 3 0 2
PROBABILITY THEORY – VOLUME III 1
w w 1 QUOTIENTS OF BIVARIATE DISTRIBUTIONS
= 2
w + 3
y dy, ≤w<∞
3 0 2
Therefore  w


 4(w2 + ), 0≤w< 1
 3 2
h(w) =

 1 1 w
78 
Advanced
(w +Topics
), 1in≤Introductory
w<∞ Probability
4 w3 3 2

1
(b) Calculation of P W ≤ .
4

X 1 1
P ≤ = P W ≤
Y 4 4
1
4 w
= 4 w2 + dw
0 3
1
=
16
EXERCISES
2.1 Refer to Exercise 1.1. Find the distribution of

X
(a) X + Y (b) X − Y (c) XY (d)
Y
3X
(e) 2X + 3Y (f ) 5X − Y (g) 3X(4Y ) (h)
7Y
(i)X2 + 5Y (j) 3X (k) 3(X + Y ) (l) Y 3
(m) Y 2 − 2Y + 3X
2.2 Refer to Example 2.14. Find the distribution of
(a) the sum Z = X + Y ;

(b) W = 2X.
(a) the distribution of the sum Z = X + Y ;

(b) P (Z = 2) (i) using the distribution of Z, (ii) from the first
principle.

principle.

principle.
(c) Find P (X = x|X + Y = x + y)

principle. 76
(b) P (Z = 4) (i) using the
distribution of Z, (ii) SUMS,
from DIFFERENCES,
the first PRODUCTS AND
principle.
(c) Find P (X = x|X + Y = x + y)

principle.
2.7 Refer to Exercise 1.11, where k = 2. Find

(b) P (Z ≤ 1) (i) using the distribution of Z, (ii) from the first
principle.
2
2.8 Refer to Exercise 1.12, where k = . Find
3
(b) P (Z =≤ 1) (i) using the distribution of Z, (ii) from the first
principle.

principle.
12
2.10 Refer to Exercise 1.13, where a = . Find
7
principle.
2.11 If X and Y are independent Binomial random variables and X is b(n, p)

80 and Y is b(m, p), use theAdvanced
convolution theorem
Topics to show that
in Introductory X + Y is
Probability
b(n + m, p).
2.12 If X and Y are independent Geometric random variables each with
parameter p. Use the convolution theorem to show that X + Y is a
negative binomial N B(2, p).
2.13 If X and Y are independent Negative Binomial random variables and

X is b− (k1 , p) and Y is b− (k2 , p), use the convolution theorem to show
that X + Y is a negative binomial b− (k1 + k2 , p).
2.14 If X and Y are independent Gamma random variables and X is

Γ(s1 , λ) and Y is Γ(s2 , λ), use the convolution theorem to show that
X + Y is Γ(s1 + s2 , λ).
2.15 Suppose X and Y are independent Normal random variables and X is

N (µ1 , σ12 ) and Y is N (µ2 , σ22 ), use the convolution theorem to show
that X + Y is a N (µ1 + µ2 , σ12 + σ22 ).
2.16 Refer to Example 1.15. Compute P (X + Y > 2).
2.17 The moment generating function of X is given by

MX (t) = exp 2et − 2
and that of Y by
77
10 10
1 t
2.15 Suppose X and Y are independent Normal random variables and X is
ADVANCED , σ12 ) and
N (µ1TOPICS Y is N (µ2 , σ22 ), use the
IN INTRODUCTORY convolution theorem to show
PROBABILITY:
that X A + FIRST
Y is aCOURSE
N (µ1 +INµ2 , σ12 + σ22 ). SUMS, DIFFERENCES, PRODUCTS AND
2.16 Refer to Example 1.15. Compute P (X + Y > 2).
2.17 The moment generating function of X is given by

MX (t) = exp 2et − 2
and that of Y by
10 10
1
MY (t) = 3et + 1
4
If X and Y are independent, find P (X + Y = 2).
(a) the distribution of the difference U = X − Y ;

(b) (i) P (U = 0); (ii) P (U = 1); (iii) P (U = 2).
(a) the distribution of the difference U = X − Y ;

(b) (i) P (U ≤ 12 ), (ii) P (− 12 ≤ U ≤ 12 ).

(a) the distribution of the product V = XY ;
(b) (i) P (V = 2), (ii) P (V = 3) (iii) P (V = 4), (iv) P (V = 6).
(a) the distribution of the product U = XY ;

(b) (i) P (U > 34 ), (ii) P ( 12 < U < 1), using the distribution of U,
Technical training on
2.22 Refer to Exercise 2.17. Find P (XY = 0).
2.23
WHAT you need, WHEN you need it
Refer to Exercise 1.10. Find
At IDC Technologies we can tailor our technical and engineering
(a) the distribution of the quotient
training workshops to suitW
your Y ;
= needs.
X We have extensive OIL & GAS
experience in training technical and X
engineering staff and ENGINEERING
(b) (i) P ( X
Y = 1); (ii) P (W = 2); (iii) P ( = 3).
have trained people in organisationsYsuch as General
ELECTRONICS
Motors,1.11.
2.24 Refer to Exercise Shell,Find
Siemens, BHP and Honeywell to name a few.
Our onsite training is cost effective, convenient and completely AUTOMATION &
(a) the distribution oftothe
customisable the quotient =XY ;
W engineering
technical and areas you want PROCESS CONTROL
covered. Our workshops are all comprehensive hands-on learning
(b) P (W = experiences
5). with ample time given to practical sessions and MECHANICAL
demonstrations. We communicate well to ensure that workshop content ENGINEERING
2.25 Refer to Exercise 1.12, match
and timing = 23 . Findskills, and abilities of the participants.
wherethek knowledge,
INDUSTRIAL
(a) the distribution
We run of P (X
onsite − Y );all year round and hold the workshops on
training DATA COMMS
your premises or a venue of your choice for your convenience.
(b) P (− 2 ≤ X − Y ≤ 1).
1
ELECTRICAL
For a no obligation proposal, contact us today POWER
2.26 Refer to Exercise 1.12, where k = 23 . Find
at training@idc-online.com or visit our website
for more information: www.idc-online.com/onsite/
(a) the distribution of XY ;
(b) P (XY ≤ 45 ). Phone: +61 8 9321 1702
Email: training@idc-online.com
Website: www.idc-online.com
Y );
(a) the distribution of P ( X
(b) P ( X
Y < 10).

78
(a) the distribution of P (X − Y );
ADVANCED
Sums, TOPICS IN
Differences, INTRODUCTORY
Products and Quotients of Bivariate Distributions 81
(b) (i) P (V = 2), (ii) P (V = 3) (iii) P (V = 4), (iv) P (V = 6).
(a) the distribution of the product U = XY ;

(b) (i) P (U > 34 ), (ii) P ( 21 < U < 1), using the distribution of U,
2.22 Refer to Exercise 2.17. Find P (XY = 0).
(a) the distribution of the quotient W = Y ;

X
(b) (i) P ( X
Y = 1); (ii) P (W = 2); (iii) P ( X
Y = 3).
(a) the distribution of the quotient W = Y ;

X
(b) P (W = 5).

(b) P (− 12 ≤ X − Y ≤ 1).

(b) P (XY ≤ 45 ).
Y );
(a) the distribution of P ( X
(b) P ( X
Y < 10).

(b) P (− 12 ≤ X − Y ≤ 12 ).

(b) P (XY ≤ 1).

(a) the distribution of P X
Y ;

(b) P X
Y ≤2 .
79
PROBABILITY: A FIRST COURSE IN EXPECTATION AND VARIANCE OF
PROBABILITY THEORY – VOLUME III BIVARIATE DISTRIBUTIONS
Chapter 3
EXPECTATION AND VARIANCE

3.1 INTRODUCTION
In the earlier text5 , we discussed the numerical characterisation of a single

random variable. Specifically, we discussed in Chapter 6 the concepts of
expectation and variance. In this chapter, we shall extend the concept of
expectation and variance to the case of bivariate distribution and discuss a
few more properties. We shall also realise that in this case, there arises the
concept of covariance.
Chapter
3.2
3
EXPECTATION OF BIVARIATE RANDOM VARIABLES
EXPECTATION AND VARIANCE

To refresh our minds, we recapitulate that the expectation of a discrete
random variable X is
n
E(X) = xi p(xi )
i=1
Example 3.1
3.1 INTRODUCTION
For the data in Example 2.3, find E(X) and E(Y ).
Solution
In the earlier text5 , we discussed the numerical characterisation of a single
For
84 convenience
random variable.weSpecifically,
present theAdvanced
data
we here.
Topics
discussed in in Introductory
Chapter Probability
6 the concepts of
expectation
5
and variance.
Nsowah-Nuamah, 2017 In this chapter, we shall extend the concept of
expectation and variance to thex case of −2bivariate
4 distribution and discuss a
i
83
few more properties. We shall also realise that in this case, there arises the
p(xi ) 0.4 0.6
concept of covariance.
3.2 E(X)
EXPECTATION OF= BIVARIATE
−2(0.4) + 4(0.6) = 1.6 VARIABLES
RANDOM
To refresh our minds, we recapitulate

yj 3 that6 the expectation of a discrete
random variable X is p(yj ) n0.7 0.3

E(X) = xi p(xi )
i=1
E(Y ) = 3(0.7) + 6(0.3) = 3.9
Example 3.1
For the data
3.2.1 in Exampleof2.3,
Expectation find E(X)
Functions ofand E(Y ). Bivariate Random
Discrete
Variables
Solution
For
The convenience
expectationwe
of present the data
any function of here.
discrete bivariate random variables,
H(X,
5
Y ), can be computed
Nsowah-Nuamah, 2017 either the
(a) parent joint probability distribution;

83 or
(b) derived joint probability distribution.

80
Expectation from Parent Joint Probability Distribution
yj )
p(y 3
0.7 6
0.3
j
p(yj ) 0.7 0.3
E(Y ) =III3(0.7)
PROBABILITY THEORY – VOLUME + 6(0.3) = 3.9 BIVARIATE DISTRIBUTIONS
E(Y ) = 3(0.7) + 6(0.3) = 3.9
3.2.1 Expectation of Functions of Discrete Bivariate Random
3.2.1 Expectation
Variables of Functions of Discrete Bivariate Random
Variables
The expectation of any function of discrete bivariate random variables,
The expectation
H(X, of any function
Y ), can be computed of discrete bivariate random variables,
either the
H(X, Y ), can be computed either the
(a) parent joint probability distribution; or
(a) parent joint probability distribution; or
The next theorem shows that we can find E(X), E(Y ) and E(XY ) of a
The next distribution
bivariate theorem shows(X, Ythat we canfrom
) directly findthe
E(X), E(Y
parent ) and
joint E(XY ) of
probability a
dis-
bivariate
tribution distribution
without first(X, Y ) directly
obtaining the from thejoint
derived parent joint probability
probability dis-
distribution
tribution without first obtaining
or the marginal distribution. the derived joint probability distribution
or the marginal distribution.
Definition 3.1
Definition 3.1a two-dimensional discrete random variable and let
Let (X, Y ) be
Let (X, Y ) be a two-dimensional discrete random variable and let
Z = H(X, Y ). Then
Z = H(X, Y ). Then
∞
∞
= Bivariate
E(Z) of
Expectation and Variance ∞ ∞ H(x
i , yj )p(xi , yj )
Distributions 85
Expectation and Variance = Bivariate
E(Z) of i=1 j=1 H(x i , yj )p(xi , yj )
Distributions 85
Expectation and Variance of Bivariate
i=1 j=1 Distributions 85
For the finite case,
For the finite case,
For the finite case, n
m

E(Z) = n
n
m H(xi , yj ) p(xi , yj )
m
E(Z) =
i=1 j=1 H(xi , yj ) p(xi , yj )
E(Z) = H(xi , yj ) p(xi , yj )
i=1 j=1
i=1 j=1
That is,
That is,
That is,
E[H(X, Y )] = H(x1 , y1 ) p(x1 , y1 ) + H(x1 , y2 ) p(x1 , y2 ) + · · ·
E[H(X, Y )] = H(x
+H(x 1 , y,1y) p(x 1 , y1,)y+)H(x
) p(x + H(x 1 , y2 ), yp(x 1 , y2 ), y+ ···
E[H(X, Y )] = H(x 1 , 1y1 )m p(x 1 , y11 ) +m H(x 1 , y2 ) 1 ) p(x
p(x 1 , y2 ) +1 )· · ·
+H(x1 ,, yym))p(x
+H(x p(x1 , ym ) ))++ H(x
· + H(x2 , y1 ) ,p(x
ym22),,p(xy1 ) , y )
2
1 2 m ) p(x21,,yy2m +· ·H(x 2 , y1 2) p(x y1 )2 m
+H(x
+ , y
· · · +2 ,H(x ) p(x , y ) + · · · + H(x , y ) p(x
p(x22n,,,yyym )
+H(x n , y21,)yp(x
y2 ) p(x 2) + n ,·y·1·)+ +H(xH(x2n, ,yym2)) p(x 2)
2 2 2 2 2 m
m
+ · · · + H(xn ,, yy1 ))p(x p(xnnn,,,yyy11m) )+ H(xn , y2 ) p(xn , y2 )
+ · · · + H(xn m 1 ) p(x ) + H(xn , y2 ) p(xn , y2 )
+ · · · + H(xn , ym ) p(xn , ym )
+ · · · + H(xn , ym ) p(xn , ym )
Example 3.2
Example
For 3.2
the data
Example 3.2 in Example 2.3, find the expectation of the function
For the data in Example 2.3, find the expectation of the function
For the data in Example 2.3, find the expectation of the function
H(X, Y ) = X 2 Y
H(X, Y ) = X 2 Y
H(X, Y ) = X 2 Y
Solution
Solution
Solution 2
E(X Y ) = {(x1 )2 (y1 )} p(x1 , y1 ) + {(x1 )2 (y2 )}p(x1 , y2 )
E(X 22 Y ) = +{(x
{(x )2 (y 2 1 )} p(x1 , y1 ) + {(x1 )2 (y
2
2 2 )}p(x1 , y2 )
E(X Y ) = {(x11 )22)(y(y 1 )}p(x
1 )} p(x1 ,2y, 1y)1 +) +{(x {(x 1 )2 )(y(y 2 )}p(x
2 )}p(x 1 ,2y, 2y)2 )
+{(x 22 )2 (y1 )}p(x2 , y1 ) +
2
(−2) 2(3)(0.28)
= +{(x +2(−2) 2 {(x2 )2 (y2 )}p(x
2
2 , y2 )
) (y1 )}p(x , y1 ) +(6)(0.12)
{(x2 ) (y+ (4)2 (3)(0.42)
2 )}p(x 2 , y2 )
= (−2)
+(4)2 (3)(0.28)
2
2
(6)(0.18) + (−2)2 (6)(0.12) + (4)2 (3)(0.42)
2 2
= (−2) (3)(0.28) + (−2) (6)(0.12) + (4) (3)(0.42)
+(4)22 (6)(0.18)
= 43.68
+(4) (6)(0.18)
= 43.68
= 43.68
Theorem 3.1 81
Theorem
Let (X, Y )3.1
be two-dimensional discrete random variable and let
= (−2)2 (3)(0.28) + (−2)2 (6)(0.12) + (4)2 (3)(0.42)
+(4)2 (6)(0.18)
= 43.68
Theorem 3.1
Let (X, Y ) be two-dimensional discrete random variable and let
H(X, Y ) = XY . Then
n
m
E(XY ) = xi yj p(xi , yj )
i=1 j=1
86
86 Advanced
Advanced Topics in
Probability
Example 3.3
Example
Refer
Example 3.3
to Example
3.3 2.2. Find E(XY ) from the parent joint probability dis-
Refer
to
tribution.Example 2.2. Find E(XY )) from
Find E(XY from the
the parent
parent joint
joint probability
probability dis-
dis-
tribution.
tribution.
Solution
Solution
The parent distribution of XY is given in the table below.
Solution
The
The parent
parent distribution
distribution of XY is
of XY is given
given in
in the
the table
table below.
below.
Y
X −1 Y
0Y 1 Row Totals
X
X
−1 −1
0.10
−1 00
0.20 11
0.11 Row Totals
Row0.41
Totals
−1
−1 0.10
(1)
0.10 0.20
(0)
0.20 0.11
(-1)
0.11 0.41
0.41
0 (1)
0.08
(1) (0)
0.02
(0) (-1)
0.26
(-1) 0.36
00 0.08
(0)
0.08 0.02
(0)
0.02 0.26
(0)
0.26 0.36
0.36
2 (0)
0.03
(0) (0)
0.17
(0) (0)
0.03
(0) 0.23
22 0.03
(-2)
0.03 0.17
(0)
0.17 0.03
(2)
0.03 0.23
0.23
Column Totals (-2)
(-2)
0.21 (0)
(0)
0.39 (2)
(2)
0.40 1.00
Column
Column Totals
Totals 0.21
0.21 0.39
0.39 0.40
0.40 1.00
1.00
The values in parentheses are for xy. Multiplying these values by their
The
The values
values in
corresponding
in parentheses
probabilities,
parentheses are for
we obtain
are xy.the
for xy. Multiplying these
these values
following table:
Multiplying values by
by their
their �e Graduate Programme
I joined
corresponding
MITAS because
probabilities, we obtain the following table: for Engineers and Geoscientists
corresponding probabilities, we obtain the following table:
I joined MITAS
Y because for Engine
X I wanted
−1 0real 1responsibili�
Y
Y Ma
X
X
−1 −1
0.10
−1 00
0.00 11
-0.11
0
−1
−1 0.10
0.00
0.10 0.00 -0.11
0.00
-0.11
200 0.00
-0.06
0.00 0.00 0.00
0.06
0.00
22 -0.06
-0.06 0.00
0.00 0.06
0.06
Hence,
Hence,
Hence,
E(XY ) = n
n
n
m
m xi yj p(xi , yj )
m
Month 16
E(XY )) =
E(XY

p(xii,, yyjj))
= i=1 j=1 xxiiyyjj p(x I was a construction Mo
= 0.10
i=1 + 0.00 + (−0.11) + 0.00 + · · · + 0.00 + 0.06
i=1 j=1
=
= 0.10
j=1
0.10 +
−0.01+ 0.00
0.00 + + (−0.11)
(−0.11) + + 0.00
0.00 +
+ ·· ·· ·· +
+ 0.00
0.00 +
+ 0.06
0.06
I was
=
= −0.01
−0.01 the North Sea super
Corollary 3.1
Corollary
advising and the No
Let (X, Y )3.1
Corollary be two-dimensional discrete random variable and let H(X, Y ) =
3.1
Let
Let (X, Y )) be
(X, Y be two-dimensional
two-dimensional discrete
discrete random
random variable
Real work
variable and
and let
let H(X,
H(X, he
helping
Y )) =
Y = foremen advis
International
al opportunities
Internationa
�ree wo
work
Real work he
helping fo
International
�ree wo
work
82
(0) (0) (0)
2 0.03
ADVANCED TOPICS IN INTRODUCTORY 0.17 0.03 0.23
PROBABILITY: A FIRST COURSE IN (-2) (0) (2) EXPECTATION AND VARIANCE OF
Column Totals 0.21 0.39 0.40 1.00
The values in parentheses are for xy. Multiplying these values by their
corresponding probabilities, we obtain the following table:
Y
X −1 0 1
−1 0.10 0.00 -0.11
0 0.00 0.00 0.00
2 -0.06 0.00 0.06
Hence,
n
m
E(XY ) = xi yj p(xi , yj )
i=1 j=1
= 0.10 + 0.00 + (−0.11) + 0.00 + · · · + 0.00 + 0.06
= −0.01
Expectation and Variance of Bivariate Distributions 87

Corollary 3.1
Let (X, Y ) be two-dimensional discrete random variable and let H(X, Y ) =
X. Then n m
E(X) = xi p(xi , yj )
i=1 j=1
and n
m

E(Y ) = yj p(xi , yj )
i=1 j=1
Example 3.4
For the data in Example 2.2, find the expectation of the functions
(a) H(X, Y ) = X
(b) H(X, Y ) = Y .
Solution
(a) From the table in Example 3.3, we calculate the values of the cells with
the xi values. Thus, for x1 = −1
x1 p(x1 , y1 ) = −1(0.10) = −0.1

x1 p(x1 , y2 ) = −1(0.2) = −0.2
x1 p(x1 , y3 ) = −1(0.11) = −0.11
so that the sum for that row is:

3

x1 p(x1 , yj ) = −0.1 + (−0.2) + (−0.11) = −0.41
j=1
Similarly, for x2 = 0
3

x2 p(x2 , yj ) = 0 + 0 + 0 = 0
j=1
and for x3 = 2
3

x3 p(x3 , yj ) = 0.06 + 0.34 +830.06 = 0.46
j=1
j=1
Similarly, for x2 = 0
3 IN EXPECTATION AND VARIANCE OF

PROBABILITY THEORY – VOLUME III , y )
x2 p(x =0+0+0=0 BIVARIATE DISTRIBUTIONS
2 j
j=1
and for x3 = 2
3

x3 p(x3 , yj ) = 0.06 + 0.34 + 0.06 = 0.46
j=1
The results are presented in the table below:
Y
−1 -0.1 -0.2 -0.11 -0.41
0 0 0 0 0
2 0.06 0.34 0.06 0.46
Hence
3
3
i=1 j=1
= −0.41 + 0 + 0.46
= 0.05
(b) We compute E(Y ) as follows:

For y1 = −1
3

y1 p(xi , y1 ) = −0.1 + (−0.08) + (−0.03) = −0.21
i=1
Similarly, for y2 = 0
3

y2 p(xi , y2 ) = 0 + 0 + 0 = 0
i=1
and for y3 = 1
3

y3 p(xi , y3 ) = 0.11 + 0.26 + 0.03 = 0.40
i=1
The results are presented in the table below:
Y
X −1 0 1
−1 -0.1 0 0.11
0 -0.08 0 0.26
Expectation and Variance of2Bivariate
-0.03Distributions
0 0.03 89
Total -0.21 0 0.40
Hence
3
3
E(Y ) = yj p(xi , yj )
j=1 i=1
= −0.21 + 0 + 0.40
= 0.19
Expectation from Derived Joint Probability Distribution

84
The following theorem shows that we can find E(XY ) of a bivariate dis-
E(Y ) = j=1 i=1 yj p(xi , yj )
= −0.21
j=1 i=1
+ 0 + 0.40
= −0.21 + 0 + 0.40
PROBABILITY: A FIRST COURSE IN= 0.19 EXPECTATION AND VARIANCE OF
PROBABILITY THEORY – VOLUME = III 0.19 BIVARIATE DISTRIBUTIONS

The following theorem shows that we can find E(XY ) of a bivariate dis-
The following
tribution (X, Ytheorem
) from theshows thatjoint
derived we can find E(XY
probability ) of a bivariate
distribution without dis-
the
tribution (X, Y ) from the derived joint probability distribution without
knowledge of the parent joint probability distribution or the marginal dis- the
knowledge
tribution inofthe
thecase
parent joint probability
of independence of Xdistribution
and Y . or the marginal dis-
tribution in the case of independence of X and Y .
Theorem 3.2
Theorem
Let (X, Y )3.2
be two-dimensional discrete random variable and let
Let
H(X,(X,
Y )Y=) XY
be .two-dimensional
Then discrete random variable and let
H(X, Y ) = XY . Then

E(XY ) = xy p(xy)
E(XY ) = xy xy p(xy)
xy
Example 3.5
Example
Refer 3.5
to Example 2.2. Find E(XY ) from the derived joint probability dis-
Refer to Example
tribution. 2.2. Find E(XY ) from the derived joint probability dis-
tribution.
Solution
Solution
From Example 2.11, we have the following:
xy −2 −1 0 1 2
xy
p(xy) −2
0.03 −1
0.11 0
0.73 1
0.1 2
0.03
p(xy) 0.03 0.11 0.73 0.1 0.03
xy p(xy) -0.06 -0.11 0 0.1 0.06
xy p(xy) -0.06 -0.11 0 0.1 0.06
Hence
Hence
E(XY ) = xy p(xy)
E(XY ) = xy xy p(xy)
xy
www.job.oticon.dk
85
ADVANCED TOPICS INxy INTRODUCTORY

−2 −1 0 1 2
PROBABILITY: A FIRST COURSE0.03
p(xy) IN 0.11 0.73 0.1 0.03 EXPECTATION AND VARIANCE OF
xy p(xy) -0.06 -0.11 0 0.1 0.06
Hence
E(XY ) = xy p(xy)
xy
= −0.06 + (−0.11) + 0 + 0.1 + 0.06
= −0.01
the same result as in Example 3.3.
Expression of Expectation through Marginal Distribution
The following Theorem shows that the expectation of X and Y in a bivariate

distribution can be obtained through the marginal distribution.
Theorem 3.3
Let (X, Y ) be a two-dimensional random variable with probability
mass function p(x, y) and let H(X, Y ) = X.
Then the function
n
m
i=1 j=1
simplifies to
n

E(X) = xi p(xi )
i=1
Proof
n
m
i=1 i=1
n m
= xi p(xi , yj )
i=1 i=1
(since the x values do not depend on yand can therefore be
taken from the summation of y)
n

= xi p(xi )
i=1
(from the definition of the marginal probability mass function
of X in Definition 1.11)
Similarly, if H(X, Y ) = Y . Then

n
m
E(Y ) = yi p(xi , yj )
i=1 j=1
which also reduces to m

E(Y ) = yj p(yj )
j=1
Example 3.6
Refer to Example 2.2. Using the marginal distributions, find
(a) E(X) (b) E(Y ).
86
Solution
E(Y ) = yi p(xi , yj )
i=1 j=1
which also reduces
PROBABILITY: toCOURSE IN
A FIRST m EXPECTATION AND VARIANCE OF

E(Y ) = yj p(yj )
j=1
Example 3.6
Refer to Example 2.2. Using the marginal distributions, find
(a) E(X) (b) E(Y ).
Solution
(a) From Example 2.11, the marginal distributions of X is:
x −1 0 2
p(x) 0.41 0.36 0.23
x p(x) -0.41 0 0.46
Hence
n

E(X) = xi p(xi )
i=1
= −41 + 0 + 0.46
= 0.05
(b) From Example 3.3, the marginal distributions of Y is:
y −1 0 1
p(y) 0.21 0.39 0.4
y p(y) -0.21 0 0.4
Hence
m
92
E(Y ) = E(Y ) = yj p(yj )
j=1
= −21 + 0 + 0.4
= 0.19
3.2.2 Expectation of Functions of Continuous Bivariate

Random Variables
Definition 3.2
density function f (x, y) and let Z = H(X, Y ).
Then ∞ ∞
E(Z) = H(x, y) f (x, y) dx dy
−∞ −∞
Theorem 3.4
density function f (x, y) and let H(X, Y ) = XY . Then expectation
of the product of X and Y is
∞ ∞
E(XY ) = xy f (x, y) dx dy
−∞ −∞
87
Example 3.7
density function f (x, y) and let Z = H(X, Y ).
ADVANCED
Then TOPICS IN INTRODUCTORY
∞
PROBABILITY: A FIRST COURSE ∞ IN EXPECTATION AND VARIANCE OF
E(Z) =
PROBABILITY THEORY – VOLUME III H(x, y) f (x, y) dx dy BIVARIATE DISTRIBUTIONS
−∞ −∞
Theorem 3.4
density function f (x, y) and let H(X, Y ) = XY . Then expectation
of the product of X and Y is
∞ ∞
E(XY ) = xy f (x, y) dx dy
−∞ −∞
Example 3.7
Refer to Example 1.3. Find E(XY ).
Solution
2 1
xy
E(XY ) = xy x + 2
dx dy
0 3
2 0
1 x2y2
= x3 y + dx dy
0 0 3
2
y y2
= + dy
0 4 9
4 8 43
= + =
8 27 54
WHY WAIT FOR

PROGRESS?
DARE TO DISCOVER
Schlumberger. But it’s the spirit that unites every
single one of us. It doesn’t matter whether they
and deliver the exceptional. If that excites you,
88
ADVANCED TOPICS
Expectation IN INTRODUCTORY
and Variance of Bivariate Distributions 93
Definition 3.3
Let (X, Y ) be a two-dimensional continuous random variable with
probability density function f (x, y). Then the expectation of the
marginal distributions of X and Y are given by
∞
E(X) = x g(x) dx
−∞
∞
E(Y ) = y h(y) dy
−∞
where g(x) and h(y) are the marginal probability density function of
X and Y respectively
Example 3.8
Refer to Example 1.3. Find (a) E(X), (b) E(Y )
Solution
To find the expectation of a random variable from a bivariate distribution
of X and Y we require the marginal probability distributions of X and Y .
The marginal probability distribution of X for Example 1.3 was found
in Example 1.10 to be

 2
2 x2 + x, 0 < x < 1
g(x) = 3
 0, elsewhere
and that of Y to be

 1 y
+ , 0<x<2
h(y) = 3 6
 0, elsewhere
Hence, by Definition 3.1

∞
(a) E(X) = x g(x) dx
−∞
1
94 2
Advanced Topics2 in Introductory Probability
= x 2x + x dx
0 3
1
2
= 2x + x2
3
dx
0 3
13
=
18
2
1 y
(b) E(Y ) = y +
dy
0 3 6
2
1 y2
= y+ dy
0 3 6
10
=
9
3.2.3 Properties of Expectation of Bivariate Random Variables
Similar to the univariate case, we shall assume in our discussions of the prop-
erties of the expectation of bivariate random variables that both variables
have finite expectations.
89
Property 1 Expectation of Identical Distributions

2
1 y2
= y+ dy
0 3 6
PROBABILITY: A FIRST COURSE IN 10 EXPECTATION AND VARIANCE OF
=
PROBABILITY THEORY – VOLUME III 9 BIVARIATE DISTRIBUTIONS
3.2.3 Properties of Expectation of Bivariate Random Variables
Similar to the univariate case, we shall assume in our discussions of the prop-
erties of the expectation of bivariate random variables that both variables
have finite expectations.
Property 1 Expectation of Identical Distributions
Theorem 3.5
If X1 , X2 , ..., Xn have the same distribution, then they possess a com-
mon expectation E(Xi ) = µ
Note
The expression “have the same distribution” is equivalent to the expression
“are identically distributed”. Identical distribution means that the probabil-
ity density function or the probability mass function of the random variable
remains the same from trial to trial.
For example, X and Y are identically distributed if their probability
distribution functions are
g(x) = e−x , x≥0

−y Distributions 95
h(y) = e , y≥0
Property 2 Addition Law of Expectation
Theorem 3.6
Let X and Y be any two random variables with expectation E(X)
and E(Y ), respectively. The expectation of their sum is the sum
of their expectations. That is,
E(X + Y ) = E(X) + E(Y )
Proof
We shall first prove for discrete case.
n
m
E(X + Y ) = (xi + yj ) p(xi , yj )
i=1 j=1
n m n
m
= xi p(xi , yj ) + yj p(xi , yj )
i=1 j=1 i=1 j=1
n m m
n

= xi p(xi , yj ) + yj p(xi , yj )
i=1 j=1 j=1 i=1
n
m

= xi p(xi ) + yj p(xi )
i=1 j=1
(following from the logic of the proof of Theorem 3.3)
= E(X) + E(Y )
90
= n m xi p(xi , yj ) + n m yj p(xi , yj )
= i=1 j=1 xi p(xi , yj ) + i=1 j=1 yj p(xi , yj )
ADVANCED TOPICS INn INTRODUCTORY
m m j=1 n

i=1 j=1
i=1
= n xi m p(xi , yj ) + m yj n p(xi , yj )
= i=1 xi j=1 p(xi , yj ) + j=1 yj i=1 p(xi , yj )
i=1
n j=1 m j=1 i=1
= n xi p(xi ) + m yj p(xi )
= i=1 xi p(xi ) + j=1 yj p(xi )
i=1(following from
j=1 the logic of the proof of Theorem 3.3)
(following
= E(X) + E(Y from
) the logic of the proof of Theorem 3.3)
= E(X) + E(Y )
We shall now prove for continuous case.

We shall now prove for
∞continuous
∞ case.
E(X + Y ) = ∞ ∞ (x + y)f (x, y) dx dy
E(X + Y ) = −∞ ∞
−∞
∞(x + y)f (x, y) dx dy ∞ ∞
= −∞ ∞ x−∞
∞ f (x, y) dy dx + ∞ y ∞ f (x, y) dx dy
= −∞ ∞ x −∞
f (x, y)dy
∞ dx + −∞
y −∞ f (x, y) dx dy
= −∞ −∞
∞ x fX (x) dx + ∞ y fY (y) dy
−∞ −∞
−∞ −∞
96 = x fX (x) dx + Topics
Advanced y fY (y) dy
−∞ −∞
By the definition of the marginal probability densities, the result follows.
By the definition of the marginal probability densities, the result follows.
Example 3.9
For the table in Example 2.2, verify that
E(X + Y ) = E(X) + E(Y )
(a) using the parent joint probability distribution table;
(b) using the derived joint probability distribution table.
Solution
(a) The distribution of X+Y from the parent joint probability distribution
table is reproduced from the solution for Example 2.2, taking note that
the values for x + y are those in parenthesis:
PREPARE FOR A
LEADING ROLE.
Y English-taught MSc programmes
X −1 0 1 in engineering: Aeronautical,
Row Totals
−1 0.10 0.20 0.11 0.41 Biomedical, Electronics,
(-2) (-1) (0) Mechanical, Communication
0 0.08 0.02 0.26 0.36 systems and Transport systems.
(-1) (0) (1) No tuition fees.
2 0.03 0.17 0.03 0.23
(1) (2) (3)
Column Totals 0.21 0.39 0.40 1.00
E liu.se/master
To obtain E(X + Y ), we multiply the values for x + y by their corre-
sponding probabilities and obtain the following table:
X −1
Y
0 1
‡
−1 -0.20 -0.20 0.00
0 -0.08 0.00 0.26
2 0.03 0.34 0.09
91
(a) usingTOPICS
ADVANCED the parent joint probability distribution table;
IN INTRODUCTORY
(b) using the
PROBABILITY derived
THEORY joint probability
– VOLUME III distribution table. BIVARIATE DISTRIBUTIONS
Solution
(a) The distribution of X+Y from the parent joint probability distribution
table is reproduced from the solution for Example 2.2, taking note that
the values for x + y are those in parenthesis:
Y
−1 0.10 0.20 0.11 0.41
(-2) (-1) (0)
0 0.08 0.02 0.26 0.36
(-1) (0) (1)
2 0.03 0.17 0.03 0.23
(1) (2) (3)
Column Totals 0.21 0.39 0.40 1.00
To obtain E(X + Y ), we multiply the values for x + y by their corre-

sponding probabilities and obtain the following table:
Y
X −1 0 1
−1 -0.20 -0.20 0.00
Expectation and Variance of 0Bivariate
-0.08 Distributions
0.00 0.26 97
2 0.03 0.34 0.09
Hence,
n
m
E(X + Y ) = (xi + yj ) p(xi , yj )
i=1 j=1
= −0.20 + (−0.20) + 0.00 + (−0.08) · · · + 0.34 + 0.09
= 0.24
which is the same as
E(X) + E(Y ) = 0.05 + 0.19 = 0.24
from Example 3.4.
(b) The distribution of X + Y from the derived joint probability distribu-

tion table is taken from Example 2.2. Hence, to obtain E(X + Y ), we
have the following table:
(x + y) −2 −1 0 1 2 3
p(x + y) 0.1 0.28 0.13 0.29 0.17 0.03
(x + y) p(x + y) −0.2 −0.28 0 0.29 0.34 0.09
Hence,

E(X + Y ) = (xi + yj ) p(xi + yj )
= −0.2 + (−0.28) + 92
0 + 0.29 + 0.34 + 0.09
= 0.24
have the following table:
(x + y)IN INTRODUCTORY
ADVANCED TOPICS −2 −1 0 1 2 3
p(x + y) 0.1 0.28 0.13 0.29 0.17 0.03
(x + y) p(x + y) −0.2 −0.28 0 0.29 0.34 0.09
Hence,

E(X + Y ) = (xi + yj ) p(xi + yj )
= −0.2 + (−0.28) + 0 + 0.29 + 0.34 + 0.09
= 0.24
Corollary 3.2
Let X1 , X2 , ..., Xn be n random variables, where n is finite. Then the ex-
pectation of the sum Sn = X1 + X2 + ... + Xn is the sum of the expectations
of the individual random variables. Thus,
E(Sn ) = E(X1 + X2 + ... + Xn )

= E(X1 ) + E(X2 ) + ... + E(Xn )
n

= E(Xi )
98 Advanced
i=1 Topics in Introductory Probability
Corollary 3.3
If X1 , X2 , ..., Xn have the same distribution with E(Xi ) = µ for all 1 ≤ i ≤ n,
then the expectation of their sum Sn = X1 + X2 + ... + Xn is
E(Sn ) = E(X1 + X2 + ... + Xn ) = nµ
Proof
Since X1 , X2 , ..., Xn have the same distribution, from Theorem 3.3, they
have a common expectation:
E(X1 ) = E(X2 ) = ... = E(Xn ) = E(X) = µ
From Corollary 3.2

n
n

E(X1 + ... + Xn ) = E(Xi ) = µ = nµ
i=1 i=1
Example 3.10
Refer to Example 2.13. Find E(X + Y ).
Solution
E(X + Y ) = E(X) + E(Y )

= 1.6 + 1.6
= 2(1.6) = 3.2
= nµ
Property 3 Difference Law of Expectation
Theorem 3.7
Let X and Y be any two random variables with expectations E(X)
and E(Y ) respectively. The expectation of their difference is the
difference of their expectations. That is,
E(X − Y ) = E(X) − E(Y )

93
E(X + Y ) = E(X) + E(Y )
= 1.6 + 1.6
PROBABILITY: A FIRST COURSE IN = 2(1.6) = 3.2 EXPECTATION AND VARIANCE OF
= nµ
Property 3 Difference Law of Expectation
Theorem 3.7
Let X and Y be any two random variables with expectations E(X)
and E(Y ) respectively. The expectation of their difference is the
difference of their expectations. That is,
E(X − Y ) = E(X) − E(Y )

Example 3.11
For the table in Example 2.2, verify that
E(X − Y ) = E(X) − E(Y )
Solution
The distribution of X − Y is given in Example 2.8. Hence, to obtain
E(X + Y ), we have the following table:
x−y −2 −1 0 1 2 3
p(x − y) 0.11 0.46 0.12 0.11 0.17 0.03
(x − y) p(x − y) −0.22 −0.46 0 0.11 0.34 0.09
Hence,
E(X − Y ) = −0.22 + (−0.46) + 0 + 0.11 + 0.34 + 0.09

= −0.14
which is the same as Click here

to learn more
TAKE THE
E(X) − E(Y ) = 0.05 − 0.19 = −0.14
RIGHT TRACK
from Example 3.4.
Property 4
Theorem 3.8 Give your career a head start

by studying withrandom
If a and b are constants, then for any us. Experience
variables the advantages
X and Y
ABB, Volvo and Ericsson!
E(aX + bY ) = aE(X) + bE(Y )
Corollary 3.4 Apply by World class

Let X1 , X2 , ...,15XJanuary
n be n random variables. Then www.mdh.se
research
n

E(c1 X1 + c2 X2 + ... + cn Xn ) = ci E(Xi )
i=1
94
E(X + Y ), we have the following table:
ADVANCED
x −TOPICS
y IN INTRODUCTORY
−2 −1 0 1 2 3
p(x − y) 0.11
0.46 0.12 0.11 0.17 0.03 BIVARIATE DISTRIBUTIONS
(x − y) p(x − y) −0.22 −0.46 0 0.11 0.34 0.09
Hence,
E(X − Y ) = −0.22 + (−0.46) + 0 + 0.11 + 0.34 + 0.09

= −0.14
which is the same as
E(X) − E(Y ) = 0.05 − 0.19 = −0.14
from Example 3.4.
Property 4
Theorem 3.8
If a and b are constants, then for any random variables X and Y
E(aX + bY ) = aE(X) + bE(Y )
Corollary 3.4
Let X1 , X2 , ..., Xn be n random variables. Then
n

100 E(c1 X1 + c2 X2Advanced
+ ... + cn Topics
Xn ) = in Introductory
ci E(Xi ) Probability
i=1
where ci ’s are constants.
If ci = c for all 1 ≤ i ≤ n, then Corollary 3.4 becomes

n

E(c1 X1 + c2 X2 + ... + cn Xn ) = c E(Xi )
i=1
The proofs of Theorem 3.8 and its corollary are similar to the proofs of
Theorem 3.6 and its corollaries.
Note
Theorems 3.6, 3.7 and 3.8 and their corollaries hold whether or not the
random variables involved are independent.
Example 3.12
For the data of Example 3.1, if a = 3 and b = 4, verify whether Property 4
is valid.
Solution
From Example 3.1,
3E(X) + 4E(Y ) = 3(1.6) + 4(3.9)

= 20.4
Now,
3x + 4y 3(−2) + 4(3) 3(−2) + 4(6) 3(4) + 4(3) 3(4) + 4(6)

p(x, y) 0.4(0.7) 0.4(0.3) 0.6(0.7) 0.6(0.3)
or
3x + 4y 6 18 24 36 95
p(x + y) 0.28 0.12 0.42 0.18
Solution
From Example
ADVANCED 3.1,
TOPICS IN INTRODUCTORY
PROBABILITY THEORY3E(X) + 4E(Y
– VOLUME III ) = 3(1.6) + 4(3.9) BIVARIATE DISTRIBUTIONS
= 20.4
Now,
3x + 4y 3(−2) + 4(3) 3(−2) + 4(6) 3(4) + 4(3) 3(4) + 4(6)

p(x, y) 0.4(0.7) 0.4(0.3) 0.6(0.7) 0.6(0.3)
or
3x + 4y 6 18 24 36
p(x + y) 0.28 0.12 0.42 0.18
so that
E(3X + 4Y ) = 6(0.28) + 18(0.12) + 24(0.42) + 36(0.18)

Expectation and Variance= of1.68 + 2.16 +
Bivariate 10.08 + 6.48
Distributions 101
Expectation and Variance= of20.4
Bivariate Distributions 101
Property 5 andExpectation
Expectation of Product
Variance of Bivariate of X and Y
Distributions 101
Thus, Property
Property 4 is valid.
5 Expectation of Product of X and Y
Property 5 Expectation of Product of X and Y
Property 5 Expectation of Product of X and Y
Theorem 3.9
Theorem
Let X and 3.9
Y be two random variables with expectations E(X) and
Theorem
Let and 3.9
Y be twoThenrandom
E(YX), respectively. the variables with
expectation of expectations
their productE(X)
is and
Let X
Theorem
E(Y and Y
3.9be two
), respectively. random variables with expectations
Then the expectation of their productE(X) is and
E(YX
Let ), respectively.
and Y be
E(XY ) =twoThen the) variables
random
E(X)E(Y expectation
+ E{[X − of expectations
with their product
E(X)][Y − E(Y )]}is
E(X) and
E(XY ) = E(X)E(Y
E(Y ), respectively. Then the) expectation
+ E{[X − E(X)][Y − E(Y )]}is
of their product
E(XY ) = E(X)E(Y ) + E{[X − E(X)][Y − E(Y )]}
E(XY ) = E(X)E(Y ) + E{[X − E(X)][Y − E(Y )]}
The last term appears sufficiently often to deserve a separate name and
The last term
treatment. It is appears sufficiently
called the covarianceoften to deserve
of X and a separate
Y and denoted name and
as Cov(X, Y ).
The last term
treatment. It is appears
called sufficiently
the covariance often
of X to deserve
and Y and a separate
denoted as name and
Cov(X, Y ).
The concept of covariance is discussed in detail in the next chapter.
treatment. It
last term
The concept is called
of appears the covariance
sufficiently
covariance of X
often
is discussed and Y and
to deserve
in detail denoted
in theanext as
separate Cov(X, Y
name and
chapter. ).
The concept
treatment. of covariance is discussed in detail in the next chapter.
It is called the covariance of X and Y and denoted as Cov(X, Y ).
Example 3.13
For the table of
The concept
Example 3.13 incovariance is discussed
Example 2.2, obtain E{[Xin detail in the next
− E(X)][Y − E(Ychapter.
)]}.
Example 3.13
For the table in Example 2.2, obtain E{[X − E(X)][Y − E(Y )]}.
For the table
Example 3.13 in Example 2.2, obtain E{[X − E(X)][Y − E(Y )]}.
Solution
Forhas
thebeen
Solution
It tableproved
in Example 2.2, obtain
in Theorem 4.1 inE{[X that − E(Y )]}.
− E(X)][Y
the sequel
Solution
It has been proved in Theorem 4.1 in the sequel that
It has been proved in Theorem 4.1 in the sequel that
Solution
E{[X − E(X)][Y − E(Y )]} = E(XY ) − E(X)E(Y )
It has been E{[X
proved−inE(X)][Y
Theorem − 4.1
E(Yin)]}the
= sequel
E(XY )that− E(X)E(Y )
E{[X − E(X)][Y − E(Y )]} = E(XY ) − E(X)E(Y )
From Examples 3.3,
From Examples − E(X)][Y − E(Y )]} = E(XY ) − E(X)E(Y )
E{[X3.3,
From Examples 3.3, E(XY ) = −0.01
E(XY ) = −0.01
From Examples 3.4,
From Examples 3.4,
E(X) = 0.05 and E(Y ) = 0.19
From Examples 3.4, E(X) = 0.05 and E(Y ) = 0.19
E(X) = 0.05 and E(Y ) = 0.19
Hence, E(X) = 0.05 and E(Y ) = 0.19
Hence,
Hence,
E{[X − E(X)][Y − E(Y )]} = E(XY ) − E(X)E(Y )
Hence, E{[X − E(X)][Y − E(Y )]} = E(XY ) − E(X)E(Y )
E{[X − E(X)][Y − E(Y )]} = E(XY +
= −0.01 )−0.05(0.19)
E(X)E(Y )
= −0.01 + 0.05(0.19)
E{[X − E(X)][Y − E(Y )]} = −0.01 = −0.0005
E(XY + )− E(X)E(Y )
0.05(0.19)
= −0.0005
−0.01 + 0.05(0.19)
= −0.0005
= −0.0005
96
102 III BIVARIATE DISTRIBUTIONS
Property 6 Expectation of Product of Independent Random

Variables
Theorem 3.10
Let X and Y be two independent random variables with expectations
E(X) and E(Y ) respectively. Then the expectation of their product
is equal to the product of their expectations:
E(XY ) = E(X)E(Y )
Proof (for the discrete bivariate case)

The random variable XY can take all values xi yj , where i = 1, 2, ..., n
and j = 1, 2, ..., m with probabilities g(xi )h(yj ), as long as X and Y are
independent random variables. Hence
n
m
E(XY ) = xi yj g(xi ) h(yj )
i=1 j=1
n  m 

= xi g(xi )  yj h(yj )
i=1 j=1
= E(X)E(Y )
The proof of the bivariate continuous case is similar.
Note
The converse of Theorem 3.10 does not hold, that is, the random variables
X and Y may satisfy the relation E(XY ) = E(X)E(Y ) without being
independent. How will people travel in the future, and
how will goods be transported? What re-
Example 3.14 sources will we use, and how many will
For the data in Example 2.3, verify whether we need? The passenger and freight traf-
fic sector is developing rapidly, and we
E(XY ) = E(X)E(Y ) provide the impetus for innovation and
Solution systems for internal combustion engines
The distribution of XY is given in Example 2.12. Hence, to obtain
that operate more cleanly and more ef-
E(XY ), we have the following table: ficiently than ever before. We are also
97
= xi g(xi )  yj h(yj )
i=1 j=1
= E(X)E(Y
PROBABILITY: A FIRST COURSE IN ) EXPECTATION AND VARIANCE OF
The proof of the bivariate continuous case is similar.
Note
The converse of Theorem 3.10 does not hold, that is, the random variables
X and Y may satisfy the relation E(XY ) = E(X)E(Y ) without being
independent.
Example 3.14
For the data in Example 2.3, verify whether
E(XY ) = E(X)E(Y )
Solution
The distribution
Expectation and of XY is of
Variance given in Example
Bivariate 2.12. Hence, to obtain
Distributions 103
E(XY ), we have the following table:
xy −6 −12 12 24
p(xy) 0.28 0.12 0.42 0.18
xy p(xy) -1.68 -1.44 5.08 4.32
Hence,
E(XY ) = −1.68 + (−1.44) + 5.08 + 4.32

= 6.24
From Example 3.1,
E(X) = 1.6 and E(Y ) = 3.9
It has been shown in Example 2.3 that X and Y are independent. Hence,
E(X)E(Y ) = 6.24
which is the same as E(XY ) above.
Corollary 3.5
Let X1 , X2 , · · · , Xn be independent random variables. Then
E(X1 X2 · · · Xn ) = E(X1 )E(X2 )...E(Xn )
This result may be written as

n n

E (Xi ) = E(Xi )
i=1 i=1

where is the “product of”.
Proof
We shall demonstrate for the case of a three random variables X, Y and Z.
The product of three random variables X Y Z may be written as (XY )Z.
Using Theorem 3.10, we obtain:
E(XY Z) = E[(XY )Z] = E(XY )E(Z)
Applying again the same theorem to the random variable XY which is a

product of random variables X and Y we obtain
E(XY Z) = E(X)E(Y )E(Z)
98
PROBABILITY
III BIVARIATE DISTRIBUTIONS
This corollary can be proved by mathematical induction. It is easy to prove

Theorem 3.10 for any finite number of independent random variables.
Property 7 Expectation of Quotient of X and Y
Theorem 3.11
Let X and Y be two random variables with expectations E(X) and
E(Y ) respectively. Then the expectation of their quotient is

X E(X) 1 E(X)
E ≈ − Cov(X, Y ) + Var(Y )
Y E(Y ) E(Y )2 E(Y )3
E(X) = 0, E(Y ) = 0, where Cov(X, Y ) is called the covariance of

X and Y (see next chapter)
Note
X
The expectation of the quotient may not exist even though the moments
Y
of X and Y exist.
Property 8 Monotonicity of Expectation
Theorem 3.12
Suppose X and Y are two random variables with expectations E(X)
and E(Y ) respectively. If X ≤ Y, then
E(X) ≤ E(Y )
Proof
We write
Y = X + (Y − X)
where Y − X ≥ 0. By Theorem 3.5, we have
E(Y ) = E(X) + E(Y − X) ≥ E(X)
Property 9 Hölder’s Inequality of Expectation
Theorem 3.13
Suppose X and Y are bivariate random variables with finite pth and
q th order moments respectively, where p and q are positive numbers
1 1
satisfying + = 1. Then XY has a finite first moment and
p q
1 1
E(|XY |) ≤ E(|X|p ) p E(|Y |q ) q
The equality holds if and only if E(|Y |q ) = c E(|X|p ) for some constant c
or X = 0.
99
Property 10 Cauchy-Schwartz’ Inequality of Expectation
p q
1 1
E(|XY |) ≤ E(|X|p ) p E(|Y |q ) q
The equality holds if and only if E(|Y |q ) = c E(|X|p ) for some constant c
or X = 0.
Property 10 Cauchy-Schwartz’ Inequality of Expectation
Theorem 3.14
Suppose X and Y are two random variables with second moments
E(X 2 ) and E(Y 2 ) respectively. Then E(XY ) exists and
[E(XY )]2 ≤ E(X 2 )E(Y 2 )
Proof
Case 1
If E(X 2 ) or E(Y 2 ) is infinite, the theorem is valid.
Case 2
If E(X 2 ) = 0 then we must have P (X = 0) = 1 so that P (XY = 0) = 1
and E(XY ) = 0, hence again the theorem is valid. The same argument
applies if E(Y 2 ) = 0, so that we may assume that 0 < E(X 2 ) < ∞ and
0 < E(Y 2 ) < ∞.
Case 3
Suppose we have a random variable
Zt = tX + Y

678'<)25<2850$67(5©6'(*5((
&KDOPHUV8QLYHUVLW\RI7HFKQRORJ\FRQGXFWVUHVHDUFKDQGHGXFDWLRQLQHQJLQHHU
100
Case 2
If E(X 2 ) = 0 then we must have P (X = 0) = 1 so that P (XY = 0) = 1
and E(XY ) A=FIRST
PROBABILITY: 0, hence again
COURSE IN the theorem is valid. The sameEXPECTATION
argument AND VARIANCE OF
applies if E(Y
PROBABILITY 2 ) = 0,
THEORY so that IIIwe may assume that 0 < E(X 2 ) < ∞ BIVARIATE
– VOLUME and DISTRIBUTIONS
0 < E(Y ) < ∞.
2
Case 3
Suppose we have a random variable
Zt = tX + Y
Let us consider the real valued function
E(Zt )2 = E(tX + Y )2
For every t we have
E(Zt )2 = E(tX + Y )2 ≥ 0
since
(tX + Y )2 ≥ 0
so that
E(t2 X 2 + 2 tXY + Y 2 ) ≥ 0 (i)
The term on the left hand side of (i) is a quadratic function in t, which is
always nonnegative for all t. Hence, it does not have more than one real
root. Its discriminant must, therefore, be less than or equal to zero. The
discriminant “b2 − 4ac” for the quadratic equation (i) in t is
[2 E(XY )]2 − 4 E(X 2 )E(Y 2 ) ≤ 0
or
4 [E(XY )]2 ≤ 4 E(X 2 )E(Y 2 )
Note
(a) Equality of Theorem 3.14 holds if and only if one of the random vari-
ables equals a constant multiple of the other, say, Y = t X for some
constant t or, at least one of them is non-zero, say, X = 0;
(b) This inequality is sometimes simply called Schwartz’ inequality;
(c) Cauchy-Schwartz’ inequality is a special case of Höder’s inequality
when p = q = 2.
Property 11 Minkowski’s Inequality of Expectation
Theorem 3.15
Suppose X and Y are two random variables with finite pth order
moments. Then for p ≥ 1
1 1 1
[E(|X + Y |p )] p ≤ E(|X|p ) p + E(|Y |p ) p
Corollary 3.6
Suppose X and Y are two random variables with expectations E(X) and
E(Y ) respectively. Then
E(|X + Y |) ≤ E(|X|) + E(|Y |)
The proof follows immediately from Theorem 3.14 for the case when p = 1.
Property 12 Expectation of Frequency of Success
Theorem 3.16
101
Suppose the random variable M is the frequency of occurrence of the
E(Y ) respectively. Then
E(|X + Y |) ≤
E(|X|) + E(|Y |) EXPECTATION AND VARIANCE OF
The proof follows immediately from Theorem 3.14 for the case when p = 1.
Property 12 Expectation of Frequency of Success
Theorem 3.16
Suppose the random variable M is the frequency of occurrence of the
event A (number of successes) in n independent and identical trials.
Then
E(M ) = np
Proof
Let X1 , X2 , · · · , Xn be n independent random variables which are identically
distributed. Suppose

1, if event A occurs in the ith trial,
Xi =
0, otherwise
Then Xi are Bernoulli random variables and
E(Xi ) = pi = p
Letting
M = X1 + X2 + · · · + Xn
we have
E(M ) = E(X1 ) + E(X2 ) + · · · + E(Xn )

= np
Example 3.15
A box contains 20 black, 20 red and 10 green balls. From the box, 25 balls
are randomly selected with replacement.
Find the expectation of the frequency of occurrence of the red balls.
Solution
Let A = {the occurrence of red balls}. Then
20 2
p = P (A) = =
50 5
Hence the expectation of the frequency of occurrence of the red balls is
E(M ) = np

2
= 25 = 10
5
That is, if we select 25 balls from a box containing 20 black, 20 red
and 10 green balls at random with replacement, we are likely to have the
red ball appearing 10 times.
Property 13 Expectation of Relative Frequency of Success
Theorem 3.17
M
Let the random variable be the relative frequency of the event
n
A (or proportion of success) among the n independent and identical
trials of experiment E. Then the expectation of the relative frequency
102
equals the probability of the event:
2
= 25 = 10
5
That is,AifFIRST
PROBABILITY: we select 25 INballs from a box containing 20 black,
COURSE 20 red AND VARIANCE OF
EXPECTATION
and 10 greenTHEORY
PROBABILITY balls at– VOLUME
random IIIwith replacement, we are likely to haveBIVARIATE
the DISTRIBUTIONS
red ball appearing 10 times.
Property 13 Expectation of Relative Frequency of Success
Theorem 3.17
M
Let the random variable be the relative frequency of the event
n
A (or proportion of success) among the n independent and identical
trials of experiment E. Then the expectation of the relative frequency
equals the probability of the event:

M
E =p
n
Proof
Thus, using Property 4 of expectation (Theorem 3.7)

M1
E E(M )=
n
n
1
= (n p) (from Theorem 3.15)
n Distributions 109
= p
The above theorem says that the expected relative frequency of the event A
is p, where p = P (A). This is intuitively clear and it establishes a connection
between the relative frequency of an event and the probability of that event.
Example 3.16
Refer to Example 3.15. Find the expectation of the relative frequency of the
occurrence of the red balls.
Scholarships
Solution
From Example 3.15,
2
p=
5
so that
Open your mind to 3
new opportunities
q = P (A) =
5
Hence,are the expectation
a modern university,of the for
known relative frequency of occurrence of the red
our strong
balls international
is profile. Every year more than Bachelor programmes in
1,600 international students fromall over
the Business & Economics | Computer Science/IT |
world choose to enjoy the friendly atmosphere
M 2 Design | Mathematics
E =
p =
n 5 Master programmes in
3.3 VARIANCE OF BIVARIATE RANDOM VARIABLES
3.3.1 Variance of Functions of Bivariate Random Variables
The variance of a univariate random variable has been discussed in the

book on “Introductory Probability Theory” (Nsowah-Nuamah,
103 2017). We
extend the discussion to bivariate distributions in this section. A few more
ADVANCED
The above TOPICS
theoremIN says
INTRODUCTORY
that the expected relative frequency of the event A
is p, where p = P (A). This is intuitively clear and it establishes a connection
between the relative frequency of an event and the probability of that event.
Example 3.16
Refer to Example 3.15. Find the expectation of the relative frequency of the
occurrence of the red balls.
Solution
From Example 3.15,
2
p=
5
so that
3
q = P (A) =
5
Hence, the expectation of the relative frequency of occurrence of the red

balls is

M 2
E =p=
n 5
3.3 VARIANCE OF BIVARIATE RANDOM VARIABLES
3.3.1 Variance of Functions of Bivariate Random Variables
The variance of a univariate random variable has been discussed in the

book on “Introductory Probability Theory” (Nsowah-Nuamah, 2017). We
extend the discussion to bivariate distributions in this section. A few more
properties will also be discussed.
The variance of a bivariate random variables X and Y can be expressed
110
through the joint probabilityAdvanced Topics in Introductory Probability
functions.
Definition 3.4
Let (X, Y ) be a two-dimensional discrete random variable with prob-
ability mass function p(xi , yj ).
If H(X, Y ) = (X − µX )2 , then
n
m
Var(X) = (X − µX )2 p(xi , yj )
i=1 j=1
If H(X, Y ) = (Y − µY )2 , then
n
m
Var(Y ) = (Y − µY )2 p(xi , yj )
i=1 j=1
where µX = E(X) and µY = E(Y )
Example 3.17
For the data in Example 2.3, find the variance of the (a) X (b) Y .
104
Solution
Var(Y ) =i=1 j=1 (Y − µY )2 p(xi , yj )
i=1 j=1
where µTOPICS
ADVANCED
X = and Y = E(Y )
IN INTRODUCTORY
E(X) µ
where µX A=FIRST
PROBABILITY: E(X) and µYIN= E(Y )
COURSE EXPECTATION AND VARIANCE OF
Example 3.17
Example 3.17in Example 2.3, find the variance of the
For the data (a) X (b) Y .
For the data in Example 2.3, find the variance of the (a) X (b) Y .
Solution
Solution
(a) From Example 3.1, E(X) = 1.6. Hence,
(a) From Example 3.1, E(X) = 1.6. Hence,
Var(X) = (x1 − µX )2 p(x1 , y1 ) + (x1 − µX )2 p(x1 , y2 )
Var(X) = (x1 − µX )2 p(x 1 , y1 ) + (x1 − µX ) p(x
2
1 , y2 )
+(x2 − µX )2 p(x 2 , y1 ) + (x2 − µX )2 p(x 2 , y2 )
+(x2 − µX2) p(x2 , y1 ) + (x2 − µ2X ) p(x2 , y2 )
2 2
= (−2 − 1.6) (0.28) + (−2 − 1.6) (0.12) + (4 − 1.6)2 (0.42)
= (−2 − 1.6)22 (0.28) + (−2 − 1.6)2 (0.12) + (4 − 1.6)2 (0.42)
+(4 − 1.6) (0.18)
+(4 − 1.6)2 (0.18)
= 8.64
= 8.64
(b) From Example 3.1, E(Y ) = 3.9. Hence,
(b) From Example 3.1, E(Y ) = 3.9. Hence,
Var(Y ) = (y1 − µY )2 p(x1 , y1 ) + (y1 − µY )2 p(x1 , y2 )
Var(Y ) = (y1 − µY )2 p(x 1 , y1 ) + (y1 − µY ) p(x
2
1 , y2 )
+(y2 − µY )2 p(x 2 , y 1 ) + (y 2 − µY )2 p(x 2 , y2 )
+(y2 − µ2Y )2 p(x2 , y1 ) + (y22 − µY )2 p(x2 , y2 ) 2
= (3 − 3.9) (0.7) + (6 − 3.9) (0.12) + (3 − 3.9) (0.42)
Expectation and= Variance
(3 − 3.9)of
2
(0.7) + (6 −Distributions
3.9)2 (0.12) + (3 − 3.9)2 (0.42) 111
+(6 − 3.9)2Bivariate
(0.18)
Expectation and Variance of 2Bivariate
+(6 − 3.9) (0.18) Distributions 111
= 2.23
= 2.23
Definition 3.5
Definition 3.5a two-dimensional continuous random variables with
Let (X, Y ) be
Let (X, Y ) be a two-dimensional
probability density function f (x, y). continuous random variables with
probability density
If H(X, Y ) = (X − function
µX ) , then
2 f (x, y).
If H(X, Y ) = (X − µX )2, then
∞ ∞
Var(X) = ∞ ∞ (X − µX )2 p(xi , yj )
Var(X) = −∞ −∞ (X − µX )2 p(xi , yj )
−∞ −∞
If H(X, Y ) = (Y − µY )2 , then
If H(X, Y ) = (Y − µY )2 ,then
∞ ∞
Var(Y ) = ∞ ∞ (Y − µY )2 p(xi , yj )
Var(Y ) = −∞ −∞ (Y − µY )2 p(xi , yj )
−∞ −∞
We may also express the variance of X and Y in a bivariate distribution

We maymarginal
through the also express the variance
probability of X and Y in a bivariate distribution
functions.
through the marginal probability functions.
Definition 3.6
Definition
Let (X, Y ) be3.6a two-dimensional discrete random variable with prob-
Let (X,mass
ability Y ) befunction
a two-dimensional discrete
p(x, y). Then randomof
the variance variable with prob-
the marginal dis-
ability mass function p(x, y). Then
tributions of X and Y are given by the variance of the marginal dis-
tributions of X and Y are given by
Var(X) = (x − µX )2 p(x)
Var(X) = all x(x − µX )2 p(x)
x
all
Var(Y ) = (y − µY )2 p(y)
Var(Y ) = all y(y − µY )2 p(y)
all y
where p(x) and p(y) are the marginal probability mass function of
where
X and p(x) and p(y) are the marginal probability mass function of
Y , respectively
X and Y , respectively
Example 3.18
Example 3.18in Example 2.3, find (a) Var(X) (b) Var(Y ).
For the data
For the data in Example 2.3, find (a) Var(X)
105 (b) Var(Y ).
Solution
Y
all y
where p(x)
PROBABILITY: and p(y)
A FIRST are IN
COURSE the marginal probability mass function of
EXPECTATION AND VARIANCE OF
X and Y THEORY
PROBABILITY , respectively
– VOLUME III BIVARIATE DISTRIBUTIONS
Example 3.18
For the data in Example 2.3, find (a) Var(X) (b) Var(Y ).
Solution

112 Var(X) = Advanced
(x Topics in Introductory Probability
− µX )2 p(x)
all x
= (−2 − 1.6)2 (0.4) + (4(−2 − 1.6)2 (0.6)

= 8.64
The reader is asked to calculate (b).
Definition 3.7
Let (X, Y ) be a two-dimensional continuous random variable with
probability density function f (x, y). Then the variance of the
marginal distributions of X and Y are given by
∞
Var(X) = (x − µX )2 g(x) dx
−∞
∞
Var(Y ) = (y − µY )2 h(y) dy
−∞
where g(x) and h(y) are the marginal probability density function of
X and Y respectively
Like in the case of the univariate random variable, for computational

purposes, we use the following formulas:
Var(X) = E(X 2 ) − [E(X)]2

= x2 p(x) − (E(X))2 (Discrete case)
all x
∞
= x2 g(x) dx − (E(X))2 (Continuous case)
−∞
Var(Y ) = E(Y 2 ) − [E(Y )]2

= y 2 p(y) − (E(Y ))2 (Discrete case)
all y
∞
= y 2 h(y) dy − (E(Y ))2 (Continuous case)
−∞
Example 3.19
Refer to Example 1.3. Find (a) Var(X) (b) Var(Y ).
106
Var(Y ) = (y − µY )2 h(y) dy
−∞
where g(x)
PROBABILITY: and h(y)
A FIRST are the
COURSE IN marginal probability density function of
EXPECTATION AND VARIANCE OF
X and Y THEORY
PROBABILITY respectively
– VOLUME III BIVARIATE DISTRIBUTIONS
Like in the case of the univariate random variable, for computational

purposes, we use the following formulas:
Var(X) = E(X 2 ) − [E(X)]2

= x2 p(x) − (E(X))2 (Discrete case)
all x
∞
= x2 g(x) dx − (E(X))2 (Continuous case)
−∞
Var(Y ) = E(Y 2 ) − [E(Y )]2

= y 2 p(y) − (E(Y ))2 (Discrete case)
all y
∞
= y 2 h(y) dy − (E(Y ))2 (Continuous case)
−∞
Example 3.19and Variance of Bivariate Distributions

Expectation 113
Refer to Example 1.3. Find (a) Var(X) (b) Var(Y ).
Solution
13
(a) From Example 3.6, E(X) = .
18
Now
1
2
E(X ) =
2
x2
2x + x 2
dx
0 3
1
2
= 2 x + x3
4
dx
0 3
1
2 x5 2 x4
= +
5 12 0
2 2
= +
5 12
17
=
30
Therefore 2
17 13
Var(X) = − = 0.04506
30 18
10
(b) From Example 3.5, E(Y ) = .
9
Now
2
1 y
E(Y ) = 2
y 2
+ dy
0 3 6
2
1 y3
= y + 2
dy
0 3 6
2
y3 y4
= +
9 24 0
8 16
= +
9 24
14
=
9 107
Hence
2
1 y3
= y +2
dy
ADVANCED TOPICS IN INTRODUCTORY 0 3 6
2 BIVARIATE DISTRIBUTIONS
y3 y4
= +
9 24 0
8 16
= +
9 24
14
=
9
Hence
14 10 2
Var(Y ) = − = 0.32099
114 9 9 Topics in Introductory Probability
Advanced
3.3.2 Properties of Variance of Bivariate Random Variables
In discussing the properties of variance, it is assumed that the variances

exist.
Property 1 Variance of Identical and Independent Random

Variables
Theorem 3.18
If the random variables X1 , X2 , ..., Xn have the same distribution,
then they all have the same variance σ 2 , that is,
Var(Xi ) = σ 2 , 1≤i≤n
Property 2 Variance of Sum and Difference of X and Y
Theorem 3.19
If (X,Y) are two dimensional random variables, then
Var(X ± Y ) = Var(X) + Var(Y ) ± 2 Cov(X, Y )
where Cov(X, Y ), is the covariance of X and Y
Proof
We shall give the proof for the case X + Y . The proof for the case X − Y
should be tried by the reader (see Exercise 3.13).
Var(X + Y ) = E{[X + Y − E(X + Y )]2 }
= E{[X + Y − E(X) − E(Y )]2 }
= E{([X − E(X)] + [Y − E(Y )])2 }
= E{(X − E(X))2 + (Y − E(Y ))2
+2(X − E(X))(Y − E(Y ))}
= E{X − E(X)}2 + E{Y − E(Y )}2
+2{E(X − E(X))(Y − E(Y ))}
= Var(X) + Var(Y ) + 2 Cov(X, Y )
108
Expectation
ADVANCED TOPICS
Expectation and
and Variance of
of Bivariate
IN INTRODUCTORY
Variance Bivariate Distributions
Distributions 115
115
Expectation
PROBABILITY: and A FIRSTVariance
COURSE of IN
Bivariate Distributions 115 AND VARIANCE OF
EXPECTATION
PROBABILITY
Expectation THEORY – VOLUME III BIVARIATE
115 DISTRIBUTIONS
Example 3.20and Variance of Bivariate Distributions
Example
Example 3.20
If Var(X)3.20
Example = 8, Var(Y ) = 5 and Cov(X, Y ) = 3 find
3.20
If Var(X)
Example
If Var(X) =
3.20
= 8, Var(Y )) =
=5 5 and Cov(X, Y Y )) = = 33 find
(a)
If Var(X3.20
Var(X)
Example =+ 8,Y ), Var(Y
8, (b) Var(X
Var(Y )=5 −
and
andY )Cov(X,
Cov(X, Y ) = 3 find
find
(a) Var(X
If Var(X)
(a) Var(X = +
+ 8,Y ),
Y ), Var(Y (b) Var(X
)=5 −
(b) Var(X −andY )
Y )Cov(X, Y ) = 3 find
(a) Var(X =
If Var(X) + Y ), Var(Y (b) Var(X Y )Cov(X, Y ) = 3 find
(a) Var(X + 8,
Solution Y ), (b) Var(X )=5 − −andY)
Solution
(a) Var(X + Y ), (b) Var(X − Y )
Solution
Solution
(a) Var(X + Y ) = 8 + 5 + 2(3) = 19
Solution
(a)
(a) Var(X
Solution Var(X + +Y Y )) ==8 8++5 5++ 2(3)
2(3) = = 1919
(a)
(b) Var(X
Var(X − + Y ) =
Y)=8+5− 8 + 5 + 2(3) =
2(3) 719
(a)
(b) + + = 19
(a) Var(X
(b) Var(X − +Y
− Y )) ==8 8++5 5−+ 2(3)
− 2(3) = =7 719
(b) Var(X − Y ) = 8 + 5 − 2(3) = 7
(b) variance
The Var(X −isY not ) = 8additive
+ 5 − 2(3) =7
in general, as the expected value. However,
(b) variance
The Var(X −isY not ) = 8additive
+ 5 − 2(3) in =7
general, as
The
with variance
the additional is not additive in general, as the
the expected
expected value. However,
we obtainvalue. However,
The
with variance
the is notassumption
additional additive in of
assumption of
independence,
general, as the expected
independence, we obtain
Theorem
value.
Theorem
3.20.
However,
3.20.
The
with variance
the additional is notassumption
additive in ofgeneral, as the expected
independence, we obtainvalue.
Theorem However,
3.20.
with
The the additional
variance is not assumption
additive in ofgeneral,
independence,
as the we obtainvalue.
expected Theorem 3.20.
However,
with the
Corollary 3.7 additional assumption of independence, we obtain Theorem 3.20.
Corollary
with the
Corollary 3.7
additional assumption of independence,
3.7Xn are n dimensional random variables, then we obtain Theorem 3.20.
If X1 , X2 , ...,
Corollary 3.7Xn are n dimensional random variables, then
If X ,
Corollary
If X , ...,
3.7X n are n dimensional random variables, then
1 2
X1 , X2 , ...,
If are
n n dimensional random variables, then
Corollary
If
X , X , ...,3.7X
X11 , X22 , ..., X n
are
n n dimensional

n
n
n
random variables,
n
n
n then
n
If X1 , X2 , ...,Var X n are
n Xn =
idimensional
n Var(X i ) + 2 variables,
random n n Cov(X
then i ,, X ),
Var
Var n X
n Xii =
= n Var(X
n Var(Xii ) + 2
) + 2 n
n j=1 n Cov(X Xj ),
n Cov(Xii , Xjj ),
Var
i=1
i=1
= i=1
i=1
n Var(X ) + 2 i=1 n Cov(X , X ),
n X

Var i=1 i=1 Xii = i=1 Var(Xii ) + 2 ni<j
i=1
j=1

j=1 Cov(Xii , Xjj ),
Var i=1 Xi = i=1 Var(Xi ) + 2 i=1i<jj=1 Cov(Xi , Xj ),
i=1 i=1 i=1 i<jj=1
or equivalently, i=1 i<j
or equivalently,
or equivalently,
i=1 i=1 i<jj=1
or equivalently, i<j
or equivalently,
n n n n
n
n Xi =
n
n Var(Xi ) +
n n
or equivalently, Var n n Cov(Xi , Xj ),
Var
Var n Xi = n Var(Xi ) + n n Cov(Xi , Xj ),
i=1 Xi =
n i=1 Var(Xi ) +
n n
i=1 j=1 n Cov(Xi , Xj ),
Var i=1 X = i=1 Var(X ) + i=1 j=1n Cov(X , X ),
Var i=1 Xi = i=1 Var(Xi ) + i=1i=j=1
n i n i n i =j Cov(Xii , Xjj ),
Var i=1 Xi = i=1 Var(Xi ) + i=1i=j=1
i=1 i=1 i=1 j
j=1
j Cov(Xi , Xj ),
or equivalently, i=1 i=j
or equivalently,
or equivalently, i=1 i=1 i =j=1
j
or equivalently, n i=j
or equivalently, n n
n
n Xi =
n
n
n
or equivalently, Var n Xi = n
n Cov(Xi , Xj )
Var
Var n Xi = n j=1 n Cov(Xi , Xj )
n Cov(Xi , Xj )
Var i=1
= i=1
n Cov(X ,X )
i=1 X i=1 j=1
Var i=1 n Xi =
i
n j=1
i=1 Cov(Xii , Xjj )
Var i=1 Xi = i=1 j=1 Cov(Xi , Xj )
i=1 j=1
Property 3 Sum and i=1 Difference of Independent Random
Property
Property 3 Sum 3 Sum and
and Differencei=1of
Difference
i=1
ofj=1Independent
Independent Random Random
Property 3 Sum Variables and Difference of Independent Random
Property 3 Sum Variables
Variables and Difference of Independent Random
Property 3 Sum Variables and Difference of Independent Random
Variables
Theorem 3.20 Variables
Theorem
Theorem 3.20
3.20
If (X, Y ) are
Theorem 3.20two dimensional random variables and if X and Y are
If (X,
Theorem
If Y )
(X, Y ) are are
3.20two
two dimensional
dimensional random random variables
variables and and if
if XX andand Y Y are
are
independent
If (X,
Theorem Y ) are then,
3.20two dimensional random variables and if X and Y are
independent
If (X, Y ) are then,
independent two dimensional random variables and if X and Y are
then,
independent
If (X, Y ) are then, two dimensional random variables and if X and Y are
independent then, Var(X ± Y ) = Var(X) + Var(Y )
independent then,Var(X Var(X ± ±Y Y )) == Var(X)
Var(X) + + Var(YVar(Y ))
Var(X ± Y ) = Var(X) + Var(Y )
109
or equivalently,
n n n
IN EXPECTATION AND VARIANCE OF
PROBABILITY THEORYVar– VOLUMEXIII
i = Cov(Xi , Xj ) BIVARIATE DISTRIBUTIONS
i=1 i=1 j=1
Property 3 Sum and Difference of Independent Random

Variables
Theorem 3.20
If (X, Y ) are two dimensional random variables and if X and Y are
independent then,

Proof
The theorem follows from Theorem 3.19 by noting that Cov(X, Y ) = 0 for
the case when X and Y are independent (see Theorem 4.7 in the sequel.)
Corollary 3.8 Bienaymé Equality

Let Xi (i = 1, 2, ..., n) be independent random variables. Then
n n

Var Xi = Var(Xi )
i=1 i=1
This corollary follows from Corollary 3.7.
Note
It is important to observe that
n n

E Xi = E(Xi )
i=1 i=1
whether or not the Xi ’s are independent but it is generally not the case that
n n

Var Xi = Var(Xi )
i=1 i=1
always.
Example 3.21
For the data in Example 2.3, verify that
(a) Var(X + Y ) = Var(X) + Var(Y );
(b) Var(X − Y ) = Var(X) + Var(Y ).
Solution
It has been shown in Example 2.3 that X and Y are independent, hence:
(a) Var(X + Y ) = E(X + Y )2 − [E(X + Y )]2 .
Now
(xi + yj )2 (−2 + 3)2 (−2 + 6)2 (4 + 3)2 (4 + 6)2
p(xi )p(yj ) 0.4(0.7) 0.4(0.3) 0.6(0.7) 0.6(0.3)
110
Therefore, the distribution of (X + Y )2 is
(xi + yj )2 1 16 49 100
p(xi )p(yj ) 0.28 0.12 0.42 0.18

E (X + Y )2 = (x + y)2 p(x)p(y)
= 1(0.28) + 16(0.12) + 49(0.42) + 100(0.18)
= 0.28 + 1.92 + 20.58 + 18
= 40.78
From Example 3.7, E(X + Y ) = 5.50.

Therefore,

Var(X + Y ) = E (X + Y )2 − [E(X + Y )]2
= 40.78 − (5.5)2
= 10.53
Now
Var(X) = E(X 2 ) − [E(X)]2
From Example 3.1, E(X) = 1.6.
From Example 2.3, we shall find E(X 2 ). Now,
x2i (−2)2 (4)2

p(xi ) 0.4 0.6
n

E(X ) = 2
x2 p(x)
i=1
= (−2)2 (0.4) + (4)2 (0.6)
= 1.6 + 9.6 = 11.2
Therefore
Var(X) = 11.2 − (1.6)2 = 8.64
Again
Var(Y ) = E(Y 2 ) − [E(Y )]2
From Example 3.1, E(Y ) = 3.9.

Similarly,
yj2 (3)2 (6)2

p(yj ) 0.7 0.3

E(Y 2 ) = y 2 h(y)
= (3)2 (0.7) + (6)2 (0.3)
= 6.3 + 10.8 = 17.1
Therefore
Var(Y ) = 17.1 − (3.9)2 = 1.89
Hence,
Var(X + Y ) = Var(X) + 111
Var(Y)

y 2 h(y)
E(Y 22)) =
2
E(Y = y 2 h(y)
E(Y ) = 2 y 2 h(y)
E(Y ) = (3) 2 (0.7) + (6)2 (0.3)
ADVANCED TOPICS IN INTRODUCTORY 2 2
= (3) y 2 h(y)
(0.7) + (6) (0.3)
PROBABILITY: A FIRST COURSE IN= (3) (0.7) + (6)2 (0.3)
2 EXPECTATION AND VARIANCE OF
PROBABILITY THEORY – VOLUME = = 6.3
III 6.3
2+ 10.8 = 17.1
(3) +(0.7)
10.8+=(6)
2
17.1(0.3) BIVARIATE DISTRIBUTIONS
= 6.3 + 10.8 = 17.1
Therefore = 6.3 + 10.8 = 17.1
Therefore
Therefore Var(Y
Var(Y )) == 17.1 − (3.9)
(3.9)22 == 1.89
2
Therefore 17.1 − 1.89
Var(Y ) = 17.1 − (3.9)2 = 1.89
Hence,
Hence, Var(Y ) = 17.1 − (3.9) = 1.89
Hence, Var(X
Hence, Var(X + +Y Y )) == Var(X)
Var(X) + + Var(Y)
Var(Y)
Var(X + Y ) = Var(X) + Var(Y)
(b) Var(X + Y ) = Var(X) + Var(Y)
(b) Var
Var [(X Y )]
)] = Y ))22 −
− [E(X Y )]
2 2
[(X − −Y E(X −
= E(X −Y [E(X − −Y )]22
(b) Var [(X − Y )] = E(X − Y )2 − [E(X − Y )]2
(b) Var [(X − Y )] = E(X − Y ) − [E(X − Y )]
Now
Now
Now
Now (xi − yj )22 (−2 − 3)22 (−2 − 6)22 (4 − 3)22 (4 − 6)22
(xi − yj ) 2 (−2 − 3) 2 (−2 − 6) 2 (4 − 3) 2 (4 − 6) 2
(x −
p(x yj )j 2) (−2
)p(y − 3)2 (−2
0.4(0.7) − 6)2 0.6(0.7)
0.4(0.3) (4 − 3)2 0.6(0.3)
(4 − 6)2
)p(y
(xii ii−
p(x yj )j ) (−2 0.4(0.7)
− 3) 0.4(0.3)
(−2 − 6) 0.6(0.7)
(4 − 3) 0.6(0.3)
(4 − 6)
p(xi )p(yj ) 0.4(0.7) 0.4(0.3) 0.6(0.7) 0.6(0.3)
p(xi )p(yj ) 0.4(0.7) 0.4(0.3) 0.6(0.7) 0.6(0.3)
(xi −
(x yj )22 25 64 11 44
i − yj ) 2 25 64
(xi i−
p(x yj )j 2) 0.28
)p(y 25 0.12 64 0.42 1 4
0.18
(xi i−
p(x )p(y
yj )j ) 0.28 25 0.12 64 0.42 1 0.18
4
p(xi )p(yj ) 0.28 0.12 0.42 0.18
p(xi )p(yj ) 0.28 0.12 0.42 0.18
)
2 = (x − y)2 g(x) h(y)
2 2
E(X − Y
E(X − Y ) 2 = (x − y) 2 g(x) h(y)
E(X − Y )2 = = (x − y) 2 g(x) h(y)
E(X − Y ) = 25(0.28) (x − y)
25(0.28) +
+ 64(0.12)
g(x) h(y)+
64(0.12) + 1(0.42)
1(0.42) + + 4(0.18)
4(0.18)
=
= 25(0.28)
7 + 7.68 +
+ 64(0.12)
0.42 + +
0.72 1(0.42) + 4(0.18)
= 725(0.28)
+ 7.68 + + 0.42
64(0.12)
+ 0.72+ 1(0.42) + 4(0.18)
= 3.82
= 7 + 7.68 + 0.42 + 0.72
7 + 7.68 + 0.42 + 0.72
= 3.82
= 3.82
= 3 3.82
xii −
x − yyjj −2
−2 −− 3 −2 −2 − − 66 44 −
− 33 44 −
− 66
xii )p(y
p(x −2 − 3 0.4(0.3)
− yjj ) 0.4(0.7) −2 − 6 0.6(0.7) 4−3 4−6
0.6(0.3)
x i )p(y
p(x − y j ) 0.4(0.7)
−2 − 3 0.4(0.3)
−2 − 6 0.6(0.7) 4−3 0.6(0.3)
4−6
p(xii )p(yjj ) 0.4(0.7) 0.4(0.3) 0.6(0.7) 0.6(0.3)
p(xi )p(yj ) 0.4(0.7) 0.4(0.3) 0.6(0.7) 0.6(0.3)
xii −
x − yyjj −5
−5 −8
−8 11 −2
−2
x
p(x ii )p(yjj )
− y 0.28
−5 0.12
−8 0.421 0.18
−2
xii )p(y
p(x − yjj ) 0.28 −5 0.12 −8 0.42 1 0.18
−2
p(xi )p(yj ) 0.28 0.12 0.42 0.18
p(xi )p(yj ) 0.28 0.12 0.42 0.18
112
ADVANCED TOPICS
Expectation IN INTRODUCTORY
and Variance of Bivariate Distributions 119

E [(X − Y )] = (x − y) g(x) h(y)
= −5(0.28) + (−8)(0.12) + 1(0.42) + −2(0.18)
= −1.4 − 0.96 + 0.42 − 0.36
= −2.3
Therefore,
[E(X − Y )]2 = (−2.30)2 = 5.29
Hence,
Var(X − Y ) = 15.82 − 5.29 = 10.53
But
Var(X) + Var(Y ) = 8.64 + 1.89 = 10.53
Hence,
Var(X − Y ) = Var(X) + Var(Y )
The next property is of much importance in both theoretical and applied

statistics.
Property 4
Theorem 3.21
If the random variables Xi (i = 1, 2, ..., n) are independent and have
the same variance, then
n

Var Xi = n σ2
i=1
Var(Xi ) = σ 2
120
This theorem follows immediately from Corollary
Advanced Topics in 3.8.
Property 5
Theorem 3.22
If (X, Y ) is a two-dimensional random variable, then
Var(aX ± bY ) = a2 Var(X) ± b2 Var(Y ) ± 2 a b Cov(X, Y )
where a and b are constant
Proof
We shall prove for the sum. The proof for the difference is similar.
Var(aX + bY ) = E{[(aX + bY ) − E(aX + bY )]2 }
= E{a[X − E(X)] + b[Y − E(Y )]2 }
= a2 E[X − E(X)]2 + b2 E[Y − E(Y )]2
+2 a b E{[X − E(X)][Y
113
− E(Y )]}
= a Var(X) + b2 Var(Y ) + 2 a b Cov(X, Y )
2
If (X, Y ) is a two-dimensional random variable, then
ADVANCED
Var(aX
TOPICS
±IN
bY ) = a22 Var(X)
INTRODUCTORY
± b22 Var(Y ) ± 2 a b Cov(X, Y )
Var(aX ± bY ) = a Var(X) ± b Var(Y ) ± 2 a b Cov(X, Y )
where a and
PROBABILITY b are– VOLUME
THEORY constantIII BIVARIATE DISTRIBUTIONS
where a and b are constant
Proof
Proof
Var(aX + bY ) = E{[(aX + bY ) − E(aX + bY )]2 }
Var(aX + bY ) = E{[(aX + bY ) − E(aX + bY )]22}
= E{a[X − E(X)] + b[Y − E(Y )]2 }
= E{a[X − E(X)] + b[Y − E(Y )] }
= a22 E[X − E(X)]22 + b22 E[Y − E(Y )]22
= a E[X − E(X)] + b E[Y − E(Y )]
+2 a b E{[X − E(X)][Y − E(Y )]}
+2 a b E{[X − E(X)][Y − E(Y )]}
= a2 Var(X) + b22 Var(Y ) + 2 a b Cov(X, Y )
2
= a Var(X) + b Var(Y ) + 2 a b Cov(X, Y )
More generally, we have the following results for the variance of a linear
More generally, we have the following results for the variance of a linear
combination of random variables.
combination of random variables.
Corollary 3.9
Corollary 3.9
If Xi (i = 1, 2, ..., n) are n random variables and ci a constant associated
If Xi (i = 1, 2, ..., n) are n random variables and ci a constant associated
with ith random variable, then
with ith random
variable,
then
n
n n
n
n n n
n
Var ci Xi = c2i2 V ar(Xi ) + 2 ci cj Cov(Xi , Xj )
Var i=1
ci Xi = i=1
ci V ar(Xi ) + 2 c c
i=1 j=1 i j
Cov(Xi , Xj )
i=1 i=1 i=1 j=1
i<j
i<j
or equivalently,
or equivalently,

n
n
n
n

n n n n
Var ci Xi = c2i2 Var(Xi ) + ci cj Cov(Xi , Xj )
Var i=1
ci Xi = i=1
ci Var(Xi ) + i=1 c c
j=1 i j
Cov(Xi , Xj )
i=1 i=1 i=1 j=1
i=j
i=j
or equivalently,
or equivalently, n
Expectation
Expectation and
and
Variance
Variance n
of Bivariate
n of Bivariate
n
Distributions
n Distributions
n
121
121
Expectation andVar i =
i XBivariate
Variance cof ci cj Cov(Xi , Xj )
Distributions 121
Var i=1 ci Xi = i=1 j=1 ci cj Cov(Xi , Xj )
i=1 i=1 j=1
Example
Example 3.22
3.22
Example
If Var(X) 3.22
= 5, Var(Y )) = 3 and Cov(X, Y ) = 2, find
If
If Var(X) =
Var(X) = 5,
5, Var(Y
Var(Y ) == 33 and
and Cov(X,
Cov(X, YY )) =
= 2,
2, find
find
(a)
(a) Var(4X + 6Y );
(a) Var(4X
Var(4X ++ 6Y
6Y );
);
(b)
(b) Var(4X − 6Y ).
(b) Var(4X
Var(4X −− 6Y
6Y ).
).
Solution
Solution (a)
Solution (a)
(a)
(a)
(a) Var(4X
Var(4X +
+ 6Y
6Y )) =
= 4
4 2 Var(X) + 62 Var(Y ) + 2(4)(6) Cov(X, Y )
2 2
2 Var(X) + 6 2 Var(Y ) + 2(4)(6) Cov(X, Y )
(a) Var(4X + 6Y ) = 4 Var(X) + 6 Var(Y ) + 2(4)(6) Cov(X, Y )
=
= 16(5) + 36(3) + 48(2)
= 16(5)
16(5) ++ 36(3)
36(3) +
+ 48(2)
48(2)
=
= 284
284
= 284
(b)
(b) Var(4X − 6Y )) = 422 Var(X) + 622 Var(Y )) − 2(4)(6) Cov(X, Y)
(b) Var(4X − 6Y ) = 442 Var(X)
Var(4X − 6Y = Var(X) +
+ 662 Var(Y
Var(Y ) −− 2(4)(6)
2(4)(6) Cov(X,
Cov(X, YY ))
=
= 16(5) + 36(3) − 48(2)
= 16(5)
16(5) ++ 36(3) − 48(2)
36(3) − 48(2)
=
= 92
92
= 92
Property
Property 6
Property 66
Theorem 3.23
Theorem 3.23
Theorem
Let X and 3.23
Y be two
two independent random
random variables. Then
Then
Let X and
Let X and YY be
be two independent
independent random variables.
variables. Then
Var(aX ±
Var(aX bY )) =
± bY =aa22 Var(X) + b22 Var(Y )
2 Var(X) + b 2 Var(Y )
Var(aX ± bY ) = a Var(X) + b Var(Y )
where
where a and bb are constants
where aa and
and b are
are constants
constants
Corollary 3.10
Corollary 3.10
Corollary
Let X (i 3.10
= 1, 2, ..., n) be independent random114variables and cci aa constant
Let
Let X
X i (i
i
(i =
= 1,
1, 2, ...,
2,ith
..., n)
n) be
be independent
independent random
random variables
variables and
and cii a constant
constant
associated
i with random variable,
associated with i th random variable, then
th then
Let X and Y be two independent random variables. Then
Var(aX ± bY ) = a Var(X)
+ b2 Var(Y ) EXPECTATION AND VARIANCE OF
where a and b are constants
Corollary 3.10
Let Xi (i = 1, 2, ..., n) be independent random variables and ci a constant
associated with ith random variable, then
n n

Var ci Xi = c2i Var(Xi )
i=1 i=1
From Corollary 3.10, if ci = c then

n n

Var c Xi = c2 Var(Xi )
i=1 i=1
Example 3.23
If Var(X) = 10 and Var(Y ) = 6, find
(a) Var(3X + 2Y ), (b) Var(2X + 2Y ),

(c) Var(3X − 2Y ), (d) Var(2X − 2Y ).
Solution
(a) Var(3X + 2Y ) = 32 Var(X) + 22 Var(Y )
= 9(10) + 4(6)
= 112
(b) Var(2X + 2Y ) = Var[2(X + Y )]
= 22 Var(X + Y )
= 22 [Var(X) + Var(Y )]
= 4(10 + 6)
= 64
The reader should solve (c) and (d).
Property 7 Variance of Frequency of Success
Theorem 3.24
WE WILL TURN YOUR CV
Let the random variable M be the frequency of success in n inde- INTO AN OPPORTUNITY
pendent trials, then OF A LIFETIME
Var(M ) = npq
Proof
We found in the proof of Theorem 3.16 that E(Xi ) = p. To find the variance,
we need to find also E(Xi2 ):
x2i 0 1
p(xi ) q p
i )to=
be1(p)
a part+
of0(q) = p brand?
2
E(X
Do you like cars? Would you like a successful
Send us your CV on
so that ease everyday lives of people all around Send us your CV. We will give it an entirely www.employerforlife.com
new new dimension.
Var(Xi ) = E(Xi2 ) − [E(Xi )]2
= p − p = pq2
115
Solution
PROBABILITY: A(a)
FIRSTVar(3X
COURSE+IN
2Y ) = 32 Var(X) + 22 Var(Y ) EXPECTATION AND VARIANCE OF
= 9(10) + 4(6)
= 112
(b) Var(2X + 2Y ) = Var[2(X + Y )]
= 22 Var(X + Y )
= 22 [Var(X) + Var(Y )]
= 4(10 + 6)
= 64
The reader should solve (c) and (d).
Property 7 Variance of Frequency of Success
Theorem 3.24
Let the random variable M be the frequency of success in n inde-
pendent trials, then
Var(M ) = npq
Proof
We found in the proof of Theorem 3.16 that E(Xi ) = p. To find the variance,
we need to find also E(Xi2 ):
x2i 0 1
p(xi ) q p
E(Xi2 ) = 1(p) + 0(q) = p

so that
Var(X
Expectation and Variance ofi )Bivariate
= E(XDistributions
i ) − [E(Xi )]
2 2 123
= p − p = pq 2
Hence
Var(M ) = Var(X1 + X2 + ... + Xn )
= Var(X1 ) + Var(X2 ) + · · · + Var(Xn )
= npq
Example 3.24
Refer to example 3.15, find Var(M ).
Solution
2 3
n = 25, p = , q=
5 5
Var(M ) = npq

2 3
= 25
5 5
= 6
Property 8 Variance of Relative Frequency of Success
Theorem 3.25
M
Let the random variable be the relative frequency of success in n
n
independent trials, then

M pq
Var =
n n 116
5 5
Var(M ) = npq
2 3
PROBABILITY: A FIRST COURSE IN = 25 EXPECTATION AND VARIANCE OF
PROBABILITY THEORY – VOLUME III 5 5 BIVARIATE DISTRIBUTIONS
= 6
Property 8 Variance of Relative Frequency of Success
Theorem 3.25
M
Let the random variable be the relative frequency of success in n
n
independent trials, then

M pq
Var =
n n
Proof

M 1
Var = Var(M )
n n2
npq
= (by Theorem 3.24)
n2
pq
=
n
Example 3.25
124 to example 3.15, find Var M
Refer Advanced
. Topics in Introductory Probability
n
Solution
2 3
n = 25, p = , q=
5 5

M pq
Var =
n n

2 3
5 5
=
25
= 0.0096
In this chapter we have discussed some moments of bivariate random

variables, namely, the expectation and variance. In the next chapter, we
shall continue with the discussion of moments and how that can be used to
measure the relationship between two random variables.
EXERCISES
3.1 Refer to the data of Exercise 1.1, find
(a) E(X + Y ) (b) E(X

−Y )
3X
(c) E(XY ) (d) E
7Y
(e) E(2X + 3Y ) (f ) E(5X
−Y )
3X
(g) E[(3X)(4Y )] (h) E
7Y
(i) E(X 2 + 5Y ) (j) E[3(X + Y )]
(k) E(Y 3 ) (l) E(Y 2 − 2Y + 3X)
3.2 Refer to Exercise 1.1. Suppose X and Y are not assumed to be inde-
pendence, find
(a) Var(X + Y ) (b) Var(X

117
−Y )
X
7Y 7Y
(e) (e)
E(2X
E(2X+ 3Y+ )3Y (f) ) (f )E(5X
E(5X −Y−)Y )
3X 3X
ADVANCED TOPICS(g) (g)
IN INTRODUCTORY
E[(3X)(4Y
E[(3X)(4Y )] )](h) (h)
E E
PROBABILITY: A FIRST COURSE IN 7Y 7Y EXPECTATION AND VARIANCE OF
(i) E(X
PROBABILITY THEORY (i) 2 + 25Y
E(X
– VOLUME +
III)5Y )(j) (j)
E[3(X + Y+)]Y )]
E[3(X BIVARIATE DISTRIBUTIONS
(k) (k) 3) 3)
E(YE(Y (l) E(Y
(l) E(Y
2 − 22Y
− 2Y
+ 3X)
+ 3X)
3.2 3.2
Refer
Refer
to Exercise
to Exercise
1.1.1.1.
Suppose
Suppose X and
X and Y are
Y are notnot assumed
assumed to inde-
to be be inde-
pendence,
pendence, findfind
(a) (a)
Var(X
Var(X
+ Y+) Y ) (b) (b)
Var(X
Var(X
−Y−)Y )
X X
(c) (c) Var(XY
Var(XY ) ) (d) (d)
VarVar
Y Y
(e) (e)
Var(2X
Var(2X
+ 3Y
+ )3Y (f
) ) (f
Var(5X
) Var(5X
− Y−) Y )
(g) (g)
Var[(3X)(4Y
Var[(3X)(4Y
)] )]
Expectation and Varianceof3X

3X
Bivariate Distributions 125
(h) (h)
VarVar
7Y 7Y
(i) Var(X 2 + 5Y )
1
(j) Var(3X) (k) Var(3X − 2Y )
7
(l) Var(Y 2 ) (m) Var(Y 2 − 2Y + 3X)
(n) Var[3(X + Y )]
3.3 Refer to Exercise 3.2 and rework, assuming that X and Y are inde-
pendent.
3.4 A box contains 10 green balls and 15 red balls. 5 balls are randomly
picked from the box with replacement. Find the expectation and vari-
ance of the frequency and relative frequency of the occurrence of green
balls.
3.5 Given a pair of continuous random variables having the joint density

24 x y, x > 0, y > 0 and x + y ≤ 1
f (x, y) =
0, elsewhere
Find
(a) E(X) (b) E(Y ) (c) E(XY )
(d) Var(X) (e) Var(Y ) (f ) E(X + Y )
(g) E(X − Y ) (h) Var(X + Y ) (i) Var(X − Y )
assuming independence.
3.6 Given the joint density

2, 0 < x < 1, 0 < y < 1, x + y < 1
f (x, y) =
0, elsewhere
find
(a) E(X) (b) E(Y) (c) E(XY)
(d) Var(X) (e) Var(Y) (f ) E(X − Y )
(g) E(X + Y ) (h) Var(X + Y ) (i) Var(X − Y )
if independence is assumed.
(a) E(X) (b) E(Y) (c) E(XY)
(d) Var(X) (e) Var(Y) (f ) E(X − Y )
118
PROBABILITY
126 THEORY – VOLUME III
Advanced BIVARIATE DISTRIBUTIONS

(a) E(X) (b) E(Y) (c) E(XY)
(d) Var(X) (e) Var(Y) (f ) E(X − Y )

(a) E(X) (b) E(Y) (c) E(XY)
(d) Var(X) (e) Var(Y) (f ) E(X − Y )

(a) E(X) (b) E(Y) (c) E(XY)
(d) Var(X) (e) Var(Y) (f ) E(X − Y )

(a) E(X) (b) E(Y) (c) E(XY)
(d) Var(X) (e) Var(Y) (f ) E(X − Y )

(a) E(X) (b) E(Y) (c) E(XY)
(d) Var(X) (e) Var(Y) (f ) E(X − Y )
3.13 Prove Theorem 3.13.
3.14 Suppose X and Y are independent Normal variables N (µ, σ 2 ) and

N (µ, 2σ 2 ), respectively. Determine µ if
(a) σ = 3 and P (X + 2Y ≤ 10) = 0.3,

(b) σ = 2 and P (2X + Y < 5) = 0.4,
(c) σ = 3 and P (X + Y ≥ 8) = 0.2.
3.15 Refer to Exercise 3.14. Determine µ and σ if

P (|2X − Y | > 10) = 0.05 and P (Y ≤ 10) = 0.9
119
PROBABILITY THEORY – VOLUME III Measures of Relationship of Bivariate Distributions
Chapter 4
MEASURES OF RELATIONSHIP
4.1 INTRODUCTION
We recall that the expectation and variance of a univariate random variable

are special moments about the origin and the mean respectively, the expec-
tation being the first moment about the origin and the variance, the second
moment about the mean. When we have two or more random variables, mo-
ments are similarly calculated. In Chapter 3, we calculated the mean and
variance of bivariate random variables and also discussed their properties.
However, we shall realise in this chapter that in the multivariate case, other
types of moments can be calculated.
4.2 PRODUCT MOMENT
Let g(X1 , ..., Xn ) be any function of the random variables X1 , ..., Xn whose
density function is f (x1 , ..., xn ). Then the k th moment of g(X1 , X2 , ..., Xn )
is defined by
E[g k (X1 , X2 ..., Xn )]
=
∞
Develop
...
∞
the tools we need for Life Science
g k (x , x , · · · , x ) f (x , ..., x ) dx dx ...dx
1 2 n 1 n 1 2 n
−∞ Masters
−∞ Degree in Bioinformatics
In the case of a bivariate distribution a special type of moment that has been
127
mathematics meet.

mathematics.
120
ments are similarly calculated. In Chapter 3, we calculated the mean and
variance ofTOPICS
ADVANCED bivariate random variables and also discussed their properties.
IN INTRODUCTORY
However, we A
PROBABILITY: shall
FIRSTrealise in this
COURSE IN chapter that in the multivariate case, other
PROBABILITY THEORY
types of moments can– VOLUME III
be calculated. Measures of Relationship of Bivariate Distributions
4.2 PRODUCT MOMENT
Let g(X1 , ..., Xn ) be any function of the random variables X1 , ..., Xn whose
density function is f (x1 , ..., xn ). Then the k th moment of g(X1 , X2 , ..., Xn )
is defined by
E[g k (X1 , X2 ..., Xn )]

∞ ∞
128 = ... g k (x1 , Advanced
x2 , · · · , xn )Topics
f (x1 , ..., n ) dx1 dx2 ...dx
in xIntroductory n
Probability
−∞ −∞
In the case of a bivariate distribution a special type of moment that has been
found very useful is the product moment. We shall define product moment
for both discrete and continuous random127 variables.
4.2.1 Definition of Product Moment about Origin
Definition 4.1 PRODUCT MOMENT ABOUT ORIGIN

(Discrete Case)
Suppose X and Y are two discrete random variables. Then the
product moment of orders r and s about the origin of X and Y
respectively is given by

Mrs = E (X r Y s ) = xr y s p(x, y)
x=1 y=1
where r and s are nonnegative integers
If r = s = 1 this formula reduces to that of E(XY ) in Theorem 3.1.
Definition 4.2 PRODUCT MOMENT ABOUT ORIGIN

(Continuous Case)
Suppose X and Y are two continuous random variables. Then the
product moment of orders p and q about the origin of X and Y
respectively, is given by
∞ ∞

Mpq = E (X p Y q ) = xp y q f (x, y) dy dx
−∞ −∞
where p and q are nonnegative integers
If p = q = 1 this formula reduces to that of E(XY ) in Theorem 3.4.
The moments described in Definitions 4.1 and 4.2 for the case when
both powers are equal to unity are known as the first product moment about
the origin.
121
PROBABILITY
Measures THEORY – VOLUME
of Relationship III
of Bivariate Measures of Relationship of Bivariate
Distributions 129 Distributions
4.2.2 Definition of Product Moment about Mean
Definition 4.3 PRODUCT MOMENT ABOUT MEAN

(Discrete Case)
Suppose X and Y are two discrete random variables with means µX
and µY respectively. Then the product moment of orders r and s
about the mean of X and Y respectively is defined as
∗
Mrs = E[(X − µX )r (Y − µY )s ]

Measures of Relationship (x − µDistributions
= of Bivariate X ) (y − µY ) p(x, y)
r s 129
x y
4.2.2 Definition of Product Moment about Mean
If r = s = 1, the formula in Definition 4.3 reduces to6

M11∗
= (Discrete
E[(X − µX Case)
)(Y − µY )] = Cov(X, Y )
Suppose X and Y are two discrete random variables with means µX
and µY respectively. Then the product moment of orders r and s
about the mean of X and Y respectively is defined as
Mrs∗
=(Continuous
E[(X − µX )rCase)
(Y − µY )s ]

Suppose X and Y are two continuousr random variables. Then the
= (x − µX ) (y − µY )s p(x, y)
product moment of ordersx y
p and q about the mean of X and Y
respectively is defined as
∗
Mpq = E[(X − µX )p (Y − µY )q ]
∞ ∞
If r = s = 1, the formula in Definition 4.3 reduces to6
= (x − µX )p (y − µY )q f (x, y) dy dx
∗ −∞ −∞
M11 = E[(X − µX )(Y − µY )] = Cov(X, Y )
If p = q = 1 this formula reduces to

Definition 4.4∗ PRODUCT MOMENT ABOUT MEAN
M11 = E [(X − µX )(Y Case)
(Continuous − µY )] = Cov(X, Y )
Suppose X and described
The moments Y are twoincontinuous
Definitionsrandom
4.3 andvariables. Then
4.4 for the casethe
when
product moment of orders and
130 powers are equal to unityAdvanced
both p q about the
shall conveniently
Topics in bemean of X and
as theY first
referred to Probability
Introductory
respectively
product momentisabout
definedtheasmean or the first central product moment.
Example
6
4.1
See detailed
Mpq ∗discussion of this in Section
= E[(X − µX )p (Y − µY )q ]
4.3.

∞ ∞ the first product moment about the mean.
= (x − µX )p (y − µY )q f (x, y) dy dx
−∞ −∞
Solution
The first product moment about the mean, (that is, p = q = 1) for which
we are required to calculate is given by
If p = q = 1 this formula reduces to
1 2
µ11 = E[(X − ∗ )(Y − µ )] =
X = E [(XY− µX )(Y − µ(x
Mµ11 Y )] )(y − µYY )) f (x, y) dy dx
−=µXCov(X,
0 0
FromThe moments
Example 3.5 described in Definitions 4.3 and 4.4 for the case when
both powers are equal to unity shall conveniently be referred to as the first
product moment about the µmean=or E(X) the first 13
X = central product moment.
6
18
See detailed discussion of this in Section 4.3. 10
µY = E(Y ) =
9
Hence
1 2
13 10 xy
= x− y− x + 122 dy dx
2
0 0 18 9 3
1 2
Solution
The first product moment about the mean, (that is, p = q = 1) for which
ADVANCED TOPICS
we are required toINcalculate
INTRODUCTORY
is given by
PROBABILITY THEORY – VOLUME III 1 2 Measures of Relationship of Bivariate Distributions
µ11 = E[(X − µX )(Y − µY )] = (x − µX )(y − µY ) f (x, y) dy dx
0 0
From Example 3.5

13
µX = E(X) =
18
10
µY = E(Y ) =
9
Hence
1 2
13 10 xy
= x− y− x2 + dy dx
0 0 18 9 3

1 1 2
= (18x − 13) (27x2 y + 9y 2 x − 30x2 − 10xy) dy dx
486 0 0
1
1
= (18x − 13)(54x2 + 24x − 60x2 − 20x) dx
486 0
1
1
= (18x − 13)(2x − 3x2 ) dx
243 0
1
1
= (75x2 − 54x3 − 26x) dx
243 0
= −0.00617
4.3 COVARIANCE OF RANDOM VARIABLES
4.3.1 Definition of Covariance
The variance of a random variable is a measure of its variability, and the

covariance of two random variables is a measure of their joint variability, or
their degree of association.
Covariance was first mentioned and defined in Chapter 3 during the
Copenhagen
Master of Excellence cultural studies
Copenhagen Master of Excellence are
two-year master degrees taught in English
at one of Europe’s leading universities religious studies
Come to Copenhagen - and aspire!
Apply now at science

www.come.ku.dk
123
= (18x − 13)(2x − 3x ) dx
243 0
1
1
= TOPICS IN(75x
ADVANCED 2
− 54x3 − 26x) dx
INTRODUCTORY
243
0
= −0.00617
4.3 COVARIANCE OF RANDOM VARIABLES
4.3.1 Definition of Covariance
The variance of a random variable is a measure of its variability, and the

covarianceofofRelationship
Measures two random ofvariables is aDistributions
Bivariate measure of their joint variability,131
or
their degree of association.
Covariance was first mentioned and defined in Chapter 3 during the
discussion of the properties of variance. In this section, we shall introduce
it formally and discuss its properties.
Definition 4.5 COVARIANCE OF X and Y

If (X, Y ) is a bivariate random variable with E(X) = µX and E(Y ) =
µY , then the covariance of X and Y , denoted by Cov(X, Y ), is defined
as the expected value of (X − µX )(Y − µY )
Cov(X, Y ) = E(X − µX )(Y − µY )
Note
M11∗ = Cov(X, Y ) and hence the covariance is sometimes referred to as the
first product moment about the mean or simply the product moment.
Theorem 4.1
Let X and Y be random variables with a joint distribution function,
then
Cov(X, Y ) = E(XY ) − E(X)E(Y )
Proof
Cov(X, Y ) = E{[X − E(X)][Y − E(Y )]}

= E[XY − Y E(X) − XE(Y ) + E(X)E(Y )]
= E(XY ) − E(X)E(Y ) − E(X)E(Y ) + E(X)E(Y )
= E(XY ) − E(X)E(Y )
From Theorem 3.1, the covariance may also be defined as

n
m
Cov(X, Y ) = xi yj p(xi , yj ) − E(X)E(Y )
i=1 j=1
124
PROBABILITY
III Measures
Topics of Relationship
in Introductory of Bivariate Distributions
Probability
Note
(a) Cov(X, X) = Var(X) ≥ 0
(b) The covariance need not be finite, or even exist. However, it is finite
if both random variables have finite variances.
Example 4.2
For the data in Example 2.3, find the covariance between X and Y .
Solution
It has been shown in Example 3.14,
E(XY ) = 6.24
E(X)E(Y ) = 6.24
Hence
Cov(X, Y ) = E(XY ) − E(X)E(Y )

= 6.24 − 6.24
= 0
Example 4.3
Refer to Example 1.3. Find Cov(X, Y ).
Solution
From Examples 3.7 and 3.8,
13 10 43
E(X) = ; E(Y ) = ; E(XY ) = .
18 9 54
Hence
Cov(X, Y ) = E(XY ) − E(X)E(Y )

43 13 10
= −
54 18 9
= −0.00617
which is the
Measures same as that of
of Relationship obtained in Example
Bivariate 4.1.
Distributions 133
4.3.2 Properties of Covariance
Property 1 Symmetry
Theorem 4.2
Suppose X and Y are two random variables, then
Cov(X, Y ) = Cov(Y, X)
Property 2
Theorem 4.3
Suppose X and Y are two random variables125
and a is a constant, then
Theorem
Suppose
Suppose X X 4.2
and
and Y Y are two
two random
are Cov(X, Y ) = variables,
random Cov(Y, X)then
variables, then
Suppose X and Y are Cov(X, Y ) = variables,
two random Cov(Y, X)then
Cov(X,
Cov(X, Y )) = Cov(Y, X)
Cov(X, Y Y)= = Cov(Y,
Cov(Y, X)X)
Property 2 Cov(X,
IN Y ) = Cov(Y, X)
PROBABILITY
Property 2 THEORY – VOLUME III Measures of Relationship of Bivariate Distributions
Property 2
Property
Property 2
2
Theorem
Property 2 4.3
Theorem
Property
Suppose
2 4.3
and Y are two random variables and a is a constant, then
X 4.3
Theorem
Suppose
Theorem and Y are two random variables and a is a constant, then
X 4.3
Suppose
Theorem
Theorem and Y are two random variables and a is a constant, then
X 4.3
4.3
Suppose
Theorem X and Y are Cov(a
two + X, Y )variables
random = Cov(X, Y )a is a constant, then
and
Suppose
Suppose X X 4.3
and
and YY are
are two
two random
Cov(a + X, Y )variables
random = Cov(X,
variables and
Y )a
and a is
is aa constant,
constant, then
then
Suppose X and Y are Cov(a + X, Y )variables
two random = Cov(X, Y )a is a constant, then
and
Cov(a
Cov(a + X, Y )) = Cov(X, Y )
Cov(a ++ X,
X, Y Y)= = Cov(X,
Cov(X, Y Y ))
Proof Cov(a + X, Y ) = Cov(X, Y )
Proof
Proof
Proof Cov(a + X, Y ) = E{[a + X − E(a + X)][Y − E(Y )]}
Proof
Proof Cov(a + X, Y ) = E{[a + X − E(a + X)][Y − E(Y )]}
Proof Cov(a + X, Y ) = E{[a E{[X+−XE(X)][Y − E(a +−X)][Y
E(Y )]} − E(Y )]}
Cov(a +
Cov(a + X,
X, YY )) == E{[a
E{[a +
E{[X +−X − E(a
XE(X)][Y
E(a + +−X)][Y
E(Y )]}
X)][Y E(Y )]}
− E(Y
Cov(a + X, Y ) = E{[X Cov(X,
E{[a +−X − E(Y )]}
)− E(a +−X)][Y
YE(X)][Y − E(Y )]}
− )]}
Cov(a + X, Y ) = = Cov(X,
E{[X+−
E{[a
E{[X −X )− E(a +−
E(X)][Y
YE(X)][Y −X)][Y
E(Y )]}
E(Y )]}
− E(Y )]}
= Cov(X,
E{[X −YE(X)][Y) − E(Y )]}
Corollary 4.1 = E{[X
= Cov(X,
Cov(X, −Y )
YE(X)][Y
) − E(Y )]}
Corollary 4.1 = Cov(X, Y )
Suppose and Y are two random
CorollaryX4.1 = Cov(X, variables
Y ) and a and b are constants. Then
Suppose X4.1
Corollary and Y are two random variables and a and b are constants. Then
Suppose
Corollary
CorollaryX and Y are two random variables and a and b are constants. Then
4.1
4.1
Suppose X and Y are Cov(a
two + X, bvariables
random + Y ) = Cov(X,
and a Y ) b are constants. Then
and
Corollary
Suppose
Suppose XX4.1
and
and YY are two
two random
Cov(a
are Cov(a + X, bvariables
random and
+ Y ) = Cov(X,
variables and a
a and
Y ) b are
and are constants.
constants. Then
Then
Suppose X and Y are Cov(a + X, bvariables
two random + Y ) = Cov(X,
and a Y ) bb are
and constants. Then
Cov(a + X, b + Y ) = Cov(X, Y )
Cov(a + X, bb + Y )) =
= Cov(X, Y ))
Property 3
Property 3 + X, +Y Cov(X, Y
Property 3 Cov(a + X, b + Y ) = Cov(X, Y )
Property
Property 3
3
Theorem
Property 3 4.4
Theorem
Property 3 4.4
Suppose X 4.4
and Y are two random variables, then
Theorem
Suppose X 4.4
Theorem and Y are two random variables, then
Suppose
Theorem
Theorem and Y are two random variables, then
X 4.4
4.4
Suppose
Theorem Cov(X,
and
X 4.4 Y + Z)random
areY two = Cov(X, Y ) + Cov(X,
variables, then Z)
Suppose
Suppose X X and
and Y are
Y
Cov(X, areY two
+ Z)random
two = Cov(X,
random variables, then
Y ) + Cov(X,
variables, then Z)
Suppose X and Cov(X, + Z)random
Y areY two = Cov(X, Y ) + Cov(X,
variables, then Z)
Cov(X,
Cov(X, Y + Z) = Cov(X, Y ) + Cov(X, Z)
Cov(X, Y Y + Z) =
+ Z) = Cov(X,
Cov(X, Y Y )) +
+ Cov(X,
Cov(X, Z) Z)
Cov(X, Y + Z) = Cov(X, Y ) + Cov(X, Z)
Brain power By 2020, wind could provide one-tenth of our planet’s

how is crucial to running a large proportion of the
world’s wind turbines.
Up to 25 % of the generating costs relate to mainte-
nance. These can be reduced dramatically thanks to our
lubrication. We help make it more economical to create
cleaner, cheaper energy out of thin air.
Therefore we need the best employees who can
meet this challenge!
The Power of Knowledge Engineering
Plug into The Power of Knowledge Engineering.

126
134 III Measures
Probability
Proof
Cov(X, Y + Z) = E([X − E(X)]{[Y + Z − E(Y + Z)]})

= E{[X − E(X)][Y − E(Y )] + [X − E(X)][Z − E(Z)]}
= E{[X − E(X)][Y − E(Y )]} + E{[X − E(X)][Z − E(Z)]}
= Cov(X, Y ) + Cov(X, Z)
Property 4
Theorem 4.5
Suppose X and Y are two random variables and a and b are
constants, then
Cov(aX, bY ) = a b Cov(X, Y )
Proof
Cov(aX, bY ) = E {[aX − aE(X)] [bY − bE(Y )]}

= E {[a(X − E(X))] [b(Y − bE(Y ))]}
= a b E {[X − E(X)] [Y − E(Y )]}
= a b Cov(X, Y )
Combining Corollary 4.1 and Theorem 4.5 , we have another corollary.
Corollary 4.2
Suppose X and Y are two random variables and a, b, c, and d are constants,
then
Cov(a + cX, b + dY ) = c d Cov(X, Y )
In general, the same kind of argument gives the important linear property
of covariance.
Measures of Relationship of Bivariate Distributions 135
Property 5 Bilinear Property of Covariance
Theorem 4.6

Let U = a + ni=1 ci Xi and V = b + m
j=1 dj Yj . Then
n
m
Cov(U, V ) = ci dj Cov(Xi , Yj )
i=1 j=1
This theorem has many applications. In particular, since
Cov(X + Y, X + Y ) = Var(X + Y )
127
= Var(X) + Var(Y ) + 2 Cov(X, Y )
n
m
Cov(U, V ) = ci dj Cov(Xi , Yj )
i=1 j=1
This theorem has many applications. In particular, since
Cov(X + Y, X + Y ) = Var(X + Y )
= Var(X) + Var(Y ) + 2 Cov(X, Y )
Property 6 Independence of Random Variables
Theorem 4.7
If X and Y are independent random variables then Cov(X, Y ) = 0
Proof
From Theorem 4.1
Cov(X, Y ) = E(XY ) − E(X)E(Y )
But if X and Y are independent, then from Theorem 3.10
E(XY ) = E(X)E(Y )
The desired result follows immediately.
Note
If Cov(X, Y ) = 0, X and Y need not be independent.
Property 7
Theorem 4.8
Suppose X and Y are random variables having second moments then
Cov(X, Y ) is a well-defined finite number and
[Cov(X, Y )]2 ≤ Var(X)Var(Y )
By replacing X by X − E(X) and Y by Y − E(Y ) in Theorem 3.14,

Theorem 4.8 follows immediately.
Note
The equality holds if and only if Y and X have perfect linear relation.
4.3.3 Uses of Covariance
Cov(X, Y ) is often used as a measure of the linear dependence of X and Y ,

and the reason for this is that Cov(X, Y ) is a single number (rather than
a complicated object such as a joint density function) which contains some
useful information about the joint behaviour of X and Y :
(a) Positive values indicate that X increases as Y increases;
(b) Negative values indicate that X decreases as Y increases or vice versa;
(c) A zero value of the covariance would indicate no linear dependence
between X and Y .
128
4.3.4 Limitations of Covariance
4.3.3 Uses of Covariance
Cov(X, Y )TOPICS
ADVANCED is oftenINused as a measure of the linear dependence of X and Y ,
INTRODUCTORY
and the reason for this is that Cov(X, Y ) is a single number (rather than
a complicated object such as a joint density function) which contains some
useful information about the joint behaviour of X and Y :
(a) Positive values indicate that X increases as Y increases;
(b) Negative values indicate that X decreases as Y increases or vice versa;
(c) A zero value of the covariance would indicate no linear dependence
between X and Y .
4.3.4 Limitations of Covariance
The covariance is a bad measure of dependence because:

(a) It is not ‘scale-invariant’, that is, the numerical value of the covariance
is in the product of the units in which the two variables were measured.
This makes it difficult to determine whether a particular covariance is
large at first glance.
(b) It is unbounded. It can take any value. This means we cannot deter-
mineofthe
Measures maximum value
Relationship of a covariance
of Bivariate 137
at which we can say whether
Distributions
it is too large or not.
To deal with these problems, we ‘re-scale’ covariance to obtain the correla-
tion coefficient of random variables X and Y .
4.4 CORRELATION COEFFICIENT OF RANDOM VARIABLES
4.4.1 Definition of Correlation Coefficient
The problems associated with covariance can be eliminated by standardising

its value by the standard deviations σX and σY of X and Y , respectively.
This gives the simple coefficient of linear correlation or simply the correlation
coefficient, denoted as ρ(X, Y ) or simply as ρ.
NNE and Pharmaplan have joined forces to create – You have to be proactive and open-minded as a
NNE Pharmaplan, the world’s leading engineering newcomer and make it clear to your colleagues what
pharma and biotech industries. to me. But busy as they are, most of my colleagues
Definition 4.6 CORRELATION COEFFICIENT find the time to teach me, and they also trust me.
If the covariance
Education: of XEngineer
Chemical and Y is defined then the correlation coefficient
time that I am beginning to be taken seriously and
is defined as that my contribution is appreciated.
Cov(X, Y )
ρ=
Var(X)Var(Y )
0
In this definition, = 0. Correlation coefficient is also referred to as
0
Pearson’s correlation coefficient or product-moment correlation coefficient.
The product-moment correlation coefficient ρ can also be expressed as
XY − Xand
NNE Pharmaplan is the world’s leading engineering Y consultancy company
focused entirely on the pharma and ρ= biotech industries. We employ more than (i)
1500 people worldwide and offer global reach σXand
σYlocal knowledge along with
The reader will be asked in Exercise 4.21 to show that the numerator of
expression (i) above is defined by Cov(X, Y ). This expression may also be
used to establish a very important property of the correlation coefficient (see
Theorem 4.10 in the sequel.) 129

To deal with Athese
PROBABILITY: FIRSTproblems,
COURSE INwe ‘re-scale’ covariance to obtain the correla-
tion coefficient
PROBABILITY of random
THEORY variables
– VOLUME III X andMeasures
Y. of Relationship of Bivariate Distributions
4.4 CORRELATION COEFFICIENT OF RANDOM VARIABLES
4.4.1 Definition of Correlation Coefficient
The problems associated with covariance can be eliminated by standardising

its value by the standard deviations σX and σY of X and Y , respectively.
This gives the simple coefficient of linear correlation or simply the correlation
coefficient, denoted as ρ(X, Y ) or simply as ρ.
Definition 4.6 CORRELATION COEFFICIENT

If the covariance of X and Y is defined then the correlation coefficient
is defined as
Cov(X, Y )
ρ=
Var(X)Var(Y )
0
In this definition, = 0. Correlation coefficient is also referred to as
0
Pearson’s correlation coefficient or product-moment correlation coefficient.
The product-moment correlation coefficient ρ can also be expressed as
XY − X Y
ρ= (i)
σX σY
The reader will be asked in Exercise 4.21 to show that the numerator of
expression (i) above is defined by Cov(X, Y ). This expression may also be
used to establish a very important property of the correlation coefficient (see
Theorem 4.10 in the sequel.)
4.4.2 Properties of Correlation Coefficient
Property 1 Scale-invariance
Theorem 4.9
For any a, b, c, d ∈ R such that ac = 0,
ρ(aX + b, cY + d) = IXY ρ(X, Y )
where
+1, if ac > 0
IXY =
−1, if ac < 0
Proof
Cov(ax + b, cY + d)
ρ(aX + b, cY + d) =
Var(aX + b) Var(cY + d)
By Corollary 4.2 130
where
+1, if ac > 0
PROBABILITY: A FIRST COURSE =
IXY IN
PROBABILITY THEORY – VOLUME III −1, ifMeasures
ac < 0 of Relationship of Bivariate Distributions
Proof
Cov(ax + b, cY + d)
ρ(aX + b, cY + d) =
Var(aX + b) Var(cY + d)
By Corollary 4.2
Cov(aX + b, cY + d) = a c Cov(X, Y )
and by Theorem 3.22
Var(aX + b) = a2 Var(X) and Var(cY + d) = c2 Var(Y )
Hence
a c Cov(X, Y )
ρ(aX + b, cY + d) =
a2 Var(X) c2 Var(Y )
ac
= · ρ(X, Y )
|a c|
Property 2 Bounds of Correlation Coefficient
Theorem 4.10
The correlation coefficient lies between −1 and +1, that is,
−1 ≤ ρ ≤ 1
By dividing both sides of the formula in Theorem 4.8 by Var(X) Var(Y )

and finding the square root, Theorem 4.10 follws immediately.
4.4.3 Interpretation of Correlation Coefficient
For the value of ρ to be meaningful, both variables involved, in theory,

must be distributed normally. Positive values of ρ show that Y tends to
increase with increasing X; negative values of ρ indicate that Y tends to
decrease with increasing values of X.
There are specific interpretations of the correlation coefficient in terms
of the joint behaviour of X and Y.
(a) As a measure of linearity
The correlation coefficient measures the degree of linearity between X and

Y.
When ρ = ±1, there is perfect linearity between X and Y.
(i) If ρ(X, Y ) = 1, then Y is an increasing perfect linear function of X.

(ii) If ρ(X, Y ) = −1, Y is a decreasing perfect linear function of X.
When ρ = 0, there is no linear relationship between X and Y (see

the note below). When the value of ρ is near +1 or −1, that is an
indication of a high degree of linearity and when ρ is very small (near 0) it
indicates a lack of linearity.
Note
When ρ = 0 or near 0 it does not indicate the absence of relationship between
X and Y . It only indicates no linear relationship and it does not preclude
the possibility of some nonlinear relationship. 131
(ii) If ρ(X, Y ) = −1, Y is a decreasing perfect linear function of X.
When ρ = 0, there is no linear relationship between X and Y (see

the note below).
PROBABILITY: WhenINthe value of ρ is near +1 or −1, that is an
A FIRST COURSE
indication of THEORY
PROBABILITY a high degree of linearity
– VOLUME III and when ρ is
Measures ofvery small (near
Relationship of0) it
Bivariate Distributions
indicates a lack of linearity.
Note
When ρ = 0 or near 0 it does not indicate the absence of relationship between
X and Y . It only indicates no linear relationship and it does not preclude
the possibility of some nonlinear relationship.
(b) As a measure of independence
When X and Y are independent then ρ = 0. However, the converse is not

true. That is:
(i) If ρ = 0, it does not mean that X and Y are independent.
140
(ii) If ρ = 0 it does not mean Advanced
that X andTopics
Y are in
dependent. The Probability
Introductory association
may be spurious.
In general, correlation and dependence are not equivalent. The exis-
In general, correlation and dependence are not equivalent. The exis-
tence of a statistical association, even if it is very strong, does not establish
tence of a statistical association, even if it is very strong, does not establish
that an increase in X causes an increase in Y, or that an increase in Y causes
that an increase in X causes an increase in Y, or that an increase in Y causes
an increase in X. A fundamental weakness of observational studies is that
an increase in X. A fundamental weakness of observational studies is that
they can demonstrate association but not causation.
they can demonstrate association but not causation.
Example 4.4
Example 4.4
For the data in Example 2.3, calculate the correlation coefficient of X and
For the data in Example 2.3, calculate the correlation coefficient of X and
Y and comment on the result.
Y and comment on the result.
Solution
Solution
From Example 4.2
From Example 4.2
Cov(X, Y ) = 0
Cov(X, Y ) = 0
From Example 3.21
From Example 3.21
√
σX = Var(X) = √8.64 = 2.9394
σX = Var(X) = √ 8.64 = 2.9394
σY = Var(Y ) = √1.89 = 1.3748
σY = Var(Y ) = 1.89 = 1.3748
Hence
Hence 0
ρ= 0 =0
ρ = (1.3748)(2.9394) = 0
(1.3748)(2.9394)
There is no linear correlation between X and Y .
There is no linear correlation between X and Y .
Example 4.5
Example 4.5
Refer to Example 1.3. Calculate the correlation coefficient.
Refer to Example 1.3. Calculate the correlation coefficient.
Solution
Solution
From Example 3.7 and 3.8,
From Example 3.7 and 3.8,
43
E(XY ) = 43 = 0.79630
E(XY ) = 54 = 0.79630
54
13
E(X) = 13 = 0.72222
E(X) = 18 = 0.72222
18
10
E(Y ) = 10 = 1.11111
E(Y ) = 9 = 1.11111
9
From Example 3.19
From Example 3.19
Var(X) = 0.04506
Var(X) = 0.04506
Var(Y ) = 0.32099
Var(Y ) = 0.32099
132
PROBABILITY
Measures of THEORY – VOLUME
Relationship III
of Bivariate Measures of Relationship of Bivariate
Distributions 141 Distributions
Hence
E(XY ) − E(X)E(Y )
ρ =
Var(X)Var(Y )
0.79630 − 0.72222(1.11111)
=
(0.04506)(0.32099)
= −0.05127
4.4.4 Variance of Sum of Random Variables with Common
Variance and Common Correlation Coefficient
Hence
We now establish the variance
E(XYof sum of random) variables with correlation
) − E(X)E(Y
coefficient. ρ =
Var(X)Var(Y )
0.79630 − 0.72222(1.11111)
=
Theorem 4.11 (0.04506)(0.32099)
Suppose X1 , X2 , ...,=Xn−0.05127
are random variables with common vari-
ance σ 2 and common correlation Corr(Xi , Xj ) = ρ, i = j (i, j =
1, 2, ...,
4.4.4 n). Let of Sum of Random Variables with Common
Variance
Sn = X1 +Correlation
Variance and Common X2 + ... + XnCoefficient
Then
We now establish the variance of sum of random variables with correlation
coefficient.
(a) Var(Sn ) = nσ 2 [1 + (n − 1)ρ]

Sn σ2
(b)
Theorem 4.11Var = [1 + (n − 1) ρ]
n n
Suppose X1 , X2 , ..., Xn are random variables with common vari-
ance 2Sn
whereσ and = common correlation
X n = sample mean Corr(Xi , Xj ) = ρ, i = j (i, j =
n Let
1, 2, ..., n).
Sn = X1 + X2 + ... + Xn
Proof
Then
(a) From Corollary 3.9,
(a) Var(Sn ) = nσ 2 [1 + (n − 1)ρ]
n
n
n

Var(S
S n ) =σ 2
n
Var(Xi ) + 2 Cov Cov(Xi , Xj )
(b) Var = [1 +
i=1 (n − 1) ρ] i j
n n i<j
n n
n

Sn
where = X n = sample
= mean
Var(Xi ) + 2 Var(Xi )Var(Xj )
n i=1 i j
i<j
Proof
(a) From Corollary 3.9,
n
n
n
Var(Sn ) = Var(Xi ) + 2 Cov Cov(Xi , Xj )
i=1 i j
i<j
n
n
n

142 = Var(Xi ) +Topics
Advanced 2 Var(Xi )Var(X
in Introductory j)
Probability
i=1 i j
i<j
 
Cov (Xi , Xj )
×  
Var(Xi )Var(Xj )
n
n
n √
= σ2 + 2 σ2σ2 ρ
i=1 i j
i<j
n
n 133
= nσ + 2 2
ρσ 2
PROBABILITY: A FIRST COURSE IN  
PROBABILITY THEORY – VOLUME III CovMeasures
(Xi , Xj ) of 
Relationship of Bivariate Distributions
× 
Var(Xi )Var(Xj )
n
n
n √
= σ2 + 2 σ2σ2 ρ
i=1 i j
i<j
n
n
= nσ 2 + 2 ρ σ2
i j
i<j

n2 − n
= nσ + 2 ρ σ
2 2
2

= nσ 2 + ρ σ 2 n2 − n
= nσ 2 + ρ σ 2 n (n − 1)
= nσ 2 [1 + (n − 1) ρ]

Sn 1
(b) Var = Var(Sn )
n n2
σ2
= [1 + (n − 1) ρ]
n
When X’s are independent, ρ = 0 and

Sn σ2
Var = Var(X) =
n n
as will be seen in Theorem 7.2.
4.5 CONDITIONAL EXPECTATIONS
We encountered in my book “Introductory Probability Theory” (Nsowah-

Nuamah, 2017) the concept of conditional probability. Again, in Chapter
1 of this book, we dealt with conditional distributions. In this section, we
This e-book
discuss conditional expectation7 which is simply the expected value of a
variable, given that a set of prior conditions has taken place. As we shall

7
It is sometimes referred to as conditional expected value or conditional mean of a
random variable
SetaPDF
www.setasign.com
134

Sn σ2
Var
= Var(X) =
n n
PROBABILITY THEORY
as will be seen – VOLUME
in Theorem 7.2.III Measures of Relationship of Bivariate Distributions
4.5 CONDITIONAL EXPECTATIONS

1Measures
of this book, we dealt with
of Relationship conditional
of Bivariate distributions. In this section, 143
Distributions we
observe
7
later, it is obtained simply by summation or integration with respect
to conditional distributions. A precise definition will be given and their
random variable
142
properties, some of which are Advanced
analogous Topics
to properties obtained Probability
in Introductory in Chapter
3 for (unconditional) expectations, will be discussed.
 
4.5.1 Definition of Conditional Cov (Xi , Xj )
×  Expectation 
Var(Xi )Var(Xj )
Consider the random variable (Y |X) which may be interpreted as the ran-
n n n √
dom variable Y given random
= σvariable
2
+ 2 X. Let σE(Y ρ be the conditional
2 σ 2 |X)
expectation of Y for values i=1
of X. It is important
i j to remember that E(Y |X)
itself is a random variable. Let E(Y |Xi<j = x) or simply E(Y |x) represent
the expectation of Y conditioned on
n the
n event {X = x}. Since E(Y |x) is
= nσ 2
+ 2
a function of x it is a random variable though, ρ σ 2 strictly speaking, it is one
i j
value of the random variable E(Y |X).i<j

n2 − n
= nσ 2 + 2 ρ σ 2
2
Definition 4.7 CONDITIONAL EXPECTATION

=(Discrete
nσ + ρ σ
2 Case)
2 2
n −n
If (X, Y ) is a two-dimensional discrete random variable, we define
= nσ 2 + ρ σ 2 n (n − 1)
the conditional expectation of Y for given X = x as
= nσ 2 [1 + (n − 1) ρ]

Sn E(Y1|X = x) = y h(y|x)
(b) Var = 2
Var(Sn ) y
n n
where h(y|x) = pY |X σ 2 denotes the conditional p.m.f of Y given
=(y|x) [1 + (n − 1) ρ]
X n
When X’s are independent, ρ = 0 and

Sn σ2
Var = Var(X) =
Definition 4.8 n
CONDITIONAL n
EXPECTATION
(Continuous Case)
as will be seen in Theorem 7.2.
If (X, Y ) is a two-dimensional continuous random variable, we define
the conditional expectation of Y for given X = x as
4.5 CONDITIONAL EXPECTATIONS ∞
E(Y |X = x) = y h(y|x) dy
−∞
where h(y|x) = fY |X (y|x) denotes the conditional p.d.f of Y given
X
1 of this book, we dealt with conditional distributions. In this section, we
7
random variable
135
PROBABILITY
III Measures
Probability
From the foregoing two definitions we notice that the definition of condi-
tional expectation is almost the same as the definition of expectation, except
that instead of a probability (marginal) distribution it uses a conditional
probability (marginal) distribution.
Example 4.6
Refer to Example 1.3. Determine: (a) E(Y |x), (b) E(X|y)
Solution
∞
(a) E(Y |x) = y f (y|x) dy
−∞
From Example 1.3,

3x + y
f (y|x) = , 0 < y < 2, 0<x<1
6x + 2
Hence
2
3x + y
E(Y |x) = y dy
0 6x + 2
2
3xy + y 2
= dy
0 6x + 2

1 4
= 3x +
3x + 1 3
9x + 4
=
3(3x + 1)
∞
(b) E(X|y) = x f (x|y) dx
−∞
From Example 1.13,
6x2 + 2xy
f (x|y) = , 0 < x < 1, 0<y<2
2+y
Hence
1
Measures of Relationship of Bivariate
Distributions 145
6x2 + 2xy
E(X|y) = x dx
0 2+y
1 3
6x + 2x2 y
= dx
0 2+y
1
1 6x4 2x3 y

= +
2+y 4 3
0

1 6 2y
= +
2+y 4 3
18 + 8y
=
12(2 + y)
4.5.2 Properties of Conditional Expectation
Property 1
136
Theorem 4.12
= 2+ 1 y 46 + 2y
3
= 2+ 1 y 46 + 2y 3
= 2+y 4 + 3
+ +
y 8y 4 3
= 18 + 8y
PROBABILITY: A FIRST COURSE = 18 + 8y
IN 12(2 + y)
=III 12(2
18 ++8yy)
= 12(2 + y)Measures of Relationship of Bivariate Distributions
12(2 + y)
4.5.2
PropertyProperties
1 of Conditional Expectation
Property 1
Property 1
Property 1
Theorem 4.12
Theorem 4.12
Theorem
Suppose X 4.12
and Y are random variables. If Y has a finite expectation
Suppose
Theorem and Y are random variables. If Y has a finite expectation
X 4.12
and Y ≥ 0, and
Suppose X thenY are random variables. If Y has a finite expectation
and
Suppose
Y ≥X 0, and
thenY are random variables. If Y has a finite expectation
and Y ≥ 0, then E(Y |X) ≥ 0
and Y ≥ 0, then E(Y |X) ≥ 0
E(Y |X) ≥ 0
E(Y |X) ≥ 0
Property 2
Property 2
Property 2
Property 2
Theorem 4.13
Theorem 4.13
Theorem
Suppose X 4.13
and Y are random variables having finite expectations
Suppose
TheoremX 4.13and Y are random variables having finite expectations
Suppose and Y are1 ≤
and ai areXconstants, random
i ≤ n, variables
then having finite expectations
and a areXconstants,
Suppose and Y are1 ≤ random then
i ≤ n, variables having finite expectations
and aii are constants,
1 ≤ i ≤ n,then
and ai are constants,
1nn ≤ i ≤ n,then n
E n
n ai Yi |X = n Eai E(Yi |X)
E n a Y |X =
n Eai E(Yi |X)
E i=1 aii Yii |X = i=1 Eai E(Yi |X)
E i=1 ai Yi |X = i=1 Eai E(Yi |X)
i=1 i=1
i=1 i=1
In particular if a = a then
In particular if aii = a then
In particular if ai = a then
n
In particular if ai = a then

n

n
n
E n ai Yi |X = a
n E(Yi |X)
E n ai Yi |X = a n E(Yi |X)
E i=1 ai Yi |X = a i=1 E(Yi |X)
E i=1 a Y
i=1 i i
|X = a i=1
i=1
E(Yi |X)
i=1 i=1
Sharp Minds - Bright Ideas!

new inventions to make dedicated solutions for our customers. With sharp minds and the world leader as supplier of
cross functional teamwork, we constantly strive to develop new unique products - dedicated, high-tech analytical
Would you like to join our team? solutions which measure and
FOSS works diligently with innovation and development as basis for its growth. It is tion of agricultural, food, phar-
reflected in the fact that more than 200 of the 1200 employees in FOSS work with Re- maceutical and chemical produ-
search & Development in Scandinavia and USA. Engineers at FOSS work in production, cts. Main activities are initiated
development and marketing, within a wide range of different fields, i.e. Chemistry, from Denmark, Sweden and USA
Electronics, Mechanics, Software, Optics, Microbiology, Chemometrics. with headquarters domiciled in
Hillerød, DK. The products are
We offer marketed globally by 23 sales
A challenging job in an international and innovative company that is leading in its field. You will get the companies and an extensive net
opportunity to work with the most advanced technology together with highly skilled colleagues. of distributors. In line with
its market position.
Dedicated Analytical Solutions

FOSS
Slangerupgade 69
3400 Hillerød
Tel. +45 70103370
www.foss.dk
137
146 III Measures
Probability
Property 3
Theorem 4.14
Suppose Y1 and Y2 are random variables with finite expectations. If
Y1 ≤ Y2 then
E(Y1 |X) ≤ E(Y2 |X)
Property 4
Theorem 4.15
Suppose X and Y are two independent random variables. If Y has
a finite expectation, then
E(Y |X) = E(Y )
Property 5
Theorem 4.16
Suppose X and Y are two random variables. If Y has a moment of
order r ≥ 1, then
r

E(Y |X) ≤ [E(|Y |X)]r ≤ E(|Y |r |X)
As pointed out earlier, E(Y |X) is a random variable and hence we may find
Measures of Relationship of Bivariate Distributions
its expectation. 147
Property 6
Theorem 4.17
Suppose that X and Y are independent random variables. Then the
expectation of conditional expectation is given by
E[E(Y |X)] = E(Y )
Proof
We will prove this for the continuous case. The discrete case is proved
similarly on replacing integrals by summations.
By definition,
+∞ 138
E(Y |x) = y h(y|x) dy

expectation of conditional expectation is given by

E[E(Y |X)] = E(Y )
Proof
We will prove this for the continuous case. The discrete case is proved
similarly on replacing integrals by summations.
By definition,
+∞
E(Y |x) = y h(y|x) dy
−∞
∞
f (x, y)
= y dy
−∞ g(x)
where f (x, y) is the joint probability density function of (X, Y ) and g(x) is
the marginal probability density function of X.
Multiply both sides of the equation by g(x) :
∞
f (x, y)
g(x)E[Y |x] = y dy g(x)
−∞ g(x)
Taking the integral of both sides with respect to x, gives an expectation of
E(Y |x), namely, E[E(Y |X)]. That is,
∞
E[E(Y |X)] = E(Y |x) g(x) dx
−∞
∞ ∞
f (x, y)
= y dy g(x) dx
−∞ −∞ g(x)
If all the expectations exist, it is permissible to write the above iterated
integral with the order of integration reversed. Thus
148
148 Advanced
Advanced
∞ Topics
Topics in
in Introductory
Introductory
Probability
Probability
148 Advanced Topics
∞ in Introductory Probability
E[E(Y |X)] = y f (x, y) dx dy

−∞
∞ −∞
∞
= ∞ yy h(y)
= h(y) dy
dy
= −∞ −∞ y h(y) dy
=
= E(Y ))
−∞
E(Y
= E(Y )
Theorem
Theorem 4.17
4.17 gives
gives what
what might
might be
be called
called the
the law
law of
of total
total expectation:
expectation: the
the
Theorem 4.17
expectation of gives what
of aa random might be
random variable called
variable Y the
can be
Y can law of total
be calculated
calculated byexpectation:
by weighting the
the
expectation weighting the
expectation
conditional of a randomappropriately
expectations variable Y can
and be calculated
summing or by weighting
integrating. the
conditional expectations appropriately and summing or integrating.
conditional expectations appropriately and summing or integrating.
Property 7
Property 7
Property 7
Theorem 4.18
Theorem 4.18
Theorem
Let = 4.18
|x) be
Let Y = E(Y |x)
Y E(Y be the
the conditional
conditional expectation
expectation of Y given
of Y X. Then
given X. Then
Let Y = E(Y |x) be the conditional expectation of Y given X. Then
2
Y ))2 ]] ≤ π)2 ]]
2
E[(Y −
E[(Y −Y E[(Y −
≤ E[(Y − π)
E[(Y − Y )2 ] ≤ E[(Y − π)2 ]
where
where ππ= π(x) is
= π(x) is any
any other
other function
function of of X
X
where π = π(x) is any other function of X
Proof
Proof
Proof
π)}22 ]
} = Y )) +
+ ((Y
2
E{(Y −
E{(Y π)2 }
− π) E[{(Y −
= E[{(Y −Y Y −
− π)} ]
E{(Y − π)2 } = E[{(Y −Y 2) + (Y − π)}2 ]
= E[(Y − Y ) 2 ] + 2E[(Y − Y
= E[(Y − Y )2 ] + 2E[(Y − Y
)(YY π)] +
− π)] + E[(
E[(Y
Y − π)2 ]]
− π)
2
= E[(Y − Y )2 ] + 2E[(Y − )(
Y )( Y −
− π)] + E[(Y − π)2 ]
= E[(Y − Y
) 2 ] + E[(Y
− π) 2]
2
(i)
= E[(Y − Y )2 ] + E[(Y − π)2 ] (i)
= E[(Y − Y ) ] + E[(Y − π) ] (i)
because
because we
we can
can show
show that
that the
the second
second term
term isis zero.
zero. Thus,
Thus,
because we can show thatthe
second term is zero. Thus,

Y (y − y)(y − π)f (x, y)∂y∂x
E(Y −
E(Y −YY )(
)(Y − π) =
− π) = (y − y)(y − π)f (x, y)∂y∂x
E(Y − Y )(Y − π) = xx yy (y − y)(y − π)f (x, y)∂y∂x
x

y 139

= (y − y)f (y|x)∂y (y − π)f (x)∂x
= (y − y)f (y|x)∂y (y − π)f (x)∂x
Proof
ADVANCED π)2 } IN
E{(Y −TOPICS E[{(Y − Y ) + (Y − π)}2 ]
= INTRODUCTORY
PROBABILITY: A FIRST COURSE IN 2
PROBABILITY THEORY= – E[(Y − YIII) ] + 2E[(Y
VOLUME − Y )(Y of
Measures − π)] + E[(Y − π)
Relationship
2
]
of Bivariate Distributions
2
= E[(Y − Y ) ] + E[(Y − π) ] 2
(i)
because we can show that the second term is zero. Thus,

E(Y − Y )(Y − π) = (y − y)(y − π)f (x, y)∂y∂x
x y

= (y − y)f (y|x)∂y (y − π)f (x)∂x
x y
since f (x, y) can be expressed as the product of the conditional density

function of Y given X and the marginal density function of Y , that is,
f (x, y) has been factored as f (x, y) = f (y|x)f (x). And also

(y − y)f (y|x)∂y
Measures of Relationship =
of Bivariate yf (y|x)∂y − yf (y|x)∂y
Distributions 149
Measures of Relationship
y of Bivariatey Distributions y 149
= E(Y
|x) − E(Y |x) 149
Now, from (i), = 0
Now, from (i),
Now, from (i),
Now, from (i), E{(Y − π)2 } = E[(Y − Y )2 ] + E[(Y − π)2 ]
E{(Y − π)2 } = E[(Y − Y )2 ] + E[(Y − π)2 ]
E{(Y − π)22 } = E[(Y
))22]] + π)22 ]
− 2
E{(Y − π) } =
≥ E[(Y
≥ E[(Y − Y
Y
+ E[(Y −
− π) ]
≥ E[(Y − Y )22 ]
from which the result follows.≥ E[(Y − Y ) ]
Theorem 4.19
Theorem 4.19
If the joint 4.19
Theorem distribution of Y and X is a Normal distribution, then
If the joint4.19
Theorem distribution of Y and X is a Normal distribution, then
the
If thefunction E(Y |x) mayofbeY expressed
joint distribution and X is as a Normal distribution, then
the function E(Y |x) mayofbe
If the joint distribution Y expressed
and X is as a Normal distribution, then
the function E(Y |x) may be expressed as
the function E(Y |x) mayE(Y be expressed
|x) = α +as βx
E(Y |x) = α + βx
E(Y |x) = α + βx
E(Y |x) = α + βx
Theorem 4.19 indicates that the conditional expectation of Y given X is

Theorem 4.19 indicates that the conditional expectation of Y given X is
a linear function
Theorem of X. that
4.19 indicates As wetheshall discuss in
conditional the sequel,ofthe
expectation function
Y given X is
a linear function
Theorem 4.19 of X. that
indicates As wetheshall discuss in
conditional the sequel,ofthe
expectation Y function
given X is
adescribed as a linear
linear function regression
of X. As we function.
shall discuss in the sequel, the function is
adescribed as a linear
linear function regression
of X. As we function.
shall discuss in the sequel, the function is
described as a linear regression function.
described
Property 8 as a linear regression function.
Property 8
Property 8
Property 8
Theorem 4.20
Theorem 4.20
Suppose that
Theorem X and Y are independent random variables.
4.20 Then the
Suppose that
Theorem X and Y are independent random variables.
4.20 Then the
Suppose that X and Y expectation
variance of conditional Y is random
are independent given byvariables. Then the
variance that
Suppose of conditional expectation
X and Y are Y is random
independent given byvariables. Then the
variance of conditional expectation Y is given by
variance of conditional expectation
Var[E(Y |X)] is E[Var(Y
= Var(YY) − given by |X)]
Var[E(Y |X)] = Var(Y ) − E[Var(Y |X)]
Proof
Proof
Just as E(Y |X), Var(Y |X) is also a random variable and has expectation
Proof
Just as E(Y |X), Var(Y |X) is also a random variable and has expectation
Proof
Just as E(Y
E[Var(Y Now,
|X)].|X), Var(Y |X) is also a random variable and has expectation
Just as E(Y
E[Var(Y Now,
|X)].|X), Var(Y |X) is also a random variable and has expectation
E[Var(Y |X)]. Now,
E[Var(Y |X)]. Now,
E[Var(Y |X)] = E{E(Y 22|X) − [E(Y |X)]22}
E[Var(Y |X)] = E{E(Y22 |X) − [E(Y |X)]2 } 2
E[Var(Y |X)] = E[E(Y |X) − [E(Y
E{E(Y 22|X)] |X)]
E{E[(Y 2 } 2}
|X)]
E[Var(Y |X)] = E{E(Y
E[E(Y 2 |X)] − [E(Y
|X) − |X)]|X)]
E{E[(Y } }
= E[E(Y 22|X)] − {E[E(Y |X)]22 }22 + {E[E(Y |X)]}22
E{E[(Y |X)]}
= E[E(Y |X)] − E{E[(Y |X)]}} + {E[E(Y |X)]}
{E[E(Y |X)]
= E[E(Y 22 |X)] − {E[E(Y |X)]}22 + {E[E(Y |X)]}22
= E[E(Y |X)] − {E[E(Y |X)]} + {E[E(Y |X)]}
140
−E{[E(Y |X)]2 }
= E(Y 2 ) − [E(Y )]2 − Var[E(Y |X)] (from Theorem 4.17)
= Var(Y ) − Var[E(Y |X)
4.6 CONDITIONAL VARIANCES
In this section we discuss conditional variances and prove a useful formula

relating conditional and unconditional variances.
4.6.1 Definition of Conditional Variance
Definition 4.9 CONDITIONAL VARIANCE

If (X, Y ) is a two-dimensional discrete random variable, we define
the conditional variance of Y given X as
Var(Y |X) = E(Y 2 |X) − [E(Y |X)]2
provided that E(Y |X) exists
Thus for the discrete case, if pX (x) > 0 and if Y has a second moment, then
the conditional variance of Y given X = x is given by
2

Var(Y |X = x) = y pY |X (y|x) −
2
y pY |X (y|x)
y y
The conditional variance of Y given X = x for the continuous case is defined

similarly by replacing the summations with integrals.
Another way of definition the conditional variance is given in Definition
4.9a.
141
Var(Y |X) = E(Y 2 |X) − [E(Y |X)]2
provided A
PROBABILITY: that E(Y
FIRST |X) exists
COURSE IN
Thus for the discrete case, if pX (x) > 0 and if Y has a second moment, then
the conditional variance of Y given X = x is given by
2

Var(Y |X = x) = y pY |X (y|x) −
2
y pY |X (y|x)
y y
The conditional variance of Y given X = x for the continuous case is defined

similarly by replacing the summations with integrals.
Another way of definition the conditional variance is given in Definition
4.9a.
Definition 4.9a CONDITIONAL VARIANCE

If (X, Y ) is a two-dimensional discrete random variable, then the
conditional variance of Y given X is

Var(Y |X) = E{[Y − E(Y |X)]2 X}
provided that E(Y |X) exists
4.6.2 Properties of Conditional Variance
Property 1
Theorem 4.21
expectation of conditional variance
E[Var(Y |X)] = E[E(Y 2 |X)] − E{[E(Y |X)]2 }
Property 2
Theorem 4.22
expectation of conditional variance is given by
E[Var(Y |X)] = Var(Y ) − Var[E(Y |X)]
Proof
This follows from the proof of Theorem 4.20.
We shall notice in Theorem 4.23 in the sequel derived from Theorem

4.20 that the unconditional variance Var(Y ) is not just the expected value
of the conditional variance but also the variance of conditional expectation.
142
Theorem 4.23
Suppose that X and Y are independent random variables. Then
Var(Y ) = Var[E(Y |X)] + E[Var(Y |X)]
Proof
This follows from the proof of Theorem 4.20.
Aliter
Var(Y ) = E[Y − E(Y )]2
= E{[Y − E(Y |X)] + [E(Y |X) − E(Y )]}2
= E[Y − E(Y |X)]2 + E[E(Y |X) − E(Y )]2
since the cross-product term vanishes. Conditioning on X the two terms of
the expression on the right side, we obtain for the first term:
E{E[Y − E(Y |X)]2 |X} = E[Var(Y |X)] (by Definition 4.9a)
and the second term:
E[E[Y |X) − E(Y |X)]2 = Var[E(Y |X)]
Hence
Var(Y ) = E[Var(Y |X)] + Var[E(Y |X)]
4.7 REGRESSION CURVES
The graph of E(Y |x) as a function of x, is called the regression curve

(of the mean) of Y on X. Similarly, the graph of E(X|y) as a function of y
is known as the regression curve (of the mean) of X on Y .
4.7.1 Definition of Linear Regression
The linear regression function was encountered in Theorem 4.19 as the ex-
pected value of Y given X. In this section, we shall discuss in detail its
Measures
special of Relationship of Bivariate Distributions
characteristics. 153
Definition 4.10 LINEAR REGRESSION

The two-dimensional random variable (X, Y ) is said to have a linear
regression of Y on X if
E(Y |x) = α + βx
where α and β are constants
That is, the regression curve of Y on X is a straight line, called the re-
gression line of Y on X. The constants α and β are the parameters of the
linear regression equation. The constant α is called the intercept, which is
the point where the regression line cuts the y-axis. The constant β is called
the slope of the regression equation or the regression coefficient of Y on X
which measures a change in Y per unit change in X.
143
The constants α and β are unknown parameters and have to be esti-
mated. We discuss here two estimation methods, namely,
E(Y |x) = α + βx
where α and
PROBABILITY β are
THEORY constants
– VOLUME III Measures of Relationship of Bivariate Distributions
That is, the regression curve of Y on X is a straight line, called the re-
gression line of Y on X. The constants α and β are the parameters of the
linear regression equation. The constant α is called the intercept, which is
the point where the regression line cuts the y-axis. The constant β is called
the slope of the regression equation or the regression coefficient of Y on X
which measures a change in Y per unit change in X.
The constants α and β are unknown parameters and have to be esti-
mated. We discuss here two estimation methods, namely,
(a) method of moments;
(b) method of least suares.
4.7.2 Method of Moments for Estimating Linear Regression

Function
The method of moments for estimating the linear regression function at-
tempts to find expressions for α and β that are in terms of the first-and
second-order moments of the joint distribution, namely, E(X), E(Y ), Var(X)
and Cov(X, Y ).
Theorem 4.24
If (X, Y ) has linear regression of Y on X, then
α = E(Y ) − β E(X)
Cov(XY )
β =
Var(X)
The Wake
the only emission we want to leave behind
144
The method of moments for estimating the linear regression function at-
ADVANCED
tempts to TOPICS IN INTRODUCTORY
find expressions for α and β that are in terms of the first-and
second-order moments of the joint distribution,
namely, E(X), E(Y ), Var(X)
and Cov(X, Y ).
Theorem 4.24
If (X, Y ) has linear regression of Y on X, then
α = E(Y ) − β E(X)
Cov(XY )
β =
Var(X)
Proof
Proof for Estimating α:
From Definition 4.10, the regression function is given by
E(Y |x) = α + βx (i)
If we multiply (i) by f (x) and integrate it with respect to x, that is,

E(Y |x)f (x) dx = (α + βx)f (x) dx
x x
we obtain
E(Y ) = α + βE(X) (ii)
so that
α = E(Y ) − βE(X) (iii)
Proof for Estimating β:

Let us substitute (iii) into (i) to obtain
E(Y |x) = E(Y ) + β(X − E(X)) (iv)
Multiplying (i) by x and f (x) and then integrating with respect to x, we

have

xE(Y |x)f (x) dx = x(α + βx)f (x) dx
x x
E(XY ) = αE(X) + βE(X 2 ) (v)
Multiplying (ii) by E(X), we get
E(X)E(Y ) = αE(X) + β[E(X)]2 (vi)
Subtracting (vi) from (v) gives
E(XY ) − E(X)E(Y ) = β{E(X 2 ) − E[(X)]2 } (vi)
so that
E(XY ) − E(X)E(Y )
β =
E(X 2 ) − E[(X)]2
Cov(X, Y )
= (from Theorem 4.1)
Var(X)
or
E[X − E(X)][(Y − E(Y )]
β=
E{[(X − E[(X)]2 }
4.7.3 Least Squares Method of Estimating Linear Regression

Function
Let the values (x1 , y1 ), (x2 , y2 ), · · · (xn , yn ), be

145plotted as points in the x, y
plane. Then the problem of estimating a linear regression function can be

or
PROBABILITY THEORY – β
E[XIII− E(X)][(Y
VOLUME
− E(Y )]
=
E{[(X − E[(X)]2 }

Function
Let the values (x1 , y1 ), (x2 , y2 ), · · · (xn , yn ), be plotted as points in the x, y

plane. Then the problem of estimating a linear regression function can be
treated as a problem of fitting a straight line to this set of points. The best
known method is the least squares approach.
The least squares method states that the sum of the squares of the
difference between the observed value of Y and the corresponding fitted
curve value of Y, which we denote by Ŷ , must be minimum. The values of
the parameters obtained by this minimisation determine what is known as
Measures of Relationship
the best fitting curve in theof sense
Bivariate Distributions
of least squares. 155
or Theorem 4.25
E[X − E(X)][(Y − E(Y )]
=
If (X, Y ) has linearβ regression of Y on X, then
E{[(X − E[(X)]2 }
α = µY − β µX
Function σXY
β =
σ2
Let the values (x1 , y1 ), (x2 , y2 ), · · · (xnX, yn ), be plotted as points in the x, y
plane.
whereThenµY the problem
= E(X), 2 of=estimating
σX Var(X) and a linear
σXYregression
= Cov(X,function
Y) can be
treated as a problem of fitting a straight line to this set of points. The best
known method is the least squares approach.
ProofThe least squares method states that the sum of the squares of the
difference
Let us find between the observed
the best function of thevalue of Y =and
form h(x) α+βthe
x. corresponding fitted
This merely requires
curve valueover
optimising of Y,the
which
two we denote byαŶ ,and
parameters mustβ.beNow,
minimum.
we canThe values
write 8 of
the parameters obtained by this minimisation determine what is known as
Var(Y
the best−fitting curve=in the
α − β X) − α of
E[(Ysense −β least ] − [E(Y − α − β X)]2
X)2squares.
156 − α − β X)2 ] = Var(YAdvanced
E[(Y − α − β X) + [E(Y
Topics − α − β X)]2Probability
in Introductory
= (σY2 + β 2 σX2
− 2β σXY ) + [E(Y − α − β X)]2
Theorem 4.25
8
If (X, Y )that
Recollect (by Theorem
has linear regression of Y 3.22;
on X,and also Var(α) = 0, a property
then
2 2
ofE(X)
variance)
= E(X ) − [E(X)]
α = µY − β µX
The second term of the expression on the right hand side of the equation
does not depend on α so α can be chosen σXY so as to minimize the second term.
β =
Recall that to minimize a function σX2 we set the derivative of the function
to zero. Thus
where µY = E(X), σX 2 = Var(X) and σ
XY = Cov(X, Y )
[E(Y − α − β X)]2 = 2[E(Y − α − β X)] = 0
Proof
giving
Let us find the best function of the form h(x) = α+β x. This merely requires
optimising over the two − α − β µX α =and
µYparameters 0 β. Now, we can write8
α = µY − β µX
Var(Y − α − β X) = E[(Y − α − β X)2 ] − [E(Y − α − β X)]2
Now to
E[(Y − minimize
α − β X)2the
] =firstVar(Y
term − weαset− βthe
X)derivative
+ [E(Y −with
α − respect
β X)]2 to β equal
to zero, to obtain = (σ 2 + β 2 σ 2 − 2β σ ) + [E(Y − α − β X)]2
Y X XY
8
σ 2
Y + β 2 2
σ X − 2 β σ XY =0
Recollect that
giving E(X) = E(X 2 ) − [E(X)]2
2 β σX
2
− 2 σXY = 0
σXY
β = 2
σX
146
[E(Y − α − β X)] = 2[E(Y − α − β X)] = 0
giving
PROBABILITY THEORY – µ Y −α−
VOLUME IIIβ µX = Measures
0 of Relationship of Bivariate Distributions
α = µY − β µX
Now to minimize the first term we set the derivative with respect to β equal
to zero, to obtain
σY2 + β 2 σX
2
− 2 β σXY = 0
giving
2 β σX
2
− 2 σXY = 0
σXY
β = 2
σX
Corollary 4.3
If β is a coefficient of a linear regression function of X on Y , then
σY
β=ρ
σX
Proof
It is sufficient to show that
σXY σY
2 =ρ
σX σX
Thus
σXY
β = 2
σX

FULL ENGAGEMENT…
RUN FASTER.
1349906_A6_4+0.indd 1 22-08-2014 12:56:57
147
σXY σY
= ·
σX σX σY
σXY σY
= ·
σX σY σX
σY
= ρ
σX
Similarly, (X, Y ) is said to have linear regression of X on Y if
E(X|y) = τ + θ y (i)
with τ and θ as constants. That is, the regression curve of X on Y is a
straight line, which is called the regression line of X on Y . The number θ
is called the regression coefficient of X on Y and is defined as
σXY Cov(X, Y )
θ= = (ii)
2
σY Var(Y )
Theorem 4.26
Let us define Var(Ŷ ) = Var(Y − β X) as the mean squared predic-
tion error. Then
Var(Ŷ ) = σY2 (1 − ρ2 )
Proof
Var(Y − β X) = Var(Y ) + Var(β X) − 2Cov(Y, β X)
= σY2 + β 2 σX
2
− 2 β σXY
σ 2 σXY
= σY2 + XY 4
2
σX − 2 2 σXY
σX σX
σ 2
= σY2 − XY 2
σX

σXY σY 2
= σY2− ·
σX σY
= σY2 − ρ2 σY2
158
Advanced
= σY2 1 − ρ2
Example 4.7
(a) Find the linear regression of Y on X;
(b) Calculate the mean squared prediction error.
Solution
(a) From Example 3.19
2
σX = VarX = 0.04506
σY2 = VarY = 0.32099
σXY = Cov(X,Y) = −0.00617
Hence
σXY 0.00617
β= 2 = − 0.04506 = −0.13693
σX
148
From Example 3.8,
(a) From Example 3.19
2
σX = VarX = 0.04506
σY = IIIVarY
2
PROBABILITY THEORY – VOLUME = 0.32099
σXY = Cov(X,Y) = −0.00617
Hence
σXY 0.00617
β= 2 =− = −0.13693
σX 0.04506
From Example 3.8,
13
µX = E(X) = = 0.72222
18
10
µY = E(Y ) = = 1.11111
9
Hence
α = µY − β µX
= 1.11111 − (−0.13693)(0.72222) = 0.90111
Finally,
E(Y |x) = 0.90111 − 0.13693 x
(b) Calculation of the mean squared prediction error
Var(Ŷ ) = σY2 (1 − ρ2 )
= 0.32099[1 − (−0.05127)] = 0.33745
Since the value for β is negative, it means as X increases Y decreases.

That is, a unit increase in X corresponds to about 1.14 decrease in Y
(where
Measurestheof1.14 is in the same
Relationship unit of Distributions
of Bivariate measurement as Y .) 159
Theorem 4.27
Let (X, Y ) be a two-dimensional random variable with
E(X) = µX , E(Y ) = µY , V (X) = σX

2
and V (Y ) = σY2
If the regression of Y on X is linear, then

σY
E(Y |x) = µY + ρ (X − µX )
σX
where ρ is the correlation coefficient between Y and X
Proof
Substituting the expressions for α and β from Theorem 4.25 into Defi-
nition 4.10, we get
σXY
E(Y |x) = (µY − β µX ) + 2 x
σX
σXY σXY
= µX − 2 µX + 2 x
σX σX
σXY
= µY − 2 (X − µX )
σX
σY
= µY − ρ (X − µX ) (f romCorollary4.3)
σX
Similarly, it can be proved that if the regression of X on Y is linear, then
σX
E(X|y) = µX + θ (X − µY )
σY
It follows that if a regression is linear and ρ = 0, then E(Y |x) does not
depend on X and E(X|y) does not depend on Y.
149
Example 4.8
σXY σXY
= µX − 2 µX + σ 2 x
σX X
σXY
= COURSE
PROBABILITY: A FIRST µY − IN2 (X − µX )
σX
σY
= µY − ρ (X − µX ) (f romCorollary4.3)
σX
Similarly, it can be proved that if the regression of X on Y is linear, then
σX
E(X|y) = µX + θ (X − µY )
σY
It follows that if a regression is linear and ρ = 0, then E(Y |x) does not
depend on X and E(X|y) does not depend on Y.
Example 4.8
Refer to Example 1.3. Write the regression equation of Y on X.
Solution
From Example 3.8
13
E(X) = = 0.72222
18
160 10 Advanced Topics in Introductory Probability
E(Y ) = = 1.11111 (to 5 decimal places)
9
From Example 3.19,
√
σX = 0.04506 = 0.21227
√
σY = 0.32099 = 0.56656
From Example 4.5,

ρ = −0.05127.
Hence
0.56656
E(Y |x) = 1.11111 − 0.05127 (x − 0.72222)
0.21227
= 1.11111 − 1.13684(x − 0.72222)
Theorem 4.28 WHAT you need, WHEN you need it
The product ofAtthe IDC regression
Technologies coefficients β and
we can tailor θ in the and
our technical linear
engineering
regressions E(Y |x) and
training E(X|y), respectively,
workshops is equal
to suit your needs. Weto theextensive
have square OIL & GAS
of the correlation coefficient
experience of X and
in training Y : and engineering staff and
technical ENGINEERING
have trained people in organisations such as General
ρ2 = βθ ELECTRONICS
Motors, Shell, Siemens, BHP and Honeywell to name a few.
customisable to the technical and engineering areas you want PROCESS CONTROL
Proof experiences with ample time given to practical sessions and MECHANICAL
In Corollary 4.3 and (ii) above, We communicate well to ensure that workshop content
demonstrations. ENGINEERING
and timing match the knowledge, skills, and abilities of the participants.
σY INDUSTRIAL
We run onsite β = ρ all year round and hold the workshops on
training DATA COMMS
your premises or a venue
σX of your choice for your convenience.
ELECTRICAL
and For a no obligation σX proposal, contact us today POWER
θ=ρ
σY
Hence
σY σX
βθ = ρ · ρ Phone: +61 8 9321 1702
σX σY
from which the result follows. Email: training@idc-online.com
Note
(a) The sign of the regression coefficient is determined by ρ, since σX > 0
and σY > 0.
150
√
σX = 0.04506 = 0.21227
√
σY = 0.32099 = 0.56656
From Example 4.5,
ρ = −0.05127.
Hence
0.56656
E(Y |x) = 1.11111 − 0.05127 (x − 0.72222)
0.21227
= 1.11111 − 1.13684(x − 0.72222)
Theorem 4.28
The product of the regression coefficients β and θ in the linear
regressions E(Y |x) and E(X|y), respectively, is equal to the square
of the correlation coefficient of X and Y :
ρ2 = βθ
Proof
In Corollary 4.3 and (ii) above,
σY
β=ρ
σX
and
σX
θ=ρ
σY
Hence
σY σX
βθ = ρ ·ρ
σX σY
Note
(a) The of
Measures sign of the regression
Relationship coefficient
of Bivariate 161
is determined by ρ, since σX >
Distributions 0
and σY > 0.
(b) The two linear regression equations, E(Y |x) and E(X|y), have the
same sign of slope.
(c) The regression line passes through the point (EXY ) = [(E(X), E(Y )],
which is the expected value of the joint distribution.
(d) The point of the intersection of the linear regression curves E(Y |x)
and E(X|y) is [(E(X), E(Y )].
This chapter concludes discussions on bivariate distributions which we

started in Chapter 1. By induction we can easily extent most of the concepts
discussed so far to multivariate distributions in general.
EXERCISES
4.1 For the data of Exercise 1.1, find the covariance between X and Y.
4.2 For the data of Exercise 1.1, find the correlation coefficient of X and
Y.
4.3 If Var(W ) = 5, Var(X) = 3, Var(Y ) = 4 and Var(Z) = 1, find
(a) Cov(X, W ), (b) Cov(X, Y ), (c) Cov(Y, W ),

151
4.1 For the data of Exercise 1.1, find the covariance between X and Y.
4.2 For the data of Exercise 1.1, find the correlation coefficient of X and
Y.
4.3 If Var(W ) = 5, Var(X) = 3, Var(Y ) = 4 and Var(Z) = 1, find
(a) Cov(X, W ), (b) Cov(X, Y ), (c) Cov(Y, W ),
(d) Cov(X, Z), (e) Cov(Y, Z), (f) Cov(W, Z)
given that the correlation coefficient between any pair is 0.6.
4.4 Refer to Exercise 4.3. If a = 4, b = 2, c = 2 and d = 3, find
(a) Cov(c + X, Y ), (b) Cov(c+X, a+Y),

(c) Cov(dX + bY ), (d) Cov(a + dY, b + cZ),
(e) Cov(aX, bZ + cW ), (f ) Cov(aW + bX, cY + dZ),
(g) Cov(c + aW, c + aW ), (h) Cov(bY + cW, X)
4.5 A die is rolled twice. Let X be the sum of the outcomes, and Y , the
first outcome minus the second. Compute Cov(X, Y )
4.9 Refer to Exercise 1.7, find the correlation coefficient of X and Y.
4.10 Refer to Exercise 1.10, find the correlation coefficient of X and Y.
4.11 Suppose the random variable X and Y have the joint p.d.f.

x + y, 0 < x < 1, 0<y<1
f (x, y) =
0, elsewhere
Compute the correlation coefficient of X and Y.
4.12 Suppose the joint p.d.f. of X and Y is given by

1
e−y , 0 < x < y, 0<y<∞
f (x, y) = y
0, elsewhere
Compute (a) E(X|y), (b) E(X 2 |y)
(a) Use the least squares method to obtain the linear regression
equation of (i) Y on X; (ii) X on Y.
(b) (i) Find the product of the regression coefficients in a(i) and
a(ii);
(ii) Take the square root of your results in b(i);
(iii) Compare the result in b(ii) with that in Exercise 4.9.

(a) Var(Y |X) (b) Var(X|Y ).


(a) Var(Y |X) (b) Var(X|Y ). 152
(ii) Take the square root of your results in b(i);
ADVANCED(iii) Compare
TOPICS the result in b(ii) with that in Exercise 4.9.
IN INTRODUCTORY
4.14 Refer toTHEORY
PROBABILITY Exercise 1.10. Find
– VOLUME III Measures of Relationship of Bivariate Distributions


4.17 Verify Theorem 4.16 using Table 1.1.
(a) Find the regression of X on Y .

(a) Find the regression of X on Y.

4.20 Show that the point of intersection of Y = E(Y |x) and X = E(X|y)
is (µX , µY )
4.21 Show that ρ(X, Y ) can be defined as
Cov(X, Y ) = XY − XY
Hence show that the correlation coeficient is scale invariant.
4.22 Prove Theorem 4.10 for the discrete case.
�e Graduate Programme
I joined MITAS because for Engineers and Geoscientists
I joined MITAS because for Engine
I wanted real responsibili� Ma
Month 16
I was
the North Sea super
advising and the No
Real work he
International
al opportunities
Internationa
�ree wo
work
Real work he
helping fo
International
�ree wo
work
153
PROBABILITY: A FIRST COURSE IN PART 2 STATISTICAL INEQUALITIES, LIMIT
PROBABILITY THEORY – VOLUME III LAWS AND SAMPLING DISTRIBUTIONS
PART 2
STATISTICAL INEQUALITIES, LIMIT LAWS
AND SAMPLING DISTRIBUTIONS
Life is the art of drawing sufficient conclusions from insufficient premises

SAMUEL BUTLER
154
STATISTICAL INEQUALITIES AND LIMIT LAWS
PROBABILITY THEORY – VOLUME III Statistical Inequalities and Limit Laws
Chapter 5
Chapter
5.1
5
STATISTICAL INEQUALITIES AND LIMIT LAWS
INTRODUCTION
STATISTICAL
5.1.1
INEQUALITIES AND LIMIT LAWS
Statistical Inequalities
There are two well-known inequalities due to two Russian mathematicians,

5.1 INTRODUCTION
Andrey Markov and and his teacher, Pafnuty Chebyshev9 , which play an
important role in statistical work.
5.1.1If weINTRODUCTION
5.1 know the probability
Statistical Inequalities distribution of a random variable X, we may
compute E(X) and Var(X), if these exist. However, the converse is not true.
That is,
There arefrom
twoawell-known
knowledge inequalities
of E(X) anddue Var(X),
to twowe cannotmathematicians,
Russian reconstruct the
5.1.1
probabilityStatistical Inequalities
distribution
Andrey Markov and andofhisX.teacher,
This isPafnuty
becauseChebyshev
to describe
9 a probability
, which play an
distributionrole
important completely, we work.
in statistical need to know the probability function of the
There are two well-known inequalities due to two Russian mathematicians,
random variable.
If we know theHowever,
probabilityMarkov’s and Chebyshev’s
distribution inequalities
of a random variable X, we enable
may
Andrey Markov and and his teacher, Pafnuty Chebyshev9 , which play an
us to derive
compute E(X)lower
and (or upper)
Var(X), bound
if these on such
exist. probabilities
However, when
the converse onlytrue.
is not the
important role in statistical work.
mean, or both the mean and the variance of the distribution
That is, from a knowledge of E(X) and Var(X), we cannot reconstruct the are known.
If we know the probability distribution of a random variable X, we may
probability distribution of X. This is because to describe a probability
compute
5.1.2 E(X) and
Limit LawsVar(X), if these exist. However, the converse is not true.
distribution completely, we need to know the probability function of the
That is, from a knowledge of E(X) and Var(X), we cannot reconstruct the
random variable. However, Markov’s and Chebyshev’s inequalities enable
probability
Perhaps, thedistribution
two most of X. This
important lawsis because to describe a probability
us to derive lower (or upper) bound onofsuch
probability theorywhen
probabilities and statistics
only the
distribution
are theor“Law completely, we need to know the probability function ofBoth
the
mean, bothoftheLarge
meanNumbers” and the of
and the variance “Central Limit Theorem”.
the distribution are known.
random variable. However, Markov’s and Chebyshev’s inequalities enable
us 9to derive
These lower
names (orinupper)
appear bound onassuch
other publications probabilities
Markoff whenrespectively.
and Tchebysheff only the
5.1.2 Limit Laws
mean, or both the mean and the variance of the distribution are known.
In this book we have used the spellings Markov and Chebyshev, the official American
transliterations from Russian for consistency.
Perhaps, the two most important laws of probability theory and statistics
5.1.2 Limit Laws
are the “Law of Large Numbers” and
167the “Central Limit Theorem”. Both
Perhaps,
9 the two
These names mostin important
appear laws of
other publications as probability theory andrespectively.
Markoff and Tchebysheff statistics
arethis
In thebook
“Law we of Large
have used Numbers”
the spellingsand the “Central
Markov Limitthe
and Chebyshev, Theorem”. Both
official American
9
These names appear in other publications as Markoff and Tchebysheff respectively.
167
167
www.job.oticon.dk
155
probability distribution of X. This is because to describe a probability
distribution completely, we need to know the probability function of the
random variable.
PROBABILITY: However,
A FIRST COURSE Markov’s
IN and Chebyshev’s inequalities enable
us to derive THEORY
PROBABILITY lower (or upper) III
– VOLUME bound on such probabilities when only
Statistical the
Inequalities and Limit Laws
mean, or both the mean and the variance of the distribution are known.
5.1.2 Limit Laws

Perhaps, the two most important laws of probability theory and statistics
are the “Law of Large Numbers” and the “Central Limit Theorem”. Both
are related to the notion of the “sum of random variables”. Recall from
9
Chapter
These 3 that appear
names if Xi in(iother
= 1,publications
2, ..., n) areasrandom variables,
Markoff and thenrespectively.
Tchebysheff their sum
Sn = X1 + X2 + · · · + Xn is also a random variable with expectation E(Sn )
which is the sum of the expectations of the individual random variables if n is
finite (see Corollary 3.3). Furthermore, if the Xi ’s are also independent, then
167
the additivity property of variance also holds for the variance of Sn ; that is,
the variance of the sum, Var(Sn ), is the sum of the individual variances (see
Corollary 3.9). Finally, it follows that if the Xi ’s have the same distribution,
then they possess a common mean µ (Theorem 3.3), a common variance σ 2
(Theorem 3.16), and E(Sn ) = nµ (Corollary 3.4). Theorem 3.19 states that
if Xi ’s are independent and identically distributed, then Var(Sn ) = nσ 2 .
Except for a few exceptional cases such as some special probability dis-
tributions discussed in the previous chapters the distributions involving sums
may be very complicated. Computations of probabilities with direct evalua-
tion are extremely difficult. Fortunately, under the ‘Law of Large Numbers’
and the ‘Central Limit Theorem’ (see Sections 5.4 and 5.5, respectively),
the evaluation of probabilities involving sums of random variables turn out
to be fairly simple.
In the subsequent sections we shall discuss statistical inequalities and
limit theorems in detail.
5.2 MARKOV’S INEQUALITY
Markov’s inequality relates a probability to an expectation and provides an

upper bound for the probability that a non-negative function of a random
variable is greater than or equal to some positive constant.
Theorem 5.1 MARKOV’S INEQUALITY

Let X be a nonnegative random variable having a finite expectation
E(X). Let be any positive number, then the probability that X is
no less than is no greater than the expectation of X divided by :
E(X)
P (X ≥ ) ≤

Statistical Inequalities and Limit Laws 169
Proof
This theorem is valid for both continuous and discrete cases. We shall first
prove for the discrete case.
(a) Discrete Case

Suppose that the probability distribution of X has the form in the following
table:
X x1 x2 ··· xk ··· xn
P (X = x) p(x1 ) p(x2 ) ··· p(xk ) ··· p(xn )
Suppose also that the values of the random variable are arranged in ascend-
156
ing order

Proof
This theoremTHEORY
PROBABILITY is valid– for both III
VOLUME continuous and discrete Statistical
cases. We shall first
prove for the discrete case.
(a) Discrete Case

Suppose that the probability distribution of X has the form in the following
table:
X x1 x2 ··· xk ··· xn
P (X = x) p(x1 ) p(x2 ) ··· p(xk ) ··· p(xn )
Suppose also that the values of the random variable are arranged in ascend-
ing order
0 ≤ x1 < x2 < · · · < xn
Note that E(X) ≥ 0, since X ≥ 0. We shall consider three cases.
Case 1
If X takes only zero values, then E(X) = 0 and for any constant > 0
0
P (X ≥ ) = 0 ≤

That is, the theorem is valid.
Suppose now that not all xi equal zero. Clearly,
E(X) > 0
Take an arbitrary constant > 0.
Case 2
If > xn , then {X < } is a ‘certain event’ and
E(X)
P (X < ) = 1 ≤

that is, the theorem is again valid in this case.
Case 3
Let ≤ xn and xk , xk+1 , · · · ,Advanced
170 xn be all Topics
values in
of Introductory
X greater than (if in a
Probability
special case ≤ x1 , then k = 1)
By definition,
E(X) = x1 p1 + x2 p2 + · · · + xk pk + · · · + xn pn
≥ xk pk + · · · + xn pn
≥ (pk + · · · + pn ) (i)
By addition rule of probability
pk + pk+1 + · · · + pn = P (X ≥ )
Then from (i)

E(X) ≥ P (X ≥ )
or
E(X)
P (X ≥ ) ≤ (ii)

Since
P (X < ) = 1 − P (X ≥ )
then from (ii) it follows that
E(X) 157
P (X < ) ≥ 1 − (iii)

Then from (i)
E(X) ≥
ADVANCED TOPICS IN INTRODUCTORY P (X ≥ )
or
E(X)
P (X ≥ ) ≤ (ii)

Since
P (X < ) = 1 − P (X ≥ )
then from (ii) it follows that
E(X)
P (X < ) ≥ 1 − (iii)

(b) Continuous case

∞
E(X) = xf (x) dx
0 ∞
= xf (x) dx + xf (x) dx
0
∞
≥ xf (x) dx since xf (x) dx ≥ 0
∞ 0
≥ f (x)dx since >0

∞
= f (x) dx

= P (X ≥ )
Therefore
E(X)
P (X ≥ ) ≤ ,

WHY WAIT FOR

PROGRESS?
DARE TO DISCOVER
158
Statistical Inequalities III Laws
and Limit Statistical Inequalities
171 and Limit Laws
and hence
E(X)
P (X < ) ≥ 1 −

Aliter
Let
0, if X <
Y =
, if X ≥
Then
P (Y = 0) = P (X < )
and
P (Y = ) = P (X ≥ )
Hence
E(Y ) = 0 · P (Y = 0) + · P (Y = )
= P (X ≥ )
Clearly,
X≥Y
Hence
E(X) ≥ E(Y ) = P (X ≥ )
Therefore
E(X)
P (X ≥ ) ≤

Note
(1) The Markov’s inequality may be equivalently stated as in Theorem
5.1(b).
Theorem 5.1(b) MARKOV’S INEQUALITY

Let X be a nonnegative random variable having a finite expectation
E(X). Let be any positive number, then
E(X)
P (X < ) ≥ 1 −

Proof
Since X ≥ and X < are complementary events, Theorem 5.1(b)
follows.
(2) The Markov’s inequality had appeared earlier in the work of Pafnuty
Chebyshev, and for this reason it is sometimes referred to in other
books as the first Chebyshev’s inequality. Such books refer to the
Chebyshev’s inequality, discussed in the sequel, as the second Cheby-
shev inequality.
Example 5.1
A textile factory produces on the average 150 bales of suiting material a
month. Suppose the number of bales of a suiting material produced each
month is a random variable. Find the bounds for the probability that a
particular month’s production will be 159
(2) The Markov’s inequality had appeared earlier in the work of Pafnuty
Chebyshev, and for this reason it is sometimes referred to in other
books as the first Chebyshev’s inequality. Such books refer to the
Chebyshev’s
PROBABILITY THEORYinequality,
– VOLUME discussed
III in the sequel, as the secondInequalities
Statistical Cheby- and Limit Laws
shev inequality.
Example 5.1
A textile factory produces on the average 150 bales of suiting material a
month. Suppose the number of bales of a suiting material produced each
month is a random variable. Find the bounds for the probability that a
particular month’s production will be
(a) at least 160 bales;
(b) less than 200 bales.
Solution
Let X be the number of bales of the suiting material produced in a month.
E(X) = 150
(a) We are required to find P (X ≥ 160).

It is better to use the first formulation of the Markov’s inequality in
Theorem 5.1 with = 160. Hence
150 15
P (X ≥ 160) ≤ =
160 16
(b) We are required to find P (X < 200). Using the second formulation we
have
150
P (X < 200) ≥ 1 −
200
1
=
4
5.3 CHEBYSHEV’S INEQUALITY
Chebyshev’s inequality was first formulated by Irnée-Jules Bienaymé in

1853 but without proof and it was not until 1867 that Chebyshev provided
the proof. For this reason, the inequality is sometimes called Bienaymé-
Chebyshev inequality.
5.3.1 Basic Chebyshev’s Inequality
Theorem 5.2(a) CHEBYSHEV’S INEQUALITY

Let X be a random variable with finite expectation E(X) = µ and
finite variance Var(X). Then for any real number > 0,
Var(X)
P (|X − µ| ≥ ) ≤
2
Proof
The inequality
|X − µ| ≥
is equivalent to the inequality 160

Var(X)
P (|X − µ|
ADVANCED TOPICS IN INTRODUCTORY ≥ ) ≤
2
Proof
The inequality
|X − µ| ≥
is equivalent to the inequality
(X − µ)2 ≥ 2
The random variable (X − µ)2 takes only nonnegative values.

Applying to it the Markov’s inequality, we have:
P (|X − µ| ≥ ) = P [(X − µ)2 ≥ 2 ]

E (X − µ)2
≤
2
174 Advanced Var(X) in Introductory Probability

≤ Topics
2
Note
(1) The Chebyshev’s inequality may be equivalently stated as in Theorem
5.2(b).
Theorem 5.2(b) CHEBYSHEV’S INEQUALITY

Var(X)
P (|X − µ| < ) ≥ 1 −
2
Proof
Since
{|X − µ| ≥ }
and
{|X − µ| < }
are complementary events, Theorem 5.2(b) follows.
(2) As indicated earlier, the Theorem 5.2 is referred to in some books as
the second Chebyshev’s inequality.
Example 5.2
An electric station services an area with 12, 000 bulbs. The probability of
switching on each of these bulbs every evening is 0.9. What are the bounds
for the probability that the number of bulbs switched on in the area in one
particular evening is different from its expected value in absolute terms by
(a) less than 100? (b) at least 120?
Solution
Let
X = the number of bulbs switched on161in the area in that evening

µ = E(X) = np = 12000(0.9) = 10800
An electric station services an area with 12, 000 bulbs. The probability of
switching on each of these bulbs every evening is 0.9. What are the bounds
for the probability
PROBABILITY: that
A FIRST the number
COURSE IN of bulbs switched on in the area in one
particular evening
PROBABILITY THEORY is –different
VOLUME from
III its expected value in absolute terms
Statistical by
(a) less than 100? (b) at least 120?
Solution
Let
X = the number of bulbs switched on in the area in that evening

µ = E(X) = np = 12000(0.9) = 10800
Var(X) = npq = 12000(0.9)(0.1) = 1080
(a) We are required to calculate P (|X − µ| < 100).

We shall use the second Chebyshev’s Inequality
Var(X)
P (|X − µ| < ) ≥ 1 −
2
with = 100.
Then
1080
P (|X − 10800| < 100) ≥ 1 − = 0.892
(100)2
(b) We are required to calculate P (|X − µ| ≥ 120).

We shall have to use the first Chebyshev’s inequality
1080
P (|X − 10800| ≥ 120) ≤ = 0.075
(120)2
Example 5.3
Suppose that a random variable has mean 5 and standard deviation 1.5. Use
the Chebyshev’s inequality to estimate the probability that an outcome lies
between 3 and 7.
Solution
µ = 5, σ = 1.5 PREPARE FOR A
LEADING ROLE.
Since we wish to estimate the probability of an outcome lying between 3
and 7, we have
English-taught
P (3 − 5 < X − µ < 7 − 5) = P (−2 < X − µ < 2) = P (|X − µ| < 2) MSc programmes
in engineering: Aeronautical,
That is = 2. By Chebyshev’s inequality Biomedical, Electronics,
Var(X) 1.5 Mechanical, Communication
P (|X − µ| < 2) ≥ 1 − = 1 − 2 = 0.625 systems and Transport systems.
2 2
No tuition fees.
The desired probability is at least 0.625. That is, if the experiment is re-
peated a large number of times, we expect, at least, 62.5% of the outcome
to be between 3 and 7.
The following two theorems are other forms by which the Chebyshev’s
inequality are expressed. E liu.se/master
‡
162
(b) We are required to calculate P (|X − µ| ≥ 120).
We shall have to use the first Chebyshev’s inequality
PROBABILITY: A FIRST COURSE IN 1080
PROBABILITY THEORY P (|X − 10800|
– VOLUME III ≥ 120) ≤ = Statistical
0.075 Inequalities and Limit Laws
(120)2
Example 5.3
Suppose that a random variable has mean 5 and standard deviation 1.5. Use
the Chebyshev’s inequality to estimate the probability that an outcome lies
between 3 and 7.
Solution
µ = 5, σ = 1.5
Since we wish to estimate the probability of an outcome lying between 3
and 7, we have
P (3 − 5 < X − µ < 7 − 5) = P (−2 < X − µ < 2) = P (|X − µ| < 2)
That is = 2. By Chebyshev’s inequality

Var(X) 1.5
P (|X − µ| < 2) ≥ 1 − = 1 − 2 = 0.625
2 2
The desired probability is at least 0.625. That is, if the experiment is re-
peated a large number of times, we expect, at least, 62.5% of the outcome
to be between 3 and 7.
The following two theorems are other forms by which the Chebyshev’s
176
inequality are expressed. Advanced Topics in Introductory Probability
Theorem 5.3(a)
finite variance Var(X). Then for any positive number ,
1
P (|X − µ| ≥ cσ) ≤
c2

where = cσ and σ = Var(X)
In words, the above theorem states that the probability that X assumes a
1
value outside the interval from µ − cσ to µ + cσ is never more than 2 .
c
Theorem 5.3(b)
1
P (|X − µ| < cσ) ≥ 1 −
c2

where = cσ and σ = Var(X)
This states that the probability of the event that X takes on a value x
1
which is within c standard deviations of its expectation is at least 1 − 2 ,
c
no matter what c happens to be. That is, the probability that X assumes a
1
value within the interval from µ − c σ to µ + c σ is never less than 1 − 2 .
c
Note
Theorems 5.3(a) and 5.3(b) are used only when c > 1. Now, if c < 1, then
1 1
1 − 2 < 0 or 2 > 1, but we know that the probability of any event ranges
c c
from zero to one. Thus, Chebyshev’s inequality 163 of Theorems 5.3(a) and
5.3(b) is trivially true when c < 1.
This states that the probability of the event that X takes on a value x
which is within c standard deviations of its expectation is at least 1 − 2 ,
PROBABILITY: A FIRST COURSE IN c
no matter what
PROBABILITY c happens
THEORY to be.
– VOLUME III That is, the probability that X assumes
Statistical a
1
value within the interval from µ − c σ to µ + c σ is never less than 1 − 2 .
c
Note
Theorems 5.3(a) and 5.3(b) are used only when c > 1. Now, if c < 1, then
1 1
1 − 2 < 0 or 2 > 1, but we know that the probability of any event ranges
c c
from zero to one. Thus, Chebyshev’s inequality of Theorems 5.3(a) and
5.3(b) is trivially true when c < 1.
Example 5.4
Suppose a random variable X has an expectation µ = 4.6 and a variance
σ 2 = 2.25. Find the bounds for the following probabilities:
(a) P (|X −
Statistical µ| < 2σ) and
Inequalities (b)Limit
P (|XLaws
− µ| < 3σ) 177
Solution √
µ = 4.6, σ= 2.25 = 1.5
1 3
(a) P [|X − 4.6| < 2(1.5)] ≥ 1 − 2 = = 0.75
2 4
Thus,
P [|X − 4.6| < 3] ≥ 0.75
1 8
(b) P [|X − 4.6| ≤ 3(1.5)] ≥ 1 − 2 = = 0.8889
3 9
Thus,
P (|X − 4.6| < 4.5) ≥ 0.8889
From Example 5.4, we can say that the probability that the random variable
X will take on a value within two standard deviations from the mean is at
least 43 , and the probability that X will take on a value within three standard
deviations from the mean is at least 98 . It is in this sense that the standard
deviation σ controls the spread or dispersion of the distribution of a random
variable.
5.3.2 Special Cases of Chebyshev’s Inequality
The following special cases of Chebyshev’s inequality are very helpful in

sampling theory. They can be used to determine the minimum sample sizes
(number of trials).
Theorem 5.4
Let Xn be the sample mean based on a random sample of size n on a
random variable X with expectation µ and finite variance σ 2 . Then
for any real number ,
σ2
P (|X n − µ| ≥ ) ≤
n2
or equivalently,
σ2
P (|X n − µ| < ) ≥ 1 −
n2
where n is the sample size
164
PROBABILITY
III Statistical
Topics in Introductory Inequalities and Limit Laws
Probability
Example 5.5
Let X denote the mean of random variables Xi with expectation µ = 100
and variance σ 2 = 2.5. Find the bound for the probability that in 120 trials
the mean differs from the expected value in absolute terms by less than 0.8.
Solution
= 0.8, n = 120, σ 2 = 2.5
2.5
P (|X n − µ| < 0.8) ≥ 1 − = 0.974
120(0.8)2
Example 5.6
Referring to the X of Example 5.5, determine the size of n such that
P (|X n − µ| < 0.8) ≥ 0.99
Solution
P (|X n − 100| < 0.8) ≥ 0.99
Comparing this with Theorem 5.4 we have:
σ2
1− = 0.99
n2
σ2
⇒ = 0.01
n2
Click here
Rearranging we obtain to learn more
TAKE THE
RIGHT TRACK
σ2 2.5
n≥ = = 390.6
0.012 0.01(0.8)2
That is, we need at least 391 trials in order that the probability will be at
least 0.99 that the sample mean X n will lie within 0.8 of the expectation.
by studying with us. Experience the advantages
Apply by World class

www.mdh.se
15 January research
165
Solution TOPICS IN INTRODUCTORY
ADVANCED
Solution
P (|X n III− 100|
PROBABILITY THEORY – VOLUME < 0.8) ≥ 0.99 Statistical Inequalities and Limit Laws
P (|X n − 100| < 0.8) ≥ 0.99
σ2
1 − σ 22 = 0.99
1 − n2 = 0.99
n
σ 22
⇒ σ = 0.01
⇒ n22 = 0.01
n
Rearranging we obtain
Rearranging we obtain
σ2 2.5
n ≥ σ2 2 = 2.5 2 = 390.6
n≥ 0.01 = 0.01(0.8) = 390.6
0.01 2
Statistical Inequalities and Limit Laws 0.01(0.8) 2
179
Statistical
That Inequalities
is, weInequalities and
need at least Limit
391 Laws
trials in order that the probability will be179 at
That is, we need at least 391 trials in order that the probability will be179
Statistical and Limit Laws at
least Going
0.99 that the sample
through mean X
the solution will lie within
ofnExample 5.5, we0.8
mayof the expectation.
write the formula
least 0.99 that the sample mean X n will lie within 0.8 of the expectation.
for determining
Going throughthe sample size nofbyExample
the solution Chebyshev’s inequality
5.5, we may write as the formula
for determining the sample size n by Chebyshev’s inequality as the formula
Going through the solution of Example 5.5, we may write
for determining the sample size n by Chebyshev’s inequality as
σ2
n = 22
q
σ
n = σ 22
n = q2
q
where q = 1 − p.
where q = 1 − p.
whereAnother
q = 1 −form
p. of Chebyshev’s inequality which is of great importance
is theAnother
Bernoulli forms
form which are theinequality
of Chebyshev’s applications of the
which is ofDegreat
Moivre-Laplace
importance
Another
Integral form
Theorem. of Chebyshev’s inequality which is of great
is the Bernoulli forms which are the applications of the De Moivre-Laplace importance
is the Bernoulli
Integral Theorem. forms which are the applications of the De Moivre-Laplace
Integral Theorem.
Theorem 5.5 FIRST BERNOULLI THEOREM

Let E be an5.5
Theorem experiment
FIRSTand A an event THEOREM
BERNOULLI associated with E. Consider
Theorem 5.5
independent FIRST
trials of BERNOULLI
Let the random THEOREM
Let E be an experiment and A an event associated
n E. n beE.
variable Mwith theConsider
number
Let
of E be an experiment
times event trials
n independent A occurs and A
of E. amongan event
Let thethe associated
n trials.
random Then
variable with E. Consider
MnbybeChebyshev’s
the number
n independent
inequality trials of E. Let the random variable M be
of times event A occurs among the n trials. Then by Chebyshev’s
n the number
of times event A occurs among npq
the n trials. Then by Chebyshev’s
inequality P (|M n − np| ≥ ) ≤
inequality 2
npq
P (|Mn − np| ≥ ) ≤ npq
or equivalently, P (|Mn − np| ≥ ) ≤ 22

or equivalently, npq
or equivalently, P (|Mn − np| < ) ≥ 1 − 2

npq
P (|Mn − np| < ) ≥ 1 − npq
P (|Mn − np| < ) ≥ 1 − 22

Proof
The
ProofTheorem follows from Chebyshev’s inequality (Theorem 5.2) by remem-
Proof
bering that E(M
The Theorem n ) =from
follows np and Var(Mn ) =
Chebyshev’s inequality
npq. (Theorem 5.2) by remem-
The Theorem follows from Chebyshev’s
bering that E(Mn ) = np and Var(Mn ) = npq. inequality (Theorem 5.2) by remem-
bering that E(M
The reader should ) = np and Var(M ) = npq.
n attempt Exercisen 5.26 (a). It is a typical example of
Theorem 5.5.
The reader should attempt Exercise 5.26 (a). It is a typical example of
The reader
Theorem 5.5.should attempt Exercise 5.26 (a). It is a typical example of
Theorem 5.5 is true for any value we give to the > 0. In particular,
Theorem 5.5.
if we Theorem
replace 5.5
by is where
true
n δ, δ is value
for any thoughtwe of as to
give “small”,
the >then
0. Inweparticular,
have the
Theorem
following 5.5
theorem. is true for any value we give to the > 0. In
if we replace by n δ, where δ is thought of as “small”, then we have theparticular,
if we replace
following by n δ, where δ is thought of as “small”, then we have the
theorem.
following theorem.
166
Theorem 5.6 SECOND BERNOULLI THEOREM

Let E be an experiment and A be an event associated with E. Con-
Mn
sider n independent trials of E. Let be the relative frequency of
n
the occurrence of event A in n trials. Then

Mn pq

P
− p ≥ n δ ≤ 2
n nδ
or equivalently,

Mn pq
P − p < n δ ≥ 1 − 2
n nδ
For any value of δ, the right side of the second part of Theorem 5.6
converges to 1 as n → ∞. What it means is that for any value of δ, no
matter how small, the probability that the proportion of successes in n
trials differs from the theoretical probability p by less than δ tends to 1 as
the number of trials increases without bound. That is, it is guaranteed that
the observed relative frequency of successes will converge to the theoretical
relative frequency (as measured by p) as the number of trials tends to infinity
(see Section 5.4).
Example 5.7
The probability of the occurrence of an event A in each trial of an experiment
2
is . Using Chebyshev’s inequality, find a lower bound for the probability
3
that in 10, 000 trials the deviation of the relative frequency of the event A
from the true probability of A will be less than 0.01.
Solution
The relative frequency in n independent trials is a random variable. Its
2
expectation equals p = .
3
We are required to calculate

Statistical Inequalities and
Mn 2
< 0.01 181
P Limit
Laws
10, 000 −
3
Then,

2 1

Mn 2 3 3

P − < 0.01 ≥ 1−
10, 000 3 (10, 000)(0.01)2
2 7
= 1 − = ≈ 0.778
9 9
Thus, with probability not less than 0.778 we may expect that in 10, 000
trials the relative frequency of event A will deviate from its probability for
less than 0.01.
Example 5.8
How many times should a fair die be tossed in order to be at least 95% sure
that the relative frequency of having a four come up is within 0.01 of the
1
theoretical probability .
6 167
Solution
2 7
= 1− = ≈ 0.778
9 9
Thus, with probability
PROBABILITY: not less
A FIRST COURSE IN than 0.778 we may expect that in 10, 000
trials the relative
PROBABILITY THEORYfrequency ofIIIevent A will deviate from
– VOLUME its probability
Statistical for
less than 0.01.
Example 5.8
How many times should a fair die be tossed in order to be at least 95% sure
that the relative frequency of having a four come up is within 0.01 of the
1
theoretical probability .
6
Solution
1 5
p= , 1−p= and = 0.01
6 6
Mn
Let be the relative frequency. Then
n

Mn
P − p < 0.01 ≥ 0.95
n
By the second Bernoulli Theorem,

Sn pq

P − p < ≥ 1 − 2
n n
Thus
pq
1− = 0.95
n2
pq
⇒ = 0.05
n2
pq
⇒ n =
0.05(2 )

1 5
6 6
=
0.05(0.01)2
= 27777.778
≈ 27, 778
The Chebyshev’s inequality provides only crude and general results. This
limitation arises because of its complete universality; it is valid under any
distribution provided both the expectation and the variance are finite. But
this condition will always be satisfied when the number of values of the
random variable X is finite. In the process of achieving such a general
result, however, this inequality is not particularly tight in terms of the bound
achieved on particular distributions. For most distributions that arise in
practice, there are far sharper bounds for P (|X − µ| < ) than that given by
Chebyshev’s inequality. For instance, if the random variable X is normally
distributed, then the following formula derived in Chapter 11 of Volume I is
more exact:

P (|X − µ| < ) = 2Φ −1
σ
Example 5.9
Referring to Example 5.4, use the Normal distribution to find the probabil-
ities and compare the results.
Solution √
µ = 4.6, σ= 2.25 = 1.5
(a) 2σ = 2(1.5) = 3
P (|X − µ| < 2σ) = P (|X − µ| < 3)

3
= 2Φ −1
1.5
= 2Φ(2)168
−1
= 2(0.9972) − 1
Referring to Example 5.4, use the Normal distribution to find the probabil-
ities and compare the results.
Solution
PROBABILITY THEORY
√ – VOLUME III Statistical Inequalities and Limit Laws
µ = 4.6, σ= 2.25 = 1.5
(a) 2σ = 2(1.5) = 3
P (|X − µ| < 2σ) = P (|X − µ| < 3)

3
= 2Φ −1
1.5
= 2Φ(2) − 1
= 2(0.9972) − 1
= 0.9544
(b) 3σ = 3(1.5) = 4.4
P (|X − µ| < 3σ) = P (|X − µ| < 4.5)

4.5
= 2Φ −1
1.5
= 2Φ(3) − 1
= 2(0.9987) − 1
= 0.9974
In both cases the probabilities are greater than those obtained by Cheby-
shev’s inequality in Example 5.4.
Note that Chebyshev’s inequality provides nontrivial estimate of the

probability of the event {|X − µ| < } only in the case when the variance
of the random variable X is sufficiently small, namely, less than 2 .
Clearly, the left side of the inequality in Theorem 5.2, expressing the
probability of an event is nonnegative. It gives the estimate of the probability
of the event {|X − µ| < }, if the right side is positive.
Thus,
Var(X) Var(X)
1− > 0, or < 1, or Var(X) < how2 will goods be transported? What re-
2 2 sources will we use, and how many will
we need? The passenger and freight traf-
If, on the other hand, Var(X) > 2 , then the right side of the inequality
fic sector is of
developing rapidly, and we
Theorem 5.2 will become negative and the Chebyshev inequalityprovidewill give
the impetus for innovation and
systems for internal combustion engines
P (|X − µ ≤ ) < −a
ficiently than ever before. We are also
where a > 0.
Thus, Chebyshev’s inequality is trivial for the case when Var(X) 2.
> anew
drives into dimension – for private,
This of course reduces the role of Chebyshev’s inequality in its corporate,
application and public use. The challeng-
to practical problems; however, its theoretical importance is great. It We
es are great. is deliver the solutions and
the starting point for several theoretical development. It provides uschallenging
offer with a jobs.
convenient interpretation of the concept of variance (or standard www.schaeffler.com/careers
deviation).
It can also be used to provide a simple proof for the law of large numbers
in the next section.
Note
The inequality that helps us derive bounds for sums of independent random
variables is the Komogorov’s inequality.
Suppose that X1 , X2 , · · · , Xn are independent random variables with
mean zero and finite variances and let Sk = X1 + X2 + · · · + Xn . Then for
>0

Var(Sn )
P max |Sk | ≥ ≤
1≤k≤n 169
2
In both casesTHEORY
PROBABILITY the probabilities
– VOLUME IIIare greater than those Statistical
obtained byInequalities
Cheby- and Limit Laws
shev’s inequality in Example 5.4.
Note that Chebyshev’s inequality provides nontrivial estimate of the

probability of the event {|X − µ| < } only in the case when the variance
of the random variable X is sufficiently small, namely, less than 2 .
Clearly, the left side of the inequality in Theorem 5.2, expressing the
probability of an event is nonnegative. It gives the estimate of the probability
of the event {|X − µ| < }, if the right side is positive.
Thus,
Var(X) Var(X)
1− > 0, or < 1, or Var(X) < 2
2 2
If, on the other hand, Var(X) > 2 , then the right side of the inequality of
Theorem 5.2 will become negative and the Chebyshev inequality will give
P (|X − µ ≤ ) < −a
where a > 0.
Thus, Chebyshev’s inequality is trivial for the case when Var(X) > 2 .
This of course reduces the role of Chebyshev’s inequality in its application
to practical problems; however, its theoretical importance is great. It is
the starting point for several theoretical development. It provides us with a
convenient interpretation of the concept of variance (or standard deviation).
It can also be used to provide a simple proof for the law of large numbers
in the next section.
Note
The inequality that helps us derive bounds for sums of independent random
variables is the Komogorov’s inequality.
Suppose that X1 , X2 , · · · , Xn are independent random variables with
mean zero and finite variances and let Sk = X1 + X2 + · · · + Xn . Then for
>0

Var(Sn )
P max |Sk | ≥ ≤
1≤k≤n 2
For the proof see page 248 of Billinsley, P. (1979) listed in the bibliography.
For n = 1, Komogorov’s inequality reduces to Chebyshev’s inequality.

For n > 1, Chebyshev’s inequality gives the same bound for the single
relation |Sk | ≥ , so that Komogorov’s inequality is considerably stronger.
5.4 LAW OF LARGE NUMBERS
5.4.1 Definition of Law of Large Numbers
The Law of Large Numbers is an important consequence of Chebyshev’s

inequality. It is commonly believed that if a fair coin is tossed many times
the number of heads is about the same as the number of tails. The law of
large numbers is a mathematical formulation of this belief.
Definition 5.1 LAW OF LARGE NUMBERS

170
The law of large numbers states that the average of a number of
The Law of Large Numbers is an important consequence of Chebyshev’s
inequality. It is commonly believed that if a fair coin is tossed many times
the numberTOPICS
ADVANCED of heads is about the same as the number of tails. The law of
IN INTRODUCTORY
PROBABILITY:
large numbersA isFIRST COURSE IN
a mathematical formulation of this belief.
Definition 5.1 LAW OF LARGE NUMBERS

The law of large numbers states that the average of a number of
independent and identically distributed random variables converge
to the expected value of the underlying distribution of X as the
number of random variables increases
Whatever the “law of averages” may be, it is certainly not reasonable in a

hundred tosses of a fair coin to expect exactly 50 tails or a million tosses
to expect exactly 500,000 heads. To be more reasonable, perhaps the best
we can get is that usually the proportion of tails in n tosses is close to 12 .
J.E. Kerrich who, while interned in Denmark as a British subject during the
Second World War, tossed a coin 10,000 times and recorded 5067 heads.
Before we give the various forms of the law of large numbers we shall
discuss briefly the term ”converge” used in the definition. Suppose we have a
sequence X1 , X2 , · · · of random variables defined on the same sample space.
In what follows we shall distinguish various possible modes in which the se-
quence {Xn } may converge. Convergence generally comes in two categories
Statistical
which Inequalities
we call strong and and Limit Laws
weak. 185
Theorem 5.7 CONVERGENCE IN PROBABILITY

Suppose {Xn } is a sequence of random variables (n = 1, 2, · · ·).
Then if for every > 0
P (|Xn − θ| > )
approaches zero as n approaches infinity, then {Xn } is said to

converge in probability to θ
Symbolically, we may write convergence in probability as
lim P (|Xn − θ| > ) → 0

n→∞
or
p
Xn −→ θ
Synonyms for convergence in probability are stochastic convergence, conver-
gence in measure, or weak convergence.
Theorem 5.8 CONVERGENCE WITH PROBABILITY

Suppose {Xn } is a sequence of random variables (n = 1, 2, · · ·).
Then if
P (|Xn = X|)
is 1 as n approaches infinity, then {Xn } is said to converge with
probability one (w.p. 1) to X
That is, Xn actually converges to X except in a negligible number of cases.

This form of convergence is also called almost sure convergence (a.s.), or
almost everywhere convergence (a.e.), or strong convergence.
171
Symbolically, we may write convergence in probability as
Then if
P (|Xn = X|)
P (|Xn = X|)
Even though the IN
strong law of large numbers may not be realistic, what
is 1 as TOPICS
ADVANCED n approaches infinity, then {Xn } is said to converge with
INTRODUCTORY
is 1not
might as be
PROBABILITY: nAapproaches
realised infinity,
in the
FIRST COURSE real-world
IN {Xn } is
then situation said
may to converge
sometimes with
be achieved
probability one (w.p. 1) to X
in aprobability
PROBABILITY one (w.p.
purely theoretical
THEORY 1) toIIISuch
sense.
– VOLUME X a possibility was Statistical
unravelled by Borel
in 1909, who established the strong law of large numbers in the case of
independent Bernoulli trials.
Symbolically,
Theorem we 5.9may write convergence
LAW OFin probability as
Symbolically, we maySTRONG
write convergence inLARGE NUMBERS
probability as
(Bernoulli Form)
Let Mn be a randomn→∞ lim P (|Xnof=the
variable X|) = 1
lim P (|Xn = X|) number
=1 of successes in n
n→∞
Mn
or Bernoulli trials so that n Advanced
186
is the proportion of successes. Then
or a.s. Topics in Introductory Probability
X n −→
a.s. X
XnM −→n
X
P lim 10 = p = 1
These two types of convergence n→∞ n are the basis of the two laws of large
numbers, namely, the weak and the strong laws of large numbers. For our
where p is the probability of success
purposes, which are statistical, the weak law of large numbers is the central
concept and when the “law of large numbers” is referred to without qualifi-
cation, this one is implied. We shall discuss it in much detail later but now
we state without proof, the strong law of large numbers.
10
The other type of convergence that plays an important role in Probability theory is
5.4.2 Strong
convergence Law ofalso
in distribution, Large
called Numbers
complete convergence or weak convergence. Sup-
pose {Xn } is a sequence of random variables, (n = 1, 2, · · ·) and let Fn (t) be their c.d.f.’s.
Even though
If for every the strong
t at which F0 (t) islaw of large numbers may not be realistic, what
continuous,
might not be realised in the real-world situation may sometimes be achieved
in a purely theoretical sense.n→∞ Such
lim a=
Fn (t) possibility
F0 (t) was unravelled by Borel
independent
then Xn is saidBernoulli
to convergetrials.
d
in distribution to X0 , denoted by Fn (t) → F0 (t).
Theorem 5.9 STRONG LAW OF LARGE NUMBERS

(Bernoulli Form)

678'<)25<2850$67(5©6'(*5((
Let Mn be a random variable of the number of successes in n
Mn
Bernoulli trials so that is the proportion of successes. Then
n

Mn
P lim =p =1
n→∞ n
10
convergence in distribution, also called complete convergence or weak convergence. Sup-
If for every t at which F0 (t) is continuous,
lim Fn (t) = F0 (t)

n→∞
d
then Xn is said to converge in distribution to X0 , denoted by Fn (t) → F0 (t).
172
numbers, namely, the weak and the strong laws of large numbers. For our
purposes, which are statistical, the weak law of large numbers is the central
concept and when the “law of large numbers” is referred to without qualifi-
cation, this one
PROBABILITY is implied.
THEORY WeIIIshall discuss it in much Statistical
– VOLUME detail later Inequalities
but now and Limit Laws
we state without proof, the strong law of large numbers.
5.4.2 Strong Law of Large Numbers
Even though the strong law of large numbers may not be realistic, what
might not be realised in the real-world situation may sometimes be achieved
in a purely theoretical sense. Such a possibility was unravelled by Borel
independent Bernoulli trials.

(Bernoulli Form)
Let Mn be a random variable of the number of successes in n
Mn
Bernoulli trials so that is the proportion of successes. Then
n

Mn
P lim =p =1
n→∞ n


(Khinchin Form)
10
convergence in distribution, also called complete convergence or random
Let X1 , X2 , · · · , Xn be a sequence of independent weak convergence.
variablesSup-
If forhaving
every t a
at common
which F0 (t)distribution
is continuous,and let E(X) = µ. Then there is a
limit number µ, such that
lim Fn (t) = F0
(t)
P n→∞
lim X n = µ = 1
n→∞
d
Xin distribution to X0 , denoted by Fn (t) → F0 (t).
then Xn is said to converge
i
where X n = n
i=1
n
The proof of this theorem goes beyond this book. To see it refer to page
250 of Billingsley, P. (1979), listed in the bibliography.
The strong law of large numbers makes better sense than the weak law
and it is indispensable for certain theoretical investigations. It is indeed the
foundation of a mathematical theory of probability based on the concept of
relative frequency. It is also known that the sample mean, X n , converges
with probability 1 to the mean µ, provided that the latter exists. The
strong law of large numbers is usually not given much attention in most
statistical textbooks, including this one partly because it is not realistic to
assert this almost sure convergence, for in any experiment we can neither be
100 percent sure nor 100 percent accurate, otherwise the phenomenon will
not be a random one. Secondly, in general, it is easier to prove the weak
law of large numbers than the strong law of large numbers.
5.4.3 Weak Law of Large Numbers
The weak law of large numbers is one of the earliest and most famous of
the limit laws of probability. We shall consider two of its forms, namely,
Bernoulli law (which relates to proportions) and Khinchin law (which relates
to the means).
173
Bernoulli Law of Large Numbers (Discrete Case)

statistical textbooks, including this one partly because it is not realistic to
assert this almost sure convergence, for in any experiment we can neither be
100 percent sure
PROBABILITY: norCOURSE
A FIRST 100 percent
IN accurate, otherwise the phenomenon will
not be a random
PROBABILITY THEORYone. Secondly,
– VOLUME III in general, it is easier to prove the
Statistical weak
law of large numbers than the strong law of large numbers.
5.4.3 Weak Law of Large Numbers
The weak law of large numbers is one of the earliest and most famous of
the limit laws of probability. We shall consider two of its forms, namely,
Bernoulli law (which relates to proportions) and Khinchin law (which relates
to the means).
Bernoulli Law of Large Numbers (Discrete Case)
The first formulation of the law of large numbers, known as the Bernoulli
law of large numbers, was given and proved by Jakob Bernoulli and pub-
lished posthumously in his book “Ars Conjectandi” in 1713 as a crowning
achievement. It states that if Sn represents the number of successes in n
identical Bernoulli trials, with probability of success p in each trial, then the
Mn
relative frequency is very likely to be close to p when n is a sufficiently
n
large and fixed integer. The law in a sense justifies the use of the frequency
definition of probability discussed in Chapter 3 of Volume I and it brings
the theory of probability into contact with practice.
Mathematically, the Bernoulli law of large numbers may be expressed
in the following theorem.
Theorem 5.11 BERNOULLI LAW OF LARGE NUMBERS

Let X1 , X2 , · · · , Xn , be a sequence of independent Bernoulli random
variables, each taking on the value 1 (with the occurrence of an event
A) or 0 (with non-occurrence of an event A), and let p = P (Xi =
1); i = 1, 2, ..., n. Then for sums Mn , the number of successes in n
trials, where
n

Mn = Xi ,
i=1

Mn
lim P
− p < = 1

n→∞ n
for all > 0 or equivalently,

Mn
lim P
− p ≥ = 0

n→∞ n
for all > 0
Mn
That is, for n large, is very close to p. In other words the law of large
n
numbers may be stated as follows: In n trials the probability that the rela-
Mn
tive number of successes , deviates numerically from the true probability
n
p by not more than ( > 0), approaches 1 as n approaches infinity.
Or equivalently, in n trials the probability that the relative number

Mn
of successes , deviates numerically from the true probability p by more
n
than ( > 0), approaches 0 as n approaches infinity.
174
Statistical Inequalities IN
and Limit Laws 189
and Limit Statistical Inequalities
189 and Limit Laws
Khinchin Theorem of Large Numbers (Continuous Case)

Khinchin Theorem of Large Numbers (Continuous Case)
The modern discussion of a mathematical version of the weak law of large
The modern
numbers wasdiscussion of a mathematical
first published by the Russian version of the weak Khinchin
mathematician, law of large
in
numbers
1929. Thewas first published
Khinchin Theorem by the “If
states Russian
Xi (i =mathematician, Khinchin in
1, 2, ..., n) are independent
1929.identically
and The Khinchin Theorem
distributed statesvariables
random “If Xi (i and
= 1,if2,E(X
..., n)i ) =areµ independent
exists, then
and identically distributed random variables and if
when n becomes a very large integer, the probability thati the random
E(X ) = µ exists, vari-
then
whenXn will
able becomes
differafrom
very the
large integer,expectation
common the probability
µ of that
Xi by thenot
random vari-
more than
able X will differ
any arbitrarily from the small
prescribed common expectation
difference µ of close
is very Xi bytonot one.”moreMath-
than
any arbitrarily
ematically, this prescribed
theorem can small difference in
be expressed isthe
very
formcloseas to in one.” Math-
the following
ematically, this theorem can be expressed in the form as in the following
theorem.
theorem.
Theorem 5.12(a) KHINCHIN LAW OF LARGE NUMBERS I

Theorem
Let , Xn be aKHINCHIN
X1 , · · ·5.12(a) sequence of LAW OF LARGEand
independent NUMBERS I
identically
Let
distributed
X 1 , · · · , X n be
continuous a sequence of independent and identically
random variables with finite common mean
distributed
µ and finite continuous randomσ 2variables
common variance . Then forwith
anyfinite common mean
real number >0
µ and finite common variance σ 2 . Then for any real number > 0

lim P X n − µ < = 1
n→∞
lim P X n − µ < = 1
or equivalently, n→∞
or equivalently,
Sn
lim P S − µ < = 1

lim P n − µ < = 1
n→∞ n
where S = X + X + · · · + Xn
n→∞
n 1 2 n
where Sn = X1 + X2 + · · · + Xn
Proof
Proof
By Theorem 5.4,
By Theorem 5.4,
σ2

Scholarships
P
X
n
− µ

< = 1 − σ 22
P X n − µ < = 1 − n2
so that n
so that
σ2
1 − σ 22 ≤ P (|X n − µ| < ) ≤ 1
1 − n2 ≤ P (|X n − µ| < ) ≤ 1
n2
σ
lim 1 − σ 22 ≤ lim P (|X n − µ| < ) ≤ 1

Open your mind to
lim 1 − n2 ≤ lim P (|X n − µ| < ) ≤ 1
n→∞
n→∞ n
n→∞
n→∞
new opportunities
175

Sn
lim P

ADVANCED TOPICS IN INTRODUCTORY − µ < = 1

n→∞ n
where SnTHEORY
PROBABILITY = X1 +–X 2 + · · · III
VOLUME + Xn Statistical Inequalities and Limit Laws
Proof
By Theorem 5.4,
σ2

P X n − µ < = 1 −
n2
so that
σ2
1− ≤ P (|X n − µ| < ) ≤ 1
n2

190 σ2
lim 1 − 2 Advanced
≤ lim PTopics
(|X n −inµ|Introductory
< ) ≤ 1 Probability
n→∞ n n→∞
1 ≤ lim P (|X n − µ| < ) ≤ 1

n→∞
By Sandwich rule of limits
lim P (|X n − µ| < ) = 1

n→∞
It is in this sense that the arithmetic mean “converges” to E(X) = µ. That

is, by averaging an increasingly large number of observations of the value
of a quantity, we can obtain increasingly more accurate measures of the
expectation of that quantity.
Aliter
We first find E(X n ) and Var(X n ):
n
1
E(X n ) = E(Xi ) = µ (see Chapter 7).
n i=1
Since the Xi are independent and identically distributed,

n
1 σ2
Var(X n ) = Var(X i ) = (see Chapter 7).
n2 i=1 n
This desired result now follows immediately from Chebyshev’s inequality as

follows
Var(X n ) σ2

P X n − µ < ≥ 1 − = 1 − → 1 as n → ∞
2 n2
Theorem 5.12(b) KHINCHIN LAW OF LARGE NUMBERS II

The law of large numbers may equivalently be stated as
lim P (|X n − µ| ≥ ) = 0
n→∞
or equivalently,

Sn
lim P
− µ ≥ = 0

n→∞ n
where Sn = X1 + X2 + · · · + Xn
176
ADVANCED
Statistical TOPICS IN INTRODUCTORY
Inequalities and Limit Laws 191
and Limit Statistical Inequalities191 and Limit Laws
The reader is asked in Exercise 5.27 to prove this theorem.

The reader is asked in Exercise 5.27 to prove this theorem.
The
Notereader is asked in Exercise 5.27 to prove this theorem.
Note
To apply the weak law of large numbers, the random variables do not have
Note
To
to beapply the weak
identically law of large
distributed; they numbers,
just need thetorandom
have the variables
same mean do notandhavethe
To apply
to be finite
same the
identicallyweakdistributed; they just need to have the same mean andhave
variance. law of large numbers, the random variables do not the
to
samebe finite
identically distributed; they just need to have the same mean and the
variance.
same finite variance.
5.5 CENTRAL LIMIT THEOREM
5.5.1 Definition of Central Limit Theorem
5.5.1 Definition of Central Limit Theorem
5.5.1 all
Among Definition
probability of laws,
Central Limit Theorem
the Central Limit Theorem (CLT) is consid-
Among all probability laws, the
ered the most remarkable theoretical formulation Central Limit Theorem
from both (CLT) is consid-
practical and
Among
theoreticalall
ered the most probability
remarkable
viewpoints. laws, the
theoretical
It provides Central Limit
formulation
some Theorem
reassurance fromwhen (CLT) areis not
bothwepractical consid-
and
cer-
ered
tain the most
theoretical
whether remarkable
viewpoints.
observations theoretical
It are
provides
normally formulation
some reassurance
distributed. fromwhenbothwepractical
Whatever arenature
the and
not cer-of
theoretical
tain whether viewpoints.
observations It provides
are normally some reassurance
distributed. when
Whatever
the random variables X1 , X2 , · · · – whether they are discrete or continuous, we are
the not
nature cer-
of
tain whether
the random
Bernoulli observations
or variables
gamma – X are
as1 ,nXgets normally distributed.
· – whether
2 , · ·larger and largerthey the Whatever
are distribution the nature
discrete or continuous,
of the sum of
the random
Bernoulli variables X
or gamma – asalways
(after standardisation) , X
1 n gets , ·
2 gets · · – whether
larger andand
closer they
larger are
the to
closer discrete
distributionor continuous,
that of theofStandard
the sum
Bernoulli
(after
Normal or gamma
standardisation) – as
distribution. Inalways n gets
fact, itgets larger and
closer
is the most larger
and the distribution
closer toreason
important that offor of
thethe the
Standardsum
central
(after
Normal standardisation)
role of thedistribution. always
In fact, it is
Normal distribution gets closer
in the and closer to that of the
most important reason for the central
Statistics. Standard
Normal
role of distribution.
the Normal In fact,
distribution
Generally, the name “Central Limit it is
in the mostTheorem”
Statistics. importantisreason
used tofordescribe
the centralany
role of the
convergence Normal
Generally, theorem distribution
the namein which “Central in Statistics.
Limit Theorem”
the Normal distributionis used to describe
appears as the limit.any
MoreGenerally,
convergence
particularly, theitname
theorem in which
applies “Central
tothe sums Limit
Normal Theorem”
of largedistribution
numbers is used to describe
appears
or random as the any
limit.
variables
convergence
More theorem
havingparticularly,
approximately inthewhich
it applies tothe
Standard sumsNormal
of large
Normal distribution
numbers appears
distribution or after
random as the limit.
variables
standardisa-
More
having particularly,
approximately
tion. Though it
the nametheapplies to
Standard
“Central sums of large
Normal
Limit Theorem” numbers
distribution or
was given random
after variables
bystandardisa-
G. Polya in
having
tion. approximately
Though the name the Standard
“Central Normal
Limit Theorem”distribution
1920, the concept was first introduced by De Moivre early in the eighteenthwas after
given by standardisa-
G. Polya in
tion.
1920,
century.Though
the De the
concept
Moivre name
wasproved “Central
first introduced Limit
special cases Theorem”
by De in Moivre was
1733 and given
early by G.
in thelater
Laplace Polya
eighteenth in
in his
1920,
century.the
book of 1812 concept
De Moivre was
gave the first
proved introduced by De
special cases statement
first (incomplete) Moivre
in 1733 and early in
of Laplacethe
the general eighteenth
laterform
in his
of
century.
book
the CLT De
of 1812 Moivre
so the gave
CLT proved
theof first
Theorem special
(incomplete) cases
5.16 in the in 1733
statement and Laplace
of the
sequel (the later
general form)
Bernoulli in his
form of
is
book
the
oftenCLTof 1812
so the
referred gaveas the
toCLT theof first
Theorem
De (incomplete)
5.16 in the
Moivre–Laplace statement
sequel
CLT. The of modern
(the the general
Bernoulli
CLTform)form
(Theo-of
is
the
rem CLT
often so the
referred
5.15 in thetosequel)
CLT
as the ofbegan
Theorem
De 5.16 infrom
Moivre–Laplace
to emerge theCLT.
sequel
the The
work (the Bernoulli
modern
of the CLTform)
Russian (Theo-
schoolis
often
rem referred
5.15 in thetosequel)
of Probabilists–notably as thebeganDe Moivre–Laplace
Chebyshev to emergeand from CLT. The
the work
his students modern
of CLT
the Russian
Markov and (Theo-
school
Lyapunov
rem
of 5.15 in the sequel)
Probabilists–notably
towards began
the close of theChebyshev to emerge
nineteenthand from the
his students
century. work of
In factMarkov the
Lyapunov Russian
andproved school
Lyapunov the
of Probabilists–notably
towards the close of the Chebyshev
nineteenth and his
century. students
In fact
first general version of CLT in about 1901, though the details of his proof Markov
Lyapunov and Lyapunov
proved the
towards
first
were generalthe close
version
complicated. of of
theCLT nineteenth
in aboutcentury. In factthe
1901, though Lyapunov
details ofproved
his proofthe
192
first general version of CLT in about
Advanced 1901, though
Topics the details
in Introductory of his proof
Probability
were The
complicated.
central limit theorem has been expressed in many forms, differing
were The
complicated.limit theorem has been expressed in many forms, differing
in variouscentral
degrees of abstraction and generality. We shall present some of
in The
various central
degrees
the simplest versions limit
of buttheorem
abstraction
first we hasandbeen
state itsexpressed
generality. Wein many forms, differing
shall present
basic theorems. some of
in various degrees of abstraction and generality. We shall present some of
Theorem 5.13 POLYA’S THEOREM

Let X1 , X2 , · · · , Xn be a sequence of independent and identically dis-
tributed random variables with finite mean E(X) = µ and fine vari-
ance Var(X) = σ 2 > 0 and let
√
∗ n(Xn − µ)
Zn =
σ
Then
d
P (Zn∗ ≤ z ∗ ) −→ Φ(z ∗ ) as n → ∞
where Φ is the c.d.f. of the Standard Normal distribution
We shall observe in Theorem 5.17 in the sequel that Polya’s Theorem is the
central limit theorem of means.
177
Theorem 5.14 CENTRAL LIMIT THEOREM

Then
d
P (Zn∗ ≤ z ∗ ) −→ Φ(z ∗ ) as n
ADVANCED TOPICS IN INTRODUCTORY →∞
where Φ THEORY
PROBABILITY is the c.d.f. of theIIIStandard Normal
– VOLUME distribution
Statistical Inequalities and Limit Laws
We shall observe in Theorem 5.17 in the sequel that Polya’s Theorem is the
central limit theorem of means.

(Lindberg-Levy Theorem)
Suppose Z1 , Z2 , · · · , Zn are independent and identically distributed
Standard Normal random variables and let
Z1 + Z2 + · · · + Zn
Sn∗ = √
n
Then
P (Sn∗ ≤ s∗ ) → Φ(s∗ ) as n → ∞
For proof see page 316 of Dudewicz and Mishra, 1988, listed among the
bibliography.
5.5.2 Central Limit Theorem of Sums
The Central Limit Theorem is concerned with the distribution of the “sum
of random variables”. If X1 , X2 , ..., is a sequence of independent random
178
For proof seeTHEORY
PROBABILITY page 316 of Dudewicz
– VOLUME III and Mishra, 1988, listed among
Statistical the
bibliography.
5.5.2 Central Limit Theorem of Sums

The Central Limit Theorem is concerned with the distribution of the “sum
of random variables”. If X1 , X2 , ..., is a sequence of independent random
variables with expectation E(X) and variance Var(X) and if
n

Sn = Xi
i=1
Sn
we know from the law of large numbers that converges to E(X)
n
in probability. The Central Limit Theorem is concerned not with the fact
Sn
that the ratio converges to E(X) but with how it fluctuates around
n
E(X). To analyse these fluctuations, we standardise the sum
Sn − E(Sn )
Tn = (i)
Var(Sn )
It is easily verified that Tn has mean 0 and variance 1.
As usual we shall discuss two forms of CLT, namely, the Khinchin form
(continuous case) and the Bernoulli form (discrete case.)
Continuous Case

(Khinchin Form)
Let X1 , · · · , Xn be a sequence of independently and identically dis-
tributed random variables, each having mean µ and variance σ 2 . Let
n

Sn = Xi .
i=1
Then the distribution of

Sn − nµ
Tn = √
σ n
tends to the Standard Normal distribution when n becomes infinite

Proof
Since Xn ’s are independent and identically distributed, then from Corollary
3.4 and Theorem 3.19
E(Sn ) = nµ, and Var(Sn ) = nσ 2

Substituting these expressions in (i) above the result follows immediately.
Note
These theorems will also hold when X1 , · · · , Xn are independent random
variables with the same mean and the same finite variance but not neces-
sarily identically distributed.
Discrete Case
The De Moivre-Laplace Theorem discussed in Chapter 12 of Volume I is
179
actually a special case of the Central Limit Theorem, namely, the Bernoulli
form of CLT of frequency.
Substituting these expressions in (i) above the result follows immediately.
Note
These theorems
PROBABILITY: will COURSE
A FIRST IN when X1 , · · · , Xn are independent random
also hold
variables with the same mean IIIand the same finite variance
PROBABILITY THEORY – VOLUME but not
Statistical neces-
sarily identically distributed.
Discrete Case
The De Moivre-Laplace Theorem discussed in Chapter 12 of Volume I is
actually a special case of the Central Limit Theorem, namely, the Bernoulli
form of CLT of frequency.
The Bernoulli and other forms of the central limit theorem arise out of
Theorem 5.15. For instance, if the random variables Xi ’s are independent
Bernoulli random variables, their sum
n

Mn = Xi
i=1
is a binomial random variable. Substituting the expressions

E(Mn ) = np and Var(Mn ) = npq
in (i) above, the result follows immediately.

(Bernoulli Form)
Let X1 , X2 , · · · , Xn be independent Bernoulli random variables with
parameter p so that E(Xi ) = p and Var(Xi ) = pq. If
n

Mn = Xi
i=1
then
Mn − np
Yn = √
npq
tends toInequalities
Statistical the standard
andNormal distribution as n → ∞
Limit Laws 195
Thus, the central limit theorem may mathematically be stated as
lim P (Zn ≤ z) = Φ(z),

n→∞
where Φ is the cumulative distribution function of the standard Normal

distribution.
Significance of the Central Limit Theorem
The computational significance of the CLT is that for large n, we can express
the cumulative distribution function of Tn in terms of N (0, 1) as follows.
In the case of frequencies,

y − np
P (Yn ≤ y) ≈ Φ
npq
where y is a particular value of Yn .
In the case of continuous random variables

t − nµ
P (Tn ≤ t) ≈ Φ √
σ n
where t is a particular value of Tn .
The logical question that arises at this point is: How large should the
n be to enable us apply the Central Limit Theorem. There is no single
answer, since this depends on the closeness of approximation required and
the actual distribution forms of Xi ’s. If the random variables Xi ’s are nor-
mally distributed, then no matter how small n 180is, their sum is also normally
distributed and P (Tn ≤ t) provides exact probabilities. If nothing is known
where y is a particular value of Yn .
In the case of continuous random variables
t − nµ
P (Tn III≤ t)
PROBABILITY THEORY – VOLUME ≈Φ √ Statistical Inequalities and Limit Laws
σ n
where t is a particular value of Tn .
The logical question that arises at this point is: How large should the
n be to enable us apply the Central Limit Theorem. There is no single
answer, since this depends on the closeness of approximation required and
the actual distribution forms of Xi ’s. If the random variables Xi ’s are nor-
mally distributed, then no matter how small n is, their sum is also normally
distributed and P (Tn ≤ t) provides exact probabilities. If nothing is known
about the distribution patterns of Xi ’s, or if the distribution of Xi ’s differs
greatly from normality then n must be large enough to guarantee approx-
imate normality for Sn . One rule of thumb states that, in most practical
situations, n equals at least 30 is satisfactory. In general, the approximation
of the sum of random variables to normality becomes better and better as
the sample size increases.
Example 5.10
A die is thrown ninety times. At a given throw the expected number of
7 32
points is and the variance is . Find
2 12
(a) the expectation of the sum
196 of points;
(b) the variance of the sum of points;
(c) the probability that the sum obtained is at most 300.
Solution
7 32
n = 90, µ = , σ2 =
2 12

7
(a) E(Sn ) = nµ = 90 = 315
2
32
(b) Var(Sn ) = nσ 2 = 90 = 240
12
(c) s = 300

Sn − nµ s − nµ s − nµ
P (Sn ≤ s) = P √ ≤ √ = P Tn ≤ √
σ n σ n σ n
 
 300 − 315 
Thus, P (Sn ≤ 300) = P 
 T n ≤ 
32 √ 
90
12
−15
= P Tn ≤
(1.6330)(9.4868)
= P (Tn ≤ −0.968)
= 1 − Φ(0.97)
= 1 − 0.8340 = 0.166
.
5.5.3 Central Limit Theorem of Means
Another important version of the Central Limit

181 Theorem is expressed in
terms of the means of independent and identically distributed random vari-
(b) the variance of the sum of points;

(c) the probability that the sum obtained is at most 300.
Solution
7 32
n = 90, µ = , σ2 =
2 12

7
(a) E(Sn ) = nµ = 90 = 315
2
32
(b) Var(Sn ) = nσ 2 = 90 = 240
12
(c) s = 300

Sn − nµ s − nµ s − nµ
P (Sn ≤ s) = P √ ≤ √ = P Tn ≤ √
σ n σ n σ n
 
 300 − 315 
Thus, P (Sn ≤ 300) = P 
Tn ≤ 32 √ 

90
12
−15
= P Tn ≤
(1.6330)(9.4868)
= P (Tn ≤ −0.968)
= 1 − Φ(0.97)
= 1 − 0.8340 = 0.166
5.5.3 Central Limit Theorem of Means
Another important version of the Central Limit Theorem is expressed in

terms of the means of independent and identically distributed random vari-
Statistical Inequalities and Limit Laws
ables. 197

(of Means)
Let X1 , · · · , Xn be n independent and identically distributed random
variables with means equal to µ and variances equal to σ 2 . Let Xn
be the random variable defined by
X 1 + X 2 + · · · + Xn
Xn =
n
The distribution of
Xn − µ
Zn = √
σ/ n
tends to the Standard Normal distribution as n → ∞
Note
Zn is the standardised mean of the random variables X1 , · · · , Xn and it is
equal to Tn in Theorem 5.16.
Proof
The proof is quite sophisticated. An outline of a proof is given in Hoel
(1971), and Freund and Walpole (1971) listed 182
among the references.

Note
Zn is the standardised
PROBABILITY mean of
THEORY – VOLUME III the random variables Statistical and it is
X1 , · · · , Xn Inequalities and Limit Laws
equal to Tn in Theorem 5.16.
Proof
The proof is quite sophisticated. An outline of a proof is given in Hoel
(1971), and Freund and Walpole (1971) listed among the references.
Example 5.11
Suppose that I.Q. is a random variable with mean µ = 100 and standard
deviation σ = 25. What is the probability that in a class of 40 students
(a) the average I.Q. exceeds 100?
(b) a single person’s I.Q. exceeds 100?
Solution
25
n = 40, µ = 110, σ = 25, σX = √ = 3.9528
40
By the Central Limit Theorem, the mean I.Q. of the 40 students is normally
distributed. Hence
(a) The probability that the average I.Q. exceeds 100 is
198 Advanced Topics in Introductory
Probability
110 − 100
P (X ≥ 100) = P Z ≥
3.9528
= P (Z ≥ 2.53)
= 1 − Φ(2.53)
= 1 − 0.9943
= 0.0057
So, approximately, 0.6% of the mean I.Q.’s of a class of 40 students will

exceed 100.
(b) The probability that a single person’s I.Q. exceeds 100 is

110 − 100
P (X > 100) = P Z ≥
25
= P (Z ≥ 0.4)
= 1 − Φ(0.4)
= 1 − 0.6554
= 0.3446
Theorem 5.18
Suppose that X1 , X2 , · · · , Xn are independent and identically dis-
tributed random variables, each having mean µ and Var(X) = σ 2 .
Suppose also that the average of the measurements, X, is used as
an estimate of µ. Then
√ √
n n
P (|X − µ| < ) ≈ Φ −Φ −
σ σ
Proof
Suppose that we wish to find
P (|X − µ| < )
for some constant > 0. To use the Central Limit

183 Theorem to approximate
this probability, we standardise the mean, using E(X) = µ and Var(X) =
√ √
√n √n
P (|X − µ| < ) ≈ Φ n − Φ − n
ADVANCED TOPICS (|XINTRODUCTORY
P IN − µ| < ) ≈ Φ σ −Φ − σ
σ σ
Proof
Proof
P (|X − µ| < )
P (|X − µ| < )
for some constant
Statistical > 0.
Inequalities andTo use the Central Limit Theorem to approximate
199
for some constant > 0. ToLimit Laws
use the Central Limit Theorem to approximate
σ 22
σ ;
n;
n
P (|X − µ| < ) = P (− < X − µ < )
 
 − X −µ 
= P σ < σ < σ 

√ √ √
n n n
√ √
n n
≈ Φ −Φ −
σ σ
If we use the Standard Normal Table I in the Appendix, then we may write
this formula as √
n
P (|X − µ| < ) ≈ 2Φ −1
σ
On the other hand
√
n
P (|X − µ| < ) ≈ 2Ψ
σ
if we use Table II in the Appendix.
The Law of Large Numbers tells us that X converges to µ in

probability; so we can hope that X is close to µ if n is large. Cheby-
shev’s inequality allows us to determine the bound of the probability of an
error of a given size, but the Central Limit Theorem gives a much sharper
approximation to the actual error.
Example 5.12
Suppose that 25 measurements are taken with σ = 1.2. Find the proba-
bility that the sample mean X deviates from µ by less than 0.3.
Solution
n = 25, σ = 1.2, e = 0.3

√ √
0.3 25 0.3 25
P (|X − µ| < 0.3) ≈ Φ −Φ −
1.2 1.2
√
0.3 25
= 2Φ −1
1.2
= 2Φ(1.25) − 1
= 0.7888
184
On the other hand
√
ADVANCED TOPICS IN INTRODUCTORY n
P (|X −
PROBABILITY: A FIRST COURSE IN µ| < ) ≈ 2Ψ
σ Statistical Inequalities and Limit Laws
if we use Table II in the Appendix.
The Law of Large Numbers tells us that X converges to µ in

probability; so we can hope that X is close to µ if n is large. Cheby-
shev’s inequality allows us to determine the bound of the probability of an
error of a given size, but the Central Limit Theorem gives a much sharper
approximation to the actual error.
Example 5.12
Suppose that 25 measurements are taken with σ = 1.2. Find the proba-
bility that the sample mean X deviates from µ by less than 0.3.
Solution
n = 25, σ = 1.2, e = 0.3

√ √
0.3 25 0.3 25
P (|X − µ| < 0.3) ≈ Φ −Φ −
1.2 1.2
√
0.3 25
= 2Φ −1
1.2
= 2Φ(1.25) − 1
= 0.7888
This sort of reasoning can be turned around. That is, given and Y, n
can be found such that
P (|Xn − µ| < ) ≥ θ
where θ is the probability that corresponds to the z value in the Standard

Normal Table (see Tables I and II in the Appendix).
With the Central Limit Theorem we can derive a formula for calculating
the value of n such that the probability that the mean deviates from µ by
is θ :
σ −1 1 + θ 2
n= Φ
2
if we use the Full Normal Table (Table I in the Appendix), or equivalently,
2
σ −1 θ
n= Ψ
2
using the Half Normal Table (Table II in the Appendix).
Example 5.13
Refer to Example 5.12. Suppose σ = 1.2 and = 0.3, find n such that the
probability that the mean deviates from µ by 0.3.
Solution
σ2 = 2 = 0.25 θ = 0.75
Therefore
2
2 1 + 0.75
n= Φ−1 = [8 Φ−1 (.875)]2 = 8(1.15)2 = 84.64 ≈ 85
0.25 2
5.5.4 Central Limit Theorem of Proportions
Sometimes we may have to work with proportions of successes in n inde-

−np
pendent trials than with actual number of successes. In this case M√nnpq
becomes 185
Mn
−p
σ =2 = 0.25 θ = 0.75
Therefore TOPICS IN INTRODUCTORY

ADVANCED
PROBABILITY:
A FIRST
COURSEIN
2 THEORY
PROBABILITY −1 1 –+VOLUME
0.75 2III Statistical
n= Φ = [8 Φ−1 (.875)]2 = 8(1.15) 2
= 84.64Inequalities
≈ 85 and Limit Laws
0.25 2
5.5.4 Central Limit Theorem of Proportions
Sometimes we may have to work with proportions of successes in n inde-

−np
pendent trials than with actual number of successes. In this case M√nnpq
becomes
Mn
−p
n
Statistical Inequalities and Limit Lawspq 201
n
and this leads us to the following theorem.

(Proportions)
Mn
Suppose in n Bernoulli trials with Mn successes, is the propor-
n
tion (relative frequency) of successes and the probability of success
in each trial is p. Then
Mn
−p
Yn = n
pq
n
We conclude this chapter by summarising the two limit laws as fol-

lows. Generally, if a theorem is concerned with stating conditions under
which variables converge (in some sense) to the expected average, then it is
a law of large numbers. If on the other hand, the theorem is concerned with
determining conditions under which the sum of a large number of random
variables has a probability distribution that is approximately Normal, it is
considered a central limit theorem.
EXERCISES
5.1 Refer to Example 5.1. Find the bounds for the probability that the
month’s production will be
(a) at most 300 bales;

(b) at least 180 bales.
5.2 The mean lifetime of a certain electrical device is 4 years. Find the
lower bounds for the probability that a randomly selected device from
a consignment of such devices will not exceed 20 years.
186
5.3 The amount of savings in a certain Rural Bank is ten million cedis.
Suppose the probability that an amount of at most two hundred thou-
sand cedis drawn by a customer selected at random is 0.8. What can
we say about the number of customers.
5.4 Refer to Example 5.2. Find the bounds for the probability that the
number of bulbs switched on in the area in that evening is different
from its expected value in absolute terms by
(a) at least 80 bulbs;

(b) at most 200 bulbs;
(c) not more than 150.
5.5 The distribution of a random variable X is given by the following table:
xi 1 2 3 4 5 6
p(xi ) 0.05 0.10 0.25 0.30 0.20 0.10
(a) Find P (|X − E(X)| < 2)

(b) Use Chebyshev’s inequality to estimate the lower bounds of the
probability in (a) above. Hint: E(X) = 3.8 and Var(X) = 1.77.
5.6 Suppose that a random variable has mean 35 and standard deviation
5. Use the Chebyshev’s inequality to estimate the probability that an
outcome will lie between
(a) 24 and 46 (b) 18 and 52 (c) 31 and 39
5.7 A Pharmaceutical Company supplies boxes of paracetamol tablets to a

particular hospital. Owing to the packaging procedure, not every box
contains exactly 150 tablets. It has been found out that on the average
the number of tablets in a box is indeed 150 and the standard deviation
is 4. If the company supplies the hospital 1, 000 boxes, estimate the
number of boxes having between 140 and 160 tablets.
6. Use the Chebyshev’s inequality to find the value of for which the
probability that the outcome lies between 80 − and 80 + is at least
Statistical
5 Inequalities and Limit Laws 203
12 .
0.67. Use the Chebyshev’s inequality to find the value of for which
the probability that the outcome lies between 25 − and 25 + is at
most 10
13 .
5.10 Refer to Example 5.4. Find the bounds for the following probabilities:
(a) P (|X − µ| ≤ σ)
(b) P (|X − µ| ≥ 3.3σ)
(c) P (|X − µ| ≤ 2.5σ)
(d) P (|X − µ| ≥ 1.65σ)
(e) P (|X − µ| ≥ 2.2σ)
5.11 Suppose that the number of hours a certain type of light bulb will burn
before requiring replacement has a mean of 2, 000 hours and standard
187
deviation of 150 hours. If 1, 000 such bulbs are installed in a new house,
5.9 Suppose
ADVANCED thatINaINTRODUCTORY
TOPICS random variable has mean 25 and standard deviation
0.67. Use the Chebyshev’sIN inequality to find the value of for which
the probability that the outcome lies between 25 − and 25 + is at
most 10
13 .
5.10 Refer to Example 5.4. Find the bounds for the following probabilities:
(a) P (|X − µ| ≤ σ)
(b) P (|X − µ| ≥ 3.3σ)
(c) P (|X − µ| ≤ 2.5σ)
(d) P (|X − µ| ≥ 1.65σ)
(e) P (|X − µ| ≥ 2.2σ)
5.11 Suppose that the number of hours a certain type of light bulb will burn
before requiring replacement has a mean of 2, 000 hours and standard
deviation of 150 hours. If 1, 000 such bulbs are installed in a new house,
estimate the number that will require replacement between 1, 200 and
2, 800 hours from the time of installation.
5.12 Refer to Example 5.7. Find an upper bound for the probability that
in 5,000 trials the deviation of the relative frequency of the event A
from its probability will not exceed 0.08.
5.13 A manufacturing company produces ground measuring poles with a

mean length of 2 metres and a standard deviation of 1.5 metre per
pole. Assume that the pole lengths, X, are statistically independent.
A container of 100 poles is sent to a customer for distribution. Find
the probability that the total length of all poles in the container is
(a) not more than 195 metres

(b) at least 220 metres
(c) between 180 and 210
5.14 Refer to Exercise 5.13. How many observations should be included in

the sample if we wish X, the mean length of poles, to be within 0.1
metres of µ with probability 0.95.
5.15 The final scores of the students in Diploma class over a period of four
WE WILL TURN YOUR CV
years have a mean 60 and a variance 64. In a particular year there
INTO AN OPPORTUNITY
OF A LIFETIME
Do you like cars? Would you like to be a part of a successful brand?

Send us your CV on
www.employerforlife.com
ease everyday lives of people all around Send us your CV. We will give it an entirely
new new dimension.
188
estimate the number that will require replacement between 1, 200 and
2, 800 hours from the time of installation.
5.12 Refer toA Example
PROBABILITY: 5.7. IN
FIRST COURSE Find anupper bound for the probability that
in 5,000 trials the deviation
PROBABILITY THEORY – VOLUME III ofthe relative frequency of the Inequalities
Statistical event A and Limit Laws
from its probability will not exceed 0.08.
5.13 A manufacturing company produces ground measuring poles with a

mean length of 2 metres and a standard deviation of 1.5 metre per
pole. Assume that the pole lengths, X, are statistically independent.
A container of 100 poles is sent to a customer for distribution. Find
the probability that the total length of all poles in the container is
(a) not more than 195 metres

(b) at least 220 metres
(c) between 180 and 210
5.14 Refer to Exercise 5.13. How many observations should be included in

the sample if we wish X, the mean length of poles, to be within 0.1
metres of µ with probability 0.95.
5.15 The final scores of the students in Diploma class over a period of four
years have a mean 60 and a variance 64. In a particular year there
were 100 students with a mean score of 58. Calculate the probability
that the sample mean, X is at most 58.
5.16 A bolt manufacturer knows that, on average, 5% of his production

is defective. He gives a guarantee on a shipment of 10, 000 bolts, by
promising to refund the money if more than K bolts are defective.
How small can the manufacturer choose K and still be assured that
he needs not give a refund of more than 1% of his shipment?
5.17 X n is to be used as an approximation to µ. If σ is known to be 2,

calculate the sample size n that will be required to ensure that the
error in this approximation will be less than 0.05 with probability of
at least 0.90.
5.18 Suppose that the number of students that enrol in the Basic Statistics
course in the Faculty of Social Studies at the University of Ghana is
a Poisson random variable with mean 500. The Co-ordinator of the
course has decided that if the number enrolling is 350 or more he will
split the group into two, otherwise they will all be in the same class.
What is the probability that the class will be split?
5.19 The mean and standard deviation of the ages of statistics students of
the University of Ghana are 20 years and 5.8 years respectively. What
is the probability that a random sample of 50 students will have a
mean age of between 18 and 23 years?
5.20 The mean age and the standard deviation of Statistics students are
18 years and 1.8 years respectively. What is the probability that a
random sample of 50 students will have a mean of 16 and 20 years?
5.21 Use the Chebyshev’s inequality to determine how large n should be in

order that you are at least 95 per cent certain that the sample mean
X n is within 0.9 σ of the population mean µ?
5.22 Use the Chebyshev’s inequality to determine how large n should be

in order that you are at most 95 per cent certain that Xn is different
from µ by at least 0.2 σ?
5.23 Let X1 , X2 , · · · , Xn be independent, identically distributed random

variables each having p.d.f.
189
order that you are at least 95 per cent certain that the sample mean
X n is within 0.9 σ of the population mean µ?
5.22 Use theA Chebyshev’s
PROBABILITY: FIRST COURSEinequality
IN to determine how large n should be
PROBABILITY
in orderTHEORY – VOLUME
that you are at III
most 95 per cent certainStatistical
that Xn is Inequalities
different and Limit Laws
from µ by at least 0.2 σ?

5.23 Let X1 , X2 , · · · , Xn be independent, identically distributed random
variables each having p.d.f.
x
1
e− 2 , x > 0
f (x) = 2
0, x≤0

Let Sn = n1 ni=1 Xi . Use the Chebyshev’s inequality to estimate the
minimum possible value of n, given that
P {|Sn − E(Sn )| ≥ 0.25} ≤ 0.75
For this value of n, calculate the upper bound to the given probability
obtained by applying Chebyshev’s inequality.
5.24 The burning time of a certain type of lamp is an exponential random

variable with mean µ = 20 hours and standard deviation 14 hours.
What is the probability that 80 of these lamps will provide a total of
more than 2000 hours of burning time?
5.25 Twenty numbers are rounded off to the nearest integer and then added.
Assume the individual round-off errors are independent and uniformly
distributed over (− 12 , 12 ). Find the probability that the given sum will
differ from the sum of the original twenty numbers by 3 or more using
(a) Chebyshev’s inequality (b) Central Limit Theorem.
5.26 Suppose a coin is tossed 26 times. Estimate the probability that the
number of heads will deviate from 12 by less than 4, using
(a) Chebyshev’s inequality (b) Central Limit Theorem.
5.27 Prove Theorems 5.11.
5.28 Prove Theorem 5.12(b).
5.29 A consignment of 1,800 items has been produced by a manufacturing

company whose products are 4% defective. Suppose the company’s
production process is an independent trial process with each item pro-
duced having a probability of 0.3 of being defective. Use Chebyshev’s
inequality to give a bound on the probability that in the batch of 1,800
items the number of defective is between 39 and 69.
5.30 Past experience indicates that wire rods purchased from a certain com-
pany have a mean breaking strength of 400 kg and a standard deviation
of 20 kg. How many rods should you select so that you would be cer-
tain with probability 0.95 that your sample mean would not be in error
by more than 3 kg.
190
PROBABILITY THEORY – VOLUME III SAMPLING DISTRIBUTIONS I Basic Concepts
Chapter 6
SAMPLING DISTRIBUTIONS I
Basic Concepts
6.1 INTRODUCTION
So far we have gone through the various aspects of probability theory in

my earlier books. In the Volume I (Nsowah-Nuamah, 2017), we discussed
the basic theorems in probability in where we emphasised conditional prob-
ability, Bayes’ Theorem and Independence. We considered the Random
Variable and it was there that the concept of probability distribution was
first introduced. We also discussed the Numerical Characteristics of the
random variable and there we encountered the concepts of Mathematical
Expectation, Variance, Moments, and Moment Generating Function. In the
Volume II (Nsowah-Nuamah, 2018) we considered some special probability
distributions. These included the Bernoulli distribution, Binomial distribu-
tion, Geometric distribution, Negative binomial distribution, Poisson dis-
tribution, Hypergeometric distribution, Multinomial distribution, Uniform
distribution, Exponential distribution, Gamma distribution, Beta distribu-
tion and Normal distribution.
In Chapters 1 to 4 of this book, we extended the concept of probability
distributions to bivariate distributions. In Chapter 5 we discussed the basic
Inequalities in Statistics (Markov’s and Chebyshev’s Inequalities) and the
two main Limit Laws (the Central Limit Theorem and the Law of Large
Develop the tools we need for Life Science
Numbers) which have wider and practical applications in Statistics.
We can nowMasters Degree
conclude the in some
book with Bioinformatics
analytical statistics con-
cerned with inferential procedures. The present chapter and the subse-
207
mathematics meet.

mathematics.
191
Volume II (Nsowah-Nuamah, 2018) we considered some special probability
distributions. These included the Bernoulli distribution, Binomial distribu-
tion, Geometric distribution, Negative binomial distribution, Poisson dis-
tribution, Hypergeometric
PROBABILITY THEORY – VOLUMEdistribution,
III Multinomial SAMPLING
distribution, Uniform I Basic Concepts
DISTRIBUTIONS
distribution, Exponential distribution, Gamma distribution, Beta distribu-
tion and Normal distribution.
In Chapters 1 to 4 of this book, we extended the concept of probability
distributions to bivariate distributions. In Chapter 5 we discussed the basic
Inequalities in Statistics (Markov’s and Chebyshev’s Inequalities) and the
two main Limit Laws (the Central Limit Theorem and the Law of Large
Numbers) which have wider and practical applications in Statistics.
We can now conclude the book with some analytical statistics con-
cerned with inferential procedures. The present chapter and the subse-
quent ones serve as a bridge between the topics discussed in this book and
statistical inference. 207
6.2 STATISTICAL INFERENCE
Statistical inference involves drawing conclusions on a large data set on the

basis of information obtained from a small portion of the large collection.
Two main concepts that form the basis of statistical inference are population
and sample.
6.2.1 Population
Definition 6.1 POPULATION

A population is a large data set that contains all the members (ele-
ments) of interest
Definition 6.2 FINITE POPULATION

A finite population is a population that has a stated or limited num-
ber of members
Examples of a finite population include all students in the University of

Ghana, all households in Accra, all the mud houses in Kumasi. It is possible
(though not always practical) to count the number of elements in a finite
population.
Definition 6.3 INFINITE POPULATION

An infinite population is composed of a limitless number of members
The elements
Sampling in an infinite
Distributions I population cannot be counted completely209 no
matter how long the counting process is carried on. An example of an
infinite population is an experiment of tossing a coin to determine whether
or not the coin is biased. Theoretically, this experiment could be carried
out an infinite number of times.
Philosophically, no truly infinite population of physical objects exists.
After all, given unlimited resources and time, we could enumerate even the
grains of sand, on a particular seashore. As a practical matter, then, we will
use the term infinite population when we are talking about a population
that could not be enumerated in a reasonable period of time.
6.2.2 Sample
192
It is obvious that in an infinite population complete enumeration cannot
Sampling Distributions
ADVANCED I
TOPICS IN INTRODUCTORY 209
or not the coin is biased. Theoretically, this experiment could be carried
out an infinite number of times.
Philosophically, no truly infinite population of physical objects exists.
After all, given unlimited resources and time, we could enumerate even the
grains of sand, on a particular seashore. As a practical matter, then, we will
use the term infinite population when we are talking about a population
that could not be enumerated in a reasonable period of time.
6.2.2 Sample
It is obvious that in an infinite population complete enumeration cannot

be secured. Even in a finite but large population it is often costly, time
consuming, or even physically impossible to conduct a census (complete
enumeration). Any inferences on the population should therefore be based
on a part of the population called the sample.
Definition 6.4 SAMPLE

A sample is a part of the population
An example of a sample includes selecting a few pupils from a particular

school in order to study the socio-economic background of pupils in that
school.
If a sample of n elements is drawn, the sample is said to be of size n.
6.2.3 Parameter and Statistic
From Chapter 7 through 12, we discussed a number of special distributions

but did not specify how they were realised in any particular case. Those
probabilities that we were calculating depended on certain parameters de-
scribing such distributions.
Population Parameter
Definition 6.5 POPULATION PARAMETER

A population parameter is a measure computed from observations of
210 the population Advanced Topics in Introductory Probability
Examples of population parameter (or simply parameter ) are the values of

p in the case of the binomial distribution, λ in the case of the Poisson distri-
bution or µ and σ in the case of the Normal distribution. Other parameters
include the median and the range.
Most often the probability distribution of a population is not known
precisely although we may have some idea of or at least be able to make
some hypothesis concerning its general behaviour. Thus, for example, we
may have some reason to suppose that a particular population is binomially
distributed. In such a case we would not know the values p and so we might
have to estimate.
In short, we have no direct knowledge of the probability distribution
and therefore have to approximate it empirically by a frequency distribution.
The number of experiments performed for this purpose, called a sample, is
necessarily finite.
Definition 6.6 SAMPLE STATISTIC

A sample statistic (or simply statistic) is a193
measure calculated from
the sample
have to estimate.
In short, we have no direct knowledge of the probability distribution
PROBABILITY: Ahave
and therefore toCOURSE
FIRST approximate
IN it empirically by a frequency distribution.
The number THEORY
PROBABILITY of experiments
– VOLUMEperformed
III for this purpose, called DISTRIBUTIONS
SAMPLING a sample, is I Basic Concepts
necessarily finite.
Definition 6.6 SAMPLE STATISTIC

A sample statistic (or simply statistic) is a measure calculated from
the sample
Examples of statistic include the sample mean, the sample proportion, the
sample variance, the sample range.
Definition 6.7 SAMPLING

The process of obtaining samples from a population is called sam-
pling
As a rule, small populations, for obvious reasons, are not sampled. Instead,
the entire population is examined. A sample that contains all the members in
the population is called an exhaustive sampling, or a 100 per cent sampling,
which of course, are only other names for census.
There are basically two types of sampling: probability sampling and
non-probability sampling. In this book we shall consider only probability
sampling because it is only for probability sampling that there are sound
statistical procedures for drawing conclusions concerning the population of
interest based on the sample drawn.
Copenhagen
Master of Excellence cultural studies
Copenhagen Master of Excellence are
two-year master degrees taught in English
at one of Europe’s leading universities religious studies
Come to Copenhagen - and aspire!
Apply now at science

www.come.ku.dk
194
Sampling Distributions I III SAMPLING DISTRIBUTIONS
211 I Basic Concepts
6.3 PROBABILITY SAMPLING
6.3.1 Definition of Probability Sampling
Probability sampling is the scientific method by which the units of the sam-
ple are chosen based on some definite pre-assigned probability. The various
types of probability sampling include the cases where:
(a) each member of the population has an equal chance of being selected;
(b) the members of the population have different probabilities of being

selected;
(c) the probability of selecting a member is proportional to the sample

size.
Definition 6.8 PROBABILITY SAMPLE

A probability sample is a sample drawn from a population in such a
way that every member of the population has a known and nonzero
probability of being included in the sample
For a probability sample neither the sampler nor the member of the popu-
lation can decide which member will be included in the sample. The selec-
tion is achieved by the operation of chance alone. Synonym for probability
sampling is random sampling. Random sampling methods include simple
random sampling, systematic sampling, and stratified sampling. Samples
obtained by taking every k th name in a list are called systematic samples.
Sometimes a population P is divided into groups, P1 , P2 , · · · , Pr called
strata and random samples are taken from these strata and combined to
get a stratified random sample. This is often done when we want to be sure
that we shall have specified numbers of subjects from each stratum. Other
sampling techniques include cluster sampling and multistage sampling. For
details about sampling techniques see Nuamah (1994).
The simple random sampling is the only one that can be considered a
true random sampling and in this book unless otherwise stated, by random
212
sampling we mean simple random sampling.
6.3.2 Simple Random Sampling
Before giving the conventional definition of simple random sampling we shall

first consider a mathematical definition.
Let f (x) be the density function of a continuous random variable X
for the population being sampled. We are now interested in the values of
X taken by the elements of the sample. Suppose that k different samples of
(j)
size n each are drawn from this population. Let xi represent the resulting
sample values for the jth sample (j = 1, 2, · · · , k) of size i (i = 1, 2, · · · , n).
That is,
(1) (1) (1)
1st sample: x1 , x2 , · · · , xi , · · · , x(1)
n
(2) (2) (2)
2nd sample: x1 , x2 , · · · , xi , · · · , x(2)
n
.. ..
. .
(j) (j) (j)
j th sample: x1 , x2 , · · · , 195
xi , · · · , x(j)
n
. .
first consider a mathematical definition.
Let f (x) be the density function of a continuous random variable X
for the population being sampled. We are now interested in the values of
X taken by the elements of the sample. Suppose that SAMPLING
k differentDISTRIBUTIONS
samples of I Basic Concepts
(j)
size n each are drawn from this population. Let xi represent the resulting
sample values for the jth sample (j = 1, 2, · · · , k) of size i (i = 1, 2, · · · , n).
That is,
(1) (1) (1)
1st sample: x1 , x2 , · · · , xi , · · · , x(1)
n
(2) (2) (2)
2nd sample: x1 , x2 , · · · , xi , · · · , x(2)
n
.. ..
. .
(j) (j) (j)
j th sample: x1 , x2 , · · · , xi , · · · , x(j)
n
.. ..
. .
(k) (k) (k)
k th sample: x1 , x2 , · · · , xi , · · · , x(k)
n
The values in the ith column may be considered the values of a random
variable Xi corresponding to the ith trial of the sample having probability
density function fi (xi ). Thus,
(1) (2) (j) (k)

X1 = x1 , x1 , · · · , x1 , · · · , x1 with density function f1 (x1 )
(1) (2) (j) (k)
X2 = x2 , x2 , ···, x2 , ···, x2 with density function f2 (x2 )
.. ..
. .
(1) (2) (j) (k)
Xi = xi , xi , · · · , xi , · · · , xi with density function fi (xi )
.. ..
. .
j
Xn = x(1) (2) (k)
n , xn , · · · , x(n) , · · · , xn with density function fn (xn )
The probability density function f (xi ) has to fulfill two conditions in order
to describe the process of simple random sampling. These are given in
Definition 6.9.
Sampling Distributions I 213
Definition 6.9 SIMPLE RANDOM SAMPLING

(Mathematical Definition)
Simple random sampling is a method of sampling for which
(a) f (x1 , x2 , ..., xn ) = f1 (x1 )f2 (x2 )...fn (xn ) and
(b) f1 (x1 ) = f2 (x2 ) = ... = fn (xn ) = f (x)
where f (x) is the probability density function of the random variable

X for the population being sampled
Although the random variable X has been treated as continuous variable,

Definition 6.9 applies to both continuous and discrete variables. The definition
means that
(a) the successive selection of the element in the population are indepen-
dent and
(b) the density function of the random variable remains the same from
selection to selection as that of the population.
The independence condition can, to a certain extent, be controlled by com-

paring the empirical distributions for the 1st , 2nd , · · · elements obtained in
a large number of samples. However, it is not at all easy to be sure that
the sample is in fact drawn from a population196 described by the probability
density function f (x).
(a) the successive selection of the element in the population are indepen-
dent TOPICS
ADVANCED and IN INTRODUCTORY
(b) the density
PROBABILITY THEORYfunction of the
– VOLUME III random variable remains theDISTRIBUTIONS
SAMPLING same from I Basic Concepts
selection to selection as that of the population.
The independence condition can, to a certain extent, be controlled by com-

paring the empirical distributions for the 1st , 2nd , · · · elements obtained in
a large number of samples. However, it is not at all easy to be sure that
the sample is in fact drawn from a population described by the probability
density function f (x).
Definition 6.10 SIMPLE RANDOM SAMPLE

(Mathematical Definition)
Suppose x1 , x2 , ..., xn are values assumed by a corresponding set of
random variables X1 , X2 , ..., Xn which are independent and which
have the same distribution, then the xi ’s are referred to as a simple
random sample
Conventionally, we shall define simple random sampling and simple random

sample as follows.
Definition 6.11 SIMPLE RANDOM SAMPLING

(Conventional Definition)
Simple random sampling is any technique designed to draw sample
members from a population in such a way that each member in the
population has an equal chance of being selected

A simple random sample is a sample drawn in such aByway
2020,that
Brain power
(a) every member of the population has the samehow
wind could provide one-tenth of our planet’s
probability
of a large proportion of the
is crucial to running
being included in the sample, world’s wind turbines.
Up to 25 % of the generating costs relate to mainte-
(b) different members of the population are selectednance. These can be reduced dramatically thanks to our
independently
(that is, selection of one member has no effect lubrication.
on selectionWe helpof
make it more economical to create
another) cleaner, cheaper energy out of thin air.
Therefore we need the best employees who can
meet this challenge!
We must emphasise that in actual process of sampling, it is often quite

The Power of Knowledge Engineering
difficult to maintain randomness and conscious effort must be made to ensure
that conditions (a) and (b) in Definition 6.12 are fulfilled. There is no
general recipe that can be given to ensure that the results from sampling
are reliable besides the use of objective procedures. Suppose that there is a
sampling frame which is the list of all members of the population of interest.
Then probably, the best method of simple random sampling is the use of a
computerPluggenerated table
into The Power of random
of Knowledge numbers.
Engineering.
Random Number Table
A random number table is a table of numbers constructed by a process that

ensures that
(a) in any position in the table, each of the numbers 0 through 9 has equal
1 197
probability of of occurring.
10
Simple random sampling is any technique designed to draw sample
members from a population in such a way that each member in the
population
PROBABILITY: has anCOURSE
A FIRST equal chance
IN of being selected

A simple random sample is a sample drawn in such a way that
(a) every member of the population has the same probability of

being included in the sample,
(b) different members of the population are selected independently

(that is, selection of one member has no effect on selection of
another)
We must emphasise that in actual process of sampling, it is often quite

difficult to maintain randomness and conscious effort must be made to ensure
that conditions (a) and (b) in Definition 6.12 are fulfilled. There is no
general recipe that can be given to ensure that the results from sampling
are reliable besides the use of objective procedures. Suppose that there is a
sampling frame which is the list of all members of the population of interest.
Then probably, the best method of simple random sampling is the use of a
computer generated table of random numbers.
Random Number Table
A random number table is a table of numbers constructed by a process that

ensures that
(a) in any position in the table, each of the numbers 0 through 9 has equal
1 I
Sampling Distributions 215
probability of of occurring.
10
(b) The occurrence of any number in one part of the table is independent
of the occurrence of any number in any other part of the table (i.e.
knowing the numbers in one part of the table tells us nothing about
the numbers in another part of the table).
Table X of Appendix A presents a table of random numbers.
We shall illustrate in the following example how random numbers are used
in the selection of a sample.
Example 6.1
Select 15 students from a student population of 250 using a table of random
numbers.
Solution
(a) Assign a number to each member in the population

Compile the list of the students and assign each of the 250 students
a number from 001 to 250. Each student number is called a sampling
number.
(b) Select a page

If the table of random numbers is more than one page then select a
198
page at random (for example, by writing the page numbers on pieces
(a) Assign a number to each member in the population
Compile the list of the students
and assign each of the 250 students
a number
PROBABILITY from –001
THEORY to 250.
VOLUME III Each student number is calledDISTRIBUTIONS
SAMPLING a sampling I Basic Concepts
number.
(b) Select a page

If the table of random numbers is more than one page then select a
page at random (for example, by writing the page numbers on pieces
of paper and picking one at random). Suppose the third page of Table
X is selected.
(c) Pick an arbitrary starting point in the random table

These random numbers are in groups of five digits. This grouping is
arbitrary. Because our population size (N = 250) has three digits, we
need to pick three-digit numbers from the table.
Suppose we decide to take the first three digits of each five-digit num-
ber. To pick the starting point, close your eyes and place your pencil
point on the table so that we do not subconsciously look for a start-
216 ing point that includes aAdvanced
member ofTopics
interest to us. AssumeProbability
in Introductory that your
pencil falls in the fourth column of numbers in the eleventh row.
The number is 46201. The first three digits of this number are 462.
This number is thrown out because it is larger than the population
size (N = 250). There is no student whose number is 462.
(d) Determine the numbers
We select numbers for our sample by moving in some (predetermined)

direction away from the starting point. We can go up, down or across.
Suppose we had decided beforehand to move across the row. Reading
the next first three-digit number we get 045 and the student assigned
number 045 is the first member of our sample. The next three-digit
number going across to the right gives us 425 which we throw out.
Moving across we look for the next three digit number which is not
larger than the population size of 250. This is 211 so the second per-
son in the sample is the student numbered 211. The next three-digit
number is 763 which we throw out. This process of picking numbers
continue until 15 students (the sample size n) are selected. The re-
sulting samples are:
045 211 006 194 214 212 029 276 044 165 154 080
076 043 244
Note
If, by chance, the same number occurs more than once, we would
ignore it, after it has appeared for the first time.
6.4 SAMPLING WITH AND WITHOUT REPLACEMENT
If we select a member from a population we have the choice of replacing or

not replacing the member in the population before a second selection. In
the first case a particular member can come up again and again, whereas in
the second, it can come up only once. This leads us to Definitions 6.13 and
6.14. 199
Note
If, byTOPICS
ADVANCED chance, the same number occurs more than once, we would
IN INTRODUCTORY
ignore it, after it
PROBABILITY: A FIRST COURSEhas appeared
IN for the first time.
6.4 SAMPLING WITH AND WITHOUT REPLACEMENT
If we select a member from a population we have the choice of replacing or

not replacing the member in the population before a second selection. In
the first case a particular member can come up again and again, whereas in
the second, it can come up only once. This leads us to Definitions 6.13 and
6.14.
Sampling Distributions I 217
Definition 6.13 RANDOM SAMPLING WITH

REPLACEMENT
If n members are selected from a population of N members, sequen-
tially one by one, such that at every stage the probability of selection
1
of any member is irrespective of whether this member has been
N
selected earlier or not, the procedure is known as random sampling
with replacement
In general, when sampling is with replacement from a population of N mem-

bers, the number of possible samples of size n is equal to N n .
Sampling with replacement is often associated with an infinite popula-
tion. Thus the expressions “sample from an infinite population”, “sampling
with replacement”, “samples are independent” and “simple random sample”
are all equivalent.
A finite population which is sampled with replacement can theoreti-
cally be considered infinite since samples of any size can be drawn without
exhausting the population. For more practical purposes, sampling from a
finite population which is very large can be considered as sampling from an
infinite population.
Definition 6.14 RANDOM SAMPLING WITHOUT

REPLACEMENT
If at each stage of selection of n members from a population of N
members sequentially one by one, a member is selected with equal
probability only from among members not selected earlier, sampling
is said to be random sampling without replacement
Sampling without replacement is often associated with finite population and

the expressions “sample from a finite population”, “sampling without re-
placement” and “samples are not independent” are all equivalent.
When drawing samples of size n from a finite population of size N
without replacement, and ignoring the order in which the sample values are
drawn, the number of possible samples is given by the combination of N
objects taken n at a time. From Chapter 2 this is given by
200
218
III Topics in Introductory Probability I Basic Concepts
SAMPLING DISTRIBUTIONS

N N!
=
n n! (N − n)!
The probability for each subset of n of the N objects of the finite
population is
1

N
n
We can also determine the joint probability distributions of the random
variables from a random sample of size n from a finite population by
1
f (x1 , x2 , · · · , xn ) =
N (N − 1) · · · (N − n + 1)
To generalise: when a simple random sample of size n is drawn from
a finite population with replacement, or from an infinite population with
or without replacement, we have n identically distributed random variables
X1 , X2 , · · · , Xn all possessing the same distribution as the parent population.
Also, if the population is finite but large then even when sampling is without
replacement, the sample observations, xi , can in practice be treated as if
they were identical and independent.
EXERCISES
6.1 Write short notes on the following:
(a) Simple random sampling

(b) Systematic sampling
(c) Stratified sampling
(d) Cluster sampling
(e) multistage sampling
6.2 The management of an establishment wants to estimate the age of

twenty workers from the workers population of 300. Use random num-
bers to draw this sample.
201
PROBABILITY: A FIRST COURSE IN SAMPLING DISTRIBUTIONS II
PROBABILITY THEORY – VOLUME III SAMPLING DISTRIBUTION OF STATISTICS
Chapter
Chapter 7
7
Chapter 7
SAMPLING
DISTRIBUTIONS II II
Sampling Distribution
Sampling Distribution of Statistics
of StatisticsII
Sampling Distribution of Statistics
7.1 INTRODUCTION
7.1 INTRODUCTION
7.1 INTRODUCTION
7.1.1 Definition of Statistic
7.1.1 Definition of Statistic
Having
7.1.1 determined
Definitionin Chapter 6 what a random sample is, we can now ex-
Having determined inofChapter
Statistic
6 what a random sample is, we can now ex-
amine the distributions of statistics calculated from random samples.
amine the distributions of statistics calculated from random samples.
Though
Having we have
determined already 6defined
in Chapter what a what
random “statistic”
sample is,is, we shall
nownow
Though we have already defined what “statistic” is,wewecan
shall ex-
now
present
amine its distributions
mathematical definition.
presentthe
its mathematicalofdefinition.
statistics calculated from random samples.
Though we have already defined what “statistic” is, we shall now
present its mathematical definition.
Definition 7.1 STATISTIC
Definition 7.1 STATISTIC
Let X , X , ..., X be a random sample from a random variable and
Let X11 , X22 , ..., Xnn be a random sample from a random variable and
let x1 , x2 , ...,7.1
x be the values assumed by the sample. Then the
let x1 , x2 , ..., xnn beSTATISTIC
Definition the values assumed by the sample. Then the
real-valued
Let X1 , X2 , function
..., Xn be a random sample from a random variable and
real-valued function
let x1 , x2 , ..., xn be the values assumed by the sample. Then the
real-valued function Y = G(X , X , ..., X )
Y = G(X11 , X22 , ..., Xnn )
is a statistic which assumes the value
Y = G(X y...,=XG(x
1 , X2 , y n) 1
, x2 , ..., xn )
is a statistic which assumes the value = G(x 1 , x2 , ..., xn )
is a statistic which assumes the value y = G(x1 , x2 , ..., xn )

Examples of a statistic include
n
1 n
Examples of a statistic
(a) the sample X = 1 Xi
mean: include
(a) the sample mean: X = n Xi
n
i=1
n
1 i=1
(a) theNNE
sample mean: have
and Pharmaplan X= Xito create
joined forces – You have to be proactive and open-minded as a
n i=1219
NNE Pharmaplan, the world’s leading
219
engineering newcomer and make it clear to your colleagues what
pharma and biotech industries. to me. But busy as they are, most of my colleagues
219 find the time to teach me, and they also trust me.
Education: Chemical Engineer time that I am beginning to be taken seriously and
that my contribution is appreciated.
NNE Pharmaplan is the world’s leading engineering and consultancy company

focused entirely on the pharma and biotech industries. We employ more than
1500 people worldwide and offer global reach and local knowledge along with
202
real-valued function
Y = G(X , X2 , ..., Xn )
is a statistic
PROBABILITY which
THEORY – VOLUME n)
assumesIIIthe value y = G(x1 , x2 , ..., xSAMPLING DISTRIBUTION OF STATISTICS

n
220 1
(a) the sample mean: X =Advanced
Xi Topics in Introductory Probability
n i=1
n
1
(b) the sample variance11 : ŝ2 =219 (Xi − X)
n − 1 i=1
(c) Minimum of the sample: K = min(X1 , X2 , ..., Xn )

(d) Maximum of the sample: M = max(X1 , X2 , ..., Xn )
n
1
(b) the sample variance11 : ŝ2 = (Xi − X)
n − 1 i=1
(e) the sample range: R = M − K
(c) Minimum of the sample: K = min(X1 , X2 , ..., Xn )

That is, a statistic is a formula for obtaining values of the function. However,
sometimes the term statistic is used to refer to the values themselves. Thus
we
(d)may speak ofofthe
Maximum thestatistic
sample: yM==G(x 1 , x21, ,...,
max(X ) when
X2x,n..., Xn ) we really should
say that y is the value of the statistic Y = G(X1 , X2 , ..., Xn ).
(e) the Sample

7.1.2 sample range: R = M − as
Observations K Random Variables
Earlier we have referred to sampling as a process of selecting a sample from

That is, a statistic is a formula for obtaining values of the function. However,
a population or the range of a random variable. An observation made from
sometimes the term statistic is used to refer to the values themselves. Thus
the range of a random variable is, again a random variable. This is because
we may speak of the statistic y = G(x1 , x2 , ..., xn ) when we really should
an observed value can assume any one of all possible values of a random
say that y is the value of the statistic Y = G(X1 , X2 , ..., Xn ).
variable.
Suppose X1 , X2 , ..., Xn is a simple random sample and x1 , x2 , ..., xn are
7.1.2 Sample Observations as Random Variables
the actual observations of a specific sample that has been drawn. Thus,
the term “random sample” refers to the observations thought of as random
Earlier we have referred to sampling as a process of selecting a sample from
variables but not to a particular set of actual realisations. However, since
a population or the range of a random variable. An observation made from
the Xi are identically distributed, a specific collection of sample observations
the range of a random variable is, again a random variable. This is because
can be thought of as the particular values of a random variable X. In
an observed value can assume any one of all possible values of a random
connection with this, for a random sample of size n, we may treat the xi
variable.
as realisation of X : X = x1 , X = x2 , ..., X = xn . Consequently, if
Suppose X1 , X2 , ..., Xn is a simple random sample and x1 , x2 , ..., xn are
the sample is random then sample observations thought of as independent
the actual observations of a specific sample that has been drawn. Thus,
random variables
Sampling and sample
Distributions observations thought of as a particular set221
II refers of
the term “random sample” to the observations thought of as random
sample values are equivalent and can be used interchangeably.
variables but not to a particular set of actual realisations. However, since
the
7.1.3 Xi areDefinition
identically and
distributed, a specificof
Construction collection
Sampling of sample observations
Distribution of
can be thought
11
The reason for
Statisticsof as the particular values of a random variable
dividing by n − 1 rather than the seemingly more logical choice, In
X.n, will
be explained in
connection Section
with this,7.5.
for a random sample of size n, we may treat the xi
as
Werealisation
can consider of Xall :possible
X = xsamples
1 , X = ofx2size = xncan
, ..., nXwhich . Consequently,
be drawn from if
the sample is random
the population, and fortheneachsample
sampleobservations
we compute thought of of
a statistic as interest.
independentThe
random variables and sample observations thought of as a particular
idea that sample observations are random variables leads us to the conclusion set of
sample
that anyvalues are computed
quantity equivalentfrom and sample
can be observations
used interchangeably.
must also be a random
variable. As a random variable, the computed quantity or statistic has
a 11probability
The reason fordistribution
dividing by nof
− 1its ownthan
rather called probability
the seemingly moredistribution
logical choice,of
n, the
will
statistic
be or insampling
explained distribution of the statistic which may be established
Section 7.5.
directly from the parent population distribution.
Definition 7.2 SAMPLING DISTRIBUTION OF A STATISTIC

203
The distribution of all possible values that can be assumed by some
Statistics
We can consider all possible samples of size n which can be drawnSAMPLING
from DISTRIBUTIONS II
the population,
PROBABILITY and for
THEORY each sample
– VOLUME III we compute a statisticSAMPLING
of interest. The
DISTRIBUTION OF STATISTICS
idea that sample observations are random variables leads us to the conclusion
that any quantity computed from sample observations must also be a random
variable. As a random variable, the computed quantity or statistic has
a probability distribution of its own called probability distribution of the
statistic or sampling distribution of the statistic which may be established
directly from the parent population distribution.
Definition 7.2 SAMPLING DISTRIBUTION OF A STATISTIC
The distribution of all possible values that can be assumed by some

statistic, computed from samples of the same size randomly drawn
from the same population, is called the sampling distribution of that
statistic
For a given sampling distribution we are interested in knowing three

things, namely, the mean, the variance and the form of the distribution.
Even though in practice we would not at all attempt to empirically
construct the sampling distribution of a statistic, we can conceptualise the
manner in which it can be done when sampling is from a discrete, finite
population of size N . The procedure is as follows:
(a) Randomly select all possible samples of size n from the finite popula-
tion of size N .
(b) Compute the statistic of interest for each sample.
(c) List the different distinct observed values of the statistic together with
the corresponding frequency of occurrence of each distinct observed
value of the statistic.
(d) Compute the relative frequency (probabilities) of the occurrence of the

222 observed statistic. Advanced Topics in Introductory Probability
For infinite or large finite populations, one could approximate sampling dis-
tribution of a statistic by drawing large number of independent simple ran-
dom samples and by proceeding in the manner just described above. In
fact, the actual construction of a sampling distribution according to the
steps given above is a tedious task. Fortunately, there are theorems that
simplify things for us.
In the subsequent sections we shall introduce sampling distributions
for the most frequently encountered statistics: the mean, proportion and
variance. We shall now introduce an example that will be used most often
in this chapter.
Example 7.1
Suppose a population consists of four numbers: 4, 5, 7, 8.
Find
(a) the population mean;
(b) the population variance and population standard deviation.
Solution
(a) The population mean µ is

N 204
i=1 Xi 4+5+7+8 24
µ= = = =6
Find
(a) the population
PROBABILITY: mean; IN
A FIRST COURSE SAMPLING DISTRIBUTIONS II
(b) the population variance and population standard deviation.
Solution
(a) The population mean µ is

N
i=1 Xi 4+5+7+8 24
µ= = = =6
N 4 4
(b) The population variance σ 2 is

N
i=1 (Xi − µ)2
σ 2
=
N
(4 − 6)2 + (5 − 6)2 + (7 − 6)2 + (8 − 6)2
=
4
4+1+1+4
=
4
= 2.5
The population standard deviation is

√
σ = 2.5 = 1.5811
This e-book
SetaPDF
www.setasign.com
205
Sampling Distributions II III SAMPLING DISTRIBUTION
223 OF STATISTICS
7.2 SAMPLING DISTRIBUTION OF MEANS
7.2.1 Definition of Sampling Distribution of Means
Let x1 , x2 , ..., xn be the values assumed by a corresponding set of random

variables X1 , X2 , ..., Xn . We refer to the xi ’s as a random sample of size
n. We are accustomed to thinking of the sample mean X as a specific
number calculated from a specific sample. If we consider only one sample,
this is correct. If we consider more than one sample, we need to think of X
differently. The value of X usually changes as the sample changes. When we
select different samples from a population, we generally get different values
for X. The mean X of such a sample is a value assumed by the random
variable
X1 + X2 + ... + Xn
X=
n
The sample mean X therefore is a random variable because X1 , X2 , ..., Xn
corresponding to n trials of the sample values x1 , x2 , ..., xn are random vari-
ables. After a particular random sample has been taken, X will be a number,
but before it has been drawn it will be a random variable capable of assuming
any value that the original variable X can assume.
Definition 7.3 SAMPLING DISTRIBUTION OF MEANS

The probability distribution of the sample statistic X, is called the
sampling distribution for the sample mean or the sampling distribu-
tion of means
7.2.2 Sampling Distribution of Means from Infinite Population

In the previous section we conceptualised the manner in which the sampling
distribution of a statistic can be empirically constructed. In the following
example we shall illustrate the procedure with the sampling distribution of
means.
Example 7.2
(a) List all possible samples Advanced
224 of size twoTopics
that can be drawn from
in Introductory the pop-
Probability
ulation with replacement.
(b) Find
(i) the mean of the distribution of the means;
(ii) the variance and standard deviation of the distribution of the
means.
Solution
(a) N = 4, n = 2
There are N n = 42 = 16 possible samples of size two that can be drawn
from a population of size four (where sampling is with replacement).
These are
(4, 4) (4, 5) (4, 7) (4, 8) (5, 4) (5, 5) (5, 7) (5, 8)
(7, 4) (7, 5) (7, 7) (7, 8) (8, 4) (8, 5) (8, 7) (8, 8)
Note
206
Here the notation (a, b) is an ordered pair. For example, (4, 5) denotes
means.
Solution TOPICS IN INTRODUCTORY

ADVANCED
(a) N = THEORY
PROBABILITY 4, n =– 2VOLUME III SAMPLING DISTRIBUTION OF STATISTICS
There are N n = 42 = 16 possible samples of size two that can be drawn
from a population of size four (where sampling is with replacement).
These are
(4, 4) (4, 5) (4, 7) (4, 8) (5, 4) (5, 5) (5, 7) (5, 8)
(7, 4) (7, 5) (7, 7) (7, 8) (8, 4) (8, 5) (8, 7) (8, 8)
Note
Here the notation (a, b) is an ordered pair. For example, (4, 5) denotes
“first a 4 and then a 5” and is different from (5, 4) which denotes “first
a 5 and then a 4.”
(b) (i) The corresponding sample means X i are
4.0 4.5 5.5 6.0 4.5 5.0 6.0 6.5
5.5 6.0 7.0 7.5 6.0 6.5 7.5 8.0
The sample mean and their associated probabilities (relative fre-
quencies) are as follows:
Possible values of X i Frequency Relative Frequency

(Probability)
4.0 1 0.0625
4.5 2 0.1250
5.0 1 0.0625
5.5 2 0.1250
6.0 4 0.2500
6.5 2 0.1250
7.0 1 0.0625
7.5 2 0.1250
8.0 1 0.0625
Total II
Sampling Distributions 16 1.0000 225
For the distribution of the sample means Xi , the expected value is

µX = E(X i ) = xi p(xi )
= 4(0.0625) + 4.5(0.1250) + 5(0.0625) + 5.5(0.1250)
+6(0.2500) + 6.5(0.1250) + 7(0.0625)0 + 7.5(0.1250)
+8(0.0625)
= 6
which is identical with the population mean obtained in Example 7.1(a).

Thus the mean of the sample means is equal to the population mean.
Note
Without calculating the probabilities we could have obtained the mean of
the sampling distribution of means from the raw data (the sample means
themselves) as:

X 4.0 + 4.5 + 5.0 + · · · + 6.5 + 7.0 + 7.5 + 8.0
µX = =
N2 42
96
=
16
= 6
(ii) The variance of the distribution of sample means is:

(Xi − µX )2
2
σX =
Nn 207
(4 − 6)2 + (4.5 − 6)2 + · · · + (8 − 6)2

X 4.0 + 4.5 + 5.0 + · · · + 6.5 + 7.0 + 7.5 + 8.0
µX = =
ADVANCED TOPICS N IN2INTRODUCTORY 42
96
=
PROBABILITY THEORY
16 – VOLUME III SAMPLING DISTRIBUTION OF STATISTICS
= 6
(ii) The variance of the distribution of sample means is:

(Xi − µX )2
2
σX =
Nn
(4 − 6)2 + (4.5 − 6)2 + · · · + (8 − 6)2
=
16
4 + 2.25 + 1 + 0.25 + 0 + 0.25 + 1 + 2.25 + 4
=
16
= 1.25
The standard deviation of the distribution of the means is

√
σX = 1.25 = 1.118
Note
226 variance of the sampling distribution
The of means
Advanced Topics is not equal to
in Introductory the pop-
Probability
ulation variance.
As has been pointed out earlier, it is burdensome to construct the
sampling distribution of means in the way it has been done. In practice, we
shall employ the following two theorems to obtain the mean and variance of
the sample mean.
Theorem 7.1
Let X be a random variable with expectation E(X) = µ and variance
Var(X) = σ 2 . Let X be the sample mean of a random sample of size
n. Then the mean of the sampling distribution of means is given by
µX = E(X) = µ
Sharp
The theorem aboveMinds - Bright
states that Ideas!
the expected value of the sample mean is the
population mean.
new inventions to make dedicated solutions for our customers. With sharp minds and the world leader as supplier of
Proof cross functional teamwork, we constantly strive to develop new unique products - dedicated, high-tech analytical
Since theWould
sample mean
you like to join is
ourdefined
team? as solutions which measure and
FOSS works diligently withX 1 + X2and
innovation + development
· · · + Xn as basis for its growth. It is tion of agricultural, food, phar-
reflected in theX = more than 200 of the 1200 employees in FOSS work with Re-
fact that maceutical and chemical produ-
search & Development in Scandinavia
n
and USA. Engineers at FOSS work in production,
1 cts. Main activities are initiated
development and marketing, within a wide range of different fields, i.e. Chemistry,
E(X) = E(X
Electronics, Mechanics, Software, 1 ) + E(X 2 ) + · · · + E(X
Optics, Microbiology, Chemometrics. n ) from Denmark, Sweden and USA
n with headquarters domiciled in
1 Hillerød, DK. The products are
We offer
= (nµ) marketed globally by 23 sales
n
A challenging job in an international and innovative company that is leading in its field. You will get the
= µ
opportunity to work with the most advanced technology together with highly skilled colleagues.
companies and an extensive net
of distributors. In line with
Theorem 7.2 its market position.
If a Dedicated
populationAnalytical
is infiniteSolutions
or if sampling is with replacement, then
the variance
FOSS
of the sampling 2 ,
distribution of means, denoted by σX
is given by
Slangerupgade 69
3400 Hillerød σ2
Tel. +45 70103370 Var(X) = σ 2
X
=
n
www.foss.dk
That is, the variance of the sampling distribution is equal to the population
variance divided by the size of the sample used to obtain the sampling
distribution. 208
As has
ADVANCED been INpointed
TOPICS out earlier, it is burdensome to construct the
INTRODUCTORY
PROBABILITY: A FIRST COURSE IN in the way it has been done. In practice,
sampling distribution of means we
SAMPLING DISTRIBUTIONS II
shall employ the following two theorems to obtain the mean and variance of
the sample mean.
Theorem 7.1
Let X be a random variable with expectation E(X) = µ and variance
Var(X) = σ 2 . Let X be the sample mean of a random sample of size
n. Then the mean of the sampling distribution of means is given by
µX = E(X) = µ
The theorem above states that the expected value of the sample mean is the
population mean.
Proof
Since the sample mean is defined as
X1 + X 2 + · · · + X n
X =
n
1
E(X) = E(X1 ) + E(X2 ) + · · · + E(Xn )
n
1
= (nµ)
n
= µ
Theorem 7.2
If a population is infinite or if sampling is with replacement, then
the variance of the sampling distribution of means, denoted by σX 2 ,
is given by
σ2
Var(X) = σX 2
=
n
That is, the variance of the sampling distribution is equal to the population
variance divided by the size of the sample used to obtain the sampling
Sampling Distributions II 227
distribution.
Proof

X 1 X2 Xn
Var(X) = Var + + ... +
n n n
1
= Var(X 1 ) + Var(X 2 ) + ... + Var(X n )
n2
1 σ 2
= 2
(nσ 2 ) =
n n
Note
Theorems 7.1 and 7.2 do not assume Normality of the “parent” population.
Definition 7.4 STANDARD ERROR OF MEAN

The positive square root of the variance of the mean is referred to as
the standard error of the sample mean
The standard error of the mean measures chance variations of sample mean
from sample to sample. Denoting the standard error by σX , it is given by
209
σ
σ =√
Definition 7.4 STANDARD ERROR OF MEAN
The positive
PROBABILITY: square
A FIRST root of
COURSE IN the variance
of the mean is referred to SAMPLING
as DISTRIBUTIONS II
the standard
PROBABILITY error
THEORY of the sample
– VOLUME III mean SAMPLING DISTRIBUTION OF STATISTICS
The standard error of the mean measures chance variations of sample mean
from sample to sample. Denoting the standard error by σX , it is given by
σ
σX = √
n
This formula shows that the standard error of the mean decreases when n,
the sample size, is increased. This means that when n is sufficiently large
and we actually have more information, sample means can be expected to
be closer to µ, the quantity which X is usually supposed to estimate.
Example 7.3
Refer to Example 7.1. Employ Theorems 7.1 and 7.2 to obtain the variance
of the sampling distribution of means if a sample of size two is drawn with
replacement from the population.
Solution
From Example 7.2, N = 4, σ 2 = 2.5.
Also, the sample size n = 2. Hence
σ2 2.5
228 Var(X) = = Topics
Advanced = 1.25
n 2
Although Theorems 7.1 and 7.2 give us some characteristics of the sampling
distribution, it does not permit us to calculate probabilities because we do
not know the form of the sampling distributions. To be able to do this we
need to use the Central Limit Theorem, as it can be seen later.
7.2.3 Sampling Distributions of Means from Finite Population

The discussions so far have been based on the assumption that samples are
either drawn with replacement or are drawn from infinite populations. In
general, we do not sample with replacement, and in most practical situations
it is necessary to sample from a finite population; hence, we need to become
familiar with the behaviour of the sampling distribution of the sample mean
when sampling is drawn without replacement or from finite populations.
Example 7.4
Refer to Example 7.1
(a) List all possible samples of size two that can be drawn from the pop-
ulation without replacement.
(b) Find
(i) the mean of the distribution of the means;

means.
Solution
N = 4, n = 2

4
(a) There are = 6 samples of size two which can be drawn without
2
replacement, namely,
(4, 5) (4, 7) (4, 8) (5, 7)210 (5, 8) (7, 8)

(b) Find
(i) the
ADVANCED mean
TOPICS of the distribution of the means;
IN INTRODUCTORY
means.
Solution
N = 4, n = 2

4
(a) There are = 6 samples of size two which can be drawn without
2
replacement, namely,
(4, 5) (4, 7) (4, 8) (5, 7) (5, 8) (7, 8)
Note
Sampling
The Distributions II example, is considered the same as (5, 4).
sample (4, 5) for 229
(b) (i) The means of the corresponding samples, Xi in (a) are

4.5 5.5 6.0 6.0 6.5 7.5
Therefore the mean of the sampling distribution of means is
n
i=1X
µX =
4
2
4.5 + 5.5 + 6.0 + 6.0 + 6.5 + 7.5
=
6
= 6
Note
Once again the mean of the sampling distribution of means is equal to
the population mean.
(ii) The variance of this sampling distribution is

(Xi − µX )
2
σX =
N
n
(4.5 − 6)2 + (5.5 − 6)2 + · · · + (7.5 − 6)2
=
4
2
2.25 + 0.25 + 0 + 0 + 0.25 + 2.25
=
6
5
= = 0.83333
6

5
σX = = 0.9129
6
Note
As in the case of sampling with replacement, the variance of the sam-
pling distribution is not equal to the population variance. Moreover,
it is not equal to the population variance divided by the sample size
(σX2 = 5 = 2.5 ). The formula for obtaining the variance of the sam-
6 4
pling distribution of means in the case of sampling without replace-
ment is given in Theorem 7.3.
211
4
2
4.5 + 5.5 + 6.0 + 6.0 + 6.5 + 7.5
= IN
PROBABILITY: A FIRST COURSE SAMPLING DISTRIBUTIONS II
PROBABILITY THEORY – VOLUME III 6 SAMPLING DISTRIBUTION OF STATISTICS
= 6
Note
Once again the mean of the sampling distribution of means is equal to
the population mean.
(ii) The variance of this sampling distribution is

(Xi − µX )
2
σX =
N
n
(4.5 − 6)2 + (5.5 − 6)2 + · · · + (7.5 − 6)2
=
4
2
2.25 + 0.25 + 0 + 0 + 0.25 + 2.25
=
6
5
= = 0.83333
6

5
σX = = 0.9129
6
Note
As in the case of sampling with replacement, the variance of the sam-
pling distribution is not equal to the population variance. Moreover,
it is not equal to the population variance divided by the sample size
(σX2 = 5 = 2.5 ). The formula for obtaining the variance of the sam-
6 4
pling distribution of means in the case of sampling without replace-
230 ment is given in Theorem 7.3.
Theorem 7.3
If X is the mean of a random sample of size n from a finite
population of size N (or if sampling is without replacement), whose
mean is µ and variance is σ 2 , then
E(X) = µ
σ2 N − n
Var(X) = ·
n N −1
The formula for obtaining Var(X) in Theorem 7.2, which applies to values
assumed by independent random variables (or sampling with replacement
from a finite population) and the one in Theorem 7.3, which applies to
sampling without replacement from a finite population, differ by the factor
N −n
. If N , the size of the population, is larger compared to n, the size
N −1
of the sample, the difference between the two formulas becomes negligible.
Indeed, the formula in Theorem 7.2 is frequently used as an approxima-
tion for the variance of the distribution of X for samples obtained without
replacement from sufficiently large finite populations.
Example 7.5
For the data of Example 7.1 if a sample of size two, is drawn without re-
placement, calculate the variance and hence the standard deviation of the
sampling distribution of means.
212
Solution
. If N , the size of the population, is larger compared to n, the size
N −1
of the sample, the difference between the two formulas becomes negligible.
Indeed, the formula
PROBABILITY: in Theorem
A FIRST COURSE IN 7.2 is frequently used as an approxima-
tion for the variance
PROBABILITY THEORY –of the distribution
VOLUME III of X for samples obtained
SAMPLINGwithout
replacement from sufficiently large finite populations.
Example 7.5
For the data of Example 7.1 if a sample of size two, is drawn without re-
placement, calculate the variance and hence the standard deviation of the
sampling distribution of means.
Solution
Since sampling is without replacement, we use the formula in Theorem 7.3
σ 2 = 2.5, N = 4, n=2
σ2 N − n
Var(X) = ·
n N −1
2.5 4 − 2
= ·
2 4−1
= 0.83333
Hence the standard error of the mean is
Sampling Distributions II √ 231
σX = 0.83333 = 0.9129
Note
These results are equal to the ones obtained in Example 7.4
7.2.4 Sampling Distribution of Means from Normal

Distribution
In sampling distribution of means we are assured of at least an approximately

Normal distribution under three conditions. These are when sampling is
from
(i) a normally distributed population;
(ii) a non-normally distributed population but the sample size is large;
(iii) a population whose distribution is unknown but the sample size is

large.
Theorem 7.4
If the population from which random samples are taken is normally
distributed with mean µ and variance σ 2 then the sample mean X
σ2
is normally distributed with mean µ and variance
n
Proof
From Theorem 6.22 of Volume I it follows that

t
MX (t) = M 1 (X1 +X2 ...+Xn ) (t) = MX1 +...+Xn
n n
Since the sampling is random (that is, simple random), the variables X1 ,
X2 , .., Xn are independent, and therefore Theorem 6.28 of Volume I may be
applied to give

t t t
MX (t) = MX1 MX2 ...MX n (i)
n n n
From Definition 7.2 all the random variables X1 , X2 , ..., Xn have the same
probability density function, namely that of X, and hence the same moment
generating function. Consequently, all the moment
213 generating functions
Since the sampling is random (that is, simple random), the variables X1 ,
X2 , .., Xn are independent, and therefore Theorem 6.28 of Volume I may be
applied to give
t t t
MX (t) = MX1 MX2 ...MX n (i)
n n n
From
232 Definition 7.2 all the random variables
Advanced TopicsXin
1, X 2 , ..., Xn have
Introductory the same
Probability
probability density function, namely that of X, and hence the same moment
generating function. Consequently, all the moment generating functions
on the right hand side of (i) are the same function, namely the moment
generating function of the random variable X. Thus
n
t
MX (t) = MX
n
Since the moment generating function of the Normal distribution N (µ, σ 2 )

is given by (see Theorem 11.7 of Volume I)
1 2 2
MX (t) = eµ t + 2 t σ
,
t
replacing t by in the formula will yield
n
n
t t2
µ + 12 σ2
MX (t) = e n n2
1 2 σ2
= eµ t+ 2 t n
Therefore, by the Uniqueness Property of m.g.f. (Theorem 6.26 of Volume

σ2
I), X has the Normal distribution N (µ, ).
n
Example 7.6
A random sample of 20 students was taken from a normally distributed
population with mean age of 17 years and variance of 5.
(a) Find the

The Wake
(i) expectation of the sample mean;
thevariance
(ii) only emission we standard
and hence the want todeviation
leave behind
of the sample mean.
(b) What is the probability that the mean of this sample will fall between
16 and 23?
Solution
(a) n = 20, µ = 17, σ2 = 5
(i) E(X) = µ = 17
214
1 2 σ2
= eµ t+ 2 t n

Therefore, byAthe
PROBABILITY: Uniqueness
FIRST COURSE INProperty of m.g.f. (Theorem 6.26 of Volume
PROBABILITY THEORY – VOLUME III σ2 SAMPLING DISTRIBUTION OF STATISTICS
I), X has the Normal distribution N (µ, ).
n
Example 7.6
A random sample of 20 students was taken from a normally distributed
population with mean age of 17 years and variance of 5.
(a) Find the
(i) expectation of the sample mean;

(ii) variance and hence the standard deviation of the sample mean.
(b) What is the probability that the mean of this sample will fall between
16 and 23?
Solution
(a) n = 20, µ = 17, σ 2 = 5

(i) E(X) = µ = 17
σ2 5
(ii) Var(X) = = = 0.25
n 20
σ2 √
σX = = 0.025 = 0.5
n
(b) Though n = 20 is small, we still apply the Normal distribution because
the population from which the sample is drawn is normally distributed.

16 − 17 X − 17 23 − 17
P (16 < X < 23) = P √ < √ < √
0.5 0.5 0.5
= P (−2 < Z < 12)
= Φ(12) − Φ(−2)
= 1 − {1 − Φ(2)}
= Φ(2)
= 0.9872
7.2.5 Sampling Distribution of Means from Non-Normal

Distribution
We are often faced with the problem of sampling from non-normally dis-
tributed populations or from populations whose distributions are not known.
Under these two conditions, we shall have to take large samples since when
the sample size is sufficiently large, by virtue of the Central Limit Theorem
(Theorem 5.16), the means of random samples from any distribution will
σ2
tend to be normally distributed with a mean µ and variance . This per-
n
mits the use of the Normal distribution approximation as though sampling
is from normally distributed populations. Thus inference procedures based
on the sample mean can often use the Normal distribution. But we must be
careful not to imput normality to the original observations.
Example 7.7
A random sample of size 100 is taken from a population with mean 60 and
variance 300.
(a) Find the 215
(i) expectation of the sample mean,

σ
tend to be normally distributed with a mean µ and variance . This per-
n
mits the use of the Normal distribution approximation as though sampling
is from normally
PROBABILITY: distributed
A FIRST COURSE INpopulations. Thus inference procedures SAMPLING
based DISTRIBUTIONS II
on the sample
PROBABILITY mean can
THEORY often use
– VOLUME III the Normal distribution.SAMPLING
But we must be
careful not to imput normality to the original observations.
Example 7.7
A random sample of size 100 is taken from a population with mean 60 and
variance 300.
(a) Find the

(i) expectation of the sample mean,
(ii) variance and hence the standard error of the sample mean.
(b) What is the probability that the sample mean will be less than 56?
Solution
n = 100 µ = 60 Var(X) = 300
(a) (i) E(X) = µ = 60

300
(ii) Var(X) = =3
100 √
and hence σ = 3 = 1.732
(b) The sampling distribution of the mean is not known but the sample
size is large so by the Central Limit Theorem we approximate it by
the Normal distribution.

X n − 60 56 − 60
P (X n < 56) = P √ < √
3 3
= P (Zn < −2.31)
= Φ(−2.31)
= 1 − Φ(2.31)
= 1 − 0.9896
= 0.0104
7.3 SAMPLING DISTRIBUTION OF PROPORTIONS
7.3.1 Construction of Sampling Distribution of Proportions
In the previous section we discussed the sampling distribution of a statistic

computed from measured variables, namely, the sample mean. In this section
we shall deal with the sampling distribution of statistics that result from
counts or frequency data, namely, a sample proportion.
k
Let a population proportion, p, be defined as p = , where k is the
N
number of members of the population that possess a certain trait and N is
the total number
Sampling of members
Distributions II of the population. Let us also designate 235
the
x
sample proportion by the symbol p̂ (read p hat) and is defined as , where
n
x is the number of elements in the sample that possess a certain trait and n
is the sample size. The sample distribution of the sample proportion could
be constructed experimentally in exactly the same manner as was suggested
in the case of the sample mean. From the population, which we assume to
be finite, we would
(a) take all possible samples of a given size,
(b) for each sample compute the sample proportion p̂,
(c) prepare a frequency distribution of p̂ by 216

listing the different values of
p̂ along with their frequencies of occurrence.
x is the number of elements in the sample that possess a certain trait and n
is the sample size. The sample distribution of the sample proportion could
be constructed
PROBABILITY: experimentally
A FIRST COURSE INin exactly the same manner as was suggested
in the case of the sample
PROBABILITY THEORY – mean.
VOLUME III From the population, which we assume
SAMPLING to
be finite, we would
(a) take all possible samples of a given size,
(b) for each sample compute the sample proportion p̂,
(c) prepare a frequency distribution of p̂ by listing the different values of

p̂ along with their frequencies of occurrence.
This frequency distribution (as well as the corresponding relative frequency

distribution) would constitute the sampling distribution of p̂.
7.3.2 Sampling Distribution of Proportions With Replacement
A sample proportion p̂ may be considered as a proportion of success which

is obtained by dividing the number of success by the sample size n. Hence if
a random sample of size n is obtained with replacement, then the sampling
distribution of p̂ obeys the binomial probability law.
Theorem 7.5
If sampling is with replacement, then
(a) E(p̂) = p
p(1 − p)
(b) Var(p̂) =
n
where p̂ is the sample proportion
Example 7.8
A consignment of bulbs has 20 percent defective. If a random sample of 250
is drawn with replacement from this consignment. What is the
(a) expectation of the sample proportion?
(b) variance and hence the standard error of the sample proportion?

FULL ENGAGEMENT…
RUN FASTER.
1349906_A6_4+0.indd 1 22-08-2014 12:56:57
217
p(1 − p)
(b) Var(p̂)
ADVANCED TOPICS=IN INTRODUCTORY
n
where p̂ is
PROBABILITY the sample
THEORY proportion
– VOLUME III SAMPLING DISTRIBUTION OF STATISTICS
Example 7.8
A consignment of bulbs has 20 percent defective. If a random sample of 250
is drawn with replacement from this consignment. What is the
(a) expectation of the sample proportion?

(b) variance and hence the standard error of the sample proportion?
Solution
p = 0.2, n = 250
(a) E(p̂) = p = 0.2
(b)
p(1 − p)
Var(p̂) =
n
0.2(1 − 0.2)
= = 0.00064
√ 250
σp̂ = 0.00064 = 0.0253
7.3.3 Sampling Distribution of Proportions Without

Replacement
If the sampling is without replacement, then the sampling distribution of p̂

obeys the hypergeometric probability law.
Theorem 7.6
If sampling is without replacement, then
(a) E(p̂) = p
p(1 − p) N − n
(b) Var(p̂) = ·
n N −1
where p̂ is the sample proportion
That is, E(p̂) is still identical with the population proportion p but the
N −n
variance is adjusted by a finite population correction factor .
N −1
Example 7.9
In a class of 50 students, twenty are females and thirty are males. A random
sample of twelve students are drawn from this class without replacement.
Determine the variance and hence the standard error of the proportion of
male students in the sample of twelve.
218
Sampling Distributions II III SAMPLING DISTRIBUTION
237 OF STATISTICS
Solution
Let k represent the number of female students in the class. Hence
N = 50, k = 20, n = 12
The population proportion of females in the class is therefore
k 20
p= = = 0.4
N 50
Since this is sampling without replacement, we have
p(1 − p) N − n
σp̂2 = Var(p̂) = ·
n N −1
0.4(1 − 0.4) 50 − 20
= ·
12 50 − 1
= 0.012245
√
σp̂ = 0.012245 = 0.1107
7.3.4 Sampling Distribution of Proportions from Non-Normal

Population
The distribution of p̂ for a random sample taken with replacement ap-

proaches the Normal distribution when n becomes infinite. For a random
sampling without replacement, when the sample size is large, the distribu-
tion of sample proportions is approximately normally distributed by virtue
of the Central Limit Theorem.
Theorem 7.7
The variable
p̂ − p

σp̂2
approaches the Standard Normal distribution when n becomes
infinite, where p̂ is a sample proportion that occurs in a sample of
size n
The question that now arises is how large does the sample size have
to be for the use of the Normal approximation to be valid? A widely used
238
criterion is that both np and n(1 − p) must
Advanced be greater
Topics than 5. Probability
in Introductory
Example 7.10
Refer to Example 7.8; what is the probability that 22% will be defective?
Solution
n = 250, p = 0.2
A specific value of p̂, denoted by p0 is p0 = 0.22. We are required to calculate
P (p̂ ≤ p0 ).
 
 
 0.22 − 0.2 
P (p̂ ≤ 0.22) = P Z ≤
 

 0.2(1 − 0.2) 
250

0.02
= P Z≤
0.0253
= P (Z ≤ 0.7906) = Φ(0.7906)
219
= 0.61141
P (p̂ ≤ p0 ).  

 

ADVANCED TOPICS IN INTRODUCTORY  0.22 − 0.2 
P (p̂ ≤ 0.22) = P 

 Z ≤ 


PROBABILITY: A FIRST COURSE IN 
 0.22
0.2(1−−0.2
0.2)

 SAMPLING DISTRIBUTIONS II
P (p̂ ≤– 0.22)
PROBABILITY THEORY VOLUME=III P Z ≤
  SAMPLING DISTRIBUTION OF STATISTICS

 0.2(1250
− 0.2) 

0.02 250
= P Z ≤
0.02
0.0253
= P Z≤
= P (Z ≤ 0.7906)
0.0253 = Φ(0.7906)
=
= P (Z ≤
0.611410.7906) = Φ(0.7906)
= 0.61141
The Normal approximation may be improved by the continuity correction
factor,
The a device
Normal that makes an
approximation mayadjustment
be improvedfor the
by fact
the that a discrete
continuity distri-
correction
bution ais device
factor, being that
approximated
makes an by a continuous
adjustment distribution.
for the fact that a The correction
discrete distri-
factor isismore
bution beingimportant if n isby
approximated small. With the distribution.
a continuous continuity correction factor,
The correction
factor is more important if n is small. With the continuity correction factor,
 

 0.5 
 p0 + − p
P (p̂ ≤ p0 ) = P   Z ≤ 0.5
n 


 p 0 + − p 

, x < np

P (p̂ ≤ p0 ) = P Z ≤
 p(1n − p) 

, x < np
 p(1 n− p) 
 
n

 p0 + 0.5 
 − p
= Φ 0.5
n


 p
0+ − p


= Φ  p(1n− p)  
 
 p(1 n− p) 
or n
or
 

 0.5 
 p0 − − p
P (p̂ ≤ p0 ) = P 
 Z ≤ 0.5
n 
 , x > np

 p 0 − − p 


P (p̂ ≤ p0 ) = P Z ≤
 p(1n − p) 

, x > np
 p(1 n− p) 
n
WHAT you need, WHEN you need it
At IDC Technologies we can tailor our technical and engineering
training workshops to suit your needs. We have extensive OIL & GAS
experience in training technical and engineering staff and ENGINEERING
have trained people in organisations such as General
ELECTRONICS
Motors, Shell, Siemens, BHP and Honeywell to name a few.
customisable to the technical and engineering areas you want PROCESS CONTROL
experiences with ample time given to practical sessions and MECHANICAL
demonstrations. We communicate well to ensure that workshop content ENGINEERING
and timing match the knowledge, skills, and abilities of the participants.
INDUSTRIAL
We run onsite training all year round and hold the workshops on DATA COMMS
your premises or a venue of your choice for your convenience.
ELECTRICAL
For a no obligation proposal, contact us today POWER
Phone: +61 8 9321 1702

Email: training@idc-online.com
220
 
 p0 − 0.5 − p 
 n 
= Φ



 p(1 − p) 
n
Example 7.11
Refer to Example 7.9. Calculate the probability that out of the selected
twelve students, five or less of them will be females.
Solution
5
n = 12, p = 0.4 p0 = = 0.42
12
 
0.5
 0.42 − − 0.4 
 12 
P (p̂ ≤ 0.42) = P Z ≤
 

 0.4(1 − 0.4) 
12

−0.002
= P Z≤
0.01414
= Φ(−0.1414) = 1 − Φ(0.01414)
= 1 − 0.0056
= 0.9944
7.4 SAMPLING DISTRIBUTION OF DIFFERENCES
7.4.1 Construction of Sampling Distribution of Differences
There are many problems in applied statistics where our interest is in two
populations, such as knowing something about the difference between two
population means or the difference between two population proportions. In
one situation we may wish to know if it is reasonable to conclude that the
two population means (or proportions) are different. In another situation we
may wish to know the magnitude of the difference between two population
means (or proportions). A knowledge of the sampling distribution of the
240
difference between two means Advanced Topics isinuseful
(or proportions) Introductory Probability
in investigations of
this type.
To empirically construct the sampling distribution of the difference
between two sample means (or proportions) we would adopt the following
procedure.
Suppose there are two populations, Population 1 and Population 2. We
would draw all possible random
samples of size n1 from Population 1 of size
N1
N1 . There would be such samples. For each set of sample data, the
n1
sample statistic (mean or proportion) would be computed. From Population
2 of size N2 we
would draw separately and independently all possible random
N2
samples of size in all. The sample mean (or proportion) of each sample
n2
is computed and the difference between all possible pairs of the means (or
proportions) is taken. The sampling distribution of the difference between
sample means (or proportions) would consist of all such distinct differences,
accompanied by their frequencies or relative frequencies of occurrence.
221
7.4.2 Sampling Distribution of Difference between Means
N1
N1 . There would be such samples. For each set of sample data, the
n1
sample statistic (mean or proportion) would be computed. From Population
2 of size N2 we
PROBABILITY would

THEORY –draw
VOLUMEseparately
III and independently allSAMPLING
possible random
N2
samples of size in all. The sample mean (or proportion) of each sample
n2
is computed and the difference between all possible pairs of the means (or
proportions) is taken. The sampling distribution of the difference between
sample means (or proportions) would consist of all such distinct differences,
accompanied by their frequencies or relative frequencies of occurrence.
7.4.2 Sampling Distribution of Difference between Means
Theorem 7.8
Let X11 , X12 , ..., X1n1 , X21 , X22 , ...X2n2 be n1 + n2 independent ran-
dom variables, the first n1 , having identical distributions with mean
µ1 and variance σ12 , and the remaining n2 having identical distribu-
tions with mean µ2 and variance σ22 . Then
E(X 1 − X 2 ) = µ1 − µ2
and
σ12 σ22
Var(X 1 − X 2 ) = +
n1 n2
where X 1 and X 2 are the sample means of Populations 1 and 2
respectively
Example 7.12
Two brands
Sampling of tyres areIIbeing compared by a motor firm. Brand A has
Distributions 241a
mean life of 24,000 km and a standard deviation of 1,200 km, while Brand
B has a mean life of 22,000 km and a standard deviation of 1,080 km.

A random sample of 100 and 90 tyres were taken from Brand A and Brand B
respectively. Compute the mean and variance of the sampling distribution
of the difference between the two means.
Solution
n1 = 100 n2 = 90
µ1 = 24, 000 µ2 = 22, 000
σ1 = 1, 200 σ2 = 1, 080
µ1 − µ2 = 24, 000 − 22, 000 = 2, 000

(1200)2 (1080)2
Var(X 1 − X 2 ) = +
100 90
= 14, 400 + 12, 960
= 27, 360
When sampling is from non-normally distributed populations or from

populations whose distribution are not known, we need to take large samples
so that we can apply the Central Limit Theorem, under which the differ-
ence between two samples is approximately normally distributed with mean
σ2 σ2
µ1 − µ2 and variance 1 + 2 . To find probabilities
222
associated with spe-
n n
cific values of the statistic, our procedure would be the same as that given
100 90
= 14, 400 + 12, 960
ADVANCED TOPICS IN INTRODUCTORY= 27, 360
When sampling is from non-normally distributed populations or from

populations whose distribution are not known, we need to take large samples
so that we can apply the Central Limit Theorem, under which the differ-
ence between two samples is approximately normally distributed with mean
σ2 σ2
µ1 − µ2 and variance 1 + 2 . To find probabilities associated with spe-
n n
cific values of the statistic, our procedure would be the same as that given
when sampling is from normally distributed populations. For example, the
problem in Example 7.12 is considered to be normally distributed because
n1 and n2 are both far greater than 30.
7.4.3 Sampling Distribution of Difference between Proportions
Suppose two random samples of different sizes are drawn from binomial
populations. If we need to compare the number of successes in the samples
242have to work with their proportions.
we Advanced Topics in Introductory Probability
Theorem 7.9
If independent random samples of different sizes n1 and n2 are drawn
from two binomial populations with proportions p1 and p2 respec-
tively, then the distribution of the difference between the two sample
proportions, p̂1 − p̂2 has mean
E(p̂1 − p̂2 ) = p1 − p2
and variance
p1 (1 − p1 ) p2 (1 − p2 )
Var(p̂1 − p̂2 ) = +
n1 n2
Example 7.13
In a national election, 60 percent of voters in Community A are in favour
of a certain candidate and 50 percent in Community B are in favour of the
candidate. If a sample of 210 voters from Community A and 160 voters from
Community B are drawn, find
(a) the mean of the difference between the sample proportions;
(b) the variance and standard deviation of the sampling distribution of

difference between the two sample proportions.
Solution
p1 = 0.6 p2 = 0.5 n1 = 200 n2 = 160

(a) E(p̂1 − p̂2 ) = p1 − p2 = 0.6 − 0.5 = 0.1
p1 (1 − p1 ) p2 (1 − p2 )
(a) Var(p̂1 − p̂2 ) = +
n1 n2
0.6(1 − 0.6) 0.5(1 − 0.5)
= +
200 160
0.6(0.4) 0.5(0.5)
= +
200 160
= 0.0012 + 223
0.00156
Solution
p1 =COURSE
PROBABILITY: A FIRST
0.6 p2IN= 0.5 n1 = 200 n2 = 160 SAMPLING DISTRIBUTIONS II
(a) E(p̂1 − p̂2 ) = p1 − p2 = 0.6 − 0.5 = 0.1
p1 (1 − p1 ) p2 (1 − p2 )
(a) Var(p̂1 − p̂2 ) = +
n1 n2
0.6(1 − 0.6) 0.5(1 − 0.5)
= +
200 160
Sampling Distributions II 0.6(0.4) 0.5(0.5) 243
= +
200 160
= 0.0012 + 0.00156
= 0.00276
√
σ(p̂1 −p̂2 ) = 0.00276 = 0.0525
If n1 and n2 are sufficiently large then the sampling distribution of

p̂1 − p̂2 is approximately Normal. That is, by the Central Limit Theorem
 
 
 (p01 − p02 ) − (p1 − p2 ) 
P (p̂1 − p̂2 ≤ p01 − p02 ) ≈ Φ 
 

 p1 (1 − p1 ) p2 (1 − p2 ) 
+
n1 n2
Example 7.14
In Example 7.13 if in the sample of voters from Community A and Commu-
nity B there were 30 and 20 voters respectively in favour of the candidate
what is the probability of obtaining this or a smaller difference in the sample
proportions if the belief about the population parameters is correct?
Solution
x1 = 30, x2 = 20
x1 30 �e Graduate Programme
I joined MITAS because
p01 = = = 0.143 for Engineers and Geoscientists
n1 210
I wanted real responsibili� 20
www.discovermitas.com
p02I joined
x2 MITAS because
= = = 0.125 for Engine
I wanted
n 2 real responsibili�
160 Ma
p01 − p02 = 0.143 − 0.125 = 0.018
p1 − p2 = 0.6 − 0.5 = 0.1
The desired probability is

 
 
 (p01 − p02 ) − (p1 − p2 ) 
P (p̂1 − p̂2 ≤ 0.018) ≈ Φ 

 p1 (1 − p1 )


p2 (1 − p2 )  Month 16
+

n1 n2


 0.018 − 0.1


I was
= Φ 

 0.6(0.4)


0.5(0.5) 
the North Sea super
+
200 160 advising and the No
Real work he
International
al opportunities
Internationa
�ree wo
work
Real work he
helping fo
International
�ree wo
work
224
Example 7.14
In Example 7.13 if in the sample of voters from Community A and Commu-
nity B there Awere
PROBABILITY: 30COURSE
FIRST and 20 INvoters respectively in favour of the candidate
what is the probability
PROBABILITY of obtaining
THEORY – VOLUME III this or a smaller difference in the sample
SAMPLING DISTRIBUTION OF STATISTICS
proportions if the belief about the population parameters is correct?
Solution
x1 = 30, x2 = 20
x1 30
p01 = = = 0.143
n1 210
x2 20
p02 = = = 0.125
n2 160
p01 − p02 = 0.143 − 0.125 = 0.018
p1 − p2 = 0.6 − 0.5 = 0.1
The desired probability is

 
 
 (p01 − p02 ) − (p1 − p2 ) 
P (p̂1 − p̂2 ≤ 0.018) ≈ Φ 
 

 p1 (1 − p1 ) p2 (1 − p2 ) 
+
n1 n2
 
 
 0.018 − 0.1 
= Φ 


244  0.6(0.4)
Advanced 0.5(0.5)  Probability
+
200 160

−0.082
= Φ √
0.0012 + 0.00276

−0.082
= Φ
0.0629
= Φ(−1.30) = 0.0968
7.5 SAMPLING DISTRIBUTION OF VARIANCE
If X1 , X2 , ..., Xn denote the random variables for a sample j of size n, then

the random variable s2j which denotes the sample variance is defined by
(X1j − X j )2 + (X2j − X j )2 + ... + (Xnj − X j )2

s2j =
n
where X nj , for instance, means the nth value of the j th sample and X j is
the mean of the j th sample. To find the sampling distribution of variance we
need to calculate the mean and variance of all the possible sample values.
7.5.1 Sampling Distribution of Variance with Replacement
Example 7.15
With reference to Examples 7.1, if a sample of size two is drawn with re-
placement from the population, find
(a) the mean of the sampling distribution of variances;
(b) the variance of the sampling distribution of the variances.
Solution
The sample values and means have been obtained in Example 7.2. For the
sake of convenience, we reproduce them here. The sample values i for each
j th sample of size two (Xij ) are:
225
(4, 4) (4, 5) (4, 7) (4, 8) (5, 4) (5, 5) (5, 7) (5, 8)
placement from the population, find
(a) the mean of the sampling distribution
of variances; SAMPLING DISTRIBUTIONS II
(b) the variance of the sampling distribution of the variances.
Solution
sake of convenience, we reproduce them here. The sample values i for each
j th sample of size two (Xij ) are:
(4, 4) (4, 5) (4, 7) (4, 8) (5, 4) (5, 5) (5, 7) (5, 8)

(7, 4) (7, 5) II
Sampling Distributions (7, 7) (7, 8) (8, 4) (8, 5) (8, 7) (8, 8) 245
The corresponding sample means for each j th sample (Xj ) are:
4.0 4.5 5.5 6.0 4.5 5.0 6.0 6.5

5.5 6.0 7.0 7.5 6.0 6.5 7.5 8.0
The corresponding sample variances s2j for the 16 samples are:
0 0.25 2.25 4.00 0.25 0 1.00 2.25

2.25 1.00 0 0.25 4.00 2.25 0.25 0
Note that we have defined the sample variance as

n
i=1 (Xij − X j )2
s2j =
n
(a) Mean of sampling distribution of variances

k 2
j=1 sj
E(s2j ) = µsj =
Nn
0 + 0.25 + 2.25 + ... + 0.25 + 0
=
16
= 1.25
(b) The variance of the sampling distribution of variances

k
j=1 (sj − µs2 )2
2
Var(s2j ) =
Nn
(0 − 1.25)2 + (0.25 − 1.25)2 + (2.25 − 1.25)2 + ...
=
16
+(0.25 − 1.25)2 + (0 − 1.25)2
16
1.5625 + 1.0000 + 1.0000 + ... + 1.0000 + 1.5625
=
16
29.5
=
16
= 1.84375
226
Theorem 7.10
Let X1 , ..., Xn be a random sample of size n, from a random variable
X with expectation µ and variance σ 2 . Let s2 be the sample variance
defined as n
1
s2 = (X − X)2
n i=1
If sampling is from an infinite population or with replacement from
a finite population, then
n−1 2
E(s2 ) = σ
n
Proof
(Xi − X)2 = [(Xi − µ) − (X − µ)]2

Taking the sum of both sides
n
n

(Xi − X) 2
= [(Xi − µ)2 − 2(X − µ)(Xi − µ) + (X − µ)2 ]
i=1 i=1
n
n

= [(Xi − µ)2 − 2(X − µ) (Xi − µ)
i=1 i=1
+n(X − µ)2 ] (i)
But
n
n

(Xi − µ) = Xi − µ
i=1 i=1
= nX − n µ
= n(X − µ) (ii)
Substituting (ii) into (i) we obtain
n
n

(Xi − X)2 = (X − µ)2 − 2n (X − µ)2 + n(X − µ)2
i=1 i=1
n

= (Xi − µ)2 − n(X − µ)2
i=1
www.job.oticon.dk
227
(Xi − X)2 = [(Xi − µ)2 − 2(X − µ)(Xi − µ) + (X − µ)2 ]
i=1 i=1
n n

PROBABILITY: A FIRST =
COURSE [(X
INi − µ)2 − 2(X − µ) (Xi − µ) SAMPLING DISTRIBUTIONS II
i=1 III i=1 SAMPLING DISTRIBUTION OF STATISTICS
+n(X − µ)2 ] (i)

But
n
n

(Xi − µ) = Xi − µ
i=1 i=1
= nX − n µ
= n(X − µ) (ii)
Substituting (ii) into (i) we obtain
n
n

(Xi − X)2 = (X − µ)2 − 2n (X − µ)2 + n(X − µ)2
i=1 i=1
n
Sampling Distributions =
II (Xi − µ)2 − n(X − µ)2 247
i=1
Taking the expectation of both sides, we obtain:

Taking the expectation of both sides, we obtain:

n
n
E (Xi − X)22 = E (Xi − µ)22 − nE (X − µ)22
n n
E i=1 (Xi − X) = E i=1 (Xi − µ) − nE (X − µ)
i=1 n i=1

n
= E(Xi − µ)22 − nE[(X − µ)22 ]
= i=1
E(Xi − µ) − nE[(X − µ) ]
i=1
n
n
= Var(Xi ) − n Var(X)
= i=1
Var(Xi ) − n Var(X)
i=1
σ2
= nσ 22
− n σ2
= nσ − n n
n
= (n − 1)σ 22 (iii)
= (n − 1)σ (iii)
Now,
Now, n
1 n
= 1
s22 (X − X)2
s = n i=1 (Xii − X)2
n i=1
and
and
n
1
n
E(s22 ) = E 1 (Xi − X)2
2
E(s ) = E n i=1 (Xi − X)
n i=1
n
1 n

= 1 E (Xi − X)
= n E i=1 (Xi − X)
n
n − 1 i=1
= n − 1 σ 22 (iv)
= n σ (iv)
n
which is very nearly σ 22 for large values of n (say n ≥ 30). Thus, if we use s22
which is very nearly2σ for large values of n (say n ≥ 30). Thus, if we use s
as an estimate of σ 2 we shall be underestimating the population variance.
as an estimate of σ we shall be underestimating the population variance.
To eliminate such a bias we often use what is called the desired unbiased
To eliminate such a bias we often use what is called the desired unbiased
estimator of variance (also called the corrected sample variance) defined by
estimator of variance (also called the corrected sample variance) defined by
n
ŝ22 = n s22
ŝ = n − 1 s
n−1
From this definition
From this definition
n
E(ŝ22 ) = E n s22
E(ŝ ) = E n − 1 s
nn−1 2
= n E(s )
= n − 1 E(s2 )
n−1
228
n n−1 2
= · σ
n−1 n
= σ2
The computational formula for the corrected sample variance is

n 1
ŝ2 = · (Xi − X)2
n−1 n

(Xi − X)2
=
n−1
Because of this, some statisticians choose to define the sample variance by
simply replacing n by n−1 in the denominator of the definition of s2 given in
Example 7.15. Others continue to define the sample variance by maintaining
the n in the denominator, because by doing this many later results are
simplified. In practice the corrected sample variance is used when n < 30.
When n ≥ 30 the sample variance is used. In fact if n is large there is very
little difference in the two results.
Example 7.16
Refer to Example 7.1. List all possible samples of size two that can be drawn
from the population with replacement and calculate
(a) the variance of the sample values;
(b) the corrected sample variance;
(c) the expectation of the sample variance.
Solution
From Example 7.2, N = 4, σ 2 = 2.5. From Example 7.3, n = 2 and
N n = 16. The mean of the sample means is 6.
(a) To find the variance we subtract the mean of the sample means from
each of the sample means, square it and divide it by the number of
the sample means.
Thus,
(4 −II6)2 + (4.5 − 6)2 + · · · + (7.5 − 6)2 + (8 − 6)2

Sampling Distributions 249
s2 =
16
= 1.25
(b) The corrected sample variance is:
(4 − 6)2 + (4 − 6)2 + · · · + (8 − 6)2 + (8 − 8)2
ŝ2 =
16 − 1
= 2.503
From the results in Example 7.16, we may note the following.

(i) The expectation of the sample variance is
n−1 2 2−1
E(s2 ) = σ = 2.5 = 1.25
n 2
which is the same as the mean of the sampling distribution of variances
obtained in Example 7.15(a).
(ii) As the number of sample values increases, s2 and ŝ2 tend to be the
same. Using the definitional formula for the corrected sample variance
we obtain
n 2 229
ŝ2 = s2 = (1.25) = 2.5
n−1 2−1
(i) The expectation of the sample variance is
n−1
ADVANCED TOPICS IN INTRODUCTORY 2−1
E(s2
) = 2.5 = 1.25
σ2 =
PROBABILITY: A FIRST COURSE IN n 2 SAMPLING DISTRIBUTIONS II
which is the same as the mean of the sampling distribution of variances
obtained in Example 7.15(a).
(ii) As the number of sample values increases, s2 and ŝ2 tend to be the
same. Using the definitional formula for the corrected sample variance
we obtain
n 2
ŝ2 = s2 = (1.25) = 2.5
n−1 2−1
which is the same as the value obtained in part (b) of the example
above with a difference of 0.003.
(iii) The expectation of the corrected sample variance is
n 2
E(ŝ2 ) = E(s2 ) = (1.25) = 2.5 = σ 2
n−1 2−1
(see Example 7.1).
7.5.2 Sampling Distribution of Variance Without Replacement
Example 7.17
With reference to Examples 7.1, if a sample of size two is drawn without
replacement, find
(a) the mean, (b) the variance
of the sampling distribution of variances.
Solution
sake of convenience, we reproduce them here.
(4, 5) (4, 7) (4, 8) (5, 7) (5, 8) (7, 8)
WHY WAIT FOR

PROGRESS?
DARE TO DISCOVER
230
PROBABILITY
III SAMPLING
Topics in Introductory DISTRIBUTION OF STATISTICS
Probability
The corresponding sample means are
(4.5) (5.5) (6.0) (6.0) (6.5) (7.5)
(a) Mean of sampling distribution of variances:

s2
E(s2j ) = µs2 =
j
N
n
0.25 + 2.25 + 4.00 + 1.00 + 2.25 + 0.25
=
6
10
= = 1.6667
6
(b) The variance of sampling distribution of variances is
k
j=1 (sj − µs2 )
2
Var(s2j ) =
4
2
(0.25 − 1.6667)2 + (2.25 − 1.6667)2
=
6
(4.00 − 1.6667) + (1 − 1.6667)2
2
+
6
(2.25 − 1.6667) + (0.25 − 1.6667)2
2
+
6
2.0070 + 0.3404 + · · · + 0.3404 + 2.0070
=
6
10.5836
=
6
Sampling Distributions =
II 1.7639 251
Theorem 7.11
Let X1 , X2 , ..., Xn be a random sample of size n, with expectation µ
and variance σ 2 . Let s2 be the sample variance defined as
n
1
s2 = (Xi − X)2
n i=1
If sampling is without replacement from a finite population of size

N , then the variance of the sampling distribution of variances is

N n−1
Var(s ) =
2
σ2
N −1 n
Note
As N → ∞ this result reduces to that in Theorem
231 7.10.
Var(s ) = σ
N −1 n
Note
As N → ∞ this result reduces to that in Theorem 7.10.
Example 7.18
With reference to Examples 7.17 and 7.5, find the variance of the sample
variance s2 if sampling is without replacement.
Solution
N = 4, n = 2, σ 2 = 2.5 (from Example 7.2). Hence,

4 2−1
Var(s ) =
2
(2.5)
4−1 2
= 1.6667
which is equal to the results obtained in Example 7.17.
To find the sampling distribution of the sample variance, it is more

252
convenient to do so in terms ofAdvanced Topics
the related in Introductory
random variable. Probability
Theorem 7.12
If random samples of size n are taken from a population having a
Normal distribution, and if

i (Xi
− X)2
s =
2
n−1
then
(n − 1)s2
σ2
has a Chi-square distribution with n − 1 degrees of freedom
Note
Chi-square distribution is discussed in Chapter 8.
Theorem 7.13
If X is normally distributed with mean µ and variance σ 2 and
X1 , X2 ..., Xn is a random sample of size n of X, then the random
variable
(Xi − µ)2
V = i
σ2
will possess a Chi-square distribution with n degrees of freedom
This chapter concludes discussions on basic theory on sampling and

sampling distributions. The next chapter continues discussions on sampling
distributions which are based on small samples.
EXERCISES
232
7.1 A population consists of five values 2, 4, 6, 8 and 14.

This chapter
ADVANCED TOPICS INconcludes discussions on basic theory on sampling and
INTRODUCTORY
sampling distributions.
PROBABILITY: The next
A FIRST COURSE IN chapter continues discussions on sampling
PROBABILITY
distributionsTHEORY – VOLUME
which are III small samples.
based on SAMPLING DISTRIBUTION OF STATISTICS
EXERCISES
7.1 A population consists of five values 2, 4, 6, 8 and 14.
(a) Calculate the mean and variance of the population.

Sampling
(b) Distributions 253
II distinct samples of size 2 drawn with replacement
List all possible
from the population and compute the mean of each sample.
(c) Construct the sampling distribution of the sample means and
calculate the mean and variance of the distribution.
(d) Comment on your results.
7.2 Rework Exercise 7.1 for the case when a random sample of size 2 is
drawn without replacement.
7.3 Referring to Exercise 7.1, calculate
(a) the expectation of the sample mean;

(b) the variance and hence the standard deviation of the sample mean
if sampling is made; (i) with replacement; (ii) without replace-
ment.
7.4 A random sample of 25 adults is taken from a Normal population with

mean height of 1.65 metres and standard deviation of 0.8 metres. Find
(a) (i) the expectation of the sample mean;

(ii) the variance and standard deviation of the sample mean;
(b) the probability that the height of an adult selected at random
will be
(i) less than 1.50 metres; PREPARE FOR A
(ii)
LEADING ROLE.
not less than 1.70 metres; (iii) between 1.55 and 1.68.
7.5 The weights of the products of a certain factory have a mean of 140
kilograms and a standard deviation of 20 kilograms. IfEnglish-taught
180 of the MSc programmes
products are selected at random, find in engineering: Aeronautical,
Biomedical, Electronics,
(a) (i) the expectation of the sample mean; Mechanical, Communication
systems
(ii) the variance and standard deviation of the sample mean;and Transport systems.
No weigh
(b) the probability that a product selected at random will tuition fees.
(i) at most 150 kilograms; (ii) at least 120 kilograms;
(iii) between 110 and 155 kilograms.
7.6 The time that a cashier spends in processing each person’s order is an
E liu.se/master
independent random variable with mean of 45 minutes and standard
deviation of 30 minutes. What is the approximate probability that the
orders of 100 persons can be processed in less than 4000 minutes?
‡
233
(a) the expectation of the sample mean;
(b) theAvariance
PROBABILITY: and hence
FIRST COURSE IN the standard deviation of the sample SAMPLING
mean DISTRIBUTIONS II
if sampling is made; (i) with replacement; (ii) without replace-
ment.
7.4 A random sample of 25 adults is taken from a Normal population with

mean height of 1.65 metres and standard deviation of 0.8 metres. Find

(b) the probability that the height of an adult selected at random
will be
(i) less than 1.50 metres;
(ii) not less than 1.70 metres; (iii) between 1.55 and 1.68.
7.5 The weights of the products of a certain factory have a mean of 140
kilograms and a standard deviation of 20 kilograms. If 180 of the
products are selected at random, find

(b) the probability that a product selected at random will weigh
(i) at most 150 kilograms; (ii) at least 120 kilograms;
(iii) between 110 and 155 kilograms.
7.6 The time that a cashier spends in processing each person’s order is an
independent random variable with mean of 45 minutes and standard
254 deviation of 30 minutes. Advanced
What is the approximate
Topics probability
in Introductory that the
Probability
orders of 100 persons can be processed in less than 4000 minutes?
7.7 Suppose a population consists of 10 identical balls in a box, four of

which are white and six green. A random sample of five is drawn
without replacement. Find
(a) the expectation of the sample proportion of white balls;

(b) the variance and hence the standard error of the sample propor-
tion.
7.8 Steel wires produced by a certain factory have a mean tensile strength
of 1, 000 kilograms and a variance of 900 kilograms. If a random sample
of 250 wires is drawn from the production line during a certain month
with a total output of 100, 000 units, find
(a) (i) the expectation and (ii) the variance of the sample mean
tensile strength;
(b) the probability that the sample mean will
(i) be more than 1, 010 kilograms;
(ii) be less than 1, 003 kilograms;
(iii) be between 995 and 1, 006 kilograms;
(iv) differ from 1, 000 by 5 kilograms or more;
(v) differ from 1, 000 by at most 7 kilograms.
7.9 Repeat Exercise 7.7 for the case when sample is with replacement.
7.10 A group of children consists of twenty males and twenty-five females.

A random sample of twenty-three children is drawn from this group
with replacement.
(a) Find
(i) the expectation of the proportion of females;
234
(ii) the variance and hence the standard error of the sample pro-
(iii)
(iii) be
be between
between 995
995 and
and 1,
1, 006
006 kilograms;
kilograms;
(iv)
ADVANCED(iv) differ
differ
TOPICS from
from 1, 000 by 5 kilograms or
1, 000
IN INTRODUCTORY by 5 kilograms or more;
more;
(v)
PROBABILITY: A differ
FIRST from 1,
COURSE 000
IN by at most 7 kilograms.
(v) differ from 1, 000 by at most 7 kilograms.
7.9
7.9 Repeat
Repeat Exercise
Exercise 7.7
7.7 for
for the
the case
case when
when sample
sample is
is with
with replacement.
replacement.
7.10
7.10 A
A group
group of
of children
children consists
consists of
of twenty
twenty males
males and
and twenty-five
twenty-five females.
females.
A random sample of twenty-three children is drawn
A random sample of twenty-three children is drawn fromfrom this
this group
group
with replacement.
with replacement.
(a)
(a) Find
Find
(i)
(i) the
the expectation
expectation of
of the
the proportion
proportion of
of females;
females;
(ii)
(ii) the variance and hence the standard error
the variance and hence the standard error of
of the
the sample
sample pro-
pro-
portion of females.
portion of females.
(b)
(b) What
What isis the
the probability
probability that
that the
the sample
sample proportion
proportion of
of females
females is
is
at least 0.6?
at least 0.6?
7.11
7.11 Rework
Rework Exercise
Exercise 7.10
7.10 if
if sampling
sampling was
was made
made without
without replacement.
replacement.
7.12
7.12 Thirty-six
Sampling percent
percentIIof
Distributions
Thirty-six of University
University staff
staff are
are against
against aa strike
strike action. 255
action. If
If aa
sample of 120 of them are drawn at random with replacement,
sample of 120 of them are drawn at random with replacement, find find
(a) (i) the expectation of the sample proportion,
(ii) the variance and hence the standard error of the sample pro-
portion,
of the University staff who are against a strike action.
(b) What is the probability that the proportion of the University staff
who are against the strike will be between 0.5 and 0.7.
7.13 Rework Exercise 7.12 if sampling is without replacement and if there
are 400 university staff members.
7.14 A box contains 80 black balls and 60 white balls. Two samples of 40
marbles each are randomly selected with replacement from the box
and their colours noted. Suppose that the first and second samples
contained 30 and 25 black balls respectively. Find
(a) the variance of the difference in proportion of black balls in the
two sample;
(b) the probability that the difference between the two samples does
not exceed 10 balls.
7.15 Rework Exercise 7.14 if sampling is without replacement.
7.16 A random sample of size 5 is drawn from a population which is nor-
mally distributed with mean 49 and variance 9. A second random
sample of size 4 is drawn from a different population which is also nor-
mally distributed with mean 39 and variance 4. The two samples are
independent with means X 1 and X 2 respectively. Let W = X 1 − X 2 .
Calculate (a) the mean and variance of W ; (b) P (W > 8.2).
7.17 With reference to Exercise 7.2, find
(a) (i) the mean; (ii) the variance;
of the sampling distribution of variances;
(b) the corrected variance for sampling.
7.18 With reference to Exercise 7.1, find
(a) (i) the mean (ii) the variance
of the sampling distribution of variances;
(b) the corrected variance for sampling.
235
PROBABILITY THEORY – VOLUME III Distributions Derived from Normal Distribution
Chapter 8
DISTRIBUTIONS DERIVED FROM NORMAL

DISTRIBUTION
8.1 INTRODUCTION
So far we have encountered distributions which are related to the Normal

distribution on the basis of the Central Limit Theorem. Recall that by
the Central Limit Theorem (CLT) the probability distribution of a sample
statistic from an independent identical non-Normal distribution is approxi-
mately Normal if the sample size is sufficiently large. When the sample size
is small we cannot apply the CLT or assume that sampling distributions
are Normal. The sampling distribution in this case differs from one case
to another. Besides the binomial and Poisson distributions there are three
probability distributions, which are byproducts of the Normal distribution
such can be described by one of them under certain conditions. These are
the Chi-square (χ2 ) distribution, the Student’s t (briefly t) distribution, and
the Fisher’s variance ratio (briefly F ) distribution.
8.2 χ2 DISTRIBUTION
Perhaps, the Chi-square distribution is, of all statistical distributions, the

most widely used one in statistical applications. We have already encoun-
Click here
tered the Chi-square distribution in Chapter 19 where we considered to sam-
learn more
TAKE THE
pling distribution of the sample variance.
RIGHT TRACK
256

by studying with us. Experience the advantages
Apply by World class

www.mdh.se
15 January research
236
probability distributions, which are byproducts of the Normal distribution
such can be
ADVANCED described
TOPICS by one of them under certain conditions. These are
IN INTRODUCTORY
the Chi-square (χ2 ) distribution, the Student’s t (briefly t) distribution, and
the Fisher’s THEORY
PROBABILITY variance– ratio
VOLUME(briefly
III F ) distribution.
Distributions Derived from Normal Distribution
8.2 χ2 DISTRIBUTION
Perhaps, the Chi-square distribution is, of all statistical distributions, the

most widely used one in statistical applications. We have already encoun-
tered the Chi-square distribution in Chapter 19 where we considered sam-
pling distribution
Distributions of the
Derived sample
From variance.
Normal Distribution 257
256
8.2.1 Definition of Chi-square Distribution
Recall from Volume II (Nsowah-Nuamah, 2018) that the Chi-square distri-

bution is a special type of the Gamma distribution with α = v2 and β = 2.
Definition 8.1 CHI-SQUARE DISTRIBUTION

A continuous random variable X is said to have a Chi-square dis-
tribution with parameter v > 0 if its probability density function is
given by
1 v x
f (x) = v v x 2 −1 e− 2 , x>0
22 Γ 2
The Chi-square distribution is completely defined by its parameter v

which is called the number of degrees of freedom (df). If a random variable
X follows a Chi-square distribution with v degrees of freedom then we write
X ∼ χ2(v) . Fig. 8.1 depicts the density curves of the Chi-square distributions
for a few selected values of v.
Fig. 8.1 χ2 Distribution with various Degrees of Freedom
237
PROBABILITY
III Distributions
Topics Derived
in Introductory from Normal Distribution
Probability
8.2.2 Properties of Chi-square Distribution
Property 1
Theorem 8.1
Suppose X has a χ2 with parameter v distribution. Then the
moment-generating function of X is
v 1
MX (t) = (1 − 2t)− 2 , t<
2
The proof of this theorem is left as an exercise for the reader (see Exercise
8.1)
Example 8.1
Suppose X has a moment generating function (m.g.f)
1
MX (t) = (1 − 2t)−10 , for t <
2
then this is a χ2 distribution. What is its degree of freedom.
Solution
The Chi-square distribution has v degrees of freedom. The exponent of the
m.g.f. equals −10. Hence −10 = − v2 which gives v = 20.
Example 8.2
Suppose X has a Chi-square distribution with 5 degrees of freedom. Find
its m.g.f.
Solution
From Theorem 8.1 with v = 5, the moment generating function of X is
Distributions Derived From Normal Distribution

v 5 259
MX (t) = (1 − 2t)− 2 = (1 − 2t)− 2
Property 2
Theorem 8.2
Suppose X has a χ2 distribution with parameter v. Then
(a) E(X) = v (b) Var(X) = 2 v
Proof
Since the Chi-square distribution is a special case of the Gamma distribution
when α = v2 and β = 2, its expectation and variance is obtained simply by
substituting these values in the formulas of Theorem 10.18. Thus,
Example 8.3
238
Refer to Example 8.2. Find the (a) mean (b) variance.
Suppose X has a χ2 distribution with parameter v. Then
(a) E(X) = v
ADVANCED TOPICS IN INTRODUCTORY (b) Var(X) = 2 v
Proof
Since the Chi-square distribution is a special case of the Gamma distribution
when α = v2 and β = 2, its expectation and variance is obtained simply by
substituting these values in the formulas of Theorem 10.18. Thus,
Example 8.3
Refer to Example 8.2. Find the (a) mean (b) variance.
Solution
(a) E(X) = v = 5 (b) Var(X) = 2 v = 10
Property 3
Theorem 8.3
The cumulative distribution function of the Chi-square distribution
is given by x
v v v x
22 Γ t 2 −1 e− 2 dt
2 0
Property 4
Theorem 8.4
If X1 , X2 , ..., Xn are independent random variables having Chi-square
distributions with v1 , v2 , ... , vn degrees of freedom respectively,
then the distribution of Y = X1 + X2 + ... + Xn will possess a
Chi-square distribution with v1 + v2 + ... + vn degrees of freedom

how will goods be transported? What re-
sources will we use, and how many will
we need? The passenger and freight traf-
fic sector is developing rapidly, and we
provide the impetus for innovation and
systems for internal combustion engines
ficiently than ever before. We are also
239
The cumulative distribution function of the Chi-square distribution
is given by x
v v v x
22 Γ t 2 −1 e− 2 dt
Property 4
Theorem 8.4
distributions with v1 , v2 , ... , vn degrees of freedom respectively,
then the distribution of Y = X1 + X2 + ... + Xn will possess a
Chi-square distribution with v1 + v2 + ... + vn degrees of freedom
Corollary 8.1
distributions with 1 degree of freedom each, then the distribution of
Y = X1 + X2 + ... + Xn will possess a Chi-square distribution with n
degrees of freedom.
Corollary 8.2
If X and Y are independent and X ∼ χ2v and Y ∼ χ2u , then
Z = X + Y ∼ χ2v+u .
Property 5
Theorem 8.5
If X1 , and X2 are independent random variables, X1 has a Chi-
square distribution with v1 degrees of freedom and X1 + X2 has a
Chi-square distribution with v (> v1 ) degrees of freedom, then X2
has a Chi-square distribution with v − v1 degrees of freedom.
The Chi-square distribution is positively skewed (Fig. 8.1). It has been

extensively tabulated. Table I in Appendix A contains the values of χ2α,v
for 2
χα,v
P χ2 ≤ χ2α,v = f (x) dx = 1 − α
0
where
α = 0.005, 0.01, 0.025, 0.05, 0.95, 0.975, 0.99, 0.995 and v = 1, 2, ..., 30
It is a convention in statistics to use the same symbol χ2 for both the random
variable and a value of that random variable. Thus percentile values of the
Chi-square distribution with v degrees of freedom is denoted by χ2α,v or
simply χ2α if n is understood, where the suffix α is used from now on and
in applications of probability and Statistics, to denote “lower percentile”
(100α% point).
Example 8.4
Find from Table I the following:
(a) χ20.025, 13 (b) χ20.99, 5 (c) χ20.99, 28 (d) χ20,95, 3 (e) χ20.975, 10
240
Distributions Derived From Normal Distribution 261
Solution
(a) This is a Chi-square for probability 0.025 and degrees of freedom of 13.
Now, referring to Table I, we proceed downward under column labelled
d.f. until we reach entry 12. Then we proceed right to meet column
headed χ20.025 . The value at that meeting point, which is 24.736, is
the required value of χ20.025,12 . Therefore the probability that a χ2 -
distributed random variable with 12 degrees of freedom does not ex-
ceed 24.736 is 0.025.
(b) This is a Chi-square for probability 0.99 and degree of freedom of 5.

Proceeding as in (a), that is, we go to Table I and proceed downward
under column v until we reach the entry 4. Then proceed right to
meet column headed χ20.99 . The value at the meeting point 0.554 is
the required value of χ20.99,5 . Therefore the probability that a χ2 -
distributed random variable with 5 degrees of freedom does not exceed
0.554 is 0.99.
The reader should complete this example.
Example 8.5
Suppose a certain Chi-square distribution value for a probability of 0.975 is
30.2. What is the degree of freedom.
Solution
Refer to Table I and proceed downward under column 0.975 until you locate
the value 30.2 in the body of the table. Then proceed left to meet column
headed v. The value of v = 17 is the required degrees of freedom.
8.2.3 Relationship Between Chi-Square and Normal

Distributions
Chi-square Distribution appears in the theory associated with random vari-

ables which are normally distributed.
Theorem 8.6
Let Z be a Standard Normal variable. Then U = Z 2 is a Chi-square
distribution with 1 degree of freedom
The Chi-square distribution with 1 degree of freedom is denoted by χ2(1) .
Proof
In Theorem 10.28, we proved that

1 u 1 t
P (U ≤ u) = √ t− 2 e− 2 dt (i)
2π 0
If we put v = 1 in the Chi-square cumulative distribution function (Theorem
8.3), that is, if we consider Chi-square
√ distribution with 1 degree of freedom,
and recalling also that Γ 2 = π, we will obtain expression (i).
1
Note
If X ∼ N (µ, σ), then
2
X −µ
∼ χ21
σ
241
1 1 t
P (U ≤ u) = √ t− 2 e− 2 dt (i)
2π 0
If we put v = 1 in the Chi-square cumulative distribution function (Theorem
8.3), that is, THEORY
PROBABILITY if we consider
Chi-square
– VOLUME III√ distribution with 1 degree
Distributions of freedom,
Derived from Normal Distribution
and recalling also that Γ 2 = π, we will obtain expression (i).
1
Note
If X ∼ N (µ, σ), then
2
X −µ
∼ χ21
σ
Theorem 8.7
Let Z1 , Z2 , ... , Zv be n independent Standard Normal random vari-
ables. Then the sum of the squares of these variables is a Chi-square
variable, χ2 , with v degrees of freedom. That is,
χ2(v) = Z12 + Z22 + ... + Zv2

v

= Zi2 where Zi are independent
i=1
Example 8.6
Suppose Z1 , Z2 , ... , Zv is a random sample from the Standard Normal dis-
tribution. Find the number c such that
22

P Zi2 > c = 0.10
i=1
Solution
22

By Theorem 8.7, Zi has a χ2 distribution with 22 degrees of freedom.
i=1

678'<)25<2850$67(5©6'(*5((
242
Suppose Z1 , Z2 , ... , Zv is a random sample from the Standard Normal dis-
tribution. Find the number c such that
22
P III Zi2
PROBABILITY THEORY – VOLUME >c = Distributions
0.10 Derived from Normal Distribution
i=1
Solution
22
By Theorem 8.7, Zi has a χ2 distribution with 22 degrees of freedom.
i=1
Reading from Table I in Appendix A the row labelled 22 d.f. and the column
headed by an upper-tail area of 0.10, we get the number 30.813. Therefore
22

P Zi2 > 30.813 = 0.10
i=1
which may equivalently be stated as

22

P Zi2 ≤ 30.813 = 0.90
i=1
Thus c = 30.813
Theorem 8.8
The Chi-square distribution approaches the Normal distribution
N (v, 2v) as the degrees of freedom v gets large
In practice when v > 30, Chi-square probabilities can be computed by

employing Normal functions in the usual fashion.
8.2.4 Relationship Between Chi-Square and Multinomial

Distributions
Theorem 8.9
Suppose X1 , X2 , ... , Xs has a joint multinomial distribution with
parameters n, p1 , p2 , · · · , ps . Then the sum
s
(Xi − npi )2
i=1
npi
tends to the Chi-square distribution as n → ∞
Theorem 8.9 is very important in applied statistics. It is used for goodness-

of-fit test and contingency-table tests (tests of independence and homogene-
ity). The details of these tests may be found in Statistics books.
8.3 t DISTRIBUTION
The t-distribution was first introduced by W. Gosset who published his work
under the name of ”Student”.
8.3.1 Definition of t Distribution
Definition 8.2 t DISTRIBUTION

A continuous random variable X is said to have the Student’s t dis-
243
tribution with parameter v if its probability density function is given
8.3 t DISTRIBUTION
8.3 t DISTRIBUTION
The t-distribution COURSE
was IN
first introduced by W. Gosset who published his work
The t-distribution
PROBABILITY THEORYwas first introduced
– VOLUME
under the name of ”Student”. III by W. Distributions
Gosset who published
Derivedhis work
from Normal Distribution
under the name of ”Student”.
Definition 8.2 t DISTRIBUTION

Definition
A continuous8.2 t DISTRIBUTION
random variable X is said to have the Student’s t dis-
A continuous random variable
tribution with parameter is said to have
v if itsXprobability thefunction
density Student’s t dis-
is given
tribution
by with parameter v if its probability density function is given
by

v + 1 v+1
Γ v+1 2 − v+1
2
2
x
f (x) = Γ 2 1 + x 2 − 2 −∞<x<∞
f (x) = √v π Γ v 1 + v −∞<x<∞
√ v v
vπΓ 2
2
The parameter v is called the degree of freedom. The t distribution with v

The parameter
degrees of freedomv is is
called the denoted
usually degree ofbyfreedom. The t distribution
tα,v or simply tv . Fig. 8.2 with v
shows
degrees
the curveofoffreedom is usually
the density of thedenoted by tα,vfororvarious
t distribution simplydegrees
tv . Fig.of8.2 shows
freedom.
the curve of the density of the t distribution for various degrees of freedom.

Fig. 8.2 t Distribution with various Degrees of Freedom
Fig. 8.2 t Distribution with various Degrees of Freedom
8.3.2 Properties of t Distribution
Property 1
The t distribution is symmetric and for α = 0.50, t(1−α),v = −tα,v
Property 2
Theorem 8.10
Suppose a random variable X has the Student’s t distribution with
parameter v. Then
(a) E(X) = 0, for v > 1

v
(a) Var(X) = , for v > 2
v−2
Thus a t-distribution possesses no mean when v = 1 and the variance does

not exist for v ≤ 2.
Property 2
244
Theorem 8.11
(a) Var(X) = , for v > 2
v−2
Thus a t-distribution
PROBABILITY THEORY – possesses
VOLUME IIIno mean when v = 1 and the
Distributions variance
Derived does
from Normal Distribution
not exist for v ≤ 2.
Property 2
Theorem 8.11
If X1 , X2 · · · , Xn is a random sample from the Normal population
having the mean µ and variance σ 2 , and let X be the sample mean,
then
X −µ
T = √
ŝ/ n
has the Student’s t distribution with v = n − 1 degrees of freedom
where n
1
ŝ =
2
(Xi − X)2
n − 1 i=1
The t-distribution has been tabulated. Table II in Appendix A contains

values of t(1−α),v for
α = 0.10, 0.05, 0.025, 0.01, 0.005 and v = 1, 2, 3, 4, ..., 30, 40, 60, 120, ∞,
where tα,v is such that the area to the left of tα,v under the curve of the t
distribution with v degrees of freedom is equal to 1 − α. That is, t1−α,v is
such that
tα,v
P (t ≤ tα,v ) = f (x) dx = 1 − α
−∞
To say that P (t ≤ tα,v ) = 1 − α is the same as saying that
P (t > t−α,v ) = 1 − P (t > tα; v )
It is therefore uncommon to find texts reporting the α instead of 1 − α at

the top of the t distribution table.
Example 8.7
If t distribution has 25 degrees of freedom, find
(a) t0.975,25 (b) t0.05,25
Solution
(a) From Table II in Appendix A with v = d.f. = 25 and 1 − α = 0.975

we proceed downward under column d.f. until we reach 25. Then we
proceed right to meet column headed 0.975. The value at that meeting
point, which is 2.0595 is the required value.
Therefore the probability that a random variable with t distribution

with 25 degrees of freedom does not exceed 3.345 is 0.975.
(b) The value of t at α = 0.05 can be obtained from the value of t at

1 − α = 0.95 in Table II in Appendix A. Using the same procedure as
in (a), the reader will see that t0.05,25 = 1.708.
245
Therefore the probability that a t- distributed random variable with
proceed right to meet column headed 0.975. The value at that meeting
point, which is 2.0595 is the required value.
Therefore
PROBABILITY: the probability
A FIRST COURSE IN that a random variable with t distribution
with 25 degrees of freedom does not exceed 3.345 is 0.975.
(b) The value of t at α = 0.05 can be obtained from the value of t at

1 − α = 0.95 in Table II in Appendix A. Using the same procedure as
in (a), the reader will see that t0.05,25 = 1.708.
Therefore
Distributions the probability
Derived that Distribution
From Normal a t- distributed random variable with
267
25 degrees of freedom does not exceed 1.708 is 0.90.
8.3.3 Relationship Between t, χ2 and Normal Distributions
Theorem 8.12
If Z is a Standard Normal variable and χ2 is a Chi-square distribution
with v degrees of freedom and Z and χ2 are independently distributed
then the distribution
Z
T =
χ2
v
has a Student’s t distribution with v degrees of freedom
Proof
We simply express the given ratio in a different form as follows:
 
X − µ
 
 σ 
√
X −µ n
s =
√ s2
n
σ2
Z
=
(n − 1)s2
(n − 1)σ 2
Z
=
χ2n−1
n−1
(n − 1)s2
since = χ2n−1 by Theorem 7.12
σ2
By Definition 8.11, this is a t distribution with n − 1 degrees of freedom.
268
Note Advanced Topics in Introductory Probability
This
268 quantity is usually called Advanced
‘Student’sTopics
t and the corresponding
in Introductory distribu-
Probability
tion is called ‘Student’s t distribution.
Theorem 8.13
The t curve approaches a Normal curve as n approaches infinity
Theorem 8.13
The t curve approaches a Normal curve as n approaches infinity
The t distribution, however, has a greater dispersion than the Standard

Normal distribution. In practice, we can treat tn as N (0, 1) when n > 30.
8.4 F DISTRIBUTION
246
8.4 F DISTRIBUTION
Theorem
ADVANCED 8.13
TOPICS IN INTRODUCTORY
The t curve approaches
PROBABILITY: A FIRST COURSEaIN
Normal curve as n approaches infinity

8.4 F DISTRIBUTION
The F distribution was named after Sir R.A. Fisher.
8.4.1 Definition of F Distribution
Definition 8.3 F DISTRIBUTION

A continuous random variable X is said to have the distribution with
parameter v if its probability density function is given by
v1
v1 + v2 v1 2
−1
Γ x
2 v2
f (x) =
Γ
v1
Γ
v2 v1 + v2
2 2 v1 2
1+ x
v2
where x > 0
The F distribution with v1 and v2 degrees of freedom is denoted by

Fα;v1 ,v2 . Note that the order v1 , v2 is important, that is, in general,
Scholarships
F = F
α;v1 ;v2 α;v2 ;v1
Density curves for a few selected degrees of freedom of an F distribution are
Open your mind to

new opportunities
247
Γ Γ v1
2 2 1+ x 2
v2
where x >A 0FIRST COURSE IN
PROBABILITY:
The F distribution with v1 and v2 degrees of freedom is denoted by

Fα;v1 ,v2 . Note that the order v1 , v2 is important, that is, in general,
Fα;v1 ;v2 = Fα;v2 ;v1

Density curves for a few selected degrees of freedom of an F distribution are
shown in Fig 8.3.
Fig. 8.3 (a) Typical Curves of F Distribution

(v1 = 2 with various values of v2 )
Fig. 8.3 (b) Typical Curves of F Distribution

(v1 = 20 with various values of v2 )
248
8.4.2 Properties of F distribution
Property 1
As a ratio of two non-negative values (χ2v ≥ 0), an F random variable ranges
in value from 0 to +∞.
Property 2
Theorem 8.14
Suppose a random variable X has the F distribution with parameters
v1 and v2 . Then
v2
(a) E(X) = , v2 > 2
v2 − 2
2v22 (v1 + v2 − 2)
(b) Var(X) = , v2 > 4
v1 (v2 − 2)2 (v2 − 4)
Theorem 8.14 implies that an F variable has no mean when v2 ≤ 2 and has
no variance when v2 ≤ 4.
Property 3 RECIPROCAL PROPERTY
Theorem 8.15
1
Suppose X has the distribution Fv1 ,v2 then Y = has the distribu-
X
tion Fv2 ,v1 , that is,
1
F(1−α,v1 ,v2 ) =
F(α,v1 ,v2 )
where both α and 1 − α are lower - tail percentage points of an F

distribution
Property 4
F is a positively skewed distribution; however, its skewness decreases with
increasing v1 and v2 .
The F distribution has been tabulated for α = 0.50, 0.90, 0.95, 0.975, 0.99,
0.995, 0.999. Table III in Appendix A gives the percentage points for the
right tail of several F distributions at 1, 5 and 10 percent levels. To say
that P (F > Fv1 ,v2 ) = α is the same as saying that
P (0 ≤ F ≤ Fv1 ,v2 ) = 1 − α
Example 8.8
If the F distribution has degrees of freedom v249
1 = 30 and v2 = 24, find the
table value of
that P (F > Fv1 ,v2 ) = α is the same as saying that
P (0 ≤
PROBABILITY: A FIRST COURSE IN F ≤ Fv1 ,v2 ) = 1 − α
Example 8.8
If the F distribution has degrees of freedom v1 = 30 and v2 = 24, find the
table value of
(a) F0.90;30,24 (b) F0.95;12,8 (c) F0.05;12,8
Solution
(a) Table III in Appendix A gives the percentage point of the distribution
at 90 percent levels. Read the value of the table at the meeting point
of the degrees of freedom of the numerator, v1 = 30 and degrees of
freedom of the denominator, v2 = 24. The value is 1.67. That is, the
probability that an F-distributed random variable with 30 numerator
degrees of freedom and 24 denominator degrees of freedom does not
exceed 1.67 is 0.90.
(b) From Table III in Appendix A, the value is 3.28. Hence
F0.95;12,8 = 3.28
(c) By Theorem 8.15
1 1
F0.05;12,5 = = = 0.305
F0.95;8,12 3.28
250
8.4.3 Relationship among χ2 , t and F Distributions
Theorem 8.16
Let χ2(v1 ) and χ2(v2 ) be two independent χ2 random variables with
parameters v1 and v2 respectively. Then F -distribution is given by
χ2(v1 ) χ2(v2 )
F(v1 ,v2 ) = /
v1 v2
where v1 is the degrees of freedom of the numerator and v2 is the
degrees of freedom of the denominator
Theorem 8.17
Let χ2α,v be a Chi-square random variable with v degrees of freedom.
Then
χ2α,v
Fα,v,∞ =
v
Theorem 8.18
Let F1,v be an F distribution with degrees of freedom 1 and v. Then
F1−α; 1, v = t21− α , v
2
Proof
From Theorem 8.12
Z2
T2 =
χ2
v
But from Theorem 12.6
Z 2 ∼ χ2 with 1 degree of freedom. Hence,
Z2
Distributions Derived From Normal t2v = 12Distribution

= F1, v 273
χ
v
Example 8.9
Verify that F0.95;1,16 = t2.975,16
Solution
From Table III in Appendix A,
F0.95;1,16 = F1−0.05;1,16 = 4.49

From Table III in Appendix A,
t1− α2 ,16 = t0.975,16 = 2.12

Comparing the F and t values, we observe that
4.49 = (2.12)2
The χ2 , t and F distributions have many important applications.

251
The reader will find them in most aspects of both theoretical and applied
t1− α2 ,16 = t0.975,16 = 2.12
ADVANCED TOPICSthe
Comparing IN INTRODUCTORY
F and t values, we observe that
PROBABILITY THEORY – VOLUME III4.49 = (2.12)2 Distributions Derived from Normal Distribution
The χ2 , t and F distributions have many important applications.

The reader will find them in most aspects of both theoretical and applied
Statistics especially under Inferential Statistics (estimation and hypothesis
testing).
EXERCISES

8.2 A random variable X has the χ2 distribution. Find the degrees of
freedom if its moment generating function is
(a) MX (t) = (1 − 2t)−18 ,
1
(b) MX (t) = ,
(1 − 2t)6
(c) MX (t) = (1 − 2t)−32 ,
where t < 1
2
8.3 If X has Chi-Square distribution

2 x
f (x) = x5 e− 2
9
How many degrees of freedom has f (x).
8.4 A random variable X has the χ2 distribution. Find the m.g.f. of X if

the distribution has degrees of freedom (a) 4, (b) 10, (c) 28.
8.5 Referring to Exercise 8.3. Find the mean and variance of the χ2 vari-
able.
8.6 Suppose X and Y are independent χ2 random variables with v1 and v2

degrees of freedom, respectively. Show that the m.g.f. of Z = X + Y
is v1 +v2
MX (t) = (1 − 2t)−( 2 )
8.7 If a random variable X has the χ2 distribution. Find from Table I in

Appendix A the value for
(a) χ20.90,15 (b) χ20.99,8 (c) χ20.025,21
8.8 If X has the χ2 distribution with degrees of freedom of 12, find x such
that
(a) P (X ≤ x) = 0.05 (b) P (X ≥ x) = 0.975
8.9 If X has the t distribution, find from Table II in Appendix A the value
for
(a) t0.95,28 (b) t0.975,15 (c) t0.90,22
8.10 Suppose X has t distribution and given

(a) P (X ≥ 1.323); (b) P (X ≤ 3.078)
find (i) the degrees of freedom (ii) α
8.11 If X is Fv1 ,v2 , find from Table III in Appendix A the value for (a)
F0.95,20,30 (b) F0.90,20,24
8.12 Suppose X has the F distribution with v1 = 15 and v2 = 8, find x

such that (a) P (X ≥ x) = 0.10 (b) P (X ≤ x) = 0.10
252
(a) P (X ≤ x) = 0.05 (b) P (X ≥ x) = 0.975
8.9 If X hasA the
PROBABILITY: t distribution,
FIRST COURSE IN find from Table II in Appendix A the value
PROBABILITY
for THEORY – VOLUME III Distributions Derived from Normal Distribution
(a) t0.95,28 (b) t0.975,15 (c) t0.90,22
8.10 Suppose X has t distribution and given

(a) P (X ≥ 1.323); (b) P (X ≤ 3.078)
find (i) the degrees of freedom (ii) α
8.11 If X is Fv1 ,v2 , find from Table III in Appendix A the value for (a)
F0.95,20,30 (b) F0.90,20,24
8.12 Suppose X has the F distribution with v1 = 15 and v2 = 8, find x

such that (a) P (X ≥ x) = 0.10 (b) P (X ≤ x) = 0.10
8.13 Suppose X is Fv1 ,v2 . Find the degrees of freedom if

(a) P (X ≥ 2.49) = 0.05 (b) P (X ≤ 3.34) = 0.99
8.14 Verify that (a) F0.95;1,20 = t20.975,20 (b) F0.90;1,7 = t20.95,7
253
PROBABILITY THEORY – VOLUME III Statistical Tables
STATISTICAL TABLES
254
Table I
Standard Normal Distribution
(Full Table)
z
1 t2
Φ(z) = √ e− 2 dt
2π −∞
255
Table II
Standard Normal Distribution
(Half Table)

1 z t2
Φ(z) = √ e− 2 dt
2π 0
256
Table III
Normal Density Function
1 − t2
φ(z) = √ e 2
2π
257
Table IV
Percentiles of Chi Square Distribution
258
Table V
Percentiles of t Distribution
259
Table VI
Percentiles of F Distribution
F0.90
260
Statistical Tables 283Statistical Tables
Percentiles of F Distribution (continued )
261
PROBABILITY
284 THEORY – VOLUME IIIAdvanced Topics in Introductory ProbabilityStatistical Tables
F0.95
262
PROBABILITY
StatisticalTHEORY
Tables – VOLUME III Statistical Tables
285
263
PROBABILITY
286 THEORY – VOLUME III Advanced Statistical Tables
F0.99
264
Statistical Tables 287Statistical Tables
265
Table VII
Random Numbers
266
PROBABILITY:
Statistical A FIRST COURSE IN
Tables 289
Random Numbers (Continued)
267
290
PROBABILITY: A FIRST COURSE INAdvanced Topics in Introductory Probability
268
Statistical
PROBABILITY:Tables
A FIRST COURSE IN 291
269
292
III Topics in Introductory Probability Statistical Tables
270
PROBABILITY THEORY – VOLUME III Answers to Odd-Numbered Exercises
Answers to Odd-Numbered Exercises
Chapter 1
X Y
1 2 3
3 5 1
1 34 34 34
1.1 2 4 6
2 34 34 34
1 2
3 34 34
0
4 1 5
4 34 34 34
9 7 211 e−λp (λp)x
1.3 (a) (c) 1.5 = 1 1.7 (c)
(2x + 1) 1.9 (a) , x = 0, 1, 2, · · · 1.11 (c) (i)
34 34 219 x!
2(x + y − 2xy), 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 1.13 12
7
(c) 12
7
x 2+ x , 0≤x
2
≤ 1 1.15 (a) λe−λx , x ≥ 0
(c) λe−λ(y−x) , y ≥ x 1.17 (a) (i)x e−x , y ≥ 0
Chapter 2
X +Y 2 3 4 5 6 7
2.1 (a) 3 7 6 12 1 5
P (X + Y = k) 34 34 34 34 34 34
XY 1 2 3 4 6 8 9 12
(c) 3 7 2 8 8 1 5
P (XY = k) 34 34 34 34 34 34
0 34
2X + 3Y 5 7 8 9 10 11 12 13 14 15 17
(e) 3 2 5 1 4 5 2 6 1 5
P (2X + 3Y = k) 34 34 34 34 34 34 34 34 34
0 34
3(X)4(Y ) = 12XY 12 24 36 48 72 96 108 144

(g) 3 7 2 8 8 1 5
P (12XY = k) 34 34 34 34 34 34
0 34
3X 3 6 9 12
(j) 9 12 3 10
P (3X = k) 34 34 34 34
X +Y 2 3 4
2.3 1 3 1
P (X + Y ) 8 8 2

2 2
λz e−λ 3
z (3 − z), 0≤z≤1
2.5(a) , z = 0, 1, 2, · · · 2.7(a) s(z) = 2 2
z!
3
z (4 − 3z 2 + z 3 ), 1≤z≤2
293
271
PROBABILITY THEORY – VOLUME III Answers to Odd-Numbered Exercises
 5

 z3 , 0<z≤1
 14 3 z 1

2.9(a) s(z) = + , 1<z≤2

 7 2 3
 3 5 3
z − 4z 2 +
11
z + 36 , 2 < z ≤ 3
7 6 2
2
(u3 + 1), −1 ≤ u < 0
2.17 6.027 × 10−5 2.19 (a) h(u) = 3
2
3
(1 − u3 ), 0≤u≤1
X 3
Y
1 2
2 3
2.21 4[1 − u(1 − ln u)], 0≤u≤1 2.23 X 2 7 3 8
P(Y = k)
20 20 20 20
u2 2
1 + 23 u − 3
, −1 ≤ u < 0 9
(u + 2), 0≤u<1
2.25 h(u) = u2
2.27 h(u) = 2
1 − 43 u + , 0≤u≤1 9u3
(u + 2), 1≤u<∞
3
3 u2 2
2.29 7
(1 − 4
+ u ln( u )), 0 < u < 2
Chapter 3
3.1(a) 4.47 3.3 (a) 2.01 (c) 10.39 (e) 11.24 (g) 1496.32 (i) 52.80 (k)2.11 (m) 3.21 3.5(a) 52 (c) 2
15
(e) 0.04 (g) 0 (i) 0.08 3.15 6.46
Chapter 4
1
4.1 0.064 4.3 (a) 2.3 (c) 2.68 (e) 1.2 4.5 0 4.9 -0.1832 4.11 − 11 4.13(a)(i) −0.21x + 1.52
2
24x +24x+2
(ii) −0.16y + 1.63 4.15 9(2x+1)2
4.19 0.05y + 2.68
Chapter 5
5.1(a) 12 5.3 n ≤ 250 5.5(a) 0.85 5.7 0.84 5.9 1.3947 5.11 0.9648 5.13(a) 0.3409 (c) 0.8185
5.15 0.0062 5.17 1600 5.19 0.9927 5.21 n ≈ 25 5.23 n ≈ 85, = 0.43 5.25(a) 0.1852
5.29 0.2328
Chapter 7
7.1(a) 3.66.8 (c) E(X) = 6.8, Var = 8.48 7.3(a) 6.8 7.5(a)(i) 140 (ii) 1.49 0.85 7.7(a)(i) 1,000
(ii) 7.9 s.d = 0.657 7.11(a)(i) 0.56 (ii) 0.4855 7.15(a) 0.004 7.17(a)(ii) 458.8
Chapter 8
8.3 12 8.5(a) E(X) = 12, Var = 24 8.7(a) 22.367 (c) 10.285 8.9(a) 1.701 (c) 1.321 8.11 1.93
272
PROBABILITY THEORY – VOLUME III Bibliography
Bibliography
Applebaum, D., (1996), Probability and Information: An Integrated Approach , Cambridge

University Press, Cambridge.
Bajpal, A.C., Calus, L.M. and Fairley, J.A., (1978), Statistical Methods for Engineers and
Scientists, John Wiley and Sons, New York.
Barlow, R. (1989), Statistics, A Guide to the Use of Statistical Methods in the Physical
Sciences, John Wiley and Sons, New York.
Bartoszyński, R and Niewiadomska-Bugay, M (1996), Probability and Statistical Inference,

John Wiley and Sons, New York.
Beaumont, G.P. (1986), Probability and Random Variables, Ellis Norwood Ltd, Chichester.
Berger, M.A. (1993) An Introduction to Probability and Stochastic Processes, Springer-

Verlag, New York.
Billinsley, P. (1979) Probability and Measure, John Wiley and Sons, New York.
Birnbaum, Z.W. (1962), Introduction to Probability and Mathematical Statistics, Harper

and Brothers Publishers, New York.
Blake, Ian F.B. (1979), An Introduction to Applied Probability, John Wiley and Sons, New
York.
Bradley, J., (1988), Introduction to Discrete Mathematics, Addison-Wesley Publishing

Company, Reading.
Breiman L., (1972), Society for Industrial and Applied Maths, Siam, Philadelphia.
Brémaud, Pierre (1994), An Introduction to Probabilistic Modeling, Springer - Verlag, New

York.
Chow, Y.S., Teicher, H. (1988), Probability Theory, Springer-Verlag, New York.
David F.N. (1951), Probability Theory for Statistical Methods, Cambridge University Press,
Cambridge.
Drake, A.W. (1967), Fundamentals of Applied Probability Theory, McGraw-Hill, New

York.
Dudewicz, E.J. and Mishra, S.N. (1988), Modern Mathematical Statistics, John Wiley and
Sons, New York.
Feller, W. (1970) An Introduction to Probability Theory and its Applications Vol. I, Second
Edition, John Wiley and Sons, New York.295
Feller, W. (1971) An Introduction to Probability Theory and its Applications Vol. II,
Second Edition, John Wiley and Sons, New York.
Freund, J.E.and Walpole, R.E. (1971), Mathematical Statistics, Prentice - Hall, Inc, En-
glewood Cliffs, New Jersey.
Fristedt, B. Gray, L. (1997), Modern Approach to Probability Theory, Birhhäuser, Boston.
Gnedenko, B.V. and Khinchin, A.Ya. (1962), An Elementary Introduction to the Theory
of Probability, Dover Publications, Inc., New York.
Goldberg, S. (1960), Probability, An Introduction, Prentice-Hall, Inc., Englewood Cliffe,

New Jersey.
Grimmeth, G. and Weslsh, D. (1986), Probability, An Introduction, Clarendon Press, Ox-

ford.
Guttman, I., Wilks, S.S and Hunter, J.S. (1982), Introduction Engineering Statistics, John
Wiley and Sons, New York.
Haln, G.J., and Shapiro, S.S. (1967) Statistical Models in Engineering, John WIley and
Sons.
Hoel, P.G. (1984), Mathematical Statistics, Fifth ed., John Wiley and Sons, New York.
Hoel, P.G., Port, S.C, and Stone, J.S. (1971), Introduction to Probability Theory, Houghton
Mifflin.
Hogg, R.V. and Cragg, A.T. (1978), Introduction to273

Mathematical Statistics, Fourth Edi-
tion, Macmillan, New York.
Grimmeth, G. and Weslsh, D. (1986), Probability, An Introduction, Clarendon Press, Ox-
ford.
ADVANCED TOPICS
Guttman, IN INTRODUCTORY
I., Wilks, S.S and Hunter, J.S. (1982), Introduction Engineering Statistics, John
PROBABILITY:
Wiley andASons,
FIRSTNewCOURSE
York. IN
Haln, G.J., and Shapiro, S.S. (1967) Statistical Models in Engineering, John WIley and
Sons.
Hoel, P.G. (1984), Mathematical Statistics, Fifth ed., John Wiley and Sons, New York.
Hoel, P.G., Port, S.C, and Stone, J.S. (1971), Introduction to Probability Theory, Houghton
Mifflin.
Hogg, R.V. and Cragg, A.T. (1978), Introduction to Mathematical Statistics, Fourth Edi-
tion, Macmillan, New York.
Johnson, N.L. and Kotz, S.(1969) Distributions in Statistics: Discrete Distributions , John
Wiley and Sons, New York.
Johnson, N.L. and Kotz, S.(1970) Distributions in Statistics: Continuous Multivariate

Distributions, John Wiley and Sons, New York.
Johnson, N.L. and Kotz, S.(1970) Distributions in Statistics: Continuous Univariate Dis-
tributions 1 & 2, John Wiley and Sons, New York.
Kendal, M.G. and Stuart, A. (1963) The Advanced Theory of Statistics,, Vol. 1, 2nd ed.,
Charles Griffin & Company Ltd., London.
Kendal, M.G. and Stuart, A. (1976) The Advanced Theory of Statistics,, Vol. 3, 4th ed.,
Kendal, M.G. and Stuart, A. (1979) The Advanced Theory of Statistics,, Vol. 2, 4th ed.,
Bibliography 297
Kolmogorov, A.N. (1956), Foundations of the Theory of Probability, Chelsea Publishing
Company, New York.
Lai, Chung Kai (1975), Elementary Probability: Theory with Stochastic Processes, Springer
- Verlag, New York.
Lamperti, J. (1966), Probability: A Survey of the Mathematical Theory, W.A. Benjamin,

INC., New York.
Lindgren, B.W. (1968) Statistical Theory, McMillan Company, New York.
Lupton R., (1993), Statistics in Theory and Practice, Princeton, Princeton University
Press.
Mendenhall, W. and Scheaffer, R.L. (1990), Mathematical Statistics With Applications,

PWS-KENT Publishing Company, Mass.
Meyer, D. (1970), Probability, W. A. Benjamin, Inc. New York.
Meyer, P.L. (1970), Introductory Probability and Its Applications, 2nd Edition, Addison
Wesley Publishing Company.
Miller, I. and Freund, J.E. (1987), Probability and Statistics for Engineers, Third Edition,
Prentice-Hall, Englewood Cliffs, New Delhi.
Mood, A.M., Graybill, F.A. and Boes, D.C (1974), Introduction to the Theorey of Statis-
tics, Third Edition, McGraw-Hill, New York.
Nsowah-Nuamah, N.N.N., (2017), A First Course in Probability Theory, Vol. I: Introduc-

tory Probability Theory and Random Variables, Ghana Universities Press, Accra, Ghana.
Nsowah-Nuamah, N.N.N., (2018), A First Course in Probability Theory, Vol. II: Theo-
retical Probability Distributions, Ghana Universities Press, Accra, Ghana.
Page, L.B. (1989), Probability for Engineering with Applications to Reliability, Computer
Science Press.
Prohorov, Yu.V. and Rozanov, Yu.A. (1969) Probability Theory, Springer-Verlag, New
York.
Robinson, E.A. (1985), Probability Theory and Applications International Human Re-
sources Dev. Corporation.
Ross, S. (1984), A First Course in Probability, 2nd Edition, Macmillan Publishing Com-
pany.
Spiegel, M.R (1980), Probability and Statistics, McGraw-Hill Book Company, New York.
Stirzaker, D.(1994), Elementaty Probability, Cambridge University Press, Cambridge.
Stoyanov. J, et. al. (1989), Exercise Manual in Probability Theory, Kluwer Academic
Publishers, London.
274
Studies in the History of Statistics and Probability, Vol. 1, edited by Pearson, E.S., and
Kendal, M. (1970), Charles Griffin & Company Ltd., London.
Prohorov, Yu.V. and Rozanov, Yu.A. (1969) Probability Theory, Springer-Verlag, New
York.
Robinson, E.A. (1985), Probability Theory and Applications International Human Re-
sources Dev. COURSE IN
Corporation.
Ross, S. (1984), A First Course in Probability, 2nd Edition, Macmillan Publishing Com-
pany.
Spiegel, M.R (1980), Probability and Statistics, McGraw-Hill Book Company, New York.
Stirzaker, D.(1994), Elementaty Probability, Cambridge University Press, Cambridge.
Stoyanov. J, et. al. (1989), Exercise Manual in Probability Theory, Kluwer Academic
Publishers, London.
Trivedi, K.S. (1988), Probability and Statistics with Reliability, Queuing, and Computer
Science Applications, Prentice-Hall of India, New Delhi.
Tucker, H.G. (1963), An Introduction to Probability and Mathematical Statistics, Academic

Press, New York.
Wayne, W. D. (1991), Biostatistics: A Foundation for Analysis in the Health Sciences,

Fifth Edition, John Wiley and Sons, Singapore.
Wilks, S.S. (1962), Mathematical Statistics, John Wiley and Sons, New York.
275

Advanced Topics in Introductory Probability

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Topics in Introductory Probability

Uploaded by

Copyright:

Available Formats

NICHOLAS N.N.

Chapter 1 Probability And Distribution Functions

Chapter 2 Sums, Differences, Products and Quotients

2.4 Products of Bivariate Random Variables 68

Chapter 3 Expectation and Variance of Bivariate Distributions 80

Chapter 4 Measures of Relationship of Bivariate Distributions 120

Part 2 Statistical Inequalities, Limit Laws and Sampling Distributions 154

Chapter 5 Statistical Inequalities and Limit Laws 155

Chapter 6 Sampling Distributions I: Basic Concepts 191

Chapter 7 Sampling Distributions II:

Chapter 8 Distributions Derived from Normal Distribution 236

Statistical Tables 254

Answers to Odd-Numbered Exercises 271

PROBABILITY AND DISTRIBUTION FUNCTIONS

So far, all discussions in the two volumes of my book on probability (Nsowah-

1.2 CONCEPT OF BIVARIATE RANDOM VARIABLES

1.2.1 Definition of Bivariate Random Variables

Many of the concepts discussed for the one-dimensional random variables

Definition 1.1 BIVARIATE RANDOM 3 VARIABLE

Synonyms of a bivariate random variable are a two-dimensional random

Fig. 1.1 Bivariate Random Variables

1.2.2 Types of Bivariate Random Variables

Multivariate situations, similar to univariate cases, may involve discrete, as

1.2.2 Types of Bivariate Random Variables

Multivariate situations, similar to univariate cases, may involve discrete, as

Definition 1.2 DISCRETE BIVARIATE RANDOM

Definition 1.3 CONTINUOUS BIVARIATE RANDOM

1.3 JOINT PROBABILITY DISTRIBUTIONS

A joint distribution is a distribution having two or more random variables,

1.3.1 Joint Probability Distribution of Discrete Random

Table 1.1 Joint Frequency Distribution of X and Y

marginal totals. Such a table is called the joint frequency distribution.

Table 1.1 Joint Frequency Distribution of X and Y

Table 1.2 Joint Frequency Distribution of X and Y

Values Values of Y Row

Definition 1.4 JOINT PROBABILITY DISTRIBUTION

defined for all (xi , yj )

The function p(xi , yj ) is sometimes referred to as the joint probability

(b) The joint probability p(xi , yj ) is sometimes denoted as

(a) p(xi , yj ) ≥ 0, for all i and j

Joint Probability Distribution of Bivariate Random Variables

The joint probability distribution may be given in the form of a table of

Table 1.3 Joint Probability Distribution of X and Y

xn p(xn , y1 ) p(xn , y2 ) ··· p(xn , ym ) p(xn )

It is important to note that the distribution satisfies a joint probability

It is important to note that the distribution satisfies a joint probability

When probabilities of all possible joint events, P (X = xi , Y = yj ),

When probabilities of all possible joint events, P (X = xi , Y = yj ),

(b) From the table above,

Hence this distribution is a joint probability function.

1.3.2 Joint Distribution of Continuous Random Variables

Definition 1.6 JOINT PROBABILITY DENSITY FUNCTION

for two pairs (x1 , x2 ) and (y1 , y2 ) with x2 ≥ x1 , y2 ≥ y1

Fig. 1.2 Region assumed by (X, Y ) is shaded

Thus, the probability that the event

will fall in the shaded rectangle is

For computational purposes, we shall adopt Definition 1.7.

WHY WAIT FOR

DARE TO1 DISCOVER

For computational purposes, we shall adopt Definition 1.7.

Property 1 f (x, y) ≥ 0, for all (x, y) ∈ R

1.4 JOINT CUMULATIVE DISTRIBUTION FUNCTIONS

1.4.1 Definition of Joint Bivariate Distribution Function

The joint behaviour of two random variables, X and Y is determined by the