Professional Documents
Culture Documents
Recall that for a discrete random variable, the set of all possible outcomes has to be discrete in the
sense that the elements of the set can be listed sequentially beginning with the smallest number. On
the other hand, if the set of all possible values of a random variable is an entire interval of numbers,
then the random variable is said to be continuous. In this case, between any of the two possible
values we can find infinitely many possible values. This implies that the values cannot be listed
sequentially.
Example: For instance, let X = length of a randomly chosen rattle snake. Suppose the
smallest rattle snake is of length a and the longest rattle snake is b units long. Then the
set of all possible values lie between a and b. We then write a ≤ x ≤ b as the range of
possible values.
It is possible to argue that restrictions of our measuring instruments may sometimes limit us to a
discrete world. For instance, our measurements of temperature may sometimes make us feel that
temperature is discrete. However, continuous random variables and distributions often approximate
real-world situations very well. Furthermore, problems based on continuous variables are often easier
to solve than those based on discrete variables.
1
In the discrete case, the probability distribution of a random variable called the probability mass
function (pmf) shows us how the total probability of 1 is distributed among the possible values of
the random variable. The probability distribution of a continuous random variable is called the
probability density function.
Now, let Y be a continuous random variable. Then a probability density function (pdf) of Y is a
function f (y) such that for any two numbers a and b, with a ≤ b,
Z b
P (a ≤ Y ≤ b) = f (y)dy.
a
That is, if we plot the graph of f (y) or the density curve, the probability that Y takes a value
between a and b is the area between these two numbers and under the graph of f(y). That means,
probabilities involving continuous random variables is the same as area under the curve. It therefore
makes sense that for any constant, say c,
P (Y = c) = 0.
That is, the area under the graph at a single point is zero. This implies that if Y is a continuous
random variable, for any two numbers a and b, with a ≤ b,
We wish to emphasize the point that the equality of the four probabilities above holds for only
continuous random variables.
For any function f (y) to be a legitimate probability density function, it must satisfy
Example: Problem 5, Page 151 A college professor never finishes his lecture before
the end of the hour and always finishes his lectures within 2 min after the hour. Let X =
the time that elapses between the end of the hour and the end of the lecture and suppose
the pdf of X is
(
kx2 0 ≤ x ≤ 2
f (x) =
0 otherwise.
2
(b) What is the probability that the lecture ends within 1 min of the end of the hour ?
Solution:
In our discussion on discrete random variables we studied some random variables with special mass
functions such as binomial, hypergeometric and Poisson. The first random variable with a special
density function we shall discuss is the uniformly distributed random variable. By definition, a
continuous random variable X is said to have a uniform distribution on the interval [A, B] if the
pdf of X is
(
1
B−A
A≤x≤B
f (x; A, B) =
0 otherwise.
This is equivalent to the notion of equally likely outcomes of a sample space.
Example: Problem 7, Page 151 The time X (min) for a lab assistant to prepare
the equipment for a certain experiment is believed to have a uniform distribution with
parameters 25 and 35.
(b) For any a such that 25 < a < a + 2 < 35, what is the probability that preparation
time is between a and a + 2 min ?
Solution:
The cumulative distribution function (cdf) F (x) of a continuous random variable, for every number
x, is the entire area to the left of the point x under the density curve f(x) of X. That is,
Z x
F (x) = P (X ≤ x) = f(y)dy.
−∞
Notice that the variable in the integrand has been changed to y in order not to confuse that with the
upper limit of the integral. For the purpose of illustration, let us determine the cdf of the uniform
random variable X in Problem 7, Page 151. Now, for this problem,
(
1
35−25
25 ≤ x ≤ 35
f (x; 25, 35) =
0 otherwise.
It follows that for every x,
Z x Z x 1 x − 25
F (x) = f (y)dy = dy = .
−∞ 25 10 10
3
Therefore the cdf of X = time for a lab assistant to prepare equipment is
0 x < 25
x−25
F (x) = 25 ≤ x < 35
10
1 x ≥ 35.
The idea of complements can also be used to compute probabilities when dealing with continuous
random variables. If X is a continuous random variable and a is any constant, then
P (X ≥ a) = P (X > a) = 1 − P (X ≤ a) = 1 − F (a).
Some students may have noticed that this is different from the definition for discrete random variables.
Recall that if Y is a discrete random variable and a is any integer, then P (Y ≥ a) = 1 − P (X < a) =
1 − F (a − 1). Students should therefore take note of this difference between discrete and continuous
random variable. Also, we note that if X is a continuous random variable and a and b are constants,
such that a ≤ b, then
P (a ≤ X ≤ b) = P (X ≤ b) − P (X ≤ a) = F (b) − F (a).
Again, this is slightly different from what we would have done if X were to be discrete.
Now, we have seen that given a value, say x = 4 and the density function of the continuous random
variable X, we can compute the cdf of X at the point x = 4. For instance, let
(
0.5x 0 ≤ x ≤ 2
f (x) =
0 otherwise.
Thus, we can see that at x = 3, F (3) = 1 and F (−0.2) = 0. Similarly, at x = 1.5, F (1.5) =
0.25 × (1.5)2 = 0.375. Now, the question is, suppose we know that F (x) = 0.375 and we are given
the density function f (x), is it possible to find the correponding value of x, say x = c ? In this
case, we can see that the value of x that gives F (x) = 0.375 is x = 1.5. In statistics, we call the
point x = 1.5, the (100 × 0.375)th = 37.5th percentile of the distribution of the continuous random
variable X. Students may have observed that for both discrete and continuous random varables, the
smallest value of the cdf is zero and the largest value is 1. A general definition of the percentile of a
distribution is as follows.
4
Definition: Let p be a number between 0 and 1. The (100p)th percentile of the dis-
tribution of a continuous random variable X, we shall denote by c, is that value for
which Z c
p = F (c) = f(y)dy.
−∞
(a) p = 0.5 gives the 50th percentile which is also the median of the distribution of X. The median
is the point on the x axis which divides the area under the density curve of X into two equal
e
parts. In the course text, the median is denoted by µ.
(b) p = 0.25 gives the 25th percentile which is also the first quartile of the distribution of X. The
first quartile is the point on the x axis which divides the area under the density curve of X
into two parts such that the area to the left is 25% of the total area and the area to the right
is 75% of the total area.
(c) p = 0.75 gives the 75th percentile which is also the third quartile of the distribution of X. The
third quartile is the point on the x axis which divides the area under the density curve of X
into two parts such that the area to the left is 75% of the total area and the area to the right
is 25% of the total area.
As an example, suppose we wish to find the median of the distribution of the random variable X
with density function given by
(
0.5x 0 ≤ x ≤ 2
f (x) =
0 otherwise.
Now, we are seeking the value of x for which F (x) = 50% = 0.5. Thus, we solve the equation
5
Problem 1, Page 150
Let X denote the amount of time for which a book on a 2-hour reserve at a college
library is checked out by a randomly selected student and suppose that X has density
function
(
0.5x 0 ≤ x ≤ 2
f(x) =
0 otherwise
(a) P (X ≤ 1)
(b) P (0.5 ≤ X ≤ 1.5)
(c) P (1.5 ≤ X).
Problem
Find the cdf of X = measurement error with pdf
(
3
32
(4 − x2 ) −2 ≤ x ≤ 2
f (x) =
0 otherwise
Definition: The expected or mean vaue of a continuous random variable X with pdf
f(x) is Z ∞
µ = µX = E(X) = x · f(x)dx.
−∞
6
If h(X) is any function of X, then
Z ∞
µh(X) = E[h(X)] = h(x) · f(x)dx.
−∞
A very important special case which we have met in our study of discrete random variable is when
h(X) = (X − µ)2 . The expectation is then called the variance of X and denoted by σ 2 or σX
2
or
V (X). That is, Z ∞
2
σ = V (X) = (x − µ)2 · f(x)dx.
−∞
We repeat, once again, that for computational purposes, it is easier to use the expression
7
§4.5. The Normal Distribution
In statistics, the normal distribution is the most important distribution because several random
variables that we encounter in real life have distributions that can be approximated by the normal
distribution. Students who chose to pursue a career in statistics or take more statistics courses will
find that the normal distribution is central to several statistical techniques that are used in data
analysis.
It can be shown that the mean and variance of a normally distributed random variable X are
E(X) = µ and V (X) = σ 2 . The function g(x) = exp(x) is called the exponential function with
f(1) = exp(1) ≈ 2.718282. In statistics, the notation X ∼ N (µ, σ 2 ) is interpreted to mean that the
random variable X has the normal distribution with mean µ and variance σ 2 . Some properties of
the normal distribution are as follows.
(c) The median of the normal density curve is equal to the mean of the normal density curve. That
is, the curve is centered on x = µ and the point x = µ divides the area under the normal curve
into two equal parts. Therefore, the area to the left of x = µ is 0.5 and the area to the right is
also 0.5.
(d) If X ∼ N (µ, σ12 ) and Y ∼ N (µ, σ22 ) are two normally distributed random variables with same
mean but different variance σ12 > σ22 ), then the density curve of the random variable X with
the larger variance σ12 will be more spread out than the density curve for Y .
Now, let X ∼ N (µ, σ 2 ) consider a change of variable from x to z by defining a new random variable
X −µ
Z= ,
σ
which comes from the argument of the exponential function in the expression for the normal density.
This technique is called standardizing the random variable X. If Q is a random variable with mean
8
q
µQ = E(Q) and standard deviation σQ = V (Q), then we can define a new random variable by
standardizing Q. In general, to standardize Q, we subtract µQ from Q and divide by σQ to obtain
the expression, say U , given by
Q − µQ
U= .
σQ
It is quite straightforward to verify that µZ = E(Z) = 0 and σZ2 = V (Z) = 1. Now, using a common
technique in statistic, which is beyond the scope of this course, we can see that the new density
function for Z, called the standard normal density function is
³ 2
´
f (z; 0, 1) = √1 exp − z2 −∞ ≤ z ≤ ∞.
2π
The properties of the standard normal density curve are the same as before except that µ = 0 and
σ 2 = 1. We shall denote the cdf of a standard normal random variable Z by Φ(Z)
Z ∞
Φ(Z) = P (Z ≤ z) = f (y; 0, 1)dy.
−∞
Observe that computing probabilities involving normal random variables will require that we integrate
the normal density function. This integral is not easy to evaluate. As a result, values of the cumulative
distribution function Φ(z) of the standard normal curve have been evaluated and tabulated for us to
use.
(a) P (0 ≤ Z ≤ 2.17)
(b) P (−1.53 ≤ Z ≤ 2)
(c) P (−1.75 ≤ Z)
9
Percentiles of the Standard Normal Distribution
We recall that the point c, such that Φ(c) = P (Z ≤ c) = p is called the (100p)th percentile of
the standard normal distribution. A notation that is commonly used in relation to the normal
distribution is zα , where zα is the 100(1 − α)th percentile of the standard normal distribution. That
is,
Φ(zα ) = P (Z ≤ zα ) = 1 − α.
This means that the area under the standard normal curve to the right of zα is α and the area to
the left is 1 − α. As a result of symmetry of the standard normal curve about the point µz = 0,
we have that the area to the left of −zα is α and the area to the right is 1 − α. The zα values are
usually called critical values rather than percentiles for reasons that will be made clear later in our
studies. This notation is particularly useful and convenient because later in this course we will need
the critical values for some small values of α.
Example: From the standard normal table, we can see that
(a) z0.05 = 95th percentile of the standard normal distribution = 1.645. That is, for
Φ(z0.05 ) = P (Z ≤ z0.05 ) = 1 − 0.05 = 0.95, we must have z0.05 = 1.645.
(b) Similarly, z0.025 = 97.5th percentile of the standard normal distribution = 1.96.
(c) z0.01 = 99th percentile of the standard normal distribution = 2.325.
Students will sometimes encounter problems that involve nonstandard normal distributions in the
sense that although the distribution is normal, the mean is not zero and the variance may or may
not be 1. In that case, we cannot use the standard normal tables directly to compute probabilities.
We first have to transform or standardize the nonstandard normal random variable into a standard
normal random variable as previously discussed before we can use the standard normal table. For
instance, let X ∼ N (µ, σ 2 ) with µ 6= 0 and σ 6= 1, and we wish to compute P (a ≤ X ≤ b). Since X
has a nonstandard normal distribution, we first apply the transformation
X −µ
Z= ,
σ
to the variable X and the limits a and b and then use the standard normal table to obtain the
required probability as follows.
à !
a−µ X −µ b−µ
P (a ≤ X ≤ b) = P ≤ ≤
σ σ σ
à !
a−µ b−µ
= P ≤Z≤
σ σ
à ! µ ¶
b−µ a−µ
= Φ −Φ .
σ σ
10
Example (see class notes for solution)
Students may have noticed that the normal distribution is a continuous distribution. However, it is
sometimes used to approximate distributions that are not continuous. Surprisingly, these approxima-
tions are sometimes very close to the true value. In this section we shall discuss one of several possible
approximations. In such cases a correction, called continuity correction is often introduced in the
calculations to account for the fact that we are using a continuous distribution to approximate a
discrete distribution.
Let X ∼ Bin(n, p). Then, for sufficiently large value of n, the distribution of X can be
approximated by a normal distribution with mean µ = np and variance σ 2 = np(1 − p).
The approximation appear to work well if both np ≥ 10 and n(1 − p) ≥ 10. In particular,
if x is any one of the possible values of X, then
µ ¶
X −µ x + 0.5 − µ
P (X ≤ x) = B(x; n, p) ≈ P (X ≤ x + 0.5) = P ≤
σ σ
X − np x + 0.5 − np
= P q ≤ q
np(1 − p) np(1 − p)
x + 0.5 − µ
= Φ q .
np(1 − p)
11
Example (see class notes for solution)
Definition: Let X and Y be two discrete random variables defined on the sample space
of an experiment. The joint probability mass function (pmf) p(x, y) of X and Y is defined,
for each pair of numbers (x, y), by
Given the joint pmf of two discrete random variables, it is possible to obtain the pmf of each of the
random variables separately. These separate pmfs are called marginal pmfs. The marginal pmf of,
say X, is obtained by summing the joint pmf over all possible values of Y .
Definition: Let the joint probability mass function (pmf) of X and Y be p(x, y). Then
the marginal pmf of X, denoted by pX (x) is given by
X
pX (x) = p(x, y).
all y
12
Given the marginal pmfs of X and Y , we say that the two random variables are independent if the
joint pmf is equal to the product of the marginal pmfs at all possible pair of values (x, y). That is,
if X and Y are independent then
If this relation fails to hold, even for only one of all the possible values, then the random variables
are not independent. They are dependent.
Examples and Practice Problems (see class notes for solutions)
13
(a) What is the probability that there is exactly one customer in each line ?
(b) What is the probability that the numbers of customers in the two lines are
identical ?
(c) What is the probability that there are at least two more customers in one line
than in the other line ?
(d) What is the probability that the total number of customers in the two lines is
exactly four ? At least four ?
(a) Determine the marginal pmf of X1 and then calculate the expected number of
customers in line at the express checkout.
(b) Determine the marginal pmf of X2 .
(c) Are X1 and X2 independent ?
Consider two jointly distributed discrete random variables X and Y with joint pmf p(x, y). Let
h(X, Y ) be any function of the random variables, then h(X, Y ) is itself a random variable. The
expected or mean value of h(X, Y ) denoted by µh(X,Y ) or E[h(X, Y )] is given by
X X
µh(X,Y ) = E[h(X, Y )] = h(x, y) · p(x, y).
all x all y
A special case is when h(X, Y ) takes the form h(X, Y ) = (X −µX )(Y −µY ). In that case, E[h(X, Y )]
is called the covariance of X and Y and denoted by Cov(X, Y ). That is, the covariance between X
and Y is
X X
Cov(X, Y ) = E[(X − µX )(Y − µY )] = (x − µX )(y − µY ) · p(x, y).
all x all y
Cov(X, Y ) = E(XY ) − µX · µY .
If X and Y are independent, it can be shown that E(XY ) = E(X) · E(Y ) = µX · µY . It is then clear
that when X and Y are independent, Cov(X, Y ) = 0. On the other hand Cov(X, Y ) = 0 does not
necessarily imply independence.
14
2
We have seen that the variance of a random variable σX or V (X) is a measure of the size of the error
made by an experimenter. If the experimenter committed a lot of mistakes during experimentation,
the variance will tend to be large. Alternatively, if the experimenter was very careful and made little
or no mistakes during repeated trials of the experiment, variance will tend to be small. Now, the
covariance between X and Y measures how the two variables are changing together. For instance if
the values of Y tend to increase as X increases, then we say that there is a positive linear relationship
between X and Y . In that case, the value of Cov(X, Y ) will be positive and large if the linear
relationship is very strong. On the other hand if the values of Y tend to decrease as X increases,
then there is a negative linear relationship between X and Y and the Cov(X, Y ) will be negative.
One problem with using covariance as a measure of the strength of linear relationships is the fact
that it is usually affected by the unit in which the observations were measured. In order to remove
the dependence of the covariance on the units, it is standardized to obtain the correlation coefficient
denoted by ρX,Y or simply ρ or Corr(X, Y ) and given by
Cov(X, Y )
ρX,Y = ,
σX · σY
where σX is the standard deviation of X and σY is the standard deviation of Y .
An instructor has given a short quiz consisting of two parts. For a randomly selected student,
let X = the number of points earned on the first part and Y = the number of points earned
on the second part. Suppose that the joint pmf of X and Y is given in the accompanying table
y
p(x, y) 0 5 10 15
0 0.02 0.06 0.02 0.10
x 5 0.04 0.15 0.20 0.10
10 0.01 0.15 0.14 0.01
15
(a) If the score recorded in the grade book is the total number of points earned on the two
parts, what is the expected recorded score E(X + Y ) ?
(b) If the maximum of the two scores is recorded, what is the expected recorded score.
(e) Instead of marking the first part out of 15 and the second part out of 10, suppose the
instructor changed the points in the first part to y = 0, 1, 2, 3 and x = 0, 1, 2 for the second
part. How would this affect the covariance and the correlation ? Explain.
Suppose that the owner of a vending machine on campus is trying to determine how often the machine
should be refilled in a week. The owner decides to observe the number of items left in the vending
machine at the end of each day. Let X = number of items left in the machine on any given day.
Now, at the end of Day 1 suppose the observed number of items is x1 . Similarly, let x2 , x3 , · · · , x30
be the observed numbers from Day 2 to Day 30. We shall call these observed values a sample of size
30. From these observed numbers we can compute the average number of items the owner expects
will be left over in any month of the year,
30
x1 + x2 + · · · + x30 1 X
x̄ = = xi .
30 30 i=1
We can also determine how the numbers vary from day to day,
30
2 1 X
s = (xi − x̄)2 ,
29 i=1
called the sample variance. Suppose, we repeat the experiment for another 30 days, we expect the
observed values x1 , x2 , x3 , · · · , x30 to be slightly or very different from the observed values from the
first experiment. This will in turn lead to different values of x̄ and s2 . This variation in observed
values in turn implies that the value of any function of the sample observations, such as mean,
variance, standard deviation, and so on, also varies from sample to sample.
Now, before we actually observe the daily number of left over items, the owner has no way of knowing
what the value will be on any given day. The number could range from 0 to the total number of
items the machine can contain, say n. Therefore, before the experiment is actually performed, the
16
daily number of leftover items is itself viewed as a random variable. Denote the random variables by
X1 , X2 , X3 , · · · , X30. Therefore, the mean and variances computed from these random variables will
also be a random variable,
30
X1 + X2 + · · · + X30 1 X
X̄ = = Xi ,
30 30 i=1
and
30
1 X
S2 = (Xi − X̄)2 .
29 i=1
Notice that we have used lower case or small letter x for the actual observations and upper case or
capital letter X to represent the random variable. Thus, the sample mean regarded as a statistic
is denoted by X̄ with calculated or observed value x̄. Similarly, the sample variance regarded as a
statistic is represented as S 2 with s2 as its computed value.
Another way of saying the same thing is to say that X1 , X2 , · · · , Xn are independent and
identically distributed abbreviated as iid.
Recall that every random variable has a probability distribution. This implies that every statistic,
being a random variable, has a probability distribution. The probability distribution of a statistic is
sometimes called sampling distribution simply to emphasize how values of a statistic varies from
sample to sample. The main issue is how the sampling distribution of a statistic can be derived.
There are a number of ways in which this can be done. If the Xi ’s is a random sample and their
distribution is known, then we can use probability rules to determine the distribution of a statistic
computed from the random sample. In other situations were it is too complicated to use probability
rules we may obtain information about the sampling distribution of a statistic by using a computer
to mimick the behaviour of the statistic through what is normally called a simulation experiment.
We shall use an example to illustrate how the probability mass function of a statistic can be derived.
17
x1 0 1 2
µ = 1.1, σ 2 = 0.49
p(x1 ) 0.2 0.5 0.3
Let X2 be the number of lights at which I must stop on the way home; X2 is
independent of X1 . Assume that X2 has the same distribution as X1 , so that X1 ,
X2 is a random sample of size n = 2.
In Problem 41, Page 236, we saw that the population mean of the random variable X, was µ =
E(X) = 2 and the population variance was σ 2 = V (X) = E(X 2 ) − µ2 = 5 − 4 = 1. The sample
mean X̄ of a random sample of size n on the random variable X provides an indication of what a
typical value of X will be and the sample variance S 2 tells us how spread out the values of X will
be from this typical value X̄. Intuitively, we can say that since X̄ is a typical value of X, the mean
or expected value of X̄ should be the same as that of X. That is, we expect that
In our example in class, Problem 41, Page 236, we saw that this was the case. However, we saw that
the variance of X̄ was slightly different from the variance of X, in the sense that
2 V (X) σ2 1
σX̄ = V (X̄) = = = = 0.5.
n n 2
18
This gives us an idea of the relationship between the mean of X, µ and the mean of X̄, µX̄ and that
2
between the variance of X, σ and the variance of X̄, σX̄ . This relationship can be extended to more
general situations. We state this extension below.
We remark that the above statement refers to a random sample from any distribution. We state
below the distribution of X̄ if the random sample is from a normal distribution.
2. T0 ∼ N (nµ, nσ 2 ).
Remark: Even when the random sample X1 , X2 , · · · , Xn is not from a normal distri-
bution but the sample size n is sufficiently large, it can be shown that both the sample
mean X̄ and the sample total T0 can be approximated by a normal distribution. That is,
if the random sample is not from a normal distribution, then
³ 2
´
1. X̄ ≈ N µ, σn .
2. T0 ≈ N (nµ, nσ 2 ).
The larger the value of n, the better the approximation. This statement of approximation
is called the central limit theorem (CLT). In this class, by sufficiently large, we mean
n > 30.
19
(a) If X̄ is the sample mean diameter for a random sample of n = 16 rings, where
is the sampling distribution of X̄ centered, and what is the standard deviation
of the X̄ distribution ?
(b) Suppose the distribution of diameter is normal, calculate P (11.99 ≤ X̄ ≤ 12.01).
Problem 48, Page 242
Let X1 , X2 , . . . , X100 denote the actual net weights of 100 randomly selected 50-lb
bags of fertilizer. If the expected weight of each bag is 50 and the variance is 1,
calculate P (49.75 ≤ X̄ ≤ 50.25).
Problem 50, Page 243
The breaking strength of a rivet has a mean value of 10,000psi and a standard
deviation of 500psi. What is the probability that the sample mean breaking strength
for a random sample of 40 rivets is between 9900 and 10,200 ?
5.4. The Distribution of a Linear Combination
called a linear combination we can use to represent any statistic for different values of the constants.
1
We observe that the sample mean X̄ is a special case of the linear combination Y with a1 = n
,
1 1
a2 = n
, · · · , an = n
. When a1 = 1, a2 = a3 = · · · = an = 1, the linear combination Y becomes
the sample total T0 . The next issue is how to compute the expectation and variance of a linear
combination.
µY = E(a1 X1 + a2 X2 + · · · + an Xn )
= a1 E(X1 ) + a2 E(X2 ) + · · · + an E(Xn )
n
X
= a1 µ1 + a2 µ2 + · · · + an µn = ai µi .
i=1
20
Furthermore, if the random variables X1 , X2 , · · · , Xn are independent, then the variance
of the linear combination Y is
σY2 = V (a1 X1 + a2 X2 + · · · + an Xn )
= a21 V (X1 ) + a22 V (X2 ) + · · · + a2n V (Xn )
n
X
= a21 σ12 + a22 σ22 + · · · + a2n σn2 = a2i σi2 .
i=1
Now, if the random variables X1 , X2 , · · · , Xn are not independent then the variance of
the linear combination Y is
n X
X n
σY2 = V (a1 X1 + a2 X2 + · · · + an Xn ) = ai aj Cov(Xi , Xj ).
i=1 j=1
(d) If µ1 = 40, µ2 = 50, µ3 = 60, σ12 = 10, σ22 = 12 and σ32 = 14, calculate
P (X1 + X2 + X3 ≤ 160).
21
ranks of the positive observations are 2 and 3, so W = 5. In Chapter 15, W is called
Wilcoxon’s signed rank statistic. W can be represented as follows:
n
X
W = 1 · Y1 + 2 · Y2 + 3 · Y3 + · · · + n · Yn = i · Yi .
i=1
where the Yi ’s are independent Bernoulli random variables, each with p = 0.5 (Yi =
1 corresponds to the observation with rank i being positive). Compute the following:
6. POINT ESTIMATION
7. INTERVAL ESTIMATION
Let us assume that we are trying to figure out the amount of money a student needs to survive
in a typical university for a given semester. Now, based on our experience as students or other
information available to us, in the form of data, we may estimate the amount in two ways. We can
choose to use the data to specify a single amount, say $5000.00, as our estimate. Such an estimate
which specifies a single value is called a point estimate. It is called a point estimate because it is
a single value, a point in some sense. Such estimates are discussed in Chapter 6. Someone who is
more careful and does not like to be wrong may suggest instead that we specify an interval as an
estimate. So, we may say, the amount should lie between $3500 and $4500. Such an estimate is
22
called an interval estimate. We may decide not to be too precise by specifying a very wide range
of values, say $3500 and $10000. In that case, the interval estimate will be unreliable and therefore
not very helpful to students because it is too wide, the width being $10000 - $3500 = 6500. In this
section, we will be focusing on how to use information in data to obtain interval estimates.
We shall introduce the basic concepts of interval estimation with an example that is practically un-
realistic. We shall then consider situations that may arise in practice. Consider a random variable
X = alcohol content of a certain brand of wine manufactured by a company. Suppose, the problem
is to estimate the unknown population mean µ of the alcohol content of all bottles of this particular
brand of wine manufactured by the company.
Let us assume that X ∼ N (µ, σ 2 ), with σ 2 known and suppose that we take a random sample of
size n wine bottles of this brand with alcohol content X1 , X2 , · · · , Xn . On the average, the alcohol
content of the n wine bottles will be X̄. From results of §5, we know that X̄ ∼ N (µ, σ 2 /n). If we
standardize X̄, we obtain a standard normal random variable
X̄ − µ
Z= .
√σ
n
Then, by simplifying the expression within the brackets, we can rewrite the probability expression
above as à !
σ σ
P X̄ − 1.96 √ < µ < X̄ + 1.96 √ = 0.95.
n n
This simply means that the probability that the random interval
à !
σ σ
X̄ − 1.96 √ , X̄ + 1.96 √ ,
n n
will cover the “true” mean µ is 0.95. The interval is random because the statistic X̄ is random. That
is, before the experiment is performed and measurements are taken, we are 95% certain or confident
23
that the “true” mean µ will lie inside the interval above. Once the experiment is performed and
measurements are observed, say X1 = x1 , X2 = x2 , · · · , Xn = xn , then the observed average or mean
becomes X̄ = x̄. The resulting fixed, non-random interval given by
à !
σ σ
x̄ − 1.96 √ , x̄ + 1.96 √ ,
n n
or written as
σ σ
x̄ − 1.96 √ < µ < x̄ + 1.96 √
n n
or in a more compact form as
σ
x̄ ± 1.96 √ ,
n
is called a 95% confidence interval for µ. The situation we have used to obtain the expression
for the confidence interval is practically unrealistic because of the assumptions we have made. We
assumed that the random sample is from a normal population with unknown mean µ and known
variance σ 2 . In practice, the mean is needed in order to compute the variance. So, if the mean is
unknown, it is highly unlikely that the variance will be known.
Some students may have noticed that there is a relationship between the value 1.96 in the expression
for the confidence interval and the confidence level 0.95. We observe that 1.96 is the (1−0.05/2)100th
= (1 − 0.025)100th = 97.5th percentile or critical value of the standard normal distribution. Using
the notations from Chapter 4, we write
z 0.05 = 1.96.
2
Thus, for a general confidence level, say 99% or 98% or 90%, a (1 − α)100% confidence interval for
the population mean µ of a distribution is obtained by replacing 1.96 with the critical value z α2 ,
à !
σ σ
x̄ − z α2 √ , x̄ + z α2 √ ,
n n
An interval width a large width shows that the error in our estimation is too large and therefore the
interval will not be reliable or the estimate becomes unrealistic.
24
Problems 1, 5, Page 290
Assume that the helium porosity (in percentage) of coal samples taken from any
particular seam is normally distributed with true standard deviation 0.75.
(a) What is the confidence level for the interval x̄ ± 2.81 √σn ?
(b) What value of zα/2 in the CI formula results in a confidence level of 99.7% ?
(c) Compute a 98% CI for the true average porosity of a certain seam if the average
porosity for 16 specimens from the seam was 4.56.
(d) How large a sample size is necessary if the width of the 95% interval is to be
0.4 ?
Problem 2, Page 290
Each of the following is a confidence interval for µ = true average (i.e., population
mean) resonance frequency (Hz) for all tennis rackets of a certain type:
(114.4, 115.6) (114.1, 115.9)
(a) What is the value of the sample mean resonance frequency ?
(b) Both intervals were calculated from the same sample data. The confidence level
for one of these intervals is 90% and for the other is 99%. Which of the intervals
has the 90% confidence level, and why ?
7.2. Large-Sample Confidence Intervals for a Population Mean and Proportion
In §7.1 we made some assumptions which are practically unrealistic. We assumed that the variance
of the population distribution σ 2 is known. The assumption that the random sample is from a
normal population may be valid sometimes but may fail in other situations. We will now discuss
large-sample confidence intervals whose validity does not depend on these assumptions.
Large-Sample Interval for µ
Let X1 , X2 , · · · , Xn be a random sample from a population whose distribution is not known with
mean µ and variance σ 2 that are also unknown. Recall that if the sample size n is large, then by
the Central Limit Theorem we discussed in Chapter 5, the distribution of the sample mean X̄ is
approximately normal with mean E(X̄) = µ and variance V (X̄) = σ 2 /n. That is,
X̄ ≈ N(µ, σ 2 /n).
25
has approximately the standard normal distribution. Therefore,
X̄ − µ
P −zα/2 ≤ ≤ zα/2 ≈ 1 − α.
√σ
n
Following the technique used in §7.1 we can rearrange the expression in the parentheses to obtain an
approximate (1 − α)100% confdence interval for the population mean µ as
σ
x̄ ± zα/2 · √ .
n
The difficulty with this expression for confidence interval is that σ is not known. Therefore, since n
is sufficiently large, we simply replace σ by the sample standard deviation
v à !
u n
u 1 X
s=t x2 − nx̄2 ,
n − 1 i=1 i
to obtain the large-sample confidence interval for µ with an approximate confidence level of
(1 − α)100% as
s
x̄ ± zα/2 · √ .
n
Examples (see class notes for solution)
(a) Calculate a 95% (two-sided) CI for the true average dye-layer density for all
such trees.
(b) Suppose the investigators had made a rough guess of 0.16 for the value of s before
collecting data. What sample size would be necessary to obtain an interval width
of 0.05 for a confidence level of 95% ?
26
Large-Sample Interval for a Population Proportion p
In Chapter we discussed how the normal distribution can be used to approximate the binomial dis-
tribution. Now, suppose that X is a binomial random variable with parameters n and p, where
n is the number of trials and p is the probability or proportion of successes. We recall that if n
is sufficiently large, then X has approximately a normal distribution with mean E(X) = np and
variance V (X) = np(1 − p).
is approximately standard normal. Thus, using the same approach as before, we find that
p̂ − p
P −zα/2 ≤ q ≤ zα/2 ≈ 1 − α.
p(1−p)
n
If we rearrange the inequalities in the bracket, we obtain a large-sample confidence interval for
a population proportion p with an approximate confidence level of (1 − α)100% to be
r
2
zα/2 2
zα/2
p̂(1−p̂)
p̂ + 2n
± zα/2 n
+ 4n2
2
zα/2
.
1+ n
Suppose that the width of the confidence interval for p is w. Then, the sample size required to obtain
a confidence interval of width w is
q
2 2
2zα/2 p̂(1 − p̂) − zα/2 w2 ± 4
4zα/2 4
p̂(1 − p̂)[p̂(1 − p̂) − w2 ] + w2 zα/2
n= .
w2
Unfortunately, the formula for n involves the unknown p̂. It can be shown that the maximum value
of p̂(1 − p̂) is attained when p̂ = 0.5. Therefore, using the value of p̂ = 0.5 will ensure that the
width will be at most w regardless of what value of p̂ results from the sample. Alternatively, if the
investigator believes strongly based on prior experience and information, that the true value of p,
say p0 , is less than 0.5 then the value p0 can be used in place of p̂.
27
Problem 21, Page 298
A random sample of 539 households from a certain city was selected, and it was
determined that 133 of these households owned at least one firearm. Compute a
95% confidence interval for the proportion of all households in this city that own at
least one firearm.
The interval estimates we discussed in §7.1 was not practical because of the assumptions underlying
the estimates. In §7.2 we discussed intervals for large samples, e.g. that is when n > 30. Here, we
discuss interval estimates for small samples and the assumptions under which the interval estimation
method can be used. In §7.2 the intervals where obtained by invoking the central limit theorem
because the sample size was assumed to be large. The key was that the standardized variable
X̄ − µ
Z= ,
√S
n
has approximately the standard normal distribution when n is large even when the population dis-
tribution of the random sample X1 , X2 , · · · , Xn is not the normal distribution.
Now, when the sample size n is small, we cannot use the central limit theorem to obtain a random
variable that has an approximate standard normal distribution. However, we can show that if n is
small and the random sample X1 , X2 , · · · , Xn of size n is from a normal population, then
the standardized variable
X̄ − µ
T = ,
√S
n
has a probability distribution that is called a student t-distribution with n − 1 degrees of freedom,
where S is the standard deviation. Notice that the standardized variable has not changed, we
28
have simply chosen to denote it by T rather than Z in order to emphasize the point that it is no
longer normally distributed. Due to the complicated mathematical expression for the probability
density function of T , we shall not provide the expression in this class. In addition, to assist us
in computing probabilities without actually integrating the density function, a table of values of
cumulative probability for the t-distribution is available in most elementary statistics text and will
be made available to us in the class.
Properties of t Distributions
The normal distribution is characterized by two parameters, the mean µ and the variance σ 2 . By that
we mean, once these two parameters are known, the normal distribution is completely determined.
On the other hand, the t-distribution is characterized by only one parameter called the degrees of
freedom we shall denote by the Greek letter ν. Possible values of ν are 1, 2, 3, · · ·. For a fixed value
of ν, some of the properties or features of the density curve of the t-distribution are as follows.
Let tν denote the density curve of the t-distribution with ν degrees of freedom. Then,
1. each tν curve is bell-shaped and centered at 0. The standard normal z curve also
has this property.
2. each tν curve is more spread out than the standard normal z curve. In order words,
the variance of data from a population with a t-distribution is more likely to be larger
than the variance of data from a population with a standard normal distribution.
3. as ν increases, the spread of the corresponding tν curve decreases.
4. as ν becomes larger and larger or tends to infinity, the sequence of tν curves ap-
proaches the standard normal z curve.
Notation
Consider the random sample X1 , X2 , · · · , Xn of size n from a normal population with n < 30. Let X̄
be the sample mean. Then, the standardized random variable
X̄ − µ
T = ,
√S
n
29
has a student t-distribution with n − 1 degrees of freedom. Following the technique of §7.1, we can
write that
X̄ − µ
P −tα/2,n−1 ≤ ≤ tα/2,n−1 ≈ 1 − α.
√S
n
Therefore, if x̄ is the observed sample mean and s is the observed sample standard deviation computed
from n < 30 observed data from a normal population with unknown mean µ and unknown variance
σ 2 , then a (1 − α)100% confidence interval for the population µ
s
x̄ ± tα/2,n−1 · √ .
n
Examples (see class notes for solution)
Calculate a two-sided 95% confidence interval for true average degree of polymeriza-
tion. Does the interval suggest that 440 is a plausible value for true average degree
of polymerization ? What about 450 ?
30
Problem 50, Page 311
A journal article reports that a sample of size 5 was used as a basis for calculating a
95% CI for the true average natural frequency (Hz) of delaminated beams of a certain
type. The resulting interval was (229.764, 233.504). You decide that a confidence
level of 99% is more appropriate than the 95% level used. What are the limits of
the 99% interval ?
Before the new treatment method was proposed, it was believed that any other method is at best
as good as the standard method. That is, the standard method was initially favored. Therefore,
the prior belief or initially favored claim can be written mathematically as µ1 − µ2 ≤ 0. As a result
of the new proposal, the scientist are claiming mathematically that there is an alternative given by
µ1 − µ2 > 0. The initially favored claim or the current state of affairs will not be rejected in favor of
the alternative unless there is strong evidence in favor of the alternative claim.
It is usual to refer to the prior belief as the null hypothesis denoted by H0 and the
claim that is to be proved or disproved or the claim that contradicts the prior belief as the
alternative hypothesis and denoted by Ha . By definition, a statistical hypothesis
or simply hypothesis is a claim or assertion about the value of a single parameter or about
the values of several parameters or about the form of an entire probability distribution.
For the water treatment example, we then write
H0 : µ1 − µ2 ≤ 0 versus Ha : µ1 − µ2 > 0.
31
A test of hypotheses is a method for using sample data to decide whether to reject or not
reject the null hypothesis H0 .
In order to decide whether to reject a claim or not, the investigator has to establish a reasonable
rule based on available evidence or sample data that will form the basis of the decision. Such a rule
is called a test procedure.
Test Procedures
Suppose that we claim that female students perform better than male students in mathematics
courses at Memorial University. If X = number of female students who pass mathematics courses,
then p = X/N = proportion of females with a pass in mathematics courses, where N = total number
of students in mathematics courses. Now, the prior belief or initially favored claim is that students
performance in mathematics courses is gender neutral. Thus, the null hypothesis can be written
mathematically as
H0 : p = 0.5,
Ha : p > 0.5.
Since the number of students taking mathematics courses are so many, I decide to take a random
sample of say n students and observe whether the student is a male or female and whether the stu-
dent passed or failed the course. We note that for each student, there is only two possible outcomes,
pass (coded as 1) or fail (denoted by 0). At the end, we count the total number of 1’s which is the
observed value, say x, of the random variable X = number of female students among n students who
passed the mathematics course. Clearly, X is a binomial random variable with parameters n and
unknown p. That is, X ∼ Bin(n, p).
Now, a reasonable rule for deciding whether to reject the null hypothesis H0 or not will depend on X.
Suppose, n = 40 and we found x = 22. Is this strong evidence in favour of rejecting H0 . The answer
is, NO. The value x = 22 is probably due to the sample that we took. A different sample may turn
up x = 21. However, we can decide that any value of x ≥ 28 is strong enough for us to reject H0 .
Now, in hypothesis testing, the set of all values for which we reject H0 is called the critical region
or rejection region. For this example, we reject H0 for all values of x greater than 28. Thus, the
rejection or critical region we shall denote by R can be written mathematically as
32
Clearly, our test procedure or rule for deciding on whether to reject or not reject H0 is made up of
two components,
(a) the value of the random variable X, which counts the number of 1’s or pass in the response or
sample of students performance. In other words, X is a function of the sample data. In this
example, the function of the sample data X is called a test statistic, and
The value 28 shall be called the critical value (the value at which the rejection begins). That
means, once the critical value is exceeded, the decision is to reject the null hypothesis.
Now, we reject H0 if the observed value of X, falls within the rejection region.
We generalize the ideas described above by stating that, in general, a test procedure is specified by
the following;
1. A test statistic which is a function of the sample data on which the decision to reject H0 or
not to reject H0 is to be based.
2. A critical or rejection region R described by the set of all possible test statistic values for which
H0 will be rejected.
Again, we reject H0 only if the observed or computed value of the test statistic falls in the critical
or rejection region.
In every decision making process, there is always the risk of making the wrong decision. It is possible
to reject H0 when in fact H0 is true. This error is called a type I error. It is also possible to not
reject H0 when it is false leading to a type II error. We can tabulate these errors as follows
H0 H0
Decision True False
Reject Type I Correct
H0 Error Decision
Do not Correct Type II
Reject H0 Decision Error
33
In statistics, the chance of committing a type I error is called the level of significance and denoted
by α. That is,
α = P (Type I error) = P (Reject Ho| H0 is true).
The desire of managers is always to make the chance of committing both Type I and Type II errors
as small as possible. Unfortunately, it is impossible to minimize both simulatneously. Statisticians
have found that as we decrease the size of α, the size of β becomes larger and larger. Thus, the
approach that is commonly adopted is to fix a value for α and construct a test procedure that will
minimize the size of β. The test procedure is then called a level α test.
(a) What function of the random sample of 25 individuals will you use to decide
whether to reject H0 or not to reject H0 ?
(b) Which of the following rejection regions is most appropriate and why?
R1 = {x : x ≤ 7 or x ≥ 18},
R2 = {x : x ≤ 8},
R3 = {x : x ≥ 17}.
(c) In the context of this problem situation, describe what type I and type II errors
are.
(d) What is the probability distribution of the test statistic X when H0 is true ?
Use it to compute the probability of a type I error.
(e) Compute the probability of a type II error for the selected region when p = 0.3.
(f) Using the selected region, what would you conclude if 6 of the 25 queried favored
company 1?
34
Problem 10, Page 325
A mixture of pulverized fuel ash and Portland cement to be used for grouting should
have a compressive strength of more than 1300 KN/m2 . The mixture will not be
used unless experimental evidence indicates conclusively that the strength specifi-
cation has been met. Suppose compressive strength for specimens of this mixture
is normally distributed with σ = 60. Let µ denote the true average compressive
strength.
(a) What are the appropriate null and alternative hypothesis?
(b) Let X̄ = sample average compressive strength for n = 20 randomly selected
specimens. Consider a test procedure with test statistic X̄ and rejection region
x̄ ≥ 1331.26. What is the probability distribution of the test statistic when H0
is true? What is the probability that the mixture was judged to be satisfactory
and used when it should not have been used?
8.2. Tests About a Population Mean
In our discussion on tests about a population mean, we shall begin with a case that is rarely met in
practice. We begin by assuming that the population is normal and that σ 2 , the population variance is
known. Then we shall discuss the case of large-sample test where the distribution of the population
is unknown and the population variance is also unknown. We will conclude this section with a
discussion on the case where the sample size is small. In that case, we will require the population to
be normally distributed. In all three cases, the null hypothesis of interest will be
H0 : µ = µ0 ,
35
(C) A two-tailed test
Ha : µ 6= µ0 .
A test involving this alternative is said to be two-tailed because it is not specified whether µ is
negative or positive. Therefore, the critical region lies on both tails of the distribution of the
test statistic.
Let X1 , X2 , . . . , Xn represent a random sample of size n from a normal population. Then a reasonable
function of the sample data that can be used for deciding whether to reject H0 : µ = µ0 or not is the
sample mean. We saw in our previous example that to compute the probability of a type I error, we
had to standardize X̄. Thus, rather than use X̄ we shall use the standardized X̄, when H0 is true,
as our test statistic. That is, the test statistic for testing H0 : µ = µ0 against any one of the three
alternatives is given by
X̄ − µ0
Z= .
√σ
n
Now let z be a computed value of the test statistic Z. For a fixed value of the significance level α,
the rejection region for a level α test is
When testing hypotheses about a parameter, it may be helpful to follow the steps below.
3. State the formula for the test statistic and compute its value.
36
Determination of β and Sample size n
We shall use the lower-tailed test with rejection region z ≤ −zα to illustrate how we can determine
the probability of committing a type II error under Case I. This case is one of the few cases in which
a closed form expression can be derived for both the sample size and β. From definition,
Now, rejecting H0 simply means that the alternative Ha is more plausible. However, under Ha there
are several possible values for µ. Thus, the value of β will depend on the value of µ that we use
among all the eligible values under Ha . Now, suppose µ = µ0 is a value of µ that is less than µ0
under Ha . Then,
Since µ0 is less than µ0 , the difference µ0 − µ0 will be positive. Now, as µ0 decreases, µ0 − µ0 will be
more positive and the value of β(µ0 ) will become smaller. Similarly, we can show that
(a) For an upper-tailed test with Ha : µ > µ0 , the probability of committing a type two error is
µ0 − µ0
β(µ0 ) = Φ zα + ,
√σ
n
37
(b) For a two-tailed test with Ha : µ 6= µ0 , the probability of committing a type two error is
µ0 − µ0 µ0 − µ0
β(µ0 ) = Φ zα/2 + − Φ −zα + ,
√σ √σ
n n
where µ0 is a value of µ that is not equal to µ0 under Ha . For a two-tailed test, it can be shown
that the approximate sample size n for which a level α test also has β(µ0 ) = a specified β at
the alternative value µ0 is " #2
σ(zα/2 + zβ )
.
µ0 − µ0
Examples (see class notes for solution)
In this section we assume that the random sample X1 , X2 , . . . , Xn is from an unknown distribution
with unknown mean µ and variance σ 2 . We also assume that the sample size n is large (n > 30).
Then, from the central limit theorem a reasonable test statistic for testing H0 : µ = µ0 against one
of the three possible alternatives is
X̄ − µ0
Z= ,
√S
n
which has approximately a standard normal distribution. The rejection regions remain the same as
before.
38
Case III: Tests About µ When n is small
When n is small but the sample X1 , X2 , . . . , Xn is from a normal population, we have seen that
the standardized random variable now has a t-distribution with n − 1 degrees of freedom. Thus, a
reasonable test statistic for testing H0 : µ = µ0 against one of the three possible alternatives is
X̄ − µ0
T = .
√S
n
39
Americans: A National Study” (J. Gerontology, 1992: M145-150) reports the fol-
lowing summary data on intake for a sample of males age 65-74 years: n = 115,
x̄ = 11.3, and s = 6.43. Does this data indicate that average daily zinc intake in the
population of all males age 65-74 falls below the recommended allowance ?
Problem 31, Page 338
In an experiment designed to measure the time necessary for an inspector’s eyes
to become used to the reduced amount of light necessary for penetrant inspection,
the sample average time for n = 9 inspectors was 6.32secs and the sample standard
deviation was 1.65sec. It has previously been assumed that the average adaptation
time was at least 7 sec. Assuming adaptation time to be normally distributed, does
the data contradict prior belief ? Conduct your test at α = 0.1.
8.3. Tests About a Population Proportion
Let p be the proportion of objects or individuals with a specified characteristic or property. Suppose
we label those individuals or objects with the specified property, success (S) and those without the
property, failure (F). Then p = population proportion of successes and 1− p = population proportion
of failures. When claims are made about p, the problem is to test hypothesis about p. In general,
the null hypothesis is of the form
H0 : p = p0 ,
and the alternative hypothesis can be any one of the following three hypotheses.
Ha : p > p0 .
Ha : p < p0 .
Ha : p 6= p0 .
It can be shown that, when the sample size n is large (n > 30) and the null hypothesis is true, the
test statistic
p̂ − p0
Z=q ,
p0 (1−p0 )
n
has approximately the standard normal distribution. The rejection regions for a level α test are as
follows.
40
(a) z ≥ zα if the alternative hypothesis is Ha : µ > µ0 (upper-tailed test).
41