You are on page 1of 55

CHAPTER 5

Introduction to Inference and Sampling distribution


5.1 Sampling distribution

A sample is a finite representative part of a statistical population whose properties are studied to
gain information about the whole population (Webster, 1985). When dealing with people, it can
be defined as a set of respondents (people) selected from a larger population for the purpose of a
survey.

Sampling is the act, process, or technique of selecting a suitable sample, or a representative part
of a population for the purpose of determining parameters or characteristics of the whole
population.

The main purpose of sampling is to draw conclusions about populations by directly observing a
portion of the population.

Definition 5.1 Target population: The totality of elements which are under discussion and about
which information is desired will be called the target population.

How should sample be chosen to form target population for the inference to be credible?
A population has a density if we can assume that each element in the population has some
numerical value associated with it and the distribution of these numerical values is given by the
density.

Definition 5.2 Random sample: Let the random variables X 1 , X 2 , X 3 ,..., X n have a joint

density f x1 ,..., f xn  x1 , x2 ,..., xn   f  x1  f  x2  ... f  xn  , then X 1 , X 2 , X 3 ,..., X n is defined to be a

random sample of size n from a population with density f (.)

Often it is not possible to select a random sample from the target population, but a random sample
can be selected from some related population.

Definition 5.3-Sampled population. Let X 1 , X 2 , X 3 ,..., X n be a random sample from a


population with density f (.), and then this population is called the sampled population.

1
Valid probability statements can be made about sampled populations based on random samples,
but statements about the target populations are not valid in a relative-frequency probability sense
unless the target population is also the sampled population.

Example 5.1 Suppose that a sociologist desires to study the religious habits of 20-year-old males
in South Africa. He draws a sample from the 20-year-old males of Kimberley to conduct his study.
In this case, the target population is the 20-year-old males in South Africa, and the sampled
population is the 20-year-old males in Kimberley, which he sampled. He can draw valid relative-
frequency-probabilistic conclusions about his sampled population, but he must use his personal
judgment to extrapolate to the target population, and the reliability of the extrapolation cannot be
measured in relative-frequency probability terms.

Example 5.2 A wheat researcher is studying the yield of a certain variety of wheat in Free State.
She has at her disposal five farms scattered throughout Free State on which he can plant the wheat
and observe the yield. The sampled population consists of the yields on these five farms, whereas
the target population consists of the yields of wheat on every farm in Free State.

5.2 Sampling distribution

A central problem in discovering new knowledge in the real world consists of observing a few of
the elements under discussion, and based on these few we make a statement about the totality of
elements. A population distribution is the distribution of all the individual measurements in a
population, and a sample distribution is the distribution of the individual values included in a
sample.

In contrast to such distributions for individual measurements, a sampling distribution refers to the
distribution of different values that a sample statistic, or estimator, would have over many samples
of the same size. Thus, even though we typically would have just one random sample or rational
subgroup, we recognize that the particular sample statistic that we determine, such as the sample
mean or median, is not exactly equal to the respective population parameter.

Further, a sample statistic will vary in value from sample to sample because of random sampling
variability, or sampling error. This is the idea underlying the concept that any sample statistic is in
fact a type of variable whose distribution of values is represented by a sampling distribution.

2
A point estimator is the numeric value of a sample statistic that is used to estimate the value of a
population or process parameter. One of the most important characteristics of an estimator is that
it be unbiased. An unbiased estimator is a sample statistic whose expected value is equal to the
parameter being estimated.

Advantages of Sampling

1. It is economical: The economic advantage of using a sample in research is that a sample


requires fewer resources than a census.
2. It saves time: A sample may provide you with needed information quickly. For example,
you are a Doctor and a disease has broken out in a village within your area of jurisdiction,
the disease is contagious and it is killing within hours nobody knows what it is. You are
required to conduct quick tests to help save the situation. If you try a census of those
affected, they may be long dead when you arrive with your results. In such a case just a
few of those already infected could be used to provide the required information
3. The large size of many populations: Many populations about which inferences must be
made are quite large. For example, Consider the population of high students in South Africa
4. Inaccessibility of some of the population: The inaccessibility may be economic or time
related. Like a particular study, population may be so costly to reach like the population of
planets that only a sample can be used. In other cases, a population of some events may be
taking too long to occur that only sample information can be relied on. For example, natural
disasters like a flood that occurs every 100 years or take the example of the flood that
occurred in Noah`s days. It has never occurred again.
5. Destructiveness of the observation: The destructive nature of the observation sometimes
the very act of observing the desired characteristic of a unit of the population destroys it
for the intended use. For example, to test the quality of a fuse, to determine whether it is
defective, it must be destroyed. To obtain a census of the quality of a lorry load of fuses,
you have to destroy all of them. This is contrary to the purpose served by quality control
testing. In this case, only a sample should be used to assess the quality of the fuses.
6. Accuracy: A sample may be more accurate than a census. A poor conducted census can
provide less reliable information than a carefully obtained sample.

3
5.2.1 Sampling Distribution of Means and the Central Limit Theorem

Suppose that a random sample of n observations is taken from a normal population with mean μ
and variance σ2. Each observation X i , i  1, 2, . . . , n , of the random sample will then have the
same normal distribution as the population being sampled. Hence, the sample mean defined by

1
X  X 1  X 2  ...  X n 
n

has a normal distribution with mean and variance respectively as

1 1 2
X      ...     and 2X  2  2  ...  2  = .
n n n

If we are sampling from a population with unknown distribution, either finite or infinite, the
2
sampling distribution of X will still be approximately normal with mean μ and variance , if the
n
sample size is large. This is an immediate consequence of the following called the Central Limit
Theorem.

Theorem 5.1 Central Limit Theorem: If X is the mean of a random sample of size n taken from
a population with mean μ and finite variance 2 , then the limiting form of the distribution of
X 
Z

n
as n   , has the standard normal distribution N (Z ; 0, 1)

The normal approximation for X will generally be good if n ≥ 30, provided the population
distribution is not terribly skewed. If n < 30, the approximation is good only if the population is
not too different from a normal distribution and, as stated above, if the population is known to be
normal, the sampling distribution of X will follow a normal distribution exactly, no matter how
small the size of the samples.

4
The sample size n = 30 is a guideline to use for the Central Limit Theorem. However, as the
statement of the theorem implies, the presumption of normality on the distribution of X becomes
more accurate as n grows larger.

5.2.3 t-Distribution
The applications of the Central Limit Theorem revolve around inferences on a population mean or
the difference between two population means. Use of the Central Limit Theorem and the normal
distribution is certainly helpful in this context. However, it was assumed that the population
standard deviation is known. In many situations, knowledge of σ is certainly no more reasonable
than knowledge of the population mean μ. often, in fact, an estimate of σ must be supplied by the
same sample information that produced the sample average X .

Let X 1 , X 2 , . . . , X n be independent random variables that are all normal with mean μ and

standard deviation σ. Let

1 n 1 n
   Xi  X 
2
X X i and s 2 
n i 1 n  1 i 1

X 
Then the random variable T  has a t-distribution with v  n  1 degrees of freedom.
s
n

If the sample size is small, the values of S2 fluctuate considerably from sample to sample and the
distribution of T deviates appreciably from that of a standard normal distribution. If the sample
size is large enough, say n ≥ 30, the distribution of T does not differ considerably from the standard
normal. However, for n < 30, it is useful to deal with the exact distribution of T. In developing the
sampling distribution of T, we shall assume that our random sample was selected from a normal
population.

Comparing the t-Distribution to Z (Standard normal) distribution

1. The distribution of T is similar to the distribution of Z in that they both are symmetric about
a mean of zero.

5
2. Both distributions are bell shaped, but the t-distribution is more variable, owing to the fact
that the T-values depend on the fluctuations of X and S2, whereas the Z-values depend
only on the changes in X from sample to sample.
3. The variance of T depends on the sample size n and is always greater than 1. When the
sample size n→∞ the two distributions become the same.

Applications of t-Distribution

The t-distribution is used extensively in problems that deal with inference about the population
mean or in problems that involve comparative samples (i.e., in cases where one is trying to
determine if means from two samples are significantly different).

NB

 The use of t-distribution requires that the sample is obtained from a normal distribution
 The use of the t-distribution and the sample size consideration do not relate to the Central
Limit Theorem.
 The use of the standard normal distribution rather than T for n ≥ 30 merely implies that S
is a sufficiently good estimator of σ in this case.

5.4 Statistical Inference

Statistical inference consists of those methods by which one makes inferences or generalizations
about a population. The trend today is to distinguish between the classical method of estimating a
population parameter, whereby inferences are based strictly on information obtained from a
random sample selected from the population, and the Bayesian method, which utilizes prior
subjective knowledge about the probability distribution of the unknown parameters in conjunction
with the information provided by the sample data. Throughout this module, we shall use classical
methods to estimate unknown population parameters such as the mean, the proportion, and the
variance by computing statistics from random samples and applying the theory of sampling
distributions.

When choosing an estimator for a population parameter, certain desirable properties needs to be
taken into account. Among these estimators includes unbiasedness, efficiency, consistency and
sufficiency.

6
(a) Unbiased Estimator

Let ̂ be an estimator whose value ˆ is a point estimate of some unknown population parameter
θ. If the mean of the sampling distribution of ̂ to is equal to the estimated parameter, then ̂ is
said to be unbiased.

Definition 5.4 A statistic ̂ is said to be an unbiased estimator of the parameter θ if

 ˆ 
ˆ   

Example 5.1: Show that S 2 is an unbiased estimator of the parameter  2


Solution

Note
n n 2 n 2

  X i  X     X i        X     X i      X   


2

i 1 i 1 i 1

n n n
  Xi  X    X     2 X     Xi  X 
2 2 2

i 1 i 1 i 1

n
  Xi  X   n X   
2 2

i 1

1 n
  Xi  X 
2
s2 
n  1 i 1

 n 2
E  s2   E   Xi  X  
1
n  1  i 1 

 n 2 1  n 2
  Xi  X   n X        X i  X   n  X    
1

2 2
 
n  1  i 1  n  1  i 1 


1
n 1

 X2 i  n X2 
However,
2
 X2   2 , for i  1, 2,..., n, and  X2 
i
n
Therefore,

7
1  n 2 2 
E  s2      n 
n  1  i 1 n 

1  2 2 
  n  n 
n 1  n 


1
n 1
 n 2   2  
1
n 1
  n  1  2    2

(b) Efficient Estimator

If 1 and  2 are two unbiased estimators of the same population parameter  , we want to choose
the estimator whose sampling distribution has the smaller variance.

Definition 5.5. Let 1 and  2 be two unbiased estimators of the population parameter  . If

Var  1   Var   2  i.e.  2ˆ   2ˆ ,


1 2

then 1 is said to be more efficient than  2 . Also the relative efficiency of 1 with respect to  2
is by

Var  2 
Var  1  .

If the relative efficiency is more than 1, then 1 is more efficient, otherwise,  2 is more efficient.

Example 5.2:
Let X 1 , X 2 , X 3 be a sample of size n = 3 from a distribution with unknown mean,  ,     

where the variance is σ 2 a known positive number.


1
(a) Show that both ˆ1  X and ˆ2   2 X 1  X 2  5 X 3  are unbiased estimators for ˆ .
8
(b) Which of the two estimators are more efficient?
(c) Calculate the relative efficiency of ˆ1 with respect to ˆ2 and comment on it.

8
Home work.
Let X 1 , X 2 , X 3 be a random sample from the normal distribution where both the mean and the

1 1 1 1
variance are unknown. If 1  Y1  Y2  Y3 and 2  Y1  Y2  Y3  .
4 2 4 3
(a) Which of the two estimators are more efficient?
(b) Calculate the relative efficiency of  2 with respect to 1 and comment on it.

5.5 Confidence interval for population mean for one sample

We discuss the estimation of population means for one sample when the population variance is
either known or unknown. If our sample is selected from a normal population or, if n is sufficiently
large, we can establish a confidence interval for μ by considering the sampling distribution of X .
From the Central Limit Theorem, we can expect the sampling distribution of X to be
2
approximately normally distributed with mean  X   and   2
X
.
n

X 
Z

n

Writing Z  for the Z above  we can write


2 2

 
 X 
P  Z  Z   Z    1  
 2  2

 n 

From this we have

P  X  Z      X  Z    1 

 2 n 2 n

In constructing the  1    % confidence interval for the mean, we may come across two main

situations. The first situation is when the population variance is known and the second situation is
when the population variance is unknown.

9
(a) Situation 1: Confidence Interval for population mean when  is known

When the population variance is known we use the normal distribution in the the following two

cases. In both cases, a 100 1 - α  % confidence interval for the  is given by x  z 2 and the
n
corresponding z 2 values are read from the standard normal Tables.

Case 1: In this case, we use the normal distribution when

 population standard deviation is known,


 sample size is small i.e. n < 30 and
 Population from which the sample is selected is normally distributed.

Case 2: We use normal distribution by invoking the central limit theorem when the

 population standard deviation is known


 Sample size is small i.e. n  30 .

Example 5.3

Suppose the weights in kg of 5 newly born babies randomly selected from maternity clinic are 3,
4,5,5,8. Find a 95% confidence interval for the mean weight  of all the newly born babies at the
clinic if the population variance of the weights of the babies is 2kg. Assume the weights of the
babies are approximately normal.

Solution

In this example, the population variance is known, the sample size is small and the sample is

approximately normally distributed, hence we make use of the formula: x ± Z  2
n

3+ 4 +5+5+8
Here σ = 2 , x  = 5 , n  5 and α = 0.05 .
5

10
Z0.05 2 = Z0.025 , thus, the Z value corresponding to 1- 0.025 = 0.975 from the normal table is 1.96.

Therefore, the 95% confidence interval for the mean is

2
5 ±1.96 = 5 ±1.239  3.761 < μ < 6.239 .
5

Thus, we are 95% confidence that the true mean is between 3.761 and 6.239.

Example 5.4 (Class work)

Scholastic Aptitude Test (SAT) mathematics scores of a random sample of 300 high school seniors
in the state of Texas are collected, and the sample mean is found to be 401. If the population
standard deviation is 112, obtain a 95% confidence interval on the mean SAT mathematics score
for seniors in the state of Texas.

(b) Confidence Interval for population mean when  is unknown

There are two case involved when the population variance is unknown. In case 1 a 100 1 - α  %

s
confidence interval for the  is given by x  t  ,n-1 while in case 2 the confidence interval is
2
n
s s
given by either x  t  2,n-1 or x  z 2 . The t  2,n -1 value is read from the student t-table.
n n

Case 1: Use t-distribution when the

 The population standard deviation is unknown,


 The sample size is small i.e. n  30 and the
 Population from which the sample is selected is normally distributed.
Case 2: Use t-distribution or invoke the central limit theorem when the

 The population standard deviation is unknown,


 The sample size is large i.e. n  30 and the
 Population from which the sample is selected is normally distributed.

11
NB: The central limit theorem may also be invoked here and it is expected that the two methods
will yield approximately the same results as the sample size increase.
Example 5.5

The contents of seven similar containers of sulfuric acid are 9.8, 10.2, 10.4, 9.8, 10.0, 10.2, and
9.6 liters. Find a 95% confidence interval for the mean contents of all such containers, assuming
an approximately normal distribution.

Solution

The sample mean and standard deviation for the given data are x  10.0 and s = 0.283.

From the t-table, t 0.025 = 2.447 for v =  n -1 = 6 degrees of freedom. Hence, the

0.283
10.0 ±  2.447 
7

Therefore the confidence interval is 9.74 < μ < 10.26.

Example 5.6

Scholastic Aptitude Test (SAT) mathematics scores of a random sample of 500 high school seniors
in the state of Texas are collected, and the sample mean and standard deviation are found to be
501 and 112, respectively. Find a 99% confidence interval on the mean SAT mathematics score
for seniors in the state of Texas. Hint: The population variance is unknown and the distribution is
unknown, thus invoke the central limit theorem and replace the  by  .

5.6 One-Sided Confidence Bounds

There are many applications in which only one sided-bound is desired. For example, if the
measurement of interest is tensile strength, the engineer receives better information from a lower
bound only. This bound communicates the worst-case scenario. On the other hand, if the
measurement is something for which a relatively large value of μ is not profitable or desirable,
then an upper confidence bound is of interest. An example would be a case in which inferences

12
need to be made concerning the mean mercury composition in a river. An upper bound is very
informative in this case.


One sided upper confidence bound a normal distribution are given by   x  z while the
n

lower confidence bound is  x  z . For t-distribution, the bounds are respectively
n
s s
  x  t  ,n 1 and  x  t  ,n 1 .
n n

5.7 Determining the sample size for the estimation of the mean

We now see that a longer interval is required to estimate μ with a higher degree of confidence. The
100 1    % confidence interval provides an estimate of the accuracy of our point estimate. If μ

is actually the center value of the interval, then x estimates μ without error. Most of the time,
however, x will not be exactly equal to μ and the point estimate will be in error. The size of this
error will be the absolute value of the difference between μ and x , and we can be 100(1 − α)%
 
confident that this difference will not exceed Z  2 . Hence we can choose n such that  = Zα 2 .
n n

Let X1 , X2 ,..., Xn be a random sample of size n from the normal distribution and let x and s be the

sample mean and standard deviation respectively. Then the minimum sample required for x to be
within  of the true mean  with a 100 1 - α  % probability is

 Zα  
2

n= 2 
 ε 
 

Strictly speaking, the formula is applicable only if we know the variance of the population from
which we select our sample. If we do not know the variance, we could take a preliminary sample
of size n ≥ 30 to provide an estimate of σ. Then, using s as an approximation for σ, we could

13
determine approximately how many observations are needed to provide the desired degree of
accuracy.

Example 5.7

How large a sample is required if we want to be 95% confident that our estimate of μ in Example
9.2 is off by less than 0.05?

Solution

The population standard deviation is   0.3 , Zα  Z0.025 = 1.96 , ε = 0.05 . Therefore


2

 Zα    1.96  0.3 2
2

n= 2     138.3 .
 ε   0.05 
 

Therefore, we can be 95% confident that a random sample of size 139 will provide an estimate x
differing from μ by an amount less than 0.05.

Example 5.8

A soft drink machine is regulated so that the amount of drink dispensed is approximately normally
distributed with mean of  and a variance of 2 . If a random sample of selected drinks yields a
standard deviation of 0.3, what size of the sample is needed to be:

i) 95% confident that the sample mean will be within 0.02 of the true mean?
ii) 90% confident that the sample mean will be within 0.01 of the true mean?

Exercise

1. The average zinc concentration from a sample of measurements taken in 36 different


locations in a river is found to be 2.6 grams per milliliter with a standard deviation is 0.3
grams per milliliter. Find the 95%
(a) Confidence intervals for the mean zinc concentration in the river.
(b) Upper bound for the mean zinc concentration in the river.

14
2. The height (ft.) of a random sample of 64 college students produced a mean height of
4.4583ft and sample standard deviation of 1.889. Construct a 95%
(a) Confidence interval for the mean height.
(b) Lower bound for the mean height.
3. Consider a random sample 2.4; 1.6; 3.8; 4.0; 5.2 and 2.0 from a normal distribution.
Construct a 90% confidence interval for the mean  in each of the following cases:
(a) Variance is 1.44
(b) Variance is unknown
4. An efficiency expert wishes to determine the average time that it takes to drill three holes
in a certain metal clamp. How large a sample will she need to be 95% confident that her
sample mean will be within 15 seconds of the true mean? Assume that it is known from
previous studies that σ = 40 seconds.
5. A machine produces metal pieces that are cylindrical in shape. A sample of pieces is taken,
and the diameters are found to be 1.01, 0.97, 1.03, 1.04, 0.99, 0.98, 0.99, 1.01, and 1.03
centimeters. Find a 99% confidence interval for the mean diameter of pieces from this
machine, assuming an approximately normal distribution.

5.8 Prediction Intervals and tolerance level

Sometimes, it is necessary to predict the possible value of a future observation. For instance,

 In quality control, the experimenter may need to use the observed data to predict a new
observation.
 A weather forecaster may be interested in predicting tomorrow’s temperature.

In these cases, a confidence interval on the mean tensile strength and the also the temperature does
not capture the required information. These types of requirement are addressed by constructing a
prediction interval.

Prediction interval provides a good estimate of the location of a future observation, which is quite
different from the estimate of the sample mean value. The variation of the prediction interval is
the sum of the variation due to an estimation of the mean and the variation of a single observation.

15
One of the importance of prediction interval is the detection of outlier observation. The majority
of scientific investigators are keenly sensitive to the existence of outlying observations or so-called
faulty or “bad data.” Outlying observation could be viewed as the observation that comes from a
population with a mean that is different from the mean that governs the rest of the sample of size
n being studied. The prediction interval produces a bound that “covers” a future single observation
with probability 1 –  if it comes from the population from which the sample was drawn. As a
result, a methodology for outlier detection involves the rule that an observation is an outlier if it
falls outside the prediction interval computed without including the questionable observation in
the sample.

(a) Constructing prediction interval


Assume that the random sample comes from a normal population with unknown mean μ and
known variance σ 2 . A natural point estimator of a new observation is X . Recall that the variance
of X is 2 / n . In predicting a new observation, we need to account for both the variation due to
estimating the mean and the variation of a future observation. We know that the variance of the
random error in a new observation is σ 2 . Consider the normal random variable x0  x , where x0

is the new observation and x comes from the sample. Since x0 and x are independent, we know
that
x0  x x0  x
z  is n (z; 0, 1).
2    1 
2 2
n n

Recall that

P   z / 2  z  z / 2   1  

Therefore, for a normal distribution of measurements with unknown mean μ and known variance
σ 2 , a 100 1 - α  % prediction interval of a future observation x0 is

x  zα/2 σ 1+ 1 n  x0  x  zα/ 2σ 1+ 1 n .

Similarly, for a normal distribution of measurements with unknown mean μ and unknown variance
σ 2 , a 100 1 - α  % prediction interval of a future observation x0 is

x  tα/2s 1+ 1 n  x0  x  tα/ 2s 1+ 1 n .

16
Example 5.9

Due to the decrease in interest rates, the First Citizens Bank received a lot of mortgage applications.
A recent sample of 50 mortgage loans resulted in an average loan amount of $257,300. Assume a
population standard deviation of $25,000.

(a) For the next customer who fills out a mortgage application, find a 95% prediction interval
for the loan amount.
(b) Determine if will be abnormal if the next customer who fills out a mortgage application
obtains a loan amount of 200,700.

Solution

(a) The point prediction of the next customer’s loan amount is x = $257,300 . The z-value here
is z0.025 = 1.96 . Hence, a 95% prediction interval for the future loan amount is

257,300 - (1.96)(25, 000) 1 + 1/ 50 < x0 < 257,300 + (1.96)(25, 000) 1 + 1/ 50 .

Therefore the prediction interval is  $207,812.43; $306, 787.57  .

(b) The approved loan will not be abnormal because it falls outside the prediction interval.

Example 5.10
A meat inspector has randomly selected 30 packs of 95% lean beef. The sample resulted in a mean
of 96.2% with a sample standard deviation of 0.8%. Find a 99% prediction interval for the leanness
of a new pack. Assume that the distribution of the meat packs is normal.

Solution
For v = 29 degrees of freedom, t0.005 = 2.756 . Hence, a 99% prediction interval for a new
observation is
96.2 - (2.756)(0.8) 1 + 1/ 30 < x0 < 6.2 + (2.756)(0.8) 1 + 1/ 30 .

Therefore, the prediction interval is (93.96; 98.44).

(b) Tolerance Limits

17
Sometimes, one is interested in determining the bounds in some probabilistic sense “cover” values
in the population (i.e., the measured values of the dimension). One method of establishing the
desired bounds is to determine a confidence interval on a fixed proportion of the measurements. A
bound that covers the middle 95% of the population of observations is μ ±1.96σ . This is called a
tolerance interval, and indeed its coverage of 95% of measured observations is exact. However, in
practice, μ and σ are seldom known; thus, the user must apply

x  ks.

The interval is a random variable, and hence the coverage of a proportion of the population by the
interval is not exact. As a result, a 100 1- γ  % confidence interval must be used since x  ks

cannot be expected to cover any specified proportion all the time. For a normal distribution of
measurements with unknown mean μ and unknown standard deviation σ, tolerance limits are
given by x  ks , where k is determined such that one can assert with 100 1- γ  % confidence

that the given limits contain at least the proportion 1 - α  of the measurements.

Example 5.11

A machine produces metal pieces that are cylindrical in shape. Assuming that the diameters are
approximately normally distributed, find the 99% tolerance limits that will contain 95% of the
metal pieces produced by this machine if a random sample of 10 selected metal pieces has a mean
and standard deviation of x  1.0056 and s  0.0246 respectively .

Solution

For n  9 , 1-   0.99 and 1- α = 0.95 , we find k = 4.550 for two-sided limits. Hence, the 99%
tolerance limits are given by

x  ks  1.0056 ±  4.550  0.0246  ,

with the bounds being 0.8937 and 1.1175. We are 99% confident that the tolerance interval from
0.8937 to 1.1175 will contain the central 95% of the distribution of diameters produced.

Exercise (Tutorials)

18
1. The following measurements were recorded for the drying time, in hours, of a certain brand
of latex paint:
3.4; 2.5; 4.8; 2.9; 3.6; 2.8; 3.3; 5.6; 3.7; 2.8; 4.4; 4.0; 5.2; 3.0; 4.8.
If the measurements represent a random sample from a normal population, find a 95%
prediction interval for the drying time for the next trial of the paint.
2. A random sample of 100 automobile owners in the state of Virginia shows that an
automobile is driven on average 23,500 kilometers per year with a standard deviation of
3900 kilometers. Assume the distribution of measurements to be approximately normal,
construct a 99% prediction interval for the kilometers traveled annually by an automobile
owner in Virginia.
3. A random sample of 12 graduates of a certain secretarial school typed an average of 79.3
words per minute with a standard deviation of 7.8 words per minute. Assuming a normal
distribution for the number of words typed per minute, find a 95% prediction interval for
the average number of words typed by all graduates of this school.
4. A random sample of 12 shearing pins is taken in a study of the Rockwell hardness of the
pin head. Measurements on the Rockwell hardness are made for each of the 12, yielding
an average value of 48.50 with a sample standard deviation of 1.5. Assuming the
measurements to be normally distributed, construct a 95% tolerance interval containing
90% of the measurements.
5. A random sample of 25 tablets of buffered aspirin contains, on average, 325.05 mg of
aspirin per tablet, with a standard deviation of 0.5 mg. Find the 95% tolerance limits that
will contain 95% of the tablet contents for this brand of buffered aspirin. Assume that the
aspirin content is normally distributed.
6. The following data are diameter measurements from a sample of metal pieces: 1.01, 0.97,
1.03, 1.04, 0.99, 0.98, 1.01, 1.03, 0.99, 1.00, 1.00, 0.99, 0.98, 1.01, 1.02, 0.99 centimeters.
If the sample is drawn from a normal distribution, construct the following intervals and
interpret your results to those of the case study.
(a) Compute a 99% confidence interval on the mean diameter.
(b) Compute a 99% prediction interval on the next diameter to be measured.
(c) Compute a 99% tolerance interval for coverage of the central 95% of the distribution
of diameters.

19
5.9 Confidence Interval for a proportion

Let p be the proportion of individuals with a certain property P , for example, proportion of
students, adults etc. Suppose that a random sample of size n is drawn from such a population and
let k be the number of individuals in the sample with property P , and then a 100 1-   %

confidence interval for p is given by

pˆ  1  pˆ 
pˆ  z 
2 n

k  
where pˆ  is the sample proportion and z  is the 100 1-  % percentile of the standard normal
n 2  2
distribution. NB. It is always assumed that the sample size is large since we are invoking the central
limit theorem here.
Example 5.12
In a random sample of n = 500 families owning television sets in the city, it is found that x = 340
subscribe to ETV. Find a 95% confidence interval for the actual proportion of families with
television sets in this city that subscribe to ETV.
Solution

The point estimate of p is p̂ = 340 / 500 = 0.68 . Recall that z0.025 = 1.96 . Therefore, a 95%
confidence interval for p is

0.68 1- 0.68 


0.68 ± z α
2 500

which simplifies to 0.6391 < p < 0.7209.

Example 5.13

A new rocket-launching system is being considered for deployment of small, short-range missiles.
The existing system has a success probability of 0.8. A sample of 40 experimental launches is
made with the new system and 34 are successful.

20
(a) Construct a 90% confidence interval for the success probability
(b) Based on the 90% confidence interval, would you conclude that the new system is
better?

5.10 Determining sample for estimating proportions

If p̂ is used as an estimate of p, we can be 100(1 − α) % confident that the error will be less than
a specified amount  when the sample size is approximately

pˆ  1 - pˆ 
n  z α/2
2

ε2

If a crude estimate of p can be made without taking a sample, this value can be used to determine
n. However, if this is not possible, we could take a preliminary sample of size n ≥ 30 to provide an
estimate of p. Using the above formula, we could determine approximately how many
observations are needed to provide the desired degree of accuracy.

Example 5.14
A random sample of families owning television sets in the city is to be selected such that we are
95% confident that the actual proportion of families with television sets in a city that subscribe to
ETV is within 0.02 of the true value. How large should the sample be?

Solution
Let us treat the 500 families as a preliminary sample, providing an estimate pˆ  0.68 . Then,

1.96   0.68 0.32  = 2089.8  2090.


2

n=
 0.02 
2

Sometimes, it is not practical to obtain an estimate of p to be used for determining the sample size
for a specified degree of confidence. If this happens, an upper bound for n is established by noting
ˆ ˆ  pˆ  1  pˆ  , which must be at most 1/4, since p̂ must lie between 0 and 1. In such a
that pq

situation, n is computed using the formula below.

z2 /2
n=
4ε 2

21
Example 5.15

How large a sample is required if we want to be at least 95% confident that our estimate of p in
Example 6.10 is within 0.02 of the true value?

Solution

1.96 
2

Substituting the relevant data into the correct formula, we obtain n= = 2401 .
 4  0.02 
2

5.11 Single Sample: Estimating the Variance

If a sample of size n is drawn from a normal population with variance σ2 and the sample variance
s2 is computed, we obtain a value of the statistic S2. This computed sample variance is used as a
point estimate of σ2. Hence, the statistic S2 is called an estimator of σ2. An interval estimate of σ2
can be established by using the statistic

 2
=
 n  1 S 2
2

If s2 is the variance of a random sample of size n from a normal population, a 100(1 − α) %


confidence interval for σ2 is

 n -1 s2 < σ2 <


 n -1 s2
χ α/2
2
χ1-α/2
2

where χ α/2
2
and χ 1-α/2
2
are χ 2  values with v  n  1 degrees of freedom, leaving areas of α/2 and
1− α/2, respectively, to the right. An approximate 100(1 − α) % confidence interval for σ is
obtained by taking the square root of each endpoint of the interval for σ2.

Example 5.16
The following are the weights, in decagrams, of 10 packages of grass seed distributed by a certain
company: 46.4, 46.1, 45.8, 47.0, 46.1, 45.9, 45.8, 46.9, 45.2, and 46.0. Find a 95% confidence
interval for the variance of the weights of all such packages of grass seed distributed by this
company, assuming a normal population.

22
Solution
1 n
Note that s 2 = 
n  1 n 1
 xi  x  hence s2 = 0.286 .
To obtain a 95% confidence interval, we choose α = 0.05. Then, using the Chi-square Table with
v = 9 degrees of freedom, we find χ 0.025
2
 19.023 and χ 0.975
2
 2.700 . Therefore, the 95%
confidence interval for σ2 is
10 - 1 0.286  < σ 2 < 10 -1 0.286 
19.023 2.7
Which implies
0.135 < σ2 < 0.953
Exercise
1. A machine is used to fill boxes with product in an assembly line operation. Much concern
centers around the variability in the number of ounces of product in a box. The standard
deviation in weight of product is known to be 0.3 ounce. An improvement is implemented,
after which a random sample of 20 boxes is selected and the sample variance is found to
be 0.045 ounce2. Find a 95% confidence interval on the variance in the weight of the
product. Does it appear from the range of the confidence interval that the improvement of
the process enhanced quality as far as variability is concerned? Assume normality on the
distribution of weights of product.
2. A manufacturer of car batteries claims that the batteries will last, on average, 3 years with
a variance of 1 year. If 5 of these batteries have lifetimes of 1.9, 2.4, 3.0, 3.5, and 4.2 years,
construct a 95% confidence interval for σ2 and decide if the manufacturer’s claim that
2  1 is valid. Assume the population of battery lives to be approximately normally
distributed.
3. A random sample of 20 students yielded a mean of 72 and a variance of 16 for scores on a
college placement test in mathematics. Assuming the scores to be normally distributed,
construct a 98% confidence interval for σ2.

23
CHAPTER 6

One sample hypothesis testing


6.1 Introduction

Statistical hypothesis is an assertion or conjecture concerning one or more populations. The truth
or falsity of a statistical hypothesis is never known with absolute certainty unless we examine the
entire population. This, of course, would be impractical in most situations. Instead, we take a
random sample from the population of interest and use the data contained in this sample to provide
evidence that either supports or does not support the hypothesis. Evidence from the sample that is
inconsistent with the stated hypothesis leads to a rejection of the hypothesis.

In hypothesis testing, the hypothesis we wish to test is called the null hypothesis and is denoted
by H0. The alternative hypothesis H1 usually represents the question to be answered or the theory
to be tested. The rejection of H0 leads to the acceptance of an alternative hypothesis.

The null hypothesis H0 nullifies or opposes H1 and is often the logical complement to H1 . The
following are the plausible conclusions made in hypothesis testing:

 Reject H0 in favour of H1 because of sufficient evidence in the data or


 Fail to reject H0 because of insufficient evidence in the data.

Note that the conclusions do not involve a formal and literal “accept H0”. The statement of H0
often represents the “status quo” in opposition to the new idea, conjecture, and so on, stated in 1
H1, while failure to reject 0 H represents the proper conclusion.

Example

Consider, a person accused of a crime and has been brought before a jury for trial. The status quo
is that the accused is not guilty until proven otherwise, thus the trial involves the following
hypothesis:

 H0: defendant is innocent


 H1 : defendant is guilty.

24
The accusation comes because of suspicion of guilt. The hypothesis H0 (the status quo) stands in
opposition to H1 and is maintained unless H1 is supported by evidence “beyond a reasonable
doubt.” However, “failure to reject H0 ” in this case does not imply innocence, but merely that the
evidence was insufficient for conviction. Therefore, the jury does not necessarily accept H0 but
fails to reject H0.

6.2 Types of hypothesis test

A hypothesis test is differentiated by its alternative hypothesis since the null hypothesis is often
the same. Depending on the way the alternative hypothesis is stated, a test can be either one-tail or
two-tailed. By one tail, we simply mean that the rejection region is found in only one direction
while in two tailed test, it could be found equally in both directions.

A test of any statistical hypothesis where the alternative is one sided, that is
Definition 6.1 H0 :   0 against H1 :   0 or H0 :   0 against H1 :   0 is called a one-
tailed test.
Generally, the critical region for the alternative hypothesis   0 lies in the right tail of the

distribution of the test statistic, while the critical region for the alternative hypothesis   0 lies
entirely in the left tail. (In a sense, the inequality symbol points in the direction of the critical
region.)
Definition 6.2 Critical region (rejection region) is the area (s) of the sampling distribution of a
statistic that will lead to the rejection of the hypothesis tested when that hypothesis is true. The
critical region is also called the  region.

A test of any statistical hypothesis where the alternative is two sided, such as

Definition 6.3 H0 :   0 against H1 :   0 is called a two-tailed test.

The critical region of a two-tailed test is split into two parts, often having equal probabilities, in
each tail of the distribution of the test statistic. The alternative hypothesis   0 states that either
  0 or   0 .

Examples 6.1

25
A manufacturer of a certain brand of rice cereal claims that the average saturated fat content does
not exceed 1.5 grams per serving. State the null and alternative hypotheses to be used in testing
this claim and determine where the critical region is located.

Solution

The claim implies that   1.5 thus, the claim can only be rejected if   1.5 , therefore we test the
hypothesis H0 :   1.5 against H1 :   1.5 . The critical region lies to the left of the mean, i.e. the
claim is rejected when the mean is greater than 1.5

Examples 6.2

A real estate agent claims that 60% of all private residences being built today are 3-bedroom
homes. To test this claim, a large sample of new residences is inspected; the proportion of these
homes with 3 bedrooms is recorded and used as the test statistic. State the null and alternative
hypotheses to be used in this test and determine the location of the critical region.

6.3 The Probability of a Type I and Type II Errors

Consider the case of judicial trial, sometimes an innocent person is jailed due to false witness or
evidence and this lead to rejecting H0 whiles in fact it should not be rejected (True). In such cases,
the jury committing an error (mistake) by rejecting H0 (sentencing an innocent person) when in
fact, he should be acquitted and discharged (H0 is true). Such an error is called a type I error,
usually called false positive.

Definition 6.4: Rejection of the null hypothesis when it is true is called a type I error.

The probability of committing a type I error is also called the level of significance or size of the
test, and it is denoted by  i.e.

  Test Statistic  C | H 0 is true 

If the size of the critical region is small, it is unlikely that a type I error will be committed. For
example, a critical region of size 0.0409 is very small, and therefore it is unlikely that a type I
error will be committed. In the case of the jury, if   0.0409 it may imply that it is very unlikely
that the jury will sentence an innocent person to imprisonment.

26
Sometimes, due to lack of evidence, a guilty person is let go without punishment. In such cases,
the jury failed to sentence a guilty person (fail to reject H1) while in actual sense he is guilty (false).
This type of mistake is called type II error, usually called false negative.
Definition 6.5: Non-rejection of the null hypothesis when it is false is called a type II error.

Type II error is denoted by  , and can be computed as

  Test Statistic  C | H 1 is true 

The probability of committing a type II error is impossible to compute unless we have a specific
alternative hypothesis. The smaller the probability of committing type II error, the higher the
power of the test.
Definition 6.6: The power of a test is the probability of rejecting H0 given that a specific alternative
is true. Mathematically, Power of test is 1  .

Power of test is correctly rejecting 𝐻0 when it is indeed false and 𝐻1is true. Ideally, we like to use
a test procedure for which the type I and type II error probabilities are both small. Statistical
hypothesis testing leads to four possible situations that determine whether our decision is correct
or not. These are summarized in the table below:
H0 is true H0 is false
Fail to reject H0 Correct decision Type II error
Reject H0 Type I error Correct decision

The probability of committing both types of error can be reduced by increasing the sample size. In
the Jury case, both errors can be reduced by collecting more evidence and considering other factors
that may affect the decision. The following are the important Properties of a Test of Hypothesis
 The type I error and type II error are related. A decrease in the probability of one generally
results in an increase in the probability of the other.
 The size of the critical region, and therefore the probability of committing a type I error,
can always be reduced by adjusting the critical value(s).
 An increase in the sample size n will reduce α and β simultaneously.
 If the null hypothesis is false, β is a maximum when the true value of a parameter
approaches the hypothesized value.

27
 The greater the distance between the true value and the hypothesized value, the smaller β
will be.

Examples 6.3
The standard deviation of a population of weights of students is 3.6. A sample of 36 students are
selected and weighted. If the researcher is interested in testing the null hypothesis that the average
weight of male students in a certain college is 68 kilograms against the alternative hypothesis that
it is unequal to 68.
(a) Compute the probability of committing type one error.
(b) Assuming that the true mean is happened to be 70 kilograms, compute the probability of
committing type two error.
Solution
(a) H 0 :   68 versus H1 :   68
A critical region for the test statistic might arbitrarily be chosen to be the two intervals
x  67 and x  69 . The non-rejection region will then be the interval 67  x  69 . Thus

67 - 68 69 - 68
Z1 = = -1.67 and Z 2 = = 1.67
3.6 3.6
36 36
Therefore, α = P(Z < −1.67) + P(Z > 1.67) = 2P(Z < −1.67) = 2(0.0475)=0.0950.
If you increase the size of the sample to say 64, the Type one error reduces to 0.0264.
(b) The reduction in α is not sufficient by itself to guarantee a good testing procedure. We must
also evaluate β for various alternative hypotheses. If it is important to reject H0 when the true mean
is some value μ ≥ 70 or μ ≤ 66, then the probability of committing a type II error should be
computed and examined for the alternatives μ = 66 and μ = 70. Because of symmetry, it is only
necessary to consider the probability of not rejecting the null hypothesis that μ = 68 when the
alternative μ = 70 is true. A type II error will result when the sample mean falls between 67 and
69 when H1 is true. Therefore, we are interested in   P  67  X  69 when   70 

67 - 70 69 - 70
Z1 = = -6.67 and Z 2 = = -2.22
3.6 3.6
64 64
Therefore,

28
β = P(−6.67 < Z < −2.22) = P(Z < −2.22) − P(Z < −6.67)
= 0.0132 − 0.0000 = 0.0132.

Examples 6.4
Assuming the time spent at ATM follows a normal distribution with mean 2.1 minutes and
standard deviation 0.9 minutes. A new system is to be implemented to reduce the average time
spent on the ATM. The null hypothesis is that the mean time spent is the same under the new and
the old systems. It is decided to reject the null hypothesis and conclude that the new system is
quicker if the mean withdrawal time from a random sample of 20 cash withdrawals is less than 1.7
minutes.
Assume that, for the new system, the standard deviation is still 0.9 minutes, and the time spent still
follows a normal distribution.
(a) Calculate the probability of committing a Type I error
(b) If the mean withdrawal time under the new system is actually 1.25 minutes, calculate the
probability of a Type II error
(c) Calculate the power of this test.

Solution

Let X be a random variable time spent at ATM, where X N  2.1, 0.92  .

We are given the critical rejection as: Reject H 0 when X  1.7

(a) The probability of committing a type I error is given by


  P  Reject H 0 | H 0 is true   P  X  1.7 | H 0 : μ = 2.1

But
 

P  X  1.7 | H 0 : μ = 2.1  P  Z 
1.7 - 2.1 
0.9   P  Z  1.99  = 0.0233
 
 20 
Therefore about 2.33% of all sample of size 20 will lead us to reject that μ = 2.1 when in
fact it is true.
(b) The probability of committing a type II error is given by

29
  P  Fail to reject H 0 | H 0 is true   P  X  1.7 | H 0 : μ = 1.25 

But
 
 1.7  1.25 
P  X  1.7 | H0 : μ = 1.25  P  Z    P  Z  2.24  = 0.0125
 0.9 
 20 
Therefore about 1.25% of all sample of size 20 will lead us to fail to reject that μ = 1.25
when in fact it is not true.
(c) Power of test  1    1  0.0125 = 0.9875 .
Therefore, about 98.75% of all sample of size 20 will lead us to reject that μ = 1.25 when
in fact it is not true.

Examples 6.5 (Class work)


A new curing process developed for a certain type of cement results in a mean compressive
strength of 5000 kilograms per square centimeter with a standard deviation of 120 kilograms. To
test the hypothesis that μ = 5000 against the alternative that μ < 5000, a random sample of 50
pieces of cement is tested. The critical region is defined to be x  4970 .
(a) Find the probability of committing a type I error when H0 is true.
(b) Evaluate β for the alternatives μ = 4970 and μ = 4960.

Answer
(a) P(Z < −1.77) = 0.0384.
(b) β = P(Z > 0) = 0.5 and β = P(Z > 0.59) = 0.2776

Examples 6.4 (Class work)


Suppose you wish to test the hypothesis that the mean age of student in NSTA515 module is 19.5
years. Explain how you could commit a
(a) type I error;
(b) type II error.

Exercise 6
1. Suppose that an allergist wishes to test the hypothesis that at least 30% of the public is
allergic to some cheese products. Explain how the allergist could commit

30
(c) a type I error;
(d) a type II error.
2. A sociologist is concerned about the effectiveness of a training course designed to get more
drivers to use seat belts in automobiles.
(a) What hypothesis is she testing if she commits a type I error by erroneously
concluding that the training course is ineffective?
(b) What hypothesis is she testing if she commits a type II error by erroneously
concluding that the training course is effective?
3. A large manufacturing firm is being charged with discrimination in its hiring practices.
(a) What hypothesis is being tested if a jury commits a type I error by finding the firm
guilty?
(b) What hypothesis is being tested if a jury commits a type II error by finding the firm
guilty?
4. A random sample of 400 voters in a certain city are asked if they favor an additional 4%
gasoline sales tax to provide badly needed revenues for street repairs. If more than 220 but
fewer than 260 favor the sales tax, we shall conclude that 60% of the voters are for it.
(a) Find the probability of committing a type I error if 60% of the voters favor the increased
tax.
(b) What is the probability of committing a type II error using this test procedure if actually
only 48% of the voters are in favor of the additional gasoline tax?
5. A soft-drink machine at a steak house is regulated so that the amount of drink dispensed is
approximately normally distributed with a mean of 200 milliliters and a standard deviation
of 15 milliliters. The machine is checked periodically by taking a sample of 9 drinks and
computing the average content. If X falls in the interval 191  X  209 , the machine is
thought to be operating satisfactorily; otherwise, we conclude that   200 milliliters .
(a) Find the probability of committing a type I error when μ = 200 milliliters.
(b) Find the probability of committing a type II error when μ = 215 milliliters.

31
6.4 Hypothesis Testing

In this section, we formally consider tests of hypotheses on a single population means, proportions
and variances. A hypothesis test involves four main identified procedures and either the p-value
or the critical region approach may be used. The general approach is given below:

1. From the problem, identify the parameter of interest and state the null and alternative
hypothesis and the type of the test (i.e. two tail or one tail test).
2. Identify the level of significance, the distribution and select or identify the critical region.
3. Compute the test statistic and or the p-value.
4. Using your test statistic or p-value, make a decision with regards to H0.

Hypothesis test about population mean

Hypothesis test involving population mean can take two main forms, i.e. when the population
standard deviation is known, and when it is not known.

Hypothesis test about population mean when the population variance is known

1. State the null and the alternative hypothesis and state the type of the test.

(a) H0 :   0 versus H1 :   0 for two tail-test


(b) H0 :   0 versus H1 :   0 for right tail test
(c) H0 :   0 versus H1 :   0 for left tail test

2. Identify the level of significance and the type of the test (i.e. two tail or one tail test)

(a) Identify the critical region

For two tail test read Z1  and Z 


2 2

For right tail test, read Z1 .

For left tailed test read Z .


x  0
(b) Compute the test statistic using Z  and or the p-value.

n
(c) Decision
Using p-value

32
 Reject H 0 if p  value   and conclude H 0 .
 Fail to reject H 0 if p  value   and conclude H 0 .
Using critical values

For   0 , reject H 0 if Z  Z 
or Z  Z1  .
2 2

For   0 , reject H 0 if Z  Z1 .


For    0 , reject H 0 if Z  Z .

Hypothesis test about population mean when the population variance is unknown

1. State the null and the alternative hypothesis and state the type of the test.

(a) H0 :   0 versus H1 :   0 for two tail-test


(b) H0 :   0 versus H1 :   0 for right tail test
(c) H0 :   0 versus H1 :   0 for left tail test

2. Identify the level of significance and the type of the test (i.e. two tail or one tail test)

(a) Identify the critical region

For two tail test read t  and t  .


2 2

For right tail test, read t .

For left tailed test read t .


x  0
(b) Compute the test statistic using t  .
s
n
(c) Decision
Using p-value
 Reject H 0 if p  value   and conclude H 0 .

 Fail to reject H 0 if p  value   and conclude H 0 .

Using critical values


For   0 , reject H 0 if t  t  ,n 1 or t  t  ,n 1 .
2 2

For   0 , reject H 0 if t  t  ,n 1 .


2

For    0 , reject H 0 if t  t  ,n 1 .


2

33
Examples 6.5
A random sample of 100 recorded deaths in the United States during the past year showed an
average life span of 71.8 years. Assuming a population standard deviation of 8.9 years, does this
seem to indicate that the mean life span today is greater than 70 years? Use a 0.05 level of
significance.

Solution

Examples 6.6
A manufacturer of sports equipment has developed a new synthetic fishing line that the company
claims has a mean breaking strength of 8 kilograms with a standard deviation of 0.5 kilogram. A
random sample of 50 lines is tested and found to have a mean breaking strength of 7.8 kilograms.
Doe this support the company’s claim at 99% confidence level?

Solution

34
Examples 6.6
Aflatoxins produced by mold on peanut crops in Kimberly is assumed to be normally distributed.

A sample of 15 batches of peanuts reveals levels of 17.9 ppm, on average, with a standard deviation
of 3 ppm. On the basis of this data can we conclude at 1% level of significance that the mean
Aflatoxins is more than 16.5 ppm?

Solution

Data: n  15 , x  17.9 ,   3 , 0  16.5

1. H 0 :   16.5 versus H1 :   16.5


2. This is a t-distribution with one-tail test with   0.05
3. The critical region: t 0.05,14   2.624

17.9  16.5
4. tcal   1.81.
3 / 15

35
5. Decision: Since tcal  2.624 , we fail to reject the null hypothesis. There is insufficient
evidence to conclude that the mean Aflatoxins is more than 16.5.

36
37
38
CHAPTER 7

Correlation and Regression Analysis

7.1 Correlation

Correlation analysis is a statistical technique used to quantify the direction and strength of
association between two variables. An estimate of the correlation between two variables is called
the correlation coefficient, denoted by r . In correlation analysis, we estimate a sample correlation
coefficient. The sample correlation coefficient, ranges between -1 and +1. When the correlation
coefficient is positive, higher levels of one variable are associated with higher levels of the other
and when it is negative, higher levels of one variable are associated with lower levels of the other.
The sign of the correlation coefficient indicates the direction of the association while the magnitude
of the correlation coefficient indicates the strength of the association.

There are two main approaches commonly used in computing the sample correlation coefficient.
These are

 The Pearson product-moment correlation coefficient: for estimating linear correlation


among continuous variables and

 The spearman’s rank correlation coefficient: for estimating non-linear correlation.

Pearson product-moment

This is used for quantitative data measured on interval or ratio scale. For given observations x
and y , It the Pearson product-moment is defined by:

cov( x, y )
r
 x y

where

1 n 1 n 1 n
  xi  x  yi  y  ,  x    xi  x  and  y    yi  y 
2 2
cov( x, y) 
n  1 i 1 n  1 i 1 n  1 i 1

Spearman’s correlation coefficient

39
This applied to the ranks instead of the actual data. It is defined by:

n
6 di2
r  1 i 1

n  n 2  1

Where

di = difference between pair of rank

n = total number of observations or pairs of ranks

Example 7.1

Consider the Table below provides the score for math and English for 10 students in a test.

Math 5 0 3 1 2 2 5 3 5 4
English 1 2 1 3 3 1 3 1 6 2
Scores( )
a) Find the correlation coefficient between the math grade and English grade and discuss or
interpret it.

b) Test the hypothesis that there is no linear association among the variables at 5% level of
significance.

Solution

 xi  x 
 xi  x   yi  y   xi  x   yi  y   yi  y 
2 2
Math xi English yi
5 1 5-3=2 1-2.3=-1.3 2^2=4 (-1.3)^2=1.69 -2.6
0 2 -3 -0.3 (-3)^2=9 0.09 0.9
3 1 0 -1.3 0 1.69 0
1 3 -2 0.7 4 0.49 -1.4
2 3 -1 0.7 1 0.49 -0.7
2 1 -1 -1.3 1 1.69 1.3
5 3 2 0.7 4 0.49 1.4
3 1 0 -1.3 0 1.69 0
5 6 2 3.7 4 13.69 7.4
4 2 1 -0.3 1 0.09 -0.3

40
3 2.3 28 22.1 6

a) y  2.3 and x  3

1 1
x   28  3.1111 y   22.1  2.4456
10  1 10  1

1
cov( x, y )   6   0.6667 .
10  1

cov( x, y )
r
 x y

0.6667
r  0.2417
3.1111 2.4456

The correlation coefficient between the math grade and English grade is about 24% which seem
to be a weak direct linear relationship. In addition, 24% of the time, whenever the math grade
increases, the grade of English also increases and vice versa.

b)

1. H0 :   0 versus H1 :   0

2. The distribution of the population correlation is unknown, thus it follows the t- distribution

3The two tail critical value at 5% is t   2.306 or t   2.306


 n  2,   n  2,1 
 2  2

r n2 0.2417  10  2
4. Compute the test Statistics using T  , hence T   0.7045
1 r2 1  0.2417 2

5. Decision: Do not reject H0 because T  2.306 and conclude that there is no linear association
between the variables.

41
Example 7.2

Consider the Table below.

x 69 66 68 73 71 74 71 69
y 163 153 185 186 157 220 190 185

(a) Find the rank correlation coefficient and interpret it


(b) What percentage of the total variations in y is explained by x ?

Solution.

(a) Let the rank values of x and y be Rx and Ry respectively.

x y Rx Ry d= Rx- Ry di2
69 163 3.5 3 0.5 0.25
66 153 1 1 0 0
68 185 2 4.5 -2.5 6.25
73 186 7 6 1 1
71 157 5.5 2 3.5 12.25
74 22 8 8 0 0
71 190 5.5 7 -1.5 2.25
69 185 3.5 4.5 -1 1
n

d
i 1
i
2
 0.25  0  6.25  1  12.25  0  2.25  1  23

Hence, the rank correlation coefficient is given by:

n
6 di2
6(23)
r  1 i 1
=1   0.7262
n  n  1
2
8(64  1)

The correlation coefficient between the two variables is about 73% which seem to be a strong
direct linear relationship. In addition, 74% of the time, whenever x increases, y increases and vice
versa.
(b) The percentage of percentage of the total variations in y explained by x

r 2  0.72622 100  52.74% .

42
Exercise 7.3
(a) Consider the Table of grades below:

Mathematics grade 70 92 80 74 65 83
English grade 74 84 63 87 78 90

(i) Compute and interpret the correlation coefficient if the grades of students are selected at
random.
(ii) Test the hypothesis that there is no linear association among the variables at 5% level of
significance.

(b) The Statistics Consulting Center at Virginia Tech analyzed data on normal woodchucks for the
Department of Veterinary Medicine. The variables of interest were body weight in grams and heart
weight in grams. It was desired to develop a linear regression equation in order to determine if
there is a significant linear relationship between heart weight and total body weight.

Weight 2.75 2.15 4.41 5.52 3.21 4.32 2.31 4.3 3.71
Chest
(kg) size 29.5 26.3 32.2 36.5 27.2 27.7 28.3 30.3 28.7
(cm)
(i) Calculate r and interpret it.

(ii) Test the null hypothesis that ρ = 0 against the alternative that ρ > 0. Use α=0.01 level.

(ii) What percentage of the variation in infant chest sizes is explained by difference in weight?
{NB: square r to obtain this percentage}.

(iii) Use the spearman’s rank approach to compute r using the data in question 8.1.

7.2 Regression Analysis

Regression analysis is a statistical technique in which we use observed data to relate a variable of
interest (or response) variable, to one or more independent (or predictor) variables. A regression
analysis in which the response/variable of interest/dependent variable depends one independent or
predictor variable is called a simple regression model. The main objective of regression analysis
is to build a regression model or prediction equation that can be used to describe, predict, and
control the dependent variable on the basis of the independent variables. (Bowerman et al,2001)

43
7.2.2 Scatter Diagram

One way to explore the relationship between a dependent variable y and an independent variable
(denoted x ) is to make a scatter diagram, or scatter plot, of y versus x . First, data concerning
the two variables are observed in pairs. To construct the scatter plot, each value of y is plotted
against its corresponding value of x .If y and x are related, the plot shows us the direction of the
relationship. That is, y could be positively related to x ( y increases as x increases) or y could
be negatively related to x ( y decreases as x increases). (Bowerman et al, 2001). The figures
below shows some examples of scatter plots.

Figure 1(a) Figure 1(b)

Figure 1(c) Figure 1(d)

44
7.2.3 Simple linear regression model

The simple linear regression model assumes that the relationship between the dependent variable,
denoted y and the independent variable, denoted x, can be approximated by a straight line. We can
tentatively decide whether there is an approximate straight line relationship between y and x by
making a scatter diagram, or scatter plot of x and y by making a scatter diagram, or scatter plot of
y versus x. The simple linear regression model is given by:

yi   yi xi  i

yi   0  1 xi  i

where

 0 = the intercept parameter on the y -axis of the model

1 = the slope or the differential tangent of the linear model.

yi = the value of the response variable in the ith observation.

xi = the known value of the predictor variable in the ith observation

 i = the random error or noise term which accounts for errors due to chance and neglected factors
assumed not important.

7.2.3 Assumptions Underlying the Simple Linear Model.

(i) The value of yi are random and independent of each other.

(ii) The probability distribution of i  N (0,  2 )

(iii) The mean of the error term is zero i.e. E(  i ) = 0

(iv) The variance of the error term is constant i.e. Var(  i )=  2  xi

(v) The random errors  i and  j are independent i.e. Cov(  i ,  j )= 0 for i j

45
From the above assumptions, we establish the following:

yˆi  E( yi )  E(0  1 xi  i )  0  1 x

Var ( yi )  Var (0  1 xi  i )

 Var (i )   2  xi

(a) yi  N (  0  1 xi ,  2 )

(b) cov( yi , y j )  0

7.2.4 The Least Square Point Estimates of Linear Regression Model

We seek to find the estimates 0 and 1 respectively by minimizing the total sum of squares
error(SSE):

n n n
SSE    i2   ( yi  yˆi ) 2   ( yi ˆ0  ˆ1 xi ) 2 (1)
i 1 i 1 i 1

Differentiating the SSE with respect to 0 and 0 we have

 ( SSE ) n

 2 ( yi ˆ0  ˆ1 xi ) 
ˆ0 i 1 
 (2)
 ( SSE ) n
 2 ( yi  0  1 xi ) xi 
ˆ ˆ
ˆ1 i 1

By setting the partial derivatives to zero and rearranging the terms, obtain the normal equations:
n n

nˆ0  ˆ1  xi   yi 
i 1 i 1 
n n n  (3)
 0  xi  1  xi   xi yi 
ˆ ˆ
i 1 i 1 i 1


Which may be solved simultaneously to yield:

46
n

  x  x  y  y 
i i
ˆ1  i 1
n
(4)
  xi  x 
2

i 1

and
n n

 y  ˆ  x
i 1 i
ˆ0  i 1 i 1
 y  ˆ1 x (5)
n

Example 7.4

Consider the Table below provides the score for math and English for 10 students in a test.

Math 5 0 3 1 2 2 5 3 5 4
English 1 2 1 3 3 1 3 1 6 2
Scores(
a) )Fit the linear regression line yˆi  0  1 xi onto the data.

b) Interpret the estimate coefficients.

b) Determine with reason the nature of the relationship between the grades for Math and English.

Solution

a) “The data is summarized in the Table below

 xi  x 
 xi  x 
2
Math xi English yi  xi  x   yi  x   yi  x 
5 1 2 -1.3 4 -2.6
0 2 -3 -0.3 9 0.9
3 1 0 -1.3 0 0
1 3 -2 0.7 4 -1.4
2 3 -1 0.7 1 -0.7
2 1 -1 -1.3 1 1.3
5 3 2 0.7 4 1.4
3 1 0 -1.3 0 0
5 6 2 3.7 4 7.4
4 2 1 -0.3 1 -0.3

47
3 2.3 28 6
n

  x  x  y  y 
i i
6
ˆ1  i 1
n
  0.2143 ˆ0  y  ˆ1 x  2.3  0.2143  3  1.657
 x  x 
2 28
i
i 1

Hence yˆi  1.657  0.2143xi

b) ˆ1  0.2143 : All things being equal, a unit change in x results in a 0.2143 change in y.

ˆ0  1.4819 : All things being equal, if x do not change, y will still change by 1.4819
units.

b) There is a positive/direct linear relationship between x and y because ˆ1 is positive.

7.2.5 Analysis of Variance

The application of Analysis of variance(ANOVA) in regression analysis is based on the


partitioning of the total variation and its degree of freedom into components. By the partitioning
of the total variation,

n n

 ( yi  y )2  [( yi  yˆi )  ( yˆi  y )]2


i 1 i 1

n n n
  ( yi  yˆi )2   ( yˆi  y ) 2  2 ( yi  yˆi )( yˆi  y )
i 1 i 1 i 1

n n n

 ( yi  y )2   ( yi  yˆi )2   ( yˆi  y )2
i 1 i 1 i 1

i.e. Total variation = unexplained variation + Explained Variation

SST  SSE  SSR

 The SST is a measure of dispersion of the total variance in the observed values, yi .

48
 The SSR also measures the amount of the total variance in the observed values of yi
that is accounted for by the model.
 The SSE is a measure of dispersion of the observed values yi about the regression line.

It can be shown that

SSR
r2 
SST

This r 2 is called coefficient of determination which is the explained variation expressed as a

fraction of the total variation. It determines the percentage of the total variation in yi accounted
for by the model.

SSR
We can deduce from r2 = that
SST

i) SSR  r 2 SST

ii) SSE  (1  r 2 ) SST

SS xx
Also 0  r 2  1 , r  ˆ1 and r 2  ˆ1 ˆ0 .
2 2

SST

Table 8.1 The Analysis of Variance(ANOVA) for the regression model.

Sources of Variation Sum of Squares Degree of freedom Mean square F-Ratio

Regression SSR  ˆ1 SS xy 1


MSR 
SSR
F
MSR
1 MSE
Residual Error SSE  SS yy  ˆ1 SS xy n2
MSE 
SSE
n2

Total SST  SS yy n 1

From the ANOVA Table the following valid conclusions can be drawn:

49
i) E ( SST )  (n  1) 2  ˆ12 SS xx

ii) E ( MSR)   2  ˆ12 SS xx

iii) E ( SSE )  (n  2) 2 under the null hypothesis, H 0 : ˆ0  0

iv) E ( MST )   2

E ( MSR )   2

E ( MSE )   2

However MSE is unbiased estimator for  2 whether or not x and y are related ( i.e ˆ1  0 or

not). If ,then ˆ1  0 then E ( MSR )   2 since ˆ12 SS xx  0 . Thus ,for testing whether or not

ˆ1  0 a comparison of MSR and MSE is made. If MSR and MSE are of the same order of

magnitude, it will suggest that ˆ1  0 . On the other hand, if the MSR  MSE this would

suggest that ˆ1  0 .

These two mean squares ( MSR and MSE ) form the basic idea underlying the ANOVA test of
the overall regression model.

7.2.6 The F- Ratio Test

The ANOVA generally provides highly useful test for regression models (and other linear
statistical models).

(a) The hypotheses are

H 0 : ˆ1  0 versus H1 : ˆ1  0 at α level of significance

MSR
(b) The test –statistic is F  F (1,n 2) which approaches 1 when 1  0 and bigger
MSE
than 1 when 1  0

50
(c) Decision Rule: Reject H 0 : 1  0 if F  F (1,n  2)

7.2.7 Coefficient of determination

The coefficient of determination denoted by R2 measures the proportion of the total variability
in the dependent variable (y) that is explained by the independent variable (x).

SSR SSTO  SSE SSE


R2    1
SSTO SSTO SSTO

Note the follow:

(a) 1. 0  R2  1
(b) If all the data points fall exactly on the regression line having a non-zero slope, then
R2  1 .
(c) If ˆ1  0 then R2  0.

(d) The square root of R2 gives you the correlation coefficient where the direction or the
 r if 1  
sign is determined by the direction of ˆ1 . i.e. r   R 2 where  .
 r if 1  
7.2.8 Pitfalls and limitations associated with regression and Correlation analysis

(a) In regression analysis a value of Y cannot be legitimately estimated if the value of X is


outside the range of values that served as the basis for the regression equation.
(b) If the estimate of Y involves the prediction of a result that has not yet occurred, the
historical data that served as the basis for the regression equation may not be relevant for
future events.
(c) The use of a prediction or a confidence interval is based on the assumption that the
conditional distributions of Y, and thus of the residuals, are normal and have equal
variances.
(d) A significant correlation coefficient does not necessarily indicate causation, but rather may
indicate a common linkage to other events.
(e) A significant correlation is not necessarily an important correlation. Given a large sample,
a correlation of, say, r=+0.10 can be significantly different from 0 at a   0.05 . Yet the

51
coefficient of determination of r2 = 0.01 for this example indicates that only 1 percent of
the variance in Y is statistically explained by knowing X.
(f) The interpretation of the coefficients of correlation and determination is based on the
assumption of a bivariate normal distribution for the population and, for each variable,
equal conditional variances.
(g) For both regression and correlation analysis, a linear model is assumed. For a relationship
that is curvilinear, a transformation to achieve linearity may be available. Another
possibility is to restrict the analysis to the range of values within which the relationship is
essentially linear.
Example 8.3

Use the estimated model below to describe the relationship between x and y

yˆ  3.829633  0.903643x .

Example 8.3

Suppose an analyst takes a random sample of 10 recent truck shipments made by a company and
records the distance in miles and delivery time to the nearest half-day from the time that the
shipment was made available for pick-up. Use the Table below to answer the following questions.

(a) Construct a scatter plot and use it to determine if a linear regression will be appropriate.
(b) Determine the least-squares regression equation for the data.
(c) What is the nature of the relationship between distance and delivery time? Motivate your
answer.
(d) Interpret the estimated value for β0.
(e) Determine the number of days it will take a shipment to arrive if the total miles is 1000.
(f) Determine the coefficient of variation and interpret it.
(g) Construct an ANOVA table to represent the data.

52
(h) Perform a hypothesis test to determine the significance of the estimated model at α=0.05.

(i) Compute the correlation coefficient and use it to test the hypothesis H 0 :   0 versus
H1 :   0 at 0.05 level of significance

Solution

(a) Looking at the scatter plot below, the points seems to form a straight line, therefore linear
regression may be appropriate model for the data.

Deliverly time versus distance


6

4
DELIVERY TIME

0
0 200 400 600 800 1000 1200 1400 1600
DISTANCE

(b) We will need the following data since we were not given.
10 10 10
x  762 , y  2.85 ,   xi  x   1297860 ,   yi  y   18.525   x  x  y  y   4653
2 2
i i
i 1 i 1 i 1

 ( x  x )( y  y )
i i
4653
ˆ1  i 1
n
  0.003585
 (x  x ) 2 12977860
i
i 1

ˆ0  y  ˆ1 x  2.85  0.003585  762  0.118129

Therefore the estimated regression line is yˆ  0.118129  0.003585x

53
(c) There is a direct or positive relationship between delivery time and distance because β1 is
positive.
(d) All things being equal, a unit increase in distance will result in 0.0036 increase in delivery
time.
ŷ  0.118  0.0036 1000   3.70 days .

(e) SSE  SS yy  ˆ1 SS xy

 SSE  18.525   0.003585  4653  1.844

10
Note from the variance decomposition that SST    yi  y   18.525 ,
2

i 1

SSE 1.844
 r2  1  1  0.90 .
SST 18.525

About 90% of the total variations in delivery time is explained by the distance.

(f) ANOVA table congralutary

Sources of Sum of Squares Degree of freedom Mean square F-Ratio


Variation
Regression 16.681 1 16.681 16.681
F  72.396
0.2305
Residual Error 1.844 8 1.844
 0.2305
10  2

Total 18.525 10

(g) H 0 : ˆ1  0 versus H1 : ˆ1  0 at 0.05 level of significance

F  72.396

F ,(1,n  2)  F0.05,(1,8)  5.32

(h) Since F  5.32 we reject H 0 : ˆ1  0 and conclude the model is significant.

54
(i) H0 :   0 versus H1 :   0

The two tail critical value at 5% is t   2.306 or t   2.306


 n  2,   n  2,1 
 2  2

r   0.9  0.95 since ̂1 is positive, r  0.95

r n2 0.95  10  2
T , hence T   8.61
1 r2 1  0.952

We reject H0 because T  2.306 and conclude that there is a significant relationship between
distance and delivery time.

55

You might also like