You are on page 1of 117

Contents

1 Review 5

1.1 Learning Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Key Terms and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Introduction to Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Types of Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.6 Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.7 Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.8 Measures of Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.9 Five-Number Summaries and Boxplots . . . . . . . . . . . . . . . . . . . . . . 19

1.10 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.10.1 Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.11 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2
1.11.1 Independent Random Samples from one Population: Continuous data . . 25

1.11.2 Independent Large Random Samples from one Population: Count data . 40

1.11.3 Independent Small Random Samples from one Population: Count data . 43

1.11.4 Random Samples from two Independent Populations: Continuous Variables 44

1.11.5 Two Dependent Random Samples: Continuous Variables . . . . . . . . . 64

1.11.6 Large Random Samples from two Independent Populations: Count Data 70

1.11.7 Small Random Samples from two Independent Populations: Count Variables 72

1.11.8 Testing Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

1.12 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

1.12.1 Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

1.12.2 Pearson’s Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . 86

1.12.3 A Test for Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 91

1.12.4 Spearman’s Rank Correlation Coefficient . . . . . . . . . . . . . . . . . 96

1.13 Univariate Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

1.13.1 Inferences about β1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

1.13.2 Inferences about β0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

1.13.3 Inferences about an expected response . . . . . . . . . . . . . . . . . . 106

1.13.4 Diagnostics for the least-squares regression line . . . . . . . . . . . . . . 112

2 Analysis of Variance 115

3
3 Non-parametric Methods 116

4 Multivariate Linear Regression 117

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

4
Chapter 1

Review

This chapter highlights the statistical prerequisite material required for successful completion of
the course. It provides an overview of some of the descriptive and inferential statistics topics that
will be used throughout the course and would have been taught in an undergraduate statistics
course.

1.1 Learning Outcomes

At the completion of this chapter, you will be able to:

• define several key terms relevant to Biostatistics;

• identify the level of measurement associated with different data sets;

• identify and calculate three measures of the central tendency (the mean, median, and mode)
of data;

• identify and calculate three measure of data dispersion (range, variance, and standard
deviation);

• identify and calculate two measures of data position (z-scores and percentile ranks);

5
• draw and interpret a box-and-whisker plot, a confidence interval plot, and a scatter plot;

• implement several hypotheses tests including z-tests for a single proportion and the dif-
ference in two independent proportions; t-tests for a single mean, the difference in means
for two independent samples, and the difference in means for two dependent samples; an
F-test for two independent standard deviations; Levene’s test for equality of variances; the
Shapiro-Wilk W test for Normality; a test for significant correlation; and tests for regression
coefficients.

• determine graphically and through inference whether a data set is normally distributed;

• compute confidence intervals for various parameters including means and regression coef-
ficients;

• identify visually and through inference whether two variables are linearly related;

• calculate a least-squares regression line using bivariate sample data;

• use a sample regression line to make interpolations and extrapolations and provide the
associated confidence and prediction intervals; and

• draw public health conclusions based on the reviewed statistical inference topics.

1.2 Key Terms and Concepts


• Statistics, Biostatistics, Descriptive Statistics, and Inferential Statistics

• Population versus sample

• Quantitative versus qualitative variables (data)

• Levels of data measurements: nominal, ordinal, interval, and ratio

• Observational studies

• Experimental design

• Lurking variables

• Types of sampling: convenience, systematic, cluster, and stratified

6
• Measures of central tendency: mean, median, and mode
• Measures of dispersion: range, variance, and standard deviation
• Measures of position: z-scores, percentiles, and quartiles
• Hypothesis testing: significance level, Type I error, Type II error, power of a test, null and
alternative hypotheses, left-tail, right-tail, two-tail, p-value, reject region, test statistic, and
critical value
• Confidence intervals: confidence level, margin of error, and confidence interval width
• Distributions: Normal, Student-t, and F
• z-tests: for a single proportion and for the difference in two independent proportions
• t-tests: for a single mean, for the difference of two means from independent samples; and
for the difference of two means from dependent samples
• F-test and Levene’s Test for the equality of two variances
• QQ and PP plots for visualizing normality
• Shapiro-Wilks Test for normality
• Influential observations
• Outliers
• Scatter plot
• Pearson’s correlation coefficient
• Spearman’s rank correlation coefficient
• The Principle of Least-Squares
• Line of best-fit
• Coefficient of Determination
• Regression line
• Interpolation/extrapolation
• Confidence and prediction intervals for expected responses

7
1.3 Introduction to Statistics

Statistics is the science of collecting, organizing, summarizing and analyzing information in order
to draw conclusions. Biostatistics is the application of statistics to medicine and the biological
sciences. The two branches of Biostatistics on which we will be focusing are Descriptive Statistics
and Inferential Statistics. We will briefly discuss two other branches: Sampling Theory and
Experimental Design.

The area of Descriptive Statistics consists of organizing and summarizing the information col-
lected. Inferential Statistics uses methods that generalize results obtained from a sample to a
population and measure their reliability. The information required is either based on an entire
population or a sample from within the population. A population is the complete collection of
objects, subjects, units or individuals of interest in a study. A population can also be the complete
set of measurements or observations on these objects, subjects, units, or individuals. A sample
is a subset or part of a population.

Individuals or elements from a population are the objects described by a data set. They can
be people, animals, or things. Variables are the characteristics or attributes of the individuals
(elements) within the population and are capable of assuming any value within the data set. Data
consist of the values (measurements or observations) that the variables can assume. Variables
whose values are determined by chance are called random variables.

There are two types of variables:


Qualitative (also called categorical) variables allow for classification of individuals into groups
based on some attribute or characteristic. It does not make sense to apply mathematical opera-
tions on the values of these variables.

– e.g. citizenship, zip code, public health fields


Quantitative variables provide numerical measures of individuals. They can be ranked, ordered,
and measured. Arithmetic operations (e.g. addition) can be performed on the values and provide
meaningful results.

– e.g. age, weight, medication dosage, bacterial counts


Quantitative variables can further be divided into two types: discrete and continuous.

Discrete variables have a finite number of possible values or a countable number of possible
values.

8
– e.g. group size, number of steps you take in your life

Continuous variables can assume an infinite number of possible values between any two given
values and therefore can be measured to any desired level of accuracy.

– e.g. height, weight, time taken from sample collected in lab until patient receives the test
results

In addition to classifying variables as qualitative or quantitative, variables can be classified ac-


cording to how they are categorized, counted, or measured. This type of classification uses four
common types of measurement scales:

Nominal data are names of categories or characteristics or properties. Data is classified into
non-overlapping, exhausting categories in which no order or rank can be imposed on the data.

– e.g. Classifying the residents of Saskatoon by their postal code; classifying the employees of
the university by their office telephone number; classifying the students in this PUBH 805 class
by their eye colour, their gender, by whether they are wearing socks or not.

Ordinal data are observations that can be ordered or ranked on some basis of magnitude. Note
differences between ordinal data do NOT make any sense.

– e.g. Opinions about a concert may be summarized as terrible, okay, excellent; the ratings of
movies are ranked according to G, PG, PG13, and R; the alert level of an American airport may
be yellow, orange, or red.

Interval data are numerical observations such that equal differences in the numbers define each
differences in magnitudes; there is no meaningful value of zero.

– e.g. Temperature: 34 o C yesterday and 26o C today implies an 8o C difference in temperature


but 0o C does not mean there is no temperature at all; IQ: for an IQ of 142 and an IQ of 110,
there is a difference of 32 between the two IQs but there is no such thing as a person with 0 IQ
(no intelligence).

Ratio data are numerical observations such that ratios of numbers define ratios of magnitudes.
The value of zero does have meaning.

– e.g. Heights, weights, bacterial counts.

9
Data can be obtained from four different sources:

1. A census is a list of all individuals in a population along with certain characteristics of each
individual.

2. An observational study measures the characteristics of a population by studying individuals


in a sample, but does not attempt to manipulate or influence the variable(s) of interest.
Observational studies are sometimes referred to as ex post facto (after-the-fact) studies
because the value of the variable of interest has already been established.

3. A designed experiment applies a treatment to individuals (experimental units) and attempts


to isolate the effects of the treatment on a response variable.

4. Survey sampling does not attempt to manipulate or influence the variable(s) of interest. It
leads to observational studies, which measure the characteristics of a population by studying
individuals in a sample. The value of the variable of interest has already been established.

NOTE BIEN: Observational studies are very useful tools for determining whether there is a relation
between two variables, BUT a designed experiment is required to isolate the cause of the relation.
When implementing a statistical study, it is important to note whether the variable that is being
studied is not influenced by some underlying hidden variables. These underlying hidden variables
are sometimes called lurking variables.

For Discussion. Suppose a cancer researcher is interested in determining whether there is a


connection between smoking and lung cancer. This type of research project would be considered
an observational study because the data is collected after an individual has been smoking for
some period of time. Suppose that from the observed data there is a higher incidence rate of
lung cancer amongst the smokers than the non-smokers.

1. What conclusion could the researcher make?

2. How could we change the study so that causation rather than association can be concluded?

3. Identify a lurking variable and explain how it, rather than smoking, could plausibly be the
cause of lung cancer.

10
Observational studies are performed for two reasons:

1. to learn characteristics of a population, and


2. to determine whether there is an association between two or more variables where the
values of the variables have already been determined.

Observational (or ex post facto) research is usually implemented in situations where the control of
certain variables is unethical or simply impractical (impossible). Experiments are used whenever
control of certain variables is desired (and morally/ethically allowed). Once one has identified the
source of the data, how does one actually form a sample?

1.4 Types of Sampling

A sample of size n from a population of size N is obtained through simple random sampling if
every possible sample of size n has an equally likely chance of occurring. The sample is then
called a simple random sample. To generate a simple random sample:

1. Obtain a list of all the N individuals in the population of interest.


2. Number the individuals in the frame from 1 to N.
3. Randomly generate n numbers, where n is the desired sample size.

Four other common types of sampling are:

1. A convenience sample is a sample in which the individuals in the sample are easily obtained.
The most popular type of convenience sample is a self-selected sample (that is, the indi-
viduals volunteer to participate). Convenience sampling will generally yield results that are
suspect. Any results should be looked upon with extreme skepticism.

11
2. A systematic sample is a sample obtained by selecting every kth individual from the popu-
lation. The first individual selected is a random number between 1 and k.

3. A cluster sample is obtained by selecting all individuals within a randomly selected collection
or group of individuals.

4. A stratified sample is obtained by separating the population into homogeneous, non-


overlapping groups (called strata) and then obtaining a simple random sample from each
stratum.

For Discussion. A study of 300 households in a rural community revealed that 20 percent had
at least one school-age child living in the household. Describe how you would use a stratified
random sample to collect data from this group. BFAHS, p. 17. q. 8

For Discussion. A study of 250 patients admitted to a hospital during the past year revealed
that, on average, the patients lived within 15 miles of the hospital. Describe how you would use
systematic sampling of patient records to collect data from this group. BFAHS, p. 17. q. 8

1.5 Experimental Design

A designed experiment is a controlled study in which one or more treatments are applied to
experimental units. An experimental unit is a person, object or some other well-defined item
upon which a treatment is applied. A treatment is a condition applied to the experimental
unit. A response variable is a qualitative or quantitative variable in which we are interested. A
predictor variable is a qualitative or quantitative variable that affects the response variable. It
may be controlled or uncontrolled.

A double-blind experiment is an experiment in which neither the experimental unit nor the ex-
perimenter knows what treatment is being administered to the experimental unit.

An experiment in which the different treatments are randomly assigned to the experimental units
is called a completely randomized design.

12
In a randomized block design, each experimental unit is subdivided into smaller blocks and treat-
ments are randomly assigned to each of these smaller blocks (within each original experimental
unit).

A matched-pairs design is a randomized block design in which the experimental units are somehow
related (i.e. the same person before and after a treatment, twins, husband/wife, etc.). There are
only TWO treatments in a matched-pairs design.

In a repeated measures design, multiple treatments in the same order are applied to each experi-
mental unit and the same variables are measured in the same order for each treatment on each
experimental unit.

Key ingredients of a well-designed experiment are Control, manipulation, randomization and


replication.

1.6 Measures of Central Tendency

A parameter is a descriptive measure of population. It is obtained by using all the data from a
population. A statistic is a descriptive measure of a sample. It is obtained by using the data
from a sample. Usually we use greek letters to represent parameters. One exception to this rule
is we usually use N to represent the size of a population and n to represent the size of a sample.

The arithmetic mean (or average) of a variable is computed by determining the sum of all values
of the variable in the data set and then dividing this sum by the number of elements in the
data set. The population arithmetic mean, denoted μ, is computed using all the individuals in a
population. This would be an example of a parameter. The sample arithmetic mean, denoted x,
is computed using sample data. The sample mean is an example of a statistic.
PN
x1 + x2 + ... + xN xi
μ= = i=1
N N
Pn
x1 + x2 + ... + xn xi
x= = i=1
n n

The sample mean x is an example of a statistic and it is quite often used to estimate the parameter
μ, the population mean. Why? Quite often it is unreasonable (if not impossible) to determine

13
the population mean μ. Using x to estimate μ is an example from “inferential statistics” which
we will be discussing later in the course.

Both the population mean μ and the sample mean x are measures of central tendency. μ is
the average or ”central” value for the population and x is the average or ”central” value for the
sample. The arithmetic mean can be thought of as the point to which half of the “weight” is to
the left and half of the “weight” is to right.

For Discussion. Does the mean provide sufficient information about a population (sample)?
When is/isn’t it enough?

The mean (or average) is not the only measure of central tendency. Two other types of central
tendency are the median and mode.

The median of a variable is the value that lies in the middle of the data when arranged in ascending
order. That is, half of the actual data points lie below the median and half of the data are above
the median. We use M to represent the median value. If there is an odd number of elements in
this ordered set, then the median is the middle value. If there is an even number of elements in
this ordered set, then the median is the average of the two middle values.

The mode of a variable is the most frequent observation of the variable that occurs in the data
set. If no value in the data occurs more often than some other value in the data, there is no
mode. You can have two modes: bimodal; three modes: trimodal; etc.

For Discussion. Which measure of central tendency is the ”best”? Note the mode is the only
measure of central tendency that applies to qualitative data.

For Practice. Suppose we are interested in testing the effectiveness of a new type of antibiotic.
Three different types of bacteria are exposed to the drug and the survival time for a particular
bacteria culture is measured as the amount of time required to kill 50% of the cells in the petri
dish. The survival times for eight colonies of one particular bacteria culture are 1.1 hours, 1.2
hours, 1.5 hours, 1.7 hours, 1.9 hours, 1.1 hours, 1.3 hours, and 1.8 hours. Calculate the mean,
median, and mode.

14
Solution: The mean is 1.45 hours. The median is 1.4 hours. The mode is 1.1 hours.

1.7 Measures of Dispersion

In this section we are going to quantify “spread out” (or disperse) a set of values is. The simplest
measure of this variability among a set of data is the range. The range of a sample is simply the
difference between the largest and smallest values in the set.

For Discussion. Is the range a “good” measure of dispersion? Does the range describe well
how dispersed data in a set is?

Because x is a measure of the center of a data set, one method of determining the variability of
the individual data points xi about the center is their deviation from the mean, that is xi − x.

Pn
For Discussion. Is the total deviation i=1 (xi − x) a good measure of dispersion? Why or why
not?

Another measure of dispersion is referred to as the variance. The population variance of a


variable is denoted σ 2 and is the average of the squared deviations of the observations about the
population mean, that is
 P 2 
N
i=1 xi
N
X (xi − x)2 N
1 X 2 
σ2 = =  xi − .
i=1
N N i=1 N

The sample variance of a variable is denoted s2 and is the average of the squared deviations of
the observations about the sample mean, that is
n
" n Pn #
X (x − x) 2
1 X ( x )
2
i i=1 i
s2 = = x2i − .
i=1
n − 1 n − 1 i=1
n

15
The variance (whether population or sample) has different units thsn the values used to compute
it. Hence the variance cannot be directly compared to the mean or the data used to compute
it. To solve this problem, we simply square root the variance to get the
√ standard deviation. The
population standard deviation is denoted σ and √ is defined to be σ = σ 2 . The sample standard
deviation is denoted s and is defined to be s = s2 .

In order to give meaning to the magnitude of the standard deviation, it needs to be compared to
the mean. The coefficient of variation is the standard deviation divided by the mean times 100%.
s σ
For samples, it is and for populations, it is (the larger the coefficient of variation, the more
x μ
variation in the corresponding data).

For Practice. Calculate the sample range, variance, standard deviation, and coefficient of
variation for the survival times in the previous practice question.

Solution: The sample range is 0.8 hours. The sample variance is 0.10285714 hours 2 . The sample
standard deviation is 0.32071349 hours. The sample coefficient of variation is 22.118171%.

1.8 Measures of Position

Suppose one needs to know the position of a data point relative to the other points in the data set.
One measure of a data point’s position is the z-score which represents the number of standard
deviations that the data point is from the mean. It is obtained by subtracting the mean from
the data value and dividing this difference by the standard deviation. There is both a population
z-score and a sample z-score. Note that the z-score is unitless.

xi − μ xi − x
z= and z = are respectively population and sample z-scores.
σ s
Another measure of a values position within a data set is the sample k’th percentile, denoted
Pk . Pk is a value such that after the data are ordered from smallest to largest, at least k% of
the observations are at or below the value Pk and at least (100-k)% are at or above the value
Pk . Special percentiles are the first, second, and third quartiles (respectively denoted Q1 , Q2 ,
and Q3 ), which are nothing but the 25th, 50th, and 75th percentiles respectively.

16
How does one compute Pk from a data set with n observations? There are several different
methods. A simple method is as follows.

1. Order the data from smallest to largest

2. Calculate nk/100.

3. If nk/100 is not an integer, round it up to the next integer and find the corresponding
ordered value. If nk/100 is an integer, say I, calculate the average of the I’th and the
(I+1)’st ordered values.

Note: SPSS does not use the above procedure. Instead of a simple average, SPSS uses a weighted
average. Percentiles you would compute using the above procedure will generally be close, if not
exactly, the values computed by SPSS.

Suppose we need to know the approximate percentile rank (the value of k) of a particular ob-
servation X within a data set. We can find the approximate ranking by using the following
expression:

(number of values below X) + 0.5


k≈ × 100%.
total number of data points

We can use percentiles and z-scores to identify extreme observations. Any extreme observation
is referred to as an outlier. Any statistic that is heavily influenced by outliers is referred to as a
nonresistant statistic. Any statistic that is minimally influenced by outliers is a resistant statistic.

One method for checking a data set for outliers uses percentiles, as follows:

1. Determine the first and third quartiles.

2. Compute the interquartile range: IQR = Q3 − Q1 .

3. Determine the fences. Fences serve as the cutoff points for determining outliers. The
lower fence is computed using Q1 − 1.5(IQR) and the upper fence is computed using
Q3 + 1.5(IQR).

17
4. If a data value is less than the lower fence or greater than the upper fence, then it is
considered an outlier.

For Practice. : The following are the systolic blood pressures of 20 men: 150 141 90 108 158
119 156 114 95 97 145 167 144 171 132 97 163 111 186 98.

1. Compute the 10th percentile (P10 ).

2. Compute Q1 , Q2 , and Q3 .

3. Determine if the above data set has any outliers.

4. Determine the percentile rank of the value 163.

Solution: The SPSS output which gives the solutions to (1) and (2) is

1. The 10th percentile is P10 =95.200 mmHg.

2. The first quartile is Q1 =100.500 mmHg. The second quartile is Q2 =136.500 mmHg.

The third quartile is Q3 =157.500 mmHg.

3. The value for the lower fence is: Q1 − 1.5IQR = 15.000 mmHg. The value for the upper
fence is: Q3 + 1.5IQR = 243.000 mmHg. Because no value in the data file is below the lower
fence or above the upper fence, this data set has no outliers.

18
4. The percentile rank associated with 163 is k = (16 + 0.5)/20 × 100% = 82.5% which suggests
that the percentile rank of 163 is 83, i.e. P83 =163.

1.9 Five-Number Summaries and Boxplots

The three quartiles are used as part of the Five-Number Summary needed to draw a box-and-
whisker plot (also called a boxplot). A five-number summary for a data set consists of the
minimum value, Q1 , Q2 , Q3 , and the maximum value. If there are no outliers in the data, the
boxplots draw via SPSS use this five-number summary. However, if there are outliers, SPSS uses
a slightly different method for drawing the boxplot.

For Practice. Draw a boxplot for the systolic blood pressures of 20 men example.

Solution: The boxplot for the systolic blood pressures of the 20 men is

19
1.10 Distributions

The distribution underlying a data set provides probabilistic information about the data. From
the distribution, one can compute the probability of different outcomes occurring.

Boxplots can be viewed as very coarse illustrations of the distributions associated with a data
sets. These illustrations provide some useful information about the data:

1. If the median is near the center of the box and each horizontal line is approximately the
same length, then the distribution is roughly symmetric.

2. If distance between the minimum value and the median is smaller than the distance from
the median to the maximum value, the distribution is skewed right.

3. If distance between the minimum value and the median is greater than the distance from
the median to the maximum value, the distribution is skewed left.

4. If one wanted to compare the underlying distributions of two different data sets, one would
create a boxplot for both data sets and plot them one on top of the other on the same
horizontal scale.

Now Your Turn: BFAHS p. 61 q. 29 except do not draw the histogram, frequency polygon,
and stem-and-leaf plot.

Thilothammal et al. (A-19) designed a study to determine the efficacy of BCG (bacillus Calmette-
Guerin) vaccine in preventing tuberculous meningitis. Among the data collected on each study
was a measure of nutritional status. The data set in the textbook is for 107 cases. Using SPSS
to assist you, complete all parts of Question 29 except do not draw the histogram, frequency
polygon, and stem-and-leaf plot.

20
1.10.1 Skewness

In the previous discussion, we introduced the idea of the skewness of a distribution. Distributions
may be symmetric (the left half of the graph is the mirror image of the right half) or asymmetric
(the left half of the graph is not the mirror image of the right half). In the case where a
distribution is not symmetric, we say the distribution is skewed.

We say that a distribution is skewed to the left (or negatively skewed) if the tail of the distribution’s
left side is much longer than the tail of the right side (cf. the following figure).

We say that a distribution is skewed to the right (or positively skewed) if the tail of the distribu-
tion’s right side is much longer than the tail on the left side (cf. the following figure).

The skewness of a distribution can be summarized quantitatively using


√ Pn
n i=1 (xi − x)3
skewness = .
(n − 1)3/2 s3
If the value computed is positive, then the distribution is skewed to the right. If the value
computed is negative, then the distribution is skewed to the left. If the value is zero, then the
distribution is symmetric.

21
For Practice. Using the boxplot generated for the nutritional status of children example, describe
the shape of the distribution of the nutritional statuses of children.

Solution: The above box plot illustrates that there are three outliers. If the three outliers were
ignored, then the lower whisker (representing the distance between Q1 and the minimum) is about
the same size as the upper whisker (representing the distance between Q3 and the upper fence)
and the distance between Q1 and Q2 is approximately equal to the distance between Q2 and Q3 .
This implies the data is distributed symmetrically.

If the outliers were not ignored, then the distance from the minimum value to Q2 is less than
the distance between Q2 and the maximum value. This implies that the data is not distributed
symmetrically, but is rather skewed to the right (positively skewed).

1.11 Hypothesis Testing

A point estimate of a parameter is the value of a statistic that estimates the value of the
parameter. The sample mean x is a point estimate of the population mean μ. The sample
standard deviation s is a point estimate of the population standard deviation σ.

A hypothesis is a statement or claim regarding a characteristic of one or more populations. The


steps in Hypothesis Testing are:

1. A claim is made

2. Evidence (sample data) is collected in order to test the claim.

3. The data are analyzed in order to support or refute the claim.

To understand the rationale behind Hypothesis Testing, we will turn to our legal system. Consider
a court case where the defendant is charged with murder. The only person who truly knows
whether the defendant is innocent is the defendant. From the jury’s perspective, the defendant’s

22
innocence will never be known with absolute certainty. Two hypotheses are put forth to the jury:

(1) The null hypothesis: H0 : the defendant is NOT guilty

(2) The alternative hypothesis: Ha : the defendant is guilty.

In our legal system one is innocent until proven guilty. Therefore Ha (the alternative hypothesis)
is always what we are trying to “prove”.

In Statistics, we do not use a jury to choose between H0 and Ha ; we perform a hypothesis test
to determine whether there is enough support to conclude the alternative hypothesis. We either
do NOT reject H0 or we reject H0 .

If we do NOT reject H0 , IT DOES NOT MEAN THAT YOU BELIEVE H0 IS TRUE!!!! Not
rejecting H0 really means that you do not have enough evidence to conclude that H0 is false. In
terms of a court case, we do not have enough evidence to prove guilt beyond a reasonable doubt.
If a jury returns a verdict of NOT GUILTY, it does NOT imply the defendant was innocent.

If we reject H0 , then we are saying that there is enough evidence to conclude that the defendant
was guilty. We are NOT saying the defendant is guilty. Sometimes we can reject H0 when H0
was actually true. This is equivalent to sending an innocent man to jail. This is quite serious.
We call rejecting H0 when H0 was actually true a Type I Error. We denote the probability of a
Type I Error occurring by the symbol α.

P(Type I Error)=P(Reject H0 |H0 is true)=α.

α is also referred to as the significance level of a hypothesis test. Since Type I errors are very
serious, it is required that when implementing a Hypothesis Test, we minimize α.

NOTE: We always choose α BEFORE we begin the test.

The second type of error that can occur is a Type II Error. Basically, a Type II Error occurs when
you do NOT reject H0 when it is false (you let a guilty man go free). We use the symbol β to
represent the probability of a Type II Error occurring, P(Type II Error)=P(do NOT reject H0 |H0
is false)=β.

β is important because 1 − β is the power of a hypothesis test.

23
NOTE: As we make α smaller, we make β larger.

In order to implement a hypothesis test, we need to calculate the value of a test statistic (a
numerical summary of a set of data that reduces the data set to a single value) and compare
the value of the test statistic to a critical value from a corresponding table of values OR we can
use the test statistic to compute the associated p-value for the problem and then compare the
p-value to the significance level.

The p-value is the probability of observing something at least as extreme as what was actually
observed (assuming the null hypothesis was true) if what was observed was due to chance.

If a p-value < α, then we reject H0 . If the p-value > α, then we do not reject H0 .

There are three ways to set up the null and alternative hypotheses and calculate the associated
p-value:

1. H0 : parameter = some value; Ha : parameter 6= some value (this is a two-tailed test).

Then p-value=2Prob(X > ts) or =2Prob(X < ts), where X is the random variable representing
the test statistic and ts is the calculated value of the test statistic.

2. H0 : parameter = some value; Ha : parameter < some value (this is a left-tailed test).

Then p-value=Prob(X < ts), where X is the random variable representing the test statistic and
ts is the calculated value of the test statistic.

3. H0 : parameter = some value; Ha : parameter > some value (this is a right-tailed test).

Then p-value=Prob(X > ts), where X is the random variable representing the test statistic and
ts is the calculated value of the test statistic.

There is a general framework that we can use to implement a hypothesis test.

1. Research Question

2. Population declarations

3. Hypothesis to be tested

24
4. Hypothesis Test to be used

5. Assumptions required to implement the hypothesis test

6. The Significance Level

7. The Test Statistic and corresponding p-value

8. The Decision Rule

9. The Conclusion

We will use this framework for each hypothesis test that we conduct throughout the rest of the
course.

Suppose a sample has been drawn from a single population and you want to extrapolate (i.e.
infer) properties about the population from this sample. If one wants to infer something regarding
a continuous variable, then one might attempt to use a t-test for a single mean or a confidence
interval for a true mean. If one wants infer something regarding a count variable, then one might
attempt to use a z-test for a single proportion or a confidence interval for a true proportion.

1.11.1 Independent Random Samples from one Population: Continu-


ous data

t-test for a single mean

Suppose we wish to know the population mean but it is not feasible to determine its exact value.
We use X to estimate μ because

1. X is an unbiased estimator of μ, that is, the expected value of X is μ, the parameter we


are trying to estimate. Note an unbiased estimator of a parameter is an estimator that does
not systematically overestimate or underestimate the value of the parameter it estimates.
2. X is a consistent estimator of μ, that is the larger the sample used, the closer the value of
the sample mean gets to the value of the population mean.

25
3. X is an efficient estimator of μ, that is in repeated samples, the majority of the sample
means will be ”close” to the value of the population mean.

To make inferences about μ using a sample’s mean (when we do not know the population standard
deviation σ), we need to know if the following two conditions are both satisfied:

1. a simple random sample is obtained and

2. the population from which the sample is drawn is normally distributed OR the sample size
is greater than 29.

If the above two conditions are both satisfied, then we can use the t-test statistic

x−μ
t(df ) = √
s/ n

with df=n-1 degrees of freedom and the critical values tα (df ) (or tα/2 (df )) to implement a“one
sample t-test” and make inferences about the true mean of the population of interest.

Note that the t-test statistic t(df ) defined above will follow a Student’s t-distribution with df=n-1
and we determine the critical t-values using the Student’s t-distribution table.

We refer to a hypothesis test that uses this test statistic t(df) with df=n-1 and the Student’s
t-distribution to determine the critical values as a t-test for a single mean.

Note that in the following examples, the parts of the SPSS output table that highlighted in yellow
contain the information we need. We ignore the rest of the output table.

For Practice. Each of 15 hypertension patients was administered several drugs on different
occasions. The results of concern are for a placebo drug compared with Inderal. Each patient
first took the placebo for one month. After the month, their systolic blood pressures were
recorded. They then stopped taking the placebo and started taking 120 mg of Inderal for one
month. After the month, their blood pressures were recorded. The data presented in the following
table are the systolic blood pressures measured.

26
Patient Placebo Inderal
1 175 176
2 199 181
3 180 146
4 180 140
5 164 127
6 174 139
7 195 129
8 204 133
9 205 194
10 180 169
11 195 186
12 161 158
13 164 141
14 190 150
15 178 164

For every question based on this scenario, you may assume that it is known that the systolic
blood pressures (of both the placebo and treatment group) are normally distributed and that the
patients selected were randomly chosen.

At the 5% level of significance, test whether the true average systolic blood pressure of those
people who took inderal is 160.

Solution: Note: at this point in the course, we are not going to worry about testing the assump-
tions necessary for the results of this test to be valid. This will not always be the case. The
Research Question: Does the systolic blood pressure of those people who take inderal differ from
160?

Population Declarations:

Let Population 1 be the group of people who have hypertension. Then define μ to be the true
average systolic blood pressure of people in Population 1 after they take inderal.

Hypothesis to be tested:

H0 : μ = 160 (i.e. the true mean systolic blood pressure of those people who take inderal is

27
equal to 160 mmHg)

Ha : μ 6= 160 (i.e. the true mean systolic blood pressure of those people who take inderal is not
equal to 160 mmHg)

Hypothesis Test to be used: A t-test for a single mean

Assumptions required to implement the hypothesis test:

1. We assume that a simple random sample was obtained.

2. We assume that the systolic blood pressures of Population 1 after taking inderal are normally
distributed.

The Significance Level: α = 0.05

The Test Statistic and corresponding p-value:

Based on the One-Sample Test table above, the test statistic is t = -0.796 with df =14 and the
corresponding p-value = 0.439. (Note SPSS indicates that this value is for a two-tailed test).

The Decision Rule: Since p-value = 0.439 > 0.05=α, we do not reject H0 .

The Conclusion: At the 5% level of significance, there is not enough evidence to conclude that
the true mean systolic blood pressure of people who take inderal differs from 160 mmHg (p-value
= 0.439). At the same level of significance, there is no evidence to reject the assumption that
the true mean systolic blood pressure of people who take inderal is 160 mmHg.

28
For Discussion. Suppose we wanted to know whether the true mean systolic pressure was less
than 160 mmHg. How would we modify the above solution?

29
Solution: Note: at this point in the course, we are not going to worry about testing the assump-
tions necessary for the results of this test to be valid. This will not always be the case. The
Research Question: Is the systolic blood pressure of those people who take inderal less than 160?
Population Declarations:
Let Population 1 be the group of people who have hypertension. Then define μ to be the true
average systolic blood pressure of people in Population 1 after they take inderal.
Hypothesis to be tested:
H0 : μ = 160 (i.e. the true mean systolic blood pressure of those people who take inderal is
equal to 160 mmHg)
Ha : μ < 160 (i.e. the true mean systolic blood pressure of those people who take inderal is less
than to 160 mmHg)
Hypothesis Test to be used: A t-test for a single mean
Assumptions required to implement the hypothesis test:
1. We assume that a simple random sample was obtained.
2. We assume that the systolic blood pressures of Population 1 are taking inderal are normally
distributed.
The Significance Level: α = 0.05
The Test Statistic and corresponding p-value:

Based on the One-Sample Test table above, the test statistic is t = -0.796 with df =14 and the
corresponding p-value = 0.439/2 (why)?. (Note SPSS indicates that this value is for a two-tailed
test. There is no way to tell SPSS that you are doing a left-tailed test.)
The Decision Rule: Since p-value = 0.439/2=0.2195 > 0.05=α, we do not reject H0 .
The Conclusion: At the 5% level of significance, there is not enough evidence to conclude that
the true mean systolic blood pressure of people who take inderal is less than 160 mmHg (p-value
= 0.2195). At the same level of significance, there is no evidence to reject the assumption that
the true mean systolic blood pressure of people who take inderal is 160 mmHg.

30
For Discussion. Suppose we wanted to know whether the true mean systolic pressure was
greater than 160 mmHg. How would we modify the above solution?

31
Solution: Note: at this point in the course, we are not going to worry about testing the assump-
tions necessary for the results of this test to be valid. This will not always be the case. The
Research Question: Is the systolic blood pressure of those people who take inderal greater than
160?
Population Declarations:
Let Population 1 be the group of people who have hypertension. Then define μ to be the true
average systolic blood pressure of people in Population 1 after they take inderal.
Hypothesis to be tested:
H0 : μ = 160 (i.e. the true mean systolic blood pressure of those people who take inderal is
equal to 160 mmHg)
Ha : μ > 160 (i.e. the true mean systolic blood pressure of those people who take inderal is
greater than to 160 mmHg)
Hypothesis Test to be used: A t-test for a single mean
Assumptions required to implement the hypothesis test:
1. We assume that a simple random sample was obtained.
2. We assume that the systolic blood pressures of Population 1 after taking inderal are normally
distributed.
The Significance Level: α = 0.05
The Test Statistic and corresponding p-value:

Based on the One-Sample Test table above, the test statistic is t = -0.796 with df =14 and the
corresponding p-value = 1-0.439/2=0.7805 (why)?. (Note SPSS indicates that this value is for
a two-tailed test. There is no way to tell SPSS that you are doing a right-tailed test.)
The Decision Rule: Since p-value = 1-0.439/2=0.7805 > 0.05=α, we do not reject H0 .
The Conclusion: At the 5% level of significance, there is not enough evidence to conclude that the
true mean systolic blood pressure of people who take inderal is greater than 160 mmHg (p-value
= 0.7805). At the same level of significance, there is no evidence to reject the assumption that
the true mean systolic blood pressure of people who take inderal is 160 mmHg.

32
Confidence interval for the true mean

Recognizing a statistic is an estimate for a parameter but, due to the random nature of the same,
the statistic is very unlikely to be exactly the true value of the parameter. Suppose you wanted
to capture a possible interval of values for the parameter. One can do this via a 100 ∙ (1 − α)%
confidence interval.

For Discussion. Recall 1 − α is the confidence one has in a statistical test. A 100 ∙ (1 − α)% is
NOT an interval that contains the parameter with probability 1 − α. Why?

For Discussion. What is a 100 ∙ (1 − α)% confidence interval? How do you explain what it
means to a lay person?

If it is known that the data is normally distributed but the population standard deviation σ is
unknown, then a (1 − α) ∙ 100% confidence interval for μ is computed using

 
s s
x − tα/2 (df ) √ , x + tα/2 (df ) √
n n

where tα/2 (df ) is referred to as a critical t-value determined from a student t distribution with
df = n − 1 degrees of freedom. The quantity
s
E = tα/2 (df ) √
n

is referred to as the (1 − α) ∙ 100% error margin. The quantity s/ n is referred to as the standard
error of the sample mean x.

For Practice. For the hypertension study, determine a 99% confidence interval for the true
mean systolic blood pressure of the placebo group.

Solution: Using SPSS, the following output was generated.

33
Then the corresponding 99% confidence interval for the true mean systolic blood pressure of the
placebo group is 171.8312 to 194.0355.

Now Your Turn: Regarding the nutritional status of children example:

(1) A group that opposes mandatory vaccination has suggested that one of the side effects of
the BCG vaccination is the underdevelopment (physically and cognitively) of the children who
have received the BCG vaccine. Suppose it is known that the nutritional status measure is a
good indicator of a child’s physical and cognitive development. If the average nutritional status
measure of those children who did not receive the BCG vaccine is 85.2, test the group’s claim at
the α=0.05 level of significance.

(2) Create a 90% confidence interval for the true mean nutritional status measure of children
who receive the BCG vaccination.

Solution: (1) Research Question: Is the average nutritional status of BCG vaccinated children
lower than the average nutritional status of non-BCG vaccinated children?

Population Declarations:

34
The population of interest is the group of children eligible for the BCG vaccination. Define μ to be
the true mean nutritional status measure of this population after they have the BCG vaccination.

Hypothesis to be tested:

H0 : μ = 85.2 HA : μ < 85.2

Hypothesis Test to be used: T-test for a single mean

Assumptions:

o A simple random sample

o The nutritional statuses of the population from which the sample is drawn are normally dis-
tributed OR the sample size is greater than 29.

We were told to assume that all assumptions are satisfied.

The Significance Level: α= 0.05

The Test Statistic and corresponding p-value:

Based on the SPSS output above, the p-value < 0.001/2 < 0.0005 (because it is a 1-tailed test).

The Decision: Since the p-value < 0.0005 < 0.05, we reject the H0 .

The Conclusion: At α= 0.05 level of significance (with p-value < 0.0005), we have evidence to
conclude that the average nutritional status of BCG vaccinated children is lower than 85.2, the

35
average nutritional status of non-BCG vaccinated children.

Solution: (2) According to the below SPSS output, a 90% confidence interval for the true average
nutritional status measure is (72.18, 77.28).

Using a confidence interval to test a hypothesis

Once one has computed an appropriate (1 − α) ∙ 100% confidence interval, we can use it to test
a hypothesis about the true mean (in the situation in which the data is normally distributed and
the sample standard deviation is known). How?

If we wish to implement a two-tailed hypothesis test at the significance level α, we first compute
a (1 − α) ∙ 100% confidence interval. If the hypothesized mean value lies in the interval, we do
not reject H0 ; otherwise we reject H0 .

If we wish to implement a one-tailed hypothesis test at the α level of significance, we must


compute a (1 − 2α) ∙ 100% confidence interval. Then, to perform a right-tailed test, if the

36
hypothesized mean value is less than the lower boundary of the (1 − 2α) ∙ 100% confidence
interval, we reject H0 ; otherwise we do not reject H0 . To perform a left-tailed test, if the
hypothesized mean value is greater than the upper boundary of the (1 − 2α) ∙ 100% confidence
interval, we reject H0 ; otherwise we do not reject H0 .

For example, suppose we read in a journal that a 90% confidence interval for the true average is
(150,170).

(1) Suppose we need to test if the true average is 160 against the alternative it is not (using a 10%
level of significance). This is a two-tailed test. We can use a 90% CI (Since 100%-90%=10%,
our level of significance) to directly test our hypotheses. The reject region associated with this
test will then be best described by any value greater than 170 or any value less than 150. The
do not reject region would be characterized by values between 150 and 170. In other words if
our hypothesized average of 160 is greater than 170 or less than 150, we would reject our null
hypothesis and have evidence to conclude the alternative hypothesis. For the value I provided
(160), 160 lies between 150 and 170. Hence at the 10% level of significance, we would not reject
our null hypothesis that the true mean is 160.

(2) Suppose we need to test if the true average is 160 against the alternative it is actually greater
than 160 (using a 5% level of significance). This is right-tailed test. We can use a 90% CI (Since
100%-90%=10%, and for a two-tailed test, half of this value (i.e. 5%, our level of significance)
is the area of the reject region in the right tail and the other half is the area of the reject region
in the left tail) to directly test our hypotheses. The reject region associated with this test will
then be best described by any value less than 150. Notice the reject region is in the opposite tail
than what we might have expected. The do not reject region would be characterized by values
greater than 150. In other words if our hypothesized average of 160 is less than 150, we would
reject our null hypothesis and have evidence to conclude the alternative hypothesis. For the value
I provided (160), 160 is greater than 150. Hence at the 5% level of significance, we would not
reject our null hypothesis that the true mean is 160.

(3) Suppose we need to test if the true average is 160 against the alternative it is actually less
than 160 (using a 5% level of significance). This is left-tailed test. We can use a 90% CI (Since
100%-90%=10%, and for a two-tailed test, half of this value (i.e. 5%, our level of significance) is
the area of the reject region in the right tail and the other half is the area of the reject region in
the left tail) to directly test our hypotheses. The reject region associated with this test will then
be best described by any value greater than 170. Notice the reject region is in the opposite tail
than what we might have expected. The do not reject region would be characterized by values
less than 170. In other words if our hypothesized average of 160 is greater than 170, we would

37
reject our null hypothesis and have evidence to conclude the alternative hypothesis. For the value
I provided (160), 160 is less than 170. Hence at the 5% level of significance, we would not reject
our null hypothesis that the true mean is 160.

This process can be used to implement a hypothesis test using a confidence interval, regardless
of the confidence interval and its associated parameter.

Confidence Interval Plots

We can depict the information captured in a confidence interval via a confidence interval plot.
This plot is formed using the lower and upper boundaries of the confidence interval and the point
estimate for the parameter estimated via the confidence interval.

For Practice. Draw a 95% confidence interval for the true mean blood pressure patients after
they have taken inderal.

Solution:

38
The circle represents the point estimate for the parameter of interest (in this case the true mean
systolic blood pressure of patients who take inderal). The lower and upper whiskers respectively
represent the lower and upper boundaries for calculated confidence interval.

For Discussion. How could we use this plot to test the hyptheses:

(a) the true mean systolic blood pressure is 160 mmHg? 140 mmHg? 170 mmHg?

(b) the true mean systolic blood pressure is less than 160 mmHg? 140 mmHg? 170 mmHg?

(c) the true mean systolic blood pressure is higher than 160 mmHg? 140 mmHg? 170 mmHg?

Now Your Turn: The box plot and confidence interval plot below are both summarizing the
systolic blood pressures of patients who took inderal.

(a) Which plot would you use to see if the data is normally distributed? Why?

(b) From these plots, is there evidence the true mean blood pressure is equal to the true median
blood pressure?

39
1.11.2 Independent Large Random Samples from one Population: Count
data

Theorem 1.11.1 (Point Estimate of a Population Proportion) Suppose a simple random


sample of size n is obtained from a population in which each individual either does or does not
have a certain characteristic. The best point estimate of π, denoted πb, (the proportion of the
population with the desired characteristic) is given by
x
b=
π
n
where x is the number of individuals in the sample with the specified characteristic.

For Practice. In a poll conducted May 7-10, 2000, by ABC News, a simple random sample of
1068 American adults was asked “Have you ever been shot at?”. Of the 1068 American adults
surveyed, 96 responded yes. Obtain a point estimate for the population proportion of American
adults who have been shot at.

Solution: An estimate for the true number of Americans who have been shot at is

π̂ = 96/1068 ≈ 0.0899.

Theorem 1.11.2 (Sampling distribution of π b) For a simple random sample of size n such
that n ≤ 0.05N (where N is the population size), the sampling
r b is approximately
distribution of π
π(1 − π)
normal with mean μπb = π and standard deviation σπb = , provided nπ(1 − π) ≥ 10.
n

z-test for a single proportion

In order to test a hypothesis regarding whether a population proportion is some hypothesized


value π0 , we need to know if the following two conditions are both satisfied:

40
1. a simple random sample is obtained and

2. nπ0 (1 − π0 ) ≥ 10 and n ≤ 0.05N (where n is the sample size and N is the population
size).

If the above two conditions are both satisfied then we can modify the three z hypothesis tests
x − μ0
by replacing μ with π and μ0 with π0 in the hypotheses; and the test statistic z = √ with
σ0 / n
b − π0
π x
the new test statistic z = r b = and x is the number of individuals in the
where π
π0 (1 − π0 ) n
n
sample with specified characteristic. The other steps remain the same.

b − π0
π
We refer to a hypothesis test that uses the test statistic z = r and the Standard
π0 (1 − π0 )
n
Normal Distribution to determine the critical values as “a large sample z−test for a single pro-
portion”.

For Practice. The drug Prevnar is a vaccine meant to prevent meningitis. It is typically
administered to infants. In clinical trials, the vaccine was given to 710 randomly sampled infants
between 12 and 15 months of age. Of the 710 infants, 121 experienced a decrease in appetite.
Is there significant evidence (at the 1% level of significance) to conclude that the proportion
of infants who receive Prevnar and experience a decrease in appetite is larger from 0.135, the
proportion of infants who experience a decrease in appetite because of competing medications?
Solution:

1. Research Question: Do more infants experience a decrease in appetite after being vaccinated
with Prevnar when compared to infants who received an existing vaccine?

2. Population declarations: The population being study is the group of infants between 12 and
15 months of age. Let π be the true proportion of infants from this population who experience
a decrease in appetite after being vaccinated with Prevnar.

3. Hypothesis to be tested:

H0 : π = 0.135

41
HA : π > 0.135

4. Hypothesis Test to be used: A large sample z-test for a single proportion

5. Assumptions required to implement the hypothesis test:

(a) We are told the sample was randomly selected.

(b) While we don’t know the population size, we will assume 710 is less than 5% of the population
size. Note
(710)(0.135)(1 − 0.135) ≈ 82.9 > 10.
Hence we can use the large sample test.

6. The Significance Level: α = 0.01

7. The Test Statistic and corresponding p-value:


121/710 − 0.135
z= q ≈ 2.76
0.135(1−0.135)

710

and since we completing a right-tailed test,


p − value = Pr(Z > 2.76) = 0.0029.

8. The Decision Rule: Since p − value = 0.0029 < 0.01 = α, we reject H0 .

9. The Conclusion: At the 1% level of significance, we have evidence to conclude that the true
proportion of infants who experience a decrease in appetite after being vaccianted with Prevnar
is greater than 0.135.

Now Your Turn: Suppose the current president heard the statistic regarding the proportion of
Americans who have been shot at and dismissed the evidence. He first claimed that the survey was
too small to be used to draw any country wide statement. He then claimed that the proportion
was an anomaly and in fact this statistic was so small that it is essentially zero. Discuss the
validity of the “logic” for both his claims.
Solution:

42
A confidence interval for a true proportion

Theorem 1.11.3 Suppose a simple random sample of size n is taken from a population. A
(1 − α) ∙ 100% confidence interval for π is given by
r r !
b(1 − π
π b) b(1 − π
π b)
b − zα/2
π b + zα/2

n n

π (1 − π
where nb b) ≥ 10 must be true.

For Practice. For the above poll, compute a 95% confidence interval for the population pro-
portion π that have been shot at.

1.11.3 Independent Small Random Samples from one Population: Count


data

The above are inference techniques for large samples but what happens if one needs to make
inferences for π when nπ̂(1 − π̂) < 10? As long as n ≥ 10, one can use the “Plus Four”
confidence interval to determine a (1 − α) ∙ 100% confidence interval for π.

If the point estimate used in the “Plus Four” confidence interval is

x+2
π̃ =
n+4

43
and the standard error is r
π̃(1 − π̃)
S.E. = ,
n+4
then the (1 − α) ∙ 100% confidence interval for π is
r r !
π̃(1 − π̃) π̃(1 − π̃)
π̃ − zα/2 , π̃ + zα/2 (1.1)
n+4 n+4

and holds when n ≥ 10.

If one wishes to implement a hypothesis via a test statistic and critical value, then one will have
to use an appropriate non-parametric test.

For Practice. In a hospital emergency room, all staff (nurses and doctors) use the same central
computer. Of 97 staff members who were observed to have used the computer, only 7 washed
their hands after they finished using the computer. Determine a 95% confidence interval for the
true proportion of staff who wash their hands after using the computer.

Solution:

1.11.4 Random Samples from two Independent Populations: Continu-


ous Variables

t-test for the difference in two independent means with unequal variances

Suppose X1 , ..., Xn1 is a random sample of size n1 taken from Population 1 whose mean is μ1 ;
Y1 , ..., Yn2 is a random sample of size n2 taken from Population 2 whose mean is μ2 ; and the
samples from the two populations are independent. In order to form a (1 − α) ∙ 100% confidence
interval for, and test a hypothesis regarding, μ1 − μ2 , we need to know the following are all
satisfied:

44
1. a simple random sample is obtained.

2. The populations from which both samples are drawn are normally distributed OR both sample
sizes are large (n1 ≥ 30 and n2 ≥ 30).

3. The two samples are independent.

There is actually no exact solution to the situation where σ1 and σ2 are unknown and σ1 6= σ2 .
We can use an approximate solution, that is we can use Welch’s approximate t-distribution, to
form a (1 − α) ∙ 100% confidence interval for, and test a hypothesis regarding, μ1 − μ2 , if we
know the following are satisfied:

1. a simple random sample is obtained.

2. The populations from which both samples are drawn are normally distributed.

3. The population standard deviations σ1 and σ2 are unknown and σ1 6= σ2 .

4. The samples are independent.

If the above four conditions are simultaneously true, then a (1 − α) ∙ 100% confidence interval
for μ1 − μ2 is given by

 s s 
xˉ1 − xˉ2 − tα/2 (df ) s21 s22 s21 s22 
+ , xˉ1 − xˉ2 + tα/2 (df ) +
n1 n2 n1 n2

where tα/2 (df ) is referred to as a critical t-value from a Student’s t-distribution with df=min(n1 , n2 )
degrees of freedom.

To implement a hypothesis test, the parameter is μ1 − μ2 ; the hypothesized difference between


the two population means is k; and the test statistic is

xˉ1 − xˉ2 − k
t= q 2
s1 s2
n1
+ n22

45
with df=min(n1 , n2 ) degrees of freedom.

We will use the Student’s t-distribution to determine the critical values. The other steps remain
the same. We refer to a hypothesis test that uses the test statistic
xˉ1 − xˉ2 − k
t= q 2
s1 s2
n1
+ n22
with df=min(n1 , n2 ) degrees of freedom and the Student’s t-Distribution to determine the critical
values as a t-test for the difference in two independent means with unequal variances.

Note the degrees of freedom in the above discussion is very conservative. SPSS calculates the
degrees of freedom to be:  2 2
s1 s22
n1
+ n2
df =  2 2  2 2 .
s1 s2
n1 n2

n1 −1
+ n2 −1

For Practice. BFAHS p185, q. 6.4.10: In a study of factors thought to be responsible for the
adverse effects of smoking on human reproduction, cadmium level determinations (nanograms
per gram) were made on placenta tissue of a random sample of 14 mothers who were smokers
and an independent random sample of 18 nonsmoking mothers. The data is summarized below:

1. At the α = 0.10 level of significance, test the claim that the mean cadmium level is higher
among smokers than nonsmokers.

46
2. Determine a 95% confidence interval for the true difference in the mean cadmium levels
between the two groups.

Solution: (1) Note: at this point in the course, we are not going to worry about testing the
assumptions necessary for the results of this test to be valid. This will not always be the case.

Research Question: Is the cadmium level higher in smoking mothers than in nonsmoking mothers?

Population Declarations:

Let population 1 be the group of all non-smoking mothers. Then define μ1 to be the true mean
cadmium level in population 1.

Let population 2 be the group of all smoking mothers. Then define μ2 to be the true mean
cadmium level in population 2.

Hypothesis to be tested:

H0 : μ1 = μ2 (i.e. the true mean cadmium level of population 1 is equal to the true mean
cadmium level of population 2)

HA : μ1 < μ2 (i.e. the true mean cadmium level of population 1 is less than the true mean
cadmium level of population 2)

Hypothesis Test to be used: t-test for the difference in two independent means with unequal
variances

Assumptions Required to Implement the Hypothesis Test:

o The cadmium levels in each population are normally distributed.

o The each of the two samples is a simple random sample.

o The two populations are independent.

The Significance Level: α = 0.10

The Test Statistic and corresponding p-value: Based on the following output from SPSS,

47
the value of test statistic is t=-2.438 with df=26.671 degrees of freedom. The corresponding
p-value = 0.022/2 = 0.011 (since we are implementing a 1-tailed test).

The Decision Rule: Since the p-value = 0.011 < 0.10 = α, we reject H0 .

Conclusion: At the 10% level of significance (with equal variances not assumed) there is evidence
to conclude that the true mean cadmium level is lower in the group of mothers who do not smoke
than in the group of mothers who do smoke (p-value = 0.011).

(2)

Based on the Independent Samples Test table above, the estimated 95% confidence interval for
the true mean difference in cadmium levels between the two groups is -10.48540 to -0.89872.

48
Hypothesis testing via confidence intervals

One can use (1 − α) ∙ 100% confidence intervals to visually determine whether it is plausible if
two populations have the same mean. How? On the same graph, plot (1 − α) ∙ 100% confidence
intervals for each sample. If the sample mean of one group lies within the confidence interval
for the other group and vice versa, then it is plausible (at the α significance level) that the two
populations have the same mean. If the two confidence intervals do not overlap, then, at the α
level of significance, it is plausible that the two populations have different means.

For Practice. Based on the below 80% confidence intervals for the true mean cadmium levels
for the two groups, is there evidence to conclude the true mean cadmium levels for the two groups
differ?

Solution: The two confidence intervals plotted do not overlap. Consequently we would believe
that, at the α = 0.20 level of significance, the true mean cadmium levels for smokers and non-
smokers differ.

49
NOTE BIEN: Boxplots are used to illustrate the distribution of the data. Confidence interval
plots do not illustrate this distribution. Confidence interval plots can be used to visualize whether
means are equal and whether it is reasonable that the variances of different populations are equal.

Now Your Turn: The total cholesterol levels (mg/dl) for 133 randomly selected hypertensive pa-
tients and 41 randomly selected normotensive patients were collected. The data for this problem
is available online at the textbook’s website (cf. Chapter 7, Section 3, Exercise 4). Assume that
the total cholesterol levels of both populations are normally distributed. From a 95% confidence
interval plot, would you conclude that the true mean cholesterol levels of hypertensive patients
equals the true mean cholesterol levels of normotensive patients?

t-test for the difference in two independent means with equal variances

When using a statistical package, one might also see the t-test, ”testing two means from in-
dependent samples with equal variances”. The (1 − α) ∙ 100% confidence interval computed
is

 r r 
1 1 1 1
xˉ1 − xˉ2 − tα/2 (df )spooled + , xˉ1 − xˉ2 + tα/2 (df )spooled +
n1 n2 n1 n 2
and the test statistic is

xˉ1 − xˉ2 − k
t= q
spooled n11 + n12

where tα/2 (df ) is referred to as a critical t-value from a Student’s t-distribution with df=n1 +n2 −2
degrees of freedom.

For Practice. Repeat the above cadmium level example (both the hypothesis test and confidence
interval using the level of significance in the original question) assuming both populations have
the same variance.

Solution: (1) Note: at this point in the course, we are not going to worry about testing the

50
assumptions necessary for the results of this test to be valid. This will not always be the case.

Research Question: Is the cadmium level higher in smoking mothers than in nonsmoking mothers?

Population Declarations:

Let population 1 be the group of all non-smoking mothers. Then define μ1 to be the true mean
cadmium level in population 1.

Let population 2 be the group of all smoking mothers. Then define μ2 to be the true mean
cadmium level in population 2.

Hypothesis to be tested:

H0 : μ1 = μ2 (i.e. the true mean cadmium level of population 1 is equal to the true mean
cadmium level of population 2)

HA : μ1 < μ2 (i.e. the true mean cadmium level of population 1 is less than the true mean
cadmium level of population 2)

Hypothesis Test to be used: t-test for the difference in two independent means with equal
variances

Assumptions Required to Implement the Hypothesis Test:

o The cadmium levels in each population are normally distributed.

o The each of the two samples is a simple random sample.

o The two populations are independent.

o The two populations are independent have the same variance.

The Significance Level: α = 0.10

The Test Statistic and corresponding p-value: Based on the following output from SPSS,

51
the value of test statistic is t=-2.468 with df=30 degrees of freedom. The corresponding p-value
= 0.020/2 = 0.010 (since we are implementing a 1-tailed test).

The Decision Rule: Since the p-value = 0.010 < 0.10 = α, we reject H0 .

Conclusion: At the 10% level of significance (with equal variances not assumed) there is evidence
to conclude that the true mean cadmium level is lower in the group of mothers who do not smoke
than in the group of mothers who do smoke (p-value = 0.010).

(2)

52
Based on the Independent Samples Test table above, the estimated 95% confidence interval for
the true mean difference in cadmium levels between the two groups is -10.4025 to -0.9816.

Now Your Turn: Does texting while driving really slow one’s ability to react? A psychologist
measured the reaction time (in seconds) to stop when a hazard was suddenly placed in the driver’s
lane. A sample consisted of 18 randomly selected individuals who were not texting while driving
and 16 randomly selected individuals who were texting while driving. The results are summarized
in the following table:

Assume that the reaction times of both populations are normally distributed.

53
(a) If the variances of the two populations are not equal, at the α = 0.10 level of significance,
test the claim that a person’s stopping reaction time is increased if texting while driving.

Solution: Research Question:

Is the stopping reaction time of a driver increased if texting while driving?

Population Declarations:

Let population 1 be the group of all the drivers who do not text while driving and μ1 be the true
mean stopping reaction time associated with this group.

Let population 2 be the group of all the drivers who do text while driving and μ2 be the true
mean stopping reaction time associated with this group.

Hypothesis to be tested:

H0 : μ1 = μ2 (i.e. the true mean stopping reaction time of population 1 is equal to the true
mean stopping reaction time of population 2)

HA : μ1 < μ2 (i.e. the true mean stopping reaction time of population 1 is less than the true
mean stopping reaction time of population 2)

Hypothesis Test to be used: T-test for two independent samples for equal variances

Assumptions Required to Implement the Hypothesis Test:

• The stopping reaction times of both populations must be normally distributed.

• Both samples must be simple random samples.

• Both populations must be independent of each other.

We are told to assume the above assumptions hold.

The Significance Level: α = 0.10

54
The test statistic and corresponding p-value: Based on the following output from SPSS,

the test statistic is T=-5.928 with an associated p-value < 0.001/2 < 0.0005 (since it is a 1-tailed
test).

The Decision Rule: Since p-value < 0.0005 < 0.10 = α, we reject H0 .

Conclusion: At the 10% level of significance (with a p-value < 0.0005) there is evidence to
conclude that the true mean stopping reaction time of drivers who were not texting while driving
is less than the true mean stopping reaction time of the drivers who were texting while driving.

(b) Assuming the variances of the two populations are not equal, construct an appropriate plot
(using the level of significance α=0.05) from which you could visually inspect whether it was
plausible that the two populations had the same mean. Referencing this plot, discuss why or why
not you would conclude the two populations have the same mean.

Solution:

55
Because the 95% confidence intervals in the above plot do not overlap, it is not likely that the
two populations have the same mean.

(c) If the variances of the two populations are not equal, determine a 99% confidence interval for
the true difference in the mean reaction times between the two groups.

Solution:

From the SPSS output above, a 99% confidence interval for the true mean difference μ1 − μ2
(assuming unequal variances) is (-2.82937,-1.03729).

56
(d) If the variances of the two populations are equal, at the α = 0.10 level of significance, test
the claim that a person’s stopping reaction time is increased if texting while driving.

Solution:

Research Question:

Is the reaction time of a driver increased if texting while driving?

Population Declarations:

Let population 1 be the group of all the drivers who do not text while driving and μ1 be the true
mean stopping reaction time associated with this group.

Let population 2 be the group of all the drivers who do text while driving and μ2 be the true
mean stopping reaction time associated with this group.

Hypothesis to be tested:

H0 : μ1 = μ2 (i.e. the true mean stopping reaction time of population 1 is equal to the true
mean stopping reaction time of population 2)

HA : μ1 < μ2 (i.e. the true mean stopping reaction time of population 1 is less than the true
mean stopping reaction time of population 2)

Hypothesis Test to be used: T-test for two independent samples for equal variances

Assumptions Required to Implement the Hypothesis Test:

• The stopping reaction times of both populations must be normally distributed.


• Both samples must be simple random samples.
• Both populations must be independent of each other.
• The variances of the stopping reaction times of both populations are equal.

We are told to assume the above assumptions hold.

57
The Significance Level: α = 0.10

The test statistic and corresponding p-value: Based on the following output from SPSS,

the test statistic is T=-5.968 with an associated p-value < 0.001/2 < 0.0005 (since it is a 1-tailed
test).

The Decision Rule: Since p-value < 0.0005 < 0.10 = α, we reject H0 .

Conclusion: At the 10% level of significance (with a p-value < 0.0005) there is evidence to
conclude that the true mean stopping reaction time of drivers who were not texting while driving
is less than the true mean stopping reaction time of the drivers who were texting while driving.

(e) If the variances of the two populations are equal, determine a 99% confidence interval for the
true difference in the mean stopping reaction times between the two groups.

Solution:

58
From the SPSS output above, a 99% confidence interval for the true mean difference μ1 − μ2
(assuming equal variances) is (-2.82048,-1.04619).

For Discussion. What are the pitfalls (if any) associated with using the assume equal variances
Confidence Interval/Test Statistic?

In order for the confidence interval/result of the hypothesis test to be valid, one must be able to
verify that the two populations indeed have the same variance. To verify this, one would have to
do a hypothesis test. Consequently the validity of the confidence interval/result of the hypothesis
test depends on the result of a hypothesis test, thus increasing the chance of making a Type I
error.

Levene’s test for the equality of variances

The F-distribution (named after R. Fisher), is a family of curves, each of which is completely
specified by two different degrees of freedom ν and d. We use the symbol Fα (ν, d) to denote the

59
specific value of the F-distribution with ν and d degrees of freedom at the significance level α.

The F-distribution is skewed from the right. Because of this, the F-distribution is not symmetric
about any particular value. We need to take this fact into consideration when we are working with
the F-distribution. The simplest way is to assign the populations in such a way that Population
1 has the largest variance. Then, given two samples from two independent normal Populations 1
and 2, we wish to test H0 : σ12 = σ22 against either HA : σ12 6= σ22 or HA : σ12 > σ22 .

The test statistic used to test the above hypotheses is defined as

s21
F (ν, d) = ,
s22

where the numerator and denominator are the sample variances from Population 1 and Population
2, respectively AND ν = n1 − 1 and d = n2 − 1. If you know both populations are normally
distributed, then this test statistic will follow an F-distribution with ν and d degrees of freedom.

Since most F-tables only provide values for right-tailed tests and due to the lack of symmetry
of the F-distribution, we have to be careful if we wish to implement a TWO-TAILED hypothesis
test. To implement a two-tailed F-Test, we have to implement both one-tailed alternatives at
half the significance level. Then if we reject H0 for either of these one-tailed alternatives, we
reject H0 for the two-tailed test; otherwise we conclude that we cannot reject H0 . Because in
practice there is really only one direction, which will lead to a rejection of like in the one-tailed
version of the hypothesis test, we let Population 1 correspond to the population with the largest
sample variance.

For Discussion. What, if any, are the potential issues with using the above test for the equality
of variances?

Suppose that you do not know the data is normally distributed OR that you know the data
is not normally distributed OR that you need to know whether the variances from popula-
tions are equal. In these three scenarios, the test we just learned cannot be used. If you
know the data is sampled from a symmetric distribution then you can use Bartlett’s Test for
Equality of Variances (cf. http://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm).
Suppose that the assumptions required to use Bartlett’s Test are not true or are known to
be not true. Then, in this situation, you can use Levene’s Test for Equality of Variances

60
(cf. http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm). This is the test that
SPSS implements.

For Practice. Use Levene’s Test for Equality of Variances to determine whether or not (at
the 10% level of significance) the variances between the smoking and non-smoking group in the
cadmium level study are equal.

Solution:

The Research Question: Is there a difference between the true variances of cadmium levels in
pregnant women who smoke and those who do not smoke?

Population Declarations: Let population 1 be the group of all smoking pregnant women. Let
population 2 be the group of all non-smoking pregnant women.

Let σ1 be the true standard deviation of the cadmium levels of population 1 and σ2 be the true
standard deviation of the cadmium levels of population 2.

Hypothesis to be tested:

H0 : σ12 = σ22 (there is no difference between the true variances of the cadmium levels of the
mothers who smoke and those who do not smoke)

HA : σ12 6= σ22 (the true variance in the cadmium levels of the pregnant women who smoke differs
from the true variance in the cadmium levels of the pregnant women who do not smoke)

Hypothesis Test to be used: Levene’s Test for the Equality of Variances.

Assumptions Required to Implement The Hypothesis Test:

1. The data collected is a simple random sample.

2. There are at least three data points in each sample.

The Significance Level: α= 0.10

The Test Statistic and corresponding p-value:

61
From the Independent Samples Test table above, the value of Levene’s Test Statistic is F=0.461
with ν = 14−1 = 13 and d = 18−1 = 17 degrees of freedom and the associated p-value=0.502.

Decision Rule: Since the p-value > 0.502 > 0.10 = α, we do not reject H0 .

The Conclusion: At the α= 0.10 level of significance, there is no evidence to conclude that the
true variance in the cadmium levels of pregnant women who smoke differs from the true variance
in the cadmium levels of pregnant women who do not smoke (p-value=0.502). In other words,
at the α= 0.10 level of significance, there is no evidence to reject the assumption that the true
variance in the cadmium levels of the pregnant women who smoke equals the true variance in the
cadmium levels of the pregnant women who do not smoke.

For Discussion. In the above example, how does one determine which is Population 1 and which
is Population 2?

Now Your Turn: For the texting while driving example above, test, at the 10% level of signif-
icance, whether both samples were drawn from populations with the same variance. Be sure to

62
write your solution in the format discussed in class.

Solution: Research Question: Are the reaction times of drivers who text while driving and the
reaction times of drivers who do not text while driving from two populations with the same
variance?

Population Declarations: Let population 1 be the group of drivers who text while driving and
population 2 be the group of drivers who do not text while driving. Then let σ1 be the standard
deviation in the stopping reaction times of individuals from population 1 and σ2 be the standard
deviation in the stopping reaction times of individuals from population 2.

Hypothesis to be tested:

H0 : The variances of the stopping reaction times of both populations are equal.

HA : the variances of the stopping reaction times of both populations are not equal.

Hypothesis Test to be used: Levene’s Test for the Homogeneity of Variances

Assumptions: The data was randomly selected.

The Significance Level: α= 0.10

The Test Statistic and corresponding p-value:

63
Based on the above SPSS output, the value of the test statistic is F=0.618 with ν = 16 − 1 = 15
and d = 18 − 1 = 17 degrees of freedom and an associated p-value = 0.437.

The Decision Rule: Since p-value = 0.437 > 0.10 = α, we do not reject H0 .

The Conclusion: At α= 0.10 level of significance, there is no evidence to reject the assumption
that the variances of the stopping reaction times of both populations are equal (p-value= 0.437).

1.11.5 Two Dependent Random Samples: Continuous Variables

The work we have done thus far was based on the assumption that our data was unpaired. In the
situations where we had two different populations, we assumed that the data samples were drawn
independently from the two populations. How do we make conclusions about two population
means, when the two samples are not independent?

Two samples are said to be paired when for each data value collected from one sample, there is
a corresponding data value collected from the second sample, and both of these data values are
collected from the same source. A perfect example would be the midterm and the final exam
marks for a group of students.

When we have a small paired data sample, we can run a t-test to test the null hypothesis that
the means of the two populations are equal (the same) against one of the three usual alternative
hypotheses. How do we implement such a test?

Match-paired t-test

Suppose (X1 , Y1 ), ..., (Xn , Yn ) is a random sample of size n with mean (μ1 , μ2 ) and μ1 − μ2 is the
difference of the two population means. In order to form a (1 − α) ∙ 100% confidence interval for
μ1 − μ2 , and test a hypothesis regarding μ1 − μ2 , we need to know the following are all satisfied:

1. a simple random sample is obtained.

64
2. The populations from which both samples are drawn are normally distributed.

3. The two samples are matched-pairs.

If the above three conditions are simultaneously true, then for


P s P P
di n( d2i ) − ( di )2
dˉ = , sd = ,
n n(n − 1)

the test statistic required to implement a hypothesis regarding μ1 − μ2 has a Student’s t-


distribution with df = n − 1 is

dˉ − k
t= √
sd / n
and a (1 − α) ∙ 100% confidence interval is given by

 
sd sd
dˉ − tα/2 (df ) √ , dˉ + tα/2 (df ) √ .
n n

We refer to a hypothesis test that uses the test statistic

dˉ − k
t= √
sd / n

with df = n − 1 degrees of freedom as a match-paired t-test.

For Practice. Referring back to the inderal example, the hope is that one’s systolic blood
pressure after taking inderal would be lower than while on the placebo.

(a) At the α = 0.1 level of significance, test the claim that one’s systolic blood pressure is lower
after taking Inderal than when on the placebo.

(b) Determine a 99% confidence interval for the true difference in the mean systolic blood
pressures (μ1 − μ2 ) between the two groups.

65
Solution: (a) The Research Question: Does one’s systolic blood pressure become lower after
taking Inderal (when compared to the blood pressure when taking the placebo)?

Population Declarations:

Let the population of interest be the group of people who with hypertension. Then define μ1 to
be the true mean systolic blood pressure of this population after taking the placebo and define
μ2 to be the true mean systolic blood pressure of this population after taking inderal.

Hypothesis to be tested:

H0 : μ1 = μ2 (i.e. the true mean systolic blood pressure of population 1 is equal to the true
mean systolic blood pressure of population 2)

HA : μ1 > μ2 (i.e. the true mean systolic blood pressure of population 1 is greater than the true
mean systolic blood pressure of population 2)

Hypothesis Test to be used:

Assumptions Required to Implement the Hypothesis Test: match-paired t-test

o The systolic blood pressures after taking the placebo and after taking inderal are normally
distributed.

o The systolic blood pressures after taking the placebo and after taking inderal were simple
random samples.

o The two sample are match-paired.

The Significance Level: α = 0.10

The Test Statistic and corresponding p-value: Based on the following output from SPSS,

66
The value of the test statistic is t =4.937 with df =14 and the corresponding p-value ¡0.0005.

The Decision Rule: Since the p-value < 0.0005 < 0.10 = α, we reject H0 .

The Conclusion: At α = 0.10 level of significance, there is evidence to conclude that the true
mean systolic blood pressure after taking inderal is lower than the true mean systolic blood
pressure before taking the medication (p-value < 0.0005).

In the above solution, you will notice that one cell of the table is highlighted in orange. This
cell tells you how SPSS calculated the differences used to calculate the test statistic. In order
to avoid confusion, it is simplest if the identified first value used in the difference (Placebo) is
associated with population 1 and the second value used in the difference (Inderal) is associated
with population 2. The same rule of thumb is used when computing a confidence interval for the
difference of the two means.

(b)

67
The estimated 99% confidence interval for the true difference μ1 − μ2 is 10.880 to 43.920.

Remark: To use the paired data t-test statistic, we have made the assumption that D = X − Y
is a normal random variable.

Remark: If you have paired data, you can either implement a paired t-test or an unpaired two-
sample t-test, because the paired t-test tests the same hypotheses as the unpaired two sample
tests that we have studied. BUT, if you do not have paired data, you CAN ONLY implement an
unpaired two-sample test!!!

Now Your Turn: BFAHS, p252, example 7.4.1: John M. Morton et al. (A-14) examined
gallbladder function before and after fundoplication–a surgery used to stop stomach contents
from flowing back into the esophagus (reflux)–in patients with gastroesophageal reflux disease.
The authors measured gall bladder functionality by calculating the gall bladder ejection fraction
(GBEF) before and after fundoplication. These values are stored in the table below.

The goal of fundoplication is to increase GBEF, which is measured as a percent. Does the data
support, at the 5% level of significance, that fundoplication increases GBEF functioning? You
may assume that the patients were randomly selected and that the differences in the Pre-op and
Post-op GBEF are normally distributed.

Solution: The Research Question: Does fundoplication increase GBEF functioning?

Population Declarations: The population of interest is the group of patients who have gastroe-
sophageal reflux disease. Then let μ1 be the true mean gall bladder ejection fraction (GBEF)
before fundoplication and μ2 be the true mean gall bladder ejection fraction (GBEF) after fun-
doplication.

Hypothesis to be tested:

68
H0 : μ1 = μ2 (i.e. the true mean GBEF before fundoplication equals the true mean GBEF after
fundoplication)

H0 : μ1 < μ2 (i.e. the true mean GBEF before fundoplication is less than the true mean GBEF
after fundoplication)

Hypothesis Test to be used: Matched-Pairs Sample T-Test.

Assumptions Required to Implement The Hypothesis Test:

1. The data collected is a simple random sample.

2. The populations from which both samples are drawn are normally distributed.

3. The two samples are matched-pairs.

The Significance Level: α= 0.05

The Test Statistic and corresponding p-value:

From the SPSS output above, the test statistic is t =-1.916 with df =11 and an the associated
p-value=0.041.

The Decision Rule: Since α = 0.05 > 0.082/2 = 0.041= p-value, we can reject H0.

The Conclusion: At α= 0.05 level of significance, (with a p-value=0.041) there is evidence to


conclude that the true mean GBEF increases after fundoplication.

69
1.11.6 Large Random Samples from two Independent Populations: Count
Data

Suppose we wish to know is there a difference between the proportion of men who feel satisfied
with a health promotion activity for stopping smoking and the proportion of women who feel
satisfied with the same health promotion activity, but we do not have apriori knowledge of either
proportion. We can use sample data from the two populations to determine a (1 − α)100%
confidence interval for the true difference in the proportions and to implement the appropriate
hypothesis test.

z-test for the difference in two independent proportions

Let ni , xi , π̂i , and π respectively be the sample size, the number of successes observed, the
observed proportion of successes, and the true proportion of successes in population i. If both
samples are large, then π̂1 − π̂2 will be approximately normally distributed with standard error
r
π̂1 (1 − π̂1 ) π̂2 (1 − π̂2 )
+ .
n1 n2

As long as n1 π̂1 ≥ 5, n1 (1 − π̂1 ) ≥ 5 and n2 π̂2 ≥ 5, n2 (1 − π̂2 ) ≥ 5, we can implement a hy-


pothesis test regarding π1 − π2 by simply making the following changes to our hypothesis testing
recipe for a z test:

The Hypotheses are:



 π1 − π2 < 0
H0 : π1 − π2 = 0 and HA : π1 − π2 > 0

π1 − π2 6= 0.

The name of the hypothesis test: z-test for the difference in two independent proportions

The assumptions:

70
1. Both must be simple random samples

2. Both must be large samples, i.e. n1 π̂1 ≥ 5, n1 (1 − π̂1 ) ≥ 5 and n2 π̂2 ≥ 5, n2 (1 − π̂2 ) ≥ 5.

π̂1 − π̂2 x1 + x 2
The test statistic is z = s   where π̂ = n + n .
1 1 1 2
π̂(1 − π̂) +
n 1 n2
The other steps do not change.

For Practice. In a group of 1000 men polled, 850 supported an issue. Of 500 women surveyed,
400 supported the issue. Assume the data was collected using simple random sampling.
Test the hypothesis that the true proportion of men supporting the issue equals the proportion
of women supporting the issue against the alternative. Use α=0.01.

Solution:

Large sample confidence interval for the difference in two independent proportions

Both samples must be large, where we will still use the condition that the variance of each of the
samples is greater than or equal to 10 to define large enough.

Let ni , xi , π̂i , and π respectively be the sample size, the number of successes observed, the
observed proportion of successes, and the true proportion of successes in population i. If both
samples are large, then π̂1 − π̂2 will be approximately normally distributed with standard error
r
π̂1 (1 − π̂1 ) π̂2 (1 − π̂2 )
+ . A (1 − α)100% confidence interval for π1 − π2 will therefore be
n1 n2
 s s 
(π̂1 − π̂2 ) − zα/2 π̂1 (1 − π̂1 ) π̂2 (1 − π̂2 ) π̂1 (1 − π̂1 ) π̂2 (1 − π̂2 ) 
+ , (π̂1 − π̂2 ) + zα/2 + .
n1 n2 n1 n2
(1.2)

71
For Practice. In a group of 1000 men polled, 850 supported an issue. Of 500 women surveyed,
400 supported the issue. Assume the data was collected using simple random sampling.
Find a 90% confidence interval for the true difference in the proportions of men and women that
support the issue.

Solution:

Now Your Turn: Do our emotions influence the economic decisions we make? One way to
examine the issue is to have subjects play an “ultimatum game” against other people and against
a computer. Your partner (person or computer) gets $10, on the condition that it be shared with
you. The partner makes you an offer. If you refuse, neither of you gets anything. Consequently,
it is to your advantage to accept an unreasonable offer (such as you get $2 and your partner
gets $8). Some people get made and refuse unfair offers. Here are data on the response of
228 subjects randomly selected to receive $2 from either a person they were introduced to or a
computer.

Humans offers accepted: 60; Human offers rejected: 54; Computer offers accepted: 96; Computer
offers rejected: 18.

We suspect that emotion will lead to offers from another being rejected more often than offers
from an impersonal computer. Test this claim at the α = 0.05 level of significance.

1.11.7 Small Random Samples from two Independent Populations: Count


Variables

Suppose that either the number of observed successes OR the number of observed failures at least
one of the variances is less than 10. We should not blindly use the confidence interval formula
above. We are going to use the “Plus Four” method to compute a (1 − α)100% confidence
interval when n1 ≥ 5 and n2 ≥ 5.

72
If the point estimates used in the “Plus Four” confidence interval are
x1 + 1 x2 + 1
π̃1 = , π̃2 =
n1 + 2 n2 + 2
and the standard error is s
π̃1 (1 − π̃1 ) π̃2 (1 − π̃2 )
S.E. = + ,
n1 + 2 n2 + 2
then the (1 − α) ∙ 100% confidence interval for π1 − π2 is
 s s 
(π̃1 − π̃2 ) − zα/2 π̃1 (1 − π̃1 ) + π̃2 (1 − π̃2 ) , (π̃1 − π̃2 ) + zα/2 π̃1 (1 − π̃1 ) + π̃2 (1 − π̃2 ) 
n1 + 2 n2 + 2 n1 + 2 n2 + 2
(1.3)
as long as n1 ≥ 5 and n2 ≥ 5.

For Practice. In an experiment to determine if consuming Omega-3 fatty acids can improve
one’s memory, the brains of 20 randomly selected healthy rats were treated in a controlled,
humane way, to damage their memories. The rats were trained to run a maze. (At the end
of the training, all 20 could find their way through the maze quickly). After a day, controlled
amounts of Omega-3 fatty acids were introduced to the diets of 10 rats; the diets of the other
10 rats were identical those rats receiving the treatment EXCEPT FOR no Omega-3 fatty acids
were added to their diets. After one day of treatment, 7 of the Omega-3 group successfully
made it through the maze; 2 of the control group made it through the maze. Determine a 95%
confidence interval for the true difference in the proportions of the Omega-3 group and control
group that successfully made it through the maze, post treatment.

Solution:

1.11.8 Testing Normality

In order to use the t-tests that we just reviewed when the number of data points is fewer than
30, the data must be normally distributed in order for the test statistic to have a distribution that

73
can be approximated by the Student’s t-distribution. The examples we have looked at thus far
have told us that we can assume that the data is normally distributed but, in practice, researchers
cannot just assume that their data is normally distributed. The normality of the data needs to
be tested.

In this course we will look at a graphical method and a numerical method for testing whether the
data is normally distributed.

Graphical Method for Testing Normality

The probability-probability plot (P-P plot or percent plot) compares an empirical cumulative
distribution function of a variable with a specific theoretical cumulative distribution function
(e.g., the standard normal distribution function).

For Practice. Create a P-P Plot for the Placebo group in the Systolic Blood Pressure example.

Solution: The P-P Plot,

74
For Discussion. Based on the above P-P Plot, are the blood pressures from this population
(from which the sample was drawn) normally distributed?
Solution: Because all the points lie quite close to the plotted line y = x and there seems to be
just as many points lying above the line as below, I would suspect that the data was drawn from
a normally distributed population.

A second plot that can be used to help determine the normality of the data is the quantile-
quantile plot (Q-Q plot) which compares ordered values of a variable with quantiles of a specific
theoretical distribution (i.e., the normal distribution).

For Practice. Create a Q-Q Plot for the Placebo group in the Systolic Blood Pressure example.

Solution: The Q-Q Plot,

75
For Discussion. Based on the above Q-Q Plot, are the blood pressures from this population
(from which the sample was drawn) normally distributed?
Solution: Because all the points lie quite close to the plotted line y = x and there seems to be
just as many points lying above the line as below, I would suspect that the data was drawn from
a normally distributed population.

In either the P-P or Q-Q plot, if the two distributions match, the points in the plot will form a
linear pattern passing through the origin with a unit slope. P-P and Q-Q plots are used to see
how well a theoretical distribution models the empirical data. The question becomes how much
of a deviation from this line does it take to conclude non-normality. Is the Inderal data plotted
in the Q-Q plot normal or is the deviation present sufficient enough to conclude the underlying
population is not normal?

Shapiro-Wilk W Test for Normality

There are several different statistical tests which can be used to quantitatively determine (up to
some level of significance whether or not a given data set is drawn from a normally distributed
population. The two tests that SPSS implements are the Shapiro-Wilk W Test for Normality
and the Kolmogorov-Smirnov D Test. The Shapiro-Wilk W Test for Normality is valid for data
sets whose size is between 3 and 2000 (inclusive). If one has more than 2000 data points, then
the Kolmogorov-Smirnov D Test should be used. The assumptions required for the Kolmogorov-
Smirnov D Test are that the data is randomly sampled and that the underlying distribution of
the population is continuous. Here we will focus on the Shapiro-Wilk W Test for Normality.

The hypotheses of the Shapiro-Wilk W Test for Normality are:

H0 : the population from which the data was sampled is normally distributed

HA : the population from which the data was sampled is not normally distributed.

The test statistic is given by

76
Pn 2
i=1 ai x(i)
W = Pn ,
i=1 (xi − x ˉ )2

where x1 , x2 , ..., xn represent the sample data, x(1) , x(2) , ..., x(n) represents the sample data or-
dered from least to greatest, and ai are constants generated from the means, variances, and
covariances of the order statistics of a sample size n from a normal distribution (see Pearson and
Hartley (1972, Table 15)).

For Practice. Determine at the 5% level of significance if the systolic blood pressures of
hypertensive people after taking a placebo are normally distributed.

Solution: Research Question: Are the systolic blood pressures of hypertensive people after taking
a placebo normally distributed?

Hypothesis to be tested:

H0 : The systolic blood pressures of hypertensive people after taking a placebo are normally
distributed.

HA : The systolic blood pressures of hypertensive people after taking a placebo are not normally
distributed.

Hypothesis Test to be used: Shapiro Wilk Test for Normality

Assumptions Required to Implement the Hypothesis Test: We must have between 3 and 2000
data points in each sample.

The Significance Level: α= 0.05

The Test Statistic and corresponding p-value:

77
From the Tests of Normality table above, the value of the test statistic is with an associated
p-value=0.349.

The Decision Rule: Since the p-value = 0.349 > 0.05 = α, we do not reject H0 .

The Conclusion: At the α=0.05 level of significance, there is no evidence to conclude that the
systolic blood pressures of hypertensive people after taking a placebo are not normally distributed
(p-value=0.349). In other words, at the α=0.05 level of significance, there is no evidence to reject
the assumption that the systolic blood pressures of hypertensive people after taking a placebo
are normally distributed (p-value=0.349).

Now Your Turn: Determine at the 5% level of significance if the systolic blood pressures of
hypertensive people after taking Inderal are normally distributed.

Solution: Research Question: Are the systolic blood pressures of hypertensive people after taking
Inderal normally distributed?

Hypothesis to be tested:

H0 : The systolic blood pressures of hypertensive people after taking Inderal are normally dis-
tributed.

HA : The systolic blood pressures of hypertensive people after taking Inderal are not normally

78
distributed.

Hypothesis Test to be used: Shapiro Wilk Test for Normality

Assumptions Required to Implement the Hypothesis Test: We must have between 3 and 2000
data points in each sample.

The Significance Level: α= 0.05

The Test Statistic and corresponding p-value:

From the Tests of Normality table above, the value of the test statistic is W = 0.938 with an
associated p-value=0.363.

The Decision Rule: Since the p-value = 0.363 > 0.05 = α, we do not reject H0 .

The Conclusion: At the α = 0.10 level of significance, there is no evidence to conclude that the
systolic blood pressures of hypertensive people after taking Inderal are not normally distributed
(p-value=0.363). In other words, at the α = 0.10 level of significance, there is no evidence to
reject the assumption that the systolic blood pressures of hypertensive people after taking Inderal
are normally distributed (p-value=0.363).

Now Your Turn: Using the data presented in the Texting While Driving Now-Your-Turn Sce-
nario,

79
a) generate the P-P plots required to visually inspect whether each sample was drawn from a
normally distributed population. Comment on whether, from the plots, you would believe each
population is normally distributed.

b) generate the Q-Q plots required to visually inspect whether each sample was drawn from a
normally distributed population. Comment on whether, from the plots, you would believe each
population is normally distributed

c) Test, at the 10% level of significance, whether each sample was drawn from a normally dis-
tributed population. Be sure to write your solution in format discussed in class.

Solution: a)

80
Based on the above P-P plots, because most of the data points for the drivers not texting while
driving and the drivers texting while driving are far away from the y = x line and the points seem
to be distributed in a sinusoidal pattern about the y = x line , one might suspect that neither of
the two samples was drawn from a normally distributed population.

81
b)

82
Based on the above two Q-Q plots, because most of the data points for the drivers not texting
while driving and the drivers texting while driving are far away from the y = x line and the points
seem to be distributed in a sinusoidal pattern about the y = x line , one might suspect that
neither of the two samples was drawn from a normally distributed population.

c) Research Question: Are the stopping reaction times of drivers who text while driving and the
stopping reaction times of drivers who do not text while driving normally distributed?

Hypothesis to be tested:

H0,1 : The stopping reaction times of drivers who do not text while they drive are normally
distributed.

HA,1 : The stopping reaction times of drivers who do not text while they drive are not normally
distributed.

H0,2 : The stopping reaction times of drivers who do text while they drive are normally distributed.

HA,2 : The stopping reaction times of drivers who do text while they drive are not normally
distributed.

Hypothesis Test to be used: Shapiro-Wilk Test

Assumptions: In order to use the Shapiro-Wilk Test, we must have between 3 and 2000 data
points in each sample.

The Significance Level: α = 0.10

The Test Statistic and corresponding p-value:

83
Based on the above Tests of Normality table, the test statistic associated with drivers who were
not texting while driving is W(18) = 0.819 with a corresponding p-value=0.003; and the test
statistic associated with the drivers who were texting while driving is W(16) = 0.817 with a
corresponding p-value=0.005.

The Decision Rule:

For those who were not texting while driving: since p-value = 0.003 < 0.10 = α, we reject H0,1 .

For those who were texting while driving: since p-value = 0.005 < 0.10 = α, we reject H0,2 .

The Conclusion: At α = 0.10 level of significance, there is evidence to conclude that the stopping
reaction times of drivers who do not text while driving (p-value=0.003) and the stopping reaction
times of drivers who text while driving (p-value=0.005) are both not normally distributed.

For more information about the Shapiro-Wilk W Test for Normality refer to Shapiro, S. S. and
Wilk, M. B. (1965). ”An analysis of variance test for normality (complete samples)”, Biometrika,
52, 3 and 4, pages 591-611.

Pearson, A. V., and Hartley, H. O. (1972). Biometrica Tables for Statisticians, Vol 2, Cambridge,
England, Cambridge University Press.

84
1.12 Correlation

Suppose we wish to some how characterize the relationship, if one exists, between two quantitative
variables. Further suppose that, given we know some relationship exists, we would like to use
this relationship to some how make predictions about unknown values from the population of
interest. To this end, suppose we have n cases and for each case we take a measurement for two
variables X and Y . Then the point (xi , yi ) is formed using the value of X and the value of Y
for case i. The point (ˉx, yˉ) can be plotted on the scatter plot and is called the centroid.

1.12.1 Scatter Plots

To construct a scatter diagram (or scatter plot), we simply plot the points (xi , yi ) for the n cases.

For Practice. A medical researcher wants to determine if there is a linear relationship between
the costs of prescription drugs that can be administered to both humans and pets. The data
collected (in Canadian dollars) is summarized in the following table.

Solution:

85
Using this sample data, create a scatterplot.

For Discussion. From the above scatter plot, what relationship would you suspect there to
be (if any) between the prescription drug cost of medications that can be administered to both
humans and pets?

1.12.2 Pearson’s Correlation Coefficient

Suppose we would like to determine if there is a linear relationship between the Human Drug
Cost X and the Pet Drug Cost Y . This relationship is referred to as the correlation. We can
measure this correlation using the Pearson’s correlation coefficient r where

86
P P P
x i y i ) − ( x i ) ( yi )
n(
r = q P P  P P  (1.4)
n ( x2i ) − ( xi )2 n ( yi2 ) − ( yi )2
P  
xi −ˉ
x yi −ˉ
y
sx sy
= (1.5)
n−1
Sxy
= √ p , (1.6)
Sxx Syy

X X P
2 ( xi )2
Sxx = (xi − xˉ) = x2i − ,
n

X X P
2 ( yi ) 2
Syy = (yi − yˉ) = yi2 − ,
n

and

X X P P
( xi ) ( yi )
Sxy = (xi − xˉ)(yi − yˉ) = xi yi − .
n

Note:

1. It can be shown using the above formula for r that we always have −1 ≤ r ≤ 1. In other
words, the sample correlation coefficient can NEVER be smaller than -1 or greater than
+1.

2. r does NOT measure the slope of the linear line (referred to as the regression line or the
line of best fit) that we are trying to fit our data, apart from the sign.

3. If r = 1.0, then we have perfect positive correlation between the two variables.

87
4. If 0.7 < r < 1, then we have a strong positive correlation between the two variables.

5. If 0.4 < r ≤ 0.7, then we have a moderate positive correlation between the two variables.

6. If 0.0 < r ≤ 0.4, then we have a weak positive correlation between the two variables.

7. If r = −1.0, then we have perfect negative correlation between the two variables.

88
8. If −1.0 < r < −0.7, then we have a strong negative correlation between the two variables.

9. If −0.7 ≤ r < −0.4, then we have a moderate negative correlation between the two
variables.

10. If −0.4 ≤ r < 0.0, then we have a weak negative correlation between the two variables.

11. If r is close to ZERO, then there is little to no linear relationship between the two variables.
This does not imply that there is no relationship between the two variables!!!!

89
For Practice. For our Human versus Pet Drug Cost data, calculate Pearson’s correlation coef-
ficient.

Solution: Pearson’s correlation coefficient is r=0.942.

90
For Discussion. Does |r| = 1 always imply a strong linear relationship between the two vari-
ables?

A lurking variable is a variable that is not among the explanatory or response variables in a study
and yet may influence the interpretation of the relationships among the explanatory and response
variables.

A common error that people make is that they interpret a strong correlation as a cause and effect.
Sometimes such a relationship does exist (Smokers and Physical Endurance), but in many cases
no such causal relationship exists even if the correlation is strong (our human and pet drug cost).
In such situations, there usually are hidden variables linking the two quantities of interest.

Another common error is people interpret a correlation value by stating that the independent
variable(s) explain(s) the percentage (i.e. 60% when r = 0.6) of the variability in the dependent
variable.

Using data collected for two variables, if r = 0.0, then there is no linear relationship between the
two variables. Hence there is either no relationship between the two variables or there is some
non-linear relationship between the two variables. From a scatter plot of the two variables, if no
trend is apparent and if r = 0.0, then we would conclude there is no relationship between the
variables (i.e. the variables are independent).

For Discussion. How do we determine if the data in a single set of observations are independent,
i.e. how do we determine if r = 0.0

1.12.3 A Test for Independence

To determine whether or not there is significant correlation between your independent and de-
pendent variables, test the hypotheses:

H0 : ρ = 0

91
r
n−2
HA : ρ 6= 0, using the test statistic t = r , with df = n − 2.
1 − r2

Note that here ρ represents the population correlation. If we reject H0 , then we conclude that
the true correlation is not zero and that some relationship exists between the two variables, i.e.
the two variables are dependent. If we do not reject H0 , then we cannot reject that the true
correlation is zero and hence we cannot reject that there is no relationship between the two
variables, i.e. the two variables are independent.

If one wanted to test if there is significant positive correlation then HA : ρ > 0 and if one wanted
to test if there is significant negative correlation then HA : ρ < 0 .

The critical values are from a student’s t-distribution with df=n-2.

In order for the above test to be valid, the following assumptions must be true: 1. the variables
x and y are linearly related;

2. each pair was randomly selected; and

3. the variables must have a bivariate normal distribution.

For Practice. : For the Human versus Pet Drug Cost example, determine whether there is
significant positive correlation between the independent and dependent variables at the 10% level
of significance. Assume all the required assumptions hold.

Solution: Research Question: Is there significant positive correlation between the cost of the
prescriptions drugs that can be administered to both humans and pets?

Population Declarations: Let population 1 be the costs for prescription drugs for humans and let
population 2 be the costs for prescription drugs for pets.

Hypotheses to be tested:

H0 : ρ = 0

HA : ρ > 0

92
Hypothesis Test to be used: Test for Significant Correlation

Assumptions to be tested: We are told to assume that all the required assumptions hold.

Point estimate and associated P-value:

From the Correlations table above, Pearson’s correlation coefficient is 0.942 and the p-value
associated with the test statistic to test for significant correlation is p-value=0.001/2=0.0005.

Statistical Decision: Since the p-value= 0.0005 < 0.10 = α, we reject H0 .

Conclusion: At the 10% level of significance, we have evidence to conclude that the true cor-
relation between human prescription drug costs and pet prescription drug costs is greater than
zero (p-value<0.0005). This implies that, at the 10% level of significance, we have evidence to
conclude that there is significant positive correlation between the human and pet prescription
drug costs.

Now Your Turn: The blood pressure and age were measured for female patients. The patients
were then grouped by age and, for each of the age groups, the median blood pressure measurement
was computed. The data are summarized below:

93
Determine at the 5% level of significance if the midpoints of the age group are independent of
the median blood pressure for the age group.

Solution: Research Question: Are the midpoints of the age group independent of the median
blood pressure for the age group?

Population Declarations: Let population 1 be the midpoint ages of female patients, grouped by
the decade in which their ages fall and let population 2 be the median blood pressure of the age
group.

Hypotheses to be tested:

H0 : ρ = 0

HA : ρ 6= 0,

Hypothesis Test to be used: Test for Significant Correlation

Assumptions to be tested: We are told to assume that all the required assumptions hold.

Point estimate and associated P-value:

94
From the Correlations table above, Pearson’s correlation coefficient is 0.997 and the p-value
associated with the test statistic to test for significant correlation is p-value<0.001.

Statistical Decision: Since the p-value< 0.001 < 0.05 = α, we reject H0 .

Conclusion: At the 5% level of significance, we have evidence to conclude that there is nonzero
correlation between the midpoints of the age group and the group’s associated median blood pres-
sure (p-value<0.001). Hence, at the 5% level of significance, we have evidence to conclude that
the midpoints of an age group and the group’s associated median blood pressure are dep endent.

When the null hypothesis in a test for significant correlation has been rejected, it could mean:

1. there is a cause-and-effect relationship between the two variables (X causes Y or vice versa);

2. the relationship between the two variables is caused by a third variable;

3. there may be more complex interrelationships amongst the variables; or

4. the relationship may be coincidental.

We say that an observation is influential for a statistical calculation if removing it from the

95
calculation significantly changes the result of the calculation.

For Discussion. Does our human/pet drug cost data have any influential observations?

How do we analyze data with influential observations?

As outliers affect Pearson’s correlation coefficient, we can measure the linear relationship between
two variables using Spearman’s Rank Correlation coefficient ρs .

1.12.4 Spearman’s Rank Correlation Coefficient

We can calculate Spearman’s Rank Correlation coefficient ρs using the following procedure:

1. Arrange the observations of the independent variable in increasing order and assign them ranks
1,2,...,n.

2. Arrange the observations of the dependent variable in increasing order and assign them ranks
1,2,...,n.

3. For a particular data point, let (xi , yi ) denote actual observations and (ri , si ) denote the ranks
of the independent and dependent variable. Then Spearman’s Rank Correlation coefficient is
defined to be

P
(ri − rˉ)(si − sˉ)
ρs = pP pP .
(ri − rˉ)2 (si − sˉ)2

What happens if two or more observations of a variable are identical. How does one rank
these identical observations? One should assign to all the tied observations the average of the
consecutive ranks that would have been assigned to the tied values.

Note:

96
1. the above formula for ρs is nothing but Pearson’s Correlation Coefficient formula where the
bivariate data are the ordered pairs of the ranks of the original data.

2. If the variables X and Y are strongly positively correlated, the ranks on X should generally
agree with the ranks on Y and ρs will be positive.

3. If the variables X and Y are strongly negatively correlated the ranks on X should be in the
reverse order to the ranks on Y and ρs will be negative .

4. If the variables X and Y are uncorrelated, the ranks on X should be randomly distributed
with the ranks on Y and ρs will be essentially zero.

5. −1 ≤ ρs ≤ 1.

6. ρs = 1.0 indicates the ranks on X completely agree with the ranks on Y. ρs = −1.0
indicates the ranks on X are in the reverse order to the ranks on Y.

7. ρs is less sensitive to outliers than Pearson’s correlation coefficient ρp

For Practice. For our Human versus Pet Drug Cost data, calculate Spearman’s Rank Correlation
Coefficient.

Solution:

From the Correlations table above, Spearman’s correlation coefficient is 0.464.

97
Once we have determined that a linear relationship between the independent and the dependent
variable explains a significant proportion of the variability in the dependent variable, we can use
the following linear model as a method for determining the line of best fit or linear regression
equation for the data set.

1.13 Univariate Linear Regression

Once we have determined that a linear relationship exists between the two variables of interest,
we would like to determine the linear function which best fits the data, that is we would like to
determine a line of best fit through the data points. This line of best fit is also called a regression
line. One reason for determining the regression line is, under certain conditions, it can be used
to predict the dependent variable given a specific value for the independent variable.

We refer to the dependent (Y) variable as the response variable or the outcome variable. The
response variable is the variable that we want to predict using the values of other variables.
These other variables are referred to as the predictor variables or the independent variables and
are usually denoted as X1 , X2 , ....

When we have one independent variable X1 , we generally drop the subscript and refer to it simply
as X. Then, the model we use relating the response Y to the predictor variable X is

Yi = β0 + β1 Xi + i , i = 1, ..., n,

where Yi denotes the response corresponding to the i’th experimental run in which the predictor
variable X has the value Xi and i are the unknown error components that are assumed to be
independent and normally distributed with mean zero and an unknown standard deviation σ; and
the parameters β0 and β1 , which together define the straight line, are also unknown. β1 is referred
to as the regression coefficient. It is the estimated mean change in Y per unit of change in X.

According to the above model, the observation yi corresponding to the input value xi is one
observation from a normal distribution with mean β0 + β1 xi and standard deviation σ.

In regression, there are two goals. The first goal is to develop an equation by which the average
value of a particular random variable (Y) can be estimated or predicted based on the knowledge

98
of values of the predictor variables. The second goal is to quantify the relationship of one or more
independent variables to a dependent variable.

Of course, we quite often will never know the true values of the parameters β0 and β1 . To
estimate β0 and β1 , we can use The Principle of Least-Squares which determines the values for
the parameters so that the overall discrepancy
X
D= (Observed response − P redicted response)2

is minimized. The values for the parameters β0 and β1 that minimize D are referred to as the
least-squares estimates.

The linear function formed using these estimates is referred to as the least-squares regression line.
The least-squares regression line is nothing but the line that minimizes the square of the vertical
distance between observed values of Y and those predicted by the line

ŷi = β̂0 + β̂1 xi ,

where theˆindicates an estimate;


P P P
n (xi yi ) − ( xi )( yi ) Sxy
β̂1 = P 2 P 2 = ;
n x i − ( xi ) Sxx

s2x is the sample variance of the X variable data;

Sxx = (n − 1)s2x ;

β̂0 = yˉ − β̂1 xˉ;


and P P
e2i (yi − ŷi )2 SSE
s2e = = =
n−2 n−2 n−2
is the estimate for σ 2 .
s
1 xˉ2 se
SE(β̂0 ) = se + and SE(β̂1 ) = √
n Sxx Sxx

are the errors associated with the estimates for β̂0 and β̂1 respectively.

Note:

99
1. β̂0 and β̂1 are the least-squares estimates for β0 and β1 .

2. β̂1 is the estimated slope of the least-squares regression line.

3. β̂0 is the y-intercept of the least-squares regression line.

4. the y-intercept only has meaning if ZERO is a possible value for the predictor variable AND
there are observed values of the predictor variable near ZERO.

5. interpolation (extrapolation) is the use of a regression line for prediction within (outside) the
range of values observed for the independent variable X and used to obtain the regression
line

For Discussion. Why should we be leery of extrapolated values?

The assumptions that must be true in order to use a linear regression model:

1. Existence: For any fixed variable X, Y is a random variable with a specific probability
distribution that has a finite mean and finite variance that depend on X.

2. Independence: The y-values are statistically independent of one another. This cannot be
assumed in the case of longitudinal studies.

3. Linearity: The mean value of the variable Y is a straight line that is a function of X.

4. Homoscedasticity: The variance of Y is the same for every value for X. Further, the errors
must be normally distributed with mean 0 and variance σ 2 .

5. For each fixed value for X, Y has a normal distribution.

For Practice. Blood pressure and age were measured for female patients. The patients were
then grouped by age and, for each of the age groups, the median Blood Pressure measurement
was computed. The data are summarized below:

Age Group: [30,40) [40,50) [50,60) [60,70) [70,80)

100
Midpoint Age Group (X): 35 45 55 65 75

Median Blood Pressure (Y): 117 131 140 155 169

Draw a scatter plot, calculate Pearson’s correlation coefficient, and calculate a least-squares
regression line for the above data.

Solution: The scatter plot is:

101
From the Correlations table above, Pearson’s correlation coefficient is 0.997.

Using the following SPSS output,

we can determine a least-squares regression line to be

ŷi = 72.0 + 1.28xi ,

for x ∈ [35, 75].

For Discussion. Why is it dangerous to use the equation

ŷi = 72.0 + 1.28xi ,

outside the interval [35,75]?

For Discussion. Is the regression line

ŷi = 72.0 + 1.28xi

the best line?

102
1.13.1 Inferences about β1

To test the null hypothesis H0 : β1 = k against one of the three alternatives, we can use our
hypothesis recipe with the test statistic

β̂1 − k
t= √ with df = n − 2.
se / Sxx

We expect the above test statistic to have a students t-distribution with df=n-2 degrees of
freedom.

A 100 ∙ (1 − α)% confidence interval for β1 is given by

 
se se
β̂1 − tα/2 (df ) √ , β̂1 + tα/2 (df ) √ ,
Sxx Sxx

where tα/2 (df ) is a critical value from a Student’s t-distribution with df=n-2 degrees of freedom.

For Practice. Referring back to our Age-Median Blood pressure example, should “median age”
be included in our model? Use α = 0.05.

Solution: Research Question: Is an individual’s age linearly related to one’s blood pressure?

Hypotheses to be tested:

H0 : β1 = 0

HA : β1 6= 0

Hypothesis Test to be used: t-test for a regression coefficient

Assumptions to be tested: Assume that all the required assumptions hold.

Level of Significance: α=0.05

103
The value of the test statistic and p-value: From the following SPSS output,

the value of the test statistic is T=23.634 with df=3 and the associated p-value< 0.001.

Statistical Decision: Since the p-value< 0.001 < 0.05 = α, we reject H0 .

Conclusion: At the α=0.05 level of significance, we have evidence to conclude that the true
regression coefficient associated with the Median Age variable is not zero (p-value <0.001). Hence
the Median Age variable needs to be in the model.

1.13.2 Inferences about β0

To test the null hypothesis H0 : β0 = k against one of the three alternatives, we can use our
hypothesis recipe with the test statistic

β̂ − k
t= q0 with df = n − 2.
2
se n1 + Sxˉxx

We expect the above test statistic to have a students t-distribution with df=n-2 degrees of
freedom.

A 100 ∙ (1 − α)% confidence interval for β0 is given by

104
s s !
1 xˉ2 1 xˉ2
β̂0 − tα/2 (df )se + , β̂0 + tα/2 (df )se + ,
n Sxx n Sxx

where tα/2 (df ) is a critical value from a Student’s t-distribution with df=n-2 degrees of freedom.

For Practice. Referring back to our Age-Median Blood pressure example, should “median age”
be included in our model? Use α = 0.05.

Solution: Research Question: Is an individual’s age linearly related to one’s blood pressure?

Hypotheses to be tested:

H0 : β0 = 0

HA : β0 6= 0

Hypothesis Test to be used: t-test for a regression intercept

Assumptions to be tested: Assume that all the required assumptions hold.

Level of Significance: α=0.05

The value of the test statistic and p-value: From the following SPSS output,

the value of the test statistic is T=23.409 with df=3 and the associated p-value< 0.001.

105
Statistical Decision: Since the p-value< 0.001 < 0.05 = α, we reject H0 .

Conclusion: At the α=0.05 level of significance, we have evidence to conclude that the true
regression intercept is not zero (p-value<0.001). Hence the intercept needs to be in the model.

In the last two For Practice questions, we established that both the y-intercept (β0 ) and the
regression coefficient associated with the Age variable (β1 ) were both non-zero. Consequently
the model for that we should fit to the data is

Y = β0 + β 1 X + 

and the corresponding estimated regression equation is

ŷi = 72.0 + 1.28xi , xi ∈ [35, 75].

To recap the model building process: The first step is to determine whether or not the independent
and dependent variables are statistically correlated. If no, there is no model to build as the
independent variable is not a good predictor for the dependent variable. If yes, then the next step
is to check if the coefficient of determination is greater than 0.50 (or equivalently 50%). If it is
not greater than 0.5 (50%), then there is no model to build, as the independent variable is not
a good predictor for the dependent variable. If it is greater than 0.5 (50%), then you perform
the regression to determine which of the slope and y-intercept are statistically significant. If
you follow this process, the independent variable should end up being statistically significant and
included in the model. There is no guarantee that the y-intercept will be statistically significant.

1.13.3 Inferences about an expected response

Given a value x0 , we can estimate the expected response using BUT, this is only a sample estimate
for the expected response. We can form a 100 ∙ (1 − α)% confidence interval for the expected
(or true) response by using the formula
 s s 
2 2
β̂0 + β̂1 x0 − tα/2 (df )se 1 + (x0 − xˉ) , β̂0 + β̂1 x0 + tα/2 (df )se 1 + (x0 − xˉ)  ,
n Sxx n Sxx

106
where tα/2 (df ) is a critical value from a Student’s t-distribution with df=n-2 degrees of freedom.

For Discussion. Does this interval look familiar?

For Practice. Estimate the median blood pressure you would expect a randomly selected 40 year
old to have. Also compute a corresponding 95% confidence interval for this expected response.

107
What if we were not interested in the expected response for the entire population with value x0
but for a single individual with value x0 ? We can compute a 100 ∙ (1 − α)% confidence interval
for this individual prediction using
 s s 
2 2
β̂0 + β̂1 x0 − tα/2 (df )se 1 + 1 + (x0 − xˉ) , β̂0 + β̂1 x0 + tα/2 (df )se 1 + 1 + (x0 − xˉ)  ,
n Sxx n Sxx

where tα/2 (df ) is a critical value from a Student’s t-distribution with df=n-2 degrees of freedom.
This is called a prediction interval.

For Practice. What median blood pressure would you expect your biostatistics professor to have
when he turned 40? Also compute a corresponding 95% prediction interval.

Solution: We would expect Michael to have a median blood pressure of


ŷi = 72.0 + 1.28(40) = 123.2
when he turned 40. A corresponding 95% prediction interval would be given by
s
1 (40 − 55)2
72.0 + 1.28(40) ± 3.1824(1.7127) 1 + + = 123.2 ± 6.5.
5 (5 − 1)(15.81139)2

Now Your Turn: A certain species of fish has the remarkable ability that the female members
of the species have the ability to change their sex to male if too few males are in the population.
This species of fish is used to study the impact of the “estrogen-like” compounds introduced to
natural water bodies through human pollution. After a specified number of males, between three
and nine, were randomly removed from a fish tank, the number of females that changed their sex
to male was counted. The data is summarized in the following chart:

Number of males removed (x) 65933948 9 9

Number of females whose sex changed (y) 5 5 6 5 2 9 3 9 11 9

(a) Draw a scatter plot for the above data.

(b) Do you suspect that there is a linear relationship between the two variables? Justify your
response.

108
(c) Determine a 95% confidence interval for the true regression coefficient and the true y-value
of the y-intercept.

(d) Determine the best linear model representing the data.

(e) How many female fish would you expect to change sex whenever 4 males are remove from
the tank? Determine a 95% confidence interval for this estimated expected value.

(f) Suppose your biostatistics professor has a tank of containing this fish species. Predict how
many females would change their sex after he removed 6 males from the tank and a corresponding
95% prediction interval.

Solution: (a)

In the scatterplot above, as the number of male fish removed from the tank increases, the number
of female fish who change sex also increases. Hence one would expect there to be a positive
correlation between the number of male fish removed from the tank and the number of female
fish who change sex.

109
(b) From the Correlations table above, Pearson’s correlation coefficient is 0.857. Because the
p-value= 0.002 < 0.05 = α, there is evidence to reject the hypothesis that the true value of
Pearson’s Correlation Coefficient is 0 and conclude that the true value of Pearson’s Correlation
Coefficient is not 0. Combining this with the fact that the estimate is 0.857, there is evidence sug-
gesting that there is strong positive correlation between the independent and dependent variables.
Consequently there is evidence that a linear relationship may exist between the two variables.

(c)

Referring to the Coefficients table above, a 95% confidence interval for the true y-value of the
y-interept is (-3.266, 3.388).

Referring to the Coefficients table above, a 95% confidence interval for the true regression coef-
ficient is (0.496, 1.454).

(d) Referring to the Coefficients table in c), because the p-value=0.002 < 0.05 = α level of
significance, there is evidence to reject the hypothesis that the true slope is zero and conclude
that the true slope is not zero (p-value=0.002). Therefore the variable representing the number
of males removed from the tank needs to be included in the model.

Referring to the Coefficients table in c), because the p-value= 0.967 > 0.05 = α level of
significance, there is no evidence to reject the hypothesis that the true y-intercept is zero (p-
value=0.967). Therefore the y-intercept does not need to be included in the model.

Because the coefficient of determination (r2 = 0.734) associated with the regression line

y = 0.061 + 0.975x

110
is less than the coefficient of determination (r2 = 0.957) associated with the regression line
y= 0.983x, the model Y = β1 X +  is the better model. Hence y = 0.983x is the sample
regression line which best predicts the number of female fish who change sex in response to a
particular number of male fish being removed from the tank.

The above analyses suggests that the best model would be of the form
Y = β1 X + .

(e) We would expect


ŷi = 0.983(4) = 3.932
females to change sex. A corresponding 95% confidence interval would be given by
s
1 (4 − 6.5)2
0.983(4) ± 2.262(1.52270) + .
10 (10 − 1)(2.59272)i 2

(f) We would expect


ŷi = 0.983(6) = 5.898
females to change sex. A corresponding 95% prediction interval would be given by
s
1 (6 − 6.5)2
0.983(6) ± 2.262(1.52270) 1 + + .
10 (10 − 1)(2.59272)i 2

Note that the degrees of freedom in (e) and (f) is n-1 not n-2 because there is only one parameter
in the model.

111
1.13.4 Diagnostics for the least-squares regression line

The coefficient of determination is r2 and measures the proportion of the total variation in the
response variable that is explained by the least-squares regression line.

2
The adjusted coefficient of determination is radj and is used for model selection.

2
Note that radj < r 2 . The adjusted coefficient of determination cannot be interpreted exactly like
2
the coefficient of determination. At best it can be thought of explaining at least radj × 100% of
the total variation in the response variable that the least-squares regression line explains. Note
the at least in this definition.

For Practice. For the Blood Pressure and Age data in the previous example, determine the
coefficient of determination.

A residual is the difference between an observed value of the response variable and the value
predicted by the fitted curve, that is for the i’th observation, the residual is ri = yi − ŷi .

Residuals calculated using the least-squares regression line are called least-squares residuals and
have the property that the mean of the least-squares residuals is always zero. Hence we will
calculate the residual sum of squares (RSS):

X X
RSS = ri2 = (yi − ŷi )2 ,
which is a measure of the goodness of fit of the line to the data.

RSS measures the variability in the dependent variable that is NOT explained by the independent
variable. Consequently RSS is also abbreviated SSunexplained .

The total sum of squares, abbreviated SStotal , is calculated using


X
SStotal = (yi − yˉ)2 ,

which is the total variability in the response variable Y ignoring the independent variable X.

The sum of squares measuring the total variability in Y explained by the independent variable X,

112
abbreviated SSexplained , is calculated using
X
SSexplained = (ŷi − yˉ)2 ,

Note that
SStotal = SSexplained + SSunexplained .

For Practice. For the Blood Pressure and Age data, calculate SStotal , SSexplained , SSunexplained .

Solution:

Referring to the above table, SStotal = 1647.2, SSexplained = 1638.4, SSunexplained = 8.8.

A residual plot is a scatter plot of the regression residuals against the independent variable.
Residual plots help us assess how well a regression line fits the data.

1) If a plot of the residuals against the predictor variable shows a discernible pattern, then the
predictor and response variables may not be linearly related.

2) If a plot of the residuals against the predictor variable shows the spread of the residuals
increasing or decreasing as the predictor variable increases, then an assumption of the least-
squares linear model is violated. This is the constant variance of the errors assumption.

3) If a plot of the residuals against the predictor variable shows that one residual is much larger
or smaller than all the other residuals, then the observation used to calculate this residual may
be an outlier.

113
For Practice. Draw a residual plot for the Age vs Median Blood Pressure example.

Solution:

114
Chapter 2

Analysis of Variance

115
Chapter 3

Non-parametric Methods

116
Chapter 4

Multivariate Linear Regression

117
4.1 Overview

In Module I: Review, we discussed fitting a simple linear model (i.e. one continuous dependent
and one continuous independent variable) to a set of data. Quite often it is found that this
single independent variable does not predict well the dependent variable. In fact, quite often
the dependent variable is influenced by many independent variables simultaneously and it is this
combination of independent variables that leads to a good predictive model for the dependent
variable. This module provides you with an introduction to modeling the relationship between
one continuous dependent variable and several independent variables. In addition to trying to
quantify the relationship between an independent and a dependent variable, the estimated model
can be used to predict the response of the dependent variable for a given set of values for the
independent variables.

118

You might also like