http://www.netnam.vn/unescocourse/statistics/stat_frm.
htm
MODULE “STATISTICAL DATA ANALYSIS”
By Dr. Dang Quang A and Dr. Bui The Hong
Institute of Information Technology
Hoang Quoc Viet road, Cau giay, HANOI
Preface
Statistics is the science of collecting, organizing and interpreting numerical and nonnumerical
facts, which we call data.
The collection and study of data is important in the work of many professions, so that training in
the science of statistics is valuable preparation for variety of careers, for example, economists
and financial advisors, businessmen, engineers and farmers.
Knowledge of probability and statistical methods also are useful for informatics specialists in
various fields such as data mining, knowledge discovery, neural networks, and fuzzy systems
and so on.
Whatever else it may be, statistics is first and foremost a collection of tools used for converting
raw data into information to help decision makers in their work.
The science of data  statistics  is the subject of this course.
Chapter 1 is an introduction into statistical analysis of data. Chapters 2 and 3 deal with
statistical methods for presenting and describing data. Chapters 4 and 5 introduce the basic
concepts of probability and probability distributions, which are the foundation for our study of
statistical inference in later chapters. Sampling and sampling distributions is the subject of
Chapter 6. The remaining seven chapters discuss statistical inference  methods for drawing
conclusions from properly produced data. Chapter 7 deals with estimating characteristics of a
population by observing the characteristic of a sample. Chapters 8 to 13 describe some of the
most common methods of inference: for drawing conclusions about means, proportions and
variances from one and two samples, about relations in categorical data, regression and
correlation and analysis of variance. In every chapter we include examples to illustrate the
concepts and methods presented. The use of computer packages such as SPSS and
STATGRAPHICS will be evolved.
Audience
This tutorial as an introductory course to statistics is intended mainly for users such as
engineers, economists and managers who need to use statistical methods in their work and for
students. However, many aspects will be useful for computer trainers.
Objectives
Understanding statistical reasoning
Mastering basic statistical methods for analyzing data such as descriptive and inferential
methods
Ability to use methods of statistics in practice with the help of computer software
Entry requirements
High school algebra course (+elements of calculus)
Elementary computer skills
ii
CONTENTS
Chapter 1 Introduction....................................................................................................1
1.1 What is Statistics...................................................................................................1
1.2 Populations and samples......................................................................................2
1.3 Descriptive and inferential statistics ......................................................................2
1.4 Brief history of statistics ........................................................................................3
1.5 Computer softwares for statistical analysis...........................................................3
Chapter 2 Data presentation ..........................................................................................4
2.1 Introduction...........................................................................................................4
2.2 Types of data........................................................................................................4
2.3 Qualitative data presentation ................................................................................5
2.4 Graphical description of qualitative data................................................................6
2.5 Graphical description of quantitative data: Stem and Leaf displays.....................7
2.6 Tabulating quantitative data: Relative frequency distributions ..............................9
2.7 Graphical description of quantitative data: histogram and polygon...................... 11
2.8 Cumulative distributions and cumulative polygons .............................................. 12
2.9 Summary ............................................................................................................. 14
2.10 Exercises .......................................................................................................... 14
Chapter 3 Data characteristics: descriptive summary statistics..................................... 16
3.1 Introduction......................................................................................................... 16
3.2 Types of numerical descriptive measures ........................................................... 16
3.3 Measures of location (or measures of central tendency) ..................................... 17
3.4 Measures of data variation.................................................................................. 20
3.5 Measures of relative standing ............................................................................. 23
3.6 Shape ................................................................................................................. 26
3.7 Methods for detecting outlier............................................................................... 28
3.8 Calculating some statistics from grouped data.................................................... 30
3.9 Computing descriptive summary statistics using computer softwares ................. 31
3.10 Summary ........................................................................................................... 32
3.11 Exercises .......................................................................................................... 33
Chapter 4 Probability: Basic concepts.......................................................................... 35
4.1 Experiment, Events and Probability of an Event.................................................. 35
4.2 Approaches to probability..................................................................................... 36
4.3 The field of events............................................................................................... 36
4.4 Definitions of probability...................................................................................... 38
4.5 Conditional probability and independence........................................................... 41
4.6 Rules for calculating probability........................................................................... 43
4.7 Summary ............................................................................................................ 46
4.8 Exercises ............................................................................................................ 46
iii
Chapter 5 Basic Probability distributions ...................................................................... 48
5.1 Random variables................................................................................................ 48
5.2 The probability distribution for a discrete random variable................................... 49
5.3 Numerical characteristics of a discrete random variable ...................................... 51
5.4 The binomial probability distribution.................................................................... 53
5.5 The Poisson distribution....................................................................................... 55
5.6 Continuous random variables: distribution function and density function.............. 57
5.7 Numerical characteristics of a continuous random variable............................... 59
5.8 Normal probability distribution.............................................................................. 60
5.10 Exercises ........................................................................................................... 63
Chapter 6. Sampling Distributions .............................................................................. 65
6.1 Why the method of sampling is important ............................................................ 65
6.2 Obtaining a Random Sample............................................................................... 67
6.3 Sampling Distribution........................................................................................... 68
6.4 The sampling distribution of x : the Central Limit Theorem................................. 73
6.5 Summary ............................................................................................................. 76
6.6 Exercises............................................................................................................. 76
Chapter 7 Estimation................................................................................................... 79
7.1 Introduction.......................................................................................................... 79
7.2 Estimation of a population mean: Largesample case.......................................... 80
7.3 Estimation of a population mean: small sample case........................................... 88
7.4 Estimation of a population proportion................................................................... 90
7.5 Estimation of the difference between two population means................................ 92
7.6 Estimation of the difference between two population means: Matched pairs....... 95
7.7 Estimation of the difference between two population proportions......................... 97
7.8 Choosing the sample size.................................................................................... 99
7.9 Estimation of a population variance ................................................................... 102
7.10 Summary ......................................................................................................... 105
7.11 Exercises ......................................................................................................... 105
Chapter 8 Hypothesis Testing .................................................................................. 107
8.1 Introduction........................................................................................................ 107
8.2 Formulating Hypotheses .................................................................................... 107
8.3 Types of errors for a Hypothesis Test ................................................................ 109
8.4 Rejection Regions.............................................................................................. 111
8.5 Summary ........................................................................................................... 118
8.6 Exercises........................................................................................................... 118
Chapter 9 Applications of Hypothesis Testing ........................................................... 119
9.1 Introduction........................................................................................................ 119
9.2 Hypothesis test about a population mean .......................................................... 119
9.3 Hypothesis tests of population proportions........................................................ 125
9.4 Hypothesis tests about the difference between two population means............... 126
9.5 Hypothesis tests about the difference between two proportions......................... 131
9.6 Hypothesis test about a population variance...................................................... 134
9.7 Hypothesis test about the ratio of two population variances.............................. 135
iv
9.8 Summary ........................................................................................................... 139
9.9 Exercises........................................................................................................... 140
Chapter 10 Categorical data analysis and analysis of variance ................................. 143
10.1 Introduction...................................................................................................... 143
10.2 Tests of goodness of fit .................................................................................... 143
10.3 The analysis of contingency tables .................................................................. 147
10.4 Contingency tables in statistical software packages......................................... 150
10.5 Introduction to analysis of variance.................................................................. 151
10.6 Design of experiments ..................................................................................... 151
10.7 Completely randomized designs ...................................................................... 155
10.8 Randomized block designs .............................................................................. 159
10.9 Multiple comparisons of means and confidence regions .................................. 162
10.10 Summary ....................................................................................................... 164
10.11 Exercises ....................................................................................................... 164
Chapter 11 Simple Linear regression and correlation................................................... 167
11.1 Introduction: Bivariate relationships ................................................................ 167
11.2 Simple Linear regression: Assumptions .......................................................... 171
11.3 Estimating A and B: the method of least squares........................................... 173
11.4 Estimating σ
2
.................................................................................................. 174
11.5 Making inferences about the slope, B ............................................................. 175
11.6. Correlation analysis ........................................................................................ 179
11.7 Using the model for estimation and prediction................................................. 182
11.8. Simple Linear Regression: An Example.......................................................... 184
11.9 Summary ......................................................................................................... 188
11.10 Exercises ...................................................................................................... 188
Chapter 12 Multiple regression.................................................................................. 191
12.1. Introduction: the general linear model ............................................................. 191
12.2 Model assumptions......................................................................................... 192
12.3 Fitting the model: the method of least squares.............................................. 192
12.4 Estimating σ
2
................................................................................................... 195
12.5 Estimating and testing hypotheses about the B parameters........................... 195
12.6. Checking the utility of a model ........................................................................ 199
Figure 12.3 STATGRAPHICS Printout for Electrical Usage Example...................... 200
12.7. Using the model for estimating and prediction................................................. 201
12.8 Multiple linear regression: An overview example............................................. 202
12.8. Model building: interaction models .................................................................. 206
12.9. Model building: quadratic models.................................................................... 208
12.11 Summary ....................................................................................................... 209
12.12 Exercises ...................................................................................................... 209
Chapter 13 Nonparametric statistics............................................................................ 213
13.1. Introduction..................................................................................................... 213
13.2. The sign test for a single population................................................................ 214
13.3 Comparing two populations based on independent random samples.............. 217
13.4. Comparing two populations based on matched pairs: ..................................... 221
v
13.5. Comparing population using a completely randomized design........................ 225
13.6. Rank Correlation: Spearman’s r
s
statistic ........................................................ 228
13.7 Summary ......................................................................................................... 231
13.8 Exercises ........................................................................................................ 232
Reference
Index
Appendixes
vi
THE STATISTICAL ANALYSIS OF DATA
Chapter 1 Introduction
CONTENTS
1.1. What is Statistics?
1.2. Populations and samples
1.3. Descriptive and inferential statistics
1.4. Brief history of statistics
1.5. Computer softwares for statistical analysis
1.1 What is Statistics
The word statistics in our everyday life means different things to different people. To a football
fan, statistics are the information about rushing yardage, passing yardage, and first downs,
given a halftime. To a manager of a power generating station, statistics may be information
about the quantity of pollutants being released into the atmosphere. To a school principal,
statistics are information on the absenteeism, test scores and teacher salaries. To a medical
researcher investigating the effects of a new drug, statistics are evidence of the success of
research efforts. And to a college student, statistics are the grades made on all the quizzes in a
course this semester.
Each of these people is using the word statistics correctly, yet each uses it in a slightly different
way and for a somewhat different purpose. Statistics is a word that can refer to quantitative data
or to a field of study.
As a field of study, statistics is the science of collecting, organizing and interpreting numerical
facts, which we call data. We are bombarded by data in our everyday life. The collection and
study of data are important in the work of many professions, so that training in the science of
statistics is valuable preparation for variety of careers. Each month, for example, government
statistical offices release the latest numerical information on unemployment and inflation.
Economists and financial advisors as well as policy makers in government and business study
these data in order to make informed decisions. Farmers study data from field trials of new crop
varieties. Engineers gather data on the quality and reliability of manufactured of products. Most
areas of academic study make use of numbers, and therefore also make use of methods of
statistics.
Whatever else it may be, statistics is, first and foremost, a collection of tools used for converting
raw data into information to help decision makers in their works.
The science of data  statistics  is the subject of this course.
vii
1.2 Populations and samples
In statistics, the data set that is the target of your interest is called a population. Notice that, a
statistical population does not refer to people as in our everyday usage of the term; it refers to a
collection of data.
Definition 1.1
A population is a collection (or set) of data that describes some phenomenon of
interest to you.
Definition 1.2
A sample is a subset of data selected from a population
Example 1.1 The population may be all women in a country, for example, in Vietnam. If from
each city or province we select 50 women, then the set of selected women is a sample.
Example 1.2 The set of all whisky bottles produced by a company is a population. For the
quality control 150 whisky bottles are selected at random. This portion is a sample.
1.3 Descriptive and inferential statistics
If you have every measurement (or observation) of the population in hand, then statistical
methodology can help you to describe this typically large set of data. We will find graphical and
numerical ways to make sense out of a large mass of data. The branch of statistics devoted to
this application is called descriptive statistics.
Definition 1.3
The branch of statistics devoted to the summarization and description of data
(population or sample) is called descriptive statistics.
If it may be too expensive to obtain or it may be impossible to acquire every measurement in the
population, then we will want to select a sample of data from the population and use the sample
to infer the nature of the population.
Definition 1.4
The branch of statistics concerned with using sample data to make an inference
about a population of data is called inferential statistics.
viii
1.4 Brief history of statistics
The word statistik comes from the Italian word statista (meaning “statesman”). It was first used
by Gottfried Achenwall (17191772), a professor at Marlborough and Gottingen. Dr. E.A.W.
Zimmermam introduced the word statistics to England. Its use was popularized by Sir John
Sinclair in his work “Statistical Account of Scotland 17911799”. Long before the eighteenth
century, however, people had been recording and using data.
Official government statistics are as old as recorded history. The emperor Yao had taken a
census of the population in China in the year 2238 B.C. The Old Testament contains several
accounts of census taking. Governments of ancient Babylonia, Egypt and Rome gathered detail
records of population and resources. In the Middle Age, governments began to register the
ownership of land. In A.D. 762 Charlemagne asked for detailed descriptions of churchowned
properties. Early, in the ninth century, he completed a statistical enumeration of the serfs
attached to the land. About 1086, William and Conqueror ordered the writing of the Domesday
Book, a record of the ownership, extent, and value of the lands of England. This work was
England’s first statistical abstract.
Because of Henry VII’s fear of the plague, England began to register its dead in 1532. About
this same time, French law required the clergy to register baptisms, deaths and marriages.
During an outbreak of the plague in the late 1500s, the English government started publishing
weekly death statistics. This practice continued, and by 1632 these Bills of Mortality listed births
and deaths by sex. In 1662, Captain John Graunt used thirty years of these Bills to make
predictions about the number of persons who would die from various diseases and the
proportion of male and female birth that could be expected. Summarized in his work, Natural
and Political Observations ...Made upon the Bills of Mortality, Graunt’s study was a pioneer
effort in statistical analysis. For his achievement in using past records to predict future events,
Graund was made a member of the original Royal Society.
The history of the development of statistical theory and practice is a lengthy one. We have only
begun to list the people who have made significant contributions to this field. Later we will
encounter others whose names are now attached to specific laws and methods. Many people
have brought to the study of statistics refinements or innovations that, taken together, form the
theoretical basis of what we will study in this course.
1.5 Computer softwares for statistical analysis
Many real problems have so much data that doing the calculations by hand is not feasible. For
this reason, most realworld statistical analysis is done on computers. You must prepare the
input data and interpret the results of the analysis and take appropriate action, but the machine
does all the “number crunching”. There many widelyused software packages for statistical
analysis. Below we list some of them.
• Minitab (registered trademark of Minitab, Inc., University Park, Pa)
• SAS (registered trademark of SAS Institute, Inc., Cary, N.C.)
• SPSS (registered trademark of SPSS, Inc.,Chicago)
• SYSTAT (registered trademark of SYSTAT, Inc., Evanston,II)
• STATGRAPHICS (registered trademark of Statistical Graphics Corp., Maryland).
Except for the above listed softwares it is possible to make simple statistical analysis of data by
using the part “Data analysis” in Microsoft EXCEL.
ix
Chapter 2 Data presentation
CONTENTS
2.1. Introduction
2.2. Types of data
2.3. Qualitative data presentation
2.4. Graphical description of qualitative data
2.5. Graphical description of quantitative data: Stem and Leaf displays
2.6. Tabulating quantitative data: Relative frequency distributions
2.7. Graphical description of quantitative data: histogram and polygon
2.8. Cumulative distributions and cumulative polygons
2.9. Summary
2.10. Exercises
2.1 Introduction
The objective of data description is to summarize the characteristics of a data set. Ultimately, we want to
make the data set more comprehensible and meaningful. In this chapter we will show how to construct
charts and graphs that convey the nature of a data set. The procedure that we will use to accomplish this
objective in a particular situation depends on the type of data that we want to describe.
2.2 Types of data
Data can be one of two types, qualitative and quantitative.
Definition 2.1
Quantitative data are observations measured on a numerical scale.
In other words, quantitative data are those that represent the quantity or amount of something.
Example 2.1 Height (in centimeters), weight (in kilograms) of each student in a group are both
quantitative data.
x
Definition 2.2
Nonnumerical data that can only be classified into one of a group of categories are
said to be qualitative data.
In other words, qualitative data are those that have no quantitative interpretation, i.e., they can
only be classified into categories.
Example 2.2 Education level, nationality, sex of each person in a group of people are qualitative
data.
2.3 Qualitative data presentation
When describing qualitative observations, we define the categories in such a way that each
observations can fall in one and only one category. The data set is then described by giving the
number of observations, or the proportion of the total number of observations that fall in each of
the categories.
Definition 2.3
The category frequency for a given category is the number of observations that fall in
that category.
Definition 2.4
The category relative frequency for a given category is the proportion of the total
number of observations that fall in that category.
ns observatio of number Total
category in that falling ns observatio of Number
category a for frequency Relative =
Instead of the relative frequency for a category one usually uses percentage for a category,
which is computed as follows
Percentage for a category = Relative frequency for the category x 100%
Example 2.3 The classification of students of a group by the score on the subject “Statistical
analysis” is presented in Table 2.0a. The table of frequencies for the data set generated by
computer using the software SPSS is shown in Figure 2.1.
xi
Table 2.0a The classification of students
No of
Stud.
CATEGORY No of
Stud.
CATEGORY No of
Stud.
CATEGORY No of
stud
CATEGORY
1 Bad 13 Good 24 Good 35 Good
2 Medium 14 Excellent 25 Medium 36 Medium
3 Medium 15 Excellent 26 Bad 37 Good
4 Medium 16 Excellent 27 Good 38 Excellent
5 Good 17 Excellent 28 Bad 39 Good
6 Good 18 Good 29 Bad 40 Good
7 Excellent 19 Excellent 30 Good 41 Medium
8 Excellent 20 Excellent 31 Excellent 42 Bad
9 Excellent 21 Good 32 Excellent 43 Excellent
10 Excellent 22 Excellent 33 Excellent 44 Excellent
11 Bad 23 Excellent 34 Good 45 Good
12 Good
2.4 Graphical description of qualitative data
Bar graphs and pie charts are two of the most widely used graphical methods for describing
qualitative data sets.
Bar graphs give the frequency (or relative frequency) of each category with the height or length
of the bar proportional to the category frequency (or relative frequency).
CATEGORY
6 13.3 13.3 13.3
18 40.0 40.0 53.3
15 33.3 33.3 86.7
6 13.3 13.3 100.0
45 100.0 100.0
Bad
Excelent
Good
Medium
Total
Valid
Frequency Percent
Valid
Percent
Cumulative
Percent
Figure 2.1 Output from SPSS showing the frequency table for the variable
CATEGORY.
xii
Example 2.4a (Bar Graph) The bar graph generated by computer using SPSS for the variable
CATEGORY is depicted in Figure 2.2.
0 5 10 15 20
Bad
Excel ent
Good
Medi um
Figure 2.2 Bar graph showing the number of students of each category
Pie charts divide a complete circle (a pie) into slices, each corresponding to a category, with the
central angle and hence the area of the slice proportional to the category relative frequency.
Example 2.4b (Pie Chart) The pie chart generated by computer using EXCEL CHARTS for the
variable CATEGORY is depicted in Figure 2.3.
Bad
Excel ent
Good
Medi um
Figure 2.3 Pie chart showing the number of students of each category
xiii
2.5 Graphical description of quantitative data: Stem and Leaf displays
One of graphical methods for describing quantitative data is the stem and leaf display, which is widely
used in exploratory data analysis when the data set is small.
In order to explain what is a stem and what is a leaf we consider the data from the table 2.0b. For this data
for a twodigit number, for example, 79, we designate the first digit (7) as its stem; we call the last digit
(9) its leaf; and for threedigit number, for example, 112, we designate the first two digit (12) as its stem;
we also call the last digit (2) its leaf.
Steps to follow in constructing a Stem and Leaf Display
1. Divide each observation in the data set into two parts, the Stem and the Leaf.
2. List the stems in order in a column, starting with the smallest stem and ending with
the largest.
3. Proceed through the data set, placing the leaf for each observation in the
appropriate stem row.
Depending on the data, a display can use one, two or five lines per stem. Among the different
stems, twoline stems are widely used.
Example 2.5 The quantity of glucose in blood of 100 persons is measured and recorded in
Table 2.0b (unit is mg %). Using SPSS we obtain the following StemandLeaf display for this
data set.
Table 2.0b Quantity of glucose in blood of 100 students (unit: mg %)
70 79 80 83 85 85 85 85 86 86
86 87 87 88 89 90 91 91 92 92
93 93 93 93 94 94 94 94 94 94
95 95 96 96 96 96 96 97 97 97
97 97 98 98 98 98 98 98 100 100
101 101 101 101 101 101 102 102 102 103
103 103 103 104 104 104 105 106 106 106
106 106 106 106 106 106 106 107 107 107
107 108 110 111 111 111 111 111 112 112
112 115 116 116 116 116 119 121 121 126
xiv
Figure 2.4.
Output from SPSS
showing the Stem
andLeaf display for
the data set of
glucose
GLUCOSE
GLUCOSE StemandLeaf Plot
Frequency Stem & Leaf
1.00 Extremes (=<70)
1.00 7 . 9
2.00 8 . 03
11.00 8 . 55556667789
15.00 9 . 011223333444444
18.00 9 . 556666677777888888
18.00 10 . 001111112223333444
16.00 10 . 5666666666677778
9.00 11 . 011111222
6.00 11 . 566669
2.00 12 . 11
1.00 Extremes (>=126)
Stem width: 10
Each leaf: 1 case(s)
The stem and leaf display of Figure 2.4 partitions the data set into 12 classes corresponding to
12 stems. Thus, here twoline stems are used. The number of leaves in each class gives the
class frequency.
Advantages of a stem and leaf display over a frequency distribution (considered in the
next section):
1. The original data are preserved.
2. A stem and leaf display arranges the data in an orderly fashion and makes it easy to
determine certain numerical characteristics to be discussed in the following chapter.
3. The classes and numbers falling in them are quickly determined once we have selected the
digits that we want to use for the stems and leaves.
Disadvantage of a stem and leaf display:
Sometimes not much flexibility in choosing the stems.
2.6 Tabulating quantitative data: Relative frequency distributions
Frequency distribution or relative frequency distribution is most often used in scientific
publications to describe quantitative data sets. They are better suited to the description of large
data sets and they permit a greater flexibility in the choice of class widths.
xv
A frequency distribution is a table that organizes data into classes. It shows the number of
observations from the data set that fall into each of classes. It should be emphasized that we
always have in mind nonoverlapping classes, i.e. classes without common items.
Steps for constructing a frequency distribution and relative frequency
distribution:
1. Decide the type and number of classes for dividing the data set, lower limit and
upper limit of the classes:
Lower limit < Minimum of values
Upper limit > Maximum of values
2. Determine the width of class intervals:
classes of number Total
limit Lower  limit Upper
intervals class of Width =
3. For each class, count the number of observations that fall in that class. This
number is called the class frequency.
4. Calculate each class relative frequency
ns observatio of number Total
frequency Class
frequency relative Class =
Except for frequency distribution and relative frequency distribution one usually uses relative
class percentage, which is calculated by the formula:
Relative class percentage = Class relative frequency x 100%
Example 2.6 Construct frequency table for the data set of quantity of glucose in blood of 100
persons recorded in Table 2.0b (unit is mg %).
Using the software STATGRAPHICS, taking Lower limit = 62, Upper limit = 150 and Total
number of classes = 22 we obtained the following table.
xvi
Table 2.1 Frequency distribution for glucose in blood of 100 persons
Class Lower
Limit
Upper
Limit
Midpoint Frequency Relative
Frequency
Cumulative
Frequency
Cum. Rel.
Frequency
0 62 66 64 0 0 0 0
1 66 70 68 1 0.01 1 0.01
2 70 74 72 0 0 1 0.01
3 74 78 76 0 0 1 0.01
4 78 82 80 2 0.02 3 0.03
5 82 86 84 8 0.08 11 0.11
6 86 90 88 5 0.05 16 0.16
7 90 94 92 14 0.14 30 0.3
8 94 98 96 18 0.18 48 0.48
9 98 102 100 11 0.11 59 0.59
10 102 106 104 18 0.18 77 0.77
11 106 110 108 6 0.06 83 0.83
12 110 114 112 8 0.08 91 0.91
13 114 118 116 5 0.05 96 0.96
14 118 122 120 3 0.03 99 0.99
15 122 126 124 1 0.01 100 1
16 126 130 128 0 0 100 1
17 130 134 132 0 0 100 1
18 134 138 136 0 0 100 1
19 138 142 140 0 0 100 1
20 142 146 144 0 0 100 1
21 146 150 0 0 100 1
Remarks:
1. All classes of frequency table must be mutually exclusive.
2. Classes may be openended when either the lower or the upper end of a quantitative
classification scheme is limitless. For example
xvii
Class: age
birth to 7
8 to 15
........
64 to 71
72 and older
3. Classification schemes can be either discrete or continuous. Discrete classes are separate
entities that do not progress from one class to the next without a break. Such class as the
number of children in each family, the number of trucks owned by moving companies.
Discrete data are data that can take only a limit number of values. Continuous data do
progress from one class to the next without a break. They involve numerical measurement
such as the weights of cans of tomatoes, the kilograms of pressure on concrete. Usually,
continuous classes are halfopen intervals. For example, the classes in Table 2.1 are half
open intervals [62, 66), [66, 70) ...
2.7 Graphical description of quantitative data: histogram and polygon
There is an old saying that “one picture is worth a thousand words”. Indeed, statisticians have
employed graphical techniques to describe sets of data more vividly. Bar charts and pie charts
were presented in Figure 2.2 and Figure 2.3 to describe qualitative data. With quantitative data
summarized into frequency, relative frequency tables, however, histograms and polygons are
used to describe the data.
2.7.1 Histogram
When plotting histograms, the phenomenon of interest is plotted along the horizontal axis, while
the vertical axis represents the number, proportion or percentage of observations per class
interval – depending on whether or not the particular histogram is respectively, a frequency
histogram, a relative frequency histogram or a percentage histogram.
Histograms are essentially vertical bar charts in which the rectangular bars are constructed at
midpoints of classes.
Example 2.7 Below we present the frequency histogram for the data set of quantities of
glucose, for which the frequency table is constructed in Table 2.1.
xviii
0
5
10
15
20
6
8
7
6
8
4
9
2
1
0
0
1
0
8
1
1
6
1
2
4
1
3
2
1
4
0
Quantity of glucoza (mg%)
F
r
e
q
u
e
n
c
y
Figure 2.5 Frequency histogram for quantities of glucose, tabulated in Table 2.1
Remark: When comparing two or more sets of data, the various histograms can not be
constructed on the same graph because superimposing the vertical bars of one on another
would cause difficulty in interpretation. For such cases it is necessary to construct relative
frequency or percentage polygons.
2.7.2 Polygons
As with histograms, when plotting polygons the phenomenon of interest is plotted along the
horizontal axis while the vertical axis represents the number, proportion or percentage of
observations per class interval – depending on whether or not the particular polygon is
respectively, a frequency polygon, a relative frequency polygon or a percentage polygon. For
example, the frequency polygon is a line graph connecting the midpoints of each class interval
in a data set, plotted at a height corresponding to the frequency of the class.
Example 2.8 Figure 2.6 is a frequency polygon constructed from data in Table 2.1.
Figure 2.6 Frequency polygon for data of glucose in Table 2.1
Advantages of polygons:
• The frequency polygon is simpler than its histogram counterpart.
• It sketches an outline of the data pattern more clearly.
• The polygon becomes increasingly smooth and curve like as we increase the number
of classes and the number of observations.
2.8 Cumulative distributions and cumulative polygons
Other useful methods of presentation which facilitate data analysis and interpretation are the
construction of cumulative distribution tables and the plotting of cumulative polygons. Both may
xix
be developed from the frequency distribution table, the relative frequency distribution table or
the percentage distribution table.
A cumulative frequency distribution enables us to see how many observations lie above or
below certain values, rather than merely recording the number of items within intervals.
A “lessthan” cumulative frequency distribution may be developed from the frequency table as
follows:
Suppose a data set is divided into n classes by boundary points x
1
, x
2
, ..., x
n
, x
n+1
. Denote the
classes by C
1
, C
2
, ..., C
n
. Thus, the class C
k
= [x
k
, x
k+1
). See Figure 2.7.
Suppose the frequency and relative frequency of class C
k
is f
k
and r
k
(k=1, 2, ..., n),
respectively. Then the cumulative frequency that observations fall into classes C
1
, C
2
, ..., C
k
or
lie below the value xk+1 is the sum f
1
+f
2
+...+f
k
. The corresponding cumulative relative
frequency is r
1
+r
2
+...+r
k
.
Example 2.9 Table 2.1 gives frequency, relative frequency, cumulative frequency and
cumulative relative frequency distribution for quantity of glucose in blood of 100 students.
According to this table the number of students having quantity of glucose less than 90 is 16.
A graph of cumulative frequency distribution is called an “lessthan” ogive or simply ogive.
Figure 2. shows the cumulative frequency distribution for quantity of glucose in blood of 100
students (data from Table 2.1)
0
20
40
60
80
100
120
6
8
8
0
9
2
1
0
4
1
1
6
1
2
8
1
4
0
Quantity of glucoza (mg%)
C
u
m
u
l
a
t
i
v
e
f
r
e
q
u
e
n
c
y
C
1
C
2
C
k
C
n
x
1
x
2
x
k
x
k+1
x
n
x
n+1
Figure 2.7 Class intervals
xx
Figure 2.8 Cumulative frequency distribution for quantity of glucose
(for data in Table 2.1)
2.9 Summary
This chapter discussed methods for presenting data set of qualitative and quantitative variables.
For a qualitative data set we first define categories and the category frequency which is the
number of observations falling in each category. Further, the category relative frequency
and the percentage for a category are introduced. Bar graphs and pie charts as the graphical
pictures of the data set are constructed.
If the data are quantitative and the number of the observations is small the categorization and
the determination of class frequencies can be done by constructing a stem and leaf display.
Large sets of data are best described using relative frequency distribution. The latter presents a
table that organizes data into classes with their relative frequencies. For describing the
quantitative data graphically histogram and polygon are used.
2.10 Exercises
1) A national cancer institure survey of 1,580 adult women recently responded to the question
“In your opinion, what is the most serious health problem facing women?” The responses
are summarized in the following table:
The most serious health
problem for women
Relative
frequency
Breast cancer 0.44
Other cancers 0.31
Emotional stress 0.07
High blood pressure 0.06
Heart trouble 0.03
Other problems 0.09
a) Use one of graphical methods to describe the data.
b) What proportion of the respondents believe that high blood pressure or heart trouble is the
most serious health problem for women?
c) Estimate the percentage of all women who believe that some type of cancer is the most
serious health problem for women?
2) The administrator of a hospital has ordered a study of the amount of time a patient must wait
before being treated by emergency room personnel. The following data were collected
during a typical day:
xxi
WAITING TIME (MINUTES)
12 16 21 20 24 3 11 17 29 18
26 4 7 14 25 2 26 15 16 6
a) Arrange the data in an array from lowest to heighest. What comment can you make
about patient waiting time from your data array?
b) Construct a frequency distribution using 6 classes. What additional interpretation can
you give to the data from the frequency distribution?
c) Construct the cumulative relative frequency polygon and from this ogive state how long
75% of the patients should expect to wait.
3) Bacteria are the most important component of microbial eco systems in sewage treatment
plants. Water management engineers must know the percentage of active bacteria at each
stage of the sewage treatment. The accompanying data represent the percentages of
respiring bacteria in 25 raw sewage samples collected from a sewage plant.
42.3 50.6 41.7 36.5 28.6
40.7 48.1 48.0 45.7 39.9
32.3 31.7 39.6 37.5 40.8
50.1 39.2 38.5 35.6 45.6
34.9 46.1 38.3 44.5 37.2
a. Construct a relative frequency distribution for the data.
b. Construct a stem and leaf display for the data.
c. Compare the two graphs of parts a and b.
4) At a newspaper office, the time required to set the entire front page in type was recorded for
50 days. The data, to the nearest tenth of a minute, are given below.
20.8 22.8 21.9 22.0 20.7 20.9 25.0 22.2 22.8 20.1
25.3 20.7 22.5 21.2 23.8 23.3 20.9 22.9 23.5 19.5
23.7 20.3 23.6 19.0 25.1 25.0 19.5 24.1 24.2 21.8
21.3 21.5 23.1 19.9 24.2 24.1 19.8 23.9 22.8 23.9
19.7 24.2 23.8 20.7 23.8 24.3 21.1 20.9 21.6 22.7
a) Arrange the data in an array from lowest to heighest.
b) Construct a frequency distribution and a “lessthan” cumulative frequency distribution
from the data, using intervals of 0.8 minutes.
c) Construct a frequency polygon from the data.
d) Construct a “lessthan” ogive from the data.
e) From your ogive, estimate what percentage of the time the front page can be set in less
than 24 minutes.
xxii
Chapter 3 Data characteristics: descriptive summary
statistics
CONTENTS
3.1. Introduction
3.2. Types of numerical descriptive measures
3.3. Measures of central tendency
3.4. Measures of data variation
3.5. Measures of relative standing
3.6. Shape
3.7. Methods for detecting outlier
3.8. Calculating some statistics from grouped data
3.9. Computing descriptive summary statistics using computer softwares
3.10. Summary
3.11. Exercises
3.1 Introduction
In the previous chapter data were collected and appropriately summarized into tables and charts. In
this chapter a variety of descriptive summary measures will be developed. These descriptive measures
are useful for analyzing and interpreting quantitative data, whether collected in raw form (ungrouped
data) or summarized into frequency distributions (grouped data)
3.2 Types of numerical descriptive measures
Four types of characteristics which describe a data set pertaining to some numerical variable or
phenomenon of interest are:
• Location
• Dispersion
• Relative standing
• Shape
In any analysis and/or interpretation of numerical data, a variety of descriptive measures
representing the properties of location, variation, relative standing and shape may be used to
extract and summarize the salient features of the data set.
If these descriptive measures are computed from a sample of data they are called statistics . In
contrast, if these descriptive measures are computed from an entire population of data, they are
called parameters.
Since statisticians usually take samples rather than use entire populations, our primary
emphasis deals with statistics rather than parameters.
xxiii
3.3 Measures of location (or measures of central tendency)
3.3.1. Mean
Definition 3.1
The arithmetic mean of a sample (or simply the sample mean) of n observations
n
x x x , , ,
2 1
Κ , denoted by x is computed as
n
x
n
x x x
x
n
i
i
n
∑
=
=
+ + +
=
1 2 1
...
Definition 3.1a
The population mean is defined by the formula
n populatio in ns observatio of number Total
n populatio in ns observatio all of values the of Sum
= =
∑
=
N
x
N
i
i
1
u
Note that the definitions of the population mean and the sample mean are the same. It is also
valid for the definition of other measures of central tendency. But in the next section we will give
different formulas for variances of population and sample.
Example 3.1 Consider 7 observations: 4.2, 4.3, 4.7, 4.8, 5.0, 5.1, 9.0.
By definition
x = (4.2+ 4.3+ 4.7+ 4.8+ 5.0+ 5.1+ 9.0)/7 = 5.3
Advantages of the mean:
• It is a measure that can be calculated and is unique.
• It is useful for performing statistical procedures such as comparing the means from several
data sets.
Disadvantages of the mean:
It is affected by extreme values that are not representative of the rest of the data.
Indeed, if in the above example we compute the mean of the first 6 numbers and exclude the
9.0 value, then the mean is 4.7. The one extreme value 9.0 distorts the value we get for the
mean. It would be more representative to calculate the mean without including such an extreme
value.
xxiv
3.3.2. Median
Definition 3.2
The median m of a sample of n observations
n
x x x , , ,
2 1
Κ arranged in ascending or
descending order is the middle number that divides the data set into two equal
halves: one half of the items lie above this point, and the other half lie below it.
Formula for calculating median of an arranged in ascending order data set
Example 3.2 Find the median of the data set consisting of the observations 7, 4, 3, 5, 6, 8, 10.
Solution First, we arrange the data set in ascending order
3 4 5 6 7 8 10.
Since the number of observations is odd, n = 2 x 4  1, then median m = x
4
= 6. We see that a half of the
observations, namely, 3, 4, 5 lie below the value 6 and another half of the observations, namely, 7, 8 and
10 lie above the value 6.
Example 3.3 Suppose we have an even number of the observations 7, 4, 3, 5, 6, 8, 10, 1. Find
the median of this data set.
Solution First, we arrange the data set in ascending order
1 3 4 5 6 7 8 10.
Since the number of the observations n = 2 x 4, then by Definition
Median = (x
4
+x
5
)/2 = (5+6)/2 = 5.5
Advantage of the median over the mean: Extreme values in data set do not affect the median
as strongly as they do the mean.
Indeed, if in Example 3.1 we have
mean = 5.3, median = 4.8.
The extreme value of 9.0 does not affect the median.
3.3.3 Mode
( )
¦
¹
¦
´
¦
= +
− =
= =
+
even) is ( if
2
1
odd) is ( if
n k n x x
n k n x
Median m
k k
k
2
1 2
1
xxv
Definition 3.3
The mode of a data set
n
x x x , , ,
2 1
Κ is the value of x that occurs with the
greatest frequency , i.e., is repeated most often in the data set.
Example 3.4 Find the mode of the data set in Table 3.1.
Table 3.1 Quantity of glucose (mg%) in blood of 25 students
70 88 95 101 106
79 93 96 101 107
83 93 97 103 108
86 93 97 103 112
87 95 98 106 115
Solution First we arrange this data set in the ascending order
70 88 95 101 106
79 93 96 101 107
83 93 97 103 108
86 93 97 103 112
87 95 98 106 115
This data set contains 25 numbers. We see that, the value of 93 is repeated most often.
Therefore, the mode of the data set is 93.
Multimodal distribution: A data set may have several modes. In this case it is called
multimodal distribution.
Example 3.5 The data set
0 2 6 9
0 4 6 10
1 4 7 11
1 4 8 11
1 5 9 12
xxvi
have two modes: 1 and 4. his distribution is called bimodal distribution.
Advantage of the mode: Like the median, the mode is not unduly affected by extreme values.
Even if the high values are very high and the low value is very low, we choose the most
frequent value of the data set to be the modal value We can use the mode no matter how large,
how small, or how spread out the values in the data set happen to be.
Disadvantages of the mode:
• The mode is not used as often to measure central tendency as are the mean and the
median. Too often, there is no modal value because the data set contains no values that
occur more than once. Other times, every value is the mode because every value occurs for
the same number of times. Clearly, the mode is a useless measure in these cases.
• When data sets contain two, three, or many modes, they are difficult to interpret and
compare.
Comparing the Mean, Median and Mode
• In general, for data set 3 measures of central tendency: the mean , the median and the
mode are different. For example, for the data set in Table 3.1, mean =96.48, median = 97
and mode = 93.
• If all observations in a data set are arranged symmetrically about an observation then this
observation is the mean, the median and the mode.
• Which of these three measures of central tendency is better? The best measure of central
tendency for a data set depends on the type of descriptive information you want. For most
data sets encountered in business, engineering and computer science, this will be the
MEAN.
3.3.4 Geometric mean
Definition 3.4
Suppose all the n observations in a data set 0 , , ,
2 1
>
n
x x x Κ . Then the geometric
mean of the data set is defined by the formula
n
n G
x x x M G x ... .
2 1
= =
The geometric mean is appropriate to use whenever we need to measure the average rate of
change (the growth rate) over a period of time.
From the above formula it follows
where log is the logarithmic function of any base.
∑
=
=
n
i
i G
x
n
x
1
log
1
log
xxvii
Thus, the logarithm of the geometric mean of the values of a data set is equal to the arithmetic mean of
the logarithms of the values of the data set.
3.4 Measures of data variation
Just as measures of central tendency locate the “center” of a relative frequency distribution,
measures of variation measure its “spread”.
The most commonly used measures of data variation are the range, the variance and the
standard deviation.
3.4.1 Range
Definition 3.5
The range of a quantitative data set is the difference between the largest and smallest values
in the set.
Range = Maximum  Minimum,
where Maximum = Largest value, Minimum = Smallest value.
3.4.2 Variance and standard deviation
Definition 3.6
The population variance of the population of the observations x is defined the formula
( )
N
x
N
i
i ∑
=
−
=
1 2
u
σ
where:
2
σ =population variance
i
x
= the item or observation
u = population mean
N = total number of observations in the population.
From the Definition 3.6 we see that the population variance is the average of the squared
distances of the observations from the mean.
xxviii
Definition 3.7
The standard deviation of a population is equal to the square root of the variance
( )
N
x
N
i
i ∑
=
−
= =
1 2
u
σ σ
Note that for the variance, the units are the squares of the units of the data. And for the
standard deviation, the units are the same as those used in the data.
Definition 3.6a
The sample variance of the sample of the observations
n
x x x , , ,
2 1
Κ is defined the
formula
( )
1
2
1 2
−
−
=
∑
=
n
x x
s
n
i
i
where:
2
s =sample variance
x = sample mean
n = total number of observations in the sample
The standard deviation of the sample is
2
s s =
Remark: In the denominator of the formula for s
2
we use n1 instead n because statisticians
proved that if s
2
is defined as above then s
2
is an unbiased estimate of the variance of the
population from which the sample was selected ( i.e. the expected value of s
2
is equal to the
population variance ).
Uses of the standard deviation
The standard deviation enables us to determine, with a great deal of accuracy, where the
values of a frequency distribution are located in relation to the mean. We can do this according
to a theorem devised by the Russian mathematician P.L. Chebyshev (18211894).
xxix
Chebyshev’s Theorem
For any data set with the mean x and the standard deviation s at least 75% of the
values will fall within the interval s x 2 ± and at least 89% of the values will fall within
the interval s x 3 ± .
We can measure with even more precision the percentage of items that fall within specific
ranges under a symmetrical, bellshaped curve. In these cases we have:
The Empirical Rule
If a relative frequency distribution of sample data is bellshaped with mean x and
standard deviation s, then the proportions of the total number of observations falling
within the intervals s x ± , s x 2 ± , s x 3 ± are as follows:
s x ± : Close to 68%
s x 2 ± : Close to 95%
s x 3 ± : Near 100%
3.4.3 Relative dispersion: The coefficient of variation
The standard deviation is an absolute measure of dispersion that expresses variation in the
same units as the original data. For example, the unit of standard deviation of the data set of
height of a group of students is centimeter, the unit of standard deviation of the data set of their
weight is kilogram. Can we compare the values of these standard deviations? Unfortunately, no,
because they are in the different units.
We need a relative measure that will give us a feel for the magnitude of the deviation relative to
the magnitude of the mean. The coefficient of variation is one such relative measure of
dispersion.
Definition 3.8
The coefficient of variation of a data set is the relation of its standard deviation to its
mean
cv = Coefficient of variation = % 100
Mean
deviation Standard
×
This definition is applied to both population and sample.
The unit of the coefficient of variation is percent.
xxx
Example 3.6 Suppose that each day laboratory technician A completes 40 analyses with a
standard deviation of 5. Technician B completes 160 analyses per day with a standard deviation
of 15. Which employee shows less variability?
At first glance, it appears that technician B has three times more variation in the output rate than
technician A. But B completes analyses at a rate 4 times faster than A. Taking all this
information into account, we compute the coefficient of variation for both technicians:
For technician A: cv=5/40 x 100% = 12.5%
For technician B: cv=15/60 x 100% = 9.4%.
So, we find that, technician B who has more absolute variation in output than technician A, has
less relative variation.
3.5 Measures of relative standing
In some situations, you may want to describe the relative position of a particular observation in a
data set.
Descriptive measures that locate the relative position of an observation in relation to the other
observations are called measures of relative standing.
A measure that expresses this position in terms of a percentage is called a percentile for the
data set.
Definition 3.9
Suppose a data set is arranged in ascending (or descending ) order. The p
th
percentile is a number such that p% of the observations of the data set fall below and
(100p)% of the observations fall above it.
The median, by definition, is the 50
th
percentile.
The 25
th
percentile, the median and 75
th
percentile are often used to describe a data set
because they divide the data set into 4 groups, with each group containing onefourth (25%) of
the observations. They would also divide the relative frequency distribution for a data set into 4
parts, each contains the same are (0.25) , as shown in Figure 3.1. Consequently, the 25
th
percentile, the median, and the 75
th
percentile are called the lower quartile, the mid quartile, and
the upper quartile, respectively, for a data set.
Definition 3.10
The lower quartile, Q
L
, for a data set is the 25
th
percentile
xxxi
Definition 3.11
The mid quartile, M, for a data set is the 50
th
percentile.
Definition 3.12
The upper quartile, Q
U
, for a data set is the 75
th
percentile.
Definition 3.13
The interquartile range of a data set is Q
U
 Q
L
.
Q
L
M Q
U
Figure 3.1 Locating of lower, mid and upper quartiles
For large data set, quartiles are found by locating the corresponding areas under the relative
frequency distribution polygon as in Figure 3. . However, when the sample data set is small, it
may be impossible to find an observation in the data set that exceeds, say, exactly 25% of the
remaining observations. Consequently, the lower and the upper quartiles for small data set are
not well defined. The following box describes a procedure for finding quartiles for small data
sets.
Finding quartiles for small data sets:
1. Rank the n observations in the data set in ascending order of magnitude.
xxxii
2. Calculate the quantity (n+1)/4 and round to the nearest integer. The observation
with this rank represents the lower quartile. If (n+1)/4 falls halfway between two
integers, round up.
3. Calculate the quantity 3(n+1)/4 and round to the nearest integer. The observation
with this rank represents the upper quartile. If 3(n+1)/4 falls halfway between two
integers, round down.
Example 3.7 Find the lower quartile, the median, and the upper quartile for the data set in
Table 3.1.
Solution For this data set n = 25. Therefore, (n+1)/4 = 26/4 = 6.5, 3(n+1)/4 = 3*26/4 = 19.5. We
round 6.5 up to 7 and 19.5 down to 19. Hence, the lower quartile = 7
th
observation = 93, the
upper quartile =19
th
observation = 103. We also have the median = 13
th
observation = 97. The
location of these quartiles is presented in Figure 3.2.
Another measure of real relative standing is the zscore for an observation (or standard score).
It describes how far individual item in a distribution departs from the mean of the distribution.
Standard score gives us the number of standard deviations, a particular observation lies below
or above the mean.
70 80 90 93 97 100 103 110 115
Min Q
L
M Q
U
Max
Figure 3.2 Location of the quartiles for the data set of Table 2.1
xxxiii
Definition 3.14
Standard score (or z score) is defined as follows:
For a population:
zscore=
σ
u − x
where x = the observation from the population,
u = the population mean,
σ = the population standard deviation .
For a sample:
zscore=
s
x x −
where x = the observation from the sample
x = the sample mean,
s = the sample standard deviation .
3.6 Shape
The fourth important numerical characteristic of a data set is its shape. In describing a
numerical data set its is not only necessary to summarize the data by presenting appropriate
measures of central tendency, dispersion and relative standing, it is also necessary to consider
the shape of the data – the manner, in which the data are distributed.
There are two measures of the shape of a data set: skewness and kurtosis.
3.6.1 Skewness
If the distribution of the data is not symmetrical, it is called asymmetrical or skewed.
Skewness characterizes the degree of asymmetry of a distribution around its mean. For a
sample data, the skewness is defined by the formula:
3
1
) 2 )( 1 (
∑
=

¹

\
 −
− −
=
n
i
i
s
x x
n n
n
Skewness ,
where n = the number of observations in the sample,
xxxiv
i
x = i
th
observation in the sample,
s = standard deviation of the sample.
The direction of the skewness depends upon the location of the extreme values. If the extreme
values are the larger observations, the mean will be the measure of location most greatly
distorted toward the upward direction. Since the mean exceeds the median and the mode, such
distribution is said to be positive or rightskewed. The tail of its distribution is extended to the
right. This is depicted in Figure 3.3a.
On the other hand, if the extreme values are the smaller observations, the mean will be the
measure of location most greatly reduced. Since the mean is exceeded by the median and the
mode, such distribution is said to be negative or leftskewed. The tail of its distribution is
extended to the left. This is depicted in Figure 3.3b.
Figure 3.3a Rightskewed
distribution
Figure 3.3b Leftskewed distribution
3.6.2 Kurtosis
Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the
bellshaped distribution (normal distribution).
Kurtosis of a sample data set is calculated by the formula:
) 3 )( 2 (
) 1 ( 3
) 3 )( 2 )( 1 (
) 1 (
2
4
1
− −
−
−
¦
)
¦
`
¹
¦
¹
¦
´
¦

¹

\
 −
− − −
+
=
∑
=
n n
n
s
x x
n n n
n n
Kurtosis
n
i
i
Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a
relatively flat distribution.
The distributions with positive and negative kurtosis are depicted in Figure 3.4 , where the
distribution with null kurtosis is normal distribution.
xxxv
Figure 3.4
The distributions
with positive and
negative kurtosis
3.7 Methods for detecting outlier
Definition 3.15
An observation (or measurement) that is unusually large or small relative to the
other values in a data set is called an outlier. Outliers typically are attributable to
one of the following causes:
1. The measurement is observed, recorded, or entered into the computer
incorrectly.
2. The measurements come from a different population.
3. The measurement is correct, but represents a rare event.
Outliers occur when the relative frequency distribution of the data set is extreme skewed,
because such a distribution of the data set has a tendency to include extremely large or small
observations.
There are two widely used methods for detecting outliers.
Method of using zscore:
According to Chebyshev theorem almost all the observations in a data set will have zscore less
than 3 in absolute value i.e. fall into the interval ( ) s x s x 3 , 3 + − , where x the mean and s is is
the standard deviation of the sample. Therefore, the observations with zscore greater than 3
will be outliers.
Example 3.8 The doctor of a school has measured the height of pupils in the class 5A. The
result (in cm) is follows
xxxvi
Table 3.2 Heights of the pupils of the class 5A
For the data set in Table 3.1 x = 132.77, s = 6.06, 3s = 18.18, zscore of the observation of
153 is (153132.77)/6.06=3.34 , zscore of 110 is (110132.77)/6.06 = 3.76. Since the absolute
values of zscore of 153 and 110 are more than 3, the height of 153 cm and the height of 110
cm are outliers in the data set.
Box plot method
Another procedure for detecting outliers is to construct a box plot of the data. Below we present
steps to follow in constructing a box plot.
Steps to follow in constructing a box plot
1. Calculate the median M, lower and upper quartiles, Q
L
and Q
U
, and the
interquartile range, IQR= Q
U
 Q
L
, for the data set.
2. Construct a box with Q
L
and Q
U
located at the lower corners. The base width will
then be equal to IQR. Draw a vertical line inside the box to locate the median M.
3. Construct two sets of limits on the box plot: Inner fences are located a distance
of 1.5 * IQR below Q
L
and above Q
U;
outer fences are located a distance of 3 *
IQR below Q
L
and above Q
U
(see Figure 4.5 ).
4. Observations that fall between the inner and outer fences are called suspect
outliers. Locate the suspect outliers on the box plot using asterisks
(*).Observations that fall outside the outer fences is called highly suspect
outliers. Use small circles to locate them.
130 132 138 136 131 153
131 133 129 133 110 132
129 134 135 132 135 134
133 132 130 131 134 135
xxxvii
For large data set box plot can be constructed using available statistical computer software.
A computergenerated by SPSS box plot for data set in Table 3.2 is shown in Figure 3.6.
Figure 3.6 Output from SPSS showing box plot for the data set in Table 3.2
Outer fences Inner fences Inner fences Outer fences
*
*
QL M QU
1.5 * IQR 1.5 * IQR IQR 1.5 * IQR 1.5 * IQR
Figure 3.5 Box plot
xxxviii
3.8 Calculating some statistics from grouped data
In Sections 3.3 through 3.6 we gave formulas for computing the mean, median, standard
deviation etc. of a data set. However, these formulas apply only to raw data sets, i.e., those, in
which the value of each of the individual observations in the data set is known. If the data have
already been grouped into classes of equal width and arranged in a frequency table, you must
use an alternative method to compute the mean, standard deviation etc.
Example 3.9 Suppose we have a frequency table of average monthly checkingaccount
balances of 600 customers at a branch bank.
CLASS (DOLLARS) FREQUENCY
0 – 49.99 78
50 – 99.99 123
100 – 149.99 187
150 – 199.99 82
150 – 199.99 82
200 – 249.99 51
250 – 299.99 47
300 – 349.99 13
350 – 399.99 9
400 – 449.99 6
450 – 499.99 4
From the information in this table, we can easily compute an estimate of the value of the mean
and the standard deviation.
Formulas for calculating the mean and the standard deviation for grouped data:
n
x f
x
k
1 i
i i ∑
=
= ,
1
2
1 1
2
2
−

¹

\

−
=
∑ ∑
= =
n
x f x f
s
k
i
i i
k
i
i i
,
where x = mean of the data set, s
2
= standard deviation of the data set
x
i
= midpoint of the ith class, f
i
= frequency of the ith class,
xxxix
k = number of classes, n = total number of observations in the data set.
3.9 Computing descriptive summary statistics using computer
softwares
All statistical computer softwares have procedure for computing descriptive summary statistics.
Below we present outputs from STATGRAPHICS and SPSS for computing descriptive summary
statistics for GLUCOSE data in Table 2.0b.
Variable: GLUCOSE.GLUCOSE


Sample size 100.
Average 100.
Median 100.5
Mode 106.
Geometric mean 99.482475
Variance 102.767677
Standard deviation 10.137439
Standard error 1.013744
Minimum 70.
Maximum 126.
Range 56.
Lower quartile 94.
Upper quartile 106.
Interquartile range 12.
Skewness 0.051526
Kurtosis 0.131118
Coeff. of variation 10.137439
Figure 4.7 Output from STATGRAPHICS for Glucose data
xl
3.10 Summary
Numerical descriptive measures enable us to construct a mental image of the relative frequency
distribution for a data set pertaining to a numerical variable. There are 4 types of these
measures: location, dispersion, relative standing and shape.
Three numerical descriptive measures are used to locate a relative frequency distribution are
the mean, the median, and the mode. Each conveys a special piece of information. In a sense,
the mean is the balancing point for the data. The median, which is insensitive to extreme values,
divides the data set into two equal halves: half of the observations will be less than the median
and half will be larger. The mode is the observation that occurs with greatest frequency. It is the
value of the data set that locates the point where the relative frequency distribution achieves its
maximum relative frequency.
The range and the standard deviation measure the spread of a relative frequency distribution.
Particularly, we can obtain a very good notion of the way data are distributed around the mean
by constructing the intervals and referring to the Chebyshev’s theorem and the Empirical rule.
Percentiles, quartiles, and zscores measure the relative position of an observation in a data set.
The lower and upper quartiles and the distance between them called the interquartile range can
also help us visualize a data set. Box plots constructed from intervals based on the interquartile
range and zscores provide an easy way to detect possible outliers in the data.
The two numerical measures of the shape of a data set are skewness and kurtosis. The
skewness characterizes the degree of asymmetry of a distribution around its mean. The kurtosis
characterizes the relative peakedness or flatness of a distribution compared with the bell
shaped distribution.
xli
3.11 Exercises
1. The ages of a sample of the people attending a training course on networking in IOIT in
Hanoi are:
29 20 23 22 30 32 28
23 24 27 28 31 32 33
31 28 26 25 24 23 22
26 28 31 25 28 27 34
a) Construct a frequency distribution with intervals 1519, 2024, 2529, 3034, 3539.
b) Compute the mean and the standard deviation of the raw data set.
c) Compute the approximate values for the mean and the standard deviation using the
constructed frequency distribution table. Compare these values with ones obtained in b).
2. Industrial engineers periodically conduct “work measurement” analyses to determine the
time used to produce a single unit of output. At a large processing plant, the total number of
manhours required per day to perform a certain task was recorded for 50 days. his
information will be used in a work measurement analysis. The total manhours required for
each of the 50 days are listed below.
128 119 95 97 124 128 142 98 108 120
113 109 124 97 138 133 136 120 112 146
128 103 135 114 109 100 111 131 113 132
124 131 133 88 118 116 98 112 138 100
112 111 150 117 122 97 116 92 122 125
a) Compute the mean, the median, and the mode of the data set.
b) Find the range, the variance and the standard deviation of the data set.
c) Construct the intervals s x ± , s x 2 ± , s x 3 ± . Count the number of observations that fall
within each interval and find the corresponding proportions. Compare the results to the
Chebyshev theorem. Do you detect any outliers?
e) Find the 75
th
percentile for the data on total daily manhours.
3. An engineer tested nine samples of each of three designs of a certain bearing for a new
electrical winch. The following data are the number of hours it took for each bearing to fail when
xlii
the winch motor was run continuously at maximum output, with a load on the winch equivalent
to 1,9 times the intended capacity.
DESIGN
A B C
16 18 21
16 27 17
53 34 23
21 34 32
17 32 21
25 19 18
30 34 21
21 17 28
45 43 19
a) Calculate the mean and the median for each group.
b) Calculate the standard deviation for each group.
c) Which design is best and why?
4. The projected 30day storage charges (in US$) for 35 web pages stored on the web server of
a university are listed here:
120 125 145 180 175 167 154
143 120 180 175 190 200 145
165 210 120 187 179 167 165
134 167 189 182 145 178 231
185 200 231 240 230 180 154
a) Construct a stemandleaf display for the data set.
b) Compute x , s
2
and s.
c) Calculate the intervals s x ± , s x 2 ± , and s x 3 ± and count the number of observations that
fall within each interval. Compare your results with the Empirical rule.
xliii
Chapter 4 Probability: Basic concepts
CONTENTS
4.1. Experiment, Events and Probability of an Event
4.2. Approaches to probability
4.3. The field of events
4.4. Definitions of probability
4.5. Conditional probability and independence
4.6. Rules for calculating probability
4.7. Summary
4.8. Exercises
4.1 Experiment, Events and Probability of an Event
Definition 4.1
The process of making an observation or recording a measurement under a given
set of conditions is a trial or experiment
Thus, an experiment is realized whenever the set of conditions is realized.
Definition 4.2
Outcomes of an experiment are called events.
We denote events by capital letters A, B, C,...
Example 4.1 Consider the following experiment. Toss a coin and observe whether the upside of
the coin is Head or Tail. Two events may be occurred:
• H: Head is observed,
• T: Tail is observed.
Example 4.2 Toss a die and observe the number of dots on its upper face. You may observe
one, or two, or three, or four, or five or six dots on the upper face of the die. You can not predict
this number.
Example 4.3 When you draw one card from a standard 52 card bridge deck, some possible
outcomes of this experiment can not be predicted with certainty in advance are:
• A: You draw an ace of hearts
• B: You draw an eight of diamonds
xliv
• C: You draw a spade
• D: You do not draw a spade.
The probability of an event A, denoted by P(A), in general, is the chance A will happen.
But how to measure the chance of occurrence, i.e., how determine the probability an event?
The answer to this question will be given in the next Sections.
4.2 Approaches to probability
The number of different definitions of probability that have been proposed by various authors is
very large. But the majority of definitions can be subdivided into 3 groups:
1. Definitions of probability as a quantitative measure of the “degree of certainty” of the
observer of experiment.
2. Definitions that reduce the concept of probability to the more primitive notion of “equal
likelihood” (the socalled “classical definition “).
3. Definitions that take as their point of departure the “relative frequency” of occurrence of the
event in a large number of trials (“statistical” definition).
According to the first approach to definition of probability, the theory of probability is something
not unlike a branch of psychology and all conclusions on probabilistic judgements are deprived
of the objective meaning that they have independent of the observer. Those probabilities that
depend upon the observer are called subjective probabilities.
In the next sections we shall give the classical and statistical definitions of probability.
4.3 The field of events
Before proceeding to the classical Definition of the concept of probability we shall introduce
some definitions and relations between the events, which may or may not occur when an
experiment is realized.
1. If whenever the event A occurs the event B also occurs, then we say that A implies B (or A
is contained in B) and write A⊂B or B
⊃
A.
2. If A implies B and at the same time, B implies A, i.e., if for every realization of the
experiment either A and B both occur or both do not occur, then we say that the events A
and B are equivalent and write A=B.
3. The event consisting in the simultaneous occurrence of A and B is called the product or
intersection of the events A and B, and will be denoted by AB or A∩B.
4. The event consisting in the occurrence of at least one of the events A or B is called the
sum, or union, of the events A and B, and is denoted by A+B or A∪B.
5. The event consisting in the occurrence of A and the nonoccurrence of B is called the
difference of the events A and B and is denoted by AB or A\B.
6. An event is called certain (or sure) if it must inevitably occur whenever the experiment is
realized.
7. An event is called impossible if it can never occur.
Clearly, all certain events are equivalent to one another. We shall denote these events by the
letter E. All impossible events are likewise equivalent and denoted by 0.
xlv
8. Two events A and A are complementary if E A A = + and A A = 0 hold simultaneously.
For example, in the experiment of tossing a die the following events are complementary:
•
even
D = {even number of dots is observed on upper face}
•
odd
D ={ odd number of dots is observed on upper face}
9. Two events A and B are called mutually exclusive if when one of the two events occurs in
the experiment, the other can not occur, i.e., if their joint occurrence is impossible AB = 0.
10. If A=B
1
+B
2
+...+B
n
and the events B
i
(i =1,2,...,n) are mutually exclusive in pairs (or pair
wise mutually exclusive), i.e., B
i
B
j
= 0 for any i ≠ j, then we say that the event A is
decomposed into the mutually exclusive events B
1
, B
2
, ..., B
n
.
For example, in the experiment of tossing a single die, the event consisting of the throw of an
even number of dots is decomposed into the mutually exclusive events D
2
, D
4
and D
6
, where D
k
= {observing k dots on the upper face of the die}.
11. An event A is called simple (or elementary) if it can not be decomposed into other events.
For example, the events D
k
that k dots (k=1, 2, 3, 4, 5, 6) are observed in the experiment of
tossing a die are simple events.
12. The sample space of an experiment is the collection of all its simple events.
13. Complete list of events: Suppose that when the experiment is realized there may be a list
of events A
1
, A
2
, ..., A
n
with the following properties:
I. A
1
, A
2
, ..., A
n
are pair wise mutually exclusive events,
II. A
1
+A
2
+...+A
n
=E.
Then we say that the list of events A
1
, A
2
, ..., A
n
is complete.
The examples of complete list of events may be:
• List of events Head and Tail in tossing a coin
• List of events D
1
, D
2
, D
3
, D
4
, D
5
, D
6
in the experiment of tossing a die.
• List of events
even
D and
odd
D in the experiment of tossing a die.
All relations between events may be interpreted geometrically by Venn diagrams. In theses
diagrams the entire sample space is represented by a rectangle and events are represented by
parts of the rectangle. If two events are mutually exclusive, their parts of the rectangle will not
overlap each other as shown in Figure 4.1a. If two events are not mutually exclusive, their parts
of the rectangle will overlap as shown in Figure 4.1b.
Figure 4.1a Two mutually exclusive
events
Figure 4.1b Two nonmutually
exclusive events
xlvi
A
A
B
B
A+B AB
Figure 4.2 Events B , B, A A, and AB
In every problem in the theory of probability one has to deal with an experiment (under some
specific set of conditions) and some specific family of events S.
Definition 4.3
A family S of events is called a field of events if it satisfies the following properties:
1. If the event A and B belong to the family S, the so do the events AB, A+B and A
B.
2. The family S contains the certain event E and the impossible event 0 .
We see that the sample space of an experiment together with all the events generated from the
events of this space by operations “sum”, “product” and “complement” constitute a field of
events. Thus, for every experiment we have a field of events.
4.4 Definitions of probability
4.4.1 The classical definition of probability
The classical definition of probability reduces the concept of probability to the concept of
equiprobability (equal likelihood) of events, which is regarded as a primitive concept and hence
not subject to formal definition. For example, in the tossing of a single perfectly cubical die,
made of completely homogeneous material, the equally likely events are the appearance of any
of the specific number of dots (from 1 to 6) on its upper face.
xlvii
Thus, for the classical definition of probability we suppose that all possible simple events are
equally likely.
Definition 4.4 (The classical definition of probability)
The probability P(A) of an event A is equal to the number of possible simple events
(outcomes) favorable to A divided by the total number of possible simple events of
the experiment, i.e.,
N
m
P(A) =
where m= number of the simple events into which the event A can be decomposed.
Example 4.4 Consider again the experiment of tossing a balanced coin (see Example 4.1). In
this experiment the sample space consists of two simple events: H (Head is observed ) and T
(Tail is observed ). These events are equally likely. Therefore, P(H)=P(T)=1/2.
Example 4.5 Consider again the experiment of tossing a balanced die (see Example 4.2). In
this experiment the sample space consists of 6 simple events: D
1
, D
2
, D
3
, D
4
, D
5
, D
6
, where D
k
is the event that k dots (k=1, 2, 3, 4, 5, 6) are observed on the upper face of the die. These
events are equally likely. Therefore, P(D
k
) =1/6 (k=1, 2, 3, 4, 5, 6).
Since D
odd
= D
1
+D
3
+D
5
, D
even
= D
2
+D
4
+D
6
, where D
odd
is the event that an odd number of dots
are observed, D
even
an even number of dots are observed, we have P(D
odd
)=3/6=1/2, P(D
even
) =
3/6 = 1/2. If denote by A the event that a number less than 6 of dots is observed then P(A) = 5/6
because the event A = D
1
+ D
2
+D
3
+ D
4
+ D
5
.
According to the above definition, every event belonging to the field of events S has a well
defined probability. Therefore, the probability P(A) may be regarded as a function of the event A
defined over the field of events S. This function has the following properties, which are easily
proved.
The properties of probability:
1. For every event A of the field S, P(A) ≥ 0
2. For the certain event E, P(E) = 1
3. If the event A is decomposed into the mutually exclusive events B and C
belonging to S then P(A)=P(B)+P(C)
This property is called the theorem on the addition of probabilities.
4. The probability of the event A complementary to the event A is given by the
formula ) ( 1 ) ( A P A P − = .
5. The probability of the impossible event is zero, P(0) = 0.
6. If the event A implies the event B then P(A) ≤ P(B).
7. The probability of any event A lies between 0 and 1: 0 ≤ P(A) ≤ 1.
xlviii
Example 4.6 Consider the experiment of tossing two fair coins. Find the probability of the event
A = {observe at least one Head} by using the complement relationship.
Solution The experiment of tossing two fair coins has 4 simple events: HH, HT, TH and TT,
where H = {Head is observed}, T = {Tail is observed}. We see that the event A consists of the
simple events HH, HT, TH. Then the complementary event for A is A = { No Heads observed }
= TT. We have P( A ) = P(TT) = 1/4. Therefore, P(A) = 1P( A ) = 11/4 = 3/4.
4.4.2 The statistical definition of probability
The classical definition of probability encounters insurmountable difficulties of a fundamental
nature in passing from the simplest examples to a consideration of complex problems. First off
all, the question arises in a majority of cases, as to a reasonable way of selecting the “equally
likely cases”. Thus, for examples, it is difficult to determine the probability that tomorrow the
weather will be good, or the probability that a baby to be born is a boy, or to answer to the
question “what are the chances that I will blow one of my stereo speakers if I turn my amplifier
up to wide open?”
Lengthy observations as to the occurrence or nonoccurrence of an event A in large number of
repeated trials under the same set of conditions show that for a wide class of phenomena, the
number of occurrences or nonoccurrences of the event A is subject to a stable law. Namely, if
we denote by m the number of times the event A occurs in N independent trials, then it turns out
that for sufficiently large N the ratio m/N in most of such series of observations, assumes an
almost constant value. Since this constant is an objective numerical characteristic of the
phenomena, it is natural to call it the statistical probability of the random event A under
investigation.
Definition 4.5 (The statistical definition of probability)
The probability of an event A can be approximated by the proportion of times that A
occurs when the experiment is repeated a very large number of times.
Fortunately, for the events to which the classical definition of probability is applicable, the
statistical probability is equal to the probability in the sense of the classical definition.
4.4.3 Axiomatic construction of the theory of probability (optional)
The classical and statistical definitions of probability reveal some restrictions and shortcomings
when deal with complex natural phenomena and especially, they may lead to paradoxical
conclusions, for example, the wellknown Bertrand’s paradox. Therefore, in order to find wide
applications of the theory of probability, mathematicians have constructed a rigorous foundation
of this theory. The first work they have done is the axiomatic definition of probability that
includes as special cases both the classical and statistical definitions of probability and
overcomes the shortcomings of each.
Below we formulate the axioms that define probability.
xlix
Axioms for probability
1. With each random event A in a field of events S, there is associated a non
negative number P(A),called its probability.
2. The probability of the certain event E is 1, i.e., P(E) = 1.
3. (Addition axiom) If the event A
1
, A
2
, ..., A
n
are pair wise mutually exclusive
events then
P(A
1
+ A
2
+ ...+A
n
) = P(A
1
)+P(A
2
)+ ...+P(A
n
)
4. (Extended axiom of addition) If the event A is equivalent to the occurrence of at
least one of the pair wise mutually exclusive events A
1
, A
2
, ..., A
n
,...then
P(A) = P(A
1
)+P(A
2
)+ ...+P(A
n
)+...
Obviously, the classical and statistical definitions of probability which deal with finite sum of
events, satisfy the formulated above axioms. The necessity for introducing the extended axiom
of addition is motivated by the fact that in probability theory we constantly have to consider
events that decompose into an infinite number of subevents.
4.5 Conditional probability and independence
We have said that a certain set of conditions Ç underlies the definition of the probability of an
event. If no restrictions other than the conditions Ç are imposed when calculating the probability
P(A), then this probability is called unconditional.
However, in many cases, one has to determine the probability of an event under the condition
that an other event B whose probability is greater than 0 has already occurred.
Definition 4.6
The probability of an event A, given that an event B has occurred, is called the
conditional probability of A given B and denoted by the symbol P(AB).
Example 4.7 Consider the experiment of tossing a fair die. Denote by A and B the following
events:
A = {Observing an even number of dots on the upper face of the die},
B = {Observing a number of dots less than or equal to 3 on the upper face of the die}.
Find the probability of the event A, given the event B.
Solution We know that the sample space of the experiment of tossing a fair die consists of 6
simple events: D
1
, D
2
, D
3
, D
4
, D
5
, D
6
, where D
k
is the event that k dots (k = 1, 2, 3, 4, 5, 6) are
observed on the upper face of the die. These events are equally likely, and P(D
k
) = 1/6 (k = 1,
2, 3, 4, 5, 6). Since A = D
2
+ D
4
+ D
6
, B = D
1
+ D
2
+D
3
we have P(A) = P(D
2
)+ P(D
4
)+ P(D
6
) =
3*1/6 = 1/2, P(B) = P(D
1
)+ P(D
2
)+ P(D
3
) = 3*1/6 = 1/2.
l
If the event B has occurred then it reduces the sample space of the experiment from 6 simple
events to 3 simple events (namely those D
1
, D
2
, D
3
contained in event B). Since the only even
number of three numbers 1, 2, 3 is 2 there is only one simple event D
2
of reduced sample
space that is contained in the event A. Therefore, we conclude that the probability that A occurs
given that B has occurred is one in three, or 1/3, i.e., P(AB) = 1/3.
For the above example it is easy to verify that
P(B)
P(AB)
P(AB) = . In the general case, we use
this formula to define the conditional probability.
Formula for conditional probability
If the probability of an event B is greater 0 then the conditional probability of an
event A, given that the event B has occurred, is calculated by the formula
P(B)
P(AB)
P(AB) = , (1)
where AB is the event that both A and B occur.
In the same way, if P(A)>0, the conditional probability of an event B, given that the
event A has occurred, is defined by the formula
.
P(A)
P(AB)
P(BA) = (1’)
Each of formulas (1) and (1’) is equivalent to the socalled Multiplication Theorem.
Multiplication Theorem
The probability of the product of two events is equal to the product of the probability
of one of the events by the conditional probability of the other event, given that the
first even has occurred, namely
P(AB) = P(A) P(BA) = P(B) P(AB).
The Multiplication Theorem is also applicable if one of the events A and B is impossible since, in
this case, one of the equalities P(AB) = 0 and P(AB) = 0 holds along with P(A) = 0.
li
Definition 4.7
We say that an event A is independent of an event B if
P(AB) = P(A),
i.e., the occurrence of the event B does not affect the probability of the event A.
If the event A is independent of the event B, then it follows from (2) that
P(A) P(BA) = P(B) P(A). From this we find P(BA) = P(B) if P(A)>0, i.e., the event B is also
independent of A. Thus, independence is a symmetrical relation.
Example 4.8 Consider the experiment of tossing a fair die and define the following events:
A = {Observe an even number of dots}
B = { Observe a number of dots less or equal to 4}.
Are events A and B independent?
Solution As in Example 4.7 we have P(A) = 1/2 and P(B) = P(D
1
)+ P(D
2
)+ P(D
3
)+P(D
4
) =
4*1/6 = 2/3, where D
k
is the event that k dots (k = 1, 2, 3, 4, 5, 6) are observed on the upper
face of the die. Since AB = D
2
+ D
4
, we have P(AB) = P(D
2
)+ P(D
4
) = 1/6+1/6 = 1/3.
Now assuming B has occurred, the probability of A given B is
P(A)
/
/
P(B)
P(AB)
P(AB) = = = =
2
1
3 2
3 1
.
Thus, assuming B has occurred does not alter the probability of A. Therefore, the events A and
B are independent.
The concept of independence of events plays an important role in the theory of probability and
its applications. In particular, the greater part of the results presented in this course is obtained
on the assumption that the various events considered are independent.
In practical problems, we rarely resort to verifying that relations P(AB) = P(A) or P(BA) = P(B)
are satisfied in order to determine whether or not the given events are independent. To
determine independence, we usually make use of intuitive arguments based on experience.
The Multiplication Theorem in the case of independent events takes on a simple form.
lii
Multiplication Theorem for independent events
If the events A and B are independent then
P(AB) = P(A) P(B).
We next generalize the notion of the independence of two events to that of a collection of
events.
Definition 4.8
The events B
1
, B
2
, ..., B
n
are called collectively independent or mutually independent
if for any event B
p
(p = 1, 2,..., n) and for any group of other events B
q
, B
r
, ...,B
s
of
this collection, the event B
p
and the event B
q
B
r
...B
s
are independent.
Note that for several events to be mutually independent, it is not sufficient that they be pair wise
independent.
4.6 Rules for calculating probability
4.6.1 The addition rule
From the classical definition of probability we deduced the addition theorem, which serves as
the addition axiom for the axiomatic definition of probability. Using this axiom we get the
following rule:
Addition rule
If the event A
1
, A
2
, ..., A
n
are pair wise mutually exclusive events then
P(A
1
+ A
2
+ ...+A
n
) = P(A
1
)+P(A
2
)+ ...+P(A
n
)
In the case of two nonmutually exclusive events A and B we have the formula
P(A+B) = P(A) + P(B) – P(AB).
Example 4.9 In a box there are 10 red balls, 20 blue balls, 10 yellow balls and 10 white balls.
At random draw one ball from the box. Find the probability that this ball is color.
Solution Call the event that the ball drawn is red to be R, is blue B, is yellow Y, is white W and is
color C. Then P(R) = 10/(10+20+10+10) = 10/50 = 1/5, P(B) = 20/50 = 2/5, P(Y) = 10/50 = 1/5.
Since C = R+B+Y and the events R, B and Y are mutually exclusive , we have P(C) = P(R+B+Y)
= 1/5+2/5+1/5 = 4/5.
liii
In the preceding section we also got the multiplicative theorem. Below for the purpose of
computing probability we recall it.
Multiplicative rule
For any two events A and B from the same field of events there holds the formula
P(AB) = P(A) P(BA) = P(B) P(AB).
If these events are independent then
P(AB) = P(A) P(B).
Now suppose that the event B may occur together with one and only one of n mutually exclusive
events A
1
, A
2
, ..., A
n
, that is
B = A
1
B + A
2
B + ...+A
n
B.
By Addition rule we have
P(B)= P(A
1
B)+P(A
2
B)+ ...+P(A
n
B).
Further, by Multiplicative rule we get a formula, called the formula of total probability.
Formula of total probability
If the event B may occur together with one and only one of n mutually exclusive
events A
1
, A
2
, ..., A
n
then
P(B)= P(A
1
)P(BA
1
)+P(A
2
)P(BA
2
)+ ...+P(A
n
)P(BA
n
).
Example 4.10 There are 5 boxes of lamps:
3 boxes with the content A
1
: 9 good lamps and 1 defective lamp,
2 boxes with the content A
2
: 4 good lamps and 2 defective lamp.
At random select one box and from this box draw one lamp. Find the probability that the drawn
lamp is defective.
Solution Denote by B the event that the drawn lamp is defective and by the same A
1
, A
2
the
events that the box with content A
1
, A
2
, respectively, is selected. Since the defective lamp may
be drawn from a box of either content A
1
or content A
2
we have B = A
1
B + A
2
B. By the
formula of total probability P(B) = P(A
1
)P(BA
1
)+P(A
2
)P(BA
2
).
Since P(A
1
) = 3/5, P(A
2
) = 2/5, P(BA
1
) = 1/10, P(BA
2
) = 2/6 = 1/3 we have
liv
P(B) = 3/5 * 1/10 + 2/5 *1/3 = 29/150 = 0.19.
Thus, the probability that the drawn lamp is defective is 0.19.
Now, under the same assumptions and notations as in the formula of total probability, find the
probability of the event A
k
, given that the event B has occurred.
According to the Multiplicative rule,
P(A
k
B) = P(B)P(A
k
B) = P(A
k
) P(BA
k
)
Hence,
P(B)
) )P(BA P(A
B) P(A
k k
k
=
using the formula of total probability, we then find the following
Bayes’s Formula
If the event B may occur together with one and only one of n mutually exclusive
events A
1
, A
2
, ..., A
n
then
∑
=
= =
n
j
j j
k k k k
k
) )P(BA P(A
) )P(BA P(A
P(B)
) )P(BA P(A
B) P(A
1
The formula of Bayes is sometimes called the formula for probabilities of hypotheses.
Example 4.11 As in Example 4.10, there are 5 boxes of lamps:
3 boxes with the content A
1
: 9 good lamps and 1 defective lamp,
2 boxes with the content A
2
: 4 good lamps and 2 defective lamp.
From one of the boxes, chosen at random, a lamp is withdrawn. It turns out to be a defective
(event B). What is the probability, after the experiment has been performed (the aposteriori
probability), that the lamp was taken from an box of content A
1
?
Solution We have calculated P(A
1
) = 3/5, P(A
2
) = 2/5, P(BA
1
) = 1/10, P(BA
2
) = 2/6 = 1/3, P(B)
= 29/150. Hence, the formula of Bayes gives
31 0
29
9
150 29
10 1 5 3
1 1
1
.
/
/ * /
P(B)
) )P(BA P(A
B) P(A ≈ = = = .
Thus, the probability that the lamp was taken from an box of content A
1
, given the experiment
has been performed, is equal 0.31.
lv
4.7 Summary
In this chapter we introduced the notion of experiment whose outcomes called the events could
not be predicted with certainty in advance. The uncertainty associated with these events was
measured by their probabilities. But what is the probability? For answer to this question we
briefly discussed approaches to probability and gave the classical and statistical definitions of
probability. The classical definition of probability reduces the concept of probability to the
concept of equiprobability of simple events. According to the classical definition, the probability
of an event A is equal to the number of possible simple events favorable to A divided by the
total number of possible events of the experiment. In the time, by the statistical definition the
probability of an event is approximated by the proportion of times that A occurs when the
experiment is repeated very large number of times.
4.8 Exercises
A, B, C are random events.
1) Explain the meaning of the relations:
a) ABC = A;
b) A + B + C = A.
2) Simplify the expressions
a) (A+B)(B+C);
b) ; ) B B)(A (A + +
c) B). A )( B B)(A (A + + +
3) A fourvolume work is placed on a shelf in random order. What is the probability that the
books are in proper order from right to left or left to right?
4) In a lot consisting of N items, M are defective , n items are selected at random from the lot
(n<N). Find the probability that ) N m (m ≤ of them will be prove to be defective.
5) A quality control inspector examines the articles in a lot consisting of m items of first grade
and n items of second grade. A check of the first b articles chosen at random from the lot
has shown that all of them are of second grade (b<m). Find the probability that of the next
two items selected at random from those remaining at least one proves to be second grade.
6) From a box containing m white balls and n black balls (m>n), one ball after another is drawn
at random. What is the probability that at some point the number of white balls and black
balls drawn will be the same?
7) Two newly designed data base management systems (DBMS), A and B, are being
considered for marketing by a large computer software vendor. To determine whether
DBMS users have a preference for one of the two systems, four of the vendor’s customers
are randomly selected and given the opportunity to evaluate the performances of each of
the two systems. After sufficient testing, each user is asked to state which DBMS gave the
better performance (measured in terms of CPU utilization, execution time, and disk access).
a) Count the possible outcomes for this marketing experiment.
b) If DBMS users actually have no preference for one system over the other (i.e.,
performances of the two systems are identical), what is the probability that all four
sampled users prefer system A?
c) If all four customers express their preference for system A, can the software vendor infer
that DBMS users in general have a preference for one of the two systems?
lvi
Chapter 5 Basic Probability distributions
CONTENTS
5.1. Random variables
5.2. The probability distribution for a discrete random variable
5.3. Numerical characteristics of a discrete random variable
5.4. The binomial probability distribution
5.5. The Poisson distribution
5.6 Continuous random variables: distribution function and density function
5.7 Numerical characteristics of a continuous random variable
5.8. The normal distribution
5.9. Summary
5.10. Exercises
5.1 Random variables
One of the fundamental concepts of probability theory is that of a random variable.
Definition 5.1
A random variable is a variable that assumes numerical values associated with
events of an experiment.
Example 5.1 Observe 100 babies to be born in a clinic. The number of boys, which have been
born, is a random variable. It may take values from 0 to 100.
Example 5.2 Number of patients of a clinic daily is a random variable.
Example 5.3 Select one student from an university and measure his/her height and record this
height by x. Then x is a random variable, assuming values from, say from 100 cm to 250 cm in
dependence upon each specific student.
Example 5.4 The weight of babies at birth also is a random variable. It can assume values in
the interval, for example, from 800 grams to 6000 grams.
Classification of random variables: Random variables may be divided into two types:
discrete random variables and continuous random variables.
lvii
Definition 5.2
A discrete random variable is one that can assume only a countable number of
values.
A continuous random variable can assume any value in one or more intervals on
a line.
Among the random variables described above the number of boys in Example 5.1 and the
number of patients in Example 5.2 are discrete random variables, the height of students and the
weight of babies are continuous random variables.
Example 5.5 Suppose you randomly select a student attending your university. Classify each of
the following random variables as discrete or continuous:
a) Number of credit hours taken by the student this semester
b) Current grade point average of the student.
Solution a) The number of credit hours taken by the student this semester is a discrete random
variable because it can assume only a countable number of values (for example 10, 11, 12, and
so on). It is not continuous since the number of credit hours can not assume values as 11.5678,
15.3456 and 12.9876 hours.
b) The grade point average for the student is a continuous random variable because it could
theoretically assume any value (for example, 5.455, 8.986) corresponding to the points on the
interval from 0 to 10 of a line.
5.2 The probability distribution for a discrete random variable
Definition 5.3
The probability distribution for a discrete random variable x is a table, graph, or
formula that gives the probability of observing each value of x. We shall denote the
probability of x by the symbol p(x).
Thus, the probability distribution for a discrete random variable x may be given by one of the
ways:
1. the table
x p
x
1
p
1
x
2
p
2
... ...
x
n
p
n
lviii
where p
k
is the probability that the variable x assume the value x
k
(k = 1, 2,..., n).
2. a formula for calculating p(x
k
) (k = 1, 2,..., n).
3. a graph presenting the probability of each value x
k
.
Example 5.6 A balanced coin is tossed twice and the number x of heads is observed. Find the
probability distribution for x.
Solution Let H
k
and T
k
denote the observation of a head and a tail, respectively, on the k
th
toss,
for k = 1, 2. The four simple events and the associated values of x are shown in Table 5.1.
Table 5.1 Simple events of the experiment of tossing a coin twice
SIMPLE EVENT DESCRIPTION PROBABILITY NUMBER OF HEADS
E
1
H
1
H
2
0.25 2
E
2
H
1
T
2
0.25 1
E
3
T
1
H
2
0.25 1
E
4
T
1
T
2
0.25 0
The event x = 0 is the collection of all simple events that yield a value of x = 0, namely, the
simple event E
4
. Therefore, the probability that x assumes the value 0 is
P(x = 0) = p(0) = P(E
4
) = 0.25.
The event x = 1 contains two simple events, E
2
and E
3
. Therefore,
P(x = 1) = p(1) = P(E
2
) + P(E
3
) = 0.25 + 0.25 = 0.5.
Finally,
P(x = 2) = p(2) = P(E
1
) = 0.25.
The probability distribution p(x) is displayed in tabular form in Table 5.2 and as a probability
histogram in Figure 5.1.
Table 5.2 Probability distribution for x, the number of heads in two
tosses of a coin
x p(x)
0 0.25
1 0.5
2 0.25
lix
0
0.1
0.2
0.3
0.4
0.5
0.6
0 1 2
Figure 5.1 Probability distribution for x, the number of heads in two tosses of
a coin
Properties of the probability distribution for a discrete random variable x
1. 1 0 ≤ ≤ p(x)
2.
∑
=
x all
p(x) 1
Relationship between the probability distribution for a discrete random variable and the
relative frequency distribution of data:
Suppose you were to toss two coins over and over again a very large number of times and
record the number x of heads for each toss. A relative frequency distribution for the resulting
collection of 0’s, 1’s and 2’s would be very similar to the probability distribution shown in Figure
5.1. In fact, if it were possible to repeat the experiment an infinitely large number of times, the
two distributions would be almost identical.
Thus, the probability distribution of Figure 5.1 provides a model for a conceptual population of
values x – the values of x that would be observed if the experiment were to be repeated an
infinitely large number of times.
5.3 Numerical characteristics of a discrete random variable
5.3.1 Mean or expected value
Since a probability distribution for a random variable x is a model for a population relative
frequency distribution, we can describe it with numerical descriptive measures, such as its
mean and standard deviation, and we can use Chebyshev theorem to identify improbable
values of x.
The expected value (or mean) of a random variable x, denoted by the symbol E(x), is defined as
follows:
lx
Definition 5.4
Let x be a discrete random variable with probability distribution p(x). Then the mean
or expected value of x is
∑
= =
all x
xp(x) E(x)
Example 5.6 Refer to the twocoin tossing experiment of Example 5.5 and the probability
distribution for the random variable x, shown in Figure 5.1. Demonstrate that the formula for E(x)
gives the mean of the probability distribution for the discrete random variable x.
Solution If we were to repeat the twocoin tossing experiment a large number of times – say
400,000 times, we would expect to observe x = 0 heads approximately 100,000 times, x = 1
head approximately 200,000 times and x = 2 heads approximately 100,000 times. Calculating
the mean of these 400,000 values of x, we obtain
x x p
n
x
x all
∑
∑
= + + =
+ +
= ≈ ) ( ) 2 (
4
1
) 1 (
2
1
) 0 (
4
1
000 , 400
) 2 ( 000 , 100 ) 1 ( 000 , 200 ) 0 ( 000 , 100
u
Thus, the mean of x is 1 = u .
If x is a random variable then any function g(x) of x also is a random variable. The expected
value of g(x) is defined as follows:
Definition 5.5
Let x be a discrete random variable with probability distribution p(x) and let g(x) be a
function of x . Then the mean or expected value of g(x) is
∑
=
x all
g(x)p(x) E[g(x)]
5.3.2 Variance and standard deviation
The second important numerical characteristics of random variable are its variance and
standard deviation, which are defined as follows:
lxi
Definition 5.6
Let x be a discrete random variable with probability distribution p(x). Then the
variance of x is
] )  E[(x
2 2
u σ =
The standard deviation of x is the positive square root of the variance of x:
2
σ σ =
Example 5.7 Refer to the twocoin tossing experiment and the probability distribution for x,
shown in Figure 5.1. Find the variance and standard deviation of x.
Solution In Example 5.6 we found the mean of x is 1. Then
2
1
4
1
) 1 2 (
2
1
) 1 1 (
4
1
) 1 0 ( ) (
2 2 2 2
= 
¹

\

− + 
¹

\

− + 
¹

\

− = = =
∑
=
2
0 x
2 2
)  (x ] )  E[(x x p u u σ
and
707 . 0
2
1
2
≈ = = σ σ
5.4 The binomial probability distribution
Many reallife experiments are analogous to tossing an unbalanced coin a number n of times.
Example 5.8 Suppose that 80% of the jobs submitted to a dataprocessing center are of a
statistical nature. Then selecting a random sample of 10 submitted jobs would be analogous to
tossing an unbalanced coin 10 times, with the probability of observing a head (drawing a
statistical job) on a single trial equal to 0.80.
Example 5.9 Test for impurities commonly found in drinking water from private wells showed
that 30% of all wells in a particular country have impurity A. If 20 wells are selected at random
then it would be analogous to tossing an unbalanced coin 20 times, with the probability of
observing a head (selecting a well with impurity A) on a single trial equal to 0.30.
Example 5.10 Public opinion or consumer preference polls that elicit one of two responses –
Yes or No, Approve or Disapprove,... are also analogous to the unbalanced coin tossing
experiment if the size N of the population is large and the size n of the sample is relatively small.
All these experiments are particular examples of a binomial experiment known as a Bernoulli
process, after the seventeenthcentury Swiss mathematician, Jacob Bernoulli. Such
experiments and the resulting binomial random variables have the following characteristics,
which form the model of a binomial random variable.
lxii
Model (or characteristics) of a binomial random variable
1. The experiment consists of n identical trials
2. There are only 2 possible outcomes on each trial. We will denote one outcome by
S (for Success) and the other by F (for Failure).
3. The probability of S remains the same from trial to trial. This probability will be
denoted by p, and the probability of F will be denoted by q ( q = 1p).
4. The trials are independent.
5. The binomial random variable x is the number of S’ in n trials.
The binomial probability distribution, its mean and its standard deviation are given the following
formulas:
The probability distribution, mean and variance for a binomial random
variable:
1. The probability distribution:
x n x x
n
q p C p(x)
−
= (x = 0, 1, 2, ..., n),
where
p = probability of a success on a single trial, q=1p
n = number of trials, x= number of successes in n trials
x!(nx)!
n!
C
x
n
= = combination of x from n.
2. The mean: np = u
3. The variance: npq =
2
σ
Example 5.11 (see also Example 5.9) Test for impurities commonly found in drinking water from
private wells showed that 30% of all wells in a particular country have impurity A. If a random
sample of 5 wells is selected from the large number of wells in the country, what is the
probability that:
a) Exactly 3 will have impurity A?
b) At least 3?
c) Fewer than 3?
Solution First we confirm that this experiment possesses the characteristics of a binomial
experiment. This experiment consists of n = 5 trials, one corresponding to each random
selected well. Each trial results in an S (the well contains impurity A) or an F (the well does not
contain impurity A). Since the total number of wells in the country is large, the probability of
lxiii
drawing a single well and finding that it contains impurity A is equal to 0.30 and this probability
will remain the same for each of the 5 selected wells. Further, since the sampling is random, we
assume that the outcome on any one well is unaffected by the outcome of any other and that
the trials are independent. Finally, we are interested in the number x of wells in the sample of n
= 5 that contain impurity A. Therefore, the sampling process represents a binomial experiment
with n = 5 and p = 0.30.
a) The probability of drawing exactly x = 3 wells containing impurity A is
x n x x
n
q p C p(x)
−
= with n = 5, p = 0.30 and x = 3. We have by this formula
1323 0 30 0 1 30 0
2 3
5
3
3 5 3
. ) . ( ) . (
! !
!
) p( = − =
−
.
b) The probability of observing at least 3 wells containing impurity A is
P(x ≥3) = p(3)+p(4)+p(5). We have calculated p(3) = 0.1323 and we leave to the reader to
verify that p(4) = 0.02835, p(5) = 0.00243. In result, P(3) = 0.1323+0.02835+0.00243 =
0.16380.
c) Although P(x<3) = p(0)+p(1)+p(2), we can avoid calculating 3 probabilities by using the
complementary relationship P(x<3) = 1P(x≥ 3) = 10.16380 = 0.83692.
5.5 The Poisson distribution
The Poisson probability distribution is named for the French mathematician S.D. Poisson (1871
1840, It is used to describe a number of processes, including the distribution of telephone calls
going through a switchboard system, the demand of patients for service at a health institution,
the arrivals of trucks and cars at a tollbooth, and the number of accidents at an intersection.
Characteristics defining a Poisson random variable
1. The experiment consists of counting the number x of times a particular event
occurs during a given unit of time
2. The probability that an event occurs in a given unit of time is the same for all
units.
3. The number of events that occur in one unit of time is independent of the
number that occur in other units.
4. The mean number of events in each unit will be denoted by the Greek letter λ
The formulas for the probability distribution, the mean and the variance of a Poisson random
variable are shown in the next box.
lxiv
The probability distribution, mean and variance for a Poisson random variable
x:
1. The probability distribution:
x!
e
p(x)
x λ
λ
−
= ( x = 0, 1, 2,...),
where
λ = mean number of events during the given time period,
e = 2.71828...(the base of natural logarithm).
2. The mean: λ u =
3. The variance: λ σ =
2
Note that instead of time, the Poisson random variable may be considered in the experiment of
counting the number x of times a particular event occurs during a given unit of area, volume,
etc.
Example 5.12 Suppose that we are investigating the safety of a dangerous intersection. Past
police records indicate a mean of 5 accidents per month at this intersection. Suppose the
number of accidents is distributed according to a Poisson distribution. Calculate the probability
in any month of exactly 0, 1, 2, 3 or 4 accidents.
Solution Since the number of accidents is distributed according to a Poisson distribution and the
mean number of accidents per month is 5, we have the probability of happening
accidents in any month
!
5
5
x
e
p(x)
x −
= . By this formula we can calculate
p(0) = 0.00674, p(1) = 0.3370, p(2) = 0.08425, p(3) = 0.14042, p(4) = 0.17552.
The probability distribution of the number of accidents per month is presented in Table 5.3 and
Figure 5.2.
lxv
Table 5.3 Poisson probability distribution of the number of accidents per month
X NUMBER OF
ACCIDENTS
P(X)  PROBABILITY
0 0.006738
1 0.03369
2 0.084224
3 0.140374
4 0.175467
5 0.175467
6 0.146223
7 0.104445
8 0.065278
9 0.036266
10 0.018133
11 0.008242
12 0.003434
Figure 5.2 The Poisson probability distribution of the number of accidents
5.6 Continuous random variables: distribution function and density
function
Many random variables observed in real life are not discrete random variables because the
number of values they can assume is not countable. In contrast to discrete random variables,
lxvi
these variables can take on any value within an interval. For example, the daily rainfall at some
location, the strength of a steel bar and the intensity of sunlight at a particular time of day. In
Section 5.1 these random variables were called continuous random variables.
The distinction between discrete random variables and continuous random variables is usually
based on the difference in their cumulative distribution functions.
Definition 5.7
Let ξ be a continuous random variable assuming any value in the interval ( ∞, +∞).
Then the cumulative distribution function F(x) of the variable ξ is defined as
follows
x) P( F(x) ≤ = ξ
i.e., F(x) is equal to the probability that the variable ξ assumes values, which are
less than or equal to x.
Note that here and from now on we denote by letter ξ a continuous random variable and
denote by x a point on number line.
From the definition of the cumulative distribution function F(x) it is easy to show the following its
properties.
Properties of the cumulative distribution function F(x) for a continuous
random variable ξ
1. 1 ) ( 0 ≤ ≤ x F ,
2. F(x) is a monotonically nondecreasing function, that is, if b a ≤ then ) ( ) ( b F a F ≤
for any real numbers a and b.
3. ) ( ) ( ) ( a F b F b a P − = ≤ ≤ ξ
4. 0 ) ( → x F as −∞ → x and +∞ → → x x F as 1 ) (
In Chapter 2 we described a large data set by means of a relative frequency distribution. If the
data represent measurements on a continuous random variable and if the amount of data is
very large, we can reduce the width of the class intervals until the distribution appears to be a
smooth curve. A probability density is a theoretical model for this distribution.
lxvii
Definition 5.8
If F(x) is the cumulative distribution function for a continuous random variable ξ then
the density probability function f(x) for ξ is
f(x) = F’(x),
i.e., f(x) is the derivative of the distribution function F(x).
The density function for a continuous random variable ξ , the model for some reallife population
of data, will usually be a smooth curve as shown in Figure 5.3.
Figure 5.3 Density function f(x) for a continuous random variable
It follows from Definition 5.8 that
∫
∞
=
x

f(t)dt F(x)
Thus, the cumulative area under the curve between  ∞ and a point x
0
is equal to F(x
0
).
The density function for a continuous random variable must always satisfy the two properties
given in the box.
lxviii
Properties of a density function
1. 0 ≥ f(x)
2.
1 = ∞ =
∫
+∞
∞ −
) F( x)dx f(
5.7 Numerical characteristics of a continuous random variable
Definition 5.8
Let ξ be a continuous random variable with density function f(x). Then the mean or
the expected value of ξ is
∫
+∞
∞
=

xf(x)dx ) E(ξ
Definition 5.9
Let ξ be a continuous random variable with density function f(x) and g(x) is a
function of x. Then the mean or the expected value of g(ξ ) is
∫
+∞
∞
=

g(x)f(x)dx )] E[g(ξ
Definition 5.10
Let ξ be a continuous random variable with the expected value u ξ = ) E( . Then the
variance of ξ is
] )  E[(
2 2
u ξ σ =
The standard deviation of ξ is the positive square root of the variance
2
σ σ =
lxix
5.8 Normal probability distribution
The normal (or Gaussian) density function was proposed by C.F.Gauss (17771855) as a model
for the relative frequency distribution of errors, such errors of measurement. Amazingly, this
bellshaped curve provides an adequate model for the relative frequency distributions of data
collected from many different scientific areas.
The density function, mean and variance for a normal random variable
The density function:
2 2
2 / ) (
2
1
) (
σ u
π σ
− −
=
x
e x f
The parameters u and σ
2
are the mean and the variance , respectively, of the normal
random variable
There is infinite number of normal density functions – one for each combination of u and σ. The
mean measures the location and the variance measures its spread. Several different normal
density functions are shown in Figure 5.4.
0.2
0
0.2
0.4
0.6
0.8
1
Curve 1
Curve 2
Curve 3
Figure 5.4 Several normal distributions: Curve 1 with 1 , 3 = = σ u ,
Curve 2 with 0 , 1 = − = σ u , and Curve 3 with 5 . 1 , 0 = = σ u ,
If u = 0 and σ =1 then
2 / ) (
2
2
1
) (
u
π
− −
=
x
e x f . The distribution with this density function is
called the standardized normal distribution. The graph of the standardized normal density
distribution is shown in Figure 5.5.
lxx
0
0.1
0.2
0.3
0.4
0.5

3
.
4

2
.
8

2
.
2

1
.
6

1

0
.
4
0
.
2
0
.
8
1
.
4 2
2
.
6
3
.
2
Figure 5.5 The standardized normal density distribution
If ξ is a normal random variable with the mean u and variance σ then
1) the variable
σ
u ξ −
= z
is the standardized normal random variable.
2) ) ( 2 ) ( n n P Φ σ u ξ = ≤ − , where
∫
−
= Φ
x
t
dt e x
0
2 /
2
2
1
) (
π
This function is called the Laplace function and it is tabulated.
In particular, we have
= ≤ − ) ( P σ u ξ 0.6826
= ≤ − ) 2 ( P σ u ξ 0.9544
= ≤ − ) 3 ( P σ u ξ 0.9973
These equalities are known as σ , 2σ and σ rules, respectively and are often used in statistics.
Namely, if a population of measurements has approximately a normal distribution the probability
that a random selected observation falls within the intervals (u  σ, u + σ), (u  2σ, u +2σ), and
(u  3σ, u + 3σ), is approximately 0.6826, 0.9544 and 0.9973, respectively.
lxxi
The normal distribution as an approximation to various discrete probability
distributions
Although the normal distribution is continuous, it is interesting to note that it can sometimes be
used to approximate discrete distributions. Namely, we can use normal distribution to
approximate binomial probability distribution.
Suppose we have a binomial distribution defined by two parameters: the number of trials n and
the probability of success p. The normal distribution with the parameters u and σ will be a good
approximation for that binomial distribution if both
p) 1 ( np 2 2 − − = − np σ u and p) np( np − + = + 1 2 2σ u lie between 0 and n.
For example, the binomial distribution with n = 10 and p = 0.5 is well approximated by the
normal distribution with u = np = 10*0.5 = 5.0 and p) np( − = 1 = 0.5* 10 = 1.58. See
Figure 5.6 or Table 5.4.
0
0.05
0.1
0.15
0.2
0.25
0.3
0 1 2 3 4 5 6 7 8 9 10
Figure 5.6 Approximation of binomial distribution (bar graph) with n=10, p=0.5
by a normal distribution (smoothed curve)
Table 5.4 The binomial and normal probability distributions for
the same values of x
x Binomial
distribution
Normal
distribution
0 0.000977 0.0017
1 0.009766 0.010285
2 0.043945 0.041707
3 0.117188 0.113372
4 0.205078 0.206577
lxxii
5 0.246094 0.252313
6 0.205078 0.206577
7 0.117188 0.113372
8 0.043945 0.041707
9 0.009766 0.010285
10 0.000977 0.0017
5.9. Summary
This chapter introduces the notion of a random variable – one of the fundamental concepts of
the probability theory. It is a rule that assigns one and only one value of a variable x to each
simple event in the sample space. A variable is said to be discrete if it can assume only a
countable number of values.
The probability distribution of a discrete random variable is a table, graph or formula that gives
the probability associated with each value of x . The expected value u = ) (x E is the mean of
this probability distribution and
2
)] [( σ u = − x E is its variance.
Two discrete random variables – the binomial, and the Poisson – were presented, along with
their probability distributions.
In contrast to discrete random variables, continuous random variable can assume value
corresponding to the infinitely large number can assume value corresponding to the infinitely
large number of points contained in one or more intervals on the real line. The relative
frequency distribution for a population of data associated with a continuous random variable can
be modeled using a probability density function. The expected value (or mean) of a continuous
random variable x is defined in the same manner as for discrete random variables, except that
integration is substituted for summation. The most important probability distribution – the normal
distribution  with its properties is considered.
5.10 Exercises
1) The continuous random variable ξ is called a uniform random variable if its density function
is
¦
¹
¦
´
¦
≤ ≤
−
=
elsewhere
if
b x a
a b
f(x)
0
1
Show that for this variable, the mean
2
b a +
= u and the variance
12
) (
2
2
a b −
= σ .
2) The continuous random variable ξ is called a exponential random variable if its density
function is
lxxiii
) 0 ( ) (
/
∞ ≤ ≤ =
−
x
e
x f
x
β
β
Show that for this random variable
2 2
, β σ β u = = .
3) Find the area beneath a standardized normal curve between the mean z = 0 and the point z
= 1.26.
4) Find the probability that a normally distributed random variable ξ lie more than z = 2 standard
deviations above its mean.
5) Suppose y is normally distributed random variable with mean 10 and standard deviation 2.1.
a) Find ). 11 ( ≥ y P
b) Find ) 2 . 12 6 . 7 ( ≤ ≤ y P
lxxiv
Chapter 6. Sampling Distributions
CONTENTS
6.1 Why the method of sampling is important
6.2 Obtaining a Random Sample
6.3 Sampling Distribution
6.4 The sampling distribution of : the Central Limit Theorem
6.5 Summary
6.6 Exercises
6.1 Why the method of sampling is important
Much of our statistical information comes in the form of samples from populations of interests. In
order to develop and evaluate methods for using sample information to obtain knowledge of the
population, it is necessary to know how closely a descriptive quantity such as the mean or the
median of a sample resembles the corresponding population quantity. In this chapter, the ideas
of probabilities will be used to study the sampletosample variability of these descriptive
quantities.
We now return to the objective of statistics  namely, the use of sample information to infer the
nature of a population. We will explain why the method of sampling is important through an
example.
Example 6.1 The Vietnam Demographic and Health Survey (VNDHS) was a nationwide
representative sample survey conducted in May 1988 to collect data on fertility and a few
indicators of child and maternal health. In the survey a total of 4,171 eligible women, ale
aged 15 to 49 years old were interviewed. The survey data was given in Appendix A by
the format of Excel. The relative frequency distribution for number of children ever born
for 4,171 women appears as in the Table 6.1 and in Figure 6.1. In actual practice, the
entire population of 4,171 women's number of children ever born may not be easily
accessible. Now, we draw two samples of 50 women from the population of 4,171 women.
The relative frequency distributions of the two samples are given in Table 6.2a and 6.2b
and graphed in Figures 6.2a and 6.2b.
Click here for Simulation in SPSS.
Compare the distributions of number of children ever born for two samples. Which appears to
better characterize number of children ever born for the population?
Solution It is clear that the two samples lead to quite different conclusions about the
same population from which they were both selected. From Figure 6.2a, we see that only
18% of the sampled women bore 3 children, whereas from Figure 6.2b, we see that 26%
of the sampled women bore such number of children. This may be compared to the
relative frequency distribution for the population (shown in Figure 6.1), in which we
observe that 18% of all the women bore 3 children. In addition, note that none of the
women in the second sample (Figure 6.2b) had no children, whereas 10% of the women
lxxv
in the first sample (Figure 6.2a) had no child. This value from the first sample compare
favorably with the 7% of "no children" of the entire population (Figure 6.1).
Table 6.1 Frequency distribution of number
of children ever born for 4,171 women
Figure 6.1 Relative frequency distribution of
number of children ever born for 4,171
women
Number of
Children
Frequency Relative
Frequency
0 312 0.07
1 708 0.17
2 881 0.21
3 737 0.18
4 570 0.14
5 354 0.08
6 243 0.06
7 172 0.04
>7 194 0.05
Total 4171 1.00
Table 6.2 Frequency distribution of number
of children ever born for each of two samples
of 50 women selected from 4,171 women
Figure 6.2 Frequency distribution of
number of children ever born for each of two
samples of 50 women selected from 4,171
women
Number of
Children
Frequency Relative
Frequency
0 5 0.10
1 8 0.16
2 10 0.20
3 9 0.18
4 8 0.16
5 3 0.06
6 4 0.08
7 2 0.04
>7 1 0.02
Total 50 1.00
a a
Number of
Children
Frequency Relative
Frequency
0 0 0.00
1 8 0.16
2 8 0.16
3 13 0.26
.00
.05
.10
.15
.20
.25
0 1 2 3 4 5 6 7 >7
Number of children ever born
R
e
l
a
t
i
v
e
f
r
e
q
u
e
n
c
y
.00
.05
.10
.15
.20
.25
.30
0 1 2 3 4 5 6 7 >7
Number of children ever born
R
e
l
a
t
i
v
e
f
r
e
q
u
e
n
c
y
.00
.05
.10
.15
.20
.25
0 1 2 3 4 5 6 7 >7
Number of children ever born
R
e
l
a
t
i
v
e
f
r
e
q
u
e
n
c
y
lxxvi
4 9 0.18
5 6 0.12
6 2 0.04
7 4 0.08
Total 50 1.00
b b
To rephrase the question posed in the example, we could ask: Which of the two samples is
more representative of, or characteristics of, the number of children ever born for all 4,171 of the
VNDHS's women? Clearly, the information provided by the first sample (Table and Figure 6.2a)
gives a better picture of the actual population of numbers of children ever born. Its relative
frequency distribution is closer to that for the entire population (Table and Figure 6.1) than is the
one provided by the second sample (Table and Figure 6.2b). Thus, if we were to rely on
information from the second sample only, we may have a distorted, or biased, impression of the
true situation with respect to numbers of children ever born.
How is it possible that two samples from the same population can provide contradictory
information about the population? The key issue is the method by which the samples are
obtained. The examples in this section demonstrate that great care must be taken in order to
select a sample that will give an unbiased picture of the population about which inferences are
to be made. One way to cope with this problem is to use random sampling. Random sampling
eliminates the possibility of bias in selecting a sample and, in addition, provides a probabilistic
basic for evaluating the reliability of an inference. We will have more to say about random
sampling in Section 6.2.
6.2 Obtaining a Random Sample
In the previous section, we demonstrated the importance of obtaining a sample that exhibits
characteristics similar to those possessed by the population from which it came, the population
about which we wish to make inferences. One way to satisfy this requirement is to select the
sample in such a way that every different sample of size n has an equal probability of being
selected. This procedure is called random sampling and the resulting sample is called a random
sample of size n. In this section we will explain how to draw a random sample, and will then
employ random sampling in sections that follow.
Definition 6.1
A random sample of n experimental units is one selected in such a way that every
different sample of size n has an equal probability of selection.
Example 6.2 A city purchasing agent can obtain stationery and office supplies from any
of eight companies. If the purchasing agent decides to use three suppliers in a given
year and wants to avoid accusations of bias in their selection, the sample of three
suppliers should be selected from among the eight.
a. How many different samples of three suppliers can be chosen from among the eight?
b. List them.
c. State the criterion that must be satisfied in order for the selected sample to be random.
Solution In this example, the population of interest consists of eight suppliers (call them
A, B, C, D, E, F, G, H). from which we want to select a sample of size n = 3. The numbers
of different samples of n = 3 elements that can be selected from a population of N = 8
elements is
lxxvii
a. The following is a list of 56 samples:
A, B, C A, C, F A, E, G B, C, G B, E, H C, E, F D, E, H
A, B, D A, C, G A, E, H B, C, H B, F, G C, E, G D, F, G
A, B, E A, C, H A, F, G B, D, E B, F, H C, E, H D, F, H
A, B, F A, D, E A, F, H B, D, F B, G, H C, F, G D, G, H
A, B, G A, D, F A, G, H B, D, G C, D, E C, F, H E, F, G
A, B, H A, D, G B, C, D B, D, H C, D, F C, G, H E, F, H
A, C, D A, D, H B, C, E B, E, F C, D, G D, E, F E, G, H
A, C, E A, E, F B, C, F B, E, G C, D, H D, E, G F, G, H
b. Each sample must have the same chance of being selected in order to ensure that we have
a random sample. Since there are 56 possible samples of size n = 3, each must have a
probability equal to 1/56 of being selected by the sampling procedure.
What procedures may one use to generate a random sample? If the population is not too large,
each observation may be recorded on a piece of paper and placed in a suitable container. After
the collection of papers is thoroughly mixed, the researcher can remove n pieces of paper from
container; the elements named on these n pieces of paper would be ones included in the
sample.
However, this method has the following drawbacks: It is not feasible when the population
consists of a lager number of observations; and since it is very difficult to achieve a thorough
mixing, the procedure provides only an approximation to random sample.
A more practical method of generating a random sample, and one that may be used with lager
populations, is to use a table of random numbers. At present, in almost statistical program
packages this method is used to select random samples. For example, SPSS PC  a
comprehensive system for analyzing data, provides a procedure to select a random sample
based on an approximate percentage or an exact number of observations. Two samples in
Example 6.1 were drawn by the SPSS's "Select cases" procedure from the data on fertilities of
4,171 women recorded in Appendix A.
For the first sample, the mean is
For the second sample, the mean is
56
) 1 * 2 * 4 * 5 ( ) 1 * 2 * 3 (
1 * 2 * 3 * 4 * 5 * 6 * 7 * 8
! 5 ! 3
! 8
)! ( !
!
= = =
−
=
n N n
N
C
N
n
96 . 2
50
1 * 8 2 * 7 4 * 6 3 * 5 8 * 4 9 * 3 10 * 2 8 * 1 5 * 0
=
+ + + + + + + +
= =
∑
n
vf
x
lxxviii
where the mean for all 4,171 observations is 3.15. In the next section, we discuss how to judge
the performance of a statistic computed from a random sample.
6.3 Sampling Distribution
In the previous section, we learned how to generate a random sample from a population of
interest, the ultimate goal being to use information from the sample to make an inference about
the nature of the population. In many situations, the objective will be to estimate a numerical
characteristic of the population, called a parameter, using information from sample. For
example, from the first sample of 50 women in the Example 6.1, we computed 96 . 2 = x , the
mean number of children ever born from the sample of n = 50. In other word, we used the
sample information to compute a statistic  namely, the sample mean, x .
Definition 6.2
A numerical descriptive measure of a population is called a parameter.
Definition 6.3
A quantity computed from the observations in a random sample is called a statistic.
You may have observed that the value of a population parameter (for example, the mean u) is a
constant (although it is usually unknown to us); its value does not vary from sample to sample.
However, the value of a sample statistic (for example, the sample mean x ) is highly dependent
on the particular sample that is selected. As seen in the previous section, the means of two
samples with the same size of n = 50 are different.
Since statistics vary from sample to sample, any inferences based on them will necessarily be
subject to some uncertainty. How, then, do we judge the reliability of a sample statistic as a tool
in making an inference about the corresponding population parameter? Fortunately, the
uncertainty of a statistic generally has characteristic properties that are known to us, and that
are reflected in its sampling distribution. Knowledge of the sampling distribution of a particular
statistic provides us with information about its performance over the long run.
Definition 6.4
A sampling distribution of a sample statistic (based on n observations) is the relative
frequency distribution of the values of the statistic theoretically generated by taking
repeated random samples of size n and computing the value of the statistic for each
sample. (See Figure 6.3.)
We will illustrate the notion of a sampling distribution with an example, which our interest
focuses on the numbers of children ever born of 4,171 women in VNDHS 1988. The data are
given in Appendix A. In particular, we wish to estimate the mean number of children ever born to
38 . 3
50
0 * 8 4 * 7 2 * 6 6 * 5 9 * 4 13 * 3 8 * 2 8 * 1 0 * 0
=
+ + + + + + + +
= =
∑
n
vf
x
lxxix
all women. In this case, the 4,171 observations constitute the entire population and we know
that the true value of u, the mean of the population, is 3.15 children.
Example 6.3 How could we generate the sampling distribution of x , the mean of a
random sample of n = 5 observations from population of 4,171 numbers of children ever
born in Appendix A?
Solution The sampling distribution for the statistic x , based on a random sample of n =
5 measurements, would be generate in this manner: Select a random sample of five
measurements from the population of 4,171 observations on number of children ever
born in Appendix A; compute and record the value of x for this sample. Then return
these five measurements to the population and repeat the procedure. (See Figure 6.3). If
this sampling procedure could be repeated an infinite number of times, the infinite
number of values of x obtained could be summarized in a relative frequency
distribution, called the sampling distribution of x .
The task described in Example 6.3, which may seem impractical if not impossible, is not
performed in actual practice. Instead, the sampling distribution of a statistic is obtained by
applying mathematical theory or computer simulation, as illustrated in the next example.
Figure 6.3 Generating the theoretical sampling distribution of the sample mean x
Example 6.4 Use computer simulation to find the approximate sampling distribution of
x , the mean of a random sample of n = 5 observations from the population of 4,171
number of children ever born in Appendix A.
Solution We used a statistical program, for example SPSS, to obtain 100 random
samples of size n = 5 from target population. The first ten of these samples are presented
in Table 6.3.
Table 6.3 The first ten of samples of n = 5 measurement from
population of numbers of children ever born of 4,171 women
Sample Number of children ever born Mean ( x )
1 1 1 1 2 2 1.4
2 1 2 3 3 3 2.4
3 0 0 4 6 7 3.4
lxxx
4 0 1 2 2 3 1.6
5 2 2 3 4 7 3.6
6 1 2 3 5 8 3.8
7 1 2 2 5 6 3.2
8 1 2 2 3 6 2.8
9 2 2 3 3 11 4.2
10 0 0 2 3 4 1.8
For each sample of five observations, the sample mean x was computed. The relative
frequency distribution of the number of children ever born for the entire population of 4,171
women was plotted in Figure 6.4 and the 100 values of x are summarized in the relative
frequency distribution shown in Figure 6.5.
Click here to see some scripts and print outs from sampling and case summarize procedures in
SPSS with sample size of n = 5.
Figure 6.4 Relative frequency distribution for 4,171 numbers of children ever born
We can see that the value of x in Figure 6.5 tend to cluster around the population mean, u =
3.15 children. Also, the values of the sample mean are less spread out (that is, they have less
variation) than the population values shown in Figure 6.4. These two observations are borne out
by comparing the means and standard deviations of the two sets of observations, as shown in
Table 6.4.
Figure 6.5 Sampling distribution of x : Relative frequency distribution of x based on
100 samples of size n = 5
0
5
10
15
20
25
0 1 2 3 4 5 6 7 8 9 10 11 12 13
P
e
r
c
e
n
t
a
g
e
0
5
10
15
20
25
30
35
40
45
1 2 3 4 5
P
e
r
c
e
n
t
a
g
e
lxxxi
Table 6.4 Comparison of the population and the approximate sampling distribution of
x based on 100 samples of size n = 5
Mean Standard
Deviation
Population of 4,171 numbers of children ever born (Fig. 6.4)
100 values of x based on samples of size n = 5 (Fig. 6.5)
u = 3.15
3.11
σ = 2.229
.920
Example 6.5 Refer to Example 6.4. Simulate the sampling distribution of x for samples
size n = 25 from population of 4,171 observations of number of children ever born.
Compare result with the sampling distribution of x based on samples of
n = 5, obtained in Example 6.4.
Solution We obtained 100 computergenerated random samples of size n = 25 from
target population. A relative frequency distribution for 100 corresponding values of x is
shown in Figure 6.6.
It can be seen that, as with the sampling distribution based on samples of size n = 5, the values
of x tend to center about the population mean. However, a visual inspection shows that the
variation of the x values about their mean in Figure 6.6 is less than the variation in the values
of x based on samples of size n = 5 (Figure 6.5). The mean and standard deviation for these
100 values of x are shown in Table 6.5 for comparison with previous results.
Table 6.5 Comparison of the population distribution and the approximate sampling
distributions of x , based on 100 samples of size n = 5 and n = 25
Mean Standard
Deviation
Population of 4,171 numbers of children ever born (Fig. 6.4)
100 values of x based on samples of size n = 5 (Fig. 6.5)
100 values of x based on samples of size n = 25 (Fig. 6.6)
u = 3.15
3.11
3.14
σ = 2.229
.920
.492
Figure 6.6 Relative frequency distribution of x based on 100 samples of size n = 25
0
5
1 0
1 5
2 0
2 5
3 0
3 5
4 0
4 5
5 0
5 5
6 0
6 5
7 0
7 5
8 0
1 2 3 4 5
P
e
r
c
e
n
t
a
g
e
lxxxii
Click here to see some scripts and print outs from sampling and case summarize procedures in
SPSSS with sample size of n = 25.
From Table 6.5 we observe that, as the sample size increases, there is less variation in the
sampling distribution of x ; that is, the values of x tend to cluster more closely about the
population mean as n gets larger. This intuitively appealing result will be stated formally in the
next section.
6.4 The sampling distribution of x : the Central Limit Theorem
Estimating the mean number of children ever born for a population of women, or the mean
height for all 3year old boys in a daycare center are examples of practical problems in which
the goal is to make an inference about the mean, u, of some target population. In previous
sections, we have indicated that the mean x is often used as a tool for making an inference
about the corresponding population parameter u, and we have shown how to approximate its
sampling distribution. The following theorem, of fundamental importance in statistics, provides
information about the actual sampling distribution of x .
The Central Limit Theorem
If the size is sufficiently large, the mean x of a random sample from a population has a
sampling distribution that is approximately normal, regardless of the shape of the
relative frequency distribution of the target population. As the sample size increases,
the better will be the normal approximation to the sampling distribution.
(*)
The sampling distribution of x , in addition to being approximately normal, has other known
characteristics, which are summarized as follows.
Properties of Sampling Distribution of x
If x is the mean of a random sample of size n from a population with mean u and
standard deviation σ, then:
1. The sampling distribution of x has a mean equal to the mean of the population
from which the sample was selected. That is, if we let
x
u denote the mean of the
sampling distribution of x , then
x
u = u
2. The sampling distribution of x has a standard deviation equal to the standard
deviation of the population from which the sample was selected, divided by the
square root of the sample size. That is, if we let
x
σ denote the standard deviation
of the sampling distribution of x , then
n
x
σ
σ =
(*)
This is why the normal distribution is so important!
lxxxiii
Example 6.6 Show that the empirical evidence obtained in Examples 6.4 and 6.5
supports the Central Limit Theorem and two properties of the sampling distribution of x .
Recall that in Examples 6.4 and 6.5, we obtained repeated random samples of size n = 5
and n = 25 from the population of numbers of children ever born in Appendix A. For this
target population, we know that the values of the parameters u uu u and σ σσ σ:
Population mean: u = 3.15 children
Population standard deviation: σ = 2.229 children
Solution In Figures 6.4 and 6.5, we note that the values of x tend to cluster about the
population mean, u = 3.15. This is guaranteed that by property 1, which implies that, in the long
run, the average of all values of x that would be generated in infinite repeated sampling would
be equal to u.
We also observed, from Table 6.5, that the standard deviation of the sampling distribution of x
decreases as the sample size increases from n = 5 to n = 25. Property 2 quantifies the decrease
and relates it to the sample size. As an example, note that, for our approximate sampling
distribution based on samples of size n = 5, we obtained a standard deviation of .920, whereas
property 2 tells us that, for the actual sampling distribution of x , the standard deviation is equal
to
997 .
5
229 . 2
= = =
n
x
σ
σ
Similarly, for samples of size n = 25, the sampling distribution of x actually has a standard
deviation of
446 .
25
229 . 2
= = =
n
x
σ
σ
The value we obtained by simulation was .492
Finally, the Central Limit Theorem guarantees an approximately normal distribution for x ,
regardless of the shapes of the original population. In our examples, the population from which
the samples were selected is seen in Figure 6.4 to be moderately skewed to the right. Note from
Figure 6.5 and 6.6 that, although the sampling distribution of x tends to be bellshaped in each
case, the normal approximation improves when the sample size is increased from n = 5 (Figure
6.5) to n = 25 (Figure 6.6).
Example 6.7 In research on the health and nutrition of children in a rural area of
Vietnam 1988, it was reported that the average height of 823 threeyear old children in
rural areas in 1988 was 89.67 centimeters with a standard deviation of 4.99 centimeters.
These observations are given in Appendix B. In order to check these figures, we will
randomly sample 100 threeyear old children from the rural area and monitor their
heights.
a. Assuming the report's figures is true, describe the sampling distribution of the mean height
for a random sample of 100 three year old children in the rural.
b. Assuming the report's figures are true, what is probability that the sample mean height will
be at least 91 centimeters?
lxxxiv
Solution
a. Although we have no information about the shape of the relative frequency distribution of the
heights of the children, we can apply the Central Limit Theorem to conclude that the
sampling distribution of the sample mean height of the 100 three year old children is
approximately normally distributed. In addition, the mean
x
u , and the standard deviation,
x
σ , of the sampling distribution are given by
x
u = u = 91 cm and
cm
n
x
499 .
100
99 . 4
= = =
σ
σ
assuming that the reported values of u and σ are correct.
b. If the reported values are correct, then P( x ≥91), the probability of observing a mean height
of 91 cm or higher in the sample of 100 observations, is equal to the greened area shown in
Figure 6.7.
Since the sampling distribution is approximately normal, with mean and standard deviation as
obtained in part a, we can compute the desired area by obtaining the zscore for x = 91
Thus, P( ≥ x 91) = P(z ≥ 2.67), and this probability (area) may be found in Table 1 of Appendix
C.
P( x ≥91) = P(z ≥ 2.67)
= .5  A (see Figure 6.7)
= .5  .4962
= .0038
P( x ≥ 91)
A
89.67 91.00
( z = 0 ) ( z = 2.67)
67 . 2
499 .
67 . 89 91
=
−
=
σ
u −
=
x
x
x
z
lxxxv
Figure 6.7 Sampling distribution of x in Example 6.7
The probability that we would obtain a sample mean height of 91 cm or higher is only .0038, if
the reported values are true. If the 100 randomly selected three year old children have an
average height of 91 cm or higher, we would have strong evidence that the reported values are
false, because such a larger sample mean is very unlikely to occur if the research is true.
In practical terms, the Central Limit Theorem and two properties of the sampling distribution of
x assure us that the sample mean x is a reasonable statistic to use in making inference about
the population mean u, and they allow us to compute a measure of the reliability of references
made about u. As we notice earlier, we will not be required to obtain sampling distributions by
simulation or by mathematical arguments. Rather, for all the statistics to be used in this course,
the sampling distribution and its properties will be presented as the need arises.
6.5 Summary
The objective of most statistical investigations is to make an inference about a population
parameter. Since we often base inferences upon information contained in a sample from the
target population, it is essential that the sample be properly selected. A procedure for obtaining
a random sample using statistical software (SPSS) was described in this chapter.
After the sample has been selected, we compute a statistic that contains information about the
target parameter. The sampling distribution of the statistic, characterizes the relative frequency
distribution of values of the statistic over an, infinitely large number of samples.
The Central Limit Theorem provides information about the sampling distribution of the sample
mean, x . In particular, if you have used random sampling, the sampling distribution of x will be
approximately normal if the sample size is sufficiently large.
6.6 Exercises
6.1 Use command Select cases of SPSS/PC to obtain 30 random samples of size
n = 5 from “population” of 4,171 number of children ever born from Appendix A.
a. Calculate x for each of the 30 samples. Construct a relative frequency distribution for
the 30 sample means. Compare with the population relative frequency distribution
shown in Table 6.1.
b. Compute the average of the 30 sample means.
c. Compute the standard deviation of the 3o sample means.
6.2 Repeat parts a, b, and c of Exercise 7.1, using random samples of size n = 10. Compare
relative frequency distribution with that of Exercise 7.1. Do the values of x generated from
samples of size n = 10 tend cluster more closely about u?
6.3 Suppose a random sample of n measurements is selected from a population with mean u
= 60 and variance σ
2
=100. For each of the following values of n, give the mean and
standard deviation of the sampling distribution of the sample means, x :
a. n = 10 b. n = 25 c. n = 50
d. n = 75 e. n = 100 f. n = 500
x
lxxxvi
6.4 A random sample of n = 225 observations is selected from a population with
u = 70 and σ =30. Calculate each of the following probabilities:
a. P( x > 72.5) b. P( x <73.6)
c. P(69.1< x <74.0) d. P( x <65.5)
6.5 This part year, an elementary school began using a new method to teach arithmetic to first
graders. A standardized test, administered at the end of the year, was used to measure
the effectiveness of the new method. The relative frequency distribution of the test scores
in past years had a mean of 75 and a standard deviation of 10. Consider the standardized
test scores for a random sample of 36 first graders taught by the new method.
a. If the relative frequency distribution of test scores for first graders taught by the new
method is no different from that of the old method, describe the sampling distribution of
x , the mean test score for random sample of 36 first graders.
b. If the sample mean test score was computed to be x = 79, what would you conclude
about the effectiveness of the new method of teaching arithmetic? (Hint: Calculate
P( x ≥ 79) using the sampling distribution described in part a.)
lxxxvii
Chapter 7 Estimation
CONTENTS
7.1 Introduction
7.2 Estimation of a population mean: Largesample case
7.3 Estimation of a population mean: small sample case
7.4 Estimation of a population proportion
7.5 Estimation of the difference between two population means: Independent samples
7.6 Estimation of the difference between two population means: Matched pairs
7.7 Estimation of the difference between two population proportions
7.8 Choosing the sample size
7.9 Estimation of a population variance
7.10 Summary
7.11 Exercises
7.1 Introduction
In preceding chapters we learned that populations are characterized by numerical descriptive
measures (parameters), and that inferences about parameter values are based on statistics
computed from the information in a sample selected from the population of interest. In this
chapter, we will demonstrate how to estimate population means, proportions, or variances, and
how to estimate the difference between two population means or proportions. We will also be
able to assess the reliability of our estimates, based on knowledge of the sampling distributions
of the statistics being used.
Example 7.1 Suppose we are interested in estimating the average number of children
ever born to all 4,171 women in the VNDHS 1998 in Appendix A. Although we already
know the value of the population mean, this example will be continued to illustrate the
concepts involved in estimation. How could one estimate the parameter of interest in this
situation?
Solution An intuitively appealing estimate of a population mean, u , is the sample mean,
x , computed from a random sample of n observations from the target population.
Assume, for example, that we obtain a random sample of size n = 30 from numbers of
children ever born in Appendix A, and then compute the value of the sample mean to be
x =3.05 children. This value of x provides a point estimate of the population mean.
Definition 7.1
A point estimate of a parameter is a statistic, a single value computed from the
observations in a sample that is used to estimate the value of the target parameter.
lxxxviii
1.96σ
1.96σ
Area = .95
u
x
How reliable is a point estimate for a parameter? In order to be truly practical and meaningful,
an inference concerning a parameter must consist more than just a point estimate; that is, we
need to be able to state how close our estimate is likely to be to the true value of the population.
This can be done by using the characteristics of the sampling distribution of the statistic that
was used to obtain the point estimate; the procedure will be illustrated in the next section.
7.2 Estimation of a population mean: Largesample case
Recall from Section 6.4 that, for sufficient large sample size, the sampling distribution of the
sample mean, x , is approximately normal, as shown in Figure 7.1.
Example 7.2 Suppose we plan to take a sample of n = 30 measurements from
population of numbers of children ever born in Appendix A and construct interval


¹

\

± = ±
n
x x
x
σ
σ 96 . 1 96 . 1
where σ is the population standard deviation of the 4,171 numbers of children ever born and
n
x
/ σ σ = is the standard deviation of the sampling distribution of x (often called the
standard error of x .) In other word, we will construct an interval 1.96 standard deviations
around the sample mean x . What can we say about how likely is it is that this interval will
contain the true value of the population mean, u ?
Figure 7.1 Sample distribution of x
Solution We arrive at a solution by the following threestep process:
Step 1 First note that, the area beneath the sampling distribution of x between
x
σ u 1.96  and
x
1.96σ u + is approximately .95. (This area colored green in
lxxxix
Figure 7.1, is obtained from Table 1 of Appendix C.) This applies that before
the sample of measurements is drawn, the probability that x will fall within
the interval
x
σ u 1.96 ± .
Step 2 If in fact the sample yields a value of x that falls within the interval
x
σ u 1.96 ± , then it is true that x
x
σ 1.96 ± will contain u, as demonstrated in
Figure 7.2. For particular value of x that falls within the interval
x
σ u 1.96 ± , a
distance of
x
σ 1.96 is marked off both to the left and to the right of x . You
can see that the value of u must fall within x
x
σ 1.96 ± .
Step 3 Step 1 and Step 2 combined imply that, before the sample is drawn, the
probability that the interval x
x
σ 1.96 ± will enclose u is approximately .95.
Figure 7.2 Sample distribution of x in Example 7.2
The interval
x
x σ 96 . 1 ± in Example 7.2 is called a largesample 95% confidence interval for
the population mean u . The term largesample refers to the sample being of a sufficiently large
size that we can apply the Central Limit Theorem to determine the form of the sampling
distribution of x .
Definition 7.2
A confidence interval for a parameter is an interval of numbers within which we
expect the true value of the population parameter to be contained. The endpoints of
x
xc
the interval are computed based on sample information.
Example 7.3 Suppose that a random sample of observations from the population of
threeyear old children heights yield the following sample statistics:
x = 88.62 cm and s = 4.09 cm
Construct a 95% confidence interval of u , the population mean height, based on this sample.
Solution A 95% confidence interval for u , based on a sample of size = 30, is given by


¹

\

± =


¹

\

± = ±
30
96 . 1 67 . 92 96 . 1 96 . 1
σ σ
σ
n
x x
x
In most practical applications, the value of the population deviation σ will be unknown.
However, for larger samples (n ≥ 30), the sample standard deviation s provides a good
approximation to σ , and may be used on the formula for the confidence interval. For this
example, we obtain
46 . 1 62 . 88
30
09 . 4
96 . 1 62 . 88
30
96 . 1 62 . 88 ± =


¹

\

± =


¹

\

±
σ
or (87.16, 90.08). Hence, we estimate that the population mean height falls within the interval
from 87.16 cm to 90.08 cm.
How much confidence do we have that u , the true population mean height, lies within the
interval (87.16, 90.08)? Although we cannot be certain whether the sample interval contain u
(unless we calculate the true value of u for all 823 observations in Appendix B), we can be
reasonably sure that it does. This confidence is based on the interpretation of the confidence
interval procedure: If we were to select repeated random samples of size n = 30 heights, and
from a 1.96 standard deviation interval around x for each sample, then approximately 95% of
the intervals constructed in this manner would contain u . Thus, we are 95% confident that the
particular interval (89.93, 95.41) contains u , and this is our measure of the reliability of the
point estimate x .
Example 7.4 To illustrate the classical interpretation of a confidence interval, we
generated 40 random samples, each of size n = 30, from the population of heights in
Appendix B. For each sample, the sample mean and standard deviation are presented in
Table 7.1. We then constructed the 95% confidence interval for u , using the information
from each sample. Interpret the results, which are shown in Table 7.2.
Table 7.1 Means and standard deviations for 40 random samples of 30 heights
from Appendix B
Sample Mean Standard
Deviation
Sample Mean Standard
Deviation
1 89.53 6.39 21 91.17 5.67
2 90.70 4.64 22 89.47 6.68
3 89.02 5.08 23 88.86 4.63
4 90.45 4.69 24 88.70 5.02
5 89.96 4.85 25 90.13 5.07
xci
6 89.96 5.53 26 91.10 5.27
7 89.81 5.60 27 89.27 4.91
8 90.12 6.70 28 88.85 4.77
9 89.45 3.46 29 89.34 5.68
10 89.00 4.61 30 89.07 4.85
11 89.95 4.48 31 91.17 5.30
12 90.18 6.34 32 90.33 5.60
13 89.15 5.98 33 89.31 5.82
14 90.11 5.86 34 91.05 4.96
15 90.40 4.50 35 88.30 5.48
16 90.04 5.26 36 90.13 6.74
17 88.88 4.29 37 90.33 4.77
18 90.98 4.56 38 86.82 4.82
19 88.44 3.64 39 89.63 6.37
20 89.44 5.05 40 88.00 4.51
Table 7.2 95% confidence intervals for u for 40 random samples of
30 heights from Appendix B
Sample LCL UCL Sample LCL UCL
1 87.24 91.81 21 89.14 93.20
2 89.04 92.36 22 87.07 91.86
3 87.20 90.84 23 87.20 90.52
4 88.77 92.13 24 86.90 90.50
5 88.23 91.69 25 88.31 91.95
6 87.99 91.94 26 89.22 92.99
7 87.81 91.82 27 87.51 91.02
8 87.72 92.51 28 87.14 90.56
9 88.21 90.69 29 87.31 91.37
10 87.35 90.65 30 87.33 90.80
11 88.35 91.56 31 89.27 93.07
12 87.91 92.45 32 88.33 92.33
13 87.01 91.29 33 87.23 91.39
14 88.01 92.21 34 89.27 92.83
15 88.79 92.01 35 86.34 90.26
16 88.16 91.92 36 87.71 92.54
17 87.35 90.41 37 88.62 92.04
18 89.35 92.61 38 85.10 88.55
19 87.14 89.75 39 87.35 91.91
20 87.63 91.25 40 86.39 89.62
(Note: The green intervals don't contain u = 89.67 cm)
Solution For the target population of 823 heights, we have obtained the population mean
value u = 89.67 cm. In the 40 repetitions of the confidence interval procedure described
above, note that only two of the intervals (those based on samples 38 and 40, indicated
xcii
by red color) do not contain the value of u , where the remaining 38 intervals (or 95% of
the 40 interval) do contain the true value of u .
Note that, in actual practice, you would not know the true value of u and you would not perform
this repeated sampling; rather you would select a single random sample and construct the
associated 95% confidence interval. The one confidence interval you form may or not contain
u , but you can be fairly sure it does because of your confidence in the statistical procedure, the
basis for which was illustrated in this example.
Suppose you want to construct an interval that you believe will contain u with some degree of
confidence other than 95%; in other words, you want to choose a confidence coefficient other
than .95.
Definition 7.3
The confidence coefficient is the proportion of times that a confidence interval
encloses the true value of the population parameter if the confidence interval
procedure is used repeatedly a very large number of times.
The first step in constructing a confidence interval with any desired confidence coefficient is to
notice from Figure 7.1 that, for a 95% confidence interval, the confidence coefficient of 95% is
equal to the total area under the sampling distribution (1.00), less .05 of the area, which is
divided equally between the two tails of the distribution. Thus, each tail has an area of .025.
Second, consider that the tabulated value of z (Table 1 of Appendix C) that cuts off an area of
.025 in the right tail of the standard normal distribution is 1.96 (see Figure 7.3). The value z =
1.96 is also the distance, in terms of standard deviation, that x is from each endpoint of the
95% confidence interval. By assigning a confidence coefficient other than .95 to a confidence
interval, we change the area under the sampling distribution between the endpoint of the
interval, which in turn changes the tail area associated with z. Thus, this zvalue provides the
key to constructing a confidence interval with any desired confidence coefficient.
Figure 7.3 Tabulated zvalue corresponding to a tail area of .025
Definition 7.4
We define
2 / α
z to be the zvalue such that an area of 2 / α lies to its right (see Figure
7.4).
xciii
Figure 7.4 Locating
2 / α
z on the standard normal curve
Now, if an area of 2 / α lies beyond
2 / α
z in the right tail of the standard normal (z) distribution,
then an area of
2 / α
z lies to the left of
2 / α
z − in the left tail (Figure 7.4) because of the symmetry
of the distribution. The remaining area, ) 1 ( α − , is equal to the confidence coefficient  that is,
the probability that x falls within
2 / α
z standard deviation of u is ) 1 ( α − . Thus, a lagersample
confidence interval for u , with confidence coefficient equal to ) 1 ( α − , is given by
x
z x σ
α 2 /
±
Example 7.5 In statistic problems using confidence interval techniques, a very common
confidence coefficient is .90. Determine the value of
2 /
z
α
that would be used in
constructing a 90% confidence interval for a population mean based on a large sample.
Solution For a confidence coefficient of .90, we have
90 . 1 = −α
10 . = α
05 . 2 / = α
and we need to obtain the value
05 . 2 /
z z =
α
that locates an area of .05 in the upper tail of the
standard normal distribution. Since the total area to the right of 0 is .50,
z
.05
is the value such that the area between 0 and z
.05
is .50  .05 = .45. From Table 1 of
Appendix C, we find that z
.05
= 1.645 (see Figure 7.5). We conclude that a largesample 90%
confidence interval for a population mean is given by
x
x σ 645 . 1 ±
In Table 7.3, we present the values of
2 / α
z for the most commonly used confidence coefficients.
 z
α/2
0 z
α/2
xciv
Table 7.3 Commonly used confidence
coefficient
Figure 7.5 Location of
2 / α
z for Example 7.5
Confidence
Coefficient
) 1 ( α − 2 / α
2 / α
z
.90 .050 1.645
.95 .025 1.960
.98 .010 2.330
.99 .005 2.58
A summary of the largesample confidence interval procedure for estimating a population
means appears in the next box.
Largesample ) ( α αα α − 1 100% confidence interval for a population mean, u uu u


¹

\

± = ±
n
z x z x
x
σ
σ
α α 2 / 2 /
where
2 / α
z is the zvalue that locates an area of 2 / α to its right, σ is the standard
deviation of the population from which the sample was selected, n is the sample size,
and x is the value of the sample mean.
Assumption: n ≥ 30
[When the value of σ is unknown, the sample standard deviation s may be used to
approximate σ in the formula for the confidence interval. The approximation is generally quite
satisfactory when n ≥ 30.]
Example 7.6 Suppose that in the previous year all graduates at a certain university
reported the number of hours spent on their studies during a certain week; the average
was 40 hours and the standard deviation was 10 hours. Suppose we want to investigate
the problem whether students now are studying more than they used to. This year a
random sample of n = 50 students is selected. Each student in the sample was
interviewed about the number of hours spent on his/her study. This experiment produced
the following statistics:
x = 41.5 hours s = 9.2 hours
Estimate u , the mean number of hours spent on study, using a 99% confidence interval.
Interpret the interval in term of the problem.
Solution The general form of a largesample 99% confidence interval for u is
36 . 3 5 . 41
50
2 . 9
58 . 2 5 . 41 58 . 2 58 . 2 ± =


¹

\

± =


¹

\

± ≈


¹

\

±
n
s
x
n
x
σ
0 z
.05
= 1.645
xcv
or (38.14, 44.86).
We can be 99% confident that the interval (38.14, 44.86) encloses the true mean weekly time
spent on study this year. Since all the values in the interval fall above 38 hours and below 45
hours, we conclude that there is tendency that students now spend more than 6 hours and less
than 7.5 hours per day on average (suppose that they don't study on Sunday).
Example 7.7 Refer to Example 7.6.
a. Using the sample information in Example 7.6, construct a 95% confidence interval for mean
weekly time spent on study of all students in the university this year.
b. For a fixed sample size, how is the width of the confidence interval related to the confidence
coefficient?
Solution
a. The form of a largesample 95% confidence interval for a population mean u is
55 . 2 5 . 41
50
2 . 9
96 . 1 5 . 41 96 . 1 96 . 1 ± =


¹

\

± =


¹

\

± ≈


¹

\

±
n
s
x
n
x
σ
or (38.95, 44.05).
b. The 99% confidence interval for u was determined in Example 7.6 to be (38.14, 44.86).
The 95% confidence interval, obtained in this example and based on the same sample
information, is narrower than the 99% confidence interval. This relationship holds in general,
as stated in the next box.
Relationship between width of confidence interval and confidence coefficient
For a given sample size, the width of the confidence interval for a parameter increases
as the confidence coefficient increases. Intuitively, the interval must become wider for
us to have greater confidence that it contains the true parameter value.
Example 7.8 Refer to Example 7.6.
a. Assume that the given values of the statistic x and s were based on a sample of size n =
100 instead of a sample size n = 50. Construct a 99% confidence interval for u , the
population mean weekly time spent on study of all students in the university this year.
b. For a fixed confidence coefficient, how is the width of the confidence interval related to the
sample size?
Solution
a. Substitution of the values of the sample statistics into the general formula for a 99%
confidence interval for u yield
37 . 2 5 . 41
100
2 . 9
58 . 2 5 . 41 58 . 2 58 . 2 ± =


¹

\

± =


¹

\

± ≈


¹

\

±
n
s
x
n
x
σ
or (39.13, 43.87)
b. The 99% confidence interval based on a sample of size n = 100, constructed in part a., is
narrower than the 99% confidence interval based on a sample of size n = 50, constructed
xcvi
in Example 7.6. This will also hold in general, as stated in the box.
Relationship between width of confidence interval and sample size
For a fixed confidence coefficient, the width of the confidence interval decreases as
the sample size increases. That is, larger samples generally provide more information
about the target population than do smaller samples.
In this section we introduced the concepts of point estimation of the population mean u , based
on large samples. The general theory appropriate for the estimation of u also carries over to
the estimation of other population parameters. Hence, in subsequent sections we will present
only the point estimate, its sampling distribution, the general form of a confidence interval for the
parameter of interest, and any assumptions required for the validity of the procedure.
7.3 Estimation of a population mean: small sample case
In the previous section, we discussed the estimation of a population mean based on large
samples (n ≥ 30). However, time or cost limitations may often restrict the number of sample
observations that may be obtained, so that the estimation procedures of Section 7.2 would not
be applicable.
With small samples, the following two problems arise:
1. Since the Central Limit Theorem applies only to large samples, we are not able to assume
that the sampling distribution of x is approximately normal. For small samples, the sampling
distribution of x depends on the particular form of the relative frequency distribution of the
population being sampled.
2. The sample standard deviation s may not be a satisfactory approximation to the population
standard deviation σ if the sample size is small.
Fortunately, we may proceed with estimation techniques based on small samples if we can
make the following assumption:
Assumption required for estimating u uu u based on small samples (n < 30)
The population from which the sample is selected has an approximate normal
distribution.
If this assumption is valid, then we may again use x as a point estimation for u , and the
general form of a smallsample confidence interval for u is as shown next box.
Smallsample confidence interval for u uu u


¹

\

±
n
s
t x
2 / α
where the distribution of t based on (n  1) degrees of freedom.
Upon comparing this to the largesample confidence interval for u , you will observe that the
sample standard deviation s replaces the population standard deviation σ . Also, the sampling
distribution upon which the confidence interval is based is known as a Student's tdistribution.
xcvii
Consequently, we must replace the value of
2 / α
z used in a largesample confidence interval by
a value obtained from the tdistribution.
The tdistribution is very much like the zdistribution. In particular, both are symmetric, bell
shaped, and have a mean of 0. However, the distribution of t depends on a quantity called its
degrees of freedom (df), which is equal to (n  1) when estimating a population mean based on
a small sample of size n. Intuitively, we can think of the number of degrees of freedom as the
amount of information available for estimating, in addition to u , the unknown quantity
2
σ .
Table 2 of Appendix C, a portion of which is reproduced in Table 7.4, gives the value of
α
t that
located an area of α in the upper tail of the tdistribution for various values of α and for
degrees of freedom ranging from 1 to 120.
Table 7.6
Some values for Student's tdistribution
Degrees
of
freedom
t
.100
t
.050
t
.025
t
.010
t
.005
t
.001
t
.0005
1 3.078 6.314 12.706 31.821 63.657 318.31 636.62
2 1.886 2.920 4.303 6.965 9.925 22.326 31.598
3 1.638 2.353 3.182 4.541 5.841 10.213 12.924
4 1.533 2.132 2.776 3.747 4.604 7.173 8.610
5 1.476 2.015 2.571 3.365 4.032 5.893 6.869
6 1.440 1.943 2.447 3.143 3.707 5.208 5.959
7 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 1.397 1.860 2.306 2.896 3.355 4.501 5.041
9 1.383 1.833 2.262 2.821 3.250 4.297 4.781
10 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 1.356 1.782 2.179 2.681 3.055 3.930 4.318
13 1.350 1.771 2.160 2.650 3.102 3.852 4.221
14 1.345 1.760 2.145 2.624 2.977 3.787 4.140
15 1.341 1.753 2.131 2.602 2.947 3.733 4.073
Example 7.9 Using Table 7.4 to determine the tvalue that would be used in constructing
a 95% confidence interval for u based on a sample of size n = 14.
Solution For confidence coefficient of .95, we have
95 . 1 = −α
05 . = α
025 . 2 / = α
t
α
α
xcviii
We require the value of t
.025
for a tdistribution based on (n  1) = (14  1) = 13 degrees of
freedom. In Table 7.4, at intersection of the column labeled t
.025
and the row corresponding to df
= 13, we find the entry 2.160 (see Figure 7.6). Hence, a 95% confidence interval for u , based
on a sample of size n = 13 observations, would be given by

¹

\

±
14
160 . 2
s
x
Figure 7.6 Location of t
.025
for Example 7.9
At this point, the reasoning for the arbitrary cutoff point of n = 30 for distinguishing between
large and small samples may be better understood. Observe that the values in the last row of
Table 2 in Appendix C (corresponding to df = ∞) are the values from the standard normal z
distribution. This phenomenon occurs because, as the sample size increases, the t distribution
becomes more like the z distribution. By the time n reaches 30, i.e., df = 29, there is very little
difference between tabulated values of t and z.
Before concluding this section, we will comment on the assumption that the sampled population
is normally distributed. In the real world, we rarely know whether a sampled population has an
exact normal distribution. However, empirical studies indicate that moderates departures from
this assumption do not seriously affect the confidence coefficients for smallsample confidence
intervals. As a consequence, the definition of the smallsample confidence given in this section
interval is frequently used by experimenters when estimating the population mean of a non
normal distribution as long as the distribution is bellshaped and only moderately skewed.
7.4 Estimation of a population proportion
We will consider now the method for estimating the binomial proportion of successes, that is,
the proportion of elements in a population that have a certain characteristic. For example, a
demographer may be interested in the proportion of a city residents who are married; a
physician may be interested in the proportion of men who are smokers. How would you estimate
a binomial proportion p based on information contained in a sample from a population.
Example 7.10 A commission on crime is interested in estimation the proportion of
crimes to firearms in an area with one of the highest crime rates in a country. The
commission selects a random sample of 300 files of recently committed crimes in the
area and determines that a firearm was reportedly used in 180 of them. Estimate the true
tdistribution with 13 df
0 t
.025
= 2.160
t
xcix
proportion p of all crimes committed in the area in which some type of firearm was
reportedly used.
Solution A logical candidate for a point estimate of the population proportion p is the
proportion of observations in the sample that have the characteristic of interest (called a
"success"); we will call this sample proportion pˆ (read "p hat"). In this example, the
sample proportion of crimes related to firearms is given by
sapmle in crimes of number Total
used reportedly was firearm a in which sample in crimes of Number
ˆ = p =180/300=.60
That is, 60% of the crimes in the sample were related to firearms; the value 60 . ˆ = p servers as
our point estimate of the population proportion p.
To assess the reliability of the point estimate pˆ , we need to know its sampling distribution. This
information may be derived by an application of the Central Limit Theorem. Properties of the
sampling distribution of pˆ are given in the next box.
Sampling distribution of pˆ
For sufficiently large samples, the sampling distribution of pˆ is approximately normal,
with
Mean: p
p
=
ˆ
u
and Standard deviation:
n
pq
p
=
ˆ
σ
where q = q p.
A largesample confidence interval for p may be constructed by using a procedure analogous to
that used for estimating a population mean.
Largesample ) 1 ( α αα α − 100% confidence interval for a population proportion, p
n
q p
z p z p
p
ˆ ˆ
ˆ ˆ
2 / ˆ 2 / α α
σ ± ≈ ±
where pˆ is the sample proportion of observations with the characteristic of interest,
and p q ˆ 1 ˆ − = .
Note that, we must substitute pˆ and qˆ into the formula for n pq
p
/
ˆ
= σ in order to construct the
confidence interval. This approximation will be valid as long as the sample size n is sufficiently
large.
Example 7.11 Refer to Example 7.10. Construct a 95% confidence interval for p, the
population proportion of crimes committed in the area in which some type of firearm is
reportedly used.
c
Solution For a confidence interval of .95, we have 95 . 1 = −α ; 05 . = α ; 025 . 2 / = α ;
and the required zvalue is z
.025
= 1.96. In Example 7.10, we obtained 60 . 300 / 180 ˆ = = p .
Thus, 40 . 60 . 1 ˆ 1 ˆ = − = − = p q . Substitution of these values into the formula for an
approximate confidence interval for p yields
06 . 60 .
300
) 40 )(. 60 (.
96 . 1 60 .
ˆ ˆ
ˆ
2 /
± = ± = ±
n
q p
z p
α
or (.54, .66). Note that the approximation is valid since the interval does not contain 0 or 1.
We are 95% confident that the interval from .54 to .66 contains the true proportion of crimes
committed in the area that are related to firearms. That is, in repeated construction of 95%
confidence intervals, 95% of all samples would produce confidence interval that enclose p.
It should be noted that smallsample procedure are available for the estimation of a population
proportion p. We will not discuss details here, however, because most surveys in actual practice
use samples that are large enough to employ the procedure of this section.
7.5 Estimation of the difference between two population
means: Independent samples
In Section 7.2, we learned how to estimate the parameter u based on a large sample from a
single population. We now proceed to a technique for using the information in two samples to
estimate the difference between two population means. For example, we may want to compare
the mean heights of the children in province No.18 and in province No.14 using the
observations in Appendix B. The technique to be presented is a straightforward extension of
that used for largesample estimation of a single population mean.
Example 7.12 To estimate the difference between the mean heights for all children of
province No. 18 and province No. 14 use the following information
1. A random sample of 30 heights of children in province No. 18 produced a sample mean of
91.72 cm and a standard deviation of 4.50 cm.
2. A random sample of 40 heights of children in province No. 14 produced a sample mean of
86.67 cm and a standard deviation of 3.88 cm.
Calculate a point estimate for the difference between heights of children in two provinces.
Solution We will let subscript 1 refer to province No. 18 and the subscript 2 to province
No. 14. We will also define the following notation:
1
u = Population mean height of all children of province No. 18.
2
u = Population mean height of all children of province No. 14.
Similarly, lets
1
x and
2
x denote the respective means; s
1
and s
2
, the respective sample
standard deviations; and n
1
and n
2
, the respective sample sizes. The given information may be
summarized as in Table 7.5.
Table 7.5 Summary information for Example 7.12
Province No. 18 Province No. 14
Sample size
n
1
= 30 n
2
= 40
ci
) ( 2 1
2
x x −
σ
) (
2 1
x x −
(u
1
 u
2
)
) ( 2 1
2
x x −
σ
Sample mean
1
x
= 91.72 cm
2
x
= 86.67 cm
Sample standard deviation
s
1
= 4.50 cm s
2
= 3.88 cm
To estimate ) (
2 1
u u − , it seems sensible to use the difference between the sample means
) (
2 1
x x − = (91.72  86.67) = 5.05 as our point estimate of the difference between two
population means. The properties of the point estimate ) (
2 1
x x − are summarized by its
sampling distribution shown in Figure 7.8.
Figure 7.8 Sampling distribution of ) (
2 1
x x −
Sampling distribution of ) (
2 1
x x − −− −
For sufficiently large sample size (n
1
and n
2
≥ 30), the sampling distribution of
) (
2 1
x x − , based on independent random samples from two population, is
approximately normal with
Mean: ) (
2 1
) ( 2 1
u u u − =
−x x
Standard deviation:
2
2
2
1
2
1
) ( 2 1
n n
x x
σ σ
σ + =
−
where
2
1
σ and
2
2
σ are standard deviations of two population from which the samples
were selected.
As was the case with largesample estimation of single population mean, the requirement of
large sample size enables us to apply the Central Limit Theorem to obtain the sampling
distribution of ) (
2 1
x x − ; it also suffices use to
2
1
s and
2
2
s as approximation to the respective
population variances,
2
1
σ and
2
2
σ .
The procedure for forming a largesample confidence interval for ) (
2 1
u u − appears in the
accompanying box.
cii
Largesample (1  α αα α)100% confidence interval for ) (
2 1
u uu u u uu u − −− −
2
2
2
1
2
1
2 / 2 1
) (
2 / 2 1
) ( ) (
2 1
n n
z x x z x x
x x
σ σ
σ
α α
+ ± − = ± −
−
2
2
2
1
2
1
2 / 2 1
) (
n
s
n
s
z x x + ± − ≈
α
(Note: We have used the sample variances
2
1
s and
2
2
s as approximations to the corresponding
population parameters.)
The assumptions upon which the above procedure is based are the following:
Assumptions required for largesample estimation of ) (
2 1
u uu u u uu u − −− −
1. The two random samples are selected in an independent manner from the target
populations. That is the choice of elements in one sample does not affect, and is
not affected by, the choice of elements in the other sample.
2. The sample sizes n
1
and n
2
are sufficiently large. (at least 30)
Example 7.13 Refer to Example 7.12. Construct a 95% confidence interval for ) (
2 1
u u − ,
the difference between mean heights of all children in province No. 18 and province No.
14. Interpret the interval.
Solution The general form of a 95% confidence interval for ) (
2 1
u u − based on large
samples from the target populations, is given by
2
2
2
1
2
1
2 / 2 1
) (
n n
z x x
σ σ
α
+ ± −
Recall that z
.025
= 1.96 and use the information in Table 7.5 to make the following substitutions
to obtain the desired confidence interval:
40 30
96 . 1 ) 67 . 86 72 . 91 (
2
2
2
1
σ σ
+ ± −
40
) 88 . 3 (
30
) 50 . 4 (
96 . 1 ) 67 . 86 72 . 91 (
2 2
+ ± − ≈
01 . 2 05 . 5 ± ≈
or (3.04, 7.06).
The use of this method of estimation produces confidence intervals that will enclose ) (
2 1
u u − ,
the difference between population means, 95% of the time. Hence, we can be reasonably sure
ciii
that the mean height of children in province No. 18 was between 3.04 cm and 7.06 cm higher
than the mean height of children in province No. 14 at the survey time.
When estimating the difference between two population means, based on small samples from
each population, we must make specific assumptions about the relative frequency distributions
of the two populations, as indicated in the box.
Assumptions required for smallsample estimation of ) (
2 1
u uu u u uu u − −− −
1. Both of the populations which the samples are selected have relative frequency
distributions that are approximately normal.
2. The variances
2
1
σ and
2
2
σ of the two populations are equal.
3. The random samples are selected in an independent manner from two
populations.
When these assumptions are satisfied, we may use the procedure specified in the next box to
construct a confidence interval for ) (
2 1
u u − , based on small samples
(n
1
and n
2
< 30) from respective populations.
Smallsample (1  α αα α)100% confidence interval for ) (
2 1
u uu u u uu u − −− −


¹

\

+ ± −
2 1
2
2 / 2 1
1 1
) (
n n
s t x x
p α
where
2
) 1 ( ) 1 (
2 1
2
2 2
2
1 1 2
− +
− + −
=
n n
s n s n
s
p
and the value of
2 / α
t is based on (n
1
+ n
2
 2) degrees of freedom.
Since we assume that the two populations have equal variances (i.e.,
2 2
2
2
1
σ σ σ = = ), we
construct an estimate of
2
σ based on the information contained in both samples. This pooled
estimate is denoted by
2
p
s and is computed as in the previous box.
7.6 Estimation of the difference between two population
means: Matched pairs
The procedure for estimating the difference between two population means presented in
Section 7.5 were based on the assumption that the samples were randomly selected from the
target populations. Sometimes we can obtain more information about the difference between
population means ) (
2 1
u u − , by selecting paired observations.
For example, suppose we want to compare two methods for teaching reading skills to first
graders using sample of ten students with each method. The best method of sampling would be
to match the first graders in pairs according to IQ and other factors that might affect reading
civ
achievement. For each pair, one member would be randomly selected to be taught by method
1; the other member would be assigned to class taught by method 2. Then the differences
between matched pairs of achievement test scores should provide a clearer picture of the
difference in achievement for the two reading methods because the matching would tend to
cancel the effects of the factors that formed the basic of the matching.
In the following boxes, we give the assumptions required and the procedure to be used for
estimating the difference between two population means based on matchedpairs data.
Assumptions required for estimation of ) (
2 1
u uu u u uu u − : Matched pairs
1. The sample paired observations are randomly selected from the target population
of paired observations.
2. The population of paired differences is normally distributed.
Smallsample ) 1 ( α αα α − −− − 100% confidence interval for ) (
2 1
u uu u u uu u u uu u − =
d
Let d
1
, d
2
, . . . d
n
represent the differences between the pairwise observations in a
random sample of n matched pairs. Then the smallsample confidence interval for
) (
2 1
u u u − =
d
is


¹

\

±
n
s
t d
d
2 / α
where d is the mean of n sample differences, s
d
is their standard deviation, and
2 / α
t
is based on (n1) degrees of freedom.
Example 7.14 Suppose that the n = 10 pairs of achievement test scores were given in
Table 7.7 . Find a 95% confidence interval for the difference in mean achievement,
) (
d 2 1
u u u − = .
Table 7.7 Reading achievement test scores for Example 7.14
Student pair
1 2 3 4 5 6 7 8 9 10
Method 1 score 78 63 72 89 91 49 68 76 85 55
Method 2 score 71 44 61 84 74 51 55 60 77 39
Pair difference 7 19 11 5 17 2 13 16 8 16
Solution The differences between matched pairs of reading achievement test scores are
computed as
d = (method 1 score  method 2 score)
The mean, variance, and standard deviation of the differences are
0 . 11
10
110
= = =
∑
n
d
d
cv
( )
6667 . 42
9
210 , 1 594 , 1
9
10
) 110 (
594 , 1
1
2
2
2
2
=
−
=
−
=
−
−
=
∑
∑
n
n
d
d
s
d
53 . 6 67 . 42 = =
d
s
The value of t
.025
, based on (n 1) = 9 degrees of freedom, is given in Table 2 of Appendix C as
t
.025
= 2.262. Substituting these values into the formula for the confidence interval, we obtain


¹

\

±
n
s
t d
d
025 .
7 . 4 0 . 11
10
53 . 6
262 . 2 0 . 11 ± =


¹

\

± =
or (6.3, 15.7).
We estimate, with 95% confidence that the difference between mean reading achievement test
scores for method 1 and 2 falls within the interval from 6.3 to 15.7. Since all the values within
the interval are positive. method 1 seems to produce a mean achievement test score that
substantially higher than the mean score for method 2.
7.7 Estimation of the difference between two population proportions
This section extends the method of Section 7.4 to the case in which we want to estimate the
difference between two population proportions. For example, we may be interested in
comparing the proportions of married and unmarried persons who are overweight.
Example 7.15 Suppose that there were two surveys, one was carried out in 1990 and
another in 1998. In both surveys, random samples of 1,400 adults in a country were
asked whether they were satisfied with their life. The results of the surveys are reported
in Table 7.8. Construct a point estimate for difference between the proportions of adults
in the country in 1990 and in 1998 who were satisfied with their life.
Table 7.8 Proportions of two samples for Example 7.15
1990 1998
Number surveyed n
1
= 1,400 n
2
= 1,400
Number in sample who said they were
satisfied with their life
462 674
Solution We define some notations:
p
1
= Population proportion of adults who said that they were satisfied with their life in 1990.
p
2
= Population proportion of adults who said that they were satisfied with their life in 1998.
As a point estimate of (p
1
 p
2
), we will use the difference between the corresponding sample
proportions, ) ˆ ˆ (
2 1
p p − , where
33 .
400 , 1
462
1990
1990
ˆ
1
= = =
in surveyed adults of Number
life their with satisfied were they that said who in adults of Number
p
cvi
and
48 .
400 , 1
674
1998
1998
ˆ
2
= = =
in surveyed adults of Number
life their with satisfied were they that said who in adults of Number
p
Thus, the point estimate of (p
1
 p
2
), is
) ˆ ˆ (
2 1
p p − = .33  .48 = .15
To judge the reliability of the point estimate ) ˆ ˆ (
2 1
p p − , we need to know the characteristics of its
performance in repeated independent sampling from two populations. This information is
provided by the sampling distribution of ) ˆ ˆ (
2 1
p p − , shown in the next box.
Sampling distribution of ) ˆ ˆ (
2 1
p p − −− −
For sufficiently large sample size, n
1
and n
2
, the sample distribution of ) ˆ ˆ (
2 1
p p − ,
based on independent random samples from two populations, is approximately normal
with
Mean: ) ˆ ˆ (
2 1 ) ˆ ˆ (
2 1
p p
p p
− =
−
u
and
Standard deviation:
2
2 2
1
1 1
) ˆ ˆ (
2 1
n
q p
n
q p
p p
+ =
−
σ
where q
1
= 1  p
1
and q
2
= 1  p
2
.
It follows that a largesample confidence interval for ) ˆ ˆ (
2 1
p p − may be obtained as shown in the
box.
Largesample ) 1 ( α αα α − −− − 100% confidence interval for ) ˆ ˆ (
2 1
p p − −− −
2
2 2
1
1 1
2 / 2 1 ) ˆ ˆ ( 2 / 2 1
ˆ ˆ ˆ ˆ
) ˆ ˆ ( ) ˆ ˆ (
2 1
n
q p
n
q p
z p p z p p
p p
+ ± − ≈ ± −
− α α
σ
where
1
ˆ p and
2
ˆ p are the sample proportions of observations with the characteristics
of interest.
Assumption: The samples are sufficiently large so that the approximation is valid. As a
general rule of thumb we will require that intervals
1
1 1
1
ˆ ˆ
2 ˆ
n
q p
p ± and
2
2 2
2
ˆ ˆ
2 ˆ
n
q p
p ± do not contain 0 or 1.
cvii
Example 7.16 Refer to Example 7.15. Estimate the difference between the proportions of
the adults in this country in 1990 and in 1998 who said that they were satisfied with their
life, using a 95% confidence interval.
Solution From Example 7.15, we have n
1
= n
2
= 1,400,
1
ˆ p = .33 and
2
ˆ p = .48.
Thus,
1
ˆ q = 1  .33 = .67 and
2
ˆ q = 1  .48 = .52. Note that the intervals
025 . 33 .
400 , 1
) 67 )(. 33 (.
2 33 .
ˆ ˆ
2 ˆ
1
1 1
1
± = ± = ±
n
q p
p
027 . 48 .
400 , 1
) 67 )(. 48 (.
2 48 .
ˆ ˆ
2 ˆ
2
2 2
2
± = ± = ±
n
q p
p
do not contain 0 and 1. Thus, we can apply the largesample confidence interval for
(p
1
 p
2
).
The 95% confidence interval is
400 , 1
) 52 )(. 48 (.
400 , 1
) 67 )(. 33 (.
96 . 1 ) 48 . 33 (.
ˆ ˆ ˆ ˆ
) ˆ ˆ (
2
2 2
1
1 1
025 . 2 1
+ ± − = + ± −
n
q p
n
q p
z p p
036 . 15 . ± − =
or (.186, .114). Thus we estimate that the interval (.186, .114) enclose the difference (p
1
 p
2
)
with 95% confidence. It appears that there were between 11.4% and 18.6% more adults in 1998
than in 1990 who said that they were satisfied with their life.
7.8 Choosing the sample size
Before constructing a confidence interval for a parameter of interest, we will have to decide on
the number n of observations to be included in a sample. Should we sample n = 10
observations, n = 20, or n = 100? To answer this question we need to decide how wide a
confidence interval we are willing to tolerate and measure of confidence  that is, the confidence
coefficient that we wish to place in it. The following example will illustrate the method for
determining the appropriate sample size for estimating a population mean.
Example 7.17 A mailorder house wants to estimate the mean length of time between
shipment of an order and receipt by customer. The management plans to randomly
sample n orders and determine, by telephone, the number of days between shipment
and receipt for each order. If management wants to estimate the mean shipping time
correct to within .5 day with probability equal to .95, how many orders should be sample?
Solution We will use x , the sample mean of the n measurements, to estimate u , the
mean shipping time. Its sampling distribution will be approximately normal and the
probability that x will lie within


¹

\

=
n
x
σ
σ 96 . 1 96 . 1
of the mean shipping time, u , is approximately .95 (see Figure 7.9). Therefore, we want to
choose the sample size n so that n / 96 . 1 σ equals .5 day:
cviii
5 . 96 . 1 =


¹

\

n
σ
Figure 7.9 Sampling distribution of the sample mean, x
To solve the equation n / 96 . 1 σ = .5, we need to know that value of σ , a measure of variation
of the population of all shipping times. Since σ is unknown, we must approximate its value
using the standard deviation of some previous sample data or deduce an approximate value
from other knowledge about the population. Suppose, for example, that we know almost all
shipments will delivered within 7 days. Then the population of shipping times might appear as
shown in Figure 7.10.
Figure 7.10 Hypothetical relative frequency distribution of population of shipping
times for Example 7.17.
Figure 7.9 provides the information we need to find an approximation for σ . Since the Empirical
Rule tells us that almost all the observations in a data set will fall within the interval σ u 3 ± , it
follows that the range of a population is approximately σ 6 . If the range of population of
shipping times is 7 days, then
σ 6 = 7 days
and σ is approximately equal to 7/6 or 1.17 days.
The final step in determining the sample size is to substitute this approximate value of σ into
the equation obtained previously and solve for n.
x
u
1.96σ
1.96σ
.5 day .5 day
cix
Thus, we have
5 .
17 . 1
96 . 1 =


¹

\

n
or 59 . 4
5 .
) 17 . 1 ( 96 . 1
= = n
Squaring both sides of this equation yields: n = 21.07.
we will follows the usual convention of rounding the calculated sample size upward. Therefore,
the mailorder house needs to sample approximately n = 22 shipping times in order to estimate
the mean shipping time correct to within .5 day with probability equal .95.
In Example 7.17, we wanted our sample estimate to lie within .5 day of the true mean shipping
time, u , with probability .95, where .95 represents the confidence coefficient. We could
calculate the sample size for a confidence coefficient other than .95 by changing the zvalue in
the equation. In general, if we want x to lie within a distance d of u with probability ) 1 ( α − , we
would solve for n in the equation
d
n
z =


¹

\
 σ
α 2 /
where the value of
2 /
z
α
is obtained from Table 1 of Appendix C. The solution is given by
2
2 /

¹

\

=
d
z
n
σ
α
For example, for a confidence coefficient of .90, we would require a sample size of
2
64 . 1

¹

\

=
d
n
σ
Choosing the sample size for estimating a population mean u uu u to within d units
with probability ) 1 ( α αα α − −− −
2
2 /

¹

\

=
d
z
n
σ
α
(Note: The population standard deviation σ will usually have to be approximated.)
The procedures for determining the sample sizes needed to estimate a population proportion,
the difference between two population means, or the difference between two population
proportions are analogous to the procedure for the determining the sample size for estimating a
population mean.
cx
Choosing the sample size for estimating a population proportion p to within d
units with probability ) 1 ( α αα α − −− −
pq
d
z
n
2
2 /

¹

\

=
σ
α
where p is the value of the population proportion that we are attempting to estimate
and q = 1  p.
(Note: This technique requires previous estimates of p and q. If none are available, use
p = q = .5 for a conservative choice of n.)
7.9 Estimation of a population variance
In the previous sections, we considered interval estimates for population means or proportions.
In this optional section, we discuss a confidence interval for a population variance, σ
2
. Intuitively,
it seems reasonable to use the sample variance s
2
to estimate σ
2
and to construct our
confidence interval around this value. However, unlike sample means and sample proportions,
the sampling distribution of the sample variances does not possess a normal zdistribution or a
tdistribution.
Rather, when certain assumptions are satisfied, the sampling distribution of s
2
possesses
approximately a chisquare (χ χχ χ
2
) distribution. The chisquare probability distribution, like the t
distribution, is characterized by a quantity called the degrees of freedom associated with the
distribution. Several chisquare probability distributions with different degrees of freedom are
shown in Figure 7.11. Unlike z and
tdistributions, the chisquare distribution is not symmetric about 0.
Throughout this section we will use the words chisquare and the Greek symbol χ
2
interchangeably.
Example 7.18 Tabulated values of the χ χχ χ
2
distribution are given in Table 3 of Appendix C;
a partial reproduction of this table is shown in Table 7.9. Entries in the table give an
uppertail value of χ χχ χ
2
, call it χ χχ χ
2
α αα α
, such that P(χ χχ χ
2
> χ χχ χ
2
α αα α
) = α αα α. Find the tabulated value of χ χχ χ
2
corresponding to 9 degrees of freedom that cuts off an uppertail area of .05.
cxi
Figure 7.11 Several chisquare probability distribution
Solution The value of χ χχ χ
2
that we seek appears (shaded) in the partial reproduction of
Table 3 of Appendix C given in Table 7.9. The columns of the table identify the value of α αα α
associated with the tabulated value of χ χχ χ
2
α αα α
and the rows correspond to the degrees of
freedom. For this example, we have df = 9 and α αα α = .05. Thus, the tabulated value of χ χχ χ
2
corresponding to 9 degrees of freedom is
χ
2
.05
= 16.9190
Table 7.9 Reproduction of part of Table 3 of Appendix C
We use the tabulated values of χ
2
to construct a confidence interval for σ
2
as the next example.
Example 7.19 There was a study of contaminated fish in a river. Suppose it is important
for the study to know how stable the weights of the contaminated fish are. That is, how
large is the variance σ σσ σ
2
in the fish weights? The 144 samples of fish in the study
produced the following summary statistics:
grams. 376.6 grams, 7 . 049 , 1 = = s x
Use this information to construct a 95% confidence interval for the true variation in weights of
contaminated fish in the river.
Degrees of
freedom
χ
2
.050
χ
2
.025
χ
2
.010
χ
2
.005
1 2.70554 3.84146 5.02389 6.63490 7.87944
2 4.60517 5.99147 7.37776 9.21034 10.59660
3 6.25139 7.81473 9.34840 11.34490 12.83810
4 7.77944 9.48773 11.14330 13.27670 14.86020
5 9.23635 11.07050 12.83250 15.08630 16.74960
6 10.64460 12.59160 14.44940 16.81190 18.54760
7 12.01700 14.06710 16.01280 18.47530 20.27770
8 13.36160 15.50730 17.53460 20.09020 21.95500
cxii
9 14.68370 16.91900 19.02280 21.66600 23.58930
10 15.98710 18.30700 20.48310 23.20930 25.18820
11 17.27500 19.67510 21.92000 24.72500 26.75690
12 18.54940 21.02610 23.33670 26.21700 28.29950
13 19.81190 22.36210 24.73560 27.68830 29.81940
14 21.06420 23.68480 26.11900 29.14130 31.31930
15 22.30720 24.99580 27.48840 30.57790 32.80130
16 23.54180 26.29620 28.84540 31.99990 34.26720
17 24.76900 27.58710 30.19100 33.40870 35.71850
18 25.98940 28.86930 31.52640 34.80530 37.15640
19 27.20360 30.14350 32.85230 36.19080 38.58220
Solution A (1  α αα α)100% confidence interval for σ σσ σ
2
depends on the quantities s
2
,
(n  1), and critical values of χ χχ χ
2
as shown in the box. Note that (n  1) represents the
degrees of freedom associated with the χ χχ χ
2
distribution. To construct the interval, we first
locate the critical values
2
2 / 1 α
χ
−
, and
2
2 / α
χ . These are the values of χ χχ χ
2
that cut off an area
of α αα α/2 in the lower and upper tails, respectively, of the chisquare distribution (see Figure
7.11).
A (1  α αα α)100% confidence interval for a population variance, σ σσ σ
2
) 2 / 1 (
2
2
2
2 /
2
2
) 1 ( ) 1 (
α α χ
σ
χ −
−
≤ ≤
− s n s n
where
2
2 / 1 α
χ
−
, and
2
2 / α
χ are values of χ
2
that locate an area of α/2 to the right and α/2
to the left, respectively, of a chisquare distribution based on (n  1) degrees of
freedom.
Assumption: The population from which the sample is selected has an approximate
normal distribution.
For a 95% confidence interval, (1  α) = .95 and α/2 = .05/2 = .025. There fore, we need the
tabulated values χ
2
.025
, and χ
2
.975
for (n  1) = 143 df. Looking in the
df = 150 row of Table 3 of Appendix C (the row with the df values closest to 143), we find χ
2
.025
= 185.800 and χ
2
.975
= 117.985. Substituting into the formula given in the box, we obtain
985 . 117
) 6 . 376 )( 1 144 (
800 . 185
) 6 . 376 )( 1 144 (
2
2
2
−
≤ ≤
−
σ
cxiii
We are 95% confident that the true variance in weights of contaminated fish in the river falls
between 109,156.8 and 171,898.4.
Figure 7.11 The location of χ
2
1α/2
and χ
2
α/2
for a chisquare distribution
Example 7.20 Refer to Example 7.19. Find a 95% confidence interval for σ σσ σ, the true
standard deviation of the fish weights.
Solution A confidence interval for σ σσ σ is obtained by taking the square roots of the lower
and upper endpoints of a confidence interval for σ σσ σ
2
. Thus, the 95% confidence interval is
6 . 414 4 . 330
4 . 898 , 171 8 . 156 , 109
2
2
≤ ≤
≤ ≤
σ
σ
Thus, we are 95% confident that the true standard deviation of the fish weights is between
330.4 grams and 414.6 grams.
Note that the procedure for calculating a confidence interval for σ
2
in Example 7.19 (and the
confidence interval for a in Example 7.20) requires an assumption regardless of whether the
sample size n is large or small (see box). We must assume that the population from which the
sample is selected has an approximate normal distribution. It is reasonable to expect this
assumption to be satisfied in Examples 7.19 and 7.20 since the histogram of the 144 fish
weights in the sample is approximately normal.
7.10 Summary
This chapter presented the technique of estimation  that is, using sample information to make
an inference about the value of a population parameter, or the difference between two
population parameters. In each instance, we presented the point estimate of the parameter of
interest, its sampling distribution, the general form of a confidence interval, and any
assumptions required for the validity of the procedure. In addition, we provided techniques for
determining the sample size necessary to estimate each of these parameters.
7.11 Exercises
7.1. Use Table 1 of Appendix C to determine the value of z
α/2
that would be used to construct a
largesample confidence interval for u, for each of the following confidence coefficients:
a) .85
b) .95
cxiv
c) .975
7.2. Suppose a random sample of size n = 100 produces a mean of x =81 and a standard
deviation of s = 12.
a) Construct a 90% confidence interval for u.
b) Construct a 95% confidence interval for u.
c) Construct a 99% confidence interval for u.
7.3. Use Table 2 of Appendix C to determine the values of t
α/2
that would used in the
construction of a confidence interval for a population mean for each of the following
combinations of confidence coefficient and sample size:
a) Confidence coefficient .99, n = 18.
b) Confidence coefficient .95, n = 10.
c) Confidence coefficient .90, n = 15.
7.4. A random sample of n = 10 measurements from a normally distributed population yields
x = 9.4 and s = 1.8.
a) Calculate a 90% confidence for u.
b) Calculate a 95% confidence for u.
c) Calculate a 99% confidence for u.
7.5. The mean and standard deviation of n measurements randomly sampled from a normally
distributed population are 33 and 4, respectively. Construct a 95% confidence interval for
u when:
a) n = 5 b) n = 15 c) n = 25
7.6. Random samples of n measurements are selected from a population with unknown
proportion of successes p. Compute an estimate of
pˆ
σ for each of the following situations:
a) n = 250, pˆ = .4 b) n = 500, pˆ = .85 c) n = 95, pˆ = .25
7.7. A random sample of size 150 is selected from a population and the number of successes
is 60.
a) Find pˆ .
b) Construct a 90% confidence interval for p.
c) Construct a 95% confidence interval for p.
d) Construct a 99% confidence interval for p.
7.8. Independent random samples from two normal population produced the sample means
and variances listed in the following table.
Sample from
population 1
Sample from
population 2
n
1
= 14
1
x = 53.2
2
1
s = 96.8
n
2
= 7
1
x = 43.4
2
2
s = 102.0
a) Find a 90% confidence interval for (u
1
 u
2
).
cxv
b) Find a 95% confidence interval for (u
1
 u
2
).
c) Find a 99% confidence interval for (u
1
 u
2
).
7.9. A random sample of ten paired observations yielded the following summary information:
d = 2.3 s
d
= 2.67
a) Find a 90% confidence interval for u
d
.
b) Find a 95% confidence interval for u
d
.
c) Find a 99% confidence interval for u
d
.
cxvi
Chapter 8 Hypothesis Testing
CONTENTS
8.1 Introduction
8.2 Formulating Hypotheses
8.3 Types of errors for a Hypothesis Test
8.4 Rejection Regions
8.5 Summary
8.6 Exercises
8.1 Introduction
In this chapter we will study another method of inferencemaking: hypothesis testing. The
procedures to be discussed are useful in situations where we are interested in making a
decision about a parameter value, rather than obtaining an estimate of its value. It is often
desirable to know whether some characteristics of a population is larger than a specified value,
or whether the obtained value of a given parameter is less than a value hypothesized for the
purpose of comparison.
8.2 Formulating Hypotheses
When we set out to test a new theory, we first formulate a hypothesis, or a claim, which we
believe to be true. For example, we may claim that the mean number of children born to urban
women is less than the mean number of children born to rural women.
Since the value of the population characteristic is unknown, the information provided by a
sample from the population is used to answer the question of whether or not the population
quantity is larger than the specified or hypothesized value. In statistical terms, a statistical
hypothesis is a statement about the value of a population parameter. The hypothesis that we try
to establish is called the alternative hypothesis and is denoted by H
a
. To be paired with the
alternative hypothesis is the null hypothesis, which is "opposite" of the alternative hypothesis,
and is denoted by H
0
. In this way, the null and alternative hypotheses, both stated in terms of
the appropriate parameters, describe two possible states of nature that cannot simultaneously
be true. When a researcher begins to collect information about the phenomenon of interest, he
or she generally tries to present evidence that lends support to the alternative hypothesis. As
you will subsequently learn, we take an indirect approach to obtaining support for the alternative
hypothesis: Instead of trying to show that the alternative hypothesis is true, we attempt to
produce evidence to show that the null hypothesis is false.
It should be stressed that researchers frequently put forward a null hypothesis in the hope that
they can discredit it. For example, consider an educational researcher who designed a new way
to teach a particular concept in science, and wanted to test experimentally whether this new
method worked better than the existing method. The researcher would design an experiment
comparing the two methods. Since the null hypothesis would be that there is no difference
between the two methods, the researcher would be hoping to reject the null hypothesis and
conclude that the method he or she developed is the better of the two.
cxvii
The null hypothesis is typically a hypothesis of no difference, as in the above example where it
is the hypothesis that there is no difference between population means. That is why the word
"null" in "null hypothesis" is used − it is the hypothesis of no difference.
Example 8.1 Formulate appropriate null and alternative hypotheses for testing the
demographer's theory that the mean number of children born to urban women is less
than the mean number of children born to rural women.
Solution The hypotheses must be stated in terms of a population parameter or
parameters. We will thus define
u
1
= Mean number of children born to urban women, and
u
2
= Mean number of children ever born of the rural women.
The demographer wants to support the claim that u
1
is less than u
2
; therefore, the null and
alternative hypotheses, in terms of these parameters, are
H
0
: (u
1
 u
2
) = 0 (i.e., u
1
= u
2
; there is no difference between the mean numbers of children
born to urban and rural women)
H
a
: (u
1
 u
2
) < 0 (i.e., u
1
< u
2
; the mean number of children born to urban women is less
than that for the rural women)
Example 8.2 For many years, cigarette advertisements have been required to carry the
following statement: "Cigarette smoking is dangerous to your health." But, this waning is
often located in inconspicuous corners of the advertisements and printed in small type.
Consequently, a researcher believes that over 80% of those who read cigarette
advertisements fail to see the warning. Specify the null and alternative hypotheses that
would be used in testing the researcher's theory.
Solution The researcher wants to make an inference about p, the true proportion of all
readers of cigarette advertisements who fail to see the warning. In particular, he wishes
to collect evidence to support the claim that p is greater than .80; thus, the null and
alternative hypotheses are
H
0
: p = .80
H
a
: p > .80
Observe that the statement of H
0
in these examples and in general, is written with an equality (=)
sign. In Example 8.2, you may have been tempted to write the null hypothesis as H
0
: p ≤ .80.
However, since the alternative of interest is that p > .80, then any evidence that would cause you
to reject the null hypothesis H
0
: p = .80 in favor of H
a
: p > .80 would also cause you to reject H
0
:
p = p', for any value of p' that is less than .80. In other words, H
0
: p = .80 represents the worst
possible case, from the researcher's point of view, when the alternative hypothesis is not correct.
Thus, for mathematical ease, we combine all possible situations for describing the opposite of H
a
into one statement involving equality.
Example 8.3 A metal lathe is checked periodically by quality control inspectors to
determine if it is producing machine bearings with a mean diameter of .5 inch. If the
mean diameter of the bearings is larger or smaller than .5 inch, then the process is out of
control and needs to be adjusted. Formulate the null and alternative hypotheses that
could be used to test whether the bearing production process is out of control.
Solution We define the following parameter:
u = True mean diameter (in inches) of all bearings produced by the lathe
If either u > .5 or u < .5, then the metal lathe's production process is out of control. Since we
wish to be able to detect either possibility, the null and alternative hypotheses would be
cxviii
H
0
: u = .5 (i.e., the process is in control)
H
a
: u ≠ .5 (i.e., the process is out of control)
An alternative hypothesis may hypothesize a change from H
0
in a particular direction, or it may
merely hypothesize a change without specifying a direction. In Examples 8.1 and 8.2, the
researcher is interested in detecting departure from H
0
in one particular direction. In Example
8.1, the interest focuses on whether the mean number of children born to the urban women is
less than the mean number of children born to rural women. The interest focuses on whether
the proportion of cigarette advertisement readers who fail to see the warning is greater than .80
in Example 8.2. These two tests are called onetailed tests. In contrast, Example 8.3 illustrates a
twotailed test in which we are interested in whether the mean diameter of the machine bearings
differs in either direction from .5 inch, i.e., whether the process is out of control.
8.3 Types of errors for a Hypothesis Test
The goal of any hypothesis testing is to make a decision. In particular, we will decide whether to
reject the null hypothesis, H
0
, in favor of the alternative hypothesis, H
a
. Although we would like
always to be able to make a correct decision, we must remember that the decision will be based
on sample information, and thus we are subject to make one of two types of error, as defined in
the accompanying boxes.
Definition 8.1
A Type I error is the error of rejecting the null hypothesis when it is true. The
probability of committing a Type I error is usually denoted by α.
Definition 8.2
A Type II error is the error of accepting the null hypothesis when it is false. The
probability of making a Type II error is usually denoted by β.
The null hypothesis can be either true or false further, we will make a conclusion either to reject
or not to reject the null hypothesis. Thus, there are four possible situations that may arise in
testing a hypothesis (see Table 8.1).
Table 8.1 Conclusions and consequences for testing a hypothesis
Conclusions
Do not reject
Null Hypothesis
Reject
Null Hypothesis
Null Hypothesis Correct conclusion Type I error
True
"State of Nature"
Alternative Hypothesis Type II error Correct conclusion
The kind of error that can be made depends on the actual state of affairs (which, of course, is
unknown to the investigator). Note that we risk a Type I error only if the null hypothesis is
rejected, and we risk a Type II error only if the null hypothesis is not rejected. Thus, we may
make no error, or we may make either a Type I error (with probability α), or a Type II error
(with probability β), but not both. We don't know which type of error corresponds to actuality
and so would like to keep the probabilities of both types of errors small. There is an intuitively
appealing relationship between the probabilities for the two types of error: As α increases, β
cxix
decreases, similarly, as β increases, a decreases. The only way to reduce α and β simultaneously
is to increase the amount of information available in the sample, i.e., to increase the sample size.
Example 8.4 Refer to Example 8.3. Specify what Type I and Type II errors would
represent, in terms of the problem.
Solution A Type I error is the error of incorrectly rejecting the null hypothesis. In our
example, this would occur if we conclude that the process is out of control when in fact
the process is in control, i.e., if we conclude that the mean bearing diameter is different
from .5 inch, when in fact the mean is equal to .5 inch. The consequence of making such
an error would be that unnecessary time and effort would be expended to repair the
metal lathe.
A Type II error that of accepting the null hypothesis when it is false, would occur if we conclude
that the mean bearing diameter is equal to .5 inch when in fact the mean differs from .5 inch. The
practical significance of making a Type II error is that the metal lathe would not be repaired,
when in fact the process is out of control.
The probability of making a Type I error (α) can be controlled by the researcher (how to do this
will be explained in Section 8.4). α is often used as a measure of the reliability of the conclusion
and called the level of significance (or significance level) for a hypothesis test.
You may note that we have carefully avoided stating a decision in terms of "accept the null
hypothesis H
0
." Instead, if the sample does not provide enough evidence to support the
alternative hypothesis H
a
, we prefer a decision "not to reject H
0
." This is because, if we were to
"accept H
0
," the reliability of the conclusion would be measured by β, the probability of Type II
error. However, the value of β is not constant, but depends on the specific alternative value of the
parameter and is difficult to compute in most testing situations.
In summary, we recommend the following procedure for formulating hypotheses and stating
conclusions.
Formulating hypotheses and stating conclusions
1. State the hypothesis as the alternative hypothesis H
a
.
2. The null hypothesis, H
0
, will be the opposite of H
a
and will contain an equality sign.
3. If the sample evidence supports the alternative hypothesis, the null hypothesis will be
rejected and the probability of having made an incorrect decision (when in fact H
0
is true) is
α, a quantity that can be manipulated to be as small as the researcher wishes.
4. If the sample does not provide sufficient evidence to support the alternative hypothesis, then
conclude that the null hypothesis cannot be rejected on the basis of your sample. In this
situation, you may wish to collect more information about the phenomenon under study.
Example 8.5 The logic used in hypothesis testing has often been likened to that used in
the courtroom in which a defendant is on trial for committing a crime.
a. Formulate appropriate null and alternative hypotheses for judging the guilt or innocence of
the defendant.
b. Interpret the Type I and Type II errors in this context.
c. If you were the defendant, would you want α to be small or large? Explain.
Solution
a. Under a judicial system, a defendant is "innocent until proven guilty." That is, the burden of
proof is not on the defendant to prove his or her innocence; rather, the court must collect
cxx
sufficient evidence to support the claim that the defendant is guilty. Thus, the null and
alternative hypotheses would be
H
0
: Defendant is innocent
H
a
: Defendant is guilty
b. The four possible outcomes are shown in Table 8.2. A Type I error would be to conclude
that the defendant is guilty, when in fact he or she is innocent; a Type II error would be to
conclude that the defendant is innocent, when in fact he or she is guilty.
Table 8.2 Conclusions and consequences inn Example 8.5
Decision of Court
Defendant is
innocent
Defendant is
guilty
True State of
Nature
Defendant is innocent
Defendant is guilty
Correct decision
Type I error
Type II error
Correct decision
c. Most would probably agree that the Type I error in this situation is by far the more serious.
Thus, we would want α, the probability of committing a Type I error, to be very small indeed.
A convention that is generally observed when formulating the null and alternative hypotheses of
any statistical test is to state H
0
so that the possible error of incorrectly rejecting H
0
(Type I
error) is considered more serious than the possible error of incorrectly failing to reject H
0
(Type
II error). In many cases, the decision as to which type of error is more serious is admittedly not
as clearcut as that of Example 8.5; experience will help to minimize this potential difficulty.
8.4 Rejection Regions
In this section we will describe how to arrive at a decision in a hypothesistesting situation.
Recall that when making any type of statistical inference (of which hypothesis testing is a special
case), we collect information by obtaining a random sample from the populations of interest. In
all our applications, we will assume that the appropriate sampling process has already been
carried out.
Example 8.6 Suppose we want to test the hypotheses
H
0
: u = 72
H
a
: u > 72
What is the general format for carrying out a statistical test of hypothesis?
Solution The first step is to obtain a random sample from the population of interest. The
information provided by this sample, in the form of a sample statistic, will help us decide
whether to reject the null hypothesis. The sample statistic upon which we base our
decision is called the test statistic.
The second step is to determine a test statistic that is reasonable in the context of a given
hypothesis test. For this example, we are hypothesizing about the value of the population mean
u. Since our best guess about the value of u is the sample mean x (see Section 7.2), it seems
reasonable to use x as a test statistic. We will learn how to choose the test statistic for other
hypothesistesting situations in the examples that follow.
cxxi
The third step is to specify the range of possible computed values of the test statistic for which
the null hypothesis will be rejected. That is, what specific values of the test statistic will lead us
to reject the null hypothesis in favor of the alternative hypothesis? These specific values are
known collectively as the rejection region for the test. For this example, we would need to
specify the values of x that would lead us to believe that H
a
is true, i.e., that u is greater than
72. We will learn how to find an appropriate rejection region in later examples.
Once the rejection region has been specified, the fourth step is to use the data in the sample to
compute the value of the test statistic. Finally, we make our decision by observing whether the
computed value of the test statistic lies within the rejection region. If in fact the computed value
falls within the rejection region, we will reject the null hypothesis; otherwise, we do not reject
the null hypothesis.
An outline of the hypothesistesting procedure developed in Example 8.6 is given followings.
Outline for testing a hypothesis
1. Obtain a random sample from the population(s) of interest.
2. Determine a test statistic that is reasonable in the context of the given hypothesis test.
3. Specify the rejection region, the range of possible computed values of the test statistic for
which the null hypothesis will be rejected.
4. Use the data in the sample to compute the value of the test statistic.
5. Observe whether the computed value of the test statistic lies within the rejection region. If so,
reject the null hypothesis; otherwise, do not reject the null hypothesis.
Recall that the null and alternative hypotheses will be stated in terms of specific population
parameters. Thus, in step 2 we decide on a test statistic that will provide information about the
target parameter.
Example 8.7 Refer to Example 8.1, in which we wish to test
H
0
: (u
1
 u
2
) = 0
H
a
: (u
1
 u
2
) < 0
where u
1
, and u
2
, are the population mean numbers of children born to urban women and rural
women, respectively. Suggest an appropriate test statistic in the context of this problem.
Solution The parameter of interest is (u uu u
1
 u uu u
2
), the difference between the two population
means. Therefore, we will use ) (
2 1
x x − , the difference between the corresponding sample
means, as a basis for deciding whether to reject H
0
. If the difference between the sample
means, ) (
2 1
x x − , falls greatly below the hypothesized value of (u uu u
1
 u uu u
2
) = 0, then we have
evidence that disagrees with the null hypothesis. In fact, it would support the alternative
hypothesis that (u uu u
1
 u uu u
2
) < 0. Again, we are using the point estimate of the target
parameter as the test statistic in the hypothesistesting approach. In general, when the
hypothesis test involves a specific population parameter, the test statistic to be used is
the conventional point estimate of that parameter.
In step 3, we divide all possible values of the test into two sets: the rejection region and its
complement. If the computed value of the test statistic falls within the rejection region, we reject
the null hypothesis. If the computed value of the test statistic does not fall within the rejection
region, we do not reject the null hypothesis.
cxxii
Example 8.8 Refer to Example 8.6. For the hypothesis test
H
0
: u = 72
H
a
: u > 72
indicate which decision you may make for each of the following values of the test statistic:
73 x c 59 x b 110 x a = = = . . .
Solution
a. If 110 x = , then much doubt is cast upon the null hypothesis. In other words, if the null
hypothesis were true (i.e., if u is in fact equal to 72), then it is very unlikely that we would
observe a sample mean x as large as 110. We would thus tend to reject the null hypothesis
on the basis of information contained in this sample.
b. Since the alternative of interest is u > 72, this value of the sample mean, 59 x = , provides
no support for H
a
. Thus, we would not reject H
0
in favor of H
a
: u > 72, based on this sample.
c. Does a sample value of 73 x = cast sufficient doubt on the null hypothesis to warrant its
rejection? Although the sample mean 73 x = is larger than the null hypothesized value of u
=72, is this due to chance variation, or does it provide strong enough evidence to conclude
in favor of H
a
? We think you will agree that the decision is not as clearcut as in parts a and
b, and that we need a more formal mechanism for deciding what to do in this situation.
We now illustrate how to determine a rejection region that takes into account such factors as the
sample size and the maximum probability of a Type I error that you are willing to tolerate.
Example 8.9 Refer to Example 8.8. Specify completely the form of the rejection region for
a test of
H
0
: u = 72
H
a
: u > 72
at a significance level of α = .05.
Solution We are interested in detecting a directional departure from H
0
; in particular, we
are interested in the alternative that u uu u is greater than 72. Now, what values of the sample
mean x would cause us to reject H
0
in favor of H
a
? Clearly, values of x which are
"sufficiently greater" than 72 would cast doubt on the null hypothesis. But how do we
decide whether a value, 73 x = is "sufficiently greater" than 72 to reject H
0
? A convenient
measure of the distance between x and 72 is the zscore, which "standardizes" the value
of the test statistic x :
n s
x
n
x x
z
x
x
/
72
/
72 −
≈
−
=
−
=
σ
σ
u
The zscore is obtained by using the values of
x
u and
x
σ that would be valid if the null
hypothesis were true, i.e., if u = 72. The zscore then gives us a measure of how many standard
deviations the observed x is from what we would expect to observe if H
0
were true.
cxxiii
We examine Figure 8.1a and observe that the chance of obtaining a value of x more than 1.645
standard deviations above 72 is only .05, when in fact the true value of u is 72. We are
assuming that the sample size is large enough to ensure that the sampling distribution of x is
approximately normal. Thus, if we observe a sample mean located more than 1.645 standard
deviations above 72, then either H
0
, is true and a relatively rare (with probability .05 or less)
event has occurred, or H
a
is true and the population mean exceeds 72. We would tend to favor
the latter explanation for obtaining such a large value of x , and would then reject H
0
.
Figure 8.1 Location of rejection region of Example 8.9
In summary, our rejection region for this example consists of all values of z that are greater than
1.645 (i.e., all values of x that are more than 1.645 standard deviations above 72). The value at
the boundary of the rejection region is called the critical value. The critical value 1.645 is shown
in Figure 8.1b. In this situation, the probability of a Type I error − that is, deciding in favor of H
a
if in fact H
0
is true − is equal to a α =.05.
Example 8.10 Specify the form of the rejection region for a test of
H
0
: u = 72
H
a
: u < 72
at significance level α = .01.
Solution Here, we want to be able to detect the directional alternative that u uu u is less than
72; in this case, it is "sufficiently small" values of the test statistic x that would cast
doubt on the null hypothesis. As in Example 8.9, we will standardize the value of the test
statistic to obtain a measure of the distance between x and the null hypothesized value of
72:
n s
x
n
x
x
z
x
x
/
72
/
72
) (
−
≈
−
=
−
=
σ
σ
u
This zvalue tells us how many standard deviations the observed x is from what would be
expected if H
0
were true. Here, we have also assumed that n ≥ 30 so that the sampling
distribution of x will be approximately normal. The appropriate modifications for small samples
will be indicated in Chapter 9.
cxxiv
Figure 8.2a shows us that, when in fact the true value of u is 72, the chance of observing a
value of x more than 2.33 standard deviations below 72 is only .01. Thus, at significance level
(probability of Type I error) equal to .01, we would reject the null hypothesis for all values of z
that are less than  2.33 (see Figure 8.2b), i.e., for all values of x that lie more than 2.33
standard deviations below 72.
Figure 8.2 Location of rejection region of Example 8.10
Example 8.11 Specify the form of the rejection region for a test of
H
0
: u = 72
H
a
: u ≠ 72
where we are willing to tolerate a .05 chance of making a Type I error.
Solution For this twosided (nondirectional) alternative, we would reject the null
hypothesis for "sufficiently small" or "sufficiently large" values of the standardized test
statistic
n s
x
z
/
72 −
≈
Now, from Figure 8.3a, we note that the chance of observing a sample mean x more than 1.96
standard deviations below 72 or more than 1.96 standard deviations above 7 2, when in fact H
0
is
true, is only α = .05. Thus, the rejection region consists of two sets of values: We will reject H
0
if
z is either less than 1.96 or greater than 1.96 (see Figure 8.3b). For this rejection rule, the
probability of a Type I error is .05.
The three previous examples all exhibit certain common characteristics regarding the rejection
cxxv
region, as indicated in the next paragraph.
Figure 8.3 Location of rejection region of Example 8.11
Guidelines for Step 3 of Hypothesis Testing
1. The value of α, the probability of a Type I error; is specified in advance by the researcher. It
can be made as small or as large as desired; typical values are α = .01, .02, .05, and .10.
For a fixed sample size, the size of the rejection region decreases as the value of a
decreases (see Figure 8.4). That is, for smaller values of α, more extreme departures of the
test statistic from the null hypothesized parameter value are required to permit rejection of
H
0
.
2. For testing means or proportions, the test statistic (i.e., the point estimate of the target
parameter) is standardized to provide a measure of how great is its departure from the null
hypothesized value of the parameter. The standardization is based on the sampling
distribution of the point estimate, assuming H
0
is true. (It is through standardization that the
rejection rule takes into account the sample sizes.)
estimate point of deviation Standard
value ed Hypothesiz  estimate Point
statistic test Standard =
3. The location of the rejection region depends on whether the test is onetailed or twotailed,
and on the prespecified significance level, α.
a. For a onetailed test in which the symbol ">" occurs in H
0
, the rejection region consists of
values in the upper tall of the sampling distribution of the standardized test statistic. The
critical value is selected so that the area to its right is equal to α.
b. For a onetailed test in which the symbol "<" appears in H
a
, the rejection region consists
of values in the lower tail of the sampling distribution of the standardized test statistic. The
critical value is selected so that the area to its left is equal to α.
c. For a twotailed test, in which the symbol "≠" occurs in H
a
, the rejection region consists of
two sets of values. The critical values are selected so that the area in each tail of the
sampling distribution of the standardized test statistic is equal to α/2.
Figure 8.4 Size of the uppertail rejection region for different values of α
Steps 4 and 5 of the hypothesistesting approach require the computation of a test statistic from
the sample information. Then we determine if the standardized of the test statistic value lies
within the rejection region in order to make a decision about whether to reject the null
hypothesis.
cxxvi
Example 8.12 Refer to Example 8.9. Suppose the following statistics were calculated
based on a random sample of n = 30 measurements: x = 73, s = 13. Perform a test of
H
0
: u = 72
H
a
: u > 72
at a significance level of α = .05.
Solution In Example 8.9, we determined the following rejection rule for the given value of α
and the alternative hypothesis of interest:
Reject H
0
if z > 1.645.
The standardized test statistic, computed assuming H
0
is true, is given by
42 .
30 / 13
72 73
/
72
/
72
=
−
=
−
≈
−
=
−
=
n s
x
n
x x
z
x
x
σ
σ
u
cxxvii
Figure 8.5 Location of rejection region of Example 8.12
Since this value does not lie within the rejection region (shown in Figure 8.5), we fail to reject H
0
and conclude there is insufficient evidence to support the alternative hypothesis, H
a
: u > 72.
(Note that we do not conclude that H
0
is true; rather, we state that we have insufficient evidence
to reject H
0
.)
8.5 Summary
In this chapter, we have introduced the logic and general concepts involved in the statistical
procedure of hypothesis testing. The techniques will be illustrated more fully with practical
applications in Chapter 9.
8.6 Exercises
8.1. A medical researcher would like to determine whether the proportion of males admitted to
a hospital because of heart disease differs from the corresponding proportion of females.
Formulate the appropriate null and alternative hypotheses and state whether the test is
onetailed or twotailed.
8.2. Why do we avoid stating a decision in terms of "accept the null hypothesis H
0
"?
8.3. Suppose it is desired to test
H
0
: u = 65
H
a
: u ≠ 65
at significance level α = .02. Specify the form of the rejection region. (Hint: assuming that
the sample size will be sufficient to guarantee the approximate normality of the sampling
distribution of x .)
8.4. Indicate the form of the rejection region for a test of
H
0
: (p
1
− p
2
) = 0
H
a
: (p
1
− p
2
) > 0
Assume that the sample size will be appropriate to apply the normal approximation to the
sampling distribution of ) ˆ ˆ (
2 1
p p − , and that the maximum tolerable probability of
committing a Type I error is .05.
8.5. For each of the following rejection region, determine the value of α, the probability of a
Type I error:
a) z < −1.96 b) z > 1.645 c) z < −2.58 or z > 2.58
cxxviii
Chapter 9 Applications of Hypothesis Testing
9.1 Introduction
In this chapter, we will present applications of the hypothesistesting logic developed in Chapter
8. Among the population parameters to be considered are (u
1
 u
2
), p, and (p
1 −
p
2
).
The concepts of a hypothesis test are the same for all these parameters; the null and alternative
hypotheses, test statistic, and rejection region all have the same general form (see Chapter 8).
However, the manner in which the test statistic is actually computed depends on the parameter of
interest. For example, in Chapter 7 we saw that the largesample test statistic for testing a
hypothesis about a population mean u is given by
n s
x
z
/
0
u −
= (see also Example 8.9)
while the test statistic for testing a hypothesis about the parameter p is
n
q p
p p
z
0 0
0
ˆ −
=
The key to correctly diagnosing a hypothesis test is to determine first the parameter of interests.
In this section, we will present several examples illustrating how to determine the parameter of
interest. The following are the key words to look for when conducting a hypothesis test about a
population parameter.
Determining the parameter of interest
P A R A M E T E R DESCRIPTION
u
(u
1
− u
2
)
p
(p
1
− p
2
)
σ
2
2
2
2
1
σ
σ
Mean; average
Difference in means or averages; mean difference;
comparison of means or averages
Proportion; percentage; fraction; rate
Difference in proportion, percentage, fraction, or rates;
comparison of proportions, percentages, fractions, or rates
Variance; variation; precision
Ratio of variances; difference in variation; comparison of
variances
In the following sections we will present a summary of the hypothesistesting procedures for
each of the parameters listed in the previous box.
9.2 Hypothesis test about a population mean
Suppose that in the last year all students at a certain university reported the number of hours
spent on their studies during a certain week; the average was 40 hours. This year we want to
cxxix
determine whether the mean time spent on studies of all students at the university is in excess of
40 hours per week. That is, we will test
H
0
: u = 40
H
a
: u > 40
where
u = Mean time spent on studies of all students at the university.
We are conducting this study in an attempt to gather support for H
a
; we hope that the sample
data will lead to the rejection of H
0
. Now, the point estimate of the population mean u is the
sample mean x . Will the value of x that we obtain from our sample be large enough for us to
conclude that u is greater than 40? In order to answer this question, we need to perform each
step of the hypothesistesting procedure developed in Chapter 8.
Tests of population means using large samples
The following box contains the elements of a largesample hypothesis test about a population
mean, u. Note that for this case, the only assumption required for the validity of the procedure is
that the sample size is in fact large (n ≥ 30).
Largesample test of hypothesis about a population mean
ONE TAILED TEST
H
0
: u = u
0
H
a
: u > u
0
(or H
a
: u < u
0
)
TWO TAILED TEST
H
0
: u = u
0
H
a
: u ≠ u
0
Test statistic:
n s
x x
z
x
/
0 0
u
σ
u −
≈
−
=
Rejection region:
z > z
α
(or z <  z
α
)
Rejection region:
z < z
α/2
(or z > z
α/2
)
where z
α
is the zvalue such that P(z > z
α
) = α; and z
α/2
is the zvalue such that P(z >
z
α/2
) = α/2. [Note: u
0
is our symbol for the particular numerical value specified for u
in the null hypothesis.]
Assumption: The sample size must be sufficiently large (say, n ≥ 30) so that the
sampling distribution of x is approximately normal and that s provides a good
approximately to σ.
Example 9.1 The mean time spent on studies of all students at a university last year was
40 hours per week. This year, a random sample of 35 students at the university was
drawn. The following summary statistics were computed:
hours hours; 85 . 13 1 . 42 = = s x
Test the hypothesis that u, the population mean time spent on studies per week is equal to 40
hours against the alternative that u is larger than 40 hours. Use a significance level of α = .05.
cxxx
Solution We have previously formulated the hypotheses as
H
0
: u = 40
H
a
: u > 40
Note that the sample size n = 35 is sufficiently large so that the sampling distribution of x is
approximately normal and that s provides a good approximation to σ. Since the required
assumption is satisfied, we may proceed with a largesample test of hypothesis about u.
Using a significance level of α = .05, we will reject the null hypothesis for this onetailed test if
z > z
α/2
= z
.05
i.e., if z > 1.645. This rejection region is shown in Figure 9.1.
Figure 9.1 Rejection region for Example 9.1
Computing the value of the test statistic, we obtain
897 .
35 / 85 . 13
40 1 . 42
/
0
=
−
=
−
=
n s
x
z
u
Since this value does not fall within the rejection region (see Figure 9.1), we do not reject H
0
.
We say that there is insufficient evidence (at α = .05) to conclude that the mean time spent on
studies per week of all students at the university this year is greater than 40 hours. We would
need to take a larger sample before we could detect whether u > 40, if in fact this were the
case.
Example 9.2 A sugar refiner packs sugar into bags weighing, on average 1 kilogram. Now
the setting of machine tends to drift i.e. the average weight of bags filled by the machine
sometimes increases sometimes decreases. It is important to control the average weight
of bags of sugar. The refiner wish to detect shifts in the mean weight of bags as quickly
as possible, and reset the machine. In order to detect shifts in the mean weight, he will
periodically select 50 bags, weigh them, and calculate the sample mean and standard
deviation. The data of a periodical sample as follows:
kg s kg x 05 . 03 . 1 = =
cxxxi
Test whether the population mean u is different from 1 kg at significance level
α = .01.
Solution We formulate the following hypotheses:
H
0
: u = 1
H
a
: u ≠ 1
The sample size (50) exceeds 30, we may proceed with the larger sample test about u.
Because shifts in u in either direction are important, so the test is twotailed.
At significance level α = .01, we will reject the null hypothesis for this two tail test if
z <  z
α/2
=  z
.005
or z > z
α/2
= z
.005
i.e., if z <  2.576 or z > 2.576.
The value of the test statistic is computed as follows:
243 . 4
50 / 05 .
1 03 . 1
/
0
=
−
=
−
≈
n s
x
z
u
Since this value is greater than the uppertail critical value (2.576), we reject the null hypothesis
and accept the alternative hypothesis at the significance level of 1%. We would conclude that the
overall mean weight was no longer 1 kg, and would run a less than 1% chance of committing a
Type I error.
Example 9.3 Prior to the institution of a new safety program, the average number of on
thejob accidents per day at a factory was 4.5. To determine if the safety program has
been effective in reducing the average number of accidents per day, a random sample of
30 days is taken after the institution of the new safety program and the number of
accidents per day is recorded. The sample mean and standard deviation were computed
as follows:
3 . 1 7 . 3 = = s x
a. Is there sufficient evidence to conclude (at significance level .01) that the average number of
onthejob accidents per day at the factory has decreased since the institution of the safety
program?
b. What is the practical interpretation of the test statistic computed in part a?
Solution
a. In order to determine whether the safety program was effective, we will conduct a large
sample test of
H
0
: u = 4.5 (i.e., no change in average number of onthejob accidents per day)
H
a
: u < 4.5 (i.e., average number of onthejob accidents per day has decreased)
where u represents the average number of onthejob accidents per day at the factory after
institution of the new safety program. For a significance level of
α = .01, we will reject the null hypotheses if
z <  z
.01
=  2.33 (see Figure 9.2)
The computed value of the test statistic is
cxxxii
37 . 3
30 / 3 . 1
5 . 4 7 . 3
/
0
=
−
=
−
=
n s
x
z
u
Since this value does fall within the rejection region, there is sufficient evidence (at α =.01)
to conclude that the average number of onthejob accidents per day at the factory has
decreased since the institution of the safety program. It appears that the safety program was
effective in reducing the average number of accidents per day.
b. If the null hypothesis is true, u = 4.5. Recall that for large samples, the sampling distribution
of x is approximately normal, with mean u u =
x
and standard deviation n
x
/ σ σ = . Then
the zscore for x , under the assumption that H
0
is true, is given by
n
x
z
/
5 . 4
σ
−
=
Figure 9.2 Location of rejection region of Example 9.3
You can see that the test statistic computed in part a is simply the zscore for the sample mean
x , if in fact u = 4.5. A calculated zscore of 3.37 indicates that the value of x computed from
the sample falls a distance of 3.37 standard deviations below the hypothesized mean of u = 4.5.
Of course, we would not expect to observe a zscore this extreme if in fact u = 4.5.
Tests of population means using small samples
When the assumption required for a largesample test of hypothesis about u is violated, we need
a hypothesistesting procedure that is appropriate for use with small samples. Because if we use
methods of the largesample test, we will run into trouble on two accounts. Firstly, our small
sample will underestimate the population variance, so our test statistic will be wrong. Secondly,
the means of small samples are not normally distributed, so our critical values will be wrong. We
have learnt that the means of small samples have a tdistribution, and the appropriate t
distribution will depend on the number of degrees of freedom in estimating the population
variance. If we use large samples to test a hypothesis, then the critical values we use will depend
cxxxiii
upon the type of test (one or two tailed). But if we use small samples, then the critical values will
depend upon the degrees of freedom as well as the type of test.
A hypothesis test about a population mean, u, based on a small sample (n < 30) consists of the
elements listed in the accompanying box.
Smallsample test of hypothesis about a population mean
ONETAILED TEST
H
0
: u = u
0
H
a
: u > u
0
(or H
a
: u < u
0
)
TWOTAILED TEST
H
0
: u = u
0
H
a
: u ≠ u
0
Test statistic:
n s
x
t
/
0
u −
=
Rejection region:
t > t
α
(or t <  t
α
)
Rejection region:
t < t
α/2
(or t > t
α/2
)
where the distribution of t is based on (n – 1) degrees of freedom; t
α
is the
tvalue such that P(t > t
α
) = α; and t
α/2
is the tvalue such that P(t > t
α/2
) = α/2.
Assumption: The relative frequency distribution of the population from Which the
sample was selected is approximately normal.
As we noticed in the development of estimation procedures, when we are making inferences
based on small samples, more restrictive assumptions are required than when making inferences
from large samples. In particular, this hypothesis test requires the assumption that the population
from which the sample is selected is approximately normal.
Notice that the test statistic given in the box is a t statistic and is calculated exactly as our
approximation to the largesample test statistic, z, given earlier in this section. Therefore, just
like z, the computed value of t indicates the direction and approximate distance (in units of
standard deviations) that the sample mean, x , is from the hypothesized population mean, u
0
.
Example 9.4 The expected lifetime of electric light bulbs produced by a given process
was 1500 hours. To test a new batch a sample of 10 was taken which showed a mean
lifetime of 1410 hours. The standard deviation is 90 hours. Test the hypothesis that the
mean lifetime of the electric light bulbs has not changed, using a level of significance of
α αα α = .05.
Solution This question asks us to test that the mean has not changed, so we must employ a
twotailed test:
H
0
: u = 1500
H
a
: u ≠ 1500
Since we are restricted to a small sample, we must make the assumption that the lifetimes of the
electric light bulbs have a relative frequency distribution that is approximately normal. Under
cxxxiv
this assumption, the test statistic will have a
tdistribution with (n  1) = (10 1) = 9 degrees of freedom. The rejection rule is then to reject the
null hypothesis for values of t such that
t <  t
α/2
or t > t
α/2
with α/2 = .05/2 = .025.
From Table 7.6 in Chapter 7 (or Table 2 of Appendix C) with 9 degrees of freedom, we find that
t
.025
= 2.262.
The value of test statistic is
999 . 2
10 / 90
1500 1410
/
0
− =
−
=
−
=
n s
x
t
u
The computed value of the test statistic, t =  2.999, falls below the critical value of
 2.262. We reject H
0
and accept H
1
at significance level of .05, and conclude that there is some
evidence to suggest that the mean lifetime of all light bulbs has changed.
9.3 Hypothesis tests of population proportions
Tests involving sample proportions are extremely important in practice. Many market
researchers express their results in terms of proportions, e.g. "40% of the population clean their
teeth with brand A toothpaste ". It will be useful to design tests that will detect changes in
proportions. For example, we may want to test the null hypothesis that the true proportion of
people who use brand A is equal to .40 (i.e., H
0
: p = .40) against the alternative H
a
: p > .40.
The procedure described in the next box is used to test a hypothesis about a population
proportion, p, based on a large sample from the target population. (Recall that p represents the
probability of success in a Bernoulli process.)
In order that the procedure to be valid, the sample size must be sufficiently large to guarantee
approximate normality of the sampling distribution of the sample proportion, p. A general rule of
thumb for determining whether n is "sufficiently large" is that the interval n q p p / ˆ ˆ 2 ˆ ± does not
include 0 or 1.
Largesample test of hypothesis about a population proportion
ONE TAILED TEST
H
0
: p = p
0
H
a
: p > p
0
(or H
a
: p < p
0
)
TWO TAILED TEST
H
0
: p = p
0
H
a
: p ≠ p
0
Test statistic:
n q p
p p
z
/
ˆ
0 0
0
−
=
Rejection region:
z > z
α
(or z <  z
α
)
where q
0
= 1 – p
0
Rejection region:
z < z
α/2
(or z > z
α/2
)
where q
0
= 1 – p
0
Assumption: The interval n q p p / ˆ ˆ 2 ˆ ± does not contain 0 and 1.
Example 9.5 Suppose it is claimed that in a very large batch of components, about 10%
of items contain some form of defect. It is proposed to check whether this proportion has
cxxxv
increased, and this will be done by drawing randomly a sample of 150 components. In
the sample, 20 are defectives. Does this evidence indicate that the true proportion of
defective components is significantly larger than 10%? Test at significance level α αα α = .0 5.
Solution We wish to perform a largesample test about a population proportion, p:
H
0
: p = .10 (i.e., no change in proportion of defectives)
H
a
: p > .10 (i.e., proportion of defectives has increased)
where p represents the true proportion of defects.
At significance level α = .05, the rejection region for this onetailed test consists of all values of z
for which
z > z
.05
= 1.645
The test statistic requires the calculation of the sample proportion, pˆ , of defects:
133 .
150
20
sample in the components defective of Number
components sampled of Number
ˆ
= =
= p
Noting that q
0
= 1 – p
0
= 1  .10 = .90, we obtain the following value of the test statistic:
361 . 1
150 / ) 90 )(. 10 (.
10 . 133 .
/
ˆ
0 0
0
=
−
=
−
=
n q p
p p
z
This value of z lies out of the rejection region; so we would conclude that the proportion
defective in the sample is not significant. We have no evidence to reject the null hypothesis that
the proportion defective is .01 at the 5% level of significance. The probability of our having made
a Type II error (accepting H
0
when, in fact, it is not true) is β = .05.
[Note that the interval
056 . 133 . 150 / ) 133 . 1 )( 133 (. 2 133 . / ˆ ˆ 2 ˆ ± = − ± = ± n q p p
does not contain 0 or 1. Thus, the sample size is large enough to guarantee that
validity of the hypothesis test.]
Although smallsample procedures are available for testing hypotheses about a population
proportion, the details are omitted from our discussion. It is our experience that they are of
limited utility since most surveys of binomial population performed in the reality use samples
that are large enough to employ the techniques of this section.
9.4 Hypothesis tests about the difference between two population
means
There are two brands of coffee, A and B. Suppose a consumer group wishes to determine
whether the mean price per pound of brand A exceeds the mean price per pound of brand B.
That is, the consumer group will test the null hypothesis
cxxxvi
H
0
: (u
1
 u
2
) = 0 against the alternative ((u
1
 u
2
) > 0. The largesample procedure described in
the box is applicable testing a hypothesis about (u
1
 u
2
), the difference between two population
means.
Largesample test of hypothesis about (u uu u
1
 u uu u
2
)
ONE TAILED TEST
H
0
: (u
1
 u
2
) = D
0
H
a
: (u
1
 u
2
) > D
0
(or H
a
: (u
1
 u
2
)< D
0
)
TWO TAILED TEST
H
0
: (u
1
 u
2
) = D
0
H
a
: (u
1
 u
2
) ≠ D
0
Test statistic:
2
2
2
1
2
1
0 2 1
) (
0 2 1
) ( ) (
2 1
n
s
n
s
D x x D x x
z
x x
+
− −
≈
− −
=
−
σ
Rejection region:
z > z
α
(or z <  z
α
)
Rejection region:
z < z
α/2
or z > z
α/2
[Note: In many practical applications, we wish to hypothesize that there is no difference
between the population means; in such cases, D
0
= 0]
Assumptions:
1. The sample sizes n
1
and n
2
are sufficiently large (n
1
≥ 30 and n
2
≥ 30).
2. The samples are selected randomly and independent from the target
populations.
Example 9.6 A consumer group selected independent random samples of supper
markets located throughout a country for the purpose of comparing the retail prices per
pound of coffee of brands A and B. The results of the investigation are summarized in
Table 9.1. Does this evidence indicate that the mean retail price per pound of brand A
coffee is significantly higher than the mean retail price per pound of brand B coffee? Use
a significance level of α αα α = .01.
Table 9.1 Coffee prices for Example 9.6
Brand A Brand B
n
1
= 75
00 . 3 $
1
= x
s
1
= $.11
n
2
= 64
95 . 2 $
2
= x
s
2
= $.09
Solution The consumer group wants to test the hypotheses
H
0
: (u
1
 u
2
) = 0 (i.e., no difference between mean retail prices)
cxxxvii
H
a
: (u
1
 u
2
) > 0 (i.e., mean retail price per pound of brand A is higher than that of brand
B)
where
u
1
= Mean retail price per pound of brand A coffee at all supermarkets
u
2
= Mean retail price per pound of brand B coffee at all supermarkets
This onetailed, largesample test is based on a z statistic. Thus, we will reject H
0
if
z > z
α
= z
.01
. Since z
.01
= 2.33, the rejection region is given by z > 2.33 (see Fig. 9.3)
We compute the test statistic as follows:
947 . 2
64
) 09 (.
75
) 11 (.
0 ) 95 . 2 00 . 3 ( ) (
2 2
2
2
2
1
2
1
0 2 1
=
+
− −
=
+
− −
=
n
s
n
s
D x x
z
Figure 9.3 Rejection region for Example 9.6
Since this computed value of z = 2.947 lies in the rejection region, there is sufficient evidence (at
α = .01) to conclude that the mean retail price per pound of brand A coffee is significantly higher
than the mean retail price per pound of brand B coffee. The probability of our having committed
a Type I error is α = .01.
When the sample sizes n
1
and n
2
are inadequate to permit use of the largesample procedure of
Example 9.9, we have made some modifications to perform a smallsample test of hypothesis
about the difference between two population means. The test procedure is based on
assumption that are more restrictive than in the largesample case. The elements of the
hypothesis test and required assumption are listed in the next box.
Smallsample test of hypothesis about (u uu u
1
 u uu u
2
)
ONE TAILED TEST
H
0
: (u
1
 u
2
) = D
0
H
a
: (u
1
 u
2
) > D
0
(or H
a
: (u
1
 u
2
)< D
0
)
TWO TAILED TEST
H
0
: (u
1
 u
2
) = D
0
H
a
: (u
1
 u
2
) ≠ D
0
cxxxviii
Test statistic:


¹

\

+
− −
=
2 1
2
0 2 1
1 1
) (
n n
s
D x x
t
p
Rejection region:
t > t
α
(or t <  t
α
)
Rejection region:
t < t
α/2
or t > t
α/2
where
2
) 1 ( ) 1 (
2 1
2
2 2
2
1 1 2
− +
− + −
=
n n
s n s n
s
p
and the distribution of t is based on (n
1
+ n
2
 2) degrees of freedom.
Assumptions:
1. The population from which the samples are selected both have approximately
normal relative frequency distributions.
2. The variances of the two populations are equal.
3. The random samples are selected in an independent manner from the two
populations.
Example 9.7 There was a research on the weights at birth of the children of urban and
rural women. The researcher suspects there is a significant difference between the mean
weights at birth of children of urban and rural women. To test this hypothesis, he selects
independent random samples of weights at birth of children of mothers from each group,
calculates the mean weights and standard deviations and summarizes in Table 9.2. Test
the researcher's belief, using a significance of α αα α = .02.
Table 9.2 Weight at birth data for Example 9.7
Urban mothers Rural mothers
n
1
= 15
kg x 5933 . 3
1
=
s
1
= .3707 kg
n
2
= 14
kg x 2029 . 3
2
=
s
2
= .4927 kg
Solution The researcher wants to test the following hypothesis:
H
0
: (u
1
 u
2
) = 0 (i.e., no difference between mean weights at birth)
H
a
: (u
1
 u
2
) ≠ 0 (i.e., mean weights at birth of children of urban and rural
women are different)
where u
1
and u
2
are the true mean weights at birth of children of urban and rural women,
respectively.
Since the sample sizes for the study are small (n
1
= 15, n
2
= 14), the following assumptions are
required:
1. The populations of weights at birth of children both have approximately normal distributions.
cxxxix
2. The variances of the populations of weights at birth of children for two groups of mothers are
equal.
3. The samples were independently and randomly selected.
If these three assumptions are valid, the test statistic will have a tdistribution with
(n
1
+ n
2
 2) = (15 + 14  2) = 27 degree of freedom with a significance level of
α = .02, the rejection region is given by
t <  t
.01
=  2.473 or t > t
.01
= 2.473 (see Figure 9.4)
Figure 9.4 Rejection region of Example 9.7
Since we have assumed that the two populations have equal variances (i.e. that
σ σ σ = =
2
2
2
1
), we need to compute an estimate of this common variance. Our pooled estimate
is given by
0.1881
2 14 15
) 4927 )(. 1 14 ( ) 3707 )(. 1 15 (
2
) 1 ( ) 1 (
2 2
2 1
2
2 2
2
1 1 2
=
− +
− + −
=
− +
− + −
=
n n
s n s n
s
p
Using this pooled sample variance in the computation of the test statistic, we obtain
422 . 2
14
1
15
1
1881 .
) 2029 . 3 5933 . 3 (
1 1
) (
0
2 1
2
0 2 1
=

¹

\

+
− −
=


¹

\

+
− −
=
D
n n
s
D x x
t
p
Now the computed value of t does not fall within the rejection region; thus, we fail to reject the
null hypothesis (at α = .02) and conclude that there is insufficient evidence of a difference
between the mean weights at birth of children of urban and rural women.
In this example, we can see that the computed value of t is very closed to the upper boundary of
the rejection region. This region is specified by the significance level and the degree of freedom.
How is the conclusion about the difference between the mean weights at births affected if the
significance level is α = .05? We will answer the question in the next example.
cxl
Example 9.8 Refer Example 9.7. Test the investigator's belief, using a significance level
of α αα α = .05.
Solution With a significance level of α αα α = .05, the rejection region is given by
t <  t
.025
=  2.052 or t > t
.025
= 2.052 (see Figure 9.5)
Since the sample sizes are not changed, therefore test statistic is the same as in Example 9.10,
t = 2.422.
Now the value of t falls in the rejection region; and we have sufficient evidence at a significance
level of α = .05 to conclude that the mean weight at birth of children of urban women differs
significantly (or we can say that is higher than) from the mean weight at birth of children of rural
women. But you should notice that the probability of our having committed a Type I error is α =
.05.
Figure 9.5 Rejection region of Example 9.8
9.5 Hypothesis tests about the difference between two proportions
Suppose we are interested in comparing p
1
, the proportion of a population with p
2
, the
proportion of other population. Then the target parameter about which we will test a hypothesis
is (p
1
 p
2
). Recall that p
1
, and p
2
also represent the probabilities of success for two binomial
experiments. The method for performing a largesample test of hypothesis about (p
1
 p
2
), the
difference between two binomial proportions, is outlined in the following box.
Largesample test of hypothesis about (p
1
 p
2
)
ONE TAILED TEST
H
0
: (p
1
 p
2
) = D
0
H
a
: (p
1
 p
2
) > D
0
or (H
a
: (p
1
 p
2
) < D
0
)
TWO TAILED TEST
H
0
: (p
1
 p
2
) = D
0
H
a
: (p
1
 p
2
) ≠ D
0
Test statistic:
) ˆ ˆ (
0 2 1
2 1
) ˆ ˆ (
p p
D p p
z
−
σ
− −
=
cxli
Rejection region:
z > z
α
(or z <  z
α
)
Rejection region:
z < z
α/2
or z > z
α/2
where
2
2 2
1
1 1
) ˆ ˆ (
2 1
n
q p
n
q p
p p
+ = σ
−
: ˆ ˆ
2 1 ) ˆ ˆ ( 0
2 1
p p D
p p
and using calculate 0, when
−
σ ≠
2
2 2
1
1 1
) ˆ ˆ (
ˆ ˆ ˆ ˆ
2 1
n
q p
n
q p
p p
+ ≈ σ
−
. ˆ 1 ˆ ˆ 1 ˆ
2 2 1 1
p q p q − = − = and where
For the special case where D
0
= 0, calculate


¹

\

+ ≈ σ
−
2 1
) ˆ ˆ (
1 1
ˆ ˆ
2 1
n n
q p
p p
when the total number of successes in the combined samples is (x
1
+ x
2
) and
. ˆ ˆ ˆ
2 1
2 1
2 1
n n
x x
p p p
+
+
= = =
Assumption: The intervals
2 2 2 2 1 1 1 1
/ ˆ ˆ 2 ˆ / ˆ ˆ 2 ˆ n q p p n q p p ± ± and do not contain 0 and 1.
When testing the null hypothesis that (p
1
 p
2
) equals some specified difference D
0
, we make a
distinction between the case D
0
= 0 and the case D
0
≠ 0. For the special case D
0
= 0, i.e., when
we are testing H
0
: (p
1
 p
2
) = 0 or, equivalently, H
0
: p
1
= p
2
, the best estimate of p
1
= p
2
= p is
found by dividing the total number of successes in the combined samples by the total number of
observations in the two samples. That is, if x
1
is the number of successes in sample 1 and x
2
is
the number of successes in sample 2, then
. ˆ
2 1
2 1
n n
x x
p
+
+
=
In this case, the best estimate of the standard deviation of the sampling distribution of
) ˆ ˆ (
2 1
p p − is found by substituting pˆ for both : ˆ ˆ
2 1
p p and


¹

\

+ = + ≈ + = σ
−
2 1 2 1 2
2 2
1
1 1
) ˆ ˆ (
1 1
ˆ ˆ
ˆ ˆ ˆ ˆ
2 1
n n
q p
n
q p
n
q p
n
q p
n
q p
p p
For all cases in which D
0
≠ 0 [for example, when testing H
0
: (p
1
 p
2
)=.2], we use
. ˆ ˆ
) ˆ ˆ ( 2 1
2 1
p p
p p
−
σ for formula the in and
However, in most practical situations, we will want to test for a difference between proportions 
that is, we will want to test H
0
: (p
1
 p
2
) = 0.
The sample sizes n
1
and n
2
, must be sufficiently large to ensure that the sampling distribution of
2 1
ˆ ˆ p p and , and hence of the difference ) ˆ ˆ (
2 1
p p − are approximately normal. The rule of thumb
given in the previous box may be used to determine if the sample sizes are "sufficiently large."
cxlii
Example 9.9 Two types of needles, the old type and the new type, used for injection of
medical patients with a certain substance. The patients were allocated at random to two
group, one to receive the injection from needle of the old type, the other to receive the
injection from needles of the new type. Table 9.3 shows the number of patients showing
reactions to the injection. Does the information support the belief that the proportion of
patients giving reactions to needles of the old type is less than the corresponding
proportion patients giving reactions to needles of the new type? Test at significance
level of α αα α = .01.
Table 9.3 Data on the patients' reactions in Example 9.9
Injected by old type
needles
Injected by new type
needles
Number of sampled patients
Number in sample with reactions
100
37
100
56
Solution We wish to perform a test of
H
0
: (p
1
 p
2
) = 0
H
a
: (p
1
 p
2
) < 0
where
p
1
= Proportion of patients giving reactions to needles of the old type.
p
2
= Proportion of patients giving reactions to needles of the new type.
For this largesample, onetailed test, the null hypothesis will be rejected if
z < z
.01
, = 2.33 (see Figure 9.6)
The sample proportions p
1
and p
2
are computed for substitution into the formula for the test
statistic:
37 .
100
37
ˆ
1
= =
=
type old the of needles with reactions giving patients of proportion Sample p
56 .
100
56
ˆ
2
= =
=
type new the of needles with reactions giving patients of proportion Sample p
Hence,
44 . 56 . 1 ˆ 1 ˆ 63 . 37 . 1 ˆ 1 ˆ
2 2 1 1
= − = − = = − = − = p q p q and
Since D
0
= 0 for this test of hypothesis, the test statistic is given by


¹

\

+
− −
=
2 1
0 2 1
1 1
ˆ ˆ
) ˆ ˆ (
n n
q p
D p p
z
where
465 .
100 100
56 37
ˆ
=
+
+
=
=
sampled patients of number Total
types both of needles with reactions giving patients of number Total
p
cxliii
Then we have
69 . 2
100
1
100
1
) 535 )(. 465 (.
0 ) 56 . 37 (.
− =

¹

\

+
− −
= z
This value falls below the critical value of  2.33. Thus, at α = .01, we reject the null hypothesis;
there is sufficient evidence to conclude that the proportion of patients giving reactions to
needles of the old type is significantly less than the corresponding proportion of patients giving
reactions to needles of the new type, i.e., p
1
< p
2.
The inference derived from the test in Example 9.12 is valid only if the sample sizes, n
1
and n
2
,
are sufficiently large to guarantee that the intervals
1
2 2
2
1
1 1
1
ˆ ˆ
2 ˆ
ˆ ˆ
2 ˆ
n
q p
p
n
q p
p ± ± and
do not contain 0 and 1. This requirement is satisfied for Example 9.12:
.467) , (.273 or 097 . 37 .
100
) 63 )(. 37 (.
2 37 .
ˆ ˆ
2 ˆ
1
1 1
1
± = ± = ±
n
q p
p
.659) , (.467 or 099 . 56 .
100
) 44 )(. 56 (.
2 56 .
ˆ ˆ
2 ˆ
1
2 2
2
± = ± = ±
n
q p
p
Figure 9.6 Rejection region of Example 9.9
9.6 Hypothesis test about a population variance
Hypothesis tests about a population variance σ
2
are conducted using the chisquare (χ
2
)
distribution introduced in Section 7.9. The test is outlined in the box. Note that the assumption of
a normal population is required regardless of whether the sample size n is large or small.
Example 9.10 A quality control supervisor in a cannery knows that the exact amount
each can contains will vary, since there are certain uncontrollable factors that affect the
amount of fill. The mean fill per can is important, but equally important is the variation σ σσ σ
2
cxliv
of the amount of fill. If σ σσ σ
2
is large, some cans will contain too little and others too much.
Suppose regulatory agencies specify that the standard deviation of the amount of fill
should be less than .1 ounce. The quality control supervisor sampled n = 10 cans and
calculated s = .04. Does this value of s provide sufficient evidence to indicate that the
standard deviation σ σσ σ of the fill measurements is less than .1 ounce?
Test of hypothesis about a population variance σ σσ σ
2
ONE TAILED TEST
H
0
: σ
2
= σ
0
2
H
a
: σ
2
> σ
0
2
or (H
a
: σ
2
< σ
0
2
)
TWO TAILED TEST
H
0
: σ
2
= σ
0
2
H
a
: σ
2
≠ σ
0
2
Test statistic:
2
0
2
2
) 1 (
σ
χ
s n−
=
Rejection region:
χ
2
> χ
2
α
(or χ
2
< χ
2
1α
)
Rejection region:
χ
2
< χ
2
1α/2
or χ
2
> χ
2
α/2
where
2
∞
χ and
2
1 ∞ −
χ are values of χ
2
that locate an area of α to the right and α to
the left, respectively, of a chisquare distribution based on (n 1) degrees of
freedom.
[Note:
2
0
σ is our symbol for the particular numerical value specified for σ
2
in the null
hypothesis.]
Assumption: The population from which the random sample is selected has an
approximate normal distribution.
Solution Since the null and alternative hypotheses must be stated in terms of σ σσ σ
2
(rather
than σ σσ σ), we will want to test the null hypothesis that σ σσ σ
2
= .01 against the alternative that
σ σσ σ
2
< .01. Therefore, the elements of the test are
H
0
: σ
2
= .01
H
a
: σ
2
< .01
Assumption: The population of "amounts of fill" of the cans are approximately normal.
2
0
2
2
) 1 (
: statistic Test
σ
χ
s n−
=
Rejection region: The smaller the value of s
2
we observe, the stronger the evidence in favor of
H
a
. Thus, we reject H
0
for "small values" of the test statistic. With α = .05 and 9 df, the χ
2
value
for rejection is found in Table 3, Appendix C and pictured in Figure 9.7. We will reject H
0
if χ
2
<
3.32511.
Remember that the area given in Table 3 of Appendix C is the area to the right of the numerical
value in the table. Thus, to determine the lowertail value that has α = .05 to its left, we use the
χ
2
.95
column in Table 3 of Appendix C.
cxlv
Since
44 . 1
01 .
) 04 (. 9 ) 1 (
2
2
0
2
2
= =
−
=
σ
χ
s n
is less than 3.32511, the supervisor can conclude that the variance of the population of all
amounts of fill is less than .01 (σ < 0.1) with 95 % confidence. As usual, the confidence is in the
procedure used  the χ
2
test. If this procedure is repeatedly used, it will incorrectly reject H
0
only
5% of the time. Thus, the quality control supervisor is confident in the decision that the cannery
is operating within the desired limits of variability.
Figure 9.7 Rejection region of Example 9.10
9.7 Hypothesis test about the ratio of two population variances
In this section, we present a test of hypothesis for comparing two population variances,
2
1
σ and
2
2
σ . Variance tests have broad applications in business. For example, a production manager
may be interested in comparing the variation in the length of eyescrews produced on each of
two assembly lines. A line with a large variation produces too many individual eyescrews that
do not meet specifications (either too long or too short), even though the mean length may be
satisfactory. Similarly, an investor might want to compare the variation in the monthly rates of
return for two different stocks that have the same mean rate of return. In this case, the stock
with the smaller variance may be preferred because it is less risky  that is, it is less likely to
have many very low and very high monthly return rates.
Test of hypothesis for the ratio of two population variances,
2
2
2
1
/ σ σ
ONE TAILED TEST
H
0
: 1 /
2
2
2
1
= σ σ ) . . (
2
2
2
1
σ σ = e i
TWO TAILED TEST
H
0
: 1 /
2
2
2
1
= σ σ ) . . (
2
2
2
1
σ σ = e i
cxlvi
H
a
: 1 /
2
2
2
1
> σ σ ) . . (
2
2
2
1
σ σ > e i or
[H
a
: 1 /
2
2
2
1
< σ σ ) . . (
2
2
2
1
σ σ < e i ]
H
a
: 1 /
2
2
2
1
≠ σ σ ) . . (
2
2
2
1
σ σ ≠ e i
Test statistic:
2
1
2
2
2
2
2
1
s
s
F
s
s
F = = or
Test statistic:
variance sample Smaller
variance sample Larger
= F i.e.
¦
¦
¹
¦
¦
´
¦
>
>
=
2
1
2
2
2
1
2
2
2
2
2
1
2
2
2
1
s s
s
s
s s
s
s
F
when
when
Rejection region:
F > F
α
Rejection region:
F > F
α/2
where F
α
, and F
α/2
are values that locate an area α and α/2, respectively, in the
upper tail of the Fdistribution with ν
1
= numerator degrees of freedom (i.e., the df
for the sample variance in the numerator) and ν
2
= denominator degrees of
freedom (i.e., the df for the sample variance in the denominator).
Assumptions: 1. Both of the populations from which the samples are selected have
relative frequency distributions that are approximately normal.
2. The random samples are selected in an independent manner from
the two populations.
Variance tests can also be applied prior to conducting a smallsample t test for
(u
1
 u
2
), discussed in Section 9.4. Recall that the t test requires the assumption that the
variances of the two sampled populations are equal. If the two population variances are greatly
different, any inferences derived from the t test are suspect. Consequently, it is important that
we detect a significant difference between the two variances, if it exists, before applying the
smallsample t test.
The common statistical procedure for comparing two population variances,
2
1
σ and
2
2
σ , makes
an inference about the ratio
2
2
2
1
/ σ σ . This is because the sampling
distribution of the estimator for
2
2
2
1
/ σ σ is well known when the samples are randomly and
independently selected from two normal populations.
The elements of a hypothesis test for the ratio of two population variances,
2
2
2
1
/ σ σ , are given in
the preceding box.
Example 9.11 A class of 31 students were randomly divided into an experimental set of
size n
1
= 18 that received instruction in a new statistics unit and a control set of size n
2
=
cxlvii
13 that received the standard statistics instruction. All students were given a test of
computational skill at the end of the course. A summary of the results appears in Table
9.4. Do the data provide sufficient evidence to indicate a difference in the variability of
this skill in the hypothetical population of students who might be given the new
instruction and the population of students who might be given the standard instruction?
Test using α αα α = .01.
Table 9.4 Data on students' scores in Example 9.11
Control set Experimental set
Sample size
Standard deviation
18
1.93
13
3.10
Solution Let
2
1
σ = Variance of test scores of the experimental population
2
2
σ = Variance of test scores of the control population
The hypotheses of interest are
H
0
: 1 /
2
2
2
1
= σ σ ) (
2
2
2
1
σ σ =
H
a
: 1 /
2
2
2
1
≠ σ σ ) (
2
2
2
1
σ σ ≠
According to the box, the test statistic for this twotailed test is
58 . 2
) 93 . 1 (
) 10 . 3 (
2
2
1
1
2
2
2
2
= = = =
s
s
s
s
F
Smaller
Larger
To find the appropriate rejection region, we need to know the sampling distribution of the test
statistic. Under the assumption that both samples of test scores come from normal populations,
the F statistic,
2
1
2
2
/ s s F = , possesses an F distribution with
ν
1
= (n
2
 1) numerator degrees of freedom and ν
2
= (n
1
 1) denominator degrees of freedom.
Unlike the z and tdistributions of the preceding sections, an Fdistribution can be symmetric
about its mean, skewed to the left, or skewed to the right; its exact shape depends on the
degrees of freedom associated with
2
2
s and
2
1
s , in this example,
(n
2
 1) = 12 and (n
2
 1) = 17, respectively. An Fdistribution with ν
1
= 12 numerator df and ν
2
=
17 denominator df is shown in Figure 9.8. You can see that this particular
Fdistribution is skewed to the right.
Uppertail critical values of F are found in Table 4 of Appendix C. Table 9.5 is partially
reproduced from this table. It gives F values that correspond to α = .05 uppertail areas for
different pairs of degrees of freedom. The columns of the table correspond to various numerator
degrees of freedom, while the rows correspond to various denominator degrees of freedom.
Thus, if the numerator degrees of freedom are 12 and the denominator degrees of freedom are
17, we find the F value,
F
.05
= 2.38
cxlviii
As shown in Figure 9.8, α/2 = .0 5 is the tail area to the right of 2.38 in the
Fdistribution with 12 numerator df and 17 denominator df. Thus, the probability that the F
statistic will exceed 2.38 is α/2 = .05.
Given this information on the Fdistribution, we are now able to find the rejection region for this
test. Since the test is twotailed, we will reject H
0
if F > F
α/2
. For
α = .10, we have α/2 = .05 and F
.05
= 2.38 (based on ν
1
= 12 and ν
2
= 17 df). Thus, the rejection
region is
Figure 9.8 Rejection region of Example 9.11
Rejection region: Reject H
0
if F > 2.38.
Since the test statistic, F = 2.58, falls in the rejection region (see Figure 9.8), we reject H
0
.
Therefore, at α = .10, the data provide sufficient evidence to indicated that the population
variances differ. It appears that the new statistics instruction results in a greater variability in
computational skill.
Example 9.11 illustrates the technique for calculating the test statistic and rejection region for a
twotailed F test. The reason we place the larger sample variance in the numerator of the test
statistic is that only uppertail values of F are shown in the F table of Appendix C  no lowertail
values are given. By placing the larger sample variance in the numerator, we make certain that
only the upper tail of the rejection region is used. The fact that the uppertail area is α/2 reminds
us that the test is twotailed.
The problem of not being able to locate an F value in the lower tail of the
Fdistribution is easily avoided in a onetailed test because we can control how we specify the
cxlix
ratio of the population variances in H
0
and H
a
. That is, we can always make a onetailed test an
uppertailed test.
Table 9.5 Reproduction of part of Table 4 from Appendix C; α = .05
Numerator degrees of freedom ν
1
ν
2
10 12 15 20 24 30 40 60 120
∞
1 241.90 243.90 245.90 248.00 249.10 250.10 251.10 252.20 253.33 254.30
2 19.40 19.41 19.43 19.45 19.45 19.46 19.47 19.48 19.49 19.50
3 8.79 8.74 8.70 8.66 8.64 8.62 8.59 8.57 8.55 8.53
4 5.96 5.91 5.86 5.80 5.77 5.75 5.72 5.69 5.66 5.63
5 4.74 4.68 4.62 4.56 4.53 4.50 4.46 4.43 4.40 4.36
6 4.06 4.00 3.94 3.87 3.84 3.81 3.77 3.74 3.70 3.67
7 3.64 3.57 3.51 3.44 3.41 3.38 3.34 3.30 3.27 3.23
8 3.35 3.28 3.22 3.15 3.12 3.08 3.04 3.01 2.97 2.93
9 3.14 3.07 3.01 2.94 2.90 2.86 2.83 2.79 2.75 2.71
10 2.98 2.91 2.85 2.77 2.74 2.70 2.66 2.62 2.58 2.54
11 2.85 2.79 2.72 2.65 2.61 2.57 2.53 2.49 2.45 2.40
12 2.75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.30
13 2.67 2.60 2.53 2.46 2.42 2.38 2.34 2.30 2.25 2.21
14 2.60 2.53 2.46 2.39 2.35 2.31 2.27 2.22 2.18 2.13
15 2.54 2.48 2.40 2.33 2.29 2.25 2.20 2.16 2.11 2.07
16 2.49 2.42 2.35 2.28 2.24 2.19 2.15 2.11 2.06 2.01
Denominator
degrees
of
freedom
17 2.45 2.38 2.31 2.23 2.19 2.15 2.10 2.06 2.01 1.96
cl
9.8 Summary
In this chapter we have learnt the procedures for testing hypotheses about various population
parameters. Often the comparison focuses on the means. As we note with the estimation
techniques of Chapter 7, fewer assumptions about the sampled populations are required when
the sample sizes are large. It would be emphasized that statistical significance differs from
practical significance, and the two must not be confused. A reasonable approach to hypothesis
testing blends a valid application of the formal statistical procedures with the researcher's
knowledge of the subject matter.
9.9 Exercises
9.1. A random sample of n observation is selected from a population with unknown mean u
and variance σ
2
. For each of the following situations, specify the test statistic and reject
region.
10 . ; 6 . , 5 . 9 , 48 ; 11 : , 11 : c.
01 . ; 6 . 9 , 5 . 140 , 40 ; 120 : , 120 : b.
05 . ; 64 , 60 , 35 ; 40 : , 40 : a.
0
0
2
0
= = = = < =
= = = = ≠ =
= = = = > =
α u u
α u u
α u u
s x n H H
s x n H H
s x n H H
a
a
a
9.2. A random sample of 51 measurements produced the following sums:
68 3 . 50
2
= =
∑ ∑
x x
a. Test the null hypothesis that u = 1.18 against the alternative that u < 1.18. Use α =
.01.
b. Test the null hypothesis that u = 1.18 against the alternative that u < 1.18. Use α =
.10.
9.3. A random sample of n observations is selected from a binominal population. For each of
the following situations, specify the rejection region, test statistic value, and conclusion:
01 . , 60 , 80 . ˆ , 85 . : , 85 . : c.
05 . , 000 , 1 , 04 . ˆ , 05 . : , 05 . : b.
10 . , 200 , 28 . ˆ , 25 . : , 25 . : a.
0
0
0
= = = ≠ =
= = = < =
= = = > =
α
α
α
n p p H p H
n p p H p H
n p p H p H
a
a
a
9.4. Two independent random samples are selected from populations with means u
1
and u
2
,
respectively. The sample sizes, means, and standard deviations are shown in the table.
55 45
0 . 1 0 . 3
5 . 6 5 . 7
= =
= =
= =
n n
s s
x x
2 Sample 1 Sample
a. Test the null hypothesis H
0
: (u
1
 u
2
) = 0 against the alternative hypothesis H
a
: (u
1

u
2
) ≠ 0 at α = .05.
b. Test the null hypothesis H
0
: (u
1
 u
2
) = .5 against the alternative hypothesis H
a
: (u
1

u
2
) ≠ .5 at α = .05.
9.5. Independent random samples selected from two binomial populations produced the
results given in the table
cli
Sample 1 Sample 2
Number of successes 80 74
Sample sizes 100 100
. at Test a. part in as
same the remain and estimates sample the but Suppose b.
at Test a.
0
0
05 . , 0 ) ( : , 0 ) ( :
ˆ , ˆ , ˆ , 000 , 1
10 . , 0 ) ( : , 0 ) ( :
2 1 2 1
2 1 2 1
2 1 2 1
= > − = −
= =
= > − = −
α
α
p p H p p H
p p p n n
p p H p p H
a
a
9.6. A random sample of n = 10 observations yields . 5 . 15 7 . 231
2
= = s x and Test the null
hypothesis H
0
: σ
2
= 20 against the alternative hypothesis
H
a
: σ
2
< 20. Use α = .05. What assumptions are necessary for the test to be valid.
9.7. The following measurements represent a random sample of n = 5 observations from a
normal population: 10, 2, 7, 9, 14. Is this sufficient evidence to conclude that σ
2
≠ 2. Test
using α = .10.
9.8. Calculate the value of the test statistic for testing H
0
: σ
1
2
/σ
2
2
in each of following cases:
235 , 2 , 750 , 1 ; 1 / :
90 . 5 , 52 . 1 ; 1 / :
23 . 1 , 75 . 1 ; 1 / :
2
2
2
1
2
2
2
1
2
2
2
1
2
2
2
1
2
2
2
1
2
2
2
1
= = ≠
= = <
= = >
s s H
s s H
s s H
σ σ
σ σ
σ σ
a
a
a
c.
b.
a.
clii
Chapter 10 Categorical data analysis and analysis of
variance
CONTENTS
10.1 Introduction
10.2 Tests of goodness of fit
10.3 The analysis of contingency tables
10.4 Contingency tables in statistical software packages
10.5 Introduction to analysis of variance
10.6 Design of experiments
10.7 Completely randomized designs
10.8 Randomized block designs
10.9 Multiple comparisons of means and confidence regions
10.10 Summary
10.11 Exercises
10.1 Introduction
In this chapter we present some methods for treatment of categorical data. The methods involve
the comparison of a set of observed frequencies with frequencies specified by some hypothesis
to be tested. A test of such a hypothesis is called a test of goodness of fit.
We will show how to test the hypothesis that two categorical variables are independent. The test
statistics discussed have sampling distributions that are approximated by chisquare
distributions. The tests are called chisquare tests. These tests are useful in analyzing more
than two population means.
In this chapter we will discuss the procedures for selecting sample data and analyzing variances.
The objective of these sections is to introduce some aspects of experimental design and analysis
of data from such experiments using an analysis of variance.
10.2 Tests of goodness of fit
We know that observations of a qualitative variable can only be categorized. For example,
consider the highest level of education attained by each in a group of women in a rural region.
"Level of education" is a qualitative variable and each woman would fall into one and only one of
the following three categories: can read/write degree; primary degree; and secondary and
above degree. The result of the categorization would be a count of the numbers of rural women
falling in the respective categories. When the qualitative variable results in one of the two
responses (yes or no, success or failure, favor or do not favor, etc.) the data (i.e., the counts)
can be analyzed using the binomial probability distribution. However, qualitative variables such
as "level of education" that allow for more than two categories for a response are much more
common, and these must be analyzed using a different method called test of goodness of fit. A
test of goodness of fit tests whether a given distribution fits a set of data. It is based on
comparison of an observed frequency distribution with the hypothesized distribution.
Example 10.1 Level of education attained by the women from a rural region is divided
into three categories: can read/write degree; primary degree; secondary and above
degree. A demographer estimates that 28% of them have can read/write degree, 61%
have primary degree and 11% have higher secondary degree. In order to verify these
cliii
percentages, a random sample of n = 100 women at the region were selected and their
level of education recorded. The number of the women whose level of education falling
into each of the three categories is shown in Table 10.1.
Table 10.1 Categories corresponding to level of education
Level of education
Primary degree Secondary degree Higher secondary Total
22 64 14 100
Do the data given in Table 10.1 disagree with the percentages of 28%, 61%, and 11%
estimated by the demographer? As a first step in answering this question, we need to find the
number of women in the sample of 100 that would be expected to fall in each of the three
educational categories of Table 10.1, assuming that the demographer's percentages are
accurate.
Solution Each woman in the sample was assigned to one and only one of the three
educational categories listed in Table 10.1. If the demographer's percentages are correct,
then the probabilities that a education level will fall in the three educational categories
are as shown in Table 10.2.
Table 10.2 Categories probabilities based on the demographer's percentages
Level of education
Can
read/write
Primary Secondary
and above
Total
Cell number
Cell probability
1
p
1
= .28
2
p
2
=.61
3
p
3
=.11
1.00
Consider first the "Can read/write" cell of Table 10.2. If we assume that the level of education of
any woman independent of the level of education of any other, then the observed number O
1
, of
responses falling into cell 1 is a binomial random variable and its expected value is
e
1
= np
1
= (100)(.28) = 28
Similarly, the expected observed numbers of responses in cells 2 and 3 (categories 2 and 3) are
e
2
= np
2
= (100)(.61) = 61
and
e
3
= np
3
= (100)(.11) = 11
The observed numbers of responses and the corresponding expected numbers (in parentheses)
are shown in Table 10.3.
Table 10.3 Observed and expected numbers of responses falling in the cell categories
for Example 10.1
Level of education
Can
read/write
Primary Secondary
and above
Total
Observed numbers 22 64 14 100
cliv
Expected numbers (28) (61) (11) 100
Formula for calculating expected cell counts
e
i
= np
i
where
e
i
= Expected count for cell i
n = Sample size
p
i
= Hypothesized Probability that an observation will fall in cell i.
Do the observed responses for the sample of 100 women disagree with the category
probabilities based on the demographer's estimates? If they do, we say that the theorized
demographer probabilities do not fit the data or, alternatively, that a lack of fit exists. The
relevant null and alternative hypotheses are:
H
0
: The category (cell) probabilities are p
1
= .28, p
2
= .61, p
3
= .11
H
a
: At least two of the probabilities, p
1
, p
2
, p
3
, differ from the values specified in the null
hypothesis
To find the value of the test statistic, we first calculate
i
i i
e
e O
2
) ( −
=
count cell Expected
count) cell Expected  count cell (Observed
2
for each of the cells, i = 1, 2, 3. The sum of these quantities is the test statistic used for the
goodnessoffit test:
∑
=
−
=
−
+
−
+
−
=
3
1
2
3
2
3 3
2
2
2 2
1
2
1 1 2
) ( ) ( ) ( ) (
i i
i i
e
e O
e
e O
e
e O
e
e O
χ
Substituting the values of the observed and expected cell counts from Table 10.3 into the
formula for calculating χ
2
, we obtain
26 . 2 82 . 15 . 29 . 1
11
) 11 14 (
61
) 61 64 (
28
) 28 22 ( ) (
2 2 2 3
1
2
2
= + + =
−
+
−
+
−
=
−
=
∑
= i i
i i
e
e O
χ
Example 10.2 Specify the rejection region for the test described in the preceding
discussion. Use α αα α = .05. Test to determine whether the sample data disagree with the
demographer's estimated percentages.
Solution Since the value of chisquare increases as the differences between the
observed and expected cell counts increase, we will reject
H
0
: p
1
= .28, p
2
= .61, p
3
= .11
for values of chisquare larger than some critical value, say
2
α
χ , i.e.,
Rejection region:
2
χ >
2
α
χ
clv
The critical values of the χ
2
distribution are given in Table 3 of Appendix C. The degrees of
freedom for the chisquare statistic used to test the goodness of fit of a set of cell probabilities
will always be 1 less than the number of cells. For example, if k cells were used in the
categorization of the sample data, then
Degrees of freedom: df = k  1
For our example, df = (k  1) = (3  1) = 2 and α = .05. From Table 3 of Appendix C, the
tabulated value of
2
α
χ , corresponding to df = 2 is 5.99147.
The rejection region for the test,
2 2
α
χ χ > , is illustrated in Figure 10.1. We will reject H
0
if χ
2
>
5.99147. Since the calculated value of the test statistic,
26 . 2
2
= χ , is less than
2
05 .
χ , we can not reject H
0
. There is insufficient information to indicate a
lack of fit of the sample data to the percentages estimated by the demographer.
Figure 10.1 Rejection region for Example 10.2
Summary of a goodness of fit test for specified values of the Cell probabilities
H
0
: The k cell probabilities are p
1
, p
2
, . . . . , p
k
H
a
: At least two of the cell probabilities differ from the values specified in H
0
Test statistic:
∑
=
−
=
k
i i
i i
e
e O
1
2
2
) (
χ
where
k = Number of cells in the categorization table
O
i
= Observed count for cell i
e
i
= Expected count for cell i
n = Sample size = 0
1
+ 0
2
+ . . . + 0
k
clvi
Rejection region:
2 2
α
χ χ >
At the start, we assumed that each of n observations could fall into one of k categories (or cells),
that the probability that an observation would fall in cell 1 was p
i
, i = 1, 2, . . . , k, and that the
outcome for any one observation was independent of the outcome for any others. These
characteristics define a multinomial experiment. The binomial experiment is a multinomial
experiment with k = 2.
Properties of the underlying distribution of response data for a chisquare goodness of
fit test
1. The experiment consists of n identical trials.
2. There are k possible outcomes to each trial.
3. The probabilities of the k outcomes, denoted by p
1
, p
2
, . . . . , p
k
remain the same from trial
to trial, where p
1
+ p
2
+ . . . . + p
k
= 1.
4. The trials are independent.
5. The (estimated) expected number of responses for each of the k cells should be at least 5.
Because it is widely used, the chisquare test is also one of the most abused statistical
procedures. The user should always be certain that the experiment satisfies the assumptions
before proceeding with the test. In addition, the chi square test should be avoided when the
estimated expected cell counts are small, because in this case the chisquare probability
distribution gives a poor approximation to the sampling distribution of the χ
2
statistic. The
estimated expected number of responses for each of the k cells should be at least 5. In this
case the chisquare distribution can be used to determine an approximate critical value that
specifies the rejection region.
In the sections that follow, we will present a method for analyzing data that have been
categorized according to two qualitative variables. The objective is to determine whether a
dependency exists between the two qualitative variables − the qualitative variable analogue to
a correlation analysis for two quantitative random variables. As you will see subsequently, these
methods are also based on the assumption that the sampling satisfies the requirements for one
or more multinomial experiments.
10.3 The analysis of contingency tables
Qualitative data are often categorized according to two qualitative variables. As a practical
example of a twovariable classification of data, we will consider a 2 × 3 table.
Suppose that a random sample of men and women indicated their view on a certain proposal as
shown in Table 10.4.
Table 10.4 Contingency table for views of women and men on a proposal
In favour Opposed Undecided Total
Women
Men
118
84
62
78
25
37
205
199
Total 202 140 62 404
We are to test the statement that there is no difference in opinion between men and women, i.e.
the response is independent of the sex of the person interviewed, and we adopt this as our null
hypothesis. Now if the statement is not true, then the response will depend on the sex of the
person interviewed, and the table will enable us to calculate the degree of dependence. A table
clvii
constructed in this way (to indicate dependence or association) is called a contingency table.
"Contingency" means dependence − many of you will be familiar with the terms "contingency
planning"; i.e. plans that will be put into operation if certain things happen. Thus, the purpose of
a contingency table analysis is to determine whether a dependence exists between the two
qualitative variables.
We adopt the null hypothesis that there is no association between the response and the sex of
person interviewed. On this basis we may deduce that the proportion of the sample who are
female is 205/404, and as 202 people are in favour of the proposal, the expected number of
women in favour of proposal is 205/404 × 202 = 102.5. Therefore, the estimated expected
number of women (row 1) in favour of the proposal (column 1) is
102.5 ) total 1 Column
total 1 Row
= × 
¹

\

= × 
¹

\

= ( 202
404
205
11
n
e
Also, as 140 people are against the proposal, the expected number of women against the
proposal is (row 1, column 2)
71 ) total 2 Column
total 1 Row
= × 
¹

\

= × 
¹

\

= ( 140
404
205
12
n
e
And the expected number of undecided women is (row 1, column 3)
31.5 ) total 3 Column
total 1 Row
= × 
¹

\

= × 
¹

\

= ( 62
404
205
13
n
e
We now move to row 2 for men and note that the row total is 199. Therefore, we would expect
the proportion of the sample who are male is 199/404 for all three types of opinion. The
estimated expected cell counts for columns of row 2 are
99.5 ) total 1 Column
total 2 Row
= × 
¹

\

= × 
¹

\

= ( 202
404
119
21
n
e
69 ) total 2 Column
total 2 Row
= × 
¹

\

= × 
¹

\

= ( 140
404
119
22
n
e
30.5 ) total 3 Column
total 2 Row
= × 
¹

\

= × 
¹

\

= ( 62
404
119
21
n
e
The formula for calculating any estimated expected value can be deduced from the values
calculated above. Each estimated expected cell count is equal to the product of its respective
row and column totals divided by the total sample size n:
n
C R
e
j i
ij
×
=
where e
ij
= Estimated expected counts for the cell in row i and column j
R
i
= Row total corresponding to row i
C
j
= Column total corresponding to column j
n = Sample size
clviii
The observed and estimated expected cell counts for the herring gull contingency table are
shown in Table 10.5.
Table 10.5 Observed and expected (in parentheses) counts for
response of women and men
Infavour Opposed Undecided
Women
Men
118
(102.5)
84
(99.5)
62
(71)
78
(69)
25
(31.5)
37
(30.5)
In this example, the chisquare test statistic , χ
2
, is calculated in the same manner as shown in
Example 10.1.
87 . 9
5 . 30
) 5 . 30 37 (
. . .
5 . 31
) 5 . 31 25 (
71
) 71 62 (
5 . 102
) 5 . 102 118 (
) (
. . .
) ( ) ( ) (
2 2 2 2
23
2
23 23
13
13 13
12
2
12 12
11
2
11 11 2
=
−
+ +
−
+
−
+
−
=
−
+ +
−
+
−
+
−
=
e
e O
e
e O
e
e O
e
e O
χ
The appropriate degrees of freedom for a contingency table analysis will always be
(r  1) × (c 1), where r is the number of rows and c is the numbers of columns in the table. In
this example, we have two degrees of freedom in calculating the expected values. Consulting
Table 3 of Appendix C, we see that the critical values for χ
2
are 5.99 at a significance level of α
= .05 and 9.21 at level of α = .01. In both cases, the computed test statistic is lager than these
critical values. Hence, we would reject the null hypothesis accepting the alternative hypothesis
that men and women think differently with 99% confidence.
General form of a chisquare test for independence of two directions of classification
H
0
: The two direction of classification in the contingency table are independent
H
a
: The two direction of classification in the contingency table are dependent
Test statistic:
∑∑
= =
−
=
r
i
c
j ij
ij ij
e
e O
1 1
2
2
) (
χ
where
r = Number of rows in the table
c = Number of columns in the table
O
ij
= Observed number of responses in the cell in row i and column j
e
ij
= Estimated expected number of responses in the cell(ij) = (R
i
× C
j
) / n
Rejection region:
2 2
α
χ χ >
where
2
α
χ is tabulated value of the chisquare distribution based on (r 1) × (c1) degrees of
freedom such that α χ χ
α
= > ) (
2 2
P
10.4 Contingency tables in statistical software packages
In all statistical software packages there are procedures for analysis of categorical data.
Following are printouts of the procedure "Crosstabs" of SPSS for creating the contingency table
clix
and computing value of the χ
2
statistic to test dependence of the education level on living region
of women interviewed in the DHS Survey 1988 in Vietnam (data of the survey is given in
Appendix A).
CROSSTABS
/TABLES=urban BY gd1
/FORMAT= AVALUE TABLES
/STATISTIC=CHISQ CC PHI
/CELLS= COUNT EXPECTED ROW .
Crosstabs
Case Processing Summary
Cases
Valid Missing Total
N Percent N Percent N Percent
URBAN * Education Level 4172 100.0% 0 .0% 4171 100.0%
URBAN * Education Level Cross tabulation
Education Level
Can
read/write
Primary Secondary
and above
Total
URBAN Urban Count 163 299 266 728
Expected Count 197.5 415.5 115.0 728.0
% within URBAN 22.4% 41.1% 36.5% 100.0%
Rural Count 969 2082 393 3444
Expected Count 934.5 1965.5 544.0 3444.0
% within URBAN 28.1% 60.5% 11.4% 100.0%
Total Count 1132 2381 659 4172
Expected Count 1132.0 2381.0 659.0 4172.0
% within URBAN 27.1% 57.1% 15.8% 100.0%
ChiSquare Tests
Value df Asymp. Sig.
(2sided)
Pearson ChiSquare 287.084
a
2 .000
Likelihood Ratio 241.252 2 .000
LinearbyLinear Association 137.517 1 .000
N of Valid Cases 4172
a
0 cells (.0%) have expected count less than 5. The minimum expected count is 114.99.
Before changing to discuss about analysis of variance, we make some remarks on methods for
treating categorical data.
clx
 Surveys that allow for more than two categories for a single response (a oneway table) can
be analyzed using the chisquare goodness of fit test. The appropriate test statistic, called χ
2
statistic, has a sampling distribution approximated by the chisquare probability distribution
and measures the amount of disagreement between the observed number of responses and
the expected number of responses in each category.
 A contingency table analysis is an application of the χ
2
test for a twoway (or twovariable)
classification of data. The test allows us to determine whether the two directions of
classification are independent.
10.5 Introduction to analysis of variance
As we have seen in the preceding chapters, the solutions to many statistical problems are
based on inferences about population means. Next sections extend the methods of Chapters 7 
9 to the comparison of more than two means.
When the data have been obtained according to certain specified sampling procedures, they are
easy to analyze and also may contain more information pertinent to the population means than
could be obtained using simple random sampling. The procedure for selecting sample data is
called the design of the experiment and the statistical procedure for comparing the population
means is called an analysis of variance.
We will introduce some aspects of experimental design and the analysis of the data from such
experiments using an analysis of variance.
10.6 Design of experiments
The process of collecting sample data is called an experiment and the variable to be measured
in the experiment is called the response. The planning of the sampling procedure is called the
design of the experiment. The object upon which the response measurement is taken is called
an experimental unit.
Variables that may be related to a response variable are called factors. The value − that is, the
intensity setting − assumed by a factor in an experiment is called a level. The combinations of
levels of the factors for which the response will be observed are called treatments.
The process of the design of an experiment can be divided into four steps as follows:
1. Select the factors to be included in the experiment and identify the parameters that are the
object of the study. Usually, the target parameters are the population means associated with
the factor level.
2. Choose the treatments to be included in the experiment.
3. Determine the number of observations (sample size) to be made for each treatment.
4. Decide how the treatments will be assigned to the experimental units.
Once the data for a designed experiment have been collected, we will want to use the sample
information to make inferences about the population means associated with the various
treatments. The method used to compare the treatment means is known as analysis of
variance, or ANOVA. The concept behind an analysis of variance can be explained using the
following simple example.
Example 10.3 A elementary school teacher wants to try out three different reading
workbooks. At the end of the year the 18 children in the class will take a test in reading
achievement. These test scores will be used to compare the workbooks. Table 10.6 gives
reading achievement scores. Each set of scores of the 6 children using a type of
workbook is considered as a sample from the hypothetical population of all kindergarten
children who might use that type of workbook. The scores are plotted as line plots in
Figure 10.2.
Table 10.6 Reading scores of 18 children using three different workbooks
clxi
Workbook 1 Workbook 2 Workbook 3
2
4
3
4
5
6
9
10
10
7
8
10
4
5
6
3
7
5
Sums 24 54 30
Sample means 4 9 5
Total of 3 samples: 108; mean of 3 samples: 6
Figure 10.2 Reading scores by workbook used and for combined sample
The means of the three samples are 4, 9, and 5, respectively. Figure 10.2 shows these as the
centers of the three samples; there is clearly variability from group to group. The variability in
the entire pooled sample of 18 is shown by the last line.
In contrast to this rather typical allocation, we consider Tables 10.7 and 10.8 as illustrations of
extreme cases. In Table 10.7 every observation in Group A is 3, every observation in Group B is
5, and every observation in Group C is 8. There is no variation within groups, but there is
variation between groups.
Table 10.7 No variation within groups
Group
A B C
3
3
3
3
5
5
5
5
8
8
8
8
0 1 2 3 4 5 6 7 8 9 10
1
x
•
• • • • •
Children using Workbook 1
0 1 2 3 4 5 6 7 8 9 10
2
x
•
•
• • • • •
Children using Workbook 2
0 1 2 3 4 5 6 7 8 9 10
3
x
•
• • • • •
Children using Workbook 3
0 1 2 3 4 5 6 7 8 9 10
x
• • •
• • • • • •
• • • • • • • • •
All children
clxii
Means 3 5 8
In Table 10.8 the mean of each group is 3. There is no variation among the group means,
although there is variability within each group. Neither extreme can be expected to occur in an
actual data set. In actual data, one needs to make an assessment of the relative sizes of the
betweengroups and withingroups variability. It is to this assessment that the term "analysis of
variance" refers.
Table 10.8 No variation between groups
Group
A B C
3
5
1
3
3
6
2
1
1
4
3
4
Means 3 3 3
In Example 10.3, the overall mean, x , is the sum of all the observations divided by the total
number of observations:
6
18
108
18
) 5 . . . 4 2 (
= =
+ + +
= x
The sum of squared deviations of all 18 observations from mean of the combined sample is a
measure of variability of the combined sample. This sum is called Total Sum of Squares and is
denoted by SS(Total).
SS(Total) = (2  6)
2
+ (4  6)
2
+ (3  6)
2
+ (4  6)
2
+ (5  6)
2
+ (6  6)
2
+
(9  6)
2
+ (10 6)
2
+ (10  6)
2
+ (7  6)
2
+ (8  6)
2
+ (10  6)
2
+
(4  6)
2
+ (5  6)
2
+ (6  6)
2
+ (3  6)
2
+ (7  6)
2
+ (5  6)
2
= 34 + 62 + 16 = 112
Next we measure the variability within samples. We calculate the sum of squared deviation of
each of 18 observations from their respective group means. This sum is called the Sum of
Squares Within Groups (or Sum of Squared Errors) and is denoted by SS(Within Groups) (or
SSE).
SSE = (2  4)
2
+ (4  4)
2
+ (3  4)
2
+ (4  4)
2
+ (5  4)
2
+ (6  4)
2
+
(9  9)
2
+ (10 9)
2
+ (10  9)
2
+ (7  9)
2
+ (8  9)
2
+ (10  9)
2
+
(4  5)
2
+ (5  5)
2
+ (6  5)
2
+ (3  5)
2
+ (7  5)
2
+ (5  5)
2
= 10 + 8 + 10 = 28
Now let us consider the group means of 4, 9, and 5. The sum of squared deviation of the group
means from the pooled mean of 6 is
(4  6)
2
+ (9  6)
2
+ (5  6)
2
= 4 + 9 + 1 =14.
However, this sum is not comparable to the sum of squares within groups because the sampling
variability of means is less than that of individual measurements. In fact, the mean of a sample
of 6 observations has a sampling of 1/6 the sampling variance of a single observation. Hence, to
clxiii
put the sum of squared deviations of group mean on a basis that can be compared with
SS(Within Groups), we must multiply it by 6, the number of observation in each sample, to
obtain 6 × 14 = 84. This is called the Sum of Squares Between Groups (or Sum of the Squares
for Treatment) and is denoted by SS(Between Groups) (or SST).
Now we have three sums that can be compared: SS(Between Groups), SS(Within Groups),
and SS(Total). They are given in Table 10.9. Observe that addition of the first two sum of
squares gives the last sum. This demonstrates what we mean by the allocation of the total
variability to the variability due to differences between means of groups and variability of
individuals within groups.
Table 10.9 Sums of Squares for Example 10.3
SS(Between Groups)
SS(Within Groups)
SS(Total)
84
28
112
In this example we notice that the variability between groups is a large proportion of the total
variability. However, we have to adjust the numbers in Table 10.9 in order to take account of the
number of pieces of information going into each sum of squares. That is, we want to use the
sums of squares to calculate sample variances. The sum of squares between groups has 3
deviations about the mean of combined sample. Therefore its number of degrees of freedom is
3  1 = 2 and the sample variance based on this sum of squares is
42
2
84
1 3
) (
= =
−
= −
Groups Between SS
Variation Group Between
This quantity is also called Mean Square for Treatments (MST).
The sum of squares within groups is made up of 3 sample sums of squares. Each involves 6
squared deviations, and hence, each has 6 1 = 5 degrees of freedom. Therefore 3 samples
have 18  3 = 15 degrees of freedom. The sample variance based on this sum is
867 . 1
15
28
3 18
) (
= =
−
= −
Groups Within SS
Variation Group Within
This variation is also called Mean Square for Error (MSE).
The two estimates of variation MST, measuring variability among groups and MSE, measuring
variability within groups, are now comparable. Their ratio is
50 . 22
867 . 1
42
= = =
MSE
MST
F .
The fact that MST is 22.5 times MSE seems to indicate that the variability among groups is
much greater than that within groups. However, we know that such a ratio computed for
different triplets of random samples would vary from triplet to triplet, even if the population
means were the same. We must take account of this sampling variability. This is done by
referring to the Ftables depending on the desired significance level as well as on the number of
degrees of freedom of MST, which is 2 here, and the number of degrees of freedom of MSE,
which is 15 here. The value in the Ftable for a significance level of .01 is 6.36. Thus we would
consider the calculated ratio of 22.50 as very significant. We conclude that there are real
differences in average reading readiness due to the use of different workbooks.
The results of computation are set out in Table 10.10.
Table 10.10 Analysis of Variance Table for Example 10.3
clxiv
Source of Variation Sum of
Squares
Degrees of
Freedom
Mean of
Squares
F
Between groups 84 2 42 22.50
Within groups 28 15 1.867
In next sections we will consider the analysis of variance for the general problem of comparing k
population means for three special types of experimental designs.
10.7 Completely randomized designs
The most common experimental design employed in practice is called a completely randomized
design. This experiment involves a comparison of the means for a number, say k, of treatments,
based on independent random samples of n
1
, n
2
, . . . , n
k
observations, drawn from populations
associated with treatments 1, 2, . . . , k, respectively.
After collecting the data from a completely randomized design, our goal is to make inferences
about k population means where u
i
is the mean of the population of measurements associated
with treatment i, for i = 1, 2, . . . , k. The null hypothesis to be tested is that the k treatment
means are equal, i.e.,
H
0
: u
1
= u
2
= . . . = u
k
and the alternative hypothesis is that at least two of the treatment means differ.
An analysis of variance provides an easy way to analyze the data from a completely
randomized design. The analysis partitions SS(Total) into two components, SST and SSE.
These two quantities are defined in general term as follows:
∑
=
− =
k
j
j
x x SST
1
2
) (
∑∑
= =
− =
k
j
n
i
j j i
j
x x SSE
1 1
) (
Recall that the quantity SST denotes the sum of squares for treatments and measures the
variation explained by the differences between the treatment means. The sum of squares for
error, SSE, is a measure of the unexplained variability, obtained by calculating a pooled
measure of the variability within the k samples. If the treatment means truly differ, then SSE
should be substantially smaller than SST. We compare the two sources of variability by forming
an F statistic:
MSE
MST
k n SSE
k SST
variation sample  Within
variation sample  Between
F =
−
−
= =
) /(
) 1 /(
where n is the total number of measurements. Under certain conditions, the F statistic has a
repeated sampling distribution known as the Fdistribution. Recall from Section 9.6 that the F
distribution depends on ν
1
numerator degrees of freedom and ν
2
, denominator degrees of
freedom. For the completely randomized design, F is based on ν
1
= (k  1) and ν
2
= (n  k)
degrees of freedom. If the computed value of F exceeds the upper critical value, F
∞
we reject H
0
and conclude that at least two of the treatment means differ.
Test to Compare k Population Means for a Completely Randomized Design
clxv
H
0
: u
1
= u
2
. . . = u
k
[i.e., there is no difference in the treatment (population)
means]
H
a
: At least two treatment means differ
Test statistic: F = MST/MSE
Rejection region: F > F
α
where the distribution of F is based on (k  1) numerator df and (n  k) denominator df, and F
α
is
the F value found in Table 4 of Appendix C such that P(F > F
α
) = α.
Assumptions: 1. All k population probability distributions are normal.
2. The k population variances are equal.
3. The samples from each population are random and
independent.
The results of an analysis of variance are usually summarized and presented in an analysis of
variance (ANOVA) table. Such a table shows the sources of variation, their respective degrees
of freedom, sums of squares, mean squares, and computed F statistic. The results of the
analysis of variance for Example 10.3 are given in Table 10.9, and the general form of the
ANOVA table for a completely randomized design is shown in Table 10.11.
clxvi
Table 10.11 Analysis of Variance Table for Completely Random Design
Source of Variation Sum of
Squares
Degrees of
Freedom
Mean of
Squares
F
Between groups SST
k  1 MST/(k – 1)
Within groups SSE
n  k SSE/(n  k)
F =
MST/MSE
Total SS(Total)
n 1
Example 10.4 Consider the problem of comparing the mean number of children born to
women in 10 provinces numbered from 1 to 10. Numbers of children born to 3448 women
from these provinces are randomly selected from the column heading CEB of Appendix
A. The women selected from 10 provinces are considered to be the only ones of interest.
This ensure the assumption of equality between the population variances. Now, we want
to compare the mean numbers of children born to all women in these provinces, i.e., we
wish to test
H
0
: u
1
= u
2
. . . = u
10
H
a
: At least two population means differ
Solution We will use the SPSS package to make an analysis of variance. Following are
the syntax and the print out of the procedure "OneWay ANOVA" of SPSS for analysis of
CEB by province.
ONEWAY
ceb BY province
/STATISTICS DESCRIPTIVES
/MISSING ANALYSIS .
ONEWAY
Descriptives
Children ever born
95% Confidence
Interval for Mean
N
Mean
Std.
Deviation
Std. Error
Lower
Bound
Upper
Bound
Minimum Maximum
1 228 2.40 1.55 .10 2.19 2.60 0 10
2 323 2.84 2.30 .13 2.59 3.09 0 11
3 302 3.15 2.09 .12 2.91 3.39 0 12
4 354 2.80 2.00 .11 2.59 3.01 0 10
5 412 2.53 1.61 7.93E02 2.37 2.68 0 9
6 366 3.08 1.99 .10 2.88 3.29 0 11
7 402 3.26 1.83 9.13E02 3.08 3.44 0 10
8 360 3.45 2.21 .12 3.23 3.68 0 11
9 297 3.87 2.66 .15 3.56 4.17 0 12
10 403 3.75 2.52 .13 3.51 4.00 0 12
Total 3448 3.13 2.15 3.66E02 3.06 3.20 0 12
clxvii
ANOVA
Children born
Sum of Squares df Mean
Square
F Sig.
Between Groups 702.326 9 78.036 17.621 .000
Within Groups 15221.007 3437 4.429
Total 15923.333 3446
From the printout we can see that the SPSS OneWay ANOVA procedure presents the results
in the form of an ANOVA table. Their corresponding sums of squares and mean squares are:
SST = 702.326
SSE = 15221.007
MST = 78.036
MSE = 4.429
The computed value of the test statistic, given under the column heading F is
F = 17.621
with degrees of freedom between provinces is ν
1
= 9 and degrees of freedom within provinces is
ν
2
= 3437.
To determine whether to reject the null hypothesis
H
0
: u
1
= u
2
. . . = u
10
in favor of the alternative
H
a
: at least two population means are different
we may consult Table 4 of Appendix C for tabulated values of the F distribution corresponding
to an appropriately chosen significance level α. However, since the SPSS printout gives the
observed significance level (under the column heading Sig.) of the test, we will use this quantity
to assist us in reaching a conclusion. This quality is the probability of obtaining F statistic at
least as large as the one calculated when all population means are equal. If this probability is
small enough, the null hypothesis (all population means are equal) is rejected. In this example,
the observed significance level is approximately .0001. It implies that H
0
will be rejected at any
chosen level of α lager than .0001. Thus, there is very strong evidence of a difference among
the mean numbers of children ever born of women in 10 provinces. The probability that this
procedure will lead to a Type I error is .0001.
Before ending our discussion of completely randomized designs, we make the following
comment. The proper application of the ANOVA procedure requires that certain assumptions be
satisfied, i.e., all k populations are approximately normal with equal variances. If you know, for
example, that one or more of the populations are nonnormal (e.g., highly skewed), then any
inferences derived from the ANOVA of the data are suspect. In this case, we can apply a non
parametric technique.
clxviii
10.8 Randomized block designs
Example 10.5 Three methods of treating beer cans are being compared by a panel of 5
people. Each person samples beer from each type of can and scores the beer with a
number (integer) between 0 and 6, 6 indicating a strong metallic taste and 0 meaning no
metallic taste. It is obvious that different people will use the scale somewhat differently,
and we shall take this into account when we compare the different types of can.
The data are reported in Table 10.12. This is an example of a situation in which the investigator
has data pertaining to k treatments (k = 3 types of can) in b blocks (b = 5 persons) . We let x
gj
denote the observation corresponding to the gth treatment and the jth block,
. g
x denote the
mean of the b observations for the gth treatment,
j
x
.
the mean of the k observations in the jth
block, and x the overall mean of all n = kb observations. When this particular design is used,
the three types of can are presented to the individuals in random order.
An experimental design of this type is called a randomized blocks design. In agricultural
experiments the k treatments might correspond, for example, to k different fertilizers; the field
would be divided into blocks of presupposed similar fertility; and every fertilizer was used in
each block so that differences in fertility of the soil in different parts of the field (blocks) would
not bias the comparison of the fertilizers. Each block would be subdivided into k subblocks,
called "plots." The k fertilizers would be randomly assigned to the plots in each block; hence the
name, "randomized blocks."
Table 10.12 Scores of three types of can on "metallic" scale
Person
Type of Can P1 P2 P3 P4 P5 Sums
A
B
C
6
2
6
5
3
4
6
2
4
4
2
4
3
1
3
24
10
21
Sums 14 12 12 10 7 55
In general terms, we can define that a randomized block design as a design in which k
treatments are compared within each of b blocks. Each block contains k matched experimental
units and the k treatments are randomly assigned, one to each of the units within each block.
Table 10.13 shows the pattern of a data set resulting from a randomized blocks design; it is a
twoway table with single measurements as entries. In the example people correspond to blocks
and cans to treatments. The observation x
gj
is called the response to treatment g in block j.
The treatment mean
. g
x , estimates the population mean u
g
, for treatment g (averaged out over
people). An objective may be to test the hypothesis that treatments make no difference,
H
0
: u
1
= u
2
= . . . = u
k
clxix
Table 10.13 Randomized Blocks Design
Blocks
Treatments 1 2 . . . b
1
2
.
.
.
k
x
11
x
21
.
.
.
x
k1
x
12
x
22
.
.
.
x
k2
. . .
. . .
. . .
x
1b
x
2b
.
.
.
x
kb
Each observation x
gj
can be written as a sum of meaningful terms by means of the identity
) ( ) ( ) (
. . . .
x x x x x x x x x x
j g gj j g gj
+ − − + − + − + = .
In word, the
( ) residual
block jth
to due
deviation
treatment
gth to due
deviation
mean
overal
block jth
in treatment gth
for value Observed
+


¹

\

+


¹

\

+

¹

\

=


¹

\

The "residual" is
  ) ( ) (
. .
x x x x x x
j g gj
− + − + − ,
which is the difference between the observation and
) ( ) (
. .
x x x x x
j j
− + − + ,
obtained by taking into account the overall mean, the effect of the gth treatment, and the effect
of the jth block. Algebra shows that the corresponding decomposition is true for sums of
squares:
∑∑ ∑ ∑ ∑∑
= = = = = =
+ − − + − + − = −
k
g
b
j
j g gj
b
j
j
k
g
g
k
g
b
j
gj
x x x x x x k x x b x x
1 1
2
. .
1
2
.
1
2
.
1 1
2
) ( ) ( ) ( ) (
that is,
SS(Total) = SS(Treatment) + SS(Blocks) + SS(Residuals).
The number of degrees of freedom of SS(Total) is kb  1 = n  1, the number of observations
less 1 for the overall mean.
The number of degrees of freedom of SS(Treatments) is k  1, the number of treatments less 1
for the overall mean.
Similarly, the number of degrees of freedom of SS(Blocks) is b  1. There remain, as the number
of degrees of freedom for SS(Residuals)
kb  1  (k  1)  (b  1) = (k  1)(b  1).
There is a hypothetical model behind the analysis. It is assumes that in repeated experiments
the measurement for the gth treatment in the jth block would be the sum of a constant
clxx
pertaining to the treatment, namely u
g
, a constant pertaining to the jth block, and a random
"error" term with a variance of σ
2
. The mean square for residuals,
MS(Residuals) = SS(Residuals) / (k  1)(b  1)
is an unbiased estimate of σ
2
regardless of whether the u
g
's differ (that is, whether there are true
effects due to treatments). If there are no differences in the u
g
's,
MS(Treatments) = MS(Treatments) / (k  1)
is an unbiased estimate of σ
2
(whether or not there are true effects due to blocks). If there are
differences among the u
g
's, then MS(Treatments) will tend to be larger than σ
2
. One tests H
0
by
means of
F = MS(Treatments) / MS(Residuals)
When H
0
is true, F is distributed as an Fdistribution based on (k  1) numerator df and (k  1) (b
1) df. One rejects H
0
if F is sufficiently large, that is, if F exceeds F
α
. Table 10.14 is the analysis
of variance table.
Table 10.14 Analysis of variance table for randomized blocks design
Sources of
variation
Sum of squares
Degrees of
freedom
Mean square
F
Treatments
∑
=
−
k
g
g
x x b
1
2
.
) (
k  1
MS(Treatments)
Residuals) MS
Treatments MS
(
) (
Blocks
∑
=
−
b
j
j
x x k
1
2
.
) (
b  1
MS(Blocks)
Residuals) MS
Blocks MS
(
) (
Residuals
∑∑
= =
+ − −
k
g
b
j
j g gj
x x x x
1 1
2
. .
) (
(k 1)(b 1)
MS(Residuals)
Total ∑∑
= =
= −
k
g
b
j
gj
Total SS x x
1 1
2
) ( ) (
n  1
The computational formulas are
2
1 1 1 1
2
1
) (


¹

\

− =
∑∑ ∑∑
= = = =
k
g
b
j
gj
k
g
b
j
gj
x
kb
x Total SS
2
1 1 1
2
1
1 1
) (


¹

\

−


¹

\

=
∑∑ ∑ ∑
= = = =
k
g
b
j
gj
k
g
b
j
gj
x
kb
x
b
Treatments SS
2
1 1 1
2
1
1 1
) (


¹

\

−


¹

\

=
∑∑ ∑ ∑
= = = =
k
g
b
j
gj
b
j
k
g
gj
x
kb
x
k
Blocks SS
SS(Residuals) = SS(Total)  SS(Treatments)  SS(Block)
clxxi
For the data in Table 10.13 we have
237 3 . . . 5 6
2 2 2
1 1
2
= + + + =
∑∑
= =
k
g
b
j
gj
x
67 . 201
15
55 1
2
2
1 1
= =


¹

\

∑∑
= =
k
g
b
j
gj
x
kb
SS(Total) = 237  201.67 = 35.33
73 . 21 67 . 201 40 . 223 67 . 201
5
21 10 24
) (
2 2 2
= − = −
+ +
= Treatments SS
33 . 9 67 . 201 211 67 . 201
3
7 10 12 12 14
) (
2 2 2 2 2
= − = −
+ + + +
= Blocks SS
SS(Residuals) = 35.33  21.73 = 4.27
The analysis of variance table is Table 10.15. From Table 4 in Appendix C, the tabulated value
of F
.05
with 2 and 8 df is 4.46. Therefore, we will reject H
0
if the calculated value of F is F > 4.46.
Since the computed value of test statistic,
F = 20.40, exceeds 4.46, we have sufficient evidence to reject the null hypothesis of no
difference in metallic taste of types of can at α = .05.
Table 10.15 Analysis variance table for "Metallic" scale
Sources of
variation
Sum of
Squares
Degrees of
freedom
Mean
square
F
Cans
Persons
Residual
21.73
9.33
4.27
2
4
8
10.87
2.33
0.533
Total 35.33 14
20.40
4.38
The roles of cans and people can be interchanged. To test the hypothesis that there are no
differences in scoring among persons (in the hypothetical population of repeated experiments),
one uses the ration of MS(Blocks) to MS(Residuals) and rejects the null hypothesis if that ratio
is greater than an Fvalue for b  1 and
(k  1)(b  1) degrees of freedom. The value here of 4.38 is referred to Table 4 of Appendix C
with 4 and 8 degrees of freedom, for which the 5% point is 3.84; it is barely significant.
10.9 Multiple comparisons of means and confidence regions
The Ftest gives information about all means u
1
, u
2
, . . ., u
k
simultaneously. In this section we
consider inferences about differences of pairs of means. Instead of simply concluding that some
of u
1
, u
2
, . . ., u
k
are different, we may conclude that specific pairs u
g
, u
h
are different.
The variance of difference between two means, say
1
x and
2
x , is σ
2
(1/n
1
+ 1/n
2
), which is
estimated as s
2
(1/n
1
+ 1/n
2
). The corresponding estimated standard deviation is
2 1
/ 1 / 1 n n s + .
clxxii
If one were interested simply in determining whether the first two population means differed, one
would test the null hypothesis that u
1
= u
2
at significance level α by using a ttest, rejecting the
null hypothesis if
2 / 2 1 2 1
) / 1 / 1 /(
α
> + − t n n s x x
where the number of degrees of freedom for the tvalue is the number of degrees of freedom for
s. However, now we want to consider each possible difference u
g
 u
h
; that is, we want to test all
the null hypotheses
H
gh
: u
g
= u
h
, with g ≠ h; g, h = 1, . . . , k.
There are k(k  1)/2 such hypotheses.
If, indeed, all the u's were equal, so that there were no real differences, the probability that any
particular one of the pair wise differences in absolute value would exceed the relevant tvalue is
α. Hence the probability that at least one of them would exceed the tvalue, would be greater
than α. When many differences are tested, the probability that some will appear to be
"significant" is greater than the nominal significance level α when all the null hypotheses are
true. How can one eliminate this false significance? It can be shown that, if m comparisons are
to be made and the overall Type I error probability is to be at most α, it is sufficient to use α/m
for the significance level of the individual tests. By overall Type I error we mean concluding u
g
≠
u
h
for at least one pair g, h when actually u
1
= u
2
= . . .= u
k
.
Example 10.5 We illustrate with Example 10.3 (Tables 10.6 and 10.9). Here
s
2
= 1.867, based on 15 degrees of freedom (s = 1.366). Since all the sample sizes are 6,
the value with which to compare each differences
h g
x x − is
789 . 3 / 1 366 . 1 6 / 1 6 / 1 366 . 1 ) / 1 / 1 (
2 / 2 / 2 /
2 1
2 /
* * * *
× = × × = + × × = + ×
α α α α
t t t n n s t
where α
*
is to be the level of the individual tests.
The number of comparisons to be made for k = 3 is k(k  1)/2 = 3 = m. If we want the overall
Type I error probability to be at most .03, then it suffices to choose the level α
*
to be .03/3 = .01.
The corresponding percentage point of Student's tdistribution with 15 degrees of freedom is
t
.01/2
= t
.005
= 2.947. The value with which to compare
h g
x x − is .789 x 2.947 = 2.33. In Table
10.6 the means are . 5 , 9 , 4
3 2 1
= = = x x x
The difference 5 4 9
1 2
= − = − x x is significant; so is the 4 5 9
3 2
= − = − x x . The difference
1 4 5
1 3
= − = − x x is not significant. The conclusion is that u
2
is different from both u
1
and u
3
, but
u
1
and u
3
may be equal; Workbook 2 appears to be superior.
clxxiii
Confidence Regions
With confidence at least 1  α, the following inequalities hold:
h g h g h g h g h g
n n s t x x n n s t x x / 1 / 1 ) ( / 1 / 1
2 / 2 /
* *
+ + − < u − u < + − −
α α
for g ≠ h; g, h = 1, . . . , k, if α
*
= α/m and the distribution of t is based on (n  k) degrees of
freedom.
10.10 Summary
This chapter presented an extension of the methods for comparing two population means to
allow for the comparison of more than two means. The completely randomized design uses
independent random samples selected from each of k populations. The comparison of the
population means is made by comparing the variance among the sample means, as measured
by the mean square for treatments (MST), to the variation attributable to differences within the
samples, as measured by the mean square for error (MSE). If the ratio of MST to MSE is large,
we conclude that a difference exists between the means of at least two of the k populations.
We also presented an analysis of variance for a comparison of two or more population means
using matched groups of experimental units in a randomized block design, an extension of the
matchedpairs design. The design not only allows us to test for differences among the treatment
means, but also enables us to test for differences among block means. By testing for
differences among block means, we can determine whether blocking is effective in reducing the
variation present when comparing the treatment means.
Remember that the proper application of these ANOVA techniques requires that certain
assumptions are satisfied. In most applications, the assumptions will not be satisfied exactly.
However, these analysis of variance procedures are flexible in the sense that slight departures
from the assumptions will not significantly affect the analysis or the validity of the resulting
inferences.
10.11 Exercises
10.1. A random sample of n = 500 observations were allocated to the k = 5 categories shown
in the table. Suppose we want to test the null hypothesis that the category probabilities
are p
1
=.1, p
2
=.1, p
3
=.5, p
4
=.1, and p
5
=.2.
Category
1 2 3 4 5
Total
27 62 241 69 101 500
a. Calculate the expected cell counts.
b. Find
2
α
χ for α = .05.
c. State the alternative hypothesis for the test.
d. Do the data provide sufficient evidence to indicate that the null hypothesis is false?
clxxiv
10.2. Refer to the accompanying 2 × 3 contingency table.
Columns
1 2 3
Totals
Rows 1
2
14
21
37
32
23
38
74
91
Totals 35 69 61 165
a. Calculate the estimated expected cell counts for the contingency table.
b. Calculate the chisquare statistic for the table.
10.3. A partially completed ANOVA table for a completely randomized design is shown here.
Source SS df MS F
Between groups
Within groups
24.7
÷
4
÷
÷
÷
÷
Total 62.4 34
a. Complete the ANOVA table.
b. How many treatments are involved in the experiment?
c. Do the data provide sufficient evidence to indicate a difference among the population
means? Test using α = .10.
10.4. A randomized block design was conducted to compare the mean responses for three
treatments, A, B, and C, in four blocks. The data are shown in the accompanying table,
followed by a partial summary ANOVA table.
Block
Treatment 1 2 3 4
A
B
C
3
5
2
6
7
3
1
4
2
2
6
2
Source SS df MS F
Treatments
Blocks
Residuals
23.167
14.250
÷
÷
÷
÷
÷
4.750
.917
÷
÷
Total 42.917 ÷
a. Complete the ANOVA table.
b. Do the data provide sufficient evidence to indicate a difference among treatment
means? Testing using α = .05.
c. Do the data provide sufficient evidence to indicate that blocking was effective in
reducing the experimental error? Testing using α = .10.
d. What assumptions must the data satisfy to make the F test in parts b and c valid?
10.5. At the 5% level make the Ftest of equality of population (treatment) means for the data
in the table.
clxxv
Blocks
Treatment 1 2 3
1
2
3
1
4
9
4
9
16
9
16
23
clxxvi
Chapter 11 Simple Linear regression and correlation
CONTENTS
11.1 Introduction: Bivariate relationships
11.2 Simple Linear Regression: Assumptions
11.3 Estimating A and B: the method of least squares
11.4 Estimating σ
2
11.5 Making inferences about the slope, B
11.6 Correlation analysis
11.7 Using the model for estimation and prediction
11.8 Simple Linear Regression: An Example
11.9 Summary
11.10 Exercises
11.1 Introduction: Bivariate relationships
Subject of this Chapter is to determine the relationship between variables.
In Chapter 10 we used chisquare tests of independence to determine whether a statistical
relationship existed between two variables. The chisquare test tells us if there is such a
relationship, but it does not tell us what the relationship is. Regression and correlation analyses
will show how to determine both the nature and the strength of a relationship between two
variables.
The term “regression “ was first used as a statistical concept by Sir Francis Galton. He
designed the word regression as the name of the general process of predicting one variable (
the height of the children ) from another ( the height of the parent ). Later, statisticians coined
the term multiple regression to describe the process by which several variables are used to
predict another.
In regression analysis we shall develop an estimating equation – that is a mathematical formula
that relates the known variables to the unknown variable. Then, after we have learned the
pattern of this relationship we can apply correlation analysis to determine the degree to which
the variables are related. Correlation analysis tell us how well the estimating equation actually
describes the relationship.
Types of relationships
Regression and correlation analyses are based on the relationship or association between two
or more variables.
Definition 11.1
The relationship between two random variables is known as a bivariate relationship.
The known variable ( or variables ) is called the independent variable(s). The variable
we are trying to predict is the dependent variable.
clxxvii
Example 11.1 A farmer may be interested in the relationship between the level of fertilizer x
and the yield of potatoes y. Here the level of fertilizer x is independent variable and the yield of
potatoes y is dependent variable.
Example 11.2 A medical researcher may be interested in the bivariate relationship between a
patient’s blood pressure x and heart rate y. Here x is independent variable and y is dependent
variable.
Example 11.3 Economists might base their predictions of the annual gross national product
(GDP) on the final consumption spending within the economy. Then, the final consumption
spending is the independent variable, and the GDP would be the dependent variable.
In regression analysis we can have only one dependent variable in our estimating equation.
However, we can use more than one independent variable. We often add independent variables
in order to improve the accuracy of our prediction.
Definition 11.2
If when the independent variable x increases, the dependent variable y also increases
then the relationship between x and y is direct relationship. In the case, the dependent
variable y decreases as the independent variable x increases, we call the relationship
inverse.
Scatter diagrams
The first step in determining whether there is a relationship between two variables is to
examine the graph of the observed (or known) data, i.e. of the data points.
Definition 11.3
The graph of the data points is called a scatter diagram or scatter gram.
Example 11.4 In recent years, physicians have used the socalled diving reflex to reduce
abnormally rapid heartbeats in humans by submerging the patient’s face in old water. A
research physician conducted an experiment to investigate the effects of various cold
temperatures on the pulse rates of ten small children. The results are presented in Table 11.1.
clxxviii
Table 11.1
Temperature of water – Pulse rate data
Child
Temperature of
Water, x
o
F
Reduction in
Pulse, y
beats/minute
1 68 2
2 65 5
3 70 1
4 62 10
5 60 9
6 55 13
7 58 10
8 65 3
9 69 4
10 63 6
The scatter gram of the data set in Table 11.1 is depicted in Figure 11.1.
0
2
4
6
8
10
12
14
50 55 60 65 70 75
Figure 11.1 Scatter gram for the data in Table 11.
From the scatter gram we can visualize the relationship that exists between the two variables.
As a result we can draw or “fit” a straight line through our scatter gram to represent the
relationship. We have done this in Figure 11.2.
clxxix
0
2
4
6
8
10
12
14
50 55 60 65 70 75
Figure 11.2 Scatter gram with straight line representing the
relationship between x and y “fitted” through it
We see that the relationship described by the data points is well described by a straight line.
Thus, we can say that it is a linear relationship. This relationship, as we see, is inverse because
y decreases as x increases
Example 11.5 To model the relationship between the CO (Carbon Monoxide) ranking, y, and
the nicotine content, x, of an Americanmade cigarette the Federal Trade commission tested a
random sample of 5 cigarettes. The CO ranking and nicotine content values are given in Table
11.2
Table 11.2 CO RankingNicotine Content Data
Cigarett
e
Nicotine Content, x,
mgs
CO ranking, y, mgs
1 0.2 2
2 0.4 10
3 0.6 13
4 0.8 15
5 1 20
The scatter gram with straight line representing the relationship between Nicotine Content x and
CO Ranking y “fitted” through it is depicted in Figure 11.3. From this we see that the
relationship here is direct.
clxxx
C
O
r
a
n
k
i
n
g
y
,
m
g
s
0
5
10
15
20
25
0 0.5 1 1.5
Nicotine Content x, mgs
Figure 11.3 Scatter gram with straight line
representing the relationship between x and y
“fitted” through it
11.2 Simple Linear regression: Assumptions
Suppose we believe that the value of y tends to increase or decrease in a linear manner as x
increases. Then we could select a model relating y to x by drawing a line which is well fitted to a
given data set. Such a deterministic model – one that does not allow for errors of prediction –
might be adequate if all of the data points fell on the fitted line. However, you can see that this
idealistic situation will not occur for the data of Table 11.1 and 11.2. No matter how you draw a
line through the points in Figure 11.2 and Figure 11.3, at least some of points will deviate
substantially from the fitted line.
The solution to the proceeding problem is to construct a probabilistic model relating y to x one
that acknowledges the random variation of the data points about a line. One type of probabilistic
model, a simple linear regression model, makes assumption that the mean value of y for a given
value of x graphs as straight line and that points deviate about this line of means by a random
amount equal to e, i.e.
y = A + B x + e,
where A and B are unknown parameters of the deterministic (nonrandom ) portion of the model.
If we suppose that the points deviate above or below the line of means and with expected value
E(e) = 0 then the mean value of y is
y = A + B x.
Therefore, the mean value of y for a given value of x, represented by the symbol E(y) graphs as
straight line with yintercept A and slope B.
A graph of the hypothetical line of means, E(y) = A + B x is shown in Figure 11.4.
clxxxi
Figure 11.4 The straight line of means
A SIMPLE LINEAR REGRESSION MODEL
y = A + B x + e,
where
y = dependent variable (variable to be modeled – sometimes called
the response variable)
x = independent variable ( variable used as a predictor of y)
e = random error
A = yintercept of the line
B = slope of the line
In order to fit a simple linear regression model to a set of data , we must find estimators for the
unknown parameters A and B of the line of means y = A + B x. Since the sampling distributions
of these estimators will depend on the probability distribution of the random error e, we must
first make specific assumptions about its properties.
clxxxii
ASSUMPTIONS REQUIRED FOR A LINEAR REGRESSION MODEL
1. The mean of the probability distribution of the random error is 0, E(e) =
0. that is, the average of the errors over an infinitely long series of
experiments is 0 for each setting of the independent variable x. this
assumptionsimplies that the mean value of y, E(y) for a given value of x is
y = A + B x.
2. The variance of the random error is equal a constant, say σ
2
, for all
value of x.
3. The probability distribution of the random error is normal.
4. The errors associated with any two different observations are
independent. That is, the error associated with one value of y has no
effect on the errors associated with other values.
11.3 Estimating A and B: the method of least squares
The first problem of simple regression analysis is to find estimators of A and B of the regression
model based on a sample data .
Suppose we have a sample of n data points (x
1
, y
1
), (x
2
, y
2
), ..., (x
n
, y
n
). The straightline model
for the response y in terms x is
y = A + B x + e.
The line of means is E(y) = A + B x and the line fitted to the sample data is bx a y + = ˆ . Thus,
yˆ is an estimator of the mean value of y and a predictor of some future value of y; and a, b are
estimators of A and B, respectively.
For a given data point, say the point (x
i
, y
i
), the observed value of y is y
i
and the predicted value
of y would be
i i
bx a y + = ˆ
and the deviation of the ith value of y from its predicted value is
∑
=
+ − =
n
i
i i
bx a y SSE
1
2
)] ( [ .
The values of a and b that make the SSE minimum is called the least squares estimators of the
population parameters A and B and the prediction equation bx a y + = ˆ is called the least
squares line.
Definition 11.4
The least squares line is one that has a smaller than any other straightline model.
clxxxiii
FORMULAS FOR THE LEAST SQUARES ESTIMATORS
Slope:
xx
xy
SS
SS
b = , yintercept: x b y a − =
where
∑
=
− − =
n
i
i i xy
y y x x SS
1
) )( ( ,
∑
=
− =
n
i
i xx
x x SS
1
2
) ( ,
∑ ∑
= =
= =
n
i
i
n
i
i
y
n
y x
n
x
1 1
1
,
1
,
n = sample size
Example 11.6 Refer to Example 11.5. Find the bestfitting straight line through the sample
data points.
Solution By the least squares method we found the equation of the bestfitting straight line. It
is x y 5 . 20 3 . 0 ˆ + − = . The graph of this line is shown in Figure 11.5
Figure 11.5 Least squares line for Example 11.6
11.4 Estimating σ σσ σ
2
In most practical situations, the variance σ
2
of the random error e will be unknown and must be
estimated from the sample data. Since σ
2
measures the variation of the y values about the
regression line, it seems intuitively reasonable to estimate σ
2
by dividing the total error SSE by
an appropriate number.
clxxxiv
ESTIMATION OF σ σσ σ
2
2
2
−
= =
n
SSE
r error freedom fo Degree of
SSE
s
where
2
1
) ˆ (
∑
=
− =
n
i
i i
y y SSE
From the following Theorem it is possible to prove that s
2
is an unbiased estimator of σ
2
, that is
E(s
2
) = σ
2
.
Theorem 11.1
Let
2
2
−
=
n
SSE
s . Then, when the assumptions of Section 11.2 are satisfied, the statistic
2
2
2
2
) 2 (
σ σ
χ
s n SSE −
= = has a chisquare distribution with ) 2 ( − = n ν degrees of
freedom.
Usually, s is referred to as a standard error of estimate.
Example 11.7 Refer to Example 11.5. Estimate the value of the error variance σ
2
.
Data analysis or statistical softwares provide procedures or functions for computing the
standard error of estimate s. For example, the function STEYX of MSExcel gives, for the data
of Example 11.5, the result s =1.816590.
Recall that the least squares line estimates the mean value of y for a given value of x. Since s
measures the spread of distribution of y values about the least squares line, most observations
will lie within 2s of the least squares line.
INTERPRETATION OF s, THE ESTIMATED STANDARD DEVIATION OF e
We expect most of the observed y values to lie within 2s of their respective least
squares predicted value yˆ .
clxxxv
11.5 Making inferences about the slope, B
In Section 11.2 we proposed the probabilistic model y = A + B x + e for the relationship between
two random variables x and y, where x is independent variable and y is dependent variable, A
and B are unknown parameters, and e is a random error. Under the assumptions made on the
random error e we have E(y) = A + B x . This is the population regression line. If we are given a
sample of n data points (x
i
, y
i
), i =1,...,n, then by the least squares method in Section 11.3 we
can find the straight line bx a y + = ˆ fitted to these sample data. This line is the sample
regression line. It is an estimate for the population regression line. We should be able to use it
to make inferences about the population regression line. In this section we shall make
inferences about the slope B of the “true” regression equation that are based upon the slope b
of the sample regression equation.
The theoretical background for making inferences about the slope B lies in the following
properties of the least squares estimator b:
PROPERTIES OF THE LEAST SQUARES ESTIMATOR b
1. Under the assumptions in section 11.2, b will possess sampling distribution that
is normally distributed.
2. The mean of the least squares estimator b is B, E(b) = B, that is, b is an
unbiased estimator for B.
3. The standard deviation of the sampling distribution of b is
xx
b
SS
σ
σ = ,
where σ is the standard deviation of the random error e,
∑
=
− =
n
i
i xx
x x SS
1
2
) (
We will use these results to test hypotheses about and to construct a confidence interval for the
slope B of the population regression line.
Since σ is usually unknown, we use its estimator s and instead of
xx
b
SS
σ
σ = we use its
estimate
xx
b
SS
s
s = .
For testing hypotheses about B first we state null and alternative hypotheses:
) ( :
:
0 0 0
0 0
B B or B B or B B H
B B H
a
> < ≠
=
where B
0
is the hypothesized value of B.
Often, one tests the hypothesis if B = 0 or not, that is, if x does or does not contribute
information for the prediction of y. The setup of our test of utility of the model is summarized in
the box.
clxxxvi
A TEST OF MODEL UTILITY
ONETAILED TEST
) 0 (
0 :
0 :
0
>
<
=
B
B H
B H
a
or
Test statistic:
xx
b
SS s
b
s
b
t
/
= =
Rejection region
α
t t − <
( or t > t
α
),
where
α
t is based on (n  2) df.
TWOTAILED TEST
0 :
0 :
0
≠
=
B H
B H
a
Test statistic:
xx
b
SS s
b
s
b
t
/
= =
Rejection region
2 / 2 / α α
t t t t > − < or ,
where
2 / α
t is based on (n2) df.
The values of
α
t such that α
α
= ≥ ) ( t t P are given in Table 7.4
Example 11.8 Refer to the nicotinecarbon monoxide ranking problem of Example 11.5. At
significance level 05 . 0 = α , test the hypothesis that the nicotine content of a cigarette
contributes useful information for the prediction of carbon monoxide ranking y, i.e. test the
prediction ability of the least squares straight line model x y 5 . 20 3 . 0 ˆ + − = .
Solution Testing the usefulness of the model requires testing the hypothesis
0 :
0 :
0
≠
=
B H
B H
a
with n = 5 and 05 . 0 = α , the critical value based on (5 2) = 3 df is obtained from Table 7.4
182 . 3
025 . 0 2 /
= = t t
α
.
Thus, we will reject H
0
if t < 3.182 or t > 3.182.
In order to compute the test statistic we need the values of b, s and SS
xx
. In Example 11.6 we
computed b =20.5. In Example 11.7 we know s = 1.82 and we can compute SS
xx
= 0.4. Hence,
the test statistic is
clxxxvii
12 . 7
4 . 0 / 82 . 1
5 . 20
/
= = =
xx
SS s
b
t
Since the calculated tvalue is greater than the critical value t
0.025
= 3.182, we reject the null
hypothesis and conclude that the slope 0 ≠ B . At the significance level α = 0.05, the sample
data provide sufficient evidence to conclude that nicotine content does contribute useful
information for prediction of carbonmonoxide ranking using the linear model.
Example 11.9 A consumer investigator obtained the following least squares straight line model
( based on a sample on n = 100 families ) relating the yearly food cost y for a family of 4 to
annual income x:
x y 26 . 0 467 ˆ + = .
In addition, the investigator computed the quantities s = 1.1, SS
xx
= 26. Compute the observed
pvalue for a test to determine whether mean yearly food cost y increases as annual income x
increases , i.e., whether the slope of the population regression line B is positive.
Solution The consumer investigator wants to test
0 :
0 :
0
>
=
B H
B H
a
To compute the observed significance level (pvalue ) of the test we must first find the
calculated value of the test statistic, t
c
. Since b = 0.26, s =1.1, and SS
xx
= 26 we have
21 . 1
26 / 1 . 1
26 . 0
/
= = =
xx
SS s
b
t
The observed significance level or pvalue is given by
P(t > t
c
) = P(t >1.21), where tdistribution is based on (n  2) = (100  2) = 98 df. Since df >30
we can approximate the tdistribution with the zdistribution. Thus,
pvalue = P(t >1.21) = P(z >1.21) ≈ 0.5 – 0.3869 = 0.1131.
In order to conclude that the mean yearly food cost increases as annual income increases (B >
0) we must tolerate 1131 . 0 ≥ α . But it is a big risk and usually we take α = 0.05. Under this
significance level we can not reject the hypothesis H
0
. It means we consider the sample result
to be statistically insignificant.
Another way to make inferences about the slope B is to estimate it using a confidence interval.
This interval is formed as shown in the box.
A (1α αα α)100% CONFIDENCE INTERVAL FOR THE SLOPE B
b
s t b
2 / α
± , where
xx
b
SS
s
s = and
2 / α
t is based on (n2) df.
Example 11.10 Find the 95% confidence interval for B in Example 11.8.
clxxxviii
Solution For a 95% confidence interval α = 0.05. Therefore, we need to find the value of t
α/2 =
t
0.025
based on ( 52 ) = 3 df. In Example 11.8 we found that t
0.025
= 3.182. Also, we have b =
20.5, SS
xx
= 0.4. Thus, a 95% confidence interval for the slope in the model relating carbon
monoxide to nicotine content is
16 . 9 5 . 20
4 . 0
82 . 1
182 . 3 5 . 20
2 /
± =


¹

\

± =


¹

\

±
xx
SS
s
t b
α
Our interval estimate of the slope parameter B is then 11.34 to 29.66. Since all the values in
this interval are positive, it appears that B is positive and that the mean of y, E(y) increases as x
increases.
Remark From the above we see the complete similarity between the tstatistic for testing
hypotheses about the slope B and the tstatistic for testing hypotheses about the means of
normal populations in Chapter 9 and the similarity of the corresponding confidence intervals. In
each case, the general form of the test statistic is
estimator the of error standard Estimated
mean ed hypothesiz Its estimator Parameter −
= t
and the general form of the confidence interval is
Point estimator ± t
α/2
(Estimated standard error of the estimator)
11.6. Correlation analysis
Correlation analysis is the statistical tool that we can use to describe the degree to which one
variable is linearly related to another. Frequently, correlation analysis is used in conjunction with
regression analysis to measure how well the least squares line fits the data . Correlation
analysis can also be used by itself, however, to measure the degree of association between two
variables.
In this section we present two measures for describing the correlation between two variables:
the coefficient of determination and the coefficient of correlation.
11.6.1 The coefficient of correlation
Definition 11.5
The Pearson product moment coefficient of correlation (or simply, the coefficient of
correlation) r is a measure of the strength of the linear relationship between two
variables x and y. It is computed ( for a sample of n measurements on x and y ) as
follows
yy xx
xy
SS SS
SS
r = ,
where
clxxxix
∑
=
− − =
n
i
i i xy
y y x x SS
1
) )( ( ,
∑
=
− =
n
i
i xx
x x SS
1
2
) ( ,
2
1
) (
∑
=
− =
n
i
i yy
y y SS ,
∑ ∑
= =
= =
n
i
i
n
i
i
y
n
y x
n
x
1 1
1
,
1
,
Some properties of the coefficient of correlation:
i) 1 ≤ r ≤ 1 (this follows from the CauchyBunhiacopskij inequality )
ii) r and b ( the slope of the least squares line ) have the same sign
iii) A value of r near or equal to 0 implies little or no linear relationship between x and y. The
closer r is to 1 or to –1, the stronger the linear relationship between x and y.
Keep in mind that the correlation coefficient r measures the correlation between x values and y
values in the sample, and that a similar linear coefficient of correlation exists for the population
from which the data points were selected. The population correlation coefficient is denoted by ρ
(rho). As you might expect, ρ is estimated by the corresponding sample statistic r. Or, rather
than estimating ρ, we might want to test the hypothesis H
0
: ρ = 0 against H
a
: ρ ≠ 0, i.e., test the
hypothesis that x contributes no information for the predicting y using the straight line model
against the alternative that the two variables are at least linearly related. But it can be shown
that the null hypothesis H
0
: ρ = 0 is equivalent to the hypothesis H
0
: B = 0. Therefore, we omit
the test of hypothesis for linear correlation.
11.6.1 The coefficient of determination
Another way to measure the contribution of x in predicting y is to consider how much the errors
of prediction of y can be reduced by using the information provided by x.
The sample coefficient of determination is develped from the relationship between two kinds of
variation: the variation of the y values in a data set around:
1. The fitted regression line
2. Their own mean
The term variation in both cases is used in its usual statistical sense to mean “ the sum of a
group of squared deviations”.
The first variation is the variation of y values around the regression line, i.e., around their
predicted values. This variation is the sum of squares for error (SSE) of the regression model
2
1
) ˆ (
∑
=
− =
n
i
i i
y y SSE
The second variation is the variation of y values around their own mean
2
1
) (
∑
=
− =
n
i
i yy
y y SS
Definition 11.6
The coefficient of determination is
cxc
yy
yy
SS
SSE SS −
It is easy to verify that
yy yy
yy
SS
SSE
SS
SSE SS
r − =
−
= 1
2
,
where r is the coefficient of correlation, defined in Subsection 11.6.1.
Therefore, usually we call r
2
the coefficient of determination.
Statisticians interpet the coefficient of determination by looking at the amount of the variation in
y that is explained by the regression line. To understand this meaning of r
2
consider Figure
11.6.
Figure 11.6 The explained and
unexplained deviations
Here we singled out one observed value of y and showed the total variation of this y from its
mean y , y y − , the unexplained deviation y y ˆ − and the remaining explained deviation
y y − ˆ . Now consider a whole set of observed y values instead of only one value. The total
variation, i.e., the sum of squared deviations of these points from their mean would be
2
1
) (
∑
=
− =
n
i
i yy
y y SS .
The unexplained portion of the total variation of these points from the regression line is
2
1
) ˆ (
∑
=
− =
n
i
i i
y y SSE .
The explained portion of the total variation is
2
1
) ˆ (
∑
=
−
n
i
i
y y .
It is true that
Total variation = Explained variation + Unexplained variation.
cxci
Therefore,
ation var
ation var
2
i Total
i Explained
r =
PRACTICAL INTERPRETATION OF THE COEFFICIENT OF DETERMINATION, r
2
About 100(r
2
) % of the total sum of squares of deviations of the sample yvalues
about their mean y can be explained by (or attributed to) using x to predict y in the
straightline model.
Example 11.11 Refer to Example 11.5. Calculate the coefficient of determination for the
nicotine contentcarbon monoxide ranking and interpret its value.
Solution By the formulas given in this section we found r
2
= 0.9444. We interpret this value as
follows: The use of nicotine content, x, to predict carbon monoxide ranking, y, with the least
squares line
x y 5 . 20 3 . 0 ˆ + − =
accounts for approximately 94% of the total sum of squares of deviations of the five sample CO
rankings about their mean. That is, we can reduce the total sum of squares of our prediction
errors by more than 94% by using the least squares equation instead of y .
11.7 Using the model for estimation and prediction
The most common uses of a probabilistic model can be divided into two categories:
1) The use of the model for estimating the mean value of y, E(y), for a specific value of x
2) The second use of the model entails predicting a particular y value for a given x value.
In case 1) we are attempting to estimate the mean result of a very large number of experiments
at the given x value. In case 2) we are trying to predict the outcome of a single experiment at
the given x value.
The difference in these two model uses lies in the relative accuracy of the estimate and the
prediction. These accuracies are best measured by the repeated sampling errors of the least
squares line when it is used as estimator and as a predictor, respectively. These errors are
given in the next box.
cxcii
SAMPLING ERRORS FOR THE ESTIMATOR OF THE MEAN AND THE
PREDICTOR OF AN INDIVIDUAL y
The standard deviation of the sampling
distribution of the estimator yˆ of the
mean value of y at a fixed x is
xx
y
SS
x x
n
2
ˆ
) ( 1 −
+ = σ σ
The standard deviation of the prediction
error for the predictor yˆ of an individual
yvalue at a fixed x is
xx
y y
SS
x x
n
2
) ˆ (
) ( 1
1
−
+ + =
−
σ σ
where σ is the square root of σ
2
, the variance of the random error (see Section 11.2)
The true value of σ will rarely be known. Thus, we estimate σ by s and calculate the estimation
and prediction intervals as follows
A (1α αα α)100% CONFIDENCE INTERVAL
FOR THE MEAN VALUE OF y FOR x =
x
p
A (1α αα α)100% CONFIDENCE INTERVAL
FOR AN INDIVIDUAL y FOR x = x
p
) ˆ ( ˆ
2 /
y of std Estimate t y
α
±
or
xx
p
SS
x x
n
s t y
2
2 /
) (
1
. . ˆ
−
+ ±
α
where
2 / α
t is based on (n2) df
)] ˆ ( [ ˆ
2 /
y y of std Estimate t y − ±
α
or
xx
p
SS
x x
n
s t y
2
2 /
) (
1
1 . . ˆ
−
+ + ±
α
where
2 / α
t is based on (n2) df
Example 11.12 Find a 95% confidence interval for the mean carbon monoxide ranking of all
cigarettes that have a nicotine content of 0.4 milligram. Also, find a 95% prediction interval for a
particular cigarette if its nicotine content is 0.4 mg.
Solution For a nicotine content of 0.4 mg, x
p
= 0.4 and the confidence interval for the mean of y
is calculated by the formula in left of the above box with s = 1.82, n = 5, df = n  2 = 5  2 = 3,
t
0.025
= 3.182 9 . 7 4 . 0 * 5 . 20 3 . 0 5 . 20 3 . 0 ˆ = + − = + − =
p
x y , SS
xx
= 0.4. Hence, we obtain the
confidence interval (7.9 ± 3.17).
Also, by the formula in the right cell we obtain the 95% prediction interval for a particular
cigarette with nicotine content of 0.4 mg as (7.9 ± 6.60).
cxciii
From the Example 11.12 it is important note that the prediction interval for the carbon monoxide
ranking of an individual cigarette is wider than corresponding confidence interval for the mean
carbon monoxide ranking. By examining the formulas for the two intervals, we can see that this
will always be true.
Additionally, over the range of sample data, the width of both intervals increase as the value of x
gets further from x (see Figure 11.7).
Figure 11.7 Comparison of 95% confidence interval and
prediction interval
11.8. Simple Linear Regression: An Example
In the previous sections we have presented the basic elements necessary to fit and use a
straightline regression model. In this section we will assemble these elements by applying them
to an example.
Example 11.13 The international rice research institute in the Philippines wants to relate the
grain yield of rice varieties, y, to the tiller number, x . They conducted experiments for some rice
varieties and tillers. Below there are the results obtained for the rice variety Milfor 6
cxciv
Table 11.3 The grain yield of rice,
y, for the tiller number, x
Grain Yield,
kg/ha
Tillers,
no./m
2
4,862 160
5,244 175
5,128 192
5,052 195
5,298 238
5,410 240
5,234 252
5,608 282
Step 1 Suppose that the assumptions listed in Section 11.2 are satisfied, we hypothesize a
straight line probabilistic model for the relationship between the grain yield, y, and the tillers, x
y = A + B x + e.
Step 2 Use the sample data to find the least squares line. For the purpose we make
calculations:
∑
=
− =
n
i
i xx
x x SS
1
2
) ( ,
∑
=
− − =
n
i
i i xy
y y x x SS
1
) )( (
xx
xy
SS
SS
b = , x b y a − =
for the data. As a result, we obtain the least squares line
x y 56 . 4 4242 ˆ + =
The scattergram for the data and the least squares line fitted to the data are depicted in Figure
11.8.
cxcv
4,800
5,000
5,200
5,400
5,600
5,800
150 200 250 300
Figure 11.8 Simple linear model relating Grain Yield to Tiller Number
Step 3 Compute an estimator, s
2
, for the variance σ σσ σ
2
of the random error e :
2
2
−
=
n
SSE
s
where
2
1
) ˆ (
∑
=
− =
n
i
i i
y y SSE .
The result of computations gives s
2
= 16,229.66, s = 127.39. The value of s implies that most
of the observed 8 values will fall within 2s = 254.78 of their respective predicted values.
Step 4 Check the utility of the hypothesized model, that is, whether x really contributes
information for the prediction of y using the straightline model. First test the hypothesis that the
slope B is 0, i.e., there is no linear relationship between the grain yield, y, and the tillers, x. We
test:
0 :
0 :
0
≠
=
B H
B H
a
Test statistic:
xx
b
SS s
b
s
b
t
/
= =
For the significance level α = 0.05, we will reject H
0
if
2 / 2 / α α
t t t t > − < or ,
where
2 / α
t is based on (n2) = (8 – 2) = 6 df. On this df we find t
0.025
= 2.447,
004 . 4
125415 / 39 . 127
56 . 4
= = t .
This tvalue is greater than t
0.025
. Thus, we reject the hypothesis B = 0.
Next, we obtain additional information about the relationship by forming a confidence interval for
the slope B. A 95% confidence interval is
78 . 2 56 . 4
5 . 12541
39 . 127
447 . 2 56 . 4
2 /
± =


¹

\

± =


¹

\

±
xx
SS
s
t b
α
.
It is the interval (1.78, 7.34).
cxcvi
Another measure of the utility of the model is the coefficient of correlation
yy xx
xy
SS SS
SS
r = , where
2
1
) (
∑
=
− =
n
i
i yy
y y SS .
Computations give r = 0.853.
The high correlation confirms our conclusion that B differs from 0. It appears that the grain yield
and tillers are rather highly correlated.
The coefficient of determination is r
2
= 0.7277, which implies that 72.77% of the total variation is
explained by the tillers.
Step 5 Use the least squares model:
Suppose the researchers want to predict the grain yield if the tillers are 210 per m
2
, i.e., x
p
=210.
The predicted value is
6 . 5199 210 * 56 . 4 4242 56 . 4 4242 ˆ = + = + =
p
x y .
If we want a 95% prediction interval, we calculate
) 18 . 5530 , 82 . 4867 ( 18 . 331 5199
5 . 12541
) 75 . 26 210 (
8
1
1 39 . 127 * 447 . 2 6 . 5199
) (
1
1 . . ˆ
2
2
2 /
= ± =
−
+ + ± =
−
+ + ±
xx
p
SS
x x
n
s t y
α
Thus, the model yields a 95% prediction interval for the grain yield for the given value 210 of
tillers from 4867.82 kg/ha to 5530.18 kg/ha.
Below we include the STATGRAPHICS printout for this example.
Regression Analysis  Linear model: Y = a+bX

Dependent variable: GrainYield Independent variable: Tillers

Standard T Prob.
Parameter Estimate Error Value Level

Intercept 4242.13 250.649 16.9245 0.00000
Slope 4.55536 1.13757 4.00445 0.00708

Analysis of Variance

Source Sum of Squares Df Mean Square FRatio Prob. Level
Model 260252.06 1 260252.06 16.0 0.00708
Residual 97377.944 6 16229.657

Total (Corr.) 357630.00 7
Correlation Coefficient = 0.853061 Rsquared = 72.77 percent
Stnd. Error of Est. = 127.396
Figure 11.9 STATGRAPHICS printout for Example 11.13
cxcvii
11.9 Summary
In this chapter we have introduced bivariate relationships and showed how to compute the
coefficient of correlation, r , a measure of the strength of the linear relationship between two
variables. We have also presented the method of least squares for fitting a prediction equation
to a data set. This procedure, along with associated statistical tests and estimations, is called a
regression analysis. The steps that we follow in the simple linear regression analysis are:
To hypothesize a probabilistic straightline model y=A + Bx + e.
To make assumptions on the random error component e.
To use the method of least squares to estimate the unknown parameters in the deterministic
component, y=A + Bx.
To assess the utility of the hypothesized model. Included here are making inferences about the
slope B, calculating the coefficient of correlation r and the coefficient of determination r
2
.
If we are satisfied with the model we used it to estimate the mean y value, E(y), for a given x
value and to predict an individual y value for a specific x value
11.10 Exercises
1. Consider the seven data points in the table
x 5 3 1 0 1 3 5
y 0.8 1.1 2.5 3.1 5.0 4.7 6.2
a) Construct a scatter diagram for the data. After examining the scattergram, do you think
that x and y are correlated? If correlation is present, is it positive or negative?
b) Find the correlation coefficient r and interpret its value.
c) Find the least squares prediction equation.
d) Calculate SSE for the data and calculate s
2
and s.
e) Test the null hypothesis that the slope B = 0 against the alternative hypothesis that
0 ≠ B . Use α = 0.05.
f) Find a 90% confidence interval for the slope B.
2. In fitting a least squares line to n = 22 data points, suppose you computed the following
quantities:
SS
xx
= 25 SS
yy
= 17 SS
xy
= 20
3 2 = = y x
a) Find the least squares line.
b) Calculate SSE.
b) Calculate s
2
.
d) Find a 95% confidence interval for the mean value of y when x = 1.
e) Find a 95% prediction interval for y when x = 1.
f) ) Find a 95% confidence interval for the mean value of y when x = 0.
cxcviii
3. A study was conducted to examine the inhibiting properties of the sodium salts of
phosphoric acid on the corrosion of iron. The data shown in the table provide a measure of
corrosion of Armco iron in tap water containing various concentrations of NaPO
4
inhibitor:
Concentratio
n of NaPO
4
,
x, parts per
million
Measure of
corrosion
rate, y
Concentratio
n of NaPO
4
,
x, parts per
million
Measure of
corrosion
rate, y
2.50 7.68 26.20 0.93
5.03 6.95 33.00 0.72
7.60 6.30 40.00 0.68
11.60 5.75 50.00 0.65
13.00 5.01 55.00 0.56
19.60 1.43
a) Construct a scatter diagram for the data .
b) Fit the linear model y = A + B x + e to the data.
c) Does the model of part b) provide an adequate fit? Test using α = 0.05.
d) Construct a 95% confidence interval for the mean corrosion rate of iron in tape water in
which the concentration of NaPO
4
is 20 parts per milllion.
4. For the relationship between the variables x and y one uses a linear model and for some
data collected STATGRAPHICS gives the following printout
Regression Analysis  Linear model: Y = a+bX

Dependent variable: ELECTRIC.Y Independent variable: ELECTRIC.X

Standard T Prob.
Parameter Estimate Error Value Level

Intercept 279.763 116.445 2.40252 0.04301
Slope 0.720119 0.0623473 11.5501 0.00000

Analysis of Variance

Source Sum of Squares Df Mean Square FRatio Prob. Level
Model 798516.89 1 798516.89 133.4 0.00000
Residual 47885.214 8 5985.652

Total (Corr.) 846402.10 9
cxcix
Correlation Coefficient = 0.971301 Rsquared = 94.34 percent
Stnd. Error of Est. = 77.367
Figure 11.10 STATGRAPHICS printout for Exercise 11.4
.
a) Identify the least squares model fitted to the data.
b) What are the values of SSE and s
2
for the data?
c) Perform a test of model adequacy. Use α = 0.05.
cc
Chapter 12 Multiple regression
CONTENTS
12.1 Introduction: the general linear model
12.2 Model assumptions
12.3 Fitting the model: the method of least squares
12.4 Estimating σ
2
12.5 Estimating and testing hypotheses about the B parameters
12.6 Checking the utility of a model
12.7 Using the model for estimating and prediction
12.8 Multiple linear regression: An overview example
12.9 Model building: interaction models
12.10 Model building: quadratic models
12.11 Summary
12.12 Exercises
___________________________________________________________________________
12.1. Introduction: the general linear model
The models for a multiple regression analysis are similar to simple regression model except
that they contain more terms.
Example 12.1 The researchers in the international rice research institute suppose that Grain
Yield , y, relates to Plant Height, x
1
, and Tiller Number, x
2,
by the linear model
E(y) = B
0
+ B
1
x
1
+ B
2
x
2
.
Example 12.2 Suppose we think that the mean time E(y) required to perform a dataprocessing
job increases as the computer utilization increases and that relationship is curvilinear. Instead
of using the straight line model E(y) = A + Bx
1
to model the relationship, we might use the
quadratic model E(y) = A + B
1
x
1
+ B
2
x
1
2
, where x
1
is a variable measures computer utilization.
A quadratic model often referred to as a secondorder linear model in contrast to a straight line
or firstorder model.
If, in addition, we think that the mean time required to process a job is also related to the size x
2
of the job, we could include x
2
in the model. For example, the firstorder model in this case is
E(y) = B
0
+ B
1
x
1
+ B
2
x
2
and the secondorder model is
E(y) = B
0
+ B
1
x
1
+ B
2
x
2
+ B
3
x
1
x
2
+ B
4
x
1
2
+ B
5
x
2
2
.
All the models that we have written so far are called linear models, because E(y) is a linear
function of the unknown parameters B
0
, B
1
, B
2
, ...
The model
E(y) = A e
Bx
cci
is not a linear model because E(y) is not a linear function of the unknown model parameters A
and B.
Note that by introducing new variables, secondorder models may be written in the form of first
order models. For example, putting x
2
= x
1
2
, the secondorder model
E(y) = B
0
+ B
1
x
1
+ B
2
x
1
2
becomes the firstorder model
E(y) = B
0
+ B
1
x
1
+ B
2
x
2
.
Therefore, in the future we consider only multiple firstorder regression model.
THE GENERAL MULTIPLE LINEAR MODEL
y = B
0
+ B
1
x
1
+ ... + B
k
x
k
+ e,
where
y = dependent variable (variable to be modeled – sometimes called
the response variable)
x
1
, x
2
, ..., x
k
= independent variable ( variable used as a predictor of y)
e = random error
B
i
determines the contribution of the independent variable x
i
12.2 Model assumptions
ASSUMPTIONS REQUIRED FOR A MULTIPLE LINEAR REGRESSION
MODEL
1. y = B
0
+ B
1
x
1
+ ... + B
k
x
k
+ e,
where e is random error.
2. For any given set of values x
1
, x
2
, ..., x
k
, the random error e has a normal
probability distribution with the mean equal 0 and variance equal σ
2
.
3. The random errors are independent.
12.3 Fitting the model: the method of least squares
The method of fitting a multiple regression model is identical to that of fitting the straightline
model.
Suppose we are given the sample data that are presented in Table 12.1.
ccii
Table 12.1
DATA POINT Y VALUE x
1
x
2
... x
k
1 y
1
x
11
x
21
... x
k1
2 y
2
x
12
x
22
... x
k2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n y
n
x
1n
x
2n
x
kn
We will use the method of least squares and choose estimates of B
0
, B
1
, B
2
,..., B
k
that minimize
∑ ∑
= =
+ + + − = − =
n
i
ki k i i i
n
i
i i
x b x b x b b y y y SSE
1
2
2 2 1 1 0
1
2
)] ... ( [ ] ˆ [
In order to briefly write the solution of the least squares problem we introduce the matrix
notations
, ,
1
1
1
,
1
0
2 1
2 22 12
1 21 11
2
1
=
=
=
k kn n n
k
k
n
b
b
b
b
x x x
x x x
x x x
X
y
y
y
Y
Μ
Κ
Μ Μ Μ Μ
Κ
Κ
Μ
Then we can write the least squares equations in matrix form as
THE LEAST SQUARES MATRIX EQUATION
(X’X )b = X’Y,
where X’ is the transpose of X
. The solution of the least squares equations therefore is
LEAST SQUARES SOLUTION
b = (X’X)
1
XY .
Example 12.3 Refer to Example 12.1 relating Grain Yield , y, to Plant Height, x
1
, and Tiller
Number, x
2,
by the linear model
E(y) = B
0
+ B
1
x
1
+ B
2
x
2
.
cciii
Find the least squares estimates of B
0
, B
1
, B
2
. The data are shown in Table 12.2
Table 12.2 Data for Grain Yield Study
VARIETY
NUMBER
GRAIN YIELD,
kg/ha
( y )
PLANT HEIGHT,
cm
( x
1
)
TILLER, no./hill
( x
2
)
1 5755 110.5 14.5
2 5939 105.4 16.0
3 6010 118.1 14.6
4 6545 104.5 18.2
5 6730 93.6 15.4
6 6750 84.1 17.6
7 6899 77.8 17.9
8 7862 75.6 19.4
Solution The Y, X and b are shown below
, ,
19.4 75.6 1
17.9 77.8 1
17.6 84.1 1
15.4 93.6 1
18.2 104.5 1
14.6 118.1 1
16.0 105.4 1
5 . 14 110.5 1
,
7862
6899
6750
6730
6545
6010
5939
5755
2
1
0
=
=
=
b
b
b
b X Y
After calculations, finally, we obtain b = ( 6335.59, 23.75, 150.31 )’.
Thus, the prediction equation is
y = 6335.59 23.75 x
1
+ 150.31 x
2.
Below we include the STATGRAPHICS printout for this example.
Model fitting results for: GRAIN.Y

Independent variable coefficient std. error tvalue sig. level

CONSTANT 6335.596495 2942.930958 2.1528 0.0839
GRAIN.X1 23.748104 12.895492 1.8416 0.1249
GRAIN.X2 150.312641 112.069368 1.3412 0.2375

cciv
RSQ. (ADJ.) = 0.7474 SE= 340.427774 MAE= 248.149078 DurbWat= 2.337
Previously: 0.0000 0.000000 0.000000 0.000
8 observations fitted, forecast(s) computed for 0 missing val. of dep. var.
12.4 Estimating σ σσ σ
2
We recall that the variances of the estimators of all the B parameters and of yˆ will depend on
the value of σ
2
, the variance of the random error e that appears in the linear model. Since σ
2
will rarely be known in advance, we must use the sample data to estimate its value.
ESTIMATOR OF σ σσ σ
2
, THE VARIANCE OF e IN A MULTIPLE REGRESSION
MODEL
model in parameters of Number
2
B n
SSE
r error freedom fo Degree of
SSE
s
−
= =
where
2
1
) ˆ (
∑
=
− =
n
i
i i
y y SSE
It can be proved that s
2
is an unbiased estimator of σ
2
, that is E(s
2
) = σ
2
.
Notice that in softwares SSE often is referred to as Sum of Squares for Error and s
2
is refereed
to as Mean Squares for Error. For example, for the data for Grain Yield Study in Table 12.2 the
STATGRAPHICS printout is following
Analysis of Variance for the Full Regression

Source Sum of Squares DF Mean Square FRatio Pvalue

Model 2632048. 2 1316024. 11.3557 0.0138
Error 579455. 5 115891.

Total (Corr.) 3211504. 7
Rsquared = 0.819569 Stnd. error of est. = 340.428
Rsquared (Adj. for d.f.) = 0.747396 DurbinWatson statistic = 2.33739
We see on this printout that SSE = 579455 and s
2
= 115891.
12.5 Estimating and testing hypotheses about the B parameters
12.5.1 Properties of the sampling distributions of b
0
, b
1
, ..., b
k
ccv
Before making inferences about the B parameters of the multiple linear model we provide some
properties of the least squares estimators b , which serve the theoretical background for
estimating and testing hypotheses about B.
From Section 12.3 we know that the least squares estimators b are computed by the formula
b = (X’X)
1
XY. Now, we can rewrite b in the form
b = [(X’X)
1
X]Y.
From this form we see that the components of b: b
0
, b
1
, ..., b
k
are linear functions of n normally
distributed random variables y
1
, y
2
,..., y
n
. Therefore, b
i
(i =0,1, ..., k) has a normal sampling
distribution.
One showed that the least squares estimators provide unbiased estimators of B
0
, B
1, ...,
B
k
, that
is, E(b
i
) = B
i
(i = 0,1, ..., k).
The standard errors and covariances of the estimators are defined by the elements of the matrix
(X’X)
1
.
Thus, if we denote
then the standard deviation of the sampling distributions of b
0
, b
1
, ..., b
k
are
) ,..., 1 , 0 ( k i c
ii b
i
= = σ σ
where σ is the standard deviation of the random error e.
The properties of the sampling distributions of the least squares estimators are summarized in
the box.
THEOREM 12.1 (properties of the sampling distributions of b
0
, b
1
,
..., b
k
)
The sampling distribution of b
i
( i = 0, 1,..., k ) is normal with:
mean
i i
B ) E(b = , variance c ) V(b
ii i
= ,
standard deviation: ) ,..., 1 , 0 ( k i c
ii b
i
= = σ σ
The covariance of two parameter estimators is equal to
j i c ) , b Cov(b
ij j i
) (
2
≠ = σ .
12.5.2 Estimating and testing hypotheses about the B parameters
A (1α)100% confidence interval for a model parameter B
i
( i = 0, 1,..., k ) can be constructed
using the t statistic
, ) ' (
1 0
2 21 20
1 11 10
01 00
1
=
−
kk k k
k
k
ok
c c c
c c c
c c c
c c c
X X
Κ
Μ Μ Μ
Κ
Κ
Κ
ccvi
ii
i i
b
i i
c s
B b
s
B b
t
i
−
=
−
=
where s is an estimate of σ.
A (1α αα α)100% CONFIDENCE INTERVAL FOR B
i
b
i
± t
α/2
( Estimated standard error of b
i
) or
ii i
c s t b
2 / α
±
where t
α/2
is based on [ n – (k+1)] df.
Similarly, the test statistic for testing the null hypothesis H
0
: B
i
= 0 is
i
i
b
b
t
of error standard Estimated
=
The test is summarized in the box:
TEST OF AN INDIVIDUAL PARAMETER COEFFICIENT IN THE
MULTIPLE REGRESSION MODEL y = B
0
+ B
1
x
1
+ ... + B
k
x
k
+ e,
ONETAILED TEST
) 0
0 :
0 :
0
>
<
=
i
i a
i
B
B H
B H
(or
Test statistic:
ii
i
b
i
c s
b
s
b
t
i
= =
Rejection region
) (
α
α
t t
t t
>
− <
or
where
2 / α
t is based on [ n (k+1)]
df,
n = number of observations,
k= number of independent
variables in the model
TWOTAILED TEST
0 :
0 :
0
≠
=
i a
i
B H
B H
Test statistic:
ii
i
b
i
c s
b
s
b
t
i
= =
Rejection region
2 / 2 / α α
t t t t > − < or ,
where
2 / α
t is based [ n (k+1)] df,
n = number of observations,
k= number of independent variables
in the model
The values of
α
t such that α
α
= ≥ ) t P( t are given in Table 7.4
ccvii
Example 12.4 An electrical utility company wants to predict the monthly power usage of a
home as a function of the size of the home based on the model
y = B
0
+ B
1
x + B
2
x
2
+
e.
Data are shown in Table 12.3.
Table 12.3 Data for Power Usage
Study
SIZE OF HOME
x, square feet
MONTHY USAGE
y, kilowatthours
1290 1182
1350 1172
1470 1264
1600 1493
1710 1571
1840 1711
1980 1804
2230 1840
2400 1956
2390 1954
a. Find the least squares estimators of B
0
, B
1,
B
2
.
b. Compute the estimated standard error for b
1.
c. Compute the value of the test statistic for testing H
0
: B
2
= 0.
d. Test H
0
: B
2
= 0 against H
a
: B
2
≠ 0. State your conclusions.
Solution We use computer with the software STATGRAPHICS to do this example. Below is a
part of the printout of the procedure “ Multiple regression “.
Model fitting results for: ELECTRIC.Y

Independent variable coefficient std. error tvalue sig.level

CONSTANT 1303.382558 415.209833 3.1391 0.0164
ELECTRIC.X 2.497984 0.46109 5.4176 0.0010
ELECTRIC.X * ELECTRIC.X 0.000477 0.000123 3.8687 0.0061

RSQ. (ADJ.) = 0.9768 SE= 46.689335 MAE= 32.230298 DurbWat= 2.094
Previously: 0.9768 46.689335 32.230298 2.094
10 observations fitted, forecast(s) computed for 0 missing val. of dep. var.
ccviii
Figure 12.1 STATGRAPHICS printout for Example 12.4
From the printout we see that
a. The least squares model are y = 1303.382558 + 2.497884 x – 0.000477 x
2
.
b. The estimated standard error for b
1
is 0.461069 ( in std.error column)
c. The value of the test statistic for testing H
0
: B
2
= 0 is t = –3.8687.
d. At significance level α = 0.05, for df = [10 – (2+1)] =7 we have t
α/2
= 2.365. Therefore, we will
reject H
0
: B
2
= 0 if t < 2.365 or t >2.365. Since the observed value of t = –3.8687 is less
than 2.365, we reject H
0
, that is, x
2
contributes information for the prediction of y.
Below we include also a printout from SPSS for the Example 12.4.
Coefficients
Unstandardized
Coefficients
95% Confidence
Interval for B
Model
B Std. Error
t
Sig.
Lower
Bound
Upper
Bound
(Constant) 1303.383 415.210 3.139 .016 2285.196 321.570
X 2.498 .461 5.418 .001 1.408 3.588
1
X2 4.768E04 .000 3.869 .006 .001 .000
Figure 12.2 A part of SPSS printout for Example 12.4
12.6. Checking the utility of a model
Conducting ttests on each B parameter in a model is not a good way to determine whether a
model is contributing information for the prediction of y. If we were to conduct a series of ttests
to determine whether the individual variables are contributing to the predictive relationship . it is
very likely that we would make one or more errors in deciding which terms to retain in the
model and which to exclude.
To test the utility of a multiple regression model, we will need a global test (one that
encompasses all the B parameters). We would like to find some statistical quantity that
measures how well the model fits the data.
We begin with the easier problem – finding a measure of how well a linear model fits a set of
data. For this we use the multiple regression equivalent of r
2
, the coefficient of determination for
the straight line model (Chapter 11).
ccix
Definition 12.1
The multiple coefficient of determination R
2
is defined as
yy
SS
SSE
R − =1
2
where
2
1
2
1
) ( , ) ˆ (
∑ ∑
= =
− = − =
n
i
i yy
n
i
i i
y y SS y y SSE
and
i
yˆ is the predicted value of y
i
for the multiple regression model.
From the definition we see that R
2
= 0 implies a complete lack of fit of the model to the data, , R
2
= 1 implies a perfect fit with the model passing through every data point. In general, the larger
the value of R
2
, the better the model fits the data.
R
2
is a sample statistic that tells how well the model fits the data , and thereby represents a
measure of the utility of the entire model . It can be used to make inferences about the utility of
the model for predicting y values for specific settings of the independent variables.
TESTING THE OVERALL UTILITY OF THE MODEL
E(y) = B
0
+ B
1
x
1
+ ... + B
k
x
k
H
0
: B
1
= B
2
= ...= B
k
= 0 ( Null hypothesis: y doesn’t depend on any x
i
)
H
a
: At least one B
i
≠ 0 ( Alternative hypothesis: y depends an at least one of the
x
i
’s.
Test statistic:
)] 1 ( /[
/ ) Model (
Error for Square Mean
Model for Square Mean
)] 1 ( /[ ) 1 (
/
2
2
+ −
= =
+ − −
=
k n SSE
k SS
k n R
k R
F
Rejection region: F > F
α
, where F
α
is value that locate area α in the upper tail of
the Fdistribution with ν
1
= k and ν
2
= n  (k+1),
n = Number of observations, k = Number of parameters in the model (excluding B
0
)
R
2
= Multiple coefficient of determination.
Example 12.5 Refer to Example 12.4. Test to determine whether the model contributes
information for the prediction of the monthly power usage.
Solution For the electrical usage example, n = 10, k = 2 and n – ( k+1) = 7. At the significance
level α = 0.05 we will reject H
0
: B
1
= B
2
= 0 if F > F
0.05
. where ν
1
= 2 and ν
2
= 7, or F > 4.74.
From the computer printout ( see Figure 12.3 ) we find that the computed F is 190.638. Since
this value greatly exceeds 4.74 we reject H
0
and conclude that at least one of the model
coefficients B
1
and B
2
is nonzero. Therefore, this F test indicates that the second order model y
= B
0
+ B
1
x + B
2
x
2
+ e, is useful for predicting electrical usage.
ccx
Analysis of Variance for the Full Regression

Source Sum of Squares DF Mean Square FRatio Pvalue

Model 831143. 2 415571. 190.638 0.0000
Error 15259.3 7 2179.89

Total (Corr.) 846402. 9
Rsquared = 0.981972 Stnd. error of est. = 46.6893
Rsquared (Adj. for d.f.) = 0.976821 DurbinWatson statistic = 2.09356
Figure 12.3 STATGRAPHICS Printout for Electrical Usage Example
Example 12.6 Refer to Example 12.3. test the utility of the model E(y) = A + B
1
x
1
+ B
2
x
2
.
Solution From the SPSS Printout ( Figure 12.4) we see that the F value is 11.356 and the
corresponding observed significance level is 0.014. Thus, at the significance level greater than
0.014 we reject the null hypothesis, and conclude that the linear model E(y) = A + B
1
x
1
+ B
2
x
2
is
useful for prediction of the grain yield.
ANOVA
Model
Sum of
Squares
df Mean
Square
F Sig.
1Regression
Residual
Total
2632048.15
3
579455.347
3211503.50
0
2
5
7
1316024.07
6
115891.069
11.356 .014
Figure 12.4 SPSS Printout for Grain Yield Example
12.7. Using the model for estimating and prediction
After checking the utility of the linear model and finding it to be useful for prediction and
estimation, we may decide use it for those purposes. Our methods for prediction and estimation
using any general model are identical to those discussed in Section 11.7 for the simple straight
ccxi
line model. We will use the model to form a confidence interval for the mean E(y) for a given
value x
*
of x, or a prediction interval for a future value of y for a specific x
*
.
The procedure for forming a confidence interval for E(y) is shown in following box.
A (1α αα α)100% CONFIDENCE INTERVAL FOR E(y)
* 1 *
2 /
) ' ( )' ( ˆ x X X x s t y
−
±
α
where
* *
2 2
*
1 1 0
ˆ
k k
x b x b x b b y + + + + = Λ
( ) ' 1
* *
2
*
1
*
k
x x x x Λ = is the given value of x,
s and (X’X)
1
are obtained from the least squares analysis,
2 / α
t is based on the number of degrees of freedom associated with s, namely, [n
(k+1)]
The procedure for forming a prediction interval for y for a given x* is shown in following
box.
A (1α αα α)100% PREDICTION INTERVAL FOR y
* 1 *
2 /
) ' ( )' ( 1 ˆ x X X x s t y
−
+ ±
α
where
* *
2 2
*
1 1 0
ˆ
k k
x b x b x b b y + + + + = Λ
( ) ' 1
* *
2
*
1
*
k
x x x x Λ = is the given value of x,
s and (X’X)
1
are obtained from the least squares analysis,
2 / α
t is based on the number of degrees of freedom associated with s, namely, [n
(k+1)]
12.8 Multiple linear regression: An overview example
In the previous sections we have presented the basic elements necessary to fit and use a
multiple linear regression model . In this section we will assemble these elements by applying
them to an example.
Example 12.7 Suppose a property appraiser wants to model the relationship between the sale
price of a residential property in a midsized city and the following three independent variables:
(1) appraised land value of the property,
(2) appraised value of improvements (i.e., home value )
(3) area of living space on the property (i.e., home size)
Consider the linear model
y = B
0
+ B
1
x
1
+ B
2
x
2
+ B
3
x
3
+ e
where
ccxii
y = Sale price (dollars)
x
1
= Appraised land value ( dollars)
x
2
= Appraised improvements ( dollars)
x
3
= Area (square feet)
In order to fit the model, the appraiser selected a random sample of n = 20 properties from the
thousands of properties that were sold in a particular year. The resulting data are given in Table
12.4.
Table 12.4 Real Estate Appraisal Data
Property
# (Obs.)
Sale price,
y
Land
value, x
1
Improvement
s value , x
2
Area,
x
3
1 68900 5960 44967 1873
2 48500 9000 27860 928
3 55500 9500 31439 1126
4 62000 10000 39592 1265
5 116500 18000 72827 2214
6 45000 8500 27317 912
7 38000 8000 29856 899
8 83000 23000 47752 1803
9 59000 8100 39117 1204
10 47500 9000 29349 1725
11 40500 7300 40166 1080
12 40000 8000 31679 1529
13 97000 20000 58510 2455
14 45500 8000 23454 1151
15 40900 8000 20897 1173
16 80000 10500 56248 1960
17 56000 4000 20859 1344
18 37000 4500 22610 988
19 50000 3400 35948 1076
20 22400 1500 5779 962
Step 1 Hypothesize the form of the linear model
y = B
0
+ B
1
x
1
+ B
2
x
2
+ B
3
x
3
+ e
Step 2 Use the sample data to find least squares prediction equation. Using the formulas
given in Section 12.3 we found
3 2 1
53 . 13 824 . 0 8145 . 0 28 . 1470 ˆ x x x y + + + = .
This is the same result obtained by computer using STATGRAPHICS (see Figure 12.5)
Step 3 Compute an estimator, s
2
, for the variance σ σσ σ
2
of the random error e :
) 1 (
2
+ −
=
k n
SSE
s
where
ccxiii
2
1
) ˆ (
∑
=
− =
n
i
i i
y y SSE .
STATGRAPHICS gives s = 7919.48 (see Stnd. error of est. in Figure 12.6)
Step 4 Check the utility of the model
a) Does the model fits the data well?
For this purpose calculate the coefficient of determination
yy
SS
SSE
R − =1
2
You can see in the printout in Figure 12.6 that SSE = 1003491259 ( in column “Sum of Squares”
and row “Error”) and SS
yy
= 9783168000 ( in column “Sum of Squares” and row “Total”), and R
2
is Rsquared =0.897427. This large value of R
2
indicates that the model provides a good fit to
the n = 20 sample data points.
b) Usefulness of the model
Test H
0
: B
1
= B
2
= ...= B
k
= 0 ( Null hypothesis) against H
a
: At least one B
i
≠ 0
( Alternative hypothesis).
Test statistic:
)] 1 ( /[
/ ) Model (
Error for Square Mean
Model for Square Mean
)] 1 ( /[ ) 1 (
/
2
2
+ −
= =
+ − −
=
k n SSE
k SS
k n R
k R
F
In the printout F= 46.6620, the observed significance level for this test is 0.0000 (under the
column Pvalue ). This implies that we would reject the null hypothesis for any level, for example
0.01. Thus, we have strong evidence to reject H
0
and conclude that the model is useful for
predicting the sale price of residential properties.
Model fitting results for: ESTATE.Y

Independent variable coefficient std. error tvalue sig.level

CONSTANT 1470.275919 5746.324583 0.2559 0.8013
ESTATE.X1 0.81449 0.512219 1.5901 0.1314
ESTATE.X2 0.820445 0.211185 3.8850 0.0013
ESTATE.X3 13.52865 6.58568 2.0543 0.0567

RSQ. (ADJ.) = 0.8782 SE= 7919.482541 MAE= 5009.367657 DurbWat= 1.242
Previously: 0.0000 0.000000 0.000000 0.000
20 observations fitted, forecast(s) computed for 0 missing val. of dep. var.
Figure 12.5 STATGRAPHICS Printout for Estate Appraisal Example
ccxiv
Analysis of Variance for the Full Regression

Source Sum of Squares DF Mean Square FRatio Pvalue

Model 8779676741. 3 2926558914. 46.6620 0.0000
Error 1003491259. 16 62718204.

(Total (Corr.) 9783168000. 19
Rsquared = 0.897427 Stnd. error of est. = 7919.48
Rsquared (Adj. for d.f.) = 0.878194 DurbinWatson statistic = 1.24161
Figure 12.6 STATGRAPHICS Printout for Estate Appraisal Example
Step 5 Use the model for estimation and prediction
(1) Construct a confidence interval for E(y) for particular values of the independent
variables.
Estimate the mean sale price, E(y), for a property with x
1
= 15000, x
2
= 50000 and x
3
= 1800,
using 95% confidence interval. Substituting these particular values of the independent variables
into the least squares prediction equation yields the predicted value equal 79061.4. In the
printout reproduced in Figure 12.7 the 95% confidence interval for the sale price corresponding
to the given (x
1
, x
2
, x
3
) is (733379.3, 84743.6).
Regression results for ESTATE.Y
Observation
Number
Observed
Values
Fitted
Values
Lower 95%
CL
for means
Upper 95%
CL
for means
1 68900 68556.7
2 48500 44212.9
3 55500 50235.2
4 62000 59212
5 116500 105834
6 45000 43143.7
7 38000 44643.6
8 83000 83773.6
9 59000 56449.5
10 47500 56216.8
11 40500 54981
12 40000 54662.4
ccxv
13 97000 98977.1
14 45500 42800.4
15 40900 41000.1
16 80000 82686.9
17 56000 40024.4
18 37000 37052
19 50000 48289.7
20 22400 20447.9
21 79061.4 73379.3 84743.6
Figure 12.7 STATGRAPHICS Printout for estimated mean
and corresponding confidence interval for x
1
= 15000, x
2
=
50000 and x
3
= 1800
(2) Construct a confidence interval for prediction y for particular values of the
independent variables.
For example, construct a 95% prediction interval for y with x
1
= 15000, x
2
= 50000 and x
3
=
1800.
The printout reproduced in Figure 12.8 shows that the prediction interval for y with the given x is
(61333.4, 96789.4).
We see that the prediction interval for a particular value of y is wider than the confidence interval
for the mean value.
Regression results for ESTATE.Y
Observation
Number
Observed
Values
Fitted
Values
Lower 95% CL
for forecasts
Upper 95% CL
for forecasts
1 68900 68556.7
2 48500 44212.9
3 55500 50235.2
4 62000 59212
5 116500 105834
6 45000 43143.7
7 38000 44643.6
8 83000 83773.6
9 59000 56449.5
10 47500 56216.8
11 40500 54981
12 40000 54662.4
ccxvi
13 97000 98977.1
14 45500 42800.4
15 40900 41000.1
16 80000 82686.9
17 56000 40024.4
18 37000 37052
19 50000 48289.7
20 22400 20447.9
21 79061.4 61333.4 96789.4
Figure 12.8 STATGRAPHICS Printout for estimated mean and
corresponding prediction interval for x
1
= 15000, x
2
= 50000 and
x
3
= 1800
12.8. Model building: interaction models
Suppose the relationship between the dependent variable y and the independent x
1
and x
2
is
described by firstorder linear model E(y) = B
0
+ B
1
x
1
+ B
2
x
2
. When the values of one variable,
say x
2
, are fixed then E(y) is a linear function of the other variable (x
1
):
E(y) = (B
0
+ B
2
x
2
) + B
1
x
1
.
Therefore, the graph of E(y) against x
1
is a set of parallel straight lines.
For example, if
E(y)=1 + 2x
1
– x
2
,
the graphs of E(y) for x
2
= 0, x
2
= 2 and x
2
= 3 are depicted in Figure 12.9.
When this situation occurs ( as it always does for a firstorder model), we say that the
relationship between E(y) and any one independent variable does not depend on the value of
the other independent variable(s) in the model – that is, we say that the independent
variables do not interact.
5
0
5
10
15
20
0 1 2 3 4 5 6 7 8
E
(
y
)
x2=0 x2=2 x2=3
Figure 12.9 Graphs of E(y) = 1 + 2x
1
– x
2
versus x
1
for fixed values of x
2
ccxvii
However, if the relationship between E(y) and x
1
does, in fact, depend on the value of x
2
held
fixed, then the firstorder model is not appropriate for predicting y. In this case we need another
model that will take into account this dependence. This model is illustrated in the next example
Example 12.8 Suppose that the mean value E(y) of a response y is related to two quantitative
variables x
1
and x
2
by the model
E(y) = 1 + 2x
1
– x
2
+ x
1
x
2
.
Graph the relationship between E(y) and x
1
for x
2
= 0, 2 and –3. Interpret the graph.
10
0
10
20
30
0 1 2 3 4 5 6 7
x1
E
(
y
)
x2=0 x2=2 x2=3
Figure 12.10 Graphs of E(y) = 1 + 2x
1
– x
2
+ x
1
x
2
versus x
1
for fixed values of x
2
Solution For fixed values of x
2
, E(y) is linear functions of x
1
.
Graphs of the straight lines of E(y)
for
x
2
= 0, 2 and –3 are depicted in Figure 12.10. Note that the slope of each line is represented by
2+ x
2
. The effect of adding a term involving the product x
1
x
2
can be seen in the figure. In
contrast to Figure 12.9, the lines relating E(y) to x
1
are no longer parallel. The effect on E(y) of a
change in x
1
(i.e. the slope) now depends on the value of x
2
.
When this situation occurs, we say that x
1
and x
2
interact.. The crossproduct term, x
1
x
2
, is
called an interaction term and the model
E(y) = B
0
+ B
1
x
1
+ B
2
x
2
+ B
3
x
1
x
2
is called an interaction model with two independent variables.
Below we suggest a practical procedure for building a interaction model.
ccxviii
Procedure to build a interaction model for the relationship between E(y)
and two independent variables x
1
and x
2
1. If from observations it is known that the rate of change of E(y) in x
1
depends on x
2
and vice versa, then the interaction model
E(y) = B
0
+ B
1
x
1
+ B
2
x
2
+ B
3
x
1
x
2
is hypothesized.
2. Fit the model to the data.
3. Check if the model fits the data well.
4. Test whether the model is useful for predicting y i.e., test hypothesis H
0
:
B
1
= B
2
= B
3
= 0 ( Null hypothesis) against H
a
: At least one B
i
≠ 0
( Alternative hypothesis).
5. If model is useful for predicting y (i.e. reject H
0
), test whether the
interaction term contributes significantly to the model:
H
0
: B
3
= 0 ( no interaction between x
1
and x
2
)
H
a
: B
3
≠ 0 (x
1
and x
2
interact)
12.9. Model building: quadratic models
A quadratic (secondorder) model in a single quantitative independent
variable
E(y) = B
0
+ B
1
x + B
2
x
2
where B
0
= yintercept of the curve
B
1
= shift parameter
B
2
= rate of curvature
ccxix
Procedure to build a quadratic model for the relationship between E(y)
and independent variables x
1. State the hypothesized model
E(y) = B
0
+ B
1
x + B
2
x
2
2. Fit the model to the data.
3. Check if the model fits the data well.
4. Test whether the model is useful for predicting y i.e., test hypothesis H
0
:
B
1
= B
2
= = 0 ( Null hypothesis) against H
a
: At least one B
i
≠ 0 (
Alternative hypothesis).
5. If model is useful for predicting y (i.e. reject H
0
), test whether the second
order term contributes significantly to the model:
H
0
: B
2
= 0
H
a
: B
≠ 0.
12.11 Summary
In this chapter we have discussed some of the methodology of multiple regression analysis, a
technique for modeling a dependent variable y as a function of several independent variables
k
x x x ,..., ,
2 1
. The steps employed in a multiple regression analysis are much the same as those
employed in a simple regression analysis:
1. The form of the probabilistic model is hypothesized.
2. The appropriate model assumptions are made.
3. The model coefficients are estimated using the method of least squares.
4. The utility of the model is checked using the overall Ftest and ttests on individual B
parameters.
5. If the model is deemed useful and the assumptions are satisfied, it may be used to make
estimates and to predict values of y to be observed in the future.
12.12 Exercises
1. Suppose you fit the firstorder multiple regression model
y = B
0
+ B
1
x
1
+ B
2
x
2
+ e
to n = 20 data points and obtain the prediction equation
2 1
92 . 0 1 . 3 4 . 6 ˆ x x y + + =
The estimated standard deviations of the sampling distributions of b
1
, b
2
( least squares
estimators of B
0
, B
1
)
are 2.3 and 0.27, respectively.
a) Test H
0
: B
1
= 0 against H
a
: B
1
>0. Use α = 0.05.
b) Test H
0
: B
2
= 0 against H
a
: B
2
>0. Use α = 0.05.
c) Find a 95% confidence interval for B
1
. Interpret the interval.
d) Find a 99% confidence interval for B
2
. Interpret the interval.
ccxx
Suppose you fit the firstorder multiple regression model
y = B
0
+ B
1
x
1
+ B
2
x
2
+ B
3
x
3
+ e
to n = 20 data points and obtain R
2
= 0.2632. Test the null hypothesis H
0
: B
1
= B
2
=B
3
=0
against the alternative hypothesis that at least one of the B parameters in nonzero. Use α =
0.05.
Plastics made under different environmental conditions are known to have differing strengths. A
scientist would like to know which combination of temperature and pressure yields a plastic with
a high breaking strength. A small preliminary experiment was run at two pressure levels and two
temperature levels. The following model is proposed:
E(y) = B
0
+ B
1
x
1
+ B
2
x
2
where
y = Breaking strength (pounds)
x
1
= Temperature (
0
F)
x
2
= Pressure ( pounds per square inch).
A sample of n = 16 observations yield
2 1
2 . 1 9 . 4 8 . 226 ˆ x x y + + =
with s
b1
= 1.11, s
b2
= 0.27.
Do the data indicate that the pressure is important predictor of breaking strength?
Test using α = 0.05.
Suppose you fit the interaction model
E(y) = B
0
+ B
1
x
1
+ B
2
x
2
+ B
3
x
1
x
2
in n = 32 data points and obtain the following results:
SS
yy
= 479 SSE = 21 b
3
= 10, s
b3
= 4.
a) Find R
2
and interpret its value.
b) Is the model adequate for predicting y? Test at α = 0.05.
c) Use a graph to explain the contribution for the x
1
x
2
term to the model.
d) Is there evidence that x
1
and x
2
interact? Test at α = 0.05.
The researchers in the international rice research institute in the Philippines conducted a study
on the Yield Response of Rice Variety IR6611170 to Nitrogen Fertilizer. They obtained the
following data
Pair Number Grain Yield,
kg/ha, y
Nitrogen Rate,
kg/ha, x
1 4878 0
2 5506 30
3 6083 60
4 6291 90
5 6361 120
and suggested the quadratic model
E(y) = B
0
+ B
1
x + B
2
x
2
ccxxi
The portions of STATGRAPHICS printouts are shown below.
a) Identify the least squares model fitted to the data.
b) What are the values of SSE and s
2
for the data?
c) Perform a test of overall model adequacy. Use α = 0.05.
d) Test whether the secondorder term contributes significantly to the model. Use α = 0.05.
Model fitting results for: NITROGEN. y
Independent variable coefficient std. error tvalue sig.level
CONSTANT 4861.457143 47.349987 102.6707 0.0001
x 26.64619 1.869659 14.2519 0.0049
x *x 0.117857 0.014941 7.8884 0.0157
RSQ. (ADJ.) = 0.9935 SE= 50.312168 MAE= 25.440000 DurbWat= 3.426
5 observations fitted, forecast(s) computed for 0 missing val. of dep. var.
Analysis of Variance for the Full Regression
Source Sum of Squares DF Mean Square FRatio Pvalue
Model 1564516 2 782258 309.032 0.0032
Error 5062.63 2 2531.31
Total (Corr.) 1569579 4
Rsquared = 0.996775 Stnd. error of est. = 50.3122
ccxxii
Chapter 13 Nonparametric statistics
CONTENTS
13.1 Introduction
13.2 The sign test for a single population
13.3 Comparing two populations based on independent random samples: Wilcoxon rank sum test
13.4 Comparing two populations based on matched pairs: the Wilcoxon signed ranks test
13.5 Comparing population using a completely randomized design: The KruskalWallis H test
13.6 Rank Correlation: Spearman’s r
s
statistic
13.7 Summary
13.8 Exercises

13.1. Introduction
The majority of hypothesis tests ( t and Ftests) discussed so far have made inferences about
population parameters, such as the mean and the proportion. These parametric tests have used
the parametric statistics of samples that came from the population being tested. To formulate
these tests, we made restrictive assumptions about the populations from which we drew our
samples. In each case of Chapter 9, for example, we assumed that our samples either were
large or came from normally distributed populations. But populations are not always normal.
And even if a goodnessoffit test indicates that a population is approximately normal, we can
not always be certain we’re right, because the testis not 100 percent reliable. Clearly, there are
certain situations in which the use of the normal curve is not appropriate.
An another case in which the t and Ftests are inappropriate is when the data are not
measurements but can be ranked in order of magnitude. For example, suppose we want to
compare the ease of operation of two types of computer software based on subjective
evaluations by trained observers. Although we can not give an exact value to the variable Ease
of operation of the software package, we may be able to decide that package A is better than
package B. If packages A and B are evaluated by each of ten observers, we have the standard
problem of comparing the probability distributions for two populations of ratings – one for
package A and one for package B. But the ttest of Chapter 9 would be inappropriate, because
the only data that can be recorded are preferences; that is, each observer decides either that A
is better than B or vice versa.
For the two types of the situations statisticians have developed useful techniques called
nonparametric methods or nonparametric statistics. The nonparametric counterparts of the t
and Ftests compare the relative locations of the probability distributions of the sampled
populations, rather than specific parameters of these populations (such as the means or
variances). Many nonparametric methods use the relative ranks of the sample observations
rather than their actual numerical values.
A large number of nonparametric tests exist, but this chapter will examine only a few of the
better known and more widely used ones.
ccxxiii
13.2. The sign test for a single population
Recall from Chapter 9 that smallsample procedures for testing a hypothesis about a population
mean, require that the population have an approximately normal distribution. For situations in
which we collect a small sample (n < 30) from a nonnormal distribution, the ttestis not valid
and we must resort to a nonparametric procedure. The simplest nonparametric technique to
apply in this situation is the sign test. The sign test is specifically designed for testing
hypotheses about the median of any continuous population. Like the mean, the median is a
measure of the center, or location, of the distribution; therefore, the sign test is sometimes
referred to as a test for location.
The theoretical background of the sign test follows.
Let x
1
, x
2
, ..., x
n
be a random sample form a population with unknown median M. Suppose we
want to test the null hypothesis H
0
: M = M
0
against the oneside alternative H
a
: M > M
0
. From
Definition 3.2 we know that the median is a number such that half the area under the
probability distribution lies to the left of M and half lies to the right. Therefore, the probability that
a xvalue selected from the population is larger than M is 0.5, i.e., P(x
i
> M) = 0.5. If, in fact, the
null hypothesis is true, then we should expect to observe approximately half the sample xvalue
greater than M= M
0
.
The sign test utilizes the test statistic S, where
S = { number of values x
i
that exceed M
0
}.
Notice that S depends only on the sign (positive or negative) of the difference x
i
 M
0
. That is,
we simply count the number of positive (+) signs among the differences x
i
 M
0
. If S is “too
large” the we will reject H
0
in favor of H
a
: M > M
0
.
The rejection region for the sign test is derived as follows. Let each sample difference x
i
 M
0
denote the outcome of a single trial in an experiment consisting of n identical trials. If we call a
positive difference a “Success” and a negative difference a “Failure”, then S is the number of
successes in n trials. Under H
0
the probability of observing a success on any one trial is
p = P(Success) = P(x
i
 M
0
> 0) = P(x
i
> M
0
) = 0.5
Since the trials are independent, the properties of a binomial distribution, listed in Section 5.3,
are satisfied. Therefore, S has a binomial distribution with parameters n and p = 0.5. We can
use this fact to calculate the observed significance level (pvalue ) of the sign test.
The procedure for the sign test is presented in the following box.
SIGN TEST FOR A POPULATION MEDIAN
ONETAILED TEST
) :
:
0 0
0 0
M M M M H
M M H
a
< >
=
(or
Test statistic:
S = Number of sample
observations greater than M
0
( or S = Number of sample
observations less than M
0
)
TWOTAILED TEST
0
0 0
:
:
M M H
M M H
a
≠
=
Test statistic:
S = max ( S
1
, S
2
),
where S
1
= Number of sample
observations greater than M
0
,
S
2
= Number of sample
observations less than M
0
ccxxiv
Observed significance level:
pvalue = P(S ≥ S
c
)
[ Note: By definition S
2
= n – S
1
]
Observed significance level:
pvalue =2 P(S ≥ S
c
)
where S
c
is the computed value of the test statistic and S has a binomial
distribution with parameters n and p = 0.5.
Rejection region: Reject H
0
if α > pvalue.
Example 13.1 Suppose from a population the following sample is randomly selected:
41 33 43 52 46 37 44 49 53 30.
Do the data provide sufficient evidence to indicate that the median percentage of the population
is greater than 40? Test using α = 0.05.
Solution We want to test
H
0
: M = 40
H
a
: M > 40
using the sign test. The test statistic is
S = {Number of sample observations greater than 40}
ha s binomial distribution with n =10 and p = 0.5.
The computed test statistic Sc = 7 and pvalue = P(S ≥ 7) = 1 – P(S ≤ 6) = 1 – 0.828 = 0.172.
Since pvalue > α = 0.05, we can not reject the null hypothesis. That is, there is insufficient
evidence to indicate the median percentage of the population exceeds 40.
Recall from Section 5.8 that a normal distribution with the mean u = np and the variance σ
2
=np(1p) can be used to approximate the binomial distribution for large n. When p = 0.5, the
normal approximation performs reasonably well even for n as small as 10 (see Figure 5.6 or
Table 5.4).
Thus, for n ≥ 10 we can conduct the sign test using the familiar standard normal zstatistic.
SIGN TEST BASED ON A LARGE SAMPLE ) 10 ≥ n (
ONETAILED TEST
) :
:
0 0
0 0
M M M M H
M M H
a
< >
=
(or
TWOTAILED TEST
0
0 0
:
:
M M H
M M H
a
≠
=
Test statistic:
n
n S
n
n S
S
S E S
z
5 . 0
5 . 0
) 5 . 0 )( 5 . 0 (
5 . 0
) (
) ( −
=
−
=
−
=
σ
ccxxv
S = Number of sample
observations greater than M
0
( or S = Number of sample
observations less than M
0
)
S = max ( S1, S2),
where S
1
= Number of sample
observations greater than M
0
,
S
2
= Number of sample
observations less than M
0
[ Note: By definition S
2
= n – S
1
]
Rejection region:
)
α α
z z z z − < > (or
Rejection region:
)
2 / 2 / α α
z z z z > − < (or
where
α
z and
2 / α
z are tabulated values given in any table of normal
curve areas.
Example 13.2 Refer to Example 13.1 using the sign test based on zstatistic.
Solution For this example the software STATGRAPHICS provides the following printout.
Tests for Location

Data: 41 33 43 52 46 37 44 49 53 30
Hypothesized median: 40
Test based on: Signs
Sample median = 43.5
Number of values above hypothesized median = 7
Number of values below hypothesized median = 3
Expected number = 5
Large sample test statistic Z = 0.948683
Twotailed probability of equaling or exceeding Z = 0.34278
NOTE: 10 observations. 0 values equal to hypothesized median ignored.
Figure 13.1 STATGRAPHICS printout for Example 13.2.
From the printout we see that the computed statistic z
c
= 0.948683 and 34278 . 0 ) ( = ≥
c
z z P .
Therefore 17139 . 0 ) ( = ≥
c
z z P , that is, pvalue = 0.17139.
Since pvalue > α = 0.05, we can not reject the null hypothesis. That is, there is insufficient
evidence to indicate the median percentage of the population exceeds 40.
ccxxvi
13.3 Comparing two populations based on independent random
samples: Wilcoxon rank sum test
In Chapter 9 we presented parametric tests (tests about population parameters) based on the z
and the tstatistics, to test for a difference between two population means. Recall that the mean
of a population measures the location of the population distribution. Another measure of the
location of the population distribution is the median M. Thus, if the data provides sufficient
evidence to indicate that M
1
> M
2
, we imagine the distribution for the population 1 shifted to right
of population 2.
The equivalent nonparametric test is not a test about the difference between population means.
Rather, it is a test to detect whether distribution 1 is shifted to the right of distribution 2 or vice
versa. The test based on independent random samples of n
1
and n
2
observations from the
respective populations, is known as the Wilcoxon rank sum test.
To use the Wilcoxon rank sum test, we first rank all (n
1
+ n
2
) observations, assigning a rank of 1
to the smallest, 2 to the second smallest, and so on. Tied observations (if they occur) are
assigned ranks equal to the average of the ranks of the tied observations. For example, if the
second and the third ranked observations were tied, each would be assigned the rank 2.5. The
sum of the ranks, called a rank sum, is then calculated for each sample. If the two distributions
are identical, we would expect the same rank sums, designated as T
1
and T
2
, to be nearly
equal. In contrast, if one rank sum – say, T
1
– is much larger than the other, T
2
, then the data
suggest that the distribution for population 1 is shifted to the right of the distribution for
population 2. The procedure for conducting a Wilcoxon rank sum test is summarized in the
following box.
WILCOXON RANK SUM TEST FOR A SHIFT IN POPULATION
LOCATIONS:
INDEPENDENT RANDOM SAMPLES
ONETAILED TEST
H
0
: The sampled populations
have identical probability
distributions
H
a
: The probability distribution for
population 1 is shifted to the right
of that for population 2
TWOTAILED TEST
H
0
: The sampled populations have
identical probability distributions
H
a
: The probability distribution for
population 1 is shifted either to the
left or to the right of that for
population 2
Rank the n
1
+ n
2
observations in the two samples from the smallest (rank
1) to the largest ( rank n
1
+ n
2
). Calculate T
1
and T
2
, the rank sums
associated with sample 1 and sample 2, respectively. Then calculate
the test statistic.
Test statistic:
if or if
1 2 2 2 1 1
n n T n n T ≤ <
Test statistic:
if ; if
1 2 2 2 1 1
n n T n n T ≤ ≤ . We
will denote this rank sum as T.
ccxxvii
Rejection region:
1 1
T T T
U
if ≥ is test statistic; or
2 2
T T T
L
if ≤ is test statistic
Rejection region:
L U
T T T T ≤ ≥ or ,
where T
U
and T
L
are obtained from Table 1 of Appendix D
Example 13.3 Independent random samples were selected from two populations. The data are
shown in Table 13.1. Is there sufficient evidence to indicate that population 1 is shifted to the
right of population 2. Test using α = 0.05.
Table 13.1 Data for Example 13.3
Sample from
Population 1
Sample from
Population 2
17 10
14 15
12 7
16 6
23 13
18 11
10 12
8 9
19 17
22 14
Solution The ranks of the 20 observations from lowest to highest, are shown in Table 13.2.
We test
H
0
: The sampled populations have identical probability distributions
H
a
: The probability distribution for population 1 is shifted to the right of that for population 2
The test statistic T
2
=78. Examining Table 13.3 we find that the critical values, corresponding to
n
1
= n
2
=10 are T
L
= 79 and T
U
= 131. Therefore, for onetailed test at α = 0.025, we will reject
H
0
if T
2
≤ T
L
, i.e., reject H
0
if T
2
≤ 79. Since the observed value of the test statistic, T
2
=78
<79 we reject H
0
and conclude ( at α = 0.025) that the probability distribution for population 1 is
shifted to the right of that for population 2.
Table 13.2 Calculations of rank sums for Example
13.3
Sample from
Population 1
Sample from Population
2
Raw data Rank Raw data Rank
ccxxviii
17 15.5 10 5.5
14 11.5 15 13
12 8.5 7 2
16 14 6 1
23 20 13 10
18 17 11 7
10 5.5 12 8.5
8 3 9 4
19 18 17 15.5
22 19 14 11.5
T
1
= 132 T
2
= 78
Table 13.3 A Partial Reproduction of Table 1 of Appendix D
Critical values of T
L
and T
U
for the Wilcoxon Rank Sum Test: Independent
samples
a. Alpha = 0.025 onetailed; alpha = 0.05 twotailed
n
1
n
2
3 4 5 6 7 8 9 10
T
L
T
U
T
L
T
U
T
L
T
U
T
L
T
U
T
L
T
U
T
L
T
U
T
L
T
U
T
L
T
U
3 5 16 6 18 6 21 7 23 7 26 8 28 8 31 9 33
4 6 18 11 25 12 28 12 32 13 35 14 38 15 41 16 44
5 6 21 12 28 18 37 19 41 20 45 21 49 22 53 24 56
6 7 23 12 32 19 41 26 52 28 56 29 61 31 65 32 70
7 7 26 13 35 20 45 28 56 37 68 39 73 41 78 43 83
8 8 28 14 38 21 49 29 61 39 73 49 87 51 93 54 98
9 8 31 15 41 22 53 31 65 41 78 51 93 63 108 66 114
10 9 33 16 44 24 56 32 70 43 83 54 98 66 114 79 131
Many nonparametric test statistics have sampling distributions that are approximately normal
when n
1
and n
2
are large. For these situations we can test hypotheses using the largesample
ztest.
ccxxix
WILCOXON RANK SUM TEST FOR LARGE SAMPLES ) and 10 10 (
2 1
≥ ≥ n n
ONETAILED TEST
H
0
: The sampled populations have
identical probability distributions
H
a
: The probability distribution for
population 1 is shifted to the right of that
for population 2
( or population 1 is shifted to the left of
population 2 )
TWOTAILED TEST
H
0
: The sampled populations have
identical probability distributions
H
a
: The probability distribution for
population 1 is shifted either to the left or
to the right of that for population 2
Test statistic:
12
) 1 (
2
) 1 (
2 1 2 1
1 1 2 1
1
+ +
+ +
−
=
n n n n
n n n n
T
z
Rejection region:
)
α α
z z z z − < > (or
Rejection region:
2 / 2 / α α
z z z z > − < or
where
α
z and
2 / α
z are tabulated values given in Table 1 of Appendix C
Example 13.4 Refer to Example 13.3. Using the above largesample ztest check whether
there is sufficient evidence to indicate that population 1 is shifted to the right of population 2.
Test using α = 0.05.
Solution We do this example with the help of computer using STATGRAPHICS. The printout
is given in Figure 13.2.
Comparison of Two Samples


Sample 1: 17 14 12 16 23 18 10 8 19 22
Sample 2: 10 15 7 6 13 11 12 9 17 14
Test: Unpaired
Average rank of first group = 13.2 based on 10 values.
Average rank of second group = 7.8 based on 10 values.
Large sample test statistic Z = 2.00623
Twotailed probability of equaling or exceeding Z = 0.0448313
NOTE: 20 total observations.
Figure 13.2 STATGRAPHICS printout for Example 13.4
ccxxx
From the printout we see that the computed test statistic z
c
= 2.00623 and the twotailed
probability 0448313 . 0 ) ( = ≥
c
z z P . Therefore, 022415 . 0 ) ( = ≤
c
z z P . Hence, at significance
level α < 0.023 we reject the null hypothesis and conclude that the probability distribution for
population 1 is shifted to the right of that for population 2 at this significance level.
13.4. Comparing two populations based on matched pairs:
the Wilcoxon signed ranks test
Recall from Chapter 9 that the analysis of matchedpairs data is based on the differences within
the matched pairs of observations. The Wilcoxon signed ranks test is a nonparametric test to
detect shifts in locations for population probability distributions. The test is summarized in the
box.
WILCOXON SIGNED RANKS TEST: MATCHED PAIRS
ONETAILED TEST
H
0
: The sampled populations
have identical probability
distributions
H
a
: The probability distribution for
population 1 is shifted to the right
of that for population 2
TWOTAILED TEST
H
0
: The sampled populations have
identical probability distributions
H
a
: The probability distribution for
population 1 is shifted either to the
left or to the right of that for
population 2
Calculate the differences within each of the n matched pairs of
observations. Then rank the absolute values of the n differences from
smallest (rank 1) to the highest (rank n) and calculate the rank sum
−
T of
the negative differences and the rank sum
+
T of the positive
differences.
Test statistic:
−
T , the rank sum of the negative
differences
Test statistic:
) min(
+ −
= , T T T
Rejection region:
0
T T ≤
−
Rejection region:
0
T T ≤
where T
0
is given in Table 2 of Appendix D
[Note: differences equal to 0 are eliminated and the number n of
differences is reduced accordingly. Tied absolute differences receive
ranks equal to the average of the ranks they would have received had
they not been tied.]
Example 13. 5 Suppose that a company wants to know the opinion of customers about the
quality of its product before and after introducing a new technology. The company selects
randomly 10 customers and each of them is given a sample of the product before (B) and after
ccxxxi
(A) introducing the new technology. Each customer rates the quality of each product on a scale
from 1 to 10. The results of the experiment are shown in Table 13. . Is there sufficient evidence
to indicate that the product after introducing the new technology is rated higher than the one
before new technology.
Test using α = 0.05.
Table 13.4 Product quality ratings
Product Customer
A B
Difference
( A – B )
Absolute
value
 A – B
Rank of
 A – B
1 6 4 2 2 5
2 8 5 3 3 7.5
3 4 5 1 1 2
4 9 8 1 1 2
5 4 1 3 3 7.5
6 7 9 2 2 5
7 6 2 4 4 9
8 5 3 2 2 5
9 6 7 1 1 2
10 8 2 6 6 10
T
+
= Sum of positive ranks = 46
T

= Sum of negative ranks = 9
Solution We must test the hypotheses:
H
0
: The sampled populations have identical probability distributions
H
a
: The probability distribution for population 1 (A) is shifted to the right of that for population 2
(B).
We will use T

as the test statistic and reject H
0
if T

≤ T
0
.
For our example, the computed value T

= 9. Examining Table 13.5 in the column corresponding
to a onetailed test, the row corresponding to α = 0.05, and the column for n = 10, we read T
0
=
11. Since T

= 9 < 11 we reject H
0
and conclude that there is sufficient evidence to indicate that
the probability distribution of population A is shifted to the right of the probability distribution of
population B, that is, after introducing the new technology the product is rated higher than
before.
ccxxxii
Table 13.5 A Partial Reproduction of Table 2 of Appendix D
Critical values of T
0
in the Wilcoxon Matched Pairs Signed Ranks Test
α ONE
TAILED
α TWO
TAILED
n=5 n=6 n=7 n=8 n=9 n=10
0.05 0.1 1 2 4 6 8 11
0.025 0.05 1 2 4 6 8
0.01 0.02 0 2 3 5
0.005 0.01 0 2 3
n=11 n=12 n=13 n=14 n=15 n=16
0.1 14 17 21 26 30 36
0.025 0.05 11 14 17 21 25 30
0.01 0.02 7 10 13 16 20 24
0.005 0.01 5 7 10 13 16 19
n=17 n=18 n=19 n=20 n=21 n=22
0.1 41 47 54 60 68 75
0.025 0.05 35 40 46 52 59 66
0.01 0.02 28 33 38 43 49 56
0.005 0.01 23 28 32 37 43 49
n=23 n=24 n=25 n=26 n=27 n=28
0.1 83 92 101 110 120 130
0.025 0.05 73 81 90 98 107 117
0.01 0.02 62 69 77 85 93 102
0.005 0.01 55 61 68 76 84 92
The Wilcoxon signed ranks test for large samples
The Wilcoxon signed ranks test statistic has a sampling distribution that is approximately normal
when the number n of pairs is large – say, n ≥ 25. This large sample nonparametric matched
pairs test is summarized in the following box.
ccxxxiii
WILCOXON SIGNED RANK SUM TEST FOR LARGE SAMPLES ( n
≥ ≥≥ ≥ 25)
ONETAILED TEST
H
0
: The sampled populations
have identical probability
distributions
H
a
: The probability distribution for
population 1 is shifted to the right
of that for population 2
( or population 1 is shifted to the
left of population 2 )
TWOTAILED TEST
H
0
: The sampled populations have
identical probability distributions
H
a
: The probability distribution for
population 1 is shifted either to the left
or to the right of that for population 2
Test statistic:
 
24 / )] 1 2 ( 10 ( [
4 / ) 1 (
+ +
+ −
=
+
n n n
n n T
z
Rejection region:
)
α α
z z z z − < > (or
Rejection region:
2 / 2 / α α
z z z z > − < or
where
α
z and
2 / α
z are tabulated values given in any table of normal curve
areas.
( See Table 1 of Appendix C )
Example 13.6 Suppose from each of two populations we select a sample. They are 30
matched pairs
Sample
1
4 5 6 4 7 8 6 9 7 4 10 7 6 8 5 4 6 7 9 7 4 6 7 9 6 10 9
7 8 5
Sample
2
5 6 7 8 5 9 6 8 3 7 5 7 5 8 9 4 6 8 4 6 7 9 10 6 8 5 7
8 9 6
Use the Wilcoxon signed ranks test to check whether the probability distributions of the
populations are identical.
Solution For this example using STATGRAPHICS we obtain the following printout.
ccxxxiv
Comparison of Two Samples

Sample 1: 4 5 6 4 7 8 6 9 7 4 10 7 6 8 5 4 6 7 9 7 4 6 7 9 6 10 9 7 8 5
Sample 2: 5 6 7 8 5 9 6 8 3 7 5 7 5 8 9 4 6 8 4 6 7 9 10 6 8 5 7 8 9 6
Test: Ranks
Number of positive differences = 10 with average rank = 15.4
Number of negative differences = 15 with average rank = 11.4
Large sample test statistic Z = 0.242162
Twotailed probability of equaling or exceeding Z = 0.80865
NOTE: 30 total pairs. 5 tied pairs ignored.
Figure 13.3 STATGRAPHICS printout for Example 13.6
From the printout we see that pvalue for twotailed test is 0.80865. This is not small. Therefore,
we can not reject the hypothesis that the probability distributions of the populations are identical.
13.5. Comparing population using a completely randomized design:
The KruskalWallis H test
In Chapter 10 we compare the means of k populations based on data collected according to a
completely randomized design. The analysis of variance Ftest, used to test the null hypothesis
of equality of means, is based on the assumption that the populations are normally distributed
with common variance σ
2
.
The KruskalWallis Htest is the nonparametric equivalent of the analysis of variance Ftest. It
tests the null hypothesis that all k populations possess the same probability distribution against
the alternative hypothesis that the distributions differ in location – that is, one or more of the
distributions are shifted to the right or left of each other. the advantage of the KruskalWallis H
test is that we need make no assumptions about the nature of the sampled populations.
A completely randomized design specifies that we select independent random samples of n
1
,
n
2
, ..., n
k
observations form the k populations. To conduct the test, we first rank all n = n
1
+ n
2
+
...+ n
k
observations and compute the rank sums, R
1
, R
2
, ..., R
k
for the k samples. The ranks of
tied observations are averaged in the same manner as for the Wilcoxon rank sum test. Then, if
H
0
is true, and if the sample sizes, n
1
, n
2
, ..., n
k
, each equal 5 or more, then the test statistic
) 1 ( 3
) 1 (
12
1
2
+ −
+
=
∑
=
n
n
R
n n
H
k
i
i
will have a sampling distribution that can be approximated by a chisquare distribution with (k –
1) degrees of freedom. Large values of H imply rejection of H
0
. Therefore, the rejection region
for the test is
2
α
χ > H where
2
α
χ is the value that locates α in the upper tail of the chisquare
distribution.
The test is summarized in the following box.
KRUSKALWALLIS H TEST FOR COMPARING k POPULATION PROBABILITY
ccxxxv
DISTRIBUTIONS
H
0
: The k population probability distributions are identical
H
a
: At least two of the k population probability distributions differ in location
Test statistic:
) 1 ( 3
) 1 (
12
1
2
+ −
+
=
∑
=
n
n
R
n n
H
k
i
i
where
n
i
= Number of observations in sample i
R
i
= Rank sum of sample i, where the rank of each observation is computed
according to its relative magnitude in the totality of data for the k samples
n = n
1
+ n
2
+ ...+ n
k
Rejection region:
2
α
χ > H
with df = k –1
Assumptions:
1. The k samples are random and independent
2. n
i
≥ 5 for each i
3. The observations can be ranked.
No assumptions have to be made about the shape of the population probability
distribution.
Example 13.7 Independent random samples of three different brands of magnetron tubes were
subjected to stress testing, and the number of hours each operated without repair was recorded.
Although these times do not represent typical lifetimes, they do indicate how well the tubes can
withstand extreme stress.. The data are shown in the table. Experience has shown that the
distributions of lifetimes for manufactured products are usually nonnormal.
A B C
36 49 71
48 33 31
5 60 140
67 2 59
53 55 42
Use the KruskalWallis Htest to determine whether evidence exists to conclude that the brands
of magnetron tubes tend to differ in length of life under stress. Test using α = 0.05.
Solution The first step in performing the KruskalWallis Htest is to rank the n = 15
observations in the complete data set. The ranks and rank sums for three samples are shown
in Table 13.6
ccxxxvi
Table 13.6 Ranks and Rank Sums for Example 13.7
A RANK B RANK C RANK
36 5 49 8 71 14
48 7 33 4 31 3
5 2 60 12 140 15
67 13 2 1 59 11
53 9 55 10 42 6
R
1
=36 R
2
=35
R
3
=49
We want to test the null hypothesis
H
0
: The population probability distributions lifetimes under stress are identical for three brands
of magnetron tubes
against the alternative hypothesis
H
a
: At least two of the population probability distributions differ in location using the test statistic
22 . 1 ) 16 ( 3
5
) 49 (
5
) 35 (
5
) 36 (
) 16 )( 15 (
12
) 1 ( 3
) 1 (
12
2 2 2
1
2
= −
+ + = + −
+
=
∑
=
n
n
R
n n
H
k
i
i
The rejection region for the Htest is
2
α
χ > H
with df = k –1 = 3 –1 = 2. For α = 0.05 and df = 2,
99147 . 5
2
=
α
χ . Since the computed value of H =1.22 is less than 5.99147 we can not reject H
0
.
There is insufficient evidence to indicate a difference in location among the distributions of
lifetimes for the three brands of magnetron tubes.
For this example the STATGRAPHICS printout is given in Figure 13.3. In the printout we see
that Test statistic = 1.22, Significance level = 0.543351. Therefore, at significance level α =
0.05 we can not reject the hypothesis H
0
.
KruskalWallis analysis of LIFELEN. lengths by LIFELEN. brand

Level Sample Size Average Rank

A 5 7.20000
B 5 7.00000
C 5 9.80000

Test statistic = 1.22 Significance level = 0.543351
Figure 13.4 STATGRAPHICS printout for Example 13.7
13.6. Rank Correlation: Spearman’s r
s
statistic
ccxxxvii
Several different nonparametric statistics have been developed to measure and to test for
correlation between two random variables. One of these statistics is the Spearman’s rank
correlation coefficient r
s
.
The first step in finding r
s
is to rank the values of each of the variables separately; ties are
treated by averaging the tied ranks. Then r
s
is computed in exactly the same way as the simple
correlation coefficient r. The only difference is that the values of x and y that appear in the
formula for r
s
denote the ranks of the raw data rather than the raw data themselves.
Formulas for computing Spearman’s rank correlation coefficient
Rank the values for each of the variables and let x and y denote the ranks of a pair
of observations. Then
yy xx
xy
s
SS SS
SS
r =
where
∑ ∑ ∑
− − = − = − = ) )( ( , ) ( , ) (
2 2
y y x x SS y y SS x x SS
xy yy xx
When there are no ties, the formula for r
s
, reduces to
) 1 (
6
1
2
2
−
− =
∑
n n
d
r
s
where d is the difference between the values of x and y corresponding to a pair of
observations. This simple formula will provide a good approximation to r
s
when the
number of ties in the ranks is small.
The nonparametric test of hypothesis for rank correlation is shown in the box.
Spearman’s Nonparametric Test for Rank Correlation
ONETAILED TEST
H
0
: There is no correlation
between the ranked pairs
H
a
: Ranked pairs are positively
correlated
(or Ranked pairs are negatively
correlated )
TWOTAILED TEST
H
0
: There is no correlation between
the ranked pairs
H
a
: Ranked pairs are correlated
Test statistic: r
s
Test statistic: r
s
Rejection region:
r
s
≥ r
0
( or r
s
≤ r
0
)
Rejection region:
r
s
≥ r
0
or r
s
≤ r
0
where the value of r
0
is given in Table 3 of Appendix D
ccxxxviii
Example 13.8 A large manufacturing firm wants to determine whether a relationship exists
between the number of workshours an employee misses per year and the employee’s annual
wages ( in thousands of dollars ). A sample of 15 employees produced the data shown in
Table 13.7.
Table 13.7 Data for Example 13.8
EMPLOYEE HOURS WAGES
1 49 15.8
2 36 17.5
3 127 11.3
4 91 13.2
5 72 13.0
6 34 14.5
7 155 11.8
8 11 20.2
9 191 10.8
10 6 18.8
11 63 13.8
12 79 12.7
13 43 15.1
14 57 24.2
15 82 13.9
a) Calculate Spearman’s rank correlation coefficient as a measure of the strength of the
relationship between workhours missed and annual wages.
b) Is there sufficient evidence to indicate that workhours missed decrease as annual wages
increases , i.e., that workhours missed and annual wages are negatively correlated? Test
using α = 0.01.
Solution
a) First we rank the values of workhours missed and rank the values of the annual salaries.
Let these rankings are x
i
and y
i
, respectively, and they are shown in Table 13.8. The next
step is
Table 13.8 Calculations for Example 13.8
EMPLOYEE HOURS RANK WAGES RANK d
i
d
i
2
1 49 6 15.8 11 5 25
2 36 4 17.5 12 8 64
ccxxxix
3 127 13 11.3 2 11 121
4 91 12 13.2 6 6 36
5 72 9 13.0 5 4 16
6 34 3 14.5 9 6 36
7 155 14 11.8 3 11 121
8 11 2 20.2 14 12 144
9 191 15 10.8 1 14 196
10 6 1 18.8 13 12 144
11 63 8 13.8 7 1 1
12 79 10 12.7 4 6 36
13 43 5 15.1 10 5 25
14 57 7 24.2 15 8 64
15 82 11 13.9 8 3 9
∑d
i
2
=1038
b) To calculate the differences d
i
= x
i
– y
i
( i = 1, 2, ..., 15 ). These differences d
i
and their
squares are shown in the table. Since there are no ties, we calculate r
s
by the formula
854 . 0
) 224 ( 15
) 1038 ( 6
1
) 1 (
6
1
2
2
− = − =
−
− =
∑
n n
d
r
s
This large negative value of r
s
implies that a negative correlation exists between work
hours missed and annual wages in the sample of 15 employees.
c) To test H
0:
No correlation exists between workhours missed and annual wages in the
population against H
1
: Workhours missed and annual wages are negatively correlated, we
use r
s
as the test statistic and obtain the critical value r
0
from Table
This table gives the critical values of r
0
for an uppertailed test, i.e., a test to detect a positive
rank correlation. For our example, α = 0.01, n =15, the critical value is r
0
= 0.623. Therefore,
we reject the null hypothesis in favor of the alternative hypothesis if the computed r
s
statistic
is less or equal –0.623. Since our computed r
s
= 0.854 < 0.623, we reject H
0
and conclude
that there is ample evidence to indicate that workhours missed decrease as annual wages
increases.
Below we reproduce the STATGRAPHICS printout for our example. In this printout we see that
the rank correlation coefficient between the variable HOURS (workhours missed ) and the
variable WAGES (annual wages) is –0.8536 and the significance level is 0.0014. Since the
observed significance level is very small, it is naturally to reject the null hypothesis.
Spearman Rank Correlations

HOURS WAGES
ccxl
HOURS 1.0000 0.8536
( 15) ( 15)
1.0000 0.0014
WAGES 0.8536 1.0000
( 15) ( 15)
0.0014 1.0000

Coefficient (sample size) significance level
Figure 13.4 STATGRAPHICS printout for Example 13.8
13.7 Summary
We have presented several useful nonparametric techniques for testing the location of a single
population, or for comparing two or more populations. Nonparametric techniques are useful
when the underlying assumptions for their parametric counterparts are not justified or when it is
impossible to assign specific values to the observations. Nonparametric methods provide more
general comparisons of populations than parametric methods, because they compare the
probability distributions of the populations rather than specific parameters.
Rank sums are the primary tools of nonparametric statistics. The Wincoxon rank sum test can
be used to compare two populations based on independent random samples, and Wincoxon
signed ranks test can be used for a matchedpairs experiment. The KruskalWallis Htest is
applied when comparing k populations using a completely randomized design.
13.8 Exercises
1. Suppose you want to use the sign test to test the null hypothesis that the population median
equals 75, i.e., H
0
: M = 75. Use the table of binomial probabilities to find the observed
significance level (pvalue ) of the test for each of the following situations:
a) H
a
: M > 75, n = 5, S = 2
b) H
a
: 75 ≠ M , n = 15, S = 9
c) H
a
: M < 75, n = 10, S = 7
2. A random sample of 8 observations from a continuous population resulted in the following:
17 16.5 20 18.2 19.6 14.9 21.1 19.4
Is there sufficient evidence to indicate that the population median differs from 20? test using
α = 0.05.
3. Independent random variables were selected from two populations. The data are shown in
the table
Sample from
population 1
15 16 13 14 12 17
Sample from
population 2
6 13 8 9 7 5 4 10
ccxli
a) Use the Wilcoxon rank sum test to determine whether the data provide sufficient
evidence
to indicate a shift in the locations of the probability distributions of the sampled
populations. Test using α = 0.05.
b) Do the data provide sufficient evidence to indicate that the probability distribution for
population 1 is shifted to the right of the probability distribution for population 2? Use the
Wilcoxon rank sum test with α = 0.05.
4. The following data show employee’ rates of defective work before and after a change in
wage incentive plan. Compare the two sets of data to see if the change lowered the
defective units produced (Use the Wilcoxon signed rank test for a matched pairs design with
α = 0.01)
Before 8 7 6 9 7 10 8 6 5 8 10 8
After 6 5 8 6 9 8 10 7 5 6 9 5
5. The following table shows sample retail prices for three brands of shoes. Use the Kruskal
Wallis test to determine if there is any difference among the retail prices of the brands
throughout the country. Use 0.05 level of significance.
Brand
A
$89 90 92 81 76 88 85 95 97 86 100
Brand
B
$78 93 81 87 89 71 90 96 82 85
Brand
C
$80 88 86 85 79 80 84 85 90 92
6. A random sample of seven pairs of observations are recorded on two variables, X and Y.
The data are shown in the table. use Spearman’s nonparametric test for rank correlation to
answer the following :
a) Do the data provide sufficient evidence to conclude that the rank correlation between X
and Y is greater than 0? Test using α = 0.05.
b) Do the data provide sufficient evidence to conclude that the rank correlation between X
and Y is not 0? Test using α = 0.05.
X 65 57 55 38 29 43 49
Y 58 61 58 23 34 38 37
7. Below are ratings of aggressiveness (X) and amount of sales in the last year (Y) for eight
salespeople. Is there a significant rank correlation between the two measures? Use the 0.05
significance level.
X 30 17 35 28 42 25 19 29
Y 35 31 43 46 50 32 33 42
ccxlii
References
1. Berenson, M.L. and D.M. Levine, Basic Business Statistics: Concepts and Applications, 4th
ed. Englewood Cliffs, NJ, Prentice Hall, 1989.
2. McClave, J.T. & Dietrich, F.H. Statistics, 4
th
ed., San Francisco: Dellen,1988.
3. Fahrmeir L. and Tutz G., Multivariate statistical modeling based on generalized linear
models, New York: SpringerVerlag, 1994.
4. Gnedenko B.V., The theory of probability, Chelsea Publ. Comp., New York, 1962.
5. Goldstein H. (ed.) Multilevel statistical models, London: Edward Arnold, 1995.
6. Iman, R. L., and W. J. Conover, Modern Business Statistics, 2nd ed. New York, NY, John
Wiley & Sons, 1989.
7. Kwanchai A. Gomez and Arturo A. Gomez, Statistical procedures for agricultural research,
John Wiley & Sons, 1982.
8. Levin, R.I. and D. S. Rubin, Statistics for management, 5th ed. Englewood Cliffs, NJ,
Prentice Hall, 1991.
9. Moore D. S. and G.P. McCabe, Introduction to the Practice of Statistics, W.H. Freeman and
Company, 1989.
10.Mendehall, W. and Sincich T., Statistics for the engineering and computer sciences, 2
nd
edition, Dellen Publ. Comp., 1989.
11.Rosenbaum P.R., Observational Studies, New York: SpringerVerlag, 1995.
12.STATGRAPHICS Plus, Reference manual, Manugistics Inc., 1992.
ccxliii
Table 1
Appendix C
Normal Curve
Areas
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0.0 .0000 .0040 .0080 .0120 .0160 .0199 .0239 .0279 .0319 .0359
0.1 .0398 .0438 .0478 .0517 .0557 .0596 .0636 .0675 .0714 .0753
0.2 .0793 .0832 .0871 .0910 .0948 .0987 .1026 .1064 .1103 .1141
0.3 .1179 .1217 .1255 .1293 .1331 .1368 .1406 .1443 .1480 .1517
0.4 .1554 .1591 .1628 .1664 .1700 .1736 .1772 .1808 .1844 .1879
0.5 .1915 .1950 .1985 .2019 .2054 .2088 .2123 .2157 .2190 .2224
0.6 .2257 .2291 .2324 .2357 .2389 .2422 .2454 .2486 .2517 .2549
0.7 .2580 .2611 .2642 .2673 .2704 .2734 .2764 .2794 .2823 .2852
0.8 .2881 .2910 .2939 .2967 .2995 .3023 .3051 .3078 .3106 .3133
0.9 .3159 .3186 .3212 .3238 .3264 .3289 .3315 .3340 .3365 .3389
1.0 .3413 .3438 .3461 .3485 .3508 .3531 .3554 .3577 .3599 .3621
1.1 .3643 .3665 .3686 .3708 .3729 .3749 .3770 .3790 .3810 .3830
1.2 .3849 .3869 .3888 .3907 .3925 .3944 .3962 .3980 .3997 .4015
1.3 .4032 .4049 .4066 .4082 .4099 .4115 .4131 .4147 .4162 .4177
1.4 .4192 .4207 .4222 .4236 .4251 .4265 .4279 .4292 .4306 .4319
1.5 .4332 .4345 .4357 .4370 .4382 .4394 .4406 .4418 .4429 .4441
1.6 .4452 .4463 .4474 .4484 .4495 .4505 .4515 .4525 .4535 .4545
1.7 .4554 .4564 .4573 .4582 .4591 .4599 .4608 .4616 .4625 .4633
1.8 .4641 .4649 .4656 .4664 .4671 .4678 .4686 .4693 .4699 .4706
1.9 .4713 .4719 .4726 .4732 .4738 .4744 .4750 .4756 .4761 .4767
2.0 .4772 .4778 .4783 .4788 .4793 .4798 .4803 .4808 .4812 .4817
2.1 .4821 .4826 .4830 .4834 .4838 .4842 .4846 .4850 .4854 .4857
2.2 .4861 .4864 .4868 .4871 .4875 .4878 .4881 .4884 .4887 .4890
2.3 .4893 .4896 .4898 .4901 .4904 .4906 .4909 .4911 .4913 .4916
2.4 .4918 .4920 .4922 .4925 .4927 .4929 .4931 .4932 .4934 .4936
2.5 .4938 .4940 .4941 .4943 .4945 .4946 .4948 .4949 .4951 .4952
2.6 .4953 .4955 .4956 .4957 .4959 .4960 .4961 .4962 .4963 .4964
2.7 .4965 .4966 .4967 .4968 .4969 .4970 .4971 .4972 .4973 .4974
2.8 .4974 .4975 .4976 .4977 .4977 .4978 .4979 .4979 .4980 .4981
2.9 .4981 .4982 .4982 .4983 .4984 .4984 .4985 .4985 .4986 .4986
3.0 .4987 .4987 .4987 .4988 .4988 .4989 .4989 .4989 .4990 .4990
0 z
ccxliv
Table 2
Appendix C Critical Values for Student's t
1 3.078 6.314 12.706 31.821 63.657 318.310 636.620
2 1.886 2.920 4.303 6.965 9.925 22.326 31.598
3 1.638 2.353 3.182 4.541 5.841 10.213 12.924
4 1.533 2.132 2.776 3.747 4.604 7.173 8.610
5 1.476 2.015 2.571 3.365 4.032 5.893 6.869
6 1.440 1.943 2.447 3.143 3.707 5.208 5.959
7 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 1.397 1.860 2.306 2.896 3.355 4.501 5.041
9 1.383 1.833 2.262 2.821 3.250 4.297 4.781
10 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 1.356 1.782 2.179 2.681 3.055 3.930 4.318
13 1.350 1.771 2.160 2.650 3.102 3.852 4.221
14 1.345 1.760 2.145 2.624 2.977 3.787 4.140
15 1.341 1.753 2.131 2.602 2.947 3.733 4.073
16 1.337 1.746 2.120 2.583 2.921 3.686 4.015
17 1.333 1.740 2.110 2.567 2.898 3.646 3.965
18 1.330 1.734 2.101 2.552 2.878 3.610 3.922
19 1.328 1.729 2.093 2.539 2.861 3.579 3.883
20 1.325 1.725 2.086 2.528 2.845 3.552 3.850
21 1.323 1.721 2.080 2.528 2.831 3.527 3.819
22 1.321 1.717 2.074 2.508 2.819 3.505 3.792
23 1.319 1.714 2.069 2.500 2.807 3.485 3.767
24 1.318 1.711 2.064 2.492 2.797 3.467 3.745
25 1.316 1.708 2.060 2.485 2.787 3.450 3.725
26 1.315 1.706 2.056 2.479 2.779 3.435 3.707
27 1.314 1.703 2.052 2.473 2.771 3.421 3.690
28 1.313 1.701 2.048 2.467 2.763 3.408 3.674
29 1.311 1.699 2.045 2.462 2.756 3.396 3.659
30 1.310 1.697 2.042 2.457 2.750 3.385 3.646
40 1.303 1.684 2.021 2.423 2.704 3.307 3.551
60 1.296 1.671 2.000 2.390 2.660 3.232 3.460
120 1.289 1.658 1.980 2.358 2.617 3.160 3.373
∞ 1.282 1.645 1.960 2.326 2.576 3.090 3.291
ccxlv
Table 1 Critical values of T
L
and T
U
for the Wincoxon Rank Sum Test: Independent
samples
a. Alpha = 0.025 onetailed; alpha = 0.05 twotailed
n
1
n
2
3 4 5 6 7 8 9 10
T
L
T
U
T
L
T
U
T
L
T
U
T
L
T
U
T
L
T
U
T
L
T
U
T
L
T
U
T
U
3 5 16 6 18 6 21 7 23 7 26 8 28 8 31 9 33
4 6 18 11 25 12 28 12 32 13 35 14 38 15 41 16 44
5 6 21 12 28 18 37 19 41 20 45 21 49 22 53 24 56
6 7 23 12 32 19 41 26 52 28 56 29 61 31 65 32 70
7 7 26 13 35 20 45 28 56 37 68 39 73 41 78 43 83
8 8 28 14 38 21 49 29 61 39 73 49 87 51 93 54 98
9 8 31 15 41 22 53 31 65 41 78 51 93 63 108 66 114
10 9 33 16 44 24 56 32 70 43 83 54 98 66 114 79 131
a. Alpha = 0.05 onetailed; alpha = 0.10 twotailed
n
1
n
2
3 4 5 6 7 8 9 10
T
L
T
U
T
L
T
U
T
L
T
U
T
L
T
U
T
L
T
U
T
L
T
U
T
L
T
U
T
L
T
U
3 6 15 7 17 7 20 8 22 9 24 9 27 10 29 11 31
4 7 17 12 24 13 27 14 30 15 33 16 36 17 39 18 42
5 7 20 13 27 19 36 20 40 22 43 24 46 25 50 26 54
6 8 22 14 30 20 40 28 50 30 54 32 58 33 63 35 67
7 9 24 15 33 22 43 30 54 39 66 41 71 43 76 46 80
8 9 27 16 36 24 46 32 58 41 71 52 84 54 90 57 95
9 10 29 17 39 25 50 33 63 43 76 54 90 66 105 69 111
10 11 31 18 42 26 54 35 67 46 80 57 95 69 111 83 127
ccxlvi
ccxlvii
Table 2 Critical values of T0 in the Wincoxon Matched Pairs
Signed Ranks Test
Alpha Alpha
ONETAILED TWOTAILED n=5 n=6 n=7 n=8 n=9 n=10
0.05 0.1 1 2 4 6 8 11
0.025 0.05 1 2 4 6 8
0.01 0.02 0 2 3 5
0.005 0.01 0 2 3
n=11 n=12 n=13 n=14 n=15 n=16
0.1 14 17 21 26 30 36
0.025 0.05 11 14 17 21 25 30
0.01 0.02 7 10 13 16 20 24
0.005 0.01 5 7 10 13 16 19
n=17 n=18 n=19 n=20 n=21 n=22
0.1 41 47 54 60 68 75
0.025 0.05 35 40 46 52 59 66
0.01 0.02 28 33 38 43 49 56
0.005 0.01 23 28 32 37 43 49
n=23 n=24 n=25 n=26 n=27 n=28
0.1 83 92 101 110 120 130
0.025 0.05 73 81 90 98 107 117
0.01 0.02 62 69 77 85 93 102
0.005 0.01 55 61 68 76 84 92
n=29 n=30 n=31 n=32 n=33 n=34
0.1 141 152 163 175 188 201
0.025 0.05 127 137 148 159 171 183
0.01 0.02 111 120 130 141 151 162
0.005 0.01 100 109 118 128 138 149
n=35 n=36 n=37 n=38 n=39
0.1 214 228 242 256 271
0.025 0.05 195 208 222 235 250
0.01 0.02 174 186 198 211 224
0.005 0.01 160 171 183 195 208
n=40 n=41 n=42 n=43 n=44 n=45
0.1 287 303 319 336 353 371
0.025 0.05 264 279 295 311 327 344
0.01 0.02 238 252 267 281 297 313
0.005 0.01 221 234 248 262 277 292
n=46 n=47 n=48 n=49 n=50
0.1 389 408 427 446 466
0.025 0.05 361 379 397 415 434
0.01 0.02 329 345 362 380 398
0.005 0.01 307 323 339 365 373
ccxlviii
Table 3 Critical values of Spearman's Rank Correlation
Coefficient
The alphavalues correspond to a onetailed test of Null
hypothesis. The value should be doubled for twotailed tests
n alpha=0.05 alpha=0.025 alpha=0.01 alpha=0.005
5 0.900
6 0.829 0.886 0.943
7 0.714 0.786 0.893
8 0.643 0.738 0.833 0.881
9 0.600 0.683 0.783 0.833
10 0.564 0.648 0.745 0.794
11 0.523 0.623 0.736 0.818
12 0.497 0.591 0.703 0.780
13 0.475 0.566 0.673 0.745
14 0.457 0.545 0.646 0.716
15 0.441 0.525 0.623 0.689
16 0.425 0.507 0.601 0.666
17 0.412 0.490 0.582 0.645
18 0.399 0.476 0.564 0.625
19 0.388 0.462 0.549 0.608
20 0.377 0.450 0.534 0.591
21 0.368 0.438 0.521 0.576
22 0.359 0.428 0.508 0.562
23 0.351 0.418 0.496 0.549
24 0.343 0.409 0.485 0.537
25 0.336 0.400 0.475 0.526
26 0.329 0.392 0.465 0.515
27 0.323 0.385 0.456 0.505
28 0.317 0.377 0.448 0.496
29 0.311 0.370 0.440 0.487
30 0.305 0.364 0.432 0.478
ccxlix
Index
A
Additive rule of probabilities, 4.6
Alternative hypothesis, 8.2
Analysis of variance, 10.4
completely randomized design, 10.6
oneway, 10.6
randomized block design, 10.7
Arithmetic mean, 3.3
Axiomatic construction of the theory of probability, 4.4
B
Bar graph, 2.4
Bayes’s formula, 4.6
Bernoulli process, 5.4
Biased estimator, 10.7
Bimodal distribution, 3.3
Binomial probability distribution, 5.4
normal approximation to, 5.8
Bivariate relationships, 11.1
Box plot, 3.7
C
Categorical data, 10.1
Central limit theorem, 6.4
Central tendency, 3.3
Chebyshev’s theorem, 3.4
Chisquare distribution, 7.9, 9.6
Chisquare test, 10.1, 10.2
Class
frequency, 2.5, 2.6
interval, 2.6
relative frequency , 2.6
ccl
Classical definition of probability, 4.4
Coefficient of correlation, 11.6
Coefficient of determination, 11.6, 12.6
Coefficient of multiple determination, 12.6
Coefficient of variation, 3.4
Conditional probability, 4.5
Confidence interval, 7.2
Contingency table, 10.2
Continuous data, 2.6
Continuous probability distribution, 5.6
Continuous random variable, 5.1, 5.6
Correlation analysis, 11.6
Cummulative frequency distribution, 2.7
D
Data
grouped, 3.1, 3.8
qualitative, 2.3, 2.4
quantitative, 2.5, 2.6
raw, 3.8
Degree of freedom, 7.3
Dependent variable, 11.1
Descriptive statistics, 1.3
Direct relationship, 11.1
Discrete data, 2.6
Discrete probability distribution, 5.2
Discrete random variable 5.1, 5.2, 5.3
Dispersion, 3.2, 3.4
Distribution
bimodal, 3.3
binomial, 5.4
chisquare, 7.9
frequency, 2.6
ccli
normal, 5.8
Poisson, 5.5
probability, 5.2 – 5.8
sampling, 6.1, 6.3
standard normal probability, 5.8
Student’s, 7.3
E
Empirical Rule, 3.4
Error
of Type I, 8.3
of Type II, 8.3
Estimator
error variance, least squares line, 11.4
error variance, multiple regression, 12.4
Events, 4.1
certain, 4.3
complementary, 4.3
equally likely, 4.4
impossible, 4.3
independent, 4.3
simple, 4.3, 4.4
mutually exclusive, 4.3
nonmutually exclusive, 4.3
Expected value, 5.3, 5.7
Experiment, 4.1
Exponential random variable, 5.9
F
F probability distribution, 9.7, 10.6
F statistic, 9.7
Factor level
Frequency distribution, 2.6, 2.7
Frequency polygon,2.7
cclii
G
General linear model, 12.1
Geometric mean, 3.3
Goodnessoffit test, 10.2
H
Highly suspect outlier, 2.7
Histogram, 2.7
Hypothesis
alternative, 8.2
null, 8.2
onetailed, 8.2
twotailed, 8.2
Hypothesis testing, 8.1
I
Independence, 4.5
Independent events, 4.5
Independent variables, 11.1
Inferential statistics, 1.3
Inner fences, 3.7
Interaction model, 12.8
Intersection of events, 4.1
Interquartile range, 3.5
Inverse relationship, 11.1
K
KruskalWallis test, 13.5
Kurtosis, 3.6
L
Least squares
estimates, 11.3
line, 11.3
ccliii
matrix equation,12.3
method of, 11.3, 12.3
prediction equation, 11.3, 12.3
Level of significance, 8.3
Linear relationship, 11.6
Linear regression model, 11.2, 12.2
Lower quartile, 3.5
M
Matched pairs, 7.6
Mean, 3.3
Median, 3.3
Measure of central tendency, 3.3
Measure of dispersion, 3.4
Measure of location, 3.3
Method of least squares, 11.3, 12.3
Midquartile, 3.5
Mode, 3.3
Model
firstorder, 12.1
probabilistic, 11.2, 12.2
quadratic, 12.9
secondorder, 12.1
Model building, 12.8, 12.9
Multiple coefficient of determination, 12.6
Multiple regression analysis, 12.1
Multiplication rule for probability, 4.6
Mutually exclusive events, 4.3
N
Nonparametric methods, 13.113.7
KruskalWallis test, 13.5
Sign test for a population median, 13.2
Spearman’s rank correlation coefficient, 13.6
Wilcoxon rank sum test, 13.3
ccliv
Wilcoxon signed ranks test, 13.4
Nonparametric statistics, 13.113.7
Normal probability distribution, 5.8
Null hypothesis, 8.2
Numerical descriptive measures, 3.2, 5.3
O
Objective of statistics, 1.1
Ogive, 2.7, 2.8
Onetailed test, 8.2
Outer fences, 3.7
Outlier, 3.7
P
Parameters, 3.2
Percentage relative, 2.6
Pie chart, 2.4
Point estimate, 7.1
Poisson random variable, 5.5
Poisson probability distribution, 5.5
Population, 1.2
Prediction equation, 11.3, 12.3
Prediction interval (regression ), 11.7, 12.7
multiple, 12.7
single, 11.7
Probabilistic model, 11.2, 12.2
Probability
axiomatic definition, 4.4
classical definition, 4.4
conditional, 4.5
statistical definition, 4.4
total, 4.6
unconditional, 4.5
Q
Quadratic model, 12.9
cclv
Qualitative data, 2.2 –2.4
Quantitative data, 2.2, 2.5, 2.5
Quartiles, 3.5
R
Random sample, 6.2
Random sampling, 6.2
Range
interquatile, 3.5
Rank correlation coefficient, 13.6
Rank sum, 13.3
Regression analysis
multiple, 11.1
simple, 12.1
Regression models, 11.2, 12.2
Relative frequency, 2.6
Relative frequency distribution, 2.6
Relative standing, measures of, 3.5
S
Sample space, 4.3
Sampling distribution, 6.1
Scatter gram, 11.1
Shape, 3.6
Sign test,13.2
Signed ranks, 13.4
Significance level, 8.3
Simple linear regression, 11.2
Skewness, 3.6
Spearman’s rank coefficient coefficient, 13.6
Standard deviation, 3.4, 5.3, 5.6
Standard normal variable, 5.8
Standard score, 3.5
Statistical software packages, 1.5
Statistics
cclvi
descriptive, 1.3
nonparametric, 11.3
summary, 3.9
Stem and leaf display, 2.5
Straight line model, 11.2
T
t distribution, 7.3
Test of hypotheses
Test statistic, 8.4
Twotailed test, 8.2
Type I error, 8.3
Type II error, 8.3
U
Unconditional probability, 4.5
Uniform random variable, 5.9
Union of events, 4.3
Upper quartile, 3.5
Utility of model, 11.7, 12.6
V
Variability, measures of, 3.4
Variance, 3.4
Venn diagram, 4.3
W
Wilcoxon rank sum test, 13.3
Wilcoxon signed ranks test, 13.4
Z
zscore, 3.5, 6.4
z statistic, 9.4
CONTENTS Chapter 1 Introduction....................................................................................................1 1.1 1.2 1.3 1.4 1.5 What is Statistics...................................................................................................1 Populations and samples ......................................................................................2 Descriptive and inferential statistics ......................................................................2 Brief history of statistics ........................................................................................3 Computer softwares for statistical analysis...........................................................3
Chapter 2 Data presentation ..........................................................................................4 2.1 Introduction ...........................................................................................................4 2.2 Types of data ........................................................................................................4 2.3 Qualitative data presentation ................................................................................5 2.4 Graphical description of qualitative data................................................................6 2.5 Graphical description of quantitative data: Stem and Leaf displays.....................7 2.6 Tabulating quantitative data: Relative frequency distributions ..............................9 2.7 Graphical description of quantitative data: histogram and polygon......................11 2.8 Cumulative distributions and cumulative polygons ..............................................12 2.9 Summary .............................................................................................................14 2.10 Exercises ..........................................................................................................14 Chapter 3 Data characteristics: descriptive summary statistics.....................................16 3.1 Introduction .........................................................................................................16 3.2 Types of numerical descriptive measures ...........................................................16 3.3 Measures of location (or measures of central tendency) .....................................17 3.4 Measures of data variation..................................................................................20 3.5 Measures of relative standing .............................................................................23 3.6 Shape .................................................................................................................26 3.7 Methods for detecting outlier...............................................................................28 3.8 Calculating some statistics from grouped data ....................................................30 3.9 Computing descriptive summary statistics using computer softwares .................31 3.10 Summary ...........................................................................................................32 3.11 Exercises ..........................................................................................................33 Chapter 4 Probability: Basic concepts ..........................................................................35 4.1 Experiment, Events and Probability of an Event..................................................35 4.2 Approaches to probability.....................................................................................36 4.3 The field of events...............................................................................................36 4.4 Definitions of probability ......................................................................................38 4.5 Conditional probability and independence...........................................................41 4.6 Rules for calculating probability...........................................................................43 4.7 Summary ............................................................................................................46 4.8 Exercises ............................................................................................................46
ii
Chapter 5 Basic Probability distributions ......................................................................48 5.1 Random variables ................................................................................................48 5.2 The probability distribution for a discrete random variable...................................49 5.3 Numerical characteristics of a discrete random variable ......................................51 5.4 The binomial probability distribution ....................................................................53 5.5 The Poisson distribution.......................................................................................55 5.6 Continuous random variables: distribution function and density function..............57 5.7 Numerical characteristics of a continuous random variable...............................59 5.8 Normal probability distribution..............................................................................60 5.10 Exercises ...........................................................................................................63 Chapter 6. Sampling Distributions ..............................................................................65
6.1 Why the method of sampling is important ............................................................65 6.2 Obtaining a Random Sample ...............................................................................67 6.3 Sampling Distribution ...........................................................................................68 6.4 The sampling distribution of x : the Central Limit Theorem .................................73 6.5 Summary .............................................................................................................76 6.6 Exercises .............................................................................................................76 Chapter 7 Estimation...................................................................................................79 7.1 Introduction..........................................................................................................79 7.2 Estimation of a population mean: Largesample case ..........................................80 7.3 Estimation of a population mean: small sample case ...........................................88 7.4 Estimation of a population proportion...................................................................90 7.5 Estimation of the difference between two population means ................................92 7.6 Estimation of the difference between two population means: Matched pairs .......95 7.7 Estimation of the difference between two population proportions .........................97 7.8 Choosing the sample size ....................................................................................99 7.9 Estimation of a population variance ................................................................... 102 7.10 Summary ......................................................................................................... 105 7.11 Exercises ......................................................................................................... 105 Chapter 8 Hypothesis Testing .................................................................................. 107
8.1 Introduction........................................................................................................ 107 8.2 Formulating Hypotheses .................................................................................... 107 8.3 Types of errors for a Hypothesis Test ................................................................ 109 8.4 Rejection Regions.............................................................................................. 111 8.5 Summary ........................................................................................................... 118 8.6 Exercises ........................................................................................................... 118 Chapter 9 Applications of Hypothesis Testing ........................................................... 119 9.1 Introduction........................................................................................................ 119 9.2 Hypothesis test about a population mean .......................................................... 119 9.3 Hypothesis tests of population proportions........................................................ 125 9.4 Hypothesis tests about the difference between two population means............... 126 9.5 Hypothesis tests about the difference between two proportions......................... 131 9.6 Hypothesis test about a population variance ...................................................... 134 9.7 Hypothesis test about the ratio of two population variances .............................. 135
iii
9.8 Summary ........................................................................................................... 139 9.9 Exercises ........................................................................................................... 140 Chapter 10 Categorical data analysis and analysis of variance ................................. 143 10.1 Introduction ...................................................................................................... 143 10.2 Tests of goodness of fit .................................................................................... 143 10.3 The analysis of contingency tables .................................................................. 147 10.4 Contingency tables in statistical software packages......................................... 150 10.5 Introduction to analysis of variance .................................................................. 151 10.6 Design of experiments ..................................................................................... 151 10.7 Completely randomized designs ...................................................................... 155 10.8 Randomized block designs .............................................................................. 159 10.9 Multiple comparisons of means and confidence regions .................................. 162 10.10 Summary ....................................................................................................... 164 10.11 Exercises ....................................................................................................... 164 Chapter 11 Simple Linear regression and correlation...................................................167 11.1 Introduction: Bivariate relationships ................................................................ 167 11.2 Simple Linear regression: Assumptions .......................................................... 171 11.3 Estimating A and B: the method of least squares ........................................... 173 11.4 Estimating σ2 .................................................................................................. 174 11.5 Making inferences about the slope, B ............................................................. 175 11.6. Correlation analysis ........................................................................................ 179 11.7 Using the model for estimation and prediction................................................. 182 11.8. Simple Linear Regression: An Example .......................................................... 184 11.9 Summary ......................................................................................................... 188 11.10 Exercises ...................................................................................................... 188 Chapter 12 Multiple regression.................................................................................. 191 12.1. Introduction: the general linear model ............................................................. 191 12.2 Model assumptions ......................................................................................... 192 12.3 Fitting the model: the method of least squares.............................................. 192 12.4 Estimating σ2................................................................................................... 195 12.5 Estimating and testing hypotheses about the B parameters........................... 195 12.6. Checking the utility of a model ........................................................................ 199 Figure 12.3 STATGRAPHICS Printout for Electrical Usage Example...................... 200 12.7. Using the model for estimating and prediction................................................. 201 12.8 Multiple linear regression: An overview example............................................. 202 12.8. Model building: interaction models .................................................................. 206 12.9. Model building: quadratic models.................................................................... 208 12.11 Summary ....................................................................................................... 209 12.12 Exercises ...................................................................................................... 209 Chapter 13 Nonparametric statistics............................................................................ 213 13.1. Introduction ..................................................................................................... 213 13.2. The sign test for a single population................................................................ 214 13.3 Comparing two populations based on independent random samples.............. 217 13.4. Comparing two populations based on matched pairs: ..................................... 221
iv
13.5. Comparing population using a completely randomized design ........................ 225 13.6. Rank Correlation: Spearman’s rs statistic ........................................................ 228 13.7 Summary ......................................................................................................... 231 13.8 Exercises ........................................................................................................ 232 Reference Index Appendixes
v
Statistics is a word that can refer to quantitative data or to a field of study. Engineers gather data on the quality and reliability of manufactured of products. Each of these people is using the word statistics correctly.statistics . statistics are the information about rushing yardage. so that training in the science of statistics is valuable preparation for variety of careers. Farmers study data from field trials of new crop varieties.1. and first downs.is the subject of this course. Each month. yet each uses it in a slightly different way and for a somewhat different purpose. statistics are information on the absenteeism.5.THE STATISTICAL ANALYSIS OF DATA Chapter 1 CONTENTS Introduction 1.4. Descriptive and inferential statistics 1.1 What is Statistics The word statistics in our everyday life means different things to different people. To a medical researcher investigating the effects of a new drug. statistics are the grades made on all the quizzes in a course this semester.2. What is Statistics? 1. The collection and study of data are important in the work of many professions. organizing and interpreting numerical facts. which we call data. As a field of study. Computer softwares for statistical analysis 1. for example. statistics is. Most areas of academic study make use of numbers. passing yardage. first and foremost. Brief history of statistics 1. statistics may be information about the quantity of pollutants being released into the atmosphere. To a football fan. Populations and samples 1. To a manager of a power generating station. To a school principal. vi . given a halftime. government statistical offices release the latest numerical information on unemployment and inflation. a collection of tools used for converting raw data into information to help decision makers in their works. We are bombarded by data in our everyday life. Economists and financial advisors as well as policy makers in government and business study these data in order to make informed decisions. statistics is the science of collecting. and therefore also make use of methods of statistics. The science of data . statistics are evidence of the success of research efforts. test scores and teacher salaries.3. Whatever else it may be. And to a college student.
2 Populations and samples In statistics. Definition 1. a statistical population does not refer to people as in our everyday usage of the term. For the quality control 150 whisky bottles are selected at random. it refers to a collection of data. then the set of selected women is a sample.2 A sample is a subset of data selected from a population Example 1. Notice that.3 Descriptive and inferential statistics If you have every measurement (or observation) of the population in hand.1. then we will want to select a sample of data from the population and use the sample to infer the nature of the population. Example 1. Definition 1.3 The branch of statistics devoted to the summarization and description of data (population or sample) is called descriptive statistics. the data set that is the target of your interest is called a population. for example. This portion is a sample. If from each city or province we select 50 women. in Vietnam. 1. The branch of statistics devoted to this application is called descriptive statistics. We will find graphical and numerical ways to make sense out of a large mass of data. then statistical methodology can help you to describe this typically large set of data.2 The set of all whisky bottles produced by a company is a population. vii . Definition 1. If it may be too expensive to obtain or it may be impossible to acquire every measurement in the population.1 The population may be all women in a country.4 The branch of statistics concerned with using sample data to make an inference about a population of data is called inferential statistics. Definition 1.1 A population is a collection (or set) of data that describes some phenomenon of interest to you.
in the ninth century. • Minitab (registered trademark of Minitab. Captain John Graunt used thirty years of these Bills to make predictions about the number of persons who would die from various diseases and the proportion of male and female birth that could be expected. For this reason. N. Long before the eighteenth century. French law required the clergy to register baptisms. Because of Henry VII’s fear of the plague. We have only begun to list the people who have made significant contributions to this field. For his achievement in using past records to predict future events. governments began to register the ownership of land. Governments of ancient Babylonia. taken together. 762 Charlemagne asked for detailed descriptions of churchowned properties. Its use was popularized by Sir John Sinclair in his work “Statistical Account of Scotland 17911799”. Cary. Graund was made a member of the original Royal Society. This work was England’s first statistical abstract. form the theoretical basis of what we will study in this course.Chicago) • SYSTAT (registered trademark of SYSTAT.Made upon the Bills of Mortality. extent. During an outbreak of the plague in the late 1500s. In A. a professor at Marlborough and Gottingen.C. and value of the lands of England. and by 1632 these Bills of Mortality listed births and deaths by sex. It was first used by Gottfried Achenwall (17191772). E. England began to register its dead in 1532. The Old Testament contains several accounts of census taking. In the Middle Age. 1. deaths and marriages. Dr. Summarized in his work.5 Computer softwares for statistical analysis Many real problems have so much data that doing the calculations by hand is not feasible. most realworld statistical analysis is done on computers.4 Brief history of statistics The word statistik comes from the Italian word statista (meaning “statesman”). Evanston.C. About 1086... This practice continued. Zimmermam introduced the word statistics to England. the English government started publishing weekly death statistics. Except for the above listed softwares it is possible to make simple statistical analysis of data by using the part “Data analysis” in Microsoft EXCEL. Inc. Below we list some of them. Natural and Political Observations .A. Many people have brought to the study of statistics refinements or innovations that.. however. Pa) • SAS (registered trademark of SAS Institute.. Inc. Egypt and Rome gathered detail records of population and resources. but the machine does all the “number crunching”. William and Conqueror ordered the writing of the Domesday Book. In 1662. viii . The history of the development of statistical theory and practice is a lengthy one. About this same time. You must prepare the input data and interpret the results of the analysis and take appropriate action..D. Inc.) • SPSS (registered trademark of SPSS. Official government statistics are as old as recorded history. The emperor Yao had taken a census of the population in China in the year 2238 B. Inc. Early. There many widelyused software packages for statistical analysis..1.W. Graunt’s study was a pioneer effort in statistical analysis.. he completed a statistical enumeration of the serfs attached to the land. University Park. Maryland). people had been recording and using data. Later we will encounter others whose names are now attached to specific laws and methods. a record of the ownership.II) • STATGRAPHICS (registered trademark of Statistical Graphics Corp.
Chapter 2 CONTENTS Data presentation 2. quantitative data are those that represent the quantity or amount of something. Tabulating quantitative data: Relative frequency distributions 2. Definition 2.1 Quantitative data are observations measured on a numerical scale. weight (in kilograms) of each student in a group are both quantitative data. qualitative and quantitative. In this chapter we will show how to construct charts and graphs that convey the nature of a data set.10.6. Cumulative distributions and cumulative polygons 2. we want to make the data set more comprehensible and meaningful.1 Introduction The objective of data description is to summarize the characteristics of a data set.1. 2. Example 2. Types of data 2. ix .8. Graphical description of quantitative data: histogram and polygon 2. Introduction 2.4. Graphical description of qualitative data 2.3.2 Types of data Data can be one of two types. Graphical description of quantitative data: Stem and Leaf displays 2. Qualitative data presentation 2.9. Ultimately. Exercises 2.2.5. Summary 2. The procedure that we will use to accomplish this objective in a particular situation depends on the type of data that we want to describe.1 Height (in centimeters). In other words.7.
. i. we define the categories in such a way that each observations can fall in one and only one category.3 The classification of students of a group by the score on the subject “Statistical analysis” is presented in Table 2.Definition 2.e.3 The category frequency for a given category is the number of observations that fall in that category. Definition 2. 2. which is computed as follows Percentage for a category = Relative frequency for the category x 100% Example 2.2 Education level. The table of frequencies for the data set generated by computer using the software SPSS is shown in Figure 2. qualitative data are those that have no quantitative interpretation. x . sex of each person in a group of people are qualitative data. Definition 2.2 Nonnumerical data that can only be classified into one of a group of categories are said to be qualitative data.4 The category relative frequency for a given category is the proportion of the total number of observations that fall in that category. The data set is then described by giving the number of observations. In other words. or the proportion of the total number of observations that fall in each of the categories.0a. they can only be classified into categories.1. Example 2. nationality. Relative frequency for a category = Number of observations falling in that category Total number of observations Instead of the relative frequency for a category one usually uses percentage for a category.3 Qualitative data presentation When describing qualitative observations.
3 45 100.4 Graphical description of qualitative data Bar graphs and pie charts are two of the most widely used graphical methods for describing qualitative data sets. xi .1 Output from SPSS showing the frequency table for the variable CATEGORY.Table 2.0 Valid Bad Excelent Good Medium Total Figure 2. 2.3 100. 13 14 15 16 17 18 19 20 21 22 23 Good Excellent Excellent Excellent Excellent Good Excellent Excellent Good Excellent Excellent No of CATEGORY Stud.3 18 40. Bar graphs give the frequency (or relative frequency) of each category with the height or length of the bar proportional to the category frequency (or relative frequency).3 13.3 40.3 6 13.3 86.3 33.0 Valid Cumulative Percent Percent 13.0 100. 24 25 26 27 28 29 30 31 32 33 34 Good Medium Bad Good Bad Bad Good Excellent Excellent Excellent Good No of CATEGORY stud 35 36 37 38 39 40 41 42 43 44 45 Good Medium Good Excellent Good Good Medium Bad Excellent Excellent Good CATEGORY Frequency Percent 6 13.7 13. 1 2 3 4 5 6 7 8 9 10 11 12 Bad Medium Medium Medium Good Good Excellent Excellent Excellent Excellent Bad Good No of CATEGORY Stud.0 53.0a The classification of students No of CATEGORY Stud.0 15 33.
3. with the central angle and hence the area of the slice proportional to the category relative frequency.4a (Bar Graph) The bar graph generated by computer using SPSS for the variable CATEGORY is depicted in Figure 2.Example 2.2 Bar graph showing the number of students of each category Pie charts divide a complete circle (a pie) into slices. Example 2. each corresponding to a category.4b (Pie Chart) The pie chart generated by computer using EXCEL CHARTS for the variable CATEGORY is depicted in Figure 2. Medium Good Excelent Bad 0 5 10 15 20 Figure 2. Bad Excelent Good Medium Figure 2.3 Pie chart showing the number of students of each category xii .2.
0b 70 86 93 95 97 101 103 106 107 112 79 87 93 95 97 101 103 106 108 115 Quantity of glucose in blood of 100 students (unit: mg %) 80 87 93 96 98 101 103 106 110 116 83 88 93 96 98 101 104 106 111 116 85 89 94 96 98 101 104 106 111 116 85 90 94 96 98 101 104 106 111 116 85 91 94 96 98 102 105 106 111 119 85 91 94 97 98 102 106 107 111 121 86 92 94 97 100 102 106 107 112 121 86 92 94 97 100 103 106 107 112 126 xiii . a display can use one.5 Graphical description of quantitative data: Stem and Leaf displays One of graphical methods for describing quantitative data is the stem and leaf display. for example. 2.5 The quantity of glucose in blood of 100 persons is measured and recorded in Table 2. twoline stems are widely used. For this data for a twodigit number. we designate the first digit (7) as its stem. 112. for example. and for threedigit number. Depending on the data. which is widely used in exploratory data analysis when the data set is small. 79. we call the last digit (9) its leaf. In order to explain what is a stem and what is a leaf we consider the data from the table 2. Example 2.0b. Steps to follow in constructing a Stem and Leaf Display 1. Divide each observation in the data set into two parts. Table 2. we designate the first two digit (12) as its stem. List the stems in order in a column.2. we also call the last digit (2) its leaf. 3.0b (unit is mg %). Proceed through the data set. the Stem and the Leaf. two or five lines per stem. Among the different stems. starting with the smallest stem and ending with the largest. placing the leaf for each observation in the appropriate stem row. Using SPSS we obtain the following StemandLeaf display for this data set.
9 .00 Extremes 1.00 9.00 2.00 Extremes Stem width: Each leaf: 10 1 case(s) The stem and leaf display of Figure 2. 2.4.00 18.00 16. The classes and numbers falling in them are quickly determined once we have selected the digits that we want to use for the stems and leaves. A stem and leaf display arranges the data in an orderly fashion and makes it easy to determine certain numerical characteristics to be discussed in the following chapter.00 11. Output from SPSS showing the StemandLeaf display for the data set of glucose GLUCOSE StemandLeaf Plot Frequency Stem & Leaf 1. 8 .00 18.4 partitions the data set into 12 classes corresponding to 12 stems. Thus. 11 . 10 .6 Tabulating quantitative data: Relative frequency distributions Frequency distribution or relative frequency distribution is most often used in scientific publications to describe quantitative data sets.00 15. 11 . (=<70) 9 03 55556667789 011223333444444 556666677777888888 001111112223333444 5666666666677778 011111222 566669 11 (>=126) 1. The original data are preserved.GLUCOSE Figure 2.00 7 . They are better suited to the description of large data sets and they permit a greater flexibility in the choice of class widths. 3. xiv . Advantages of a stem and leaf display over a frequency distribution (considered in the next section): 1. here twoline stems are used. 2. 9 . The number of leaves in each class gives the class frequency. 10 . Disadvantage of a stem and leaf display: Sometimes not much flexibility in choosing the stems.00 2.00 6. 12 . 8 .
xv . i. Decide the type and number of classes for dividing the data set. taking Lower limit = 62.A frequency distribution is a table that organizes data into classes. Calculate each class relative frequency Class relative frequency = Class frequency Total number of observations Except for frequency distribution and relative frequency distribution one usually uses relative class percentage. Steps for constructing a frequency distribution and relative frequency distribution: 1. lower limit and upper limit of the classes: Lower limit < Minimum of values Upper limit > Maximum of values 2. It should be emphasized that we always have in mind nonoverlapping classes.6 Construct frequency table for the data set of quantity of glucose in blood of 100 persons recorded in Table 2. This number is called the class frequency.0b (unit is mg %). classes without common items. count the number of observations that fall in that class. Determine the width of class intervals: Width of class intervals = Upper limit . 4. Upper limit = 150 and Total number of classes = 22 we obtained the following table. which is calculated by the formula: Relative class percentage = Class relative frequency x 100% Example 2. It shows the number of observations from the data set that fall into each of classes.Lower limit Total number of classes 3.e. Using the software STATGRAPHICS. For each class.
08 0.18 0. All classes of frequency table must be mutually exclusive.01 0.18 0.96 0.01 0.08 0.11 0. 2.83 0.05 0.02 0.06 0. Rel. For example xvi . Frequency Frequency Frequency 0 0.05 0.48 0.1 Frequency distribution for glucose in blood of 100 persons Class Lower Upper Midpoint Limit Limit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 62 66 70 74 78 82 86 90 94 98 102 106 110 114 118 122 126 130 134 138 142 146 66 70 74 78 82 86 90 94 98 102 106 110 114 118 122 126 130 134 138 142 146 150 64 68 72 76 80 84 88 92 96 100 104 108 112 116 120 124 128 132 136 140 144 Frequency 0 1 0 0 2 8 5 14 18 11 18 6 8 5 3 1 0 0 0 0 0 0 Relative Cumulative Cum.03 0.Table 2.03 0.16 0.99 1 1 1 1 1 1 1 Remarks: 1.11 0.14 0.77 0.3 0.91 0.59 0.01 0.01 0 0 0. Classes may be openended when either the lower or the upper end of a quantitative classification scheme is limitless.01 0 0 0 0 0 0 0 1 1 1 3 11 16 30 48 59 77 83 91 96 99 100 100 100 100 100 100 100 0 0.
Example 2. Usually. the number of trucks owned by moving companies. the phenomenon of interest is plotted along the horizontal axis. while the vertical axis represents the number.7.. Such class as the number of children in each family.3 to describe qualitative data. continuous classes are halfopen intervals. Bar charts and pie charts were presented in Figure 2.1. Continuous data do progress from one class to the next without a break..1 Histogram When plotting histograms..Class: age birth to 7 8 to 15 . for which the frequency table is constructed in Table 2... Indeed. 2. a frequency histogram. For example. a relative frequency histogram or a percentage histogram.1 are halfopen intervals [62. 64 to 71 72 and older 3.7 Below we present the frequency histogram for the data set of quantities of glucose. however. statisticians have employed graphical techniques to describe sets of data more vividly. Discrete classes are separate entities that do not progress from one class to the next without a break. Histograms are essentially vertical bar charts in which the rectangular bars are constructed at midpoints of classes. 2. histograms and polygons are used to describe the data.7 Graphical description of quantitative data: histogram and polygon There is an old saying that “one picture is worth a thousand words”.. xvii . Classification schemes can be either discrete or continuous. proportion or percentage of observations per class interval – depending on whether or not the particular histogram is respectively. 70) . the kilograms of pressure on concrete. Discrete data are data that can take only a limit number of values... the classes in Table 2.. relative frequency tables.2 and Figure 2. [66. 66). They involve numerical measurement such as the weights of cans of tomatoes. With quantitative data summarized into frequency.
1.5 Frequency histogram for quantities of glucose. proportion or percentage of observations per class interval – depending on whether or not the particular polygon is respectively.7. Example 2. Figure 2. the frequency polygon is a line graph connecting the midpoints of each class interval in a data set. plotted at a height corresponding to the frequency of the class.6 is a frequency polygon constructed from data in Table 2. For example. It sketches an outline of the data pattern more clearly. a relative frequency polygon or a percentage polygon. a frequency polygon. Both may xviii . when plotting polygons the phenomenon of interest is plotted along the horizontal axis while the vertical axis represents the number.8 Figure 2.20 Frequency 15 10 5 0 100 108 116 124 132 Quantity of glucoza (mg% ) 140 68 76 84 92 Figure 2. The polygon becomes increasingly smooth and curve like as we increase the number of classes and the number of observations. the various histograms can not be constructed on the same graph because superimposing the vertical bars of one on another would cause difficulty in interpretation.6 Frequency polygon for data of glucose in Table 2.1 Advantages of polygons: • • • The frequency polygon is simpler than its histogram counterpart.2 Polygons As with histograms.1 Remark: When comparing two or more sets of data. For such cases it is necessary to construct relative frequency or percentage polygons. tabulated in Table 2.8 Cumulative distributions and cumulative polygons Other useful methods of presentation which facilitate data analysis and interpretation are the construction of cumulative distribution tables and the plotting of cumulative polygons. 2. 2.
A graph of cumulative frequency distribution is called an “lessthan” ogive or simply ogive. xn. According to this table the number of students having quantity of glucose less than 90 is 16. C2. C2. Example 2. n).. xk+1).1) Cumulative frequency 120 100 80 60 40 20 0 4 68 92 80 6 8 12 Quantity of glucoza (mg% ) 14 10 11 0 xix . Ck or lie below the value xk+1 is the sum f1+f2+..1 gives frequency.. cumulative frequency and cumulative relative frequency distribution for quantity of glucose in blood of 100 students..7... shows the cumulative frequency distribution for quantity of glucose in blood of 100 students (data from Table 2. .. A cumulative frequency distribution enables us to see how many observations lie above or below certain values. See Figure 2.+rk.9 Table 2. Figure 2.. xn+1.+fk. respectively. the relative frequency distribution table or the percentage distribution table.. Then the cumulative frequency that observations fall into classes C1.. the class Ck = [xk. Denote the classes by C1..7 Class intervals Suppose the frequency and relative frequency of class Ck is fk and rk (k=1. .. The corresponding cumulative relative frequency is r1 +r2+. .. Thus..be developed from the frequency distribution table... relative frequency. A “lessthan” cumulative frequency distribution may be developed from the frequency table as follows: Suppose a data set is divided into n classes by boundary points x1. 2. x2. Cn. . C1 x1 x2 C2 xk Ck xk+1 xn Cn xn+1 Figure 2. rather than merely recording the number of items within intervals.
For a qualitative data set we first define categories and the category frequency which is the number of observations falling in each category. For describing the quantitative data graphically histogram and polygon are used. 2.8 Cumulative frequency distribution for quantity of glucose (for data in Table 2.03 0. Large sets of data are best described using relative frequency distribution. If the data are quantitative and the number of the observations is small the categorization and the determination of class frequencies can be done by constructing a stem and leaf display.44 0. Further. The latter presents a table that organizes data into classes with their relative frequencies. the category relative frequency and the percentage for a category are introduced.07 0.Figure 2.580 adult women recently responded to the question “In your opinion.9 Summary This chapter discussed methods for presenting data set of qualitative and quantitative variables. b) What proportion of the respondents believe that high blood pressure or heart trouble is the most serious health problem for women? c) Estimate the percentage of all women who believe that some type of cancer is the most serious health problem for women? 2) The administrator of a hospital has ordered a study of the amount of time a patient must wait before being treated by emergency room personnel.31 0.10 Exercises 1) A national cancer institure survey of 1.1) 2. The following data were collected during a typical day: xx .06 0. what is the most serious health problem facing women?” The responses are summarized in the following table: The most serious health problem for women Breast cancer Other cancers Emotional stress High blood pressure Heart trouble Other problems Relative frequency 0. Bar graphs and pie charts as the graphical pictures of the data set are constructed.09 a) Use one of graphical methods to describe the data.
9 19.1 31. d) Construct a “lessthan” ogive from the data.7 21.3 19.9 23.1 24.8 20.0 39. to the nearest tenth of a minute.5 21.7 20.8 21.8 25.9 40.3 23.0 19.7 39.6 23. the time required to set the entire front page in type was recorded for 50 days. The accompanying data represent the percentages of respiring bacteria in 25 raw sewage samples collected from a sewage plant.2 19.2 21. estimate what percentage of the time the front page can be set in less than 24 minutes.3 25.5 38.3 40.5 28. 3) Bacteria are the most important component of microbial eco systems in sewage treatment plants.9 22.9 50.0 20.1 19.6 37.1 24.1 34. are given below.0 24.5 45.2 46.8 23.3 21. What additional interpretation can you give to the data from the frequency distribution? c) Construct the cumulative relative frequency polygon and from this ogive state how long 75% of the patients should expect to wait. Construct a stem and leaf display for the data.2 22.7 22.9 22. c) Construct a frequency polygon from the data.9 20. What comment can you make about patient waiting time from your data array? b) Construct a frequency distribution using 6 classes.1 22.8 21. b) Construct a frequency distribution and a “lessthan” cumulative frequency distribution from the data.8 45.7 23. c.7 a) Arrange the data in an array from lowest to heighest. e) From your ogive.1 41.6 38. 20.3 36.2 a.8 25. Construct a relative frequency distribution for the data.7 20.6 48.7 48.9 22.8 20. 42.6 20.2 23.1 23. The data.2 22.7 37.9 20. Compare the two graphs of parts a and b.6 44.8 22.6 39.5 23.8 minutes.5 24.WAITING TIME (MINUTES) 12 26 16 4 21 7 20 14 24 25 3 2 11 26 17 15 29 16 18 6 a) Arrange the data in an array from lowest to heighest.8 23.3 50.7 32. Water management engineers must know the percentage of active bacteria at each stage of the sewage treatment.0 21. xxi .3 25.5 19. using intervals of 0. b.9 24.5 35.5 24.1 23. 4) At a newspaper office.
11. Summary 3. Computing descriptive summary statistics using computer softwares 3.1.2 Types of numerical descriptive measures Four types of characteristics which describe a data set pertaining to some numerical variable or phenomenon of interest are: • Location • Dispersion • Relative standing • Shape In any analysis and/or interpretation of numerical data. In this chapter a variety of descriptive summary measures will be developed. Calculating some statistics from grouped data 3. Measures of data variation 3. variation. In contrast.10. our primary emphasis deals with statistics rather than parameters. Since statisticians usually take samples rather than use entire populations. Exercises 3. they are called parameters.2.1 Introduction In the previous chapter data were collected and appropriately summarized into tables and charts. Shape 3. Introduction 3.3.Chapter 3 Data characteristics: descriptive summary statistics CONTENTS 3. Methods for detecting outlier 3.5. whether collected in raw form (ungrouped data) or summarized into frequency distributions (grouped data) 3. If these descriptive measures are computed from a sample of data they are called statistics .8. if these descriptive measures are computed from an entire population of data. Types of numerical descriptive measures 3. These descriptive measures are useful for analyzing and interpreting quantitative data. Measures of relative standing 3.6.9. relative standing and shape may be used to extract and summarize the salient features of the data set. Measures of central tendency 3. xxii .7.4. a variety of descriptive measures representing the properties of location.
5.0+ 5.3 Advantages of the mean: • • It is a measure that can be calculated and is unique.7+ 4. Example 3. 4. Disadvantages of the mean: It is affected by extreme values that are not representative of the rest of the data.2.1+ 9. if in the above example we compute the mean of the first 6 numbers and exclude the 9. xn .0. Mean Definition 3.3.1 The arithmetic mean of a sample (or simply the sample mean) of n observations x1 .1.0 distorts the value we get for the mean.0 value. It is also valid for the definition of other measures of central tendency. then the mean is 4. 4. The one extreme value 9. 9.3 Measures of location (or measures of central tendency) 3. It would be more representative to calculate the mean without including such an extreme value. x2 .3. Κ . denoted by x is computed as x + x 2 + .8+ 5.7. 5. 4.3. Indeed. xxiii .0.2+ 4. + x n = x= 1 n ∑x i =1 n i n Definition 3.. But in the next section we will give different formulas for variances of population and sample.3+ 4.0)/7 = 5.1a The population mean is defined by the formula µ= ∑x i =1 N i N = Sumof the valuesof all observatio in populatio ns n Total numberof observatio in populatio ns n Note that the definitions of the population mean and the sample mean are the same.1.7. It is useful for performing statistical procedures such as comparing the means from several data sets..8.1 Consider 7 observations: 4. By definition x = (4.
0 does not affect the median. median = 4. then median m = x4 = 6. Κ .2.1 we have mean = 5. and the other half lie below it. We see that a half of the observations. Solution First. 3.5 Advantage of the median over the mean: Extreme values in data set do not affect the median as strongly as they do the mean. 4.3. 5 lie below the value 6 and another half of the observations. 6. The extreme value of 9. then by Definition Median = (x4+x5)/2 = (5+6)/2 = 5. Example 3. 4. Indeed. 7.2 Find the median of the data set consisting of the observations 7. 10. namely.3 Suppose we have an even number of the observations 7. 8.3.3. x2 .3. 3.1. 8. Formula for calculating median of an arranged in ascending order data set x k if n = 2k − 1 ( n is odd) m = Median = 1 2 ( x k + x k +1 ) if n = 2k ( n is even) Example 3. 3. 5. namely. 5. 3. Find the median of this data set. if in Example 3. 4. 10. we arrange the data set in ascending order 3 4 5 6 7 8 10.2 The median m of a sample of n observations x1 . 8 and 10 lie above the value 6.3 Mode xxiv . Solution First. n = 2 x 4 . Median Definition 3. Since the number of observations is odd.8. xn arranged in ascending or descending order is the middle number that divides the data set into two equal halves: one half of the items lie above this point. 1. Since the number of the observations n = 2 x 4. we arrange the data set in ascending order 1 3 4 5 6 7 8 10. 6.
Example 3.3 that occurs with the The mode of a data set x1 . Κ .1 Quantity of glucose (mg%) in blood of 25 students 70 79 83 86 87 88 93 93 93 95 95 96 97 97 98 101 101 103 103 106 106 107 108 112 115 Solution First we arrange this data set in the ascending order 70 79 83 86 87 88 93 93 93 95 95 96 97 97 98 101 101 103 103 106 106 107 108 112 115 This data set contains 25 numbers.Definition 3.. We see that. xn is the value of x greatest frequency . In this case it is called xxv .1. Example 3. i. Table 3. is repeated most often in the data set.e. the mode of the data set is 93. Multimodal distribution: multimodal distribution. x2 .5 The data set 0 0 1 1 1 2 4 4 4 5 6 6 7 8 9 9 10 11 11 12 A data set may have several modes.4 Find the mode of the data set in Table 3. Therefore. the value of 93 is repeated most often.
median = 97 and mode = 93. Median and Mode • In general. or how spread out the values in the data set happen to be. 3. there is no modal value because the data set contains no values that occur more than once. From the above formula it follows 1 n log xG = ∑ log xi n i =1 where log is the logarithmic function of any base. Κ .. or many modes. for the data set in Table 3. for data set 3 measures of central tendency: the mean . the median and the mode. Clearly. Advantage of the mode: Like the median.4 Suppose all the n observations in a data set x1 .M = n x1 x2 . x n > 0 . his distribution is called bimodal distribution. For example. three. Disadvantages of the mode: • The mode is not used as often to measure central tendency as are the mean and the median.xn The geometric mean is appropriate to use whenever we need to measure the average rate of change (the growth rate) over a period of time. • Comparing the Mean.48. xxvi .. When data sets contain two. Even if the high values are very high and the low value is very low. Then the geometric mean of the data set is defined by the formula xG = G. the mode is a useless measure in these cases.4 Geometric mean Definition 3. the median and the mode are different. • If all observations in a data set are arranged symmetrically about an observation then this observation is the mean.3. Other times. how small. Too often. engineering and computer science. we choose the most frequent value of the data set to be the modal value We can use the mode no matter how large. For most data sets encountered in business. this will be the MEAN. mean =96.have two modes: 1 and 4. every value is the mode because every value occurs for the same number of times. • Which of these three measures of central tendency is better? The best measure of central tendency for a data set depends on the type of descriptive information you want. x 2 . they are difficult to interpret and compare. the mode is not unduly affected by extreme values.1.
the variance and the standard deviation. measures of variation measure its “spread”.4. where Maximum = Largest value. 3. 3. Minimum = Smallest value.6 we see that the population variance is the average of the squared distances of the observations from the mean.Thus.5 The range of a quantitative data set is the difference between the largest and smallest values in the set. the logarithm of the geometric mean of the values of a data set is equal to the arithmetic mean of the logarithms of the values of the data set. The most commonly used measures of data variation are the range.1 Range Definition 3.4 Measures of data variation Just as measures of central tendency locate the “center” of a relative frequency distribution. 3.Minimum. xxvii .4. From the Definition 3.6 The population variance of the population of the observations x is defined the formula σ = 2 ∑ (x i =1 N i − µ) N where: σ 2 =population variance xi µ = the item or observation = population mean N = total number of observations in the population. Range = Maximum .2 Variance and standard deviation Definition 3.
Κ . Uses of the standard deviation The standard deviation enables us to determine.7 The standard deviation of a population is equal to the square root of the variance σ= σ = 2 ∑ (x i =1 N i − µ) N Note that for the variance. xxviii 2 . the expected value of s is equal to the population variance ). with a great deal of accuracy. And for the standard deviation.6a The sample variance of the sample of the observations x1 . xn is defined the formula 2 s = 2 ∑ (x − x ) i i =1 n n −1 where: s 2 =sample variance x = sample mean n = total number of observations in the sample The standard deviation of the sample is s = s2 Remark: In the denominator of the formula for s we use n1 instead n because statisticians 2 proved that if s is defined as above then s2 is an unbiased estimate of the variance of the 2 population from which the sample was selected ( i. Chebyshev (18211894). the units are the same as those used in the data.L. Definition 3. x2 .e. We can do this according to a theorem devised by the Russian mathematician P.Definition 3. the units are the squares of the units of the data. where the values of a frequency distribution are located in relation to the mean.
8 The coefficient of variation of a data set is the relation of its standard deviation to its mean cv = Coefficient of variation = Standard deviation × 100 % Mean This definition is applied to both population and sample. In these cases we have: The Empirical Rule If a relative frequency distribution of sample data is bellshaped with mean x and standard deviation s. the unit of standard deviation of the data set of their weight is kilogram.3 Relative dispersion: The coefficient of variation The standard deviation is an absolute measure of dispersion that expresses variation in the same units as the original data. We need a relative measure that will give us a feel for the magnitude of the deviation relative to the magnitude of the mean. The coefficient of variation is one such relative measure of dispersion. x ± 2 s . then the proportions of the total number of observations falling within the intervals x ± s . the unit of standard deviation of the data set of height of a group of students is centimeter. no. Definition 3. because they are in the different units. Can we compare the values of these standard deviations? Unfortunately. bellshaped curve. For example.4. The unit of the coefficient of variation is percent.Chebyshev’s Theorem For any data set with the mean x and the standard deviation s at least 75% of the values will fall within the interval x ± 2 s and at least 89% of the values will fall within the interval x ± 3s . xxix . We can measure with even more precision the percentage of items that fall within specific ranges under a symmetrical. x ± 3s are as follows: x ±s: Close to 68% x ± 2 s : Close to 95% x ± 3s : Near 100% 3.
3. for a data set. Consequently. the median and 75th percentile are often used to describe a data set because they divide the data set into 4 groups. QL.Example 3. we compute the coefficient of variation for both technicians: For technician A: cv=5/40 x 100% = 12. They would also divide the relative frequency distribution for a data set into 4 parts. the 25th percentile. Definition 3. the mid quartile. and the 75th percentile are called the lower quartile.10 The lower quartile. technician B who has more absolute variation in output than technician A.5 Measures of relative standing In some situations. A measure that expresses this position in terms of a percentage is called a percentile for the data set. the median. and the upper quartile.1. Taking all this information into account. for a data set is the 25th percentile xxx .6 Suppose that each day laboratory technician A completes 40 analyses with a standard deviation of 5. The pth percentile is a number such that p% of the observations of the data set fall below and (100p)% of the observations fall above it. with each group containing onefourth (25%) of the observations. Which employee shows less variability? At first glance. Descriptive measures that locate the relative position of an observation in relation to the other observations are called measures of relative standing. But B completes analyses at a rate 4 times faster than A. you may want to describe the relative position of a particular observation in a data set. The median.4%. it appears that technician B has three times more variation in the output rate than technician A. The 25th percentile. by definition. Technician B completes 160 analyses per day with a standard deviation of 15. is the 50th percentile.9 Suppose a data set is arranged in ascending (or descending ) order. as shown in Figure 3. Definition 3.25) . respectively. So. has less relative variation. we find that.5% For technician B: cv=15/60 x 100% = 9. each contains the same are (0.
Definition 3.12 The upper quartile. Consequently. QL M QU Figure 3. for a data set is the 75th percentile.QL .13 The interquartile range of a data set is QU . when the sample data set is small. the lower and the upper quartiles for small data set are not well defined.quartile. QU. for a data set is the 50th percentile. mid and upper quartiles For large data set.11 The mid. The following box describes a procedure for finding quartiles for small data sets. . However. xxxi . quartiles are found by locating the corresponding areas under the relative frequency distribution polygon as in Figure 3. Rank the n observations in the data set in ascending order of magnitude. M. exactly 25% of the remaining observations. it may be impossible to find an observation in the data set that exceeds. say.Definition 3. Finding quartiles for small data sets: 1.1 Locating of lower. Definition 3.
a particular observation lies below or above the mean.2 Location of the quartiles for the data set of Table 2.5.1.5 up to 7 and 19. Standard score gives us the number of standard deviations. If 3(n+1)/4 falls halfway between two integers.2. We round 6.5 down to 19. If (n+1)/4 falls halfway between two integers. 70 80 90 93 97 100 103 110 115 Min QL M QU Max Figure 3. Example 3. Solution For this data set n = 25.7 Find the lower quartile. the upper quartile =19th observation = 103. 3. Therefore. The location of these quartiles is presented in Figure 3.2. round down. and the upper quartile for the data set in Table 3. Hence. The observation with this rank represents the upper quartile. the median. xxxii . round up. 3(n+1)/4 = 3*26/4 = 19.1 Another measure of real relative standing is the zscore for an observation (or standard score).5. (n+1)/4 = 26/4 = 6. the lower quartile = 7th observation = 93. Calculate the quantity 3(n+1)/4 and round to the nearest integer. It describes how far individual item in a distribution departs from the mean of the distribution. Calculate the quantity (n+1)/4 and round to the nearest integer. We also have the median = 13th observation = 97. The observation with this rank represents the lower quartile.
Skewness characterizes the degree of asymmetry of a distribution around its mean.Definition 3. the skewness is defined by the formula: n n x −x Skewness = ∑ i s . dispersion and relative standing.1 Skewness If the distribution of the data is not symmetrical. it is called asymmetrical or skewed. µ = the population mean. 3.6. σ = the population standard deviation .6 Shape The fourth important numerical characteristic of a data set is its shape. 3. For a sample data. For a sample: zscore= where x−x s x = the observation from the sample x = the sample mean.14 Standard score (or z score) is defined as follows: For a population: zscore= where x−µ σ x = the observation from the population. it is also necessary to consider the shape of the data – the manner. in which the data are distributed. There are two measures of the shape of a data set: skewness and kurtosis. s = the sample standard deviation . (n − 1)(n − 2) i =1 3 where n = the number of observations in the sample. xxxiii . In describing a numerical data set its is not only necessary to summarize the data by presenting appropriate measures of central tendency.
6. The distributions with positive and negative kurtosis are depicted in Figure 3. The direction of the skewness depends upon the location of the extreme values. Figure distribution 3.2 Kurtosis Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the bellshaped distribution (normal distribution).3a. s = standard deviation of the sample. This is depicted in Figure 3.3b Leftskewed distribution 3. Kurtosis of a sample data set is calculated by the formula: 4 n n(n + 1) 3(n − 1) 2 xi − x Kurtosis = − ∑ (n − 1)(n − 2)(n − 3) i =1 s (n − 2)(n − 3) Positive kurtosis indicates a relatively peaked distribution. xxxiv .xi = ith observation in the sample. such distribution is said to be positive or rightskewed. where the distribution with null kurtosis is normal distribution. such distribution is said to be negative or leftskewed. Negative kurtosis indicates a relatively flat distribution. Since the mean is exceeded by the median and the mode. This is depicted in Figure 3.3b.4 . If the extreme values are the larger observations. The tail of its distribution is extended to the right. On the other hand. the mean will be the measure of location most greatly distorted toward the upward direction. if the extreme values are the smaller observations.3a Rightskewed Figure 3. the mean will be the measure of location most greatly reduced. Since the mean exceeds the median and the mode. The tail of its distribution is extended to the left.
e. Example 3. the observations with zscore greater than 3 will be outliers. because such a distribution of the data set has a tendency to include extremely large or small observations.15 An observation (or measurement) that is unusually large or small relative to the other values in a data set is called an outlier. Method of using zscore: According to Chebyshev theorem almost all the observations in a data set will have zscore less than 3 in absolute value i. x + 3s ) . fall into the interval ( x − 3s.8 The doctor of a school has measured the height of pupils in the class 5A.Figure 3. 2. or entered into the computer incorrectly. where x the mean and s is is the standard deviation of the sample. recorded. Outliers occur when the relative frequency distribution of the data set is extreme skewed. The result (in cm) is follows xxxv . 3. The measurement is observed. There are two widely used methods for detecting outliers.7 Methods for detecting outlier Definition 3. The measurement is correct. The measurements come from a different population.4 The distributions with positive and negative kurtosis 3. but represents a rare event. Therefore. Outliers typically are attributable to one of the following causes: 1.
for the data set. Draw a vertical line inside the box to locate the median M. Construct a box with QL and QU located at the lower corners. lower and upper quartiles.18. the height of 153 cm and the height of 110 cm are outliers in the data set.QL.Table 3.77)/6.1 x = 132. outer fences are located a distance of 3 * IQR below QL and above QU (see Figure 4. Calculate the median M. xxxvi . Construct two sets of limits on the box plot: Inner fences are located a distance of 1. 4. Below we present steps to follow in constructing a box plot. The base width will then be equal to IQR. Observations that fall between the inner and outer fences are called suspect outliers. and the interquartile range. zscore of 110 is (110132. QL and QU.06. Since the absolute values of zscore of 153 and 110 are more than 3. Use small circles to locate them.76.5 * IQR below QL and above QU.06=3. Box plot method Another procedure for detecting outliers is to construct a box plot of the data. Locate the suspect outliers on the box plot using asterisks (*).5 ).06 = 3. 2. 3. zscore of the observation of 153 is (153132.34 .77. IQR= QU .Observations that fall outside the outer fences is called highly suspect outliers. 3s = 18.77)/6. Steps to follow in constructing a box plot 1. s = 6.2 Heights of the pupils of the class 5A 130 131 129 133 132 133 134 132 138 129 135 130 136 133 132 131 131 110 135 134 153 132 134 135 For the data set in Table 3.
2 xxxvii .5 * IQR IQR 1.Outer fences Inner fences Inner fences Outer fences * * QL M QU 1.6 Output from SPSS showing box plot for the data set in Table 3.5 * IQR Figure 3. Figure 3.2 is shown in Figure 3. A computergenerated by SPSS box plot for data set in Table 3.5 * IQR 1.5 Box plot For large data set box plot can be constructed using available statistical computer software.5 * IQR 1.6.
standard deviation etc.9 Suppose we have a frequency table of average monthly checkingaccount balances of 600 customers at a branch bank. s2 = ∑ i =1 k k f i xi − ∑ f i x i .99 400 – 449.99 50 – 99. fi = frequency of the ith class. those. in which the value of each of the individual observations in the data set is known. of a data set. these formulas apply only to raw data sets.99 100 – 149.99 350 – 399.99 150 – 199. Example 3.6 we gave formulas for computing the mean. However. standard deviation etc. If the data have already been grouped into classes of equal width and arranged in a frequency table. CLASS (DOLLARS) 0 – 49. s2 = standard deviation of the data set xi = midpoint of the ith class.99 150 – 199. you must use an alternative method to compute the mean. i =1 n −1 2 2 x = mean of the data set.8 Calculating some statistics from grouped data In Sections 3.99 250 – 299. Formulas for calculating the mean and the standard deviation for grouped data: x= where ∑f x i i =1 k i n .99 200 – 249. i. median.99 450 – 499.e.. we can easily compute an estimate of the value of the mean and the standard deviation.99 FREQUENCY 78 123 187 82 82 51 47 13 9 6 4 From the information in this table.3 through 3.99 300 – 349. xxxviii .3.
Geometric mean 99.GLUCOSE Sample size 100. Lower quartile 94. Interquartile range 12.137439 Standard error 1. Variable: GLUCOSE.013744 Minimum 70. of variation 10. Below we present outputs from STATGRAPHICS and SPSS for computing descriptive summary statistics for GLUCOSE data in Table 2. 3. Maximum 126.767677 Standard deviation 10.137439 Figure 4. Upper quartile 106.7 Output from STATGRAPHICS for Glucose data xxxix .5 Mode 106.482475 Variance 102. n = total number of observations in the data set. Median 100. Range 56.051526 Kurtosis 0. Skewness 0.k = number of classes.0b.9 Computing descriptive summary statistics using computer softwares All statistical computer softwares have procedure for computing descriptive summary statistics.131118 Coeff. Average 100.
Particularly. The skewness characterizes the degree of asymmetry of a distribution around its mean. The two numerical measures of the shape of a data set are skewness and kurtosis. relative standing and shape. Three numerical descriptive measures are used to locate a relative frequency distribution are the mean. Percentiles. The range and the standard deviation measure the spread of a relative frequency distribution. Box plots constructed from intervals based on the interquartile range and zscores provide an easy way to detect possible outliers in the data. It is the value of the data set that locates the point where the relative frequency distribution achieves its maximum relative frequency. dispersion. The median. In a sense. Each conveys a special piece of information. The lower and upper quartiles and the distance between them called the interquartile range can also help us visualize a data set. quartiles. we can obtain a very good notion of the way data are distributed around the mean by constructing the intervals and referring to the Chebyshev’s theorem and the Empirical rule. The mode is the observation that occurs with greatest frequency. The kurtosis characterizes the relative peakedness or flatness of a distribution compared with the bellshaped distribution. and the mode.3. xl . the median.10 Summary Numerical descriptive measures enable us to construct a mental image of the relative frequency distribution for a data set pertaining to a numerical variable. and zscores measure the relative position of an observation in a data set. which is insensitive to extreme values. the mean is the balancing point for the data. divides the data set into two equal halves: half of the observations will be less than the median and half will be larger. There are 4 types of these measures: location.
Compare the results to the Chebyshev theorem. x ± 3s . The total manhours required for each of the 50 days are listed below. the median. and the mode of the data set. the total number of manhours required per day to perform a certain task was recorded for 50 days. 2.11 Exercises 1. his information will be used in a work measurement analysis. Industrial engineers periodically conduct “work measurement” analyses to determine the time used to produce a single unit of output. Compare these values with ones obtained in b). b) Compute the mean and the standard deviation of the raw data set.3. b) Find the range. The ages of a sample of the people attending a training course on networking in IOIT in Hanoi are: 29 23 31 26 20 24 28 28 23 27 26 31 22 28 25 25 30 31 24 28 32 32 23 27 28 33 22 34 a) Construct a frequency distribution with intervals 1519. 3539. Do you detect any outliers? e) Find the 75th percentile for the data on total daily manhours. 3. the variance and the standard deviation of the data set. 2529. 3034. c) Construct the intervals x ± s . 2024. Count the number of observations that fall within each interval and find the corresponding proportions. 128 113 128 124 112 119 109 103 131 111 95 124 135 133 150 97 97 114 88 117 124 138 109 118 122 128 133 100 116 97 142 136 111 98 116 98 120 131 112 92 108 112 113 138 122 120 146 132 100 125 a) Compute the mean. c) Compute the approximate values for the mean and the standard deviation using the constructed frequency distribution table. An engineer tested nine samples of each of three designs of a certain bearing for a new electrical winch. At a large processing plant. x ± 2 s . The following data are the number of hours it took for each bearing to fail when xli .
Compare your results with the Empirical rule. c) Which design is best and why? 4. with a load on the winch equivalent to 1. s2 and s. c) Calculate the intervals x ± s . x ± 2 s . xlii .the winch motor was run continuously at maximum output. b) Compute x . DESIGN A 16 16 53 21 17 25 30 21 45 B 18 27 34 34 32 19 34 17 43 C 21 17 23 32 21 18 21 28 19 a) Calculate the mean and the median for each group. The projected 30day storage charges (in US$) for 35 web pages stored on the web server of a university are listed here: 120 143 165 134 185 125 120 210 167 200 145 180 120 189 231 180 175 187 182 240 175 190 179 145 230 167 200 167 178 180 154 145 165 231 154 a) Construct a stemandleaf display for the data set.9 times the intended capacity. and x ± 3s and count the number of observations that fall within each interval. b) Calculate the standard deviation for each group.
The field of events 4.Chapter 4 CONTENTS Probability: Basic concepts 4. C.1 Experiment. Experiment. You may observe one.1 Consider the following experiment. some possible outcomes of this experiment can not be predicted with certainty in advance are: • • A: You draw an ace of hearts B: You draw an eight of diamonds xliii . or four. Example 4. Exercises 4.7.4. Approaches to probability 4. Toss a coin and observe whether the upside of the coin is Head or Tail. or three.1. Events and Probability of an Event Definition 4..1 The process of making an observation or recording a measurement under a given set of conditions is a trial or experiment Thus.8. or two.6. Definition 4.5. or five or six dots on the upper face of the die.3.2. T: Tail is observed. Example 4.. an experiment is realized whenever the set of conditions is realized. Summary 4. Rules for calculating probability 4. Definitions of probability 4. B.2 Outcomes of an experiment are called events. Two events may be occurred: • • H: Head is observed. Events and Probability of an Event 4.. You can not predict this number.3 When you draw one card from a standard 52 card bridge deck. Example 4. We denote events by capital letters A. Conditional probability and independence 4.2 Toss a die and observe the number of dots on its upper face.
. According to the first approach to definition of probability.2 Approaches to probability The number of different definitions of probability that have been proposed by various authors is very large. 3. 4. or union. An event is called certain (or sure) if it must inevitably occur whenever the experiment is realized. how determine the probability an event? The answer to this question will be given in the next Sections.• • C: You draw a spade D: You do not draw a spade. of the events A and B. Definitions that take as their point of departure the “relative frequency” of occurrence of the event in a large number of trials (“statistical” definition). 4. 4. denoted by P(A). If whenever the event A occurs the event B also occurs. 7. Clearly. 6. The event consisting in the simultaneous occurrence of A and B is called the product or intersection of the events A and B. i. then we say that the events A and B are equivalent and write A=B. 5. 1. xliv . The probability of an event A. But how to measure the chance of occurrence. if for every realization of the experiment either A and B both occur or both do not occur. The event consisting in the occurrence of A and the nonoccurrence of B is called the difference of the events A and B and is denoted by AB or A\B. If A implies B and at the same time. which may or may not occur when an experiment is realized. the theory of probability is something not unlike a branch of psychology and all conclusions on probabilistic judgements are deprived of the objective meaning that they have independent of the observer. and will be denoted by AB or A ∩ B.e. An event is called impossible if it can never occur. and is denoted by A+B or A ∪ B. B implies A. We shall denote these events by the letter E. Definitions that reduce the concept of probability to the more primitive notion of “equal likelihood” (the socalled “classical definition “). 2. In the next sections we shall give the classical and statistical definitions of probability.. But the majority of definitions can be subdivided into 3 groups: 1. then we say that A implies B (or A is contained in B) and write A ⊂ B or B ⊃ A. Definitions of probability as a quantitative measure of the “degree of certainty” of the observer of experiment. all certain events are equivalent to one another. 3. 2. i.e. Those probabilities that depend upon the observer are called subjective probabilities. The event consisting in the occurrence of at least one of the events A or B is called the sum. All impossible events are likewise equivalent and denoted by 0. is the chance A will happen. in general.3 The field of events Before proceeding to the classical Definition of the concept of probability we shall introduce some definitions and relations between the events.
..e.n) are mutually exclusive in pairs (or pair wise mutually exclusive).. D2.8.1b. 12. 3.1b events exclusive events Two nonmutually xlv . in the experiment of tossing a single die.2. the events Dk that k dots (k=1. 11. For example.1a Two mutually exclusive Figure 4. where Dk = {observing k dots on the upper face of the die}. If two events are mutually exclusive. then we say that the event A is decomposed into the mutually exclusive events B1. 10. 2.. If two events are not mutually exclusive.e. D4... i.. In theses diagrams the entire sample space is represented by a rectangle and events are represented by parts of the rectangle. All relations between events may be interpreted geometrically by Venn diagrams.. . A2. BiBj = 0 for any i ≠ j. The sample space of an experiment is the collection of all its simple events. 6) are observed in the experiment of tossing a die are simple events. the event consisting of the throw of an even number of dots is decomposed into the mutually exclusive events D2. if their joint occurrence is impossible AB = 0. i.... D4 and D6.. An with the following properties: A1. If A=B1+B2+. . • List of events Deven and Dodd in the experiment of tossing a die. An event A is called simple (or elementary) if it can not be decomposed into other events. 4.. The examples of complete list of events may be: • List of events Head and Tail in tossing a coin • List of events D1.+An=E. A2. Two events A and B are called mutually exclusive if when one of the two events occurs in the experiment.1a. B2. D6 in the experiment of tossing a die. An are pair wise mutually exclusive events. the other can not occur. Complete list of events: Suppose that when the experiment is realized there may be a list of events A1.. . For example. their parts of the rectangle will overlap as shown in Figure 4.. For example. An is complete.... Then we say that the list of events A1... 13. their parts of the rectangle will not overlap each other as shown in Figure 4. 5.. I. D3. D5. Figure 4. II. A1+A2+.+Bn and the events Bi (i =1. in the experiment of tossing a die the following events are complementary: • • Deven = {even number of dots is observed on upper face} Dodd ={ odd number of dots is observed on upper face} 9. Bn.. Two events A and A are complementary if A + A = E and AA = 0 hold simultaneously. A2..
Thus. For example.3 A family S of events is called a field of events if it satisfies the following properties: 1. which is regarded as a primitive concept and hence not subject to formal definition. A+B and AB. The family S contains the certain event E and the impossible event 0 . the equally likely events are the appearance of any of the specific number of dots (from 1 to 6) on its upper face. B.A A B B Figure 4.4 Definitions of probability 4. xlvi . If the event A and B belong to the family S. made of completely homogeneous material.4. in the tossing of a single perfectly cubical die. 2. Definition 4.2 Events A. “product” and “complement” constitute a field of events. B and AB A+ B AB In every problem in the theory of probability one has to deal with an experiment (under some specific set of conditions) and some specific family of events S. the so do the events AB. We see that the sample space of an experiment together with all the events generated from the events of this space by operations “sum”. 4.1 The classical definition of probability The classical definition of probability reduces the concept of probability to the concept of equiprobability (equal likelihood) of events. A . for every experiment we have a field of events.
P(E) = 1 3. According to the above definition.4 (The classical definition of probability) The probability P(A) of an event A is equal to the number of possible simple events (outcomes) favorable to A divided by the total number of possible simple events of the experiment. P(Deven) = 3/6 = 1/2. Example 4. P(Dk) =1/6 (k=1. Therefore. D4. If the event A implies the event B then P(A) ≤ P(B). we have P(Dodd)=3/6=1/2. Example 4.. The probability of the event A complementary to the event A is given by the formula P ( A ) = 1 − P ( A) . If the event A is decomposed into the mutually exclusive events B and C belonging to S then P(A)=P(B)+P(C) This property is called the theorem on the addition of probabilities. The probability of the impossible event is zero. D6. Deven an even number of dots are observed.Thus. 3. The properties of probability: 1. the probability P(A) may be regarded as a function of the event A defined over the field of events S. 3. Since Dodd = D1+D3+D5. i. This function has the following properties.4 Consider again the experiment of tossing a balanced coin (see Example 4. 5. P(A) ≥ 0 2. 4. xlvii . 6). 4. P(A) = m N where m= number of the simple events into which the event A can be decomposed. 6.1). The probability of any event A lies between 0 and 1: 0 ≤ P(A) ≤ 1. D5. every event belonging to the field of events S has a welldefined probability. In this experiment the sample space consists of 6 simple events: D1. where Dk is the event that k dots (k=1. Deven = D2+D4+D6 . P(H)=P(T)=1/2. Therefore. for the classical definition of probability we suppose that all possible simple events are equally likely. 5. In this experiment the sample space consists of two simple events: H (Head is observed ) and T (Tail is observed ). 2.2). D2. D3. which are easily proved. Therefore.5 Consider again the experiment of tossing a balanced die (see Example 4. 2. 6) are observed on the upper face of the die.e. If denote by A the event that a number less than 6 of dots is observed then P(A) = 5/6 because the event A = D1+ D2+D3+ D4+ D5 . These events are equally likely. 5. These events are equally likely. Definition 4. 4. where Dodd is the event that an odd number of dots are observed. 7. For the certain event E. For every event A of the field S. P(0) = 0.
or the probability that a baby to be born is a boy. mathematicians have constructed a rigorous foundation of this theory. Find the probability of the event A = {observe at least one Head} by using the complement relationship.6 Consider the experiment of tossing two fair coins. Fortunately. it is difficult to determine the probability that tomorrow the weather will be good.3 Axiomatic construction of the theory of probability (optional) The classical and statistical definitions of probability reveal some restrictions and shortcomings when deal with complex natural phenomena and especially. The first work they have done is the axiomatic definition of probability that includes as special cases both the classical and statistical definitions of probability and overcomes the shortcomings of each. Solution The experiment of tossing two fair coins has 4 simple events: HH. T = {Tail is observed}. assumes an almost constant value. the question arises in a majority of cases. where H = {Head is observed}. they may lead to paradoxical conclusions. Therefore.5 (The statistical definition of probability) The probability of an event A can be approximated by the proportion of times that A occurs when the experiment is repeated a very large number of times. We have P( A ) = P(TT) = 1/4. We see that the event A consists of the simple events HH. in order to find wide applications of the theory of probability. 4. Thus.Example 4. Namely. the wellknown Bertrand’s paradox. 4. Then the complementary event for A is A = { No Heads observed } = TT. Definition 4.2 The statistical definition of probability The classical definition of probability encounters insurmountable difficulties of a fundamental nature in passing from the simplest examples to a consideration of complex problems. for example. TH and TT. Since this constant is an objective numerical characteristic of the phenomena. Below we formulate the axioms that define probability. if we denote by m the number of times the event A occurs in N independent trials.4. xlviii . the number of occurrences or nonoccurrences of the event A is subject to a stable law. it is natural to call it the statistical probability of the random event A under investigation. the statistical probability is equal to the probability in the sense of the classical definition. First off all. TH.4. Therefore. HT. P(A) = 1P( A ) = 11/4 = 3/4. then it turns out that for sufficiently large N the ratio m/N in most of such series of observations. for the events to which the classical definition of probability is applicable. or to answer to the question “what are the chances that I will blow one of my stereo speakers if I turn my amplifier up to wide open?” Lengthy observations as to the occurrence or nonoccurrence of an event A in large number of repeated trials under the same set of conditions show that for a wide class of phenomena. HT. for examples. as to a reasonable way of selecting the “equally likely cases”.
D5. However.. 5. then this probability is called unconditional. 3.called its probability.. 4.. 3. 4. 2.. D3. Obviously.+P(An) 4. D4... 6). i.6 The probability of an event A.... 2. 5. xlix . given the event B. Denote by A and B the following events: A = {Observing an even number of dots on the upper face of the die}. given that an event B has occurred.+An) = P(A1)+P(A2)+ .. With each random event A in a field of events S. 4.. The necessity for introducing the extended axiom of addition is motivated by the fact that in probability theory we constantly have to consider events that decompose into an infinite number of subevents.. An. B = {Observing a number of dots less than or equal to 3 on the upper face of the die}. 2..e. is called the conditional probability of A given B and denoted by the symbol P(AB). P(E) = 1.. Find the probability of the event A. 3. D6.7 Consider the experiment of tossing a fair die. An are pair wise mutually exclusive events then P(A1+ A2+ . where Dk is the event that k dots (k = 1. there is associated a nonnegative number P(A). D2. and P(Dk) = 1/6 (k = 1... A2. . Definition 4. If no restrictions other than the conditions Ç are imposed when calculating the probability P(A)..+P(An)+. in many cases. (Addition axiom) If the event A1. one has to determine the probability of an event under the condition that an other event B whose probability is greater than 0 has already occurred. 6) are observed on the upper face of the die. P(B) = P(D1)+ P(D2)+ P(D3) = 3*1/6 = 1/2. The probability of the certain event E is 1.5 Conditional probability and independence We have said that a certain set of conditions Ç underlies the definition of the probability of an event. the classical and statistical definitions of probability which deal with finite sum of events.then P(A) = P(A1)+P(A2)+ . .Axioms for probability 1. Example 4. (Extended axiom of addition) If the event A is equivalent to the occurrence of at least one of the pair wise mutually exclusive events A1. satisfy the formulated above axioms. B = D1+ D2+D3 we have P(A) = P(D2)+ P(D4)+ P(D6) = 3*1/6 = 1/2.. Since A = D2+ D4+ D6. Solution We know that the sample space of the experiment of tossing a fair die consists of 6 simple events: D1. These events are equally likely. A2.
P(BA) = P(AB) P(A) (1’) Each of formulas (1) and (1’) is equivalent to the socalled Multiplication Theorem. we use P(B) Formula for conditional probability If the probability of an event B is greater 0 then the conditional probability of an event A. The Multiplication Theorem is also applicable if one of the events A and B is impossible since. one of the equalities P(AB) = 0 and P(AB) = 0 holds along with P(A) = 0. l . if P(A)>0. in this case. P(AB) = 1/3. P(AB) . is defined by the formula . D2. we conclude that the probability that A occurs given that B has occurred is one in three. Therefore.e. 2. given that the event B has occurred. the conditional probability of an event B. Multiplication Theorem The probability of the product of two events is equal to the product of the probability of one of the events by the conditional probability of the other event. Since the only even number of three numbers 1. P(B) (1) where AB is the event that both A and B occur. i. For the above example it is easy to verify that P(AB) = this formula to define the conditional probability. In the general case. In the same way. given that the event A has occurred. is calculated by the formula P(AB) = P(AB) . or 1/3.. namely P(AB) = P(A) P(BA) = P(B) P(AB). D3 contained in event B).If the event B has occurred then it reduces the sample space of the experiment from 6 simple events to 3 simple events (namely those D1. 3 is 2 there is only one simple event D2 of reduced sample space that is contained in the event A. given that the first even has occurred.
assuming B has occurred does not alter the probability of A. independence is a symmetrical relation..e. we have P(AB) = P(D2)+ P(D4) = 1/6+1/6 = 1/3. To determine independence. i. Are events A and B independent? Solution As in Example 4.Definition 4. In particular. 6) are observed on the upper face of the die. Now assuming B has occurred.. If the event A is independent of the event B. we usually make use of intuitive arguments based on experience. Example 4.7 We say that an event A is independent of an event B if P(AB) = P(A). the occurrence of the event B does not affect the probability of the event A. The Multiplication Theorem in the case of independent events takes on a simple form. In practical problems.8 Consider the experiment of tossing a fair die and define the following events: A = {Observe an even number of dots} B = { Observe a number of dots less or equal to 4}. the greater part of the results presented in this course is obtained on the assumption that the various events considered are independent. Therefore. i. The concept of independence of events plays an important role in the theory of probability and its applications. then it follows from (2) that P(A) P(BA) = P(B) P(A). the probability of A given B is P(AB) = P(AB) 1/ 3 1 = = = P(A) . 4. where Dk is the event that k dots (k = 1. the event B is also independent of A. we rarely resort to verifying that relations P(AB) = P(A) or P(BA) = P(B) are satisfied in order to determine whether or not the given events are independent. 5. li .e. the events A and B are independent. P(B) 2/ 3 2 Thus.7 we have P(A) = 1/2 and P(B) = P(D1)+ P(D2)+ P(D3)+P(D4) = 4*1/6 = 2/3. 3. From this we find P(BA) = P(B) if P(A)>0. Thus. Since AB = D2 + D4 . 2.
Bs of this collection. 20 blue balls. it is not sufficient that they be pair wise independent.. An are pair wise mutually exclusive events then P(A1+ A2+ .. Then P(R) = 10/(10+20+10+10) = 10/50 = 1/5.. Using this axiom we get the following rule: Addition rule If the event A1.. which serves as the addition axiom for the axiomatic definition of probability.. Definition 4.. Bn are called collectively independent or mutually independent if for any event Bp (p = 1. is white W and is color C. B2.... At random draw one ball from the box.. we have P(C) = P(R+B+Y) = 1/5+2/5+1/5 = 4/5.... is blue B. 10 yellow balls and 10 white balls. Find the probability that this ball is color. 2. P(B) = 20/50 = 2/5..Bs are independent. A2. P(Y) = 10/50 = 1/5. .9 In a box there are 10 red balls.6 Rules for calculating probability 4..+P(An) In the case of two nonmutually exclusive events A and B we have the formula P(A+B) = P(A) + P(B) – P(AB).. lii . We next generalize the notion of the independence of two events to that of a collection of events.1 The addition rule From the classical definition of probability we deduced the addition theorem. is yellow Y.. .8 The events B1. the event Bp and the event BqBr.6. . n) and for any group of other events Bq.Multiplication Theorem for independent events If the events A and B are independent then P(AB) = P(A) P(B). B and Y are mutually exclusive . Since C = R+B+Y and the events R..+An) = P(A1)+P(A2)+ . Solution Call the event that the ball drawn is red to be R. Br. Note that for several events to be mutually independent.. Example 4. 4.
A2.. By Addition rule we have P(B)= P(A1B)+P(A2B)+ . A2 the events that the box with content A1. Further. An then P(B)= P(A1)P(BA1)+P(A2)P(BA2)+ ... A2. by Multiplicative rule we get a formula.... . P(BA1) = 1/10. . Below for the purpose of computing probability we recall it.+AnB. Example 4. An. Since P(A1) = 3/5. Find the probability that the drawn lamp is defective. Since the defective lamp may B = A1B + A2B. 2 boxes with the content A2: 4 good lamps and 2 defective lamp. P(BA2) = 2/6 = 1/3 we have liii .In the preceding section we also got the multiplicative theorem.. A2. Solution Denote by B the event that the drawn lamp is defective and by the same A1. Formula of total probability If the event B may occur together with one and only one of n mutually exclusive events A1. At random select one box and from this box draw one lamp. Now suppose that the event B may occur together with one and only one of n mutually exclusive events A1.. respectively. called the formula of total probability.. Multiplicative rule For any two events A and B from the same field of events there holds the formula P(AB) = P(A) P(BA) = P(B) P(AB). By the be drawn from a box of either content A1 or content A2 we have formula of total probability P(B) = P(A1)P(BA1)+P(A2)P(BA2). If these events are independent then P(AB) = P(A) P(B).+P(An)P(BAn). is selected.. that is B = A1B + A2B + .10 There are 5 boxes of lamps: 3 boxes with the content A1: 9 good lamps and 1 defective lamp..+P(AnB). P(A2) = 2/5..
there are 5 boxes of lamps: 3 boxes with the content A1: 9 good lamps and 1 defective lamp. that the lamp was taken from an box of content A1? Solution We have calculated P(A1) = 3/5. P(AkB) = P(Ak )P(BAk ) P(B) using the formula of total probability. 2 boxes with the content A2: 4 good lamps and 2 defective lamp. liv . P(BA2) = 2/6 = 1/3. .. An then P(AkB) = P(Ak )P(BAk ) P(A )P(BAk ) = n k P(B) ∑ P(A j )P(BA j ) j =1 The formula of Bayes is sometimes called the formula for probabilities of hypotheses. P(BA1) = 1/10. From one of the boxes.P(B) = 3/5 * 1/10 + 2/5 *1/3 = 29/150 = 0. the probability that the drawn lamp is defective is 0. find the probability of the event Ak. given that the event B has occurred. the formula of Bayes gives P(A1B) = P(A1 )P(BA1 ) 3 / 5*1/ 10 9 = = ≈ 0. Hence.19. Now. Example 4.. the probability that the lamp was taken from an box of content A1. P(AkB) = P(B)P(AkB) = P(Ak) P(BAk) Hence. A2.11 As in Example 4. chosen at random. It turns out to be a defective (the aposteriori (event B). Thus.10. given the experiment has been performed. is equal 0. P(B) = 29/150. a lamp is withdrawn.19. after the experiment has been performed probability). P(B) 29 / 150 29 Thus.31 . we then find the following Bayes’s Formula If the event B may occur together with one and only one of n mutually exclusive events A1. According to the Multiplicative rule. What is the probability.31. under the same assumptions and notations as in the formula of total probability.. P(A2) = 2/5.
M are defective . a) Count the possible outcomes for this marketing experiment.. four of the vendor’s customers are randomly selected and given the opportunity to evaluate the performances of each of the two systems.7 Summary In this chapter we introduced the notion of experiment whose outcomes called the events could not be predicted with certainty in advance. What is the probability that at some point the number of white balls and black balls drawn will be the same? Two newly designed data base management systems (DBMS). From a box containing m white balls and n black balls (m>n). Find the probability that of the next two items selected at random from those remaining at least one proves to be second grade. What is the probability that the books are in proper order from right to left or left to right? In a lot consisting of N items. But what is the probability? For answer to this question we briefly discussed approaches to probability and gave the classical and statistical definitions of probability. A and B. C are random events. are being considered for marketing by a large computer software vendor. According to the classical definition. A quality control inspector examines the articles in a lot consisting of m items of first grade and n items of second grade. 1) Explain the meaning of the relations: a) ABC = A. 4. In the time. B. Find the probability that m (m ≤ N ) of them will be prove to be defective. The classical definition of probability reduces the concept of probability to the concept of equiprobability of simple events. 3) 4) 5) c) (A + B)(A + B )(A + B). After sufficient testing. The uncertainty associated with these events was measured by their probabilities. A fourvolume work is placed on a shelf in random order. can the software vendor infer that DBMS users in general have a preference for one of the two systems? 6) 7) lv . one ball after another is drawn at random.4. performances of the two systems are identical). the probability of an event A is equal to the number of possible simple events favorable to A divided by the total number of possible events of the experiment. what is the probability that all four sampled users prefer system A? c) If all four customers express their preference for system A. n items are selected at random from the lot (n<N).8 Exercises A. To determine whether DBMS users have a preference for one of the two systems. by the statistical definition the probability of an event is approximated by the proportion of times that A occurs when the experiment is repeated very large number of times. b) A + B + C = A. b) If DBMS users actually have no preference for one system over the other (i. execution time.e. and disk access). A check of the first b articles chosen at random from the lot has shown that all of them are of second grade (b<m). b) (A + B)(A + B ). each user is asked to state which DBMS gave the better performance (measured in terms of CPU utilization. 2) Simplify the expressions a) (A+B)(B+C).
is a random variable. The number of boys.2.2 Number of patients of a clinic daily is a random variable.1 Random variables One of the fundamental concepts of probability theory is that of a random variable. The binomial probability distribution 5.9.1 Observe 100 babies to be born in a clinic. Classification of random variables: Random variables may be divided into two types: discrete random variables and continuous random variables. from 800 grams to 6000 grams. Example 5. Numerical characteristics of a discrete random variable 5. lvi . Exercises 5.1.Chapter 5 CONTENTS Basic Probability distributions 5.4 The weight of babies at birth also is a random variable. Example 5. Random variables 5. Definition 5.10. The probability distribution for a discrete random variable 5.3.5.4.7 Numerical characteristics of a continuous random variable 5.3 Select one student from an university and measure his/her height and record this height by x. Example 5.1 A random variable is a variable that assumes numerical values associated with events of an experiment. say from 100 cm to 250 cm in dependence upon each specific student. It can assume values in the interval. Summary 5. The Poisson distribution 5. assuming values from. Example 5.8. It may take values from 0 to 100. which have been born. The normal distribution 5. Then x is a random variable.6 Continuous random variables: distribution function and density function 5. for example.
15. b) The grade point average for the student is a continuous random variable because it could theoretically assume any value (for example.5 Suppose you randomly select a student attending your university. p p1 p2 .3456 and 12. We shall denote the probability of x by the symbol p(x). or formula that gives the probability of observing each value of x.. graph. Among the random variables described above the number of boys in Example 5.455. It is not continuous since the number of credit hours can not assume values as 11. the height of students and the weight of babies are continuous random variables. the table x x1 x2 .1 and the number of patients in Example 5. 11.2 The probability distribution for a discrete random variable Definition 5..2 A discrete random variable is one that can assume only a countable number of values..986) corresponding to the points on the interval from 0 to 10 of a line. the probability distribution for a discrete random variable x may be given by one of the ways: 1. A continuous random variable can assume any value in one or more intervals on a line. 5.3 The probability distribution for a discrete random variable x is a table. and so on). Classify each of the following random variables as discrete or continuous: a) Number of credit hours taken by the student this semester b) Current grade point average of the student.Definition 5. xn pn lvii . 12.5678. Solution a) The number of credit hours taken by the student this semester is a discrete random variable because it can assume only a countable number of values (for example 10. 8.2 are discrete random variables. 5. Example 5.9876 hours. Thus..
1 Simple events of the experiment of tossing a coin twice SIMPLE EVENT DESCRIPTION PROBABILITY 0. P(x = 1) = p(1) = P(E2) + P(E3) = 0. Finally. a formula for calculating p(xk) (k = 1. 2. The probability distribution p(x) is displayed in tabular form in Table 5. Table 5.1. respectively.25 lviii .25. P(x = 2) = p(2) = P(E1) = 0.. 3. a graph presenting the probability of each value xk . for k = 1. Table 5.25 + 0. Example 5.. the probability that x assumes the value 0 is P(x = 0) = p(0) = P(E4) = 0. Therefore. n).2 and as a probability histogram in Figure 5.25 0. Find the probability distribution for x. the simple event E4.where pk is the probability that the variable x assume the value xk (k = 1.25 0. E2 and E3. the number of heads in two tosses of a coin x 0 1 2 p(x) 0.1.5. The four simple events and the associated values of x are shown in Table 5.6 A balanced coin is tossed twice and the number x of heads is observed.25. 2. Therefore.25 NUMBER OF HEADS 2 1 1 0 E1 E2 E3 E4 H1H2 H1T2 T1H2 T1T2 The event x = 0 is the collection of all simple events that yield a value of x = 0.. 2. Solution Let Hk and Tk denote the observation of a head and a tail. on the kth toss.. The event x = 1 contains two simple events.. namely...25 0..25 = 0.2 Probability distribution for x.5 0. 2.25 0. n).
0.3 0. In fact. we can describe it with numerical descriptive measures. and we can use Chebyshev theorem to identify improbable values of x. denoted by the symbol E(x). Thus. The expected value (or mean) of a random variable x.1 Mean or expected value Since a probability distribution for a random variable x is a model for a population relative frequency distribution.6 0. the number of heads in two tosses of a coin Properties of the probability distribution for a discrete random variable x 1.3 Numerical characteristics of a discrete random variable 5.2 0. such as its mean and standard deviation. the two distributions would be almost identical.1 0 0 1 2 Figure 5.1 provides a model for a conceptual population of values x – the values of x that would be observed if the experiment were to be repeated an infinitely large number of times. 0 ≤ p(x) ≤ 1 2.1 Probability distribution for x. A relative frequency distribution for the resulting collection of 0’s. if it were possible to repeat the experiment an infinitely large number of times. the probability distribution of Figure 5. 1’s and 2’s would be very similar to the probability distribution shown in Figure 5. all x ∑ p(x) = 1 Relationship between the probability distribution for a discrete random variable and the relative frequency distribution of data: Suppose you were to toss two coins over and over again a very large number of times and record the number x of heads for each toss. 5.4 0.3. is defined as follows: lix .1.5 0.
Demonstrate that the formula for E(x) gives the mean of the probability distribution for the discrete random variable x.6 Refer to the twocoin tossing experiment of Example 5.4 Let x be a discrete random variable with probability distribution p(x).000 times.1. Calculating the mean of these 400.000 times.3. the mean of x is µ = 1 . which are defined as follows: lx . If x is a random variable then any function g(x) of x also is a random variable.000 times and x = 2 heads approximately 100.000 times. Solution If we were to repeat the twocoin tossing experiment a large number of times – say 400.5 Let x be a discrete random variable with probability distribution p(x) and let g(x) be a function of x . Then the mean or expected value of g(x) is E[g(x)] = all x ∑ g(x)p(x) 5. Then the mean or expected value of x is = E(x) = all x ∑ xp(x) Example 5.000(0) + 200.000 = 1 1 1 (0) + (1) + (2) = ∑ p ( x)x 4 2 4 all x Thus. shown in Figure 5.000(2) n 400.5 and the probability distribution for the random variable x.000 values of x.2 Variance and standard deviation The second important numerical characteristics of random variable are its variance and standard deviation.000(1) + 100. we would expect to observe x = 0 heads approximately 100. The expected value of g(x) is defined as follows: Definition 5.Definition 5. we obtain µ≈ ∑ x = 100. x = 1 head approximately 200.
are also analogous to the unbalanced coin tossing experiment if the size N of the population is large and the size n of the sample is relatively small. Example 5. Solution In Example 5. Then σ 2 = E[(x . which form the model of a binomial random variable.1.7 Refer to the twocoin tossing experiment and the probability distribution for x.6 we found the mean of x is 1. Example 5.30. Example 5.8 Suppose that 80% of the jobs submitted to a dataprocessing center are of a statistical nature.10 Public opinion or consumer preference polls that elicit one of two responses – Yes or No. with the probability of observing a head (selecting a well with impurity A) on a single trial equal to 0.80.. Then selecting a random sample of 10 submitted jobs would be analogous to tossing an unbalanced coin 10 times.4 The binomial probability distribution Many reallife experiments are analogous to tossing an unbalanced coin a number n of times.707 2 5. Then the variance of x is σ 2 = E[(x .9 Test for impurities commonly found in drinking water from private wells showed that 30% of all wells in a particular country have impurity A. All these experiments are particular examples of a binomial experiment known as a Bernoulli process.. lxi . after the seventeenthcentury Swiss mathematician.µ ) 2 p ( x) = (0 − 1) 2 + (1 − 1) 2 + (2 − 1) 2 = x =0 2 1 4 1 2 1 4 1 2 and σ = σ2 = 1 ≈ 0. Such experiments and the resulting binomial random variables have the following characteristics. Find the variance and standard deviation of x..6 Let x be a discrete random variable with probability distribution p(x). Jacob Bernoulli.µ ) 2 ] The standard deviation of x is the positive square root of the variance of x: σ = σ2 Example 5. with the probability of observing a head (drawing a statistical job) on a single trial equal to 0. shown in Figure 5.Definition 5. If 20 wells are selected at random then it would be analogous to tossing an unbalanced coin 20 times. Approve or Disapprove.µ ) 2 ] = ∑ (x .
11 (see also Example 5. where p = probability of a success on a single trial. The binomial random variable x is the number of S’ in n trials. and the probability of F will be denoted by q ( q = 1p). 3. 2. n). q=1p n = number of trials.Model (or characteristics) of a binomial random variable 1. 1. The probability of S remains the same from trial to trial. x!(nx)! 2. The trials are independent. the probability of lxii .9) Test for impurities commonly found in drinking water from private wells showed that 30% of all wells in a particular country have impurity A.. We will denote one outcome by S (for Success) and the other by F (for Failure). Since the total number of wells in the country is large. 4. The mean: 3. Each trial results in an S (the well contains impurity A) or an F (the well does not contain impurity A). If a random sample of 5 wells is selected from the large number of wells in the country. one corresponding to each random selected well. what is the probability that: a) Exactly 3 will have impurity A? b) At least 3? c) Fewer than 3? Solution First we confirm that this experiment possesses the characteristics of a binomial experiment. The binomial probability distribution. There are only 2 possible outcomes on each trial. x= number of successes in n trials C nx = n! = combination of x from n. its mean and its standard deviation are given the following formulas: The probability distribution.. This probability will be denoted by p. This experiment consists of n = 5 trials. .. µ = np σ 2 = npq The variance: Example 5. 5. The experiment consists of n identical trials 2. The probability distribution: p(x) = C nx p x q n − x (x = 0. mean and variance for a binomial random variable: 1.
the mean and the variance of a Poisson random variable are shown in the next box. lxiii . In result.30 )5−3 = 0.16380 = 0.1323+0. 4.5 The Poisson distribution The Poisson probability distribution is named for the French mathematician S. since the sampling is random.83692.drawing a single well and finding that it contains impurity A is equal to 0. The probability that an event occurs in a given unit of time is the same for all units.30. the demand of patients for service at a health institution.30 and x = 3. 3. Poisson (18711840. we can avoid calculating 3 probabilities by using the complementary relationship P(x<3) = 1P(x ≥ 3) = 10. Further. The mean number of events in each unit will be denoted by the Greek letter λ The formulas for the probability distribution.30 and this probability will remain the same for each of the 5 selected wells. The number of events that occur in one unit of time is independent of the number that occur in other units.02835.1323 and we leave to the reader to verify that p(4) = 0. 3!2! b) The probability of observing at least 3 wells containing impurity A is P(x ≥ 3) = p(3)+p(4)+p(5).D.00243. we assume that the outcome on any one well is unaffected by the outcome of any other and that the trials are independent. It is used to describe a number of processes. P(3) = 0. We have by this formula p( 3 ) = 5! ( 0. the sampling process represents a binomial experiment with n = 5 and p = 0. 5.16380. We have calculated p(3) = 0. p = 0.30 )3( 1 − 0. p(5) = 0. a) The probability of drawing exactly x = 3 wells containing impurity A is p(x) = C nx p x q n − x with n = 5. Finally.00243 = 0. Therefore. and the number of accidents at an intersection. The experiment consists of counting the number x of times a particular event occurs during a given unit of time 2. the arrivals of trucks and cars at a tollbooth. including the distribution of telephone calls going through a switchboard system. we are interested in the number x of wells in the sample of n = 5 that contain impurity A. Characteristics defining a Poisson random variable 1.1323 .02835+0. c) Although P(x<3) = p(0)+p(1)+p(2).
The mean: µ =λ 3.3 and Figure 5. 2. The variance: σ2 =λ Note that instead of time.00674.).(the base of natural logarithm). p(2) = 0. Solution Since the number of accidents is distributed according to a Poisson distribution and the mean number of accidents per month is 5. 3 or 4 accidents. p(4) = 0. e = 2.2. 2. we have the probability of happening accidents in any month p(x) = 5 x e −5 .3370. volume. the Poisson random variable may be considered in the experiment of counting the number x of times a particular event occurs during a given unit of area. 1.71828. The probability distribution: p(x) = where λ x e −λ x! ( x = 0..12 Suppose that we are investigating the safety of a dangerous intersection.08425. etc.17552.The probability distribution. By this formula we can calculate x! p(0) = 0. The probability distribution of the number of accidents per month is presented in Table 5. mean and variance for a Poisson random variable x: 1.14042. Past police records indicate a mean of 5 accidents per month at this intersection.. lxiv . 1. 2. p(3) = 0. Example 5. Suppose the number of accidents is distributed according to a Poisson distribution. λ = mean number of events during the given time period.... Calculate the probability in any month of exactly 0. p(1) = 0.
175467 0.140374 0.084224 0.NUMBER OF ACCIDENTS 0 1 2 3 4 5 6 7 8 9 10 11 12 P(X) .003434 Figure 5.008242 0.Table 5.006738 0.104445 0. In contrast to discrete random variables.065278 0.146223 0.018133 0.PROBABILITY 0.3 Poisson probability distribution of the number of accidents per month X.6 Continuous random variables: distribution function and density function Many random variables observed in real life are not discrete random variables because the number of values they can assume is not countable.036266 0. lxv .175467 0.2 The Poisson probability distribution of the number of accidents 5.03369 0.
we can reduce the width of the class intervals until the distribution appears to be a smooth curve. that is. Definition 5. which are less than or equal to x. P ( a ≤ ξ ≤ b) = F (b) − F ( a ) 4. the strength of a steel bar and the intensity of sunlight at a particular time of day. 3. Note that here and from now on we denote by letter ξ a continuous random variable and denote by x a point on number line. F ( x ) → 0 as x → −∞ and F ( x ) → 1 as x → +∞ In Chapter 2 we described a large data set by means of a relative frequency distribution. From the definition of the cumulative distribution function F(x) it is easy to show the following its properties.∞ . the daily rainfall at some location. F(x) is equal to the probability that the variable ξ assumes values. Then the cumulative distribution function F(x) of the variable ξ is defined as follows F(x) = P(ξ ≤ x) i. 0 ≤ F ( x ) ≤ 1 . if a ≤ b then F ( a ) ≤ F (b) for any real numbers a and b.7 Let ξ be a continuous random variable assuming any value in the interval (.these variables can take on any value within an interval. A probability density is a theoretical model for this distribution. If the data represent measurements on a continuous random variable and if the amount of data is very large. 2. Properties of the cumulative distribution function F(x) for a continuous random variable ξ 1. For example. F(x) is a monotonically nondecreasing function.. The distinction between discrete random variables and continuous random variables is usually based on the difference in their cumulative distribution functions. In Section 5. + ∞ ).1 these random variables were called continuous random variables.e. lxvi .
3. The density function for a continuous random variable must always satisfy the two properties given in the box.8 If F(x) is the cumulative distribution function for a continuous random variable ξ then the density probability function f(x) for ξ is f(x) = F’(x)..3 Density function f(x) for a continuous random variable It follows from Definition 5. i. the model for some reallife population of data.8 that F(x) = ∞ ∫ f(t)dt x Thus.e. The density function for a continuous random variable ξ .∞ and a point x0 is equal to F(x0). will usually be a smooth curve as shown in Figure 5. lxvii . Figure 5. the cumulative area under the curve between . f(x) is the derivative of the distribution function F(x).Definition 5.
2.8 Let ξ be a continuous random variable with density function f(x).µ ) 2 ] The standard deviation of ξ is the positive square root of the variance σ = σ 2 lxviii .7 Numerical characteristics of a continuous random variable Definition 5. Then the mean or the expected value of ξ is E(ξ) = ∫ xf(x)dx ∞ +∞ Definition 5. Then the variance of ξ is σ 2 = E[(ξ . f(x) ≥ 0 +∞ −∞ ∫ f( x)dx = F(∞) = 1 5.9 Let ξ be a continuous random variable with density function f(x) and g(x) is a function of x. Then the mean or the expected value of g( ξ ) is +∞ E[g(ξ)] = ∞ ∫ g(x)f(x)dx Definition 5.Properties of a density function 1.10 Let ξ be a continuous random variable with the expected value E(ξ ) = µ .
5. Curve 2 with Curve 2 Curve 1 Curve 3 µ = −1. The distribution with this density function is called the standardized normal distribution. Several different normal density functions are shown in Figure 5. lxix . The mean measures the location and the variance measures its spread. Amazingly. σ = 1. this bellshaped curve provides an adequate model for the relative frequency distributions of data collected from many different scientific areas. The density function.2 Figure 5. σ = 0 .4.5 . respectively.4 Several normal distributions: Curve 1 with µ = 3.8 Normal probability distribution The normal (or Gaussian) density function was proposed by C. and Curve 3 with µ = 0.4 0. such errors of measurement.Gauss (17771855) as a model for the relative frequency distribution of errors. 1 0. If µ = 0 and σ =1 then f ( x) = 1 2π e −( x − µ ) 2 /2 . of the normal random variable There is infinite number of normal density functions – one for each combination of µ and σ. The graph of the standardized normal density distribution is shown in Figure 5.8 0. mean and variance for a normal random variable The density function: f ( x) = 1 σ 2π e −( x − µ ) 2 / 2σ 2 The parameters µ and σ2 are the mean and the variance .6 0.5.2 0 0.F. σ = 1 .
6 Figure 5. 3. 0. (µ .3 0.6826 P( ξ − µ ≤ 2σ ) = 0.3σ.0. 2σ and σ rules.9544 and 0.2 lxx . 2) P( ξ − µ ≤ nσ ) = 2Φ (n) . we have P( ξ − µ ≤ σ ) = 0.2 1. In particular.2 0.8 1.6826. is approximately 0.4 0.2 0. µ + σ). µ + 3σ). where ξ −µ σ Φ( x) = 1 2π −t ∫e 0 x 2 dt This function is called the Laplace function and it is tabulated.4 2 /2 2.5 The standardized normal density distribution If ξ is a normal random variable with the mean µ and variance σ then 1) the variable z= is the standardized normal random variable.4 2.4 0.9973 These equalities are known as σ .8 2. µ +2σ). Namely.2σ.5 0. if a population of measurements has approximately a normal distribution the probability that a random selected observation falls within the intervals (µ .σ. respectively.9973.9544 P( ξ − µ ≤ 3σ ) = 0. respectively and are often used in statistics. and (µ .6 1 0.1 0 3.
The normal distribution as an approximation to various discrete distributions
probability
Although the normal distribution is continuous, it is interesting to note that it can sometimes be used to approximate discrete distributions. Namely, we can use normal distribution to approximate binomial probability distribution. Suppose we have a binomial distribution defined by two parameters: the number of trials n and the probability of success p. The normal distribution with the parameters µ and σ will be a good approximation for that binomial distribution if both
µ − 2σ = np − 2 np(1 − p) and µ + 2σ = np + 2 np( 1 − p) lie between 0 and n.
For example, the binomial distribution with n = 10 and p = 0.5 is well approximated by the normal distribution with µ = np = 10*0.5 = 5.0 and Figure 5.6 or Table 5.4.
= np( 1 − p) = 0.5* 10 = 1.58. See
0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 6 7 8 9 10
Figure 5.6 Approximation of binomial distribution (bar graph) with n=10, p=0.5
by a normal distribution (smoothed curve)
Table 5.4
The binomial and normal probability distributions for the same values of x
x
Binomial distribution
Normal distribution
0 0.000977 1 0.009766 2 0.043945 3 0.117188 4 0.205078
0.0017 0.010285 0.041707 0.113372 0.206577
lxxi
5 0.246094 6 0.205078 7 0.117188 8 0.043945 9 0.009766 10 0.000977
0.252313 0.206577 0.113372 0.041707 0.010285 0.0017
5.9. Summary
This chapter introduces the notion of a random variable – one of the fundamental concepts of the probability theory. It is a rule that assigns one and only one value of a variable x to each simple event in the sample space. A variable is said to be discrete if it can assume only a countable number of values. The probability distribution of a discrete random variable is a table, graph or formula that gives the probability associated with each value of x . The expected value E (x ) = µ is the mean of this probability distribution and E[( x − µ )] = σ 2 is its variance. Two discrete random variables – the binomial, and the Poisson – were presented, along with their probability distributions. In contrast to discrete random variables, continuous random variable can assume value corresponding to the infinitely large number can assume value corresponding to the infinitely large number of points contained in one or more intervals on the real line. The relative frequency distribution for a population of data associated with a continuous random variable can be modeled using a probability density function. The expected value (or mean) of a continuous random variable x is defined in the same manner as for discrete random variables, except that integration is substituted for summation. The most important probability distribution – the normal distribution  with its properties is considered.
5.10 Exercises
1) The continuous random variable ξ is called a uniform random variable if its density function is
1 f(x) = b − a 0
Show that for this variable, the mean µ =
if a ≤ x ≤ b elsewhere
a+b (b − a ) 2 and the variance σ 2 = . 2 12
2) The continuous random variable ξ is called a exponential random variable if its density function is lxxii
f ( x) =
Show that for this random variable
e−x/β
β
(0 ≤ x ≤ ∞ )
µ = β, σ 2 = β 2.
3) Find the area beneath a standardized normal curve between the mean z = 0 and the point z = 1.26. 4) Find the probability that a normally distributed random variable ξ lie more than z = 2 standard deviations above its mean. 5) Suppose y is normally distributed random variable with mean 10 and standard deviation 2.1. a) Find P ( y ≥ 11). b) Find P (7.6 ≤ y ≤ 12.2)
lxxiii
Chapter 6.
CONTENTS
Sampling Distributions
6.1 Why the method of sampling is important 6.2 Obtaining a Random Sample 6.3 Sampling Distribution 6.4 The sampling distribution of : the Central Limit Theorem 6.5 Summary 6.6 Exercises
6.1 Why the method of sampling is important
Much of our statistical information comes in the form of samples from populations of interests. In order to develop and evaluate methods for using sample information to obtain knowledge of the population, it is necessary to know how closely a descriptive quantity such as the mean or the median of a sample resembles the corresponding population quantity. In this chapter, the ideas of probabilities will be used to study the sampletosample variability of these descriptive quantities. We now return to the objective of statistics  namely, the use of sample information to infer the nature of a population. We will explain why the method of sampling is important through an example. Example 6.1 The Vietnam Demographic and Health Survey (VNDHS) was a nationwide representative sample survey conducted in May 1988 to collect data on fertility and a few indicators of child and maternal health. In the survey a total of 4,171 eligible women, ale aged 15 to 49 years old were interviewed. The survey data was given in Appendix A by the format of Excel. The relative frequency distribution for number of children ever born for 4,171 women appears as in the Table 6.1 and in Figure 6.1. In actual practice, the entire population of 4,171 women's number of children ever born may not be easily accessible. Now, we draw two samples of 50 women from the population of 4,171 women. The relative frequency distributions of the two samples are given in Table 6.2a and 6.2b and graphed in Figures 6.2a and 6.2b. Click here for Simulation in SPSS. Compare the distributions of number of children ever born for two samples. Which appears to better characterize number of children ever born for the population? Solution It is clear that the two samples lead to quite different conclusions about the same population from which they were both selected. From Figure 6.2a, we see that only 18% of the sampled women bore 3 children, whereas from Figure 6.2b, we see that 26% of the sampled women bore such number of children. This may be compared to the relative frequency distribution for the population (shown in Figure 6.1), in which we observe that 18% of all the women bore 3 children. In addition, note that none of the women in the second sample (Figure 6.2b) had no children, whereas 10% of the women lxxiv
in the first sample (Figure 6.2a) had no child. This value from the first sample compare favorably with the 7% of "no children" of the entire population (Figure 6.1).
Table 6.1 Frequency distribution of number Figure 6.1 Relative frequency distribution of
of children ever born for 4,171 women Number of Children 0 1 2 3 4 5 6 7 >7 Total Frequency 312 708 881 737 570 354 243 172 194 4171 Relative Frequency 0.07 0.17 0.21 0.18 0.14 0.08 0.06 0.04 0.05 1.00
.25 Relative frequency .20 .15 .10 .05 .00 0 1 2 3 4 5 6 7 >7 Number of children ever born
number of children ever born for women
4,171
Table 6.2 Frequency distribution of number Figure 6.2
of children ever born for each of two samples of 50 women selected from 4,171 women Number of Children 0 1 2 3 4 5 6 7 >7 Total 5 8 10 9 8 3 4 2 1 50 a Frequency Relative
Relative frequency
Frequency distribution of number of children ever born for each of two samples of 50 women selected from 4,171 women
.25 .20 .15 .10 .05 .00 0 1 2 3 4 5 6 7 >7 Number of children ever born
Frequency 0.10 0.16 0.20 0.18 0.16 0.06 0.08 0.04 0.02 1.00
a
Number of Children 0 1 2 3
Frequency 0 8 8 13
Relative
Relative frequency
Frequency 0.00 0.16 0.16 0.26
.30 .25 .20 .15 .10 .05 .00 0 1 2 3 4 5 6 7 >7 Number of children ever born
lxxv
E. D. the population about which we wish to make inferences. The examples in this section demonstrate that great care must be taken in order to select a sample that will give an unbiased picture of the population about which inferences are to be made. we demonstrated the importance of obtaining a sample that exhibits characteristics similar to those possessed by the population from which it came.2 A city purchasing agent can obtain stationery and office supplies from any of eight companies. C. provides a probabilistic basic for evaluating the reliability of an inference. State the criterion that must be satisfied in order for the selected sample to be random. The numbers of different samples of n = 3 elements that can be selected from a population of N = 8 elements is lxxvi .18 0. This procedure is called random sampling and the resulting sample is called a random sample of size n.08 1. or biased. How many different samples of three suppliers can be chosen from among the eight? b. we may have a distorted. G.4 5 6 7 Total 9 6 2 4 50 b 0. Definition 6.1 A random sample of n experimental units is one selected in such a way that every different sample of size n has an equal probability of selection. Solution In this example. We will have more to say about random sampling in Section 6. H).2b). the population of interest consists of eight suppliers (call them A. B. and will then employ random sampling in sections that follow. from which we want to select a sample of size n = 3. List them. 6. How is it possible that two samples from the same population can provide contradictory information about the population? The key issue is the method by which the samples are obtained. One way to satisfy this requirement is to select the sample in such a way that every different sample of size n has an equal probability of being selected. if we were to rely on information from the second sample only. Its relative frequency distribution is closer to that for the entire population (Table and Figure 6. One way to cope with this problem is to use random sampling.2a) gives a better picture of the actual population of numbers of children ever born. Random sampling eliminates the possibility of bias in selecting a sample and. in addition. impression of the true situation with respect to numbers of children ever born. or characteristics of.1) than is the one provided by the second sample (Table and Figure 6.00 b To rephrase the question posed in the example. If the purchasing agent decides to use three suppliers in a given year and wants to avoid accusations of bias in their selection.2. the information provided by the first sample (Table and Figure 6. a.12 0. the sample of three suppliers should be selected from among the eight.2 Obtaining a Random Sample In the previous section. F.171 of the VNDHS's women? Clearly. Example 6. Thus. we could ask: Which of the two samples is more representative of. c. In this section we will explain how to draw a random sample.04 0. the number of children ever born for all 4.
G. and one that may be used with lager populations. E. F. H B. H A. E B. At present. For the first sample. the elements named on these n pieces of paper would be ones included in the sample. B. B. F D. Each sample must have the same chance of being selected in order to ensure that we have a random sample. H A. E B. provides a procedure to select a random sample based on an approximate percentage or an exact number of observations. G. E. H C. H A. E A. E A. D. H A. G B. E. H C. F C. this method has the following drawbacks: It is not feasible when the population consists of a lager number of observations. C A. D B. the researcher can remove n pieces of paper from container. is to use a table of random numbers. F A. in almost statistical program packages this method is used to select random samples. the mean is x= = ∑ vf lxxvii . C. G D. H A. the mean is 0 * 5 + 1 * 8 + 2 * 10 + 3 * 9 + 4 * 8 + 5 * 3 + 6 * 4 + 7 * 2 + 8 * 1 = 2. C. F C. E C. C. G F. F B. C. F. F E. the procedure provides only an approximation to random sample.1 were drawn by the SPSS's "Select cases" procedure from the data on fertilities of 4. H C.171 women recorded in Appendix A. D. F. G D. D.a comprehensive system for analyzing data. C. Since there are 56 possible samples of size n = 3. D.96 n 50 For the second sample. What procedures may one use to generate a random sample? If the population is not too large. each must have a probability equal to 1/56 of being selected by the sampling procedure. G C. F. E. H E. H A. G. G C. C. H D. E. G B. H B. B. A more practical method of generating a random sample. D A. G C. B. H B. B. However. F. C. E. For example. G B. E. H D. D. Two samples in Example 6. After the collection of papers is thoroughly mixed. H C. G D. G A. B. F. E. G. F. E. E. F. G A. G A. D A. G B. F. each observation may be recorded on a piece of paper and placed in a suitable container. F A. F B. G. The following is a list of 56 samples: A. F. H B. D. G. and since it is very difficult to achieve a thorough mixing. C. H B. F B. E. SPSS PC . E A. H E. D. G A. D.C nN = N! 8! 8 * 7 * 6 * 5 * 4 * 3 * 2 *1 = = = 56 n!( N − n)! 3!5! (3 * 2 * 1) (5 * 4 * 2 *1) a. C. D. D. H A. D. C. F A. D. H b. E.
we wish to estimate the mean number of children ever born to lxxviii . the ultimate goal being to use information from the sample to make an inference about the nature of the population. its value does not vary from sample to sample. the mean number of children ever born from the sample of n = 50. In many situations. called a parameter. the means of two samples with the same size of n = 50 are different. then. which our interest focuses on the numbers of children ever born of 4. Knowledge of the sampling distribution of a particular statistic provides us with information about its performance over the long run.38 50 where the mean for all 4. Definition 6. The data are given in Appendix A.) We will illustrate the notion of a sampling distribution with an example. (See Figure 6. the sample mean. As seen in the previous section. the uncertainty of a statistic generally has characteristic properties that are known to us. 6. Definition 6. any inferences based on them will necessarily be subject to some uncertainty. do we judge the reliability of a sample statistic as a tool in making an inference about the corresponding population parameter? Fortunately. using information from sample.15. How.96 .3. we computed x = 2. we used the sample information to compute a statistic . the mean µ) is a constant (although it is usually unknown to us).4 A sampling distribution of a sample statistic (based on n observations) is the relative frequency distribution of the values of the statistic theoretically generated by taking repeated random samples of size n and computing the value of the statistic for each sample.x= ∑ vf n = 0 * 0 + 1 * 8 + 2 * 8 + 3 * 13 + 4 * 9 + 5 * 6 + 6 * 2 + 7 * 4 + 8 * 0 = 3. the sample mean x ) is highly dependent on the particular sample that is selected. For example. You may have observed that the value of a population parameter (for example. the objective will be to estimate a numerical characteristic of the population.3 A quantity computed from the observations in a random sample is called a statistic. In particular. and that are reflected in its sampling distribution. Since statistics vary from sample to sample. from the first sample of 50 women in the Example 6. In other word. x . Definition 6. However.171 observations is 3. the value of a sample statistic (for example.2 A numerical descriptive measure of a population is called a parameter.3 Sampling Distribution In the previous section. we learned how to generate a random sample from a population of interest.namely.171 women in VNDHS 1988. we discuss how to judge the performance of a statistic computed from a random sample.1. In the next section.
3.171 observations constitute the entire population and we know that the true value of µ. called the sampling distribution of x .171 numbers of children ever born in Appendix A? Solution The sampling distribution for the statistic x . Example 6. Solution We used a statistical program.4 2. The first ten of these samples are presented in Table 6. is 3.3). Instead.3 How could we generate the sampling distribution of x .4 Use computer simulation to find the approximate sampling distribution of x . the 4.171 women Sample 1 2 3 1 1 0 Number of children ever born 1 2 0 1 3 4 2 3 6 2 3 7 Mean ( x ) 1. to obtain 100 random samples of size n = 5 from target population.4 lxxix . the mean of a random sample of n = 5 observations from the population of 4. which may seem impractical if not impossible. based on a random sample of n = 5 measurements. compute and record the value of x for this sample.all women.4 3. The task described in Example 6. (See Figure 6.3 The first ten of samples of n = 5 measurement from population of numbers of children ever born of 4. the infinite number of values of x obtained could be summarized in a relative frequency distribution. as illustrated in the next example. for example SPSS. the sampling distribution of a statistic is obtained by applying mathematical theory or computer simulation. Table 6.15 children. Figure 6.3.171 number of children ever born in Appendix A. is not performed in actual practice.3 Generating the theoretical sampling distribution of the sample mean x Example 6. the mean of a random sample of n = 5 observations from population of 4.171 observations on number of children ever born in Appendix A. Then return these five measurements to the population and repeat the procedure. In this case. If this sampling procedure could be repeated an infinite number of times. would be generate in this manner: Select a random sample of five measurements from the population of 4. the mean of the population.
4 and the 100 values of x are summarized in the relative frequency distribution shown in Figure 6. 25 20 Percentage 15 10 5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Figure 6. These two observations are borne out by comparing the means and standard deviations of the two sets of observations. µ = 3.4 Relative frequency distribution for 4. the sample mean x was computed.5 tend to cluster around the population mean.4.5. they have less variation) than the population values shown in Figure 6.8 3.15 children.171 numbers of children ever born We can see that the value of x in Figure 6. the values of the sample mean are less spread out (that is.6 3.4 5 6 7 8 9 10 0 2 1 1 1 2 0 1 2 2 2 2 2 0 2 3 3 2 2 3 2 2 4 5 5 3 3 3 3 7 8 6 6 11 4 1.171 women was plotted in Figure 6. as shown in Table 6.8 4.2 2. 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 Figure 6.4. Click here to see some scripts and print outs from sampling and case summarize procedures in SPSS with sample size of n = 5.6 3. The relative frequency distribution of the number of children ever born for the entire population of 4.8 For each sample of five observations. Also.5 Sampling distribution of x : Relative frequency distribution of x based on 100 samples of size n = 5 P rce ta e e n g lxxx .2 1.
14 σ = 2. Compare result with the sampling distribution of x based on samples of n = 5. the values of x tend to center about the population mean.Table 6. 6.492 x based on samples of size n = 5 (Fig. It can be seen that.5 Refer to Example 6.5 for comparison with previous results.920 x based on samples of size n = 5 (Fig.5 Comparison of the population distribution and the approximate sampling distributions of x .229 .4) 100 values of µ = 3. The mean and standard deviation for these 100 values of x are shown in Table 6.6) 80 75 70 65 60 55 P erc entage 50 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 Figure 6. Simulate the sampling distribution of x for samples size n = 25 from population of 4.5) Example 6.920 .171 numbers of children ever born (Fig. 6.171 observations of number of children ever born. Solution We obtained 100 computergenerated random samples of size n = 25 from target population. a visual inspection shows that the variation of the x values about their mean in Figure 6.229 .6 is less than the variation in the values of x based on samples of size n = 5 (Figure 6. based on 100 samples of size n = 5 and n = 25 Mean Standard Deviation Population of 4.4 Comparison of the population and the approximate sampling distribution of x based on 100 samples of size n = 5 Mean Population of 4.15 3.6 Relative frequency distribution of x based on 100 samples of size n = 25 lxxxi .4. Table 6. However. 6.11 σ = 2.4) 100 values of Standard Deviation µ = 3. obtained in Example 6.5) 100 values of x based on samples of size n = 25 (Fig.6.4. 6.15 3.11 3. A relative frequency distribution for 100 corresponding values of x is shown in Figure 6.5). 6. as with the sampling distribution based on samples of size n = 5.171 numbers of children ever born (Fig.
regardless of the shape of the relative frequency distribution of the target population. or the mean height for all 3year old boys in a daycare center are examples of practical problems in which the goal is to make an inference about the mean. the mean x of a random sample from a population has a sampling distribution that is approximately normal. has other known characteristics. if we let µ x denote the mean of the sampling distribution of x . 6. the values of x tend to cluster more closely about the population mean as n gets larger. In previous sections.5 we observe that. as the sample size increases. µ. then: 1. in addition to being approximately normal. there is less variation in the sampling distribution of x . then µx = µ 2. we have indicated that the mean x is often used as a tool for making an inference about the corresponding population parameter µ. provides information about the actual sampling distribution of x . Properties of Sampling Distribution of x If x is the mean of a random sample of size n from a population with mean µ and standard deviation σ. and we have shown how to approximate its sampling distribution. then σx = σ n (*) This is why the normal distribution is so important! lxxxii . As the sample size increases.4 The sampling distribution of x : the Central Limit Theorem Estimating the mean number of children ever born for a population of women. This intuitively appealing result will be stated formally in the next section. The Central Limit Theorem If the size is sufficiently large. which are summarized as follows. That is. (*) The sampling distribution of x . of some target population. That is. The following theorem. if we let σ x denote the standard deviation of the sampling distribution of x . From Table 6.Click here to see some scripts and print outs from sampling and case summarize procedures in SPSSS with sample size of n = 25. of fundamental importance in statistics. that is. divided by the square root of the sample size. The sampling distribution of x has a standard deviation equal to the standard deviation of the population from which the sample was selected. the better will be the normal approximation to the sampling distribution. The sampling distribution of x has a mean equal to the mean of the population from which the sample was selected.
We also observed. the average of all values of x that would be generated in infinite repeated sampling would be equal to µ. for the actual sampling distribution of x .5.99 centimeters. which implies that. a. describe the sampling distribution of the mean height for a random sample of 100 three year old children in the rural.5 and 6. Example 6.15.492 Finally.6 that.7 In research on the health and nutrition of children in a rural area of Vietnam 1988.6).5.4 and 6. the normal approximation improves when the sample size is increased from n = 5 (Figure 6.67 centimeters with a standard deviation of 4. that the standard deviation of the sampling distribution of x decreases as the sample size increases from n = 5 to n = 25. the sampling distribution of x actually has a standard deviation of σx = σ n = 2. As an example. For this target population. the standard deviation is equal to σx = σ n = 2. the population from which the samples were selected is seen in Figure 6. µ = 3. regardless of the shapes of the original population. in the long run. we obtained repeated random samples of size n = 5 and n = 25 from the population of numbers of children ever born in Appendix A.229 5 = . This is guaranteed that by property 1. the Central Limit Theorem guarantees an approximately normal distribution for x . Property 2 quantifies the decrease and relates it to the sample size.6 Show that the empirical evidence obtained in Examples 6. note that.446 The value we obtained by simulation was .229 children Solution In Figures 6. we obtained a standard deviation of . whereas property 2 tells us that. Note from Figure 6. what is probability that the sample mean height will be at least 91 centimeters? lxxxiii . In our examples. although the sampling distribution of x tends to be bellshaped in each case. it was reported that the average height of 823 threeyear old children in rural areas in 1988 was 89. we know that the values of the parameters µ and σ: Population mean: µ = 3. b. from Table 6. These observations are given in Appendix B.920.4 and 6. Assuming the report's figures is true.5.997 Similarly.15 children Population standard deviation: σ = 2. In order to check these figures. for our approximate sampling distribution based on samples of size n = 5. for samples of size n = 25.5 supports the Central Limit Theorem and two properties of the sampling distribution of x . we will randomly sample 100 threeyear old children from the rural area and monitor their heights. Assuming the report's figures are true.4 and 6.5) to n = 25 (Figure 6. we note that the values of x tend to cluster about the population mean.229 25 = .4 to be moderately skewed to the right. Recall that in Examples 6.Example 6.
Although we have no information about the shape of the relative frequency distribution of the heights of the children. then P( x ≥91). P( x ≥91) = P(z ≥ 2.499 cm assuming that the reported values of µ and σ are correct. we can compute the desired area by obtaining the zscore for x = 91 z= x −µx σx = 91 − 89.A (see Figure 6.67 . and this probability (area) may be found in Table 1 of Appendix C. Since the sampling distribution is approximately normal. P( x ≥ 91) = P(z ≥ 2. is equal to the greened area shown in Figure 6. and the standard deviation.0038 P( x ≥ 91) A 89.67) lxxxiv .Solution a.5 . with mean and standard deviation as obtained in part a. the mean µ x . of the sampling distribution are given by µ x = µ = 91 cm σx = σ n = and 4. If the reported values are correct.67) = .67 (z=0) 91.499 Thus. the probability of observing a mean height of 91 cm or higher in the sample of 100 observations. In addition.67).99 100 = .4962 = . σ x .5 .67 = 2.7. b. we can apply the Central Limit Theorem to conclude that the sampling distribution of the sample mean height of the 100 three year old children is approximately normally distributed.00 ( z = 2..7) = .
n = 100 f.7 Sampling distribution of x in Example 6. using random samples of size n = 10. The sampling distribution of the statistic.5 Summary The objective of most statistical investigations is to make an inference about a population parameter. and c of Exercise 7. As we notice earlier. n = 75 e. Compute the average of the 30 sample means. Construct a relative frequency distribution for the 30 sample means. we will not be required to obtain sampling distributions by simulation or by mathematical arguments. Compute the standard deviation of the 3o sample means. and they allow us to compute a measure of the reliability of references made about µ. c. n = 500 6. Since we often base inferences upon information contained in a sample from the target population. a. In practical terms.6 Exercises 6.1. n = 50 d.1 Use command Select cases of SPSS/PC to obtain 30 random samples of size n = 5 from “population” of 4. For each of the following values of n. n = 25 c. 6. the sampling distribution and its properties will be presented as the need arises. Compare relative frequency distribution with that of Exercise 7. b.1. if the reported values are true. the Central Limit Theorem and two properties of the sampling distribution of x x assure us that the sample mean x is a reasonable statistic to use in making inference about the population mean µ. Compare with the population relative frequency distribution shown in Table 6. 6. Do the values of x generated from samples of size n = 10 tend cluster more closely about µ? Suppose a random sample of n measurements is selected from a population with mean µ 2 = 60 and variance σ =100. it is essential that the sample be properly selected. The Central Limit Theorem provides information about the sampling distribution of the sample mean.171 number of children ever born from Appendix A. If the 100 randomly selected three year old children have an average height of 91 cm or higher. Rather.3 lxxxv . we would have strong evidence that the reported values are false. for all the statistics to be used in this course. if you have used random sampling. In particular.Figure 6. Calculate x for each of the 30 samples. we compute a statistic that contains information about the target parameter. A procedure for obtaining a random sample using statistical software (SPSS) was described in this chapter. After the sample has been selected. characterizes the relative frequency distribution of values of the statistic over an. x .1. b. x : a. infinitely large number of samples.2 Repeat parts a. the sampling distribution of x will be approximately normal if the sample size is sufficiently large.0038.7 The probability that we would obtain a sample mean height of 91 cm or higher is only . n = 10 b. 6. give the mean and standard deviation of the sampling distribution of the sample means. because such a larger sample mean is very unlikely to occur if the research is true.
P( x > 72. If the sample mean test score was computed to be x = 79. describe the sampling distribution of x . The relative frequency distribution of the test scores in past years had a mean of 75 and a standard deviation of 10.4 A random sample of n = 225 observations is selected from a population with µ = 70 and σ =30.5) c. administered at the end of the year. Consider the standardized test scores for a random sample of 36 first graders taught by the new method.1< x <74. A standardized test. was used to measure the effectiveness of the new method. If the relative frequency distribution of test scores for first graders taught by the new method is no different from that of the old method.) lxxxvi . the mean test score for random sample of 36 first graders. P( x <73. P(69. Calculate each of the following probabilities: a.6) d.5) 6.5 This part year.0) b. what would you conclude about the effectiveness of the new method of teaching arithmetic? (Hint: Calculate P( x ≥ 79) using the sampling distribution described in part a. P( x <65. a.6. b. an elementary school began using a new method to teach arithmetic to first graders.
Example 7.4 Estimation of a population proportion 7. and how to estimate the difference between two population means or proportions. this example will be continued to illustrate the concepts involved in estimation. Assume. is the sample mean. proportions. In this chapter.05 children. computed from a random sample of n observations from the target population. based on knowledge of the sampling distributions of the statistics being used. We will also be able to assess the reliability of our estimates.1 Suppose we are interested in estimating the average number of children ever born to all 4.6 Estimation of the difference between two population means: Matched pairs 7.1 Introduction In preceding chapters we learned that populations are characterized by numerical descriptive measures (parameters).9 Estimation of a population variance 7.171 women in the VNDHS 1998 in Appendix A. or variances.1 A point estimate of a parameter is a statistic. µ .8 Choosing the sample size 7.7 Estimation of the difference between two population proportions 7. Definition 7. This value of x provides a point estimate of the population mean.1 Introduction 7.11 Exercises 7. and then compute the value of the sample mean to be x =3. How could one estimate the parameter of interest in this situation? Solution An intuitively appealing estimate of a population mean.Chapter 7 Estimation CONTENTS 7. for example.10 Summary 7. lxxxvii . x . a single value computed from the observations in a sample that is used to estimate the value of the target parameter.3 Estimation of a population mean: small sample case 7.5 Estimation of the difference between two population means: Independent samples 7.2 Estimation of a population mean: Largesample case 7. Although we already know the value of the population mean. and that inferences about parameter values are based on statistics computed from the information in a sample selected from the population of interest. that we obtain a random sample of size n = 30 from numbers of children ever born in Appendix A. we will demonstrate how to estimate population means.
the area beneath the sampling distribution of x between µ .171 numbers of children ever born and σ x = σ / n is the standard deviation of the sampling distribution of x (often called the standard error of x .96 n where σ is the population standard deviation of the 4. What can we say about how likely is it is that this interval will contain the true value of the population mean.96σ Figure 7. 7.1.95.4 that.2 Suppose we plan to take a sample of n = 30 measurements from population of numbers of children ever born in Appendix A and construct interval σ x ± 1. (This area colored green in lxxxviii . Example 7. µ ? Area = . the sampling distribution of the sample mean.1 Sample distribution of x Solution We arrive at a solution by the following threestep process: Step 1 First note that.1.95 x µ 1.) In other word. we will construct an interval 1. for sufficient large sample size. that is.96 standard deviations around the sample mean x .96σ x is approximately . x . as shown in Figure 7. the procedure will be illustrated in the next section.96σ x and µ + 1. This can be done by using the characteristics of the sampling distribution of the statistic that was used to obtain the point estimate.96σ 1.96σ x = x ± 1. we need to be able to state how close our estimate is likely to be to the true value of the population.2 Estimation of a population mean: Largesample case Recall from Section 6. an inference concerning a parameter must consist more than just a point estimate.How reliable is a point estimate for a parameter? In order to be truly practical and meaningful. is approximately normal.
96σ x . before the sample is drawn.96σ x .1. then it is true that x ± 1. Step 3 Step 1 and Step 2 combined imply that.2 is called a largesample 95% confidence interval for the population mean µ . Definition 7. as demonstrated in Figure 7.2 Sample distribution of x in Example 7. the probability that the interval x ± 1.2 The interval x ±1. The endpoints of lxxxix .96σ x is marked off both to the left and to the right of x . The term largesample refers to the sample being of a sufficiently large size that we can apply the Central Limit Theorem to determine the form of the sampling distribution of x .2. Step 2 If in fact the sample yields a value of x that falls within the interval µ ± 1. a distance of 1.95.96σ x will enclose µ is approximately .) This applies that before the sample of measurements is drawn. the probability that x will fall within the interval µ ± 1.96σ x in Example 7.Figure 7.96σ x will contain µ .96σ x .2 A confidence interval for a parameter is an interval of numbers within which we expect the true value of the population parameter to be contained.96σ x . x Figure 7. You can see that the value of µ must fall within x ± 1. For particular value of x that falls within the interval µ ± 1. is obtained from Table 1 of Appendix C.
13 Standard Deviation 5.02 5.47 88.93. using the information from each sample. 90.67 ± 1. the population mean height. for larger samples (n ≥ 30).96 Standard Deviation 6. from the population of heights in Appendix B. 95.62 ± 1.62 cm and s = 4.2. then approximately 95% of the intervals constructed in this manner would contain µ . we obtain 4.1.68 4.96 standard deviation interval around x for each sample. How much confidence do we have that µ .08)? Although we cannot be certain whether the sample interval contain µ (unless we calculate the true value of µ for all 823 observations in Appendix B). each of size n = 30. the value of the population deviation σ will be unknown.62 ± 1.70 90.85 Sample 21 22 23 24 25 Mean 91.09 σ 88.39 4. the sample mean and standard deviation are presented in Table 7. Thus. we are 95% confident that the particular interval (89. Solution A 95% confidence interval for µ . the sample standard deviation s provides a good approximation to σ .69 4.08 cm. 90. and this is our measure of the reliability of the point estimate x .4 To illustrate the classical interpretation of a confidence interval.45 89.67 6.96σ x = x ± 1.17 89. We then constructed the 95% confidence interval for µ .09 cm Construct a 95% confidence interval of µ . For this example. However.16.46 = 88.63 5. Example 7. based on this sample.3 Suppose that a random sample of observations from the population of threeyear old children heights yield the following sample statistics: x = 88.96 = 88.16.86 88. which are shown in Table 7. we can be reasonably sure that it does.96 30 30 or (87.41) contains µ . and may be used on the formula for the confidence interval.70 89. Table 7. Interpret the results. the true population mean height.08).62 ± 1. lies within the interval (87.02 90.96 = 92.07 xc .08 4. Hence. Example 7. is given by σ σ x ± 1. we generated 40 random samples. This confidence is based on the interpretation of the confidence interval procedure: If we were to select repeated random samples of size n = 30 heights.16 cm to 90.1 Sample 1 2 3 4 5 Means and standard deviations for 40 random samples of 30 heights from Appendix B Mean 89.64 5. we estimate that the population mean height falls within the interval from 87. based on a sample of size = 30.53 90. and from a 1. For each sample.the interval are computed based on sample information.96 30 n In most practical applications.
51 Table 7.18 89.35 86.56 3.33 91.48 6.68 4.04 88.51 90.27 4.98 5.02 90.80 93.35 87.61 89.26 92.75 91.20 88.54 92.67 cm) (Note: The green intervals don't contain µ Solution For the target population of 823 heights.05 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 91.64 5.44 5.91 89.98 88.82 89.39 random samples UCL 93.65 91.36 90.07 91.27 86.95 90.31 91.86 4.15 90.81 87.40 90.00 5.85 89.14 87.30 90.34 89.34 5.30 5.35 87.21 87.14 87.96 89.91 4.74 4.29 92.6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 89.01 88.50 91. we have obtained the population mean value µ = 89.04 87.01 91.11 90.21 92.13 90.77 5.13 91.10 87.55 91.23 89.48 6.82 6.63 UCL 91.53 5.33 89.85 5.71 88.39 92.26 4.33 87.2 95% confidence intervals 30 heights from Appendix B LCL 87.95 92.17 90.07 87.67 cm.27 88.31 89.92 90.77 88.33 86.56 91.69 91.25 for µ for 40 LCL 89.24 89.88 90.29 4.01 88.91 87.99 91.07 92.20 86.96 5. indicated xci .33 89. note that only two of the intervals (those based on samples 38 and 40.84 92.34 87.51 87.99 87.82 4.50 5.27 88.44 89.35 89.82 92.60 5.81 92.22 87.20 91.70 3.62 of Sample 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sample 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 = 89.12 89.45 91.10 89.14 87.62 85.46 4.37 4.72 88.61 4.81 90.52 90.35 88.60 6.94 91.69 90.04 88.41 92.79 88.63 88.77 4.00 89.31 87.56 92.05 88.86 90.83 90.45 89. In the 40 repetitions of the confidence interval procedure described above.23 87.37 90.16 87.90 88.
3).96 is also the distance. Note that.95.1 that. which in turn changes the tail area associated with z. but you can be fairly sure it does because of your confidence in the statistical procedure. in actual practice. consider that the tabulated value of z (Table 1 of Appendix C) that cuts off an area of . for a 95% confidence interval. Definition 7.00).05 of the area. where the remaining 38 intervals (or 95% of the 40 interval) do contain the true value of µ . the confidence coefficient of 95% is equal to the total area under the sampling distribution (1. xcii .96 (see Figure 7.3 Tabulated zvalue corresponding to a tail area of . Figure 7. you would not know the true value of µ and you would not perform this repeated sampling. Suppose you want to construct an interval that you believe will contain µ with some degree of confidence other than 95%.4 We define zα / 2 to be the zvalue such that an area of α / 2 lies to its right (see Figure 7.3 The confidence coefficient is the proportion of times that a confidence interval encloses the true value of the population parameter if the confidence interval procedure is used repeatedly a very large number of times. each tail has an area of . the basis for which was illustrated in this example. Thus. The value z = 1. that x is from each endpoint of the 95% confidence interval. which is divided equally between the two tails of the distribution.by red color) do not contain the value of µ . you want to choose a confidence coefficient other than .025. in other words. we change the area under the sampling distribution between the endpoint of the interval. Thus.025 Definition 7. this zvalue provides the key to constructing a confidence interval with any desired confidence coefficient. Second.95 to a confidence interval. less .025 in the right tail of the standard normal distribution is 1. The one confidence interval you form may or not contain µ . in terms of standard deviation. By assigning a confidence coefficient other than . The first step in constructing a confidence interval with any desired confidence coefficient is to notice from Figure 7.4). rather you would select a single random sample and construct the associated 95% confidence interval.
05 = . a lagersample confidence interval for µ .45. We conclude that a largesample 90% confidence interval for a population mean is given by x ± 1.90. a very common confidence coefficient is . z. From Table 1 of Appendix C.05 that locates an area of .05 = 1.50. (1 − α ) . we present the values of zα / 2 for the most commonly used confidence coefficients.5).645 (see Figure 7. the probability that x falls within zα / 2 standard deviation of µ is (1 − α ) . Since the total area to the right of 0 is ... Thus. is given by x ± zα / 2σ x Example 7.50 .05 in the upper tail of the standard normal distribution. is equal to the confidence coefficient .that is.z α/2 0 zα/2 Figure 7. we find that z.4) because of the symmetry of the distribution.05 and we need to obtain the value zα / 2 = z.3.90.05 is the value such that the area between 0 and z.645σ x In Table 7. if an area of α / 2 lies beyond zα / 2 in the right tail of the standard normal (z) distribution. we have 1 − α = .5 In statistic problems using confidence interval techniques. Solution For a confidence coefficient of . The remaining area.10 α / 2 = .05 is . xciii . then an area of zα / 2 lies to the left of − zα / 2 in the left tail (Figure 7. Determine the value of zα / 2 that would be used in constructing a 90% confidence interval for a population mean based on a large sample.4 Locating zα / 2 on the standard normal curve Now. with confidence coefficient equal to (1 − α ) .90 α = .
010 .645 A summary of the largesample confidence interval procedure for estimating a population means appears in the next box.95 .5 Location of zα / 2 for Example 7.Table 7. Assumption: n ≥ 30 [When the value of σ is unknown. and x is the value of the sample mean. Interpret the interval in term of the problem. σ is the standard deviation of the population from which the sample was selected.645 1.050 . This experiment produced the following statistics: x = 41. Largesample (1 − α ) 100% confidence interval for a population mean.90 .98 .99 α /2 .5 ± 2.5 hours s = 9. using a 99% confidence interval. This year a random sample of n = 50 students is selected.330 2.36 ≈ x ± 2.960 2.3 Confidence Coefficient Commonly used confidence coefficient Figure 7. n is the sample size. µ σ x ± z α / 2σ x = x ± z α / 2 n where zα / 2 is the zvalue that locates an area of α / 2 to its right.58 = 41. Each student in the sample was interviewed about the number of hours spent on his/her study. Solution The general form of a largesample 99% confidence interval for µ is 9 .5 ± 3.6 Suppose that in the previous year all graduates at a certain university reported the number of hours spent on their studies during a certain week.2 hours Estimate µ .5 (1 − α ) .58 50 n n xciv . The approximation is generally quite satisfactory when n ≥ 30.58 0 z.] Example 7.2 s σ = 41. the average was 40 hours and the standard deviation was 10 hours. Suppose we want to investigate the problem whether students now are studying more than they used to. the mean number of hours spent on study. the sample standard deviation s may be used to approximate σ in the formula for the confidence interval.58 x ± 2.025 .005 zα / 2 1.05 = 1.
2 s σ = 41. Substitution of the values of the sample statistics into the general formula for a 99% confidence interval for µ yield 9 .2 s σ = 41. 44.5 ± 2..58 x ± 2. as stated in the next box. the population mean weekly time spent on study of all students in the university this year.55 = 41. We can be 99% confident that the interval (38. Since all the values in the interval fall above 38 hours and below 45 hours. 44.14.96 ≈ x ± 1.86) encloses the true mean weekly time spent on study this year. Example 7. The 95% confidence interval.5 ± 2. For a fixed sample size.95. obtained in this example and based on the same sample information.58 100 n n or (39. The 99% confidence interval based on a sample of size n = 100.86).05). 44.6.8 Refer to Example 7.37 = 41.96 50 n n or (38.5 hours per day on average (suppose that they don't study on Sunday). a.14. how is the width of the confidence interval related to the sample size? Solution a. 44.14.6 to be (38.58 ≈ x ± 2.6. Assume that the given values of the statistic x and s were based on a sample of size n = 100 instead of a sample size n = 50. the interval must become wider for us to have greater confidence that it contains the true parameter value. the width of the confidence interval for a parameter increases as the confidence coefficient increases.7 Refer to Example 7.or (38.87) b.86).5 ± 1. Using the sample information in Example 7.6. Example 7. This relationship holds in general. constructed in part a. a. construct a 95% confidence interval for mean weekly time spent on study of all students in the university this year. constructed xcv . b.96 x ± 1. b. Intuitively. 43. how is the width of the confidence interval related to the confidence coefficient? Solution a. The form of a largesample 95% confidence interval for a population mean µ is 9 . For a fixed confidence coefficient. b. Construct a 99% confidence interval for µ . is narrower than the 99% confidence interval.5 ± 2. we conclude that there is tendency that students now spend more than 6 hours and less than 7. Relationship between width of confidence interval and confidence coefficient For a given sample size. is narrower than the 99% confidence interval based on a sample of size n = 50.13. The 99% confidence interval for µ was determined in Example 7.
we discussed the estimation of a population mean based on large samples (n ≥ 30). 7. then we may again use x as a point estimation for µ . and any assumptions required for the validity of the procedure. In this section we introduced the concepts of point estimation of the population mean µ .2 would not be applicable. in subsequent sections we will present only the point estimate. Also. This will also hold in general.6. the following two problems arise: 1. xcvi . the width of the confidence interval decreases as the sample size increases. Since the Central Limit Theorem applies only to large samples. Relationship between width of confidence interval and sample size For a fixed confidence coefficient. For small samples. The general theory appropriate for the estimation of µ also carries over to the estimation of other population parameters. That is. the sampling distribution of x depends on the particular form of the relative frequency distribution of the population being sampled. so that the estimation procedures of Section 7. larger samples generally provide more information about the target population than do smaller samples. However. Smallsample confidence interval for µ s x ± tα / 2 n where the distribution of t based on (n . we may proceed with estimation techniques based on small samples if we can make the following assumption: Assumption required for estimating µ based on small samples (n < 30) The population from which the sample is selected has an approximate normal distribution. you will observe that the sample standard deviation s replaces the population standard deviation σ . Hence. Upon comparing this to the largesample confidence interval for µ . 2. Fortunately. based on large samples. the general form of a confidence interval for the parameter of interest.3 Estimation of a population mean: small sample case In the previous section. its sampling distribution. and the general form of a smallsample confidence interval for µ is as shown next box. we are not able to assume that the sampling distribution of x is approximately normal. The sample standard deviation s may not be a satisfactory approximation to the population standard deviation σ if the sample size is small. With small samples. as stated in the box. the sampling distribution upon which the confidence interval is based is known as a Student's tdistribution.in Example 7.1) degrees of freedom. time or cost limitations may often restrict the number of sample observations that may be obtained. If this assumption is valid.
920 2. Table 2 of Appendix C.943 1.050 6.886 1. a portion of which is reproduced in Table 7.4.624 2.025 xcvii . Table 7.365 2.760 1.812 1.131 t.345 1.250 3.31 22. we have 1 − α = .62 31.303 3.032 3.100 3.353 2.541 3.650 2.297 4.782 1.733 t.706 4.9 Using Table 7.895 1.221 4.144 4.440 1.078 1.397 1.841 4. gives the value of tα that located an area of α in the upper tail of the tdistribution for various values of α and for degrees of freedom ranging from 1 to 120.372 1.833 1.213 7.604 4.869 5. and have a mean of 0.350 1.610 6.821 6.228 2.977 2.499 3.Consequently.787 3.753 t.010 31.501 4.893 5.0005 636.160 2. which is equal to (n .001 318.896 2. However.571 2.95 α = .208 4.383 1.947 t.1) when estimating a population mean based on a small sample of size n.055 3.924 8.073 Example 7.718 2.925 5.776 2.318 4.785 4.179 2.657 9.025 12. Solution For confidence coefficient of . bellshaped. the distribution of t depends on a quantity called its degrees of freedom (df). we can think of the number of degrees of freedom as the 2 amount of information available for estimating.201 2.860 1.145 2.262 2. both are symmetric.764 2.355 3.326 10.930 3.132 2.533 1.363 1.182 2.05 α / 2 = .415 1.102 2.341 t.707 3.95. we must replace the value of zα / 2 used in a largesample confidence interval by a value obtained from the tdistribution.598 12. Intuitively. In particular.140 4.025 3.6 Some values for Student's tdistribution α tα Degrees of freedom 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 t.041 4.602 t.106 3.681 2. The tdistribution is very much like the zdistribution.587 4. in addition to µ .437 4.998 2.314 2.365 3.4 to determine the tvalue that would be used in constructing a 95% confidence interval for µ based on a sample of size n = 14.306 2.771 1.965 4.408 5.747 3.638 1.005 63.356 1.821 2.852 3.447 2. the unknown quantity σ .781 4.476 1.015 1.169 3.959 5.796 1.173 5.143 2.
7. By the time n reaches 30. we rarely know whether a sampled population has an exact normal distribution.6). a 95% confidence interval for µ . i. the reasoning for the arbitrary cutoff point of n = 30 for distinguishing between large and small samples may be better understood.4.160 (see Figure 7. would be given by s x ± 2. Observe that the values in the last row of Table 2 in Appendix C (corresponding to df = ∞ ) are the values from the standard normal zdistribution.025 and the row corresponding to df = 13. the proportion of elements in a population that have a certain characteristic.1) = (14 .. at intersection of the column labeled t. However. The commission selects a random sample of 300 files of recently committed crimes in the area and determines that a firearm was reportedly used in 180 of them. For example. there is very little difference between tabulated values of t and z.025 for Example 7.e. we will comment on the assumption that the sampled population is normally distributed. a physician may be interested in the proportion of men who are smokers.025 for a tdistribution based on (n . that is. Before concluding this section. df = 29. we find the entry 2.160 14 tdistribution with 13 df t 0 t. as the sample size increases.4 Estimation of a population proportion We will consider now the method for estimating the binomial proportion of successes.10 A commission on crime is interested in estimation the proportion of crimes to firearms in an area with one of the highest crime rates in a country.160 Figure 7. the t distribution becomes more like the z distribution. the definition of the smallsample confidence given in this section interval is frequently used by experimenters when estimating the population mean of a nonnormal distribution as long as the distribution is bellshaped and only moderately skewed. based on a sample of size n = 13 observations. Example 7.We require the value of t. In Table 7. Hence.9 At this point.025 = 2.1) = 13 degrees of freedom. a demographer may be interested in the proportion of a city residents who are married. In the real world. empirical studies indicate that moderates departures from this assumption do not seriously affect the confidence coefficients for smallsample confidence intervals. As a consequence. How would you estimate a binomial proportion p based on information contained in a sample from a population. Estimate the true xcviii . This phenomenon occurs because.6 Location of t.
A largesample confidence interval for p may be constructed by using a procedure analogous to that used for estimating a population mean. In this example. Largesample (1 − α ) 100% confidence interval for a population proportion. 60% of the crimes in the sample were related to firearms. xcix . ˆ To assess the reliability of the point estimate p .60 servers as our point estimate of the population proportion p. This approximation will be valid as long as the sample size n is sufficiently large. p ˆ ˆ p ± zα / 2σ p ≈ p ± zα / 2 ˆ ˆˆ pq n ˆ where p is the sample proportion of observations with the characteristic of interest. Properties of the ˆ sampling distribution of p are given in the next box. the population proportion of crimes committed in the area in which some type of firearm is reportedly used. we will call this sample proportion p (read "p hat"). the value p = . ˆ Sampling distribution of p ˆ For sufficiently large samples. ˆ ˆ Note that. the sampling distribution of p is approximately normal.11 Refer to Example 7. Example 7.10.proportion p of all crimes committed in the area in which some type of firearm was reportedly used. Solution A logical candidate for a point estimate of the population proportion p is the proportion of observations in the sample that have the characteristic of interest (called a ˆ "success"). Construct a 95% confidence interval for p. ˆ ˆ and q = 1 − p . we must substitute p and q into the formula for σ p = ˆ pq / n in order to construct the confidence interval. with Mean: and µp = p ˆ Standard deviation: σp = ˆ pq n where q = q. This information may be derived by an application of the Central Limit Theorem. the sample proportion of crimes related to firearms is given by ˆ p= Number of crimes in sample in which a firearm was reportedly used =180/300=.p. we need to know its sampling distribution.60 Total number of crimes in sapmle ˆ That is.
5 Estimation of the difference between two population means: Independent samples µ 1 = Population mean height of all children of province No. In Example 7. Example 7. 14 n1 = 30 n2 = 40 c .5. we have 1 − α = . and n1 and n2. Solution We will let subscript 1 refer to province No. s1 and s2. We will also define the following notation: 7. .67 cm and a standard deviation of 3.025 ˆ ˆ Thus.40 . we obtained p = 180 / 300 = . 18 Sample size Province No.10. because most surveys in actual practice use samples that are large enough to employ the procedure of this section. α / 2 = .60 ± 1. 95% of all samples would produce confidence interval that enclose p.2. q = 1 − p = 1 − .05 . In Section 7.40) = . however. 18 and the subscript 2 to province No.88 cm.5 Summary information for Example 7. Calculate a point estimate for the difference between heights of children in two provinces. Similarly. ˆ = 1. Table 7.54. We now proceed to a technique for using the information in two samples to estimate the difference between two population means. we learned how to estimate the parameter µ based on a large sample from a single population. That is.60 ± . 18 and province No. The technique to be presented is a straightforward extension of that used for largesample estimation of a single population mean.95. ˆ p ± zα / 2 ˆˆ pq (.96 = . It should be noted that smallsample procedure are available for the estimation of a population proportion p. µ 2 = Population mean height of all children of province No.96. A random sample of 30 heights of children in province No.18 and in province No.95 .66). 14.025 .54 to .72 cm and a standard deviation of 4. The given information may be summarized as in Table 7.60 = . Substitution of these values into the formula for an approximate confidence interval for p yields = .12 To estimate the difference between the mean heights for all children of province No. the respective sample sizes.06 n 300 or (. 2. Note that the approximation is valid since the interval does not contain 0 or 1.60)(. in repeated construction of 95% confidence intervals.60 . we may want to compare the mean heights of the children in province No.50 cm. 14 produced a sample mean of 86. 14 use the following information 1. A random sample of 40 heights of children in province No. 18 produced a sample mean of 91.14 using the observations in Appendix B. the respective sample standard deviations.12 Province No. 18. α and the required zvalue is z. For example.Solution For a confidence interval of . 14. lets x1 and x2 denote the respective means.66 contains the true proportion of crimes committed in the area that are related to firearms. We are 95% confident that the interval from . We will not discuss details here.
ci .86.50 cm x2 = 86.67) = 5. is approximately normal with Mean: µ ( x − x ) = ( µ1 − µ 2 ) 1 2 Standard deviation: 2 2 σ (x = 1 − x2 ) σ 12 n1 + 2 σ2 n2 where σ 1 and σ 2 are standard deviations of two population from which the samples were selected. based on independent random samples from two population.88 cm To estimate ( µ1 − µ 2 ) . As was the case with largesample estimation of single population mean. The properties of the point estimate ( x1 − x2 ) are summarized by its sampling distribution shown in Figure 7. 2 2 The procedure for forming a largesample confidence interval for ( µ1 − µ 2 ) appears in the accompanying box.8 Sampling distribution of ( x1 − x2 ) Sampling distribution of ( x 1 − x 2 ) 2σ ( x1 − x 2 ) For sufficiently large sample size (n1 and n2 ≥ 30).8. it also suffices use to s1 and s 2 as approximation to the respective population variances. the requirement of large sample size enables us to apply the Central Limit Theorem to obtain the sampling 2 2 distribution of ( x1 − x2 ) .µ2) ( x1 − x 2 ) 2σ ( x1 − x 2 ) Figure 7.72 .Sample mean Sample standard deviation x1 = 91. it seems sensible to use the difference between the sample means ( x1 − x2 ) = (91.72 cm s1 = 4. (µ1 .67 cm s2 = 3.05 as our point estimate of the difference between two population means. σ 1 and σ 2 . the sampling distribution of ( x1 − x2 ) .
96 and use the information in Table 7.12. we can be reasonably sure cii .Largesample (1 .67) ± 1. Solution The general form of a 95% confidence interval for ( µ 1 − µ 2 ) based on large samples from the target populations. the difference between mean heights of all children in province No. 18 and province No. is given by ( x1 − x 2 ) ± zα / 2 σ 12 n1 + 2 σ2 n2 Recall that z.04. Interpret the interval. 2. the difference between population means.67) ± 1. and is not affected by.96 ≈ 5.13 Refer to Example 7.50) 2 (3.96 σ 12 30 + 2 σ2 40 ≈ (91.α)100% confidence interval for ( µ 1 − µ 2 ) ( x1 − x 2 ) ± zα / 2σ ( x1 − x 2 ) = ( x1 − x 2 ) ± zα / 2 ≈ ( x1 − x 2 ) ± zα / 2 (Note: We have used the sample variances population parameters. That is the choice of elements in one sample does not affect. The sample sizes n1 and n2 are sufficiently large. the choice of elements in the other sample. ( 4. Construct a 95% confidence interval for ( µ 1 − µ 2 ) .) σ 12 n1 + 2 σ2 n2 2 s12 s 2 + n1 n 2 2 s12 and s 2 as approximations to the corresponding The assumptions upon which the above procedure is based are the following: Assumptions required for largesample estimation of ( µ 1 − µ 2 ) 1.88) 2 + 30 40 The use of this method of estimation produces confidence intervals that will enclose ( µ1 − µ 2 ) .01 or (3.025 = 1. The two random samples are selected in an independent manner from the target populations.72 − 86. Hence.06).5 to make the following substitutions to obtain the desired confidence interval: (91.05 ± 2. 14. 95% of the time. (at least 30) Example 7.72 − 86. 7.
For example. 3. Smallsample (1 .e. as indicated in the box.5 were based on the assumption that the samples were randomly selected from the target populations.6 Estimation of the difference between two population means: Matched pairs The procedure for estimating the difference between two population means presented in Section 7. The best method of sampling would be to match the first graders in pairs according to IQ and other factors that might affect reading ciii . Assumptions required for smallsample estimation of ( µ 1 − µ 2 ) 1.04 cm and 7. Sometimes we can obtain more information about the difference between population means ( µ1 − µ 2 ) . based on small samples from each population. σ 1 = σ 2 = σ ). by selecting paired observations. we must make specific assumptions about the relative frequency distributions of the two populations. 2 2. Both of the populations which the samples are selected have relative frequency distributions that are approximately normal. This pooled 2 estimate is denoted by s p and is computed as in the previous box. 14 at the survey time. When estimating the difference between two population means. The random samples are selected in an independent manner from two populations.α)100% confidence interval for ( µ 1 − µ 2 ) 1 1 ( x1 − x 2 ) ± tα / 2 s 2 + p n1 n 2 where 2 (n1 − 1) s12 + (n 2 − 1) s 2 s = n1 + n 2 − 2 2 p and the value of tα / 2 is based on (n1 + n2 . we 2 2 2 construct an estimate of σ based on the information contained in both samples. When these assumptions are satisfied..that the mean height of children in province No. Since we assume that the two populations have equal variances (i.2) degrees of freedom.06 cm higher than the mean height of children in province No. suppose we want to compare two methods for teaching reading skills to first graders using sample of ten students with each method. based on small samples construct a confidence interval for (n1 and n2 < 30) from respective populations. The variances σ 12 and σ 2 of the two populations are equal. we may use the procedure specified in the next box to ( µ 1 − µ 2 ) . 18 was between 3. 2 7.
dn represent the differences between the pairwise observations in a random sample of n matched pairs. The population of paired differences is normally distributed. . d2. the other member would be assigned to class taught by method 2. Assumptions required for estimation of ( µ 1 − µ 2 ) : Matched pairs 1. Example 7.14 Suppose that the n = 10 pairs of achievement test scores were given in Table 7.0 n 10 civ . In the following boxes. Find a 95% confidence interval for the difference in mean achievement. and standard deviation of the differences are d= ∑ d = 110 = 11. and tα / 2 is based on (n1) degrees of freedom. we give the assumptions required and the procedure to be used for estimating the difference between two population means based on matchedpairs data. 2. The sample paired observations are randomly selected from the target population of paired observations. Then the differences between matched pairs of achievement test scores should provide a clearer picture of the difference in achievement for the two reading methods because the matching would tend to cancel the effects of the factors that formed the basic of the matching. For each pair. µ d = ( µ1 − µ 2 ) .14 Student pair 1 Method 1 score Method 2 score Pair difference 78 71 7 2 63 44 19 3 72 61 11 4 89 84 5 5 91 74 17 6 49 51 2 7 68 55 13 8 76 60 16 9 85 77 8 10 55 39 16 Solution The differences between matched pairs of reading achievement test scores are computed as d = (method 1 score . sd is their standard deviation. Then the smallsample confidence interval for µ d = ( µ1 − µ 2 ) is s d ± tα / 2 d n where d is the mean of n sample differences.method 2 score) The mean. variance.7 . Smallsample (1 − α ) 100% confidence interval for µ d = ( µ 1 − µ 2 ) Let d1. one member would be randomly selected to be taught by method 1. Table 7. .7 Reading achievement test scores for Example 7. .achievement.
7 10 or (6. with 95% confidence that the difference between mean reading achievement test scores for method 1 and 2 falls within the interval from 6.262.15 Suppose that there were two surveys.594 − 10 = 1.15 1990 Number surveyed Number in sample who said they were satisfied with their life 1998 n1 = 1.8 Proportions of two samples for Example 7. random samples of 1.594 − 1.262 = 11.0 ± 2. we may be interested in comparing the proportions of married and unmarried persons who are overweight.3.400 674 Solution We define some notations: p1 = Population proportion of adults who said that they were satisfied with their life in 1990. Construct a point estimate for difference between the proportions of adults in the country in 1990 and in 1998 who were satisfied with their life. one was carried out in 1990 and another in 1998.6667 = 9 9 s d = 42.400 adults in a country were asked whether they were satisfied with their life. In both surveys. As a point estimate of (p1 .0 ± 4. 7. ( p1 − p 2 ) .025.2 sd = ∑d 2 (∑ d ) − n n −1 2 (110) 2 1.7 Estimation of the difference between two population proportions This section extends the method of Section 7.3 to 15.33 1. based on (n 1) = 9 degrees of freedom. we will use the difference between the corresponding sample ˆ ˆ proportions. We estimate.67 = 6.8. Example 7.400 462 n2 = 1. 15. p2 = Population proportion of adults who said that they were satisfied with their life in 1998.4 to the case in which we want to estimate the difference between two population proportions.7.210 = 42.p2).53 = 11.7). we obtain s d ± t . method 1 seems to produce a mean achievement test score that substantially higher than the mean score for method 2.025 d n 6. Since all the values within the interval are positive. For example.400 Number of adults surveyed in 1990 cv . is given in Table 2 of Appendix C as t. where ˆ p1 = Number of adults in 1990 who said that they were satisfied with their life 462 = = . Substituting these values into the formula for the confidence interval. The results of the surveys are reported in Table 7. Table 7.53 The value of t.025 = 2.
the sample distribution of ( p1 − p 2 ) . the point estimate of (p1 .p2.33 . shown in the next box. ˆ ˆ It follows that a largesample confidence interval for ( p1 − p 2 ) may be obtained as shown in the box. based on independent random samples from two populations.15 ˆ ˆ To judge the reliability of the point estimate ( p1 − p 2 ) . As a general rule of thumb we will require that intervals ˆ p1 ± 2 ˆ ˆ p1q1 ˆ and p 2 ± 2 n1 ˆ ˆ p2q2 do not contain 0 or 1.400 Thus. n2 cvi . is ˆ ˆ ( p1 − p2 ) = .48 = . n1 and n2.and ˆ p2 = Number of adults in 1998 who said that they were satisfied with their life 674 = = . is approximately normal with Mean: and µ ˆ ˆ ( p1 − p 2 ) ˆ ˆ =( p 1 − p 2 ) Standard deviation: σ ˆ ˆ ( p1 − p 2 ) = p1q1 p q + 2 2 n1 n2 where q1 = 1 . ˆ ˆ Sampling distribution of ( p1 − p 2 ) ˆ ˆ For sufficiently large sample size.p1 and q2 = 1 .p2).48 Number of adults surveyed in 1998 1. Assumption: The samples are sufficiently large so that the approximation is valid. we need to know the characteristics of its performance in repeated independent sampling from two populations. This information is ˆ ˆ provided by the sampling distribution of ( p1 − p 2 ) . ˆ ˆ Largesample (1 − α ) 100% confidence interval for ( p1 − p 2 ) ˆ ˆ ( p 1 − p 2 ) ± z α / 2σ ˆ ˆ ( p1 − p 2 ) ˆ ˆ ≈ ( p1 − p 2 ) ± zα / 2 ˆ ˆ ˆ ˆ p1q1 p q + 2 2 n1 n2 ˆ ˆ where p1 and p 2 are the sample proportions of observations with the characteristics of interest..
027 1. µ .33 = . 48 ± 2 n2 (. we can apply the largesample confidence interval for (p1 . 33 )(. It appears that there were between 11. Its sampling distribution will be approximately normal and the probability that x will lie within σ 1.52.95 (see Figure 7.p2) with 95% confidence.025 = − . we will have to decide on the number n of observations to be included in a sample.9). the sample mean of the n measurements.that we wish to place in it.48 = . using a 95% confidence interval.15. 96 n1 n2 (. by telephone. Estimate the difference between the proportions of the adults in this country in 1990 and in 1998 who said that they were satisfied with their life. Thus we estimate that the interval (. 48 ± .96σ x = 1. 025 1.8 Choosing the sample size Before constructing a confidence interval for a parameter of interest. or n = 100? To answer this question we need to decide how wide a confidence interval we are willing to tolerate and measure of confidence .16 Refer to Example 7. the number of days between shipment and receipt for each order.6% more adults in 1998 than in 1990 who said that they were satisfied with their life. 400 (.186..114). 52 ) + 1.5 day: cvii . the mean shipping time. If management wants to estimate the mean shipping time correct to within . ˆ ˆ Thus. the confidence coefficient.036 ˆ ˆ ˆ ˆ p1 q1 p 2 q 2 + = (. 48 ) ± 1 . 48 )(. 33 − .186. 48 )(. 400 1. The management plans to randomly sample n orders and determine.95. how many orders should be sample? Solution We will use x . 67 ) (. . 400 or (. 33 ± 2 n1 ˆ ˆ p2q2 = .67 and q 2 = 1 . Example 7.15. we have n1 = n2 = 1. n = 20. 400 do not contain 0 and 1.96 n of the mean shipping time. p1 = .Example 7.96σ / n equals . we want to choose the sample size n so that 1.48. q1 = 1 . 33 ± . 7. The 95% confidence interval is ˆ ˆ ( p1 − p 2 ) ± z . 67 ) = .114) enclose the difference (p1 . ˆ ˆ Solution From Example 7.4% and 18. The following example will illustrate the method for determining the appropriate sample size for estimating a population mean.that is.15 ± . 33 )(.p2). 67 ) = . is approximately . . Should we sample n = 10 observations. Note that the intervals ˆ p1 ± 2 ˆ p2 ± 2 ˆ ˆ p1q1 = . Thus.5 day with probability equal to .33 and p 2 = .400.17 A mailorder house wants to estimate the mean length of time between shipment of an order and receipt by customer. Therefore. to estimate µ ..
Then the population of shipping times might appear as shown in Figure 7. Figure 7.10 Hypothetical relative frequency distribution of population of shipping times for Example 7. x To solve the equation 1. σ 1. a measure of variation of the population of all shipping times.96 = .5 day 1.96σ . The final step in determining the sample size is to substitute this approximate value of σ into the equation obtained previously and solve for n. Figure 7.9 Sampling distribution of the sample mean. If the range of population of shipping times is 7 days. we must approximate its value using the standard deviation of some previous sample data or deduce an approximate value from other knowledge about the population. Suppose. Since σ is unknown.17. it follows that the range of a population is approximately 6σ . we need to know that value of σ .17 days. that we know almost all shipments will delivered within 7 days.96σ . Since the Empirical Rule tells us that almost all the observations in a data set will fall within the interval µ ± 3σ .9 provides the information we need to find an approximation for σ .5 n µ 1.5 day x Figure 7.5. cviii .10. for example.96σ / n = . then 6σ = 7 days and σ is approximately equal to 7/6 or 1.
Therefore. we have 1.96 = .95 by changing the zvalue in the equation. µ . or the difference between two population proportions are analogous to the procedure for the determining the sample size for estimating a population mean.59 .) The procedures for determining the sample sizes needed to estimate a population proportion. we will follows the usual convention of rounding the calculated sample size upward.07.96(1. cix . we would solve for n in the equation σ zα / 2 =d n where the value of z α / 2 is obtained from Table 1 of Appendix C. for a confidence coefficient of .17) = 4. the mailorder house needs to sample approximately n = 22 shipping times in order to estimate the mean shipping time correct to within .5 or n n= 1. if we want x to lie within a distance d of µ with probability (1 − α ) .95. In Example 7.64σ n= d 2 For example.Thus. where . the difference between two population means. we would require a sample size of 2 Choosing the sample size for estimating a population mean µ to within d units with probability (1 − α ) z σ n= α /2 d 2 (Note: The population standard deviation σ will usually have to be approximated.5 day of the true mean shipping time. The solution is given by z σ n= α /2 d 1.17 1. In general.95.5 Squaring both sides of this equation yields: n = 21. with probability .5 day with probability equal .90. we wanted our sample estimate to lie within . We could calculate the sample size for a confidence coefficient other than .95 represents the confidence coefficient.17.
the chisquare distribution is not symmetric about 0. the sampling distribution of s2 possesses approximately a chisquare (χ2) distribution.9 Estimation of a population variance In the previous sections. the sampling distribution of the sample variances does not possess a normal zdistribution or a tdistribution. Unlike tdistributions. However. like the tdistribution. Example 7. when certain assumptions are satisfied. is characterized by a quantity called the degrees of freedom associated with the distribution. If none are available. we discuss a confidence interval for a population variance. σ2. unlike sample means and sample proportions.5 for a conservative choice of n. Find the tabulated value of χ2 corresponding to 9 degrees of freedom that cuts off an uppertail area of . use p = q = . it seems reasonable to use the sample variance s2 to estimate σ2 and to construct our confidence interval around this value.11. Several chisquare probability distributions with different degrees of freedom are zand shown in Figure 7. such that P(χ2 > χ2α) = α. a partial reproduction of this table is shown in Table 7.p. Throughout this section we will use the words chisquare and the Greek symbol χ2 interchangeably. The chisquare probability distribution. cx . Intuitively. Entries in the table give an uppertail value of χ2.Choosing the sample size for estimating a population proportion p to within d units with probability (1 − α ) z σ n = α / 2 pq d where p is the value of the population proportion that we are attempting to estimate and q = 1 .) 2 7.18 Tabulated values of the χ2 distribution are given in Table 3 of Appendix C. call it χ2α .9. Rather. In this optional section. we considered interval estimates for population means or proportions.05. (Note: This technique requires previous estimates of p and q.
05 = 16.47530 20. how large is the variance σ2 in the fish weights? The 144 samples of fish in the study produced the following summary statistics: x =1. Degrees of freedom 1 2 3 4 5 6 7 8 2. For this example.87944 10.53460 χ2.02389 7.025 5.06710 15.37776 9.010 6.08630 16.005 7.05.01280 17.27770 21.23635 10. That is.19 There was a study of contaminated fish in a river. Use this information to construct a 95% confidence interval for the true variation in weights of contaminated fish in the river. the tabulated value of χ2 corresponding to 9 degrees of freedom is χ2.70554 4.7 grams.34840 11.99147 7.6 grams.9190 Table 7.74960 18.63490 9.Figure 7.54760 20.64460 12.9 Reproduction of part of Table 3 of Appendix C We use the tabulated values of χ2 to construct a confidence interval for σ2 as the next example.049.48773 11.21034 11. The columns of the table identify the value of α associated with the tabulated value of χ2α and the rows correspond to the degrees of freedom.11 Several chisquare probability distribution Solution The value of χ2 that we seek appears (shaded) in the partial reproduction of Table 3 of Appendix C given in Table 7.44940 16. Suppose it is important for the study to know how stable the weights of the contaminated fish are.09020 χ2.83250 14.36160 χ2.86020 16.95500 cxi .9.27670 15. we have df = 9 and α = .83810 14.07050 12.050 3. Example 7. Thus.81190 18.59160 14.59660 12.34490 13.25139 7.81473 9.50730 χ2.60517 6.77944 9.14330 12.01700 13. s = 376.84146 5.
21700 27.19080 23. and χ2.14350 19.85230 21.68480 24. of the chisquare distribution (see Figure 7.58710 28.29950 29. Assumption: The population from which the sample is selected has an approximate normal distribution.05/2 = . σ2 (n − 1) s 2 χ 2α / 2 ≤σ 2 ≤ (n − 1) s 2 χ 2 (1 − α / 2 ) 2 where χ 12−α / 2 .985 cxii .6) 2 2 ≤σ ≤ 185.18820 26.20930 24.52640 32.86930 30.α) = .58220 Solution A (1 .99580 26. A (1 .15640 38. These are the values of χ2 that cut off an area of α/2 in the lower and upper tails.81940 31.800 117.73560 26.58930 25. For a 95% confidence interval.57790 31.α)100% confidence interval for a population variance.19100 31. Note that (n .025.98710 17.84540 30.1) = 143 df. and χ α / 2 .20360 16. Looking in the tabulated values df = 150 row of Table 3 of Appendix C (the row with the df values closest to 143).40870 34.98940 27.68370 15.30700 19.99990 33. and χ α / 2 are values of χ2 that locate an area of α/2 to the right and α/2 to the left.02610 22. we obtain (144 − 1)(376.1). respectively.72500 26.02280 20. we need the χ2.025.33670 24.54940 19.27500 18. Substituting into the formula given in the box.31930 32.95 and α/2 = .48310 21.68830 29.29620 27.11900 27.81190 21.75690 28.985.80530 36.26720 35.92000 23.80130 34.76900 25.06420 22.67510 21.66600 23. There.1) represents the degrees of freedom associated with the χ2 distribution. and critical values of χ2 as shown in the box.91900 18.975 for (n .fore. respectively.9 10 11 12 13 14 15 16 17 18 19 14.975 = 117. we first 2 locate the critical values χ 12−α / 2 .α)100% confidence interval for σ2 depends on the quantities s2.800 and χ2.025 = 185. To construct the interval.30720 23. (n . (1 .36210 23.6) 2 (144 − 1)(376.1) degrees of freedom.48840 28.14130 30.54180 24. we find χ2.71850 37. of a chisquare distribution based on (n .11).
for each of the following confidence coefficients: a) . Figure 7. Solution A confidence interval for σ is obtained by taking the square roots of the lower and upper endpoints of a confidence interval for σ2.4. we are 95% confident that the true standard deviation of the fish weights is between 330. In each instance.156.19 (and the confidence interval for a in Example 7. Use Table 1 of Appendix C to determine the value of zα/2 that would be used to construct a largesample confidence interval for µ.that is. We must assume that the population from which the sample is selected has an approximate normal distribution. In addition. It is reasonable to expect this assumption to be satisfied in Examples 7.95 cxiii . Find a 95% confidence interval for σ. 7.898. 7. we presented the point estimate of the parameter of interest.6 grams.1.19 and 7.8 and 171. the true standard deviation of the fish weights. Note that the procedure for calculating a confidence interval for σ2 in Example 7.20 Refer to Example 7.156.20 since the histogram of the 144 fish weights in the sample is approximately normal. its sampling distribution.20) requires an assumption regardless of whether the sample size n is large or small (see box). we provided techniques for determining the sample size necessary to estimate each of these parameters.8 ≤ σ 2 ≤ 171.4 grams and 414.11 Exercises 7.19.85 b) .6 Thus. the general form of a confidence interval. Thus.10 Summary This chapter presented the technique of estimation .We are 95% confident that the true variance in weights of contaminated fish in the river falls between 109.11 The location of χ21α/2 and χ2α/2 for a chisquare distribution Example 7.898. and any assumptions required for the validity of the procedure.4 330. using sample information to make an inference about the value of a population parameter. the 95% confidence interval is 109.4 ≤ σ 2 ≤ 414. or the difference between two population parameters.
Independent random samples from two normal population produced the sample means and variances listed in the following table.4 2 s 2 = 102.0 a) Find a 90% confidence interval for (µ1 . 7. c) Construct a 99% confidence interval for µ. n = 15. p = . Suppose a random sample of size n = 100 produces a mean of x =81 and a standard deviation of s = 12. Construct a 95% confidence interval for p.95.8. Random samples of n measurements are selected from a population with unknown proportion of successes p. 7. n = 18.3.975 7.8 x1 = 43.99. b) Calculate a 95% confidence for µ.5.4. n = 10. Sample population 1 from Sample population 2 from n1 = 14 n2 = 7 x1 = 53. A random sample of n = 10 measurements from a normally distributed population yields x = 9. The mean and standard deviation of n measurements randomly sampled from a normally distributed population are 33 and 4. c) Confidence coefficient . a) b) c) d) ˆ Find p . Compute an estimate of σ p for each of the following situations: ˆ a) n = 250.4 ˆ b) n = 500. A random sample of size 150 is selected from a population and the number of successes is 60. a) Construct a 90% confidence interval for µ. Construct a 95% confidence interval for µ when: a) n = 5 b) n = 15 c) n = 25 7.2 s12 = 96. 7.90. p = .25 7. a) Calculate a 90% confidence for µ. ˆ p = . b) Confidence coefficient .85 ˆ c) n = 95.6.4 and s = 1. c) Calculate a 99% confidence for µ.7.8. b) Construct a 95% confidence interval for µ.2. cxiv . Use Table 2 of Appendix C to determine the values of tα/2 that would used in the construction of a confidence interval for a population mean for each of the following combinations of confidence coefficient and sample size: a) Confidence coefficient . Construct a 99% confidence interval for p.µ2). Construct a 90% confidence interval for p. 7.c) . respectively.
c) Find a 99% confidence interval for µd. cxv .µ2). c) Find a 99% confidence interval for (µ1 .µ2).b) Find a 95% confidence interval for (µ1 .3 sd = 2. A random sample of ten paired observations yielded the following summary information: d = 2.67 a) Find a 90% confidence interval for µd.9. b) Find a 95% confidence interval for µd. 7.
we first formulate a hypothesis.Chapter 8 CONTENTS Hypothesis Testing 8. In this way. we attempt to produce evidence to show that the null hypothesis is false. consider an educational researcher who designed a new way to teach a particular concept in science. As you will subsequently learn. cxvi . Since the null hypothesis would be that there is no difference between the two methods. or a claim. It is often desirable to know whether some characteristics of a population is larger than a specified value.5 Summary 8.6 Exercises 8. For example. For example. When a researcher begins to collect information about the phenomenon of interest.3 Types of errors for a Hypothesis Test 8. Since the value of the population characteristic is unknown. he or she generally tries to present evidence that lends support to the alternative hypothesis. which is "opposite" of the alternative hypothesis. In statistical terms. we take an indirect approach to obtaining support for the alternative hypothesis: Instead of trying to show that the alternative hypothesis is true.1 Introduction 8. or whether the obtained value of a given parameter is less than a value hypothesized for the purpose of comparison. the information provided by a sample from the population is used to answer the question of whether or not the population quantity is larger than the specified or hypothesized value. To be paired with the alternative hypothesis is the null hypothesis.4 Rejection Regions 8. and wanted to test experimentally whether this new method worked better than the existing method. both stated in terms of the appropriate parameters. we may claim that the mean number of children born to urban women is less than the mean number of children born to rural women. rather than obtaining an estimate of its value. a statistical hypothesis is a statement about the value of a population parameter.2 Formulating Hypotheses 8.1 Introduction In this chapter we will study another method of inferencemaking: hypothesis testing. It should be stressed that researchers frequently put forward a null hypothesis in the hope that they can discredit it. The researcher would design an experiment comparing the two methods. 8. the researcher would be hoping to reject the null hypothesis and conclude that the method he or she developed is the better of the two. and is denoted by H0. The procedures to be discussed are useful in situations where we are interested in making a decision about a parameter value.2 Formulating Hypotheses When we set out to test a new theory. The hypothesis that we try to establish is called the alternative hypothesis and is denoted by Ha. the null and alternative hypotheses. describe two possible states of nature that cannot simultaneously be true. which we believe to be true.
Specify the null and alternative hypotheses that would be used in testing the researcher's theory.80 would also cause you to reject H0: p = p'. the null and alternative hypotheses would be cxvii .5.80 represents the worst possible case. Thus. for mathematical ease.80. Consequently. we combine all possible situations for describing the opposite of Ha into one statement involving equality. as in the above example where it is the hypothesis that there is no difference between population means. a researcher believes that over 80% of those who read cigarette advertisements fail to see the warning. µ1 = µ2.e.80.3 A metal lathe is checked periodically by quality control inspectors to determine if it is producing machine bearings with a mean diameter of . In other words. In particular. are (i. µ1 < µ2. If the mean diameter of the bearings is larger or smaller than .µ2) < 0 (i. for any value of p' that is less than . thus.e.80 Ha: p > .80. and µ2 = Mean number of children ever born of the rural women. therefore.80 Observe that the statement of H0 in these examples and in general. In Example 8. The demographer wants to support the claim that µ1 is less than µ2. the true proportion of all readers of cigarette advertisements who fail to see the warning. when the alternative hypothesis is not correct. cigarette advertisements have been required to carry the following statement: "Cigarette smoking is dangerous to your health. the null and alternative hypotheses are H0: p = .5 or µ < . Solution The researcher wants to make an inference about p.µ2) = 0 born to urban and rural women) Ha: (µ1 . We will thus define µ1 = Mean number of children born to urban women. since the alternative of interest is that p > .80 in favor of Ha: p > . from the researcher's point of view." But. then the process is out of control and needs to be adjusted. then the metal lathe's production process is out of control. Example 8. he wishes to collect evidence to support the claim that p is greater than . However. the mean number of children born to urban women is less than that for the rural women) Example 8.1 Formulate appropriate null and alternative hypotheses for testing the demographer's theory that the mean number of children born to urban women is less than the mean number of children born to rural women.. Solution We define the following parameter: µ = True mean diameter (in inches) of all bearings produced by the lathe If either µ > . this waning is often located in inconspicuous corners of the advertisements and printed in small type. That is why the word "null" in "null hypothesis" is used − it is the hypothesis of no difference.5 inch. you may have been tempted to write the null hypothesis as H0: p ≤ . H0: p = .80. there is no difference between the mean numbers of children H0: (µ1 . Formulate the null and alternative hypotheses that could be used to test whether the bearing production process is out of control. in terms of these parameters.The null hypothesis is typically a hypothesis of no difference. is written with an equality (=) sign.5 inch. then any evidence that would cause you to reject the null hypothesis H0: p = . Example 8.2. the null and alternative hypotheses. Solution The hypotheses must be stated in terms of a population parameter or parameters.2 For many years.. Since we wish to be able to detect either possibility.
. In particular. there are four possible situations that may arise in testing a hypothesis (see Table 8.3 Types of errors for a Hypothesis Test The goal of any hypothesis testing is to make a decision. We don't know which type of error corresponds to actuality and so would like to keep the probabilities of both types of errors small. These two tests are called onetailed tests. is unknown to the investigator). Thus. In contrast. Ha. In Examples 8. the process is out of control) An alternative hypothesis may hypothesize a change from H0 in a particular direction. The interest focuses on whether the proportion of cigarette advertisement readers who fail to see the warning is greater than .. in favor of the alternative hypothesis.5 (i.5 inch.3 illustrates a twotailed test in which we are interested in whether the mean diameter of the machine bearings differs in either direction from . Definition 8. and we risk a Type II error only if the null hypothesis is not rejected..1. and thus we are subject to make one of two types of error. Example 8. the researcher is interested in detecting departure from H0 in one particular direction.1 A Type I error is the error of rejecting the null hypothesis when it is true. of course.1 and 8. The probability of committing a Type I error is usually denoted by α. we will decide whether to reject the null hypothesis.2. In Example 8.80 in Example 8. The null hypothesis can be either true or false further. whether the process is out of control. Definition 8. i.H0: µ = . as defined in the accompanying boxes. Table 8.1).e. the interest focuses on whether the mean number of children born to the urban women is less than the mean number of children born to rural women. 8. but not both. or we may make either a Type I error (with probability α). The probability of making a Type II error is usually denoted by β. we will make a conclusion either to reject or not to reject the null hypothesis. β cxviii . or a Type II error (with probability β).2 A Type II error is the error of accepting the null hypothesis when it is false.1 Conclusions and consequences for testing a hypothesis Conclusions Do not reject Reject Null Hypothesis Null Hypothesis True "State of Nature" Null Hypothesis Alternative Hypothesis Correct conclusion Type II error Type I error Correct conclusion The kind of error that can be made depends on the actual state of affairs (which.e.5 (i. Thus. Although we would like always to be able to make a correct decision. or it may merely hypothesize a change without specifying a direction.2.e. There is an intuitively appealing relationship between the probabilities for the two types of error: As α increases. Note that we risk a Type I error only if the null hypothesis is rejected. the process is in control) Ha: µ ≠ . H0. we must remember that the decision will be based on sample information. we may make no error.
i." the reliability of the conclusion would be measured by β. if the sample does not provide enough evidence to support the alternative hypothesis Ha. 4. Interpret the Type I and Type II errors in this context. In summary. but depends on the specific alternative value of the parameter and is difficult to compute in most testing situations. then conclude that the null hypothesis cannot be rejected on the basis of your sample.5 inch. You may note that we have carefully avoided stating a decision in terms of "accept the null hypothesis H0. Formulate appropriate null and alternative hypotheses for judging the guilt or innocence of the defendant. If the sample evidence supports the alternative hypothesis.. the burden of proof is not on the defendant to prove his or her innocence. we prefer a decision "not to reject H0.5 inch. The null hypothesis. would occur if we conclude that the mean bearing diameter is equal to .4). Example 8. If the sample does not provide sufficient evidence to support the alternative hypothesis..5 inch when in fact the mean differs from . Solution A Type I error is the error of incorrectly rejecting the null hypothesis. c. when in fact the process is out of control. H0. b. the probability of Type II error. However. The consequence of making such an error would be that unnecessary time and effort would be expended to repair the metal lathe.5 inch. The only way to reduce α and β simultaneously is to increase the amount of information available in the sample. α is often used as a measure of the reliability of the conclusion and called the level of significance (or significance level) for a hypothesis test. will be the opposite of Ha and will contain an equality sign. a defendant is "innocent until proven guilty." That is. the court must collect cxix .5 The logic used in hypothesis testing has often been likened to that used in the courtroom in which a defendant is on trial for committing a crime. if we were to "accept H0. A Type II error that of accepting the null hypothesis when it is false. this would occur if we conclude that the process is out of control when in fact the process is in control. the null hypothesis will be rejected and the probability of having made an incorrect decision (when in fact H0 is true) is α. 2. we recommend the following procedure for formulating hypotheses and stating conclusions. State the hypothesis as the alternative hypothesis Ha. The practical significance of making a Type II error is that the metal lathe would not be repaired.e." This is because. a.4 Refer to Example 8. would you want α to be small or large? Explain.e. Specify what Type I and Type II errors would represent. Example 8. in terms of the problem. rather. Formulating hypotheses and stating conclusions 1. i. to increase the sample size. If you were the defendant. a quantity that can be manipulated to be as small as the researcher wishes. In this situation. similarly. when in fact the mean is equal to . if we conclude that the mean bearing diameter is different from . The probability of making a Type I error (α) can be controlled by the researcher (how to do this will be explained in Section 8. Under a judicial system. Solution a. as β increases." Instead. you may wish to collect more information about the phenomenon under study. 3. the value of β is not constant. In our example. a decreases.decreases.3.
sufficient evidence to support the claim that the defendant is guilty. In all our applications. Since our best guess about the value of µ is the sample mean x (see Section 7. it seems reasonable to use x as a test statistic. Recall that when making any type of statistical inference (of which hypothesis testing is a special case).6 Suppose we want to test the hypotheses H0: µ = 72 Ha: µ > 72 What is the general format for carrying out a statistical test of hypothesis? Solution The first step is to obtain a random sample from the population of interest. to be very small indeed. we will assume that the appropriate sampling process has already been carried out. Thus.4 Rejection Regions In this section we will describe how to arrive at a decision in a hypothesistesting situation. For this example. experience will help to minimize this potential difficulty. we collect information by obtaining a random sample from the populations of interest. 8. in the form of a sample statistic. when in fact he or she is guilty. In many cases. when in fact he or she is innocent. the probability of committing a Type I error. Most would probably agree that the Type I error in this situation is by far the more serious. The information provided by this sample. will help us decide whether to reject the null hypothesis. a Type II error would be to conclude that the defendant is innocent. Table 8. the null and alternative hypotheses would be H0: Ha: Defendant is innocent Defendant is guilty b. Example 8. cxx . we would want α. Thus. The second step is to determine a test statistic that is reasonable in the context of a given hypothesis test. A Type I error would be to conclude that the defendant is guilty.2). we are hypothesizing about the value of the population mean µ. We will learn how to choose the test statistic for other hypothesistesting situations in the examples that follow.2 Conclusions and consequences inn Example 8. The four possible outcomes are shown in Table 8. A convention that is generally observed when formulating the null and alternative hypotheses of any statistical test is to state H0 so that the possible error of incorrectly rejecting H0 (Type I error) is considered more serious than the possible error of incorrectly failing to reject H0 (Type II error). The sample statistic upon which we base our decision is called the test statistic. the decision as to which type of error is more serious is admittedly not as clearcut as that of Example 8.2.5.5 Decision of Court Defendant is innocent True State of Nature Defendant is innocent Defendant is guilty Correct decision Type I error Defendant is guilty Type II error Correct decision c.
are the population mean numbers of children born to urban women and rural women. we will reject the null hypothesis. we will use ( x1 − x 2 ) . Thus. otherwise. the difference between the two population means. Again. i. we do not reject the null hypothesis. If the difference between the sample means. as a basis for deciding whether to reject H0. do not reject the null hypothesis.e.µ2) = 0. we make our decision by observing whether the computed value of the test statistic lies within the rejection region. we are using the point estimate of the target parameter as the test statistic in the hypothesistesting approach. Therefore. If so. the fourth step is to use the data in the sample to compute the value of the test statistic. Recall that the null and alternative hypotheses will be stated in terms of specific population parameters. what specific values of the test statistic will lead us to reject the null hypothesis in favor of the alternative hypothesis? These specific values are known collectively as the rejection region for the test.The third step is to specify the range of possible computed values of the test statistic for which the null hypothesis will be rejected. For this example. Once the rejection region has been specified. Specify the rejection region. 4. In general. In step 3. 2. reject the null hypothesis. we do not reject the null hypothesis. we reject the null hypothesis. and µ2. the range of possible computed values of the test statistic for which the null hypothesis will be rejected. cxxi .7 Refer to Example 8. Outline for testing a hypothesis 1. it would support the alternative hypothesis that (µ1 . Obtain a random sample from the population(s) of interest.µ2). Determine a test statistic that is reasonable in the context of the given hypothesis test. Finally.6 is given followings. If the computed value of the test statistic does not fall within the rejection region.. Example 8. If the computed value of the test statistic falls within the rejection region. 5. If in fact the computed value falls within the rejection region. 3. falls greatly below the hypothesized value of (µ1 . ( x1 − x 2 ) .1.µ2) = 0 Ha: (µ1 . That is. Solution The parameter of interest is (µ1 . otherwise. We will learn how to find an appropriate rejection region in later examples. Suggest an appropriate test statistic in the context of this problem. respectively. Observe whether the computed value of the test statistic lies within the rejection region.µ2) < 0 where µ1.µ2) < 0. in which we wish to test H0: (µ1 . when the hypothesis test involves a specific population parameter. in step 2 we decide on a test statistic that will provide information about the target parameter. In fact. the test statistic to be used is the conventional point estimate of that parameter. we would need to specify the values of x that would lead us to believe that Ha is true. then we have evidence that disagrees with the null hypothesis. Use the data in the sample to compute the value of the test statistic. we divide all possible values of the test into two sets: the rejection region and its complement. An outline of the hypothesistesting procedure developed in Example 8. the difference between the corresponding sample means. that µ is greater than 72.
this value of the sample mean. We now illustrate how to determine a rejection region that takes into account such factors as the sample size and the maximum probability of a Type I error that you are willing to tolerate. We would thus tend to reject the null hypothesis on the basis of information contained in this sample. if the null hypothesis were true (i.6.8 Refer to Example 8. we would not reject H0 in favor of Ha: µ > 72. if µ is in fact equal to 72).e.05. then it is very unlikely that we would observe a sample mean x as large as 110. is this due to chance variation. x = 73 Solution a. x = 59 . The zscore then gives us a measure of how many standard deviations the observed x is from what we would expect to observe if H0 were true. b. Now. Does a sample value of x = 73 cast sufficient doubt on the null hypothesis to warrant its rejection? Although the sample mean x = 73 is larger than the null hypothesized value of µ =72.9 Refer to Example 8. c. if µ = 72. In other words. But how do we decide whether a value.e.Example 8. Example 8. Since the alternative of interest is µ > 72. x = 110 b. we are interested in the alternative that µ is greater than 72. then much doubt is cast upon the null hypothesis. based on this sample. and that we need a more formal mechanism for deciding what to do in this situation. x = 73 is "sufficiently greater" than 72 to reject H0? A convenient measure of the distance between x and 72 is the zscore. For the hypothesis test H0: µ = 72 Ha: µ > 72 indicate which decision you may make for each of the following values of the test statistic: a. Specify completely the form of the rejection region for a test of at a significance level of α = .. which "standardizes" the value of the test statistic x : H0: µ = 72 Ha: µ > 72 z= x − µx σx = x − 72 σ/ n ≈ x − 72 s/ n The zscore is obtained by using the values of µ x and σ x that would be valid if the null hypothesis were true. cxxii . x = 59 c.. provides no support for Ha. Solution We are interested in detecting a directional departure from H0. Thus. in particular. If x = 110 . what values of the sample mean x would cause us to reject H0 in favor of Ha? Clearly. or does it provide strong enough evidence to conclude in favor of Ha? We think you will agree that the decision is not as clearcut as in parts a and b. i. values of x which are "sufficiently greater" than 72 would cast doubt on the null hypothesis.8.
05. The appropriate modifications for small samples will be indicated in Chapter 9. we have also assumed that n ≥ 30 so that the sampling distribution of x will be approximately normal. the probability of a Type I error − that is.645 (i. Figure 8. then either H0. in this case. deciding in favor of Ha if in fact H0 is true − is equal to a α =. we want to be able to detect the directional alternative that µ is less than 72. cxxiii . Here.645 is shown in Figure 8. is true and a relatively rare (with probability . when in fact the true value of µ is 72.05 or less) event has occurred. We would tend to favor the latter explanation for obtaining such a large value of x .645 standard deviations above 72 is only .e.01. our rejection region for this example consists of all values of z that are greater than 1.We examine Figure 8. Thus. if we observe a sample mean located more than 1.1a and observe that the chance of obtaining a value of x more than 1.645 standard deviations above 72). we will standardize the value of the test statistic to obtain a measure of the distance between x and the null hypothesized value of 72: z= (x − µ x ) σ = x − 72 x σ / n ≈ x − 72 s/ n This zvalue tells us how many standard deviations the observed x is from what would be expected if H0 were true. In this situation.1b.645 standard deviations above 72. Solution Here. The value at the boundary of the rejection region is called the critical value. The critical value 1. We are assuming that the sample size is large enough to ensure that the sampling distribution of x is approximately normal.9. or Ha is true and the population mean exceeds 72. it is "sufficiently small" values of the test statistic x that would cast doubt on the null hypothesis.10 H0: µ = 72 Specify the form of the rejection region for a test of Ha: µ < 72 at significance level α = . all values of x that are more than 1.1 Location of rejection region of Example 8.. Example 8. As in Example 8. and would then reject H0.9 In summary.05.
05.96 (see Figure 8. for all values of x that lie more than 2.2b).3a. from Figure 8. the chance of observing a value of x more than 2. we would reject the null hypothesis for all values of z that are less than . the probability of a Type I error is . Thus.2a shows us that. The three previous examples all exhibit certain common characteristics regarding the rejection cxxiv .Figure 8. when in fact H0 is true. Thus.2 Location of rejection region of Example 8. For this rejection rule.10 Example 8. at significance level (probability of Type I error) equal to .3b).96 standard deviations above 7 2.96 or greater than 1.33 standard deviations below 72. we note that the chance of observing a sample mean x more than 1. when in fact the true value of µ is 72.e. Figure 8.11 H0: µ = 72 Specify the form of the rejection region for a test of Ha: µ ≠ 72 where we are willing to tolerate a . is only α = . i.01.2.33 (see Figure 8. For this twosided (nondirectional) alternative.05.33 standard deviations below 72 is only . the rejection region consists of two sets of values: We will reject H0 if z is either less than 1.01. we would reject the null Solution hypothesis for "sufficiently small" or "sufficiently large" values of the standardized test statistic z≈ x − 72 s/ n Now.05 chance of making a Type I error..96 standard deviations below 72 or more than 1.
For a onetailed test in which the symbol "<" appears in Ha. the rejection region consists of values in the upper tall of the sampling distribution of the standardized test statistic. is specified in advance by the researcher.4). in which the symbol "≠" occurs in Ha. more extreme departures of the test statistic from the null hypothesized parameter value are required to permit rejection of H0. The critical value is selected so that the area to its left is equal to α. 2. (It is through standardization that the rejection rule takes into account the sample sizes. the rejection region consists of two sets of values.) Standard test statistic = Point estimate . cxxv . It can be made as small or as large as desired. and on the prespecified significance level. For a twotailed test.region. typical values are α = . a. c.Hypothesized value Standard deviation of point estimate 3. assuming H0 is true. .3 Location of rejection region of Example 8. as indicated in the next paragraph. Figure 8.02. the point estimate of the target parameter) is standardized to provide a measure of how great is its departure from the null hypothesized value of the parameter. For a fixed sample size. Then we determine if the standardized of the test statistic value lies within the rejection region in order to make a decision about whether to reject the null hypothesis. the rejection region consists of values in the lower tail of the sampling distribution of the standardized test statistic. The location of the rejection region depends on whether the test is onetailed or twotailed. The critical value is selected so that the area to its right is equal to α. the size of the rejection region decreases as the value of a decreases (see Figure 8. the test statistic (i. the probability of a Type I error.. That is. b. The value of α. and .e.11 Guidelines for Step 3 of Hypothesis Testing 1. Figure 8. . for smaller values of α.4 Size of the uppertail rejection region for different values of α Steps 4 and 5 of the hypothesistesting approach require the computation of a test statistic from the sample information. The standardization is based on the sampling distribution of the point estimate. The critical values are selected so that the area in each tail of the sampling distribution of the standardized test statistic is equal to α/2.01.10. For testing means or proportions.05. For a onetailed test in which the symbol ">" occurs in H0. α.
645. Perform a test of H0: µ = 72 Ha: µ > 72 at a significance level of α = . computed assuming H0 is true.9. The standardized test statistic.42 cxxvi . Solution In Example 8.Example 8. we determined the following rejection rule for the given value of α and the alternative hypothesis of interest: Reject H0 if z > 1. Suppose the following statistics were calculated based on a random sample of n = 30 measurements: x = 73. s = 13.05.12 Refer to Example 8.9. is given by z= x − µx σx = x − 72 σ/ n ≈ x − 72 s/ n = 73 − 72 13 / 30 = .
5). (Hint: assuming that the sample size will be sufficient to guarantee the approximate normality of the sampling distribution of x .6 Exercises 8.5 Location of rejection region of Example 8.2. rather.5 Summary In this chapter. determine the value of α.05. Indicate the form of the rejection region for a test of H0: (p1 − p2) = 0 Ha: (p1 − p2) > 0 Assume that the sample size will be appropriate to apply the normal approximation to the ˆ ˆ sampling distribution of ( p1 − p 2 ) .5.58 cxxvii . Why do we avoid stating a decision in terms of "accept the null hypothesis H0"? 8. we fail to reject H0 and conclude there is insufficient evidence to support the alternative hypothesis. 8. A medical researcher would like to determine whether the proportion of males admitted to a hospital because of heart disease differs from the corresponding proportion of females. we state that we have insufficient evidence to reject H0. (Note that we do not conclude that H0 is true. 8.) 8. 8. Formulate the appropriate null and alternative hypotheses and state whether the test is onetailed or twotailed. Ha: µ > 72.645 c) z < −2. we have introduced the logic and general concepts involved in the statistical procedure of hypothesis testing. Suppose it is desired to test H0: µ = 65 Ha: µ ≠ 65 at significance level α = . and that the maximum tolerable probability of committing a Type I error is .1. Specify the form of the rejection region. For each of the following rejection region.96 b) z > 1.12 Since this value does not lie within the rejection region (shown in Figure 8.) 8. The techniques will be illustrated more fully with practical applications in Chapter 9.4.02. the probability of a Type I error: a) z < −1.3.Figure 8.58 or z > 2.
p. average Difference in means or averages. For example. 9. in Chapter 7 we saw that the largesample test statistic for testing a hypothesis about a population mean µ is given by x − µ0 (see also Example 8. or rates Variance. and (p1 − p2). percentages.9) z= s/ n while the test statistic for testing a hypothesis about the parameter p is z= ˆ p − p0 p0 q 0 n The key to correctly diagnosing a hypothesis test is to determine first the parameter of interests. the average was 40 hours. the manner in which the test statistic is actually computed depends on the parameter of interest.Chapter 9 Applications of Hypothesis Testing 9. In this section. percentage. test statistic. However. and rejection region all have the same general form (see Chapter 8). The concepts of a hypothesis test are the same for all these parameters.2 Hypothesis test about a population mean Suppose that in the last year all students at a certain university reported the number of hours spent on their studies during a certain week. the null and alternative hypotheses. fraction. Determining the parameter of interest PARAMETER DESCRIPTION µ (µ1 − µ2) Mean. p (p1 − p2) σ2 σ 12 2 σ2 In the following sections we will present a summary of the hypothesistesting procedures for each of the parameters listed in the previous box. difference in variation. The following are the key words to look for when conducting a hypothesis test about a population parameter. This year we want to cxxviii .1 Introduction In this chapter. rate Difference in proportion. or rates. comparison of means or averages Proportion. variation. comparison of variances mean difference. precision Ratio of variances. we will present applications of the hypothesistesting logic developed in Chapter 8.µ2). percentage. fraction. Among the population parameters to be considered are (µ1 . comparison of proportions. we will present several examples illustrating how to determine the parameter of interest. fractions.
we need to perform each step of the hypothesistesting procedure developed in Chapter 8.1 hours. [Note: µ0 is our symbol for the particular numerical value specified for µ in the null hypothesis. a random sample of 35 students at the university was drawn. That is. we will test H0: µ = 40 Ha: µ > 40 where µ = Mean time spent on studies of all students at the university. Will the value of x that we obtain from our sample be large enough for us to conclude that µ is greater than 40? In order to answer this question. n ≥ 30) so that the sampling distribution of x is approximately normal and that s provides a good approximately to σ. Note that for this case.1 The mean time spent on studies of all students at a university last year was 40 hours per week. we hope that the sample data will lead to the rejection of H0. the point estimate of the population mean µ is the sample mean x . Tests of population means using large samples The following box contains the elements of a largesample hypothesis test about a population mean.85 hours Test the hypothesis that µ. the population mean time spent on studies per week is equal to 40 hours against the alternative that µ is larger than 40 hours. This year. s = 13. Largesample test of hypothesis about a population mean ONE TAILED TEST TWO TAILED TEST H0: µ = µ0 Ha: µ > µ0 (or Ha: µ < µ0) Test statistic: H0: µ = µ0 Ha: µ ≠ µ0 z= x − µ0 σx ≈ x − µ0 s/ n Rejection region: z > zα (or z < .05. Now. and zα/2 is the zvalue such that P(z > zα/2) = α/2. cxxix . The following summary statistics were computed: x = 42.determine whether the mean time spent on studies of all students at the university is in excess of 40 hours per week. the only assumption required for the validity of the procedure is that the sample size is in fact large (n ≥ 30). We are conducting this study in an attempt to gather support for Ha. Example 9.zα) Rejection region: z < zα/2 (or z > zα/2) where zα is the zvalue such that P(z > zα) = α.] Assumption: The sample size must be sufficiently large (say. µ. Use a significance level of α = .
1).e.03 kg s = .e.85 / 35 = . In order to detect shifts in the mean weight.645.05 i.1. and calculate the sample mean and standard deviation. if z > 1. we obtain z= x − µ0 s/ n = 42. Using a significance level of α = .05. Figure 9. and reset the machine.1 Computing the value of the test statistic. This rejection region is shown in Figure 9.05) to conclude that the mean time spent on studies per week of all students at the university this year is greater than 40 hours. The data of a periodical sample as follows: x = 1. Now the setting of machine tends to drift i. We say that there is insufficient evidence (at α = . weigh them.897 Since this value does not fall within the rejection region (see Figure 9. It is important to control the average weight of bags of sugar. The refiner wish to detect shifts in the mean weight of bags as quickly as possible. if in fact this were the case.2 A sugar refiner packs sugar into bags weighing.Solution We have previously formulated the hypotheses as H0: µ = 40 Ha: µ > 40 Note that the sample size n = 35 is sufficiently large so that the sampling distribution of x is approximately normal and that s provides a good approximation to σ. We would need to take a larger sample before we could detect whether µ > 40. Since the required assumption is satisfied. we may proceed with a largesample test of hypothesis about µ. on average 1 kilogram.1 − 40 13. he will periodically select 50 bags. we do not reject H0.1 Rejection region for Example 9. Example 9. the average weight of bags filled by the machine sometimes increases sometimes decreases.. we will reject the null hypothesis for this onetailed test if z > zα/2 = z.05 kg cxxx .
z.. The value of the test statistic is computed as follows: x − µ 0 1.2) The computed value of the test statistic is cxxxi .01. For a significance level of α = .5.e.01.03 − 1 = = 4. if z < ..5 Ha: µ < 4. average number of onthejob accidents per day has decreased) where µ represents the average number of onthejob accidents per day at the factory after institution of the new safety program. Solution We formulate the following hypotheses: H0: µ = 1 Ha: µ ≠ 1 The sample size (50) exceeds 30.e.576 or z > 2.Test whether the population mean µ is different from 1 kg at significance level α = . At significance level α = .2. In order to determine whether the safety program was effective. Is there sufficient evidence to conclude (at significance level .01) that the average number of onthejob accidents per day at the factory has decreased since the institution of the safety program? b. we will reject the null hypotheses if z < .005 i.5 (i. Because shifts in µ in either direction are important.243 z≈ s / n . we may proceed with the larger sample test about µ. no change in average number of onthejob accidents per day) (i. We would conclude that the overall mean weight was no longer 1 kg.576).33 (see Figure 9. The sample mean and standard deviation were computed as follows: x = 3 . the average number of onthejob accidents per day at a factory was 4. and would run a less than 1% chance of committing a Type I error.3 a.576.zα/2 = .3 Prior to the institution of a new safety program.7 s = 1 . What is the practical interpretation of the test statistic computed in part a? Solution a.005 or z > zα/2 = z. we will reject the null hypothesis for this two tail test if z < .05 / 50 Since this value is greater than the uppertail critical value (2. Example 9..z.01. we reject the null hypothesis and accept the alternative hypothesis at the significance level of 1%. we will conduct a largesample test of H0: µ = 4.01 = . a random sample of 30 days is taken after the institution of the new safety program and the number of accidents per day is recorded. so the test is twotailed.e.2. To determine if the safety program has been effective in reducing the average number of accidents per day.
Tests of population means using small samples When the assumption required for a largesample test of hypothesis about µ is violated. then the critical values we use will depend cxxxii .7 − 4 .37 indicates that the value of x computed from the sample falls a distance of 3.5.01) to conclude that the average number of onthejob accidents per day at the factory has decreased since the institution of the safety program. A calculated zscore of 3.5. It appears that the safety program was effective in reducing the average number of accidents per day.3 You can see that the test statistic computed in part a is simply the zscore for the sample mean x .2 Location of rejection region of Example 9. µ = 4. Because if we use methods of the largesample test. Recall that for large samples. there is sufficient evidence (at α =. we need a hypothesistesting procedure that is appropriate for use with small samples. Then the zscore for x . if in fact µ = 4. If we use large samples to test a hypothesis.37 standard deviations below the hypothesized mean of µ = 4.5 σ/ n Figure 9. the sampling distribution of x is approximately normal. If the null hypothesis is true.5. is given by z= x − 4 . Firstly.5.5 1.z= x − µ0 s/ n = 3 .37 Since this value does fall within the rejection region. so our critical values will be wrong. Of course. with mean µ x = µ and standard deviation σ x = σ / n . b. under the assumption that H0 is true. we will run into trouble on two accounts. our small sample will underestimate the population variance. so our test statistic will be wrong. we would not expect to observe a zscore this extreme if in fact µ = 4.3 / 30 = 3. and the appropriate tdistribution will depend on the number of degrees of freedom in estimating the population variance. Secondly. the means of small samples are not normally distributed. We have learnt that the means of small samples have a tdistribution.
05. z.tα) Rejection region: t < tα/2 (or t > tα/2) where the distribution of t is based on (n – 1) degrees of freedom. Example 9. Smallsample test of hypothesis about a population mean ONETAILED TEST TWOTAILED TEST H0: µ = µ0 Ha: µ > µ0 (or Ha: µ < µ0) Test statistic: H0: µ = µ0 Ha: µ ≠ µ0 t= x − µ0 s/ n Rejection region: t > tα (or t < . x . based on a small sample (n < 30) consists of the elements listed in the accompanying box. Therefore. In particular. To test a new batch a sample of 10 was taken which showed a mean lifetime of 1410 hours. µ0. we must make the assumption that the lifetimes of the electric light bulbs have a relative frequency distribution that is approximately normal. Under cxxxiii . As we noticed in the development of estimation procedures. using a level of significance of α = . just like z. Test the hypothesis that the mean lifetime of the electric light bulbs has not changed.4 The expected lifetime of electric light bulbs produced by a given process was 1500 hours. so we must employ a twotailed test: H0: µ = 1500 Ha: µ ≠ 1500 Since we are restricted to a small sample. Solution This question asks us to test that the mean has not changed. Assumption: The relative frequency distribution of the population from Which the sample was selected is approximately normal.upon the type of test (one or two tailed). A hypothesis test about a population mean. when we are making inferences based on small samples. more restrictive assumptions are required than when making inferences from large samples. is from the hypothesized population mean. this hypothesis test requires the assumption that the population from which the sample is selected is approximately normal. given earlier in this section. and tα/2 is the tvalue such that P(t > tα/2) = α/2. But if we use small samples. the computed value of t indicates the direction and approximate distance (in units of standard deviations) that the sample mean. Notice that the test statistic given in the box is a t statistic and is calculated exactly as our approximation to the largesample test statistic. tα is the tvalue such that P(t > tα) = α. The standard deviation is 90 hours. then the critical values will depend upon the degrees of freedom as well as the type of test. µ.
t = . the test statistic will have a tdistribution with (n . "40% of the population clean their teeth with brand A toothpaste ". Example 9. It is proposed to check whether this proportion has cxxxiv .40. based on a large sample from the target population.999 s/ n 90 / 10 The computed value of the test statistic. H0: p = .025 = 2.025.zα) where q0 = 1 – p0 Rejection region: z < zα/2 (or z > zα/2) where q0 = 1 – p0 ˆ ˆˆ Assumption: The interval p ± 2 pq / n does not contain 0 and 1.g. p. (Recall that p represents the probability of success in a Bernoulli process.5 Suppose it is claimed that in a very large batch of components. and conclude that there is some evidence to suggest that the mean lifetime of all light bulbs has changed. From Table 7. A general rule of ˆ ˆˆ thumb for determining whether n is "sufficiently large" is that the interval p ± 2 pq / n does not include 0 or 1. The rejection rule is then to reject the null hypothesis for values of t such that t < .262. Many market researchers express their results in terms of proportions.05.) In order that the procedure to be valid. e.2.40 (i.2.999. about 10% of items contain some form of defect. p.05/2 = .3 Hypothesis tests of population proportions Tests involving sample proportions are extremely important in practice.. we may want to test the null hypothesis that the true proportion of people who use brand A is equal to .1) = (10 1) = 9 degrees of freedom.tα/2 or t > tα/2 with α/2 = . For example.262. The procedure described in the next box is used to test a hypothesis about a population proportion. 9. the sample size must be sufficiently large to guarantee approximate normality of the sampling distribution of the sample proportion. we find that t.this assumption. We reject H0 and accept H1 at significance level of .6 in Chapter 7 (or Table 2 of Appendix C) with 9 degrees of freedom. The value of test statistic is x − µ 0 1410 − 1500 t= = = − 2. It will be useful to design tests that will detect changes in proportions. Largesample test of hypothesis about a population proportion ONE TAILED TEST TWO TAILED TEST H0: p = p0 Ha: p > p0 (or Ha: p < p0) Test statistic: H0: p = p0 Ha: p ≠ p0 z= ˆ p − p0 p0 q0 / n Rejection region: z > zα (or z < .40) against the alternative Ha: p > . falls below the critical value of .e.
in fact.10 (i.increased.e. It is our experience that they are of limited utility since most surveys of binomial population performed in the reality use samples that are large enough to employ the techniques of this section. A and B.10 (.10)(.133) / 150 = .133 ± 2 (. proportion of defectives has increased) where p represents the true proportion of defects. no change in proportion of defectives) Ha: p > .4 Hypothesis tests about the difference between two population means There are two brands of coffee..01 at the 5% level of significance. 20 are defectives.05. p: H0: p = . and this will be done by drawing randomly a sample of 150 components.05 = 1. of defects: ˆ p= = Number of sampled components Number of defective components in the sample 20 = . In the sample. it is not true) is β = . the consumer group will test the null hypothesis cxxxv .10 = .133 − . [Note that the interval ˆ ˆˆ p ± 2 pq / n = . the details are omitted from our discussion. we obtain the following value of the test statistic: z= = . We have no evidence to reject the null hypothesis that the proportion defective is . Thus.. so we would conclude that the proportion defective in the sample is not significant. That is.133)(1 − .056 does not contain 0 or 1.645 ˆ The test statistic requires the calculation of the sample proportion.133 150 ˆ p − p0 p 0 q0 / n Noting that q0 = 1 – p0 = 1 . Suppose a consumer group wishes to determine whether the mean price per pound of brand A exceeds the mean price per pound of brand B.10 (i. Does this evidence indicate that the true proportion of defective components is significantly larger than 10%? Test at significance level α = .90) / 150 = 1. The probability of our having made a Type II error (accepting H0 when..0 5. the sample size is large enough to guarantee that validity of the hypothesis test. the rejection region for this onetailed test consists of all values of z for which z > z.] Although smallsample procedures are available for testing hypotheses about a population proportion.e. Solution We wish to perform a largesample test about a population proportion. At significance level α = .133 ± .05. p .361 This value of z lies out of the rejection region.90. 9.
2.95 s2 = $.6 Brand A Brand B n1 = 75 x1 = $3.H0: (µ1 .6 A consumer group selected independent random samples of suppermarkets located throughout a country for the purpose of comparing the retail prices per pound of coffee of brands A and B. Largesample test of hypothesis about (µ1 . The samples are selected randomly and independent from the target populations. no difference between mean retail prices) cxxxvi . we wish to hypothesize that there is no difference between the population means. The results of the investigation are summarized in Table 9.µ2) = 0 against the alternative ((µ1 . in such cases.µ2) > D0 (or Ha: (µ1 .µ2) > 0. Table 9. The largesample procedure described in the box is applicable testing a hypothesis about (µ1 .µ2) ONE TAILED TEST TWO TAILED TEST H0: (µ1 .1.11 n2 = 64 x 2 = $2.zα) Rejection region: z < zα/2 or z > zα/2 [Note: In many practical applications. Does this evidence indicate that the mean retail price per pound of brand A coffee is significantly higher than the mean retail price per pound of brand B coffee? Use a significance level of α = .µ2) = 0 (i. the difference between two population means.1 Coffee prices for Example 9.µ2)< D0) Test statistic: H0: (µ1 .µ2) = D0 Ha: (µ1 . D0 = 0] Assumptions: 1.µ2) ≠ D0 z= ( x1 − x 2 ) − D0 σ (x −x 1 ≈ ( x1 − x 2 ) − D0 s1 s + 2 n1 n2 2 2 2) Rejection region: z > zα (or z < .µ2). The sample sizes n1 and n2 are sufficiently large (n1 ≥ 30 and n2 ≥ 30).. Example 9.00 s1 = $.e.µ2) = D0 Ha: (µ1 .09 Solution The consumer group wants to test the hypotheses H0: (µ1 .01.
µ2) = D0 Ha: (µ1 .01 = 2.µ2) = D0 Ha: (µ1 .947 lies in the rejection region. mean retail price per pound of brand A is higher than that of brand B) µ1 = Mean retail price per pound of brand A coffee at all supermarkets µ2 = Mean retail price per pound of brand B coffee at all supermarkets This onetailed.Ha: (µ1 . 9.09) 2 + 75 64 = 2. Thus.11) 2 (.6 Since this computed value of z = 2. we will reject H0 if z > zα = z.00 − 2.µ2) ONE TAILED TEST TWO TAILED TEST H0: (µ1 .01) to conclude that the mean retail price per pound of brand A coffee is significantly higher than the mean retail price per pound of brand B coffee. When the sample sizes n1 and n2 are inadequate to permit use of the largesample procedure of Example 9. Smallsample test of hypothesis about (µ1 .µ2) > 0 where (i.. The test procedure is based on assumption that are more restrictive than in the largesample case.01.33 (see Fig.95) − 0 (. Since z.3 Rejection region for Example 9.9. The elements of the hypothesis test and required assumption are listed in the next box.µ2) ≠ D0 cxxxvii . there is sufficient evidence (at α = . The probability of our having committed a Type I error is α = .01.e.µ2) > D0 (or Ha: (µ1 .947 Figure 9.33. largesample test is based on a z statistic.µ2)< D0) H0: (µ1 . we have made some modifications to perform a smallsample test of hypothesis about the difference between two population means. the rejection region is given by z > 2.3) We compute the test statistic as follows: z= ( x1 − x 2 ) − D0 s1 s + 2 n1 n2 2 2 = (3.
. n2 = 14). The population from which the samples are selected both have approximately normal relative frequency distributions. The random samples are selected in an independent manner from the two populations. he selects independent random samples of weights at birth of children of mothers from each group.2 Weight at birth data for Example 9. cxxxviii . 2.2) degrees of freedom.e. The populations of weights at birth of children both have approximately normal distributions. Since the sample sizes for the study are small (n1 = 15.2029 kg s2 = . mean weights women are different) at birth of children of urban and rural where µ1 and µ2 are the true mean weights at birth of children of urban and rural women.µ2) = 0 (i.µ2) ≠ 0 (i.3707 kg n2 = 14 x 2 = 3. Assumptions: 1.tα) where 2 (n1 − 1) s12 + (n 2 − 1) s 2 n1 + n 2 − 2 Rejection region: t < tα/2 or t > tα/2 s2 = p and the distribution of t is based on (n1 + n2 .. no difference between mean weights at birth) Ha: (µ1 .5933 kg s1 = .e. the following assumptions are required: 1.4927 kg Solution The researcher wants to test the following hypothesis: H0: (µ1 .7 There was a research on the weights at birth of the children of urban and rural women.02.2. calculates the mean weights and standard deviations and summarizes in Table 9. Table 9.7 Urban mothers Rural mothers n1 = 15 x1 = 3. Example 9. To test this hypothesis. Test the researcher's belief.Test statistic: t= ( x1 − x 2 ) − D0 1 1 s2 + p n1 n 2 Rejection region: t > tα (or t < . 3. The researcher suspects there is a significant difference between the mean weights at birth of children of urban and rural women. The variances of the two populations are equal. respectively. using a significance of α = .
the rejection region is given by t < . In this example. thus. This region is specified by the significance level and the degree of freedom. 3. the test statistic will have a tdistribution with (n1 + n2 .4) Figure 9.2.1881 n1 + n 2 − 2 15 + 14 − 2 Using this pooled sample variance in the computation of the test statistic.01 = 2.2) = (15 + 14 . we need to compute an estimate of this common variance. If these three assumptions are valid.7 Since we have assumed that the two populations have equal variances (i.5933 − 3. The variances of the populations of weights at birth of children for two groups of mothers are equal. we fail to reject the null hypothesis (at α = .473 or t > t.4927) 2 = = 0.2) = 27 degree of freedom with a significance level of α = .473 (see Figure 9.e.422 Now the computed value of t does not fall within the rejection region.01 = .4 Rejection region of Example 9. cxxxix .t.2029) − D0 1 1 . The samples were independently and randomly selected.05? We will answer the question in the next example.1881 + 15 14 = 2. we obtain t= ( x1 − x 2 ) − D0 1 1 s2 + p n1 n 2 = (3. Our pooled estimate is given by s2 = p 2 (n1 − 1) s12 + (n 2 − 1) s 2 (15 − 1)(. that 2 σ 12 = σ 2 = σ ).3707) 2 + (14 − 1)(. we can see that the computed value of t is very closed to the upper boundary of the rejection region.02.2.02) and conclude that there is insufficient evidence of a difference between the mean weights at birth of children of urban and rural women. How is the conclusion about the difference between the mean weights at births affected if the significance level is α = .
10.05. The method for performing a largesample test of hypothesis about (p1 .7. and p2 also represent the probabilities of success for two binomial experiments.5 Hypothesis tests about the difference between two proportions Suppose we are interested in comparing p1.p2) = D0 Ha: (p1 .p2) ≠ D0 z= ˆ ˆ ( p1 − p 2 ) − D0 σ ( p1 − p2 ) ˆ ˆ cxl . using a significance level of α = .p2) > D0 or (Ha: (p1 .5 Rejection region of Example 9.05 to conclude that the mean weight at birth of children of urban women differs significantly (or we can say that is higher than) from the mean weight at birth of children of rural women.Example 9.2.8 Refer Example 9. Largesample test of hypothesis about (p1 . Solution With a significance level of α = .p2). the proportion of a population with p2.5) Since the sample sizes are not changed. is outlined in the following box.05. the difference between two binomial proportions. Test the investigator's belief.422.052 (see Figure 9.025 = . and we have sufficient evidence at a significance level of α = .025 = 2. therefore test statistic is the same as in Example 9.p2) < D0) Test statistic: H0: (p1 . Then the target parameter about which we will test a hypothesis is (p1 .05.8 9. t = 2.p2) ONE TAILED TEST TWO TAILED TEST H0: (p1 .t. Figure 9. the rejection region is given by t < . Recall that p1.052 or t > t.p2) = D0 Ha: (p1 . the proportion of other population. Now the value of t falls in the rejection region.p2). But you should notice that the probability of our having committed a Type I error is α = .
i.p2) = 0. intervals ˆ ˆ ˆ p1 ± 2 p1 q1 / n1 and When testing the null hypothesis that (p1 .2]. n1 + n 2 In this case. n1 + n 2 The ˆ ˆ ˆ p 2 ± 2 p 2 q 2 / n 2 do not contain 0 and 1. the best estimate of p1 = p2 = p is found by dividing the total number of successes in the combined samples by the total number of observations in the two samples. in most practical situations. calculate 1 1 ˆ ˆ σ ( p1 − p2 ) ≈ pq + ˆ ˆ n1 n2 when the total number of successes in the combined samples is (x1 + x2) and ˆ ˆ ˆ p1 = p 2 = p = Assumption: x1 + x 2 . σ( p1 − p 2 ) ≈ ˆ ˆ the For special case where D0 = 0. The rule of thumb given in the previous box may be used to determine if the sample sizes are "sufficiently large.Rejection region: z > zα (or z < . For the special case D0 = 0. we will want to test for a difference between proportions that is. we make a distinction between the case D0 = 0 and the case D0 ≠ 0. ˆ ˆ However.p2) equals some specified difference D0. must be sufficiently large to ensure that the sampling distribution of ˆ ˆ ˆ ˆ p1 and p 2 . then ˆ p= x1 + x 2 . the best estimate of the standard deviation of the sampling distribution of ˆ ˆ ˆ ˆ ˆ ( p1 − p 2 ) is found by substituting p for both p1 and p 2 : σ ( p1 − p2 ) = ˆ ˆ p1 q1 p 2 q 2 + ≈ n1 n2 ˆˆ ˆˆ pq pq + = n1 n 2 1 1 ˆ ˆ pq + n1 n 2 For all cases in which D0 ≠ 0 [for example. and hence of the difference ( p1 − p 2 ) are approximately normal. if x1 is the number of successes in sample 1 and x2 is the number of successes in sample 2.e. when testing H0: (p1 . when we are testing H0: (p1 . we will want to test H0: (p1 . That is.p2)=.p2) = 0 or. equivalently. we use ˆ ˆ p1 and p 2 in the formula for σ ( p1 − p2 ) .." cxli . The sample sizes n1 and n2. H0: p1 = p2.zα) Rejection region: z < zα/2 or z > zα/2 where σ ( p1 − p2 ) = ˆ ˆ p1 q1 p 2 q 2 + n1 n2 ˆ ˆ when D0 ≠ 0. calculate σ ( p1 − p2 ) using p1 and p 2 : ˆ ˆ ˆ ˆ ˆ ˆ p1q1 p2 q2 + n1 n2 ˆ ˆ ˆ ˆ where q1 =1 − p1 and q 2 =1 − p 2 .
p2) = 0 Injected by new type needles 100 56 Ha: (p1 .33 (see Figure 9.3 Data on the patients' reactions in Example 9. For this largesample. Table 9. the null hypothesis will be rejected if z < z. Table 9.Example 9. the other to receive the injection from needles of the new type.9 Injected by old type needles Number of sampled patients 100 Number in sample with reactions 37 Solution We wish to perform a test of H0: (p1 .p2) < 0 where p1 = Proportion of patients giving reactions to needles of the old type.37 = . = 2.37 100 ˆ p 2 = Sample proportion of patients giving reactions with needles of the new type = 56 = . onetailed test.56 100 Hence. p2 = Proportion of patients giving reactions to needles of the new type.3 shows the number of patients showing reactions to the injection. ˆ ˆ ˆ ˆ q1 =1 − p1 =1 − .9 Two types of needles. the test statistic is given by z= ˆ ˆ ( p1 − p 2 ) − D0 1 1 ˆ ˆ pq + n1 n 2 Total number of patients giving reactions with needles of both types Total number of patients sampled 37 + 56 = = .465 100 + 100 cxlii where ˆ p= .01. the old type and the new type.6) The sample proportions p1 and p2 are computed for substitution into the formula for the test statistic: ˆ p1 = Sample proportion of patients giving reactions with needles of the old type 37 = .56 = .01. Does the information support the belief that the proportion of patients giving reactions to needles of the old type is less than the corresponding proportion patients giving reactions to needles of the new type? Test at significance level of α = .63 and q 2 =1 − p 2 =1 − . one to receive the injection from needle of the old type. The patients were allocated at random to two group.44 = Since D0 = 0 for this test of hypothesis. used for injection of medical patients with a certain substance.
6 Rejection region of Example 9.63) = .37 − . Example 9..e. n1 and n2.12 is valid only if the sample sizes..10 A quality control supervisor in a cannery knows that the exact amount each can contains will vary. p1 < p2.69 This value falls below the critical value of .099 or (.273 . This requirement is satisfied for Example 9. are sufficiently large to guarantee that the intervals ˆ p1 ± 2 ˆ ˆ p1 q1 n1 ˆ and p 2 ± 2 ˆ ˆ p2 q2 n1 do not contain 0 and 1.465)(. but equally important is the variation σ2 cxliii .Then we have z= (. at α = .12: ˆ p1 ± 2 ˆ p2 ± 2 ˆ ˆ p1 q1 (.6 Hypothesis test about a population variance Hypothesis tests about a population variance σ2 are conducted using the chisquare (χ2) distribution introduced in Section 7.467 . Note that the assumption of a normal population is required regardless of whether the sample size n is large or small.01.37)(.097 or (.56)(. since there are certain uncontrollable factors that affect the amount of fill.659) n1 100 Figure 9.44) = .9 9.2. i.33.56) − 0 1 1 + (.37 ± . Thus.37 ± 2 = .56 ± 2 = ..535) 100 100 = − 2. there is sufficient evidence to conclude that the proportion of patients giving reactions to needles of the old type is significantly less than the corresponding proportion of patients giving reactions to needles of the new type. we reject the null hypothesis. The test is outlined in the box.56 ± .9.467) n1 100 ˆ ˆ p2 q2 (. The inference derived from the test in Example 9. The mean fill per can is important.
Does this value of s provide sufficient evidence to indicate that the standard deviation σ of the fill measurements is less than . If σ2 is large. [Note: σ 02 is our symbol for the particular numerical value specified for σ2 in the null hypothesis. of a chisquare distribution based on (n 1) degrees of freedom. Solution Since the null and alternative hypotheses must be stated in terms of σ2 (rather than σ).1 ounce? Test of hypothesis about a population variance σ2 ONE TAILED TEST TWO TAILED TEST H0: σ = σ0 2 2 H0: σ2 = σ02 Ha: σ2 ≠ σ02 Test statistic: Ha: σ2 > σ02 or (Ha: σ2 < σ02) χ2 = (n − 1) s 2 σ 02 Rejection region: χ > χ α (or χ < χ 1α) 2 2 2 2 Rejection region: χ2 < χ21α/2 or χ2 > χ2α/2 2 where χ ∞ and χ 12−∞ are values of χ2 that locate an area of α to the right and α to the left.01 Assumption: The population of "amounts of fill" of the cans are approximately normal. we reject H0 for "small values" of the test statistic. to determine the lowertail value that has α = . respectively. we use the χ2. Test statistic : χ 2 = (n − 1) s 2 σ 02 Rejection region: The smaller the value of s2 we observe.01 Ha: σ2 < . Thus. With α = .04.32511.95 column in Table 3 of Appendix C. the elements of the test are H0: σ2 = . We will reject H0 if χ2 < 3. the stronger the evidence in favor of Ha. The quality control supervisor sampled n = 10 cans and calculated s = .7. Suppose regulatory agencies specify that the standard deviation of the amount of fill should be less than .] Assumption: The population from which the random sample is selected has an approximate normal distribution. cxliv . Therefore. the χ2 value for rejection is found in Table 3.05 to its left.05 and 9 df. we will want to test the null hypothesis that σ2 = . Remember that the area given in Table 3 of Appendix C is the area to the right of the numerical value in the table.of the amount of fill.01 against the alternative that σ2 < . Thus.1 ounce. Appendix C and pictured in Figure 9.01. some cans will contain too little and others too much.
the stock with the smaller variance may be preferred because it is less risky .7 Hypothesis test about the ratio of two population variances In this section. a production manager may be interested in comparing the variation in the length of eyescrews produced on each of two assembly lines. σ 12 = σ 2 ) cxlv . Figure 9. σ 12 / σ 2 ONE TAILED TEST 2 H0: σ 12 / σ 2 = 1 2 (i.e. the confidence is in the procedure used . even though the mean length may be satisfactory. σ 12 = σ 2 ) TWO TAILED TEST 2 H0: σ 12 / σ 2 = 1 2 (i. As usual.32511.e. it will incorrectly reject H0 only 5% of the time. the supervisor can conclude that the variance of the population of all amounts of fill is less than . If this procedure is repeatedly used.01 is less than 3.01 (σ < 0. In this case.the χ2 test. For example.Since χ2 = (n − 1) s 2 2 σ0 = 9(. Thus. A line with a large variation produces too many individual eyescrews that do not meet specifications (either too long or too short).10 9. we present a test of hypothesis for comparing two population variances.44 . the quality control supervisor is confident in the decision that the cannery is operating within the desired limits of variability.7 Rejection region of Example 9.04) 2 = 1. Variance tests have broad applications in business.1) with 95 % confidence.that is. Similarly. an investor might want to compare the variation in the monthly rates of return for two different stocks that have the same mean rate of return. 2 Test of hypothesis for the ratio of two population variances. it is less likely to have many very low and very high monthly return rates. σ 12 and 2 σ 2 .
Consequently. respectively. 2 The elements of a hypothesis test for the ratio of two population variances.e. 2 The common statistical procedure for comparing two population variances.µ2). If the two population variances are greatly different. the df for the sample variance in the denominator). This is because the sampling 2 distribution of the estimator for σ 12 / σ 2 is well known when the samples are randomly and independently selected from two normal populations. The random samples are selected in an independent manner from the two populations.11 A class of 31 students were randomly divided into an experimental set of size n1 = 18 that received instruction in a new statistics unit and a control set of size n2 = cxlvi . in the upper tail of the Fdistribution with ν1 = numerator degrees of freedom (i. and Fα/2 are values that locate an area α and α/2...e.e.2 Ha: σ 12 / σ 2 > 1 2 [Ha: σ 12 / σ 2 < 1 2 (i. Example 9. discussed in Section 9. are given in the preceding box. before applying the smallsample t test. the df for the sample variance in the numerator) and ν2 = denominator degrees of freedom (i. any inferences derived from the t test are suspect.σ 12 > σ 2 ) or 2 (i.4. Smaller sample variance s12 2 s2 F = 2 s2 s2 1 2 when s12 > s 2 2 when s 2 > s12 Rejection region: F > Fα Rejection region: F > Fα/2 where Fα. σ 12 / σ 2 .e. Variance tests can also be applied prior to conducting a smallsample t test for (µ1 . it is important that we detect a significant difference between the two variances. makes 2 an inference about the ratio σ 12 / σ 2 . Both of the populations from which the samples are selected have relative frequency distributions that are approximately normal.e.σ 12 ≠ σ 2 ) Test statistic: Test statistic: F= s12 s2 or F = 2 2 s2 s12 F= Larger sample variance i.e. if it exists. σ 12 and σ 2 . Assumptions: 1. Recall that the t test requires the assumption that the variances of the two sampled populations are equal. 2. σ 12 < σ 2 ) ] 2 Ha: σ 12 / σ 2 ≠ 1 2 (i.
Table 9.13 that received the standard statistics instruction. an Fdistribution can be symmetric about its mean.11 Control set Experimental set Sample size 18 13 Standard deviation 1. Under the assumption that both samples of test scores come from normal populations.58 1 Smaller s 2 s1 (1. You can see that this particular Fdistribution is skewed to the right. The columns of the table correspond to various numerator degrees of freedom. in this example.1) denominator degrees of freedom.93 3. Unlike the z and tdistributions of the preceding sections. Table 9.4. degrees of freedom associated with (n2 .05 = 2.4 Data on students' scores in Example 9. respectively.10 Solution Let σ 12 = Variance of test scores of the experimental population 2 σ 2 = Variance of test scores of the control population The hypotheses of interest are 2 H0: σ 12 / σ 2 = 1 2 Ha: σ 12 / σ 2 ≠ 1 2 (σ 12 = σ 2 ) 2 (σ 12 ≠ σ 2 ) According to the box.1) = 17. while the rows correspond to various denominator degrees of freedom.5 is partially reproduced from this table. we need to know the sampling distribution of the test statistic. It gives F values that correspond to α = . F.05 uppertail areas for different pairs of degrees of freedom. we find the F value.8. Uppertail critical values of F are found in Table 4 of Appendix C. Do the data provide sufficient evidence to indicate a difference in the variability of this skill in the hypothetical population of students who might be given the new instruction and the population of students who might be given the standard instruction? Test using α = .1) = 12 and (n2 .38 cxlvii .1) numerator degrees of freedom and ν2= (n1 . An Fdistribution with ν1= 12 numerator df and ν2 = 17 denominator df is shown in Figure 9.93) 2 To find the appropriate rejection region.10) 2 Larger s 2 = 2 = = 2. the test statistic for this twotailed test is F= s 2 (3. if the numerator degrees of freedom are 12 and the denominator degrees of freedom are 17. Thus. possesses an F distribution with ν1 = (n2 . 2 the F statistic. its exact shape depends on the 2 s 2 and s12 . skewed to the left. or skewed to the right.01. A summary of the results appears in Table 9. F = s 2 / s12 . All students were given a test of computational skill at the end of the course.
For α = . α/2 = . Since the test is twotailed. we reject H0.no lowertail values are given. Given this information on the Fdistribution. The problem of not being able to locate an F value in the lower tail of the Fdistribution is easily avoided in a onetailed test because we can control how we specify the cxlviii . The reason we place the larger sample variance in the numerator of the test statistic is that only uppertail values of F are shown in the F table of Appendix C . Since the test statistic.38 is α/2 = .38 in the Fdistribution with 12 numerator df and 17 denominator df. we make certain that only the upper tail of the rejection region is used. Therefore. The fact that the uppertail area is α/2 reminds us that the test is twotailed. F = 2. Example 9. It appears that the new statistics instruction results in a greater variability in computational skill. Thus.38 (based on ν1 = 12 and ν2 = 17 df). at α = . the probability that the F statistic will exceed 2.05 and F.05 = 2. we will reject H0 if F > Fα/2. the rejection region is Figure 9.11 illustrates the technique for calculating the test statistic and rejection region for a twotailed F test. Thus.8. By placing the larger sample variance in the numerator.38.8).8 Rejection region of Example 9. we are now able to find the rejection region for this test. falls in the rejection region (see Figure 9. the data provide sufficient evidence to indicated that the population variances differ.As shown in Figure 9.10.11 Rejection region: Reject H0 if F > 2.58.10.05.0 5 is the tail area to the right of 2. we have α/2 = .
66 2.74 5.ratio of the population variances in H0 and Ha.62 2.30 2.49 2.01 2.27 2.46 8.38 19.45 8.16 2.18 2.23 2.20 253.96 cxlix .57 3.43 8.44 3.45 2.42 2.10 251.54 2.12 2.39 2.43 2.83 2.80 4.13 2.06 3.15 2.53 2.05 ν1 ν2 1 2 Denominator degrees of freedom 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 10 12 15 Numerator degrees of freedom 20 24 30 40 60 120 ∞ 241. α = .57 5.64 3.54 2.15 2.42 2.66 4.58 2.27 2.34 2.46 2.10 252.5 Reproduction of part of Table 4 from Appendix C.87 3.74 3. That is.11 2.94 3.90 2.19 19.40 8.70 2.91 2.75 2.85 2.70 3.53 2.10 250.64 5.67 3.56 3.28 2.85 2.61 2.00 3.31 19.53 5.57 2.25 2.63 4.34 3. Table 9.24 2.72 4.55 5.38 2.77 2.69 2.74 2.15 19.50 8.45 19.41 3.65 2.38 3.74 4.96 4.62 3.77 3.22 3.19 2.35 2.98 2.07 2.06 19.72 2.20 2.45 8.29 2.60 2.22 2.90 245.79 2.86 4.70 5.33 254.31 2.47 8.33 2.34 2.67 2.75 2.00 249.28 3.51 2.48 2.08 2.30 3.93 2.07 2.47 2.94 2.40 2.91 4.79 5.84 3.49 8.90 243.01 19.60 2.35 2.01 2.36 3.81 3.40 2.01 1.35 3.40 3.68 4.46 3.06 2.53 3.10 19.66 5.86 2.62 5.30 2.77 4.04 2.71 2.59 5.21 2.69 4.54 2.62 2.90 248.41 8.38 2.14 2.79 2.51 3.11 2.25 2. we can always make a onetailed test an uppertailed test.23 19.75 4.48 8.43 3.50 3.49 2.97 2.30 19.53 2.46 2.
9. H 0 : µ = 40.18.0 n = 55 a.80. n = 1. b. s = . test statistic value. As we note with the estimation techniques of Chapter 7. a.25. specify the rejection region.5. H a : µ ≠ 120. means.8 Summary In this chapter we have learnt the procedures for testing hypotheses about various population parameters.0 n = 45 Sample 2 x = 6 . H a : p > .18 against the alternative that µ < 1. Test the null hypothesis H0: (µ1 .05 ˆ c. and standard deviations are shown in the table.5 at α = . α = .6. α = . s 2 = 64. b.2.85. b.18. α = . A reasonable approach to hypothesis testing blends a valid application of the formal statistical procedures with the researcher's knowledge of the subject matter.6.3 ∑x 2 = 68 9.10 ˆ b.4.10 9.05. n = 48. specify the test statistic and reject region. p = .5 s = 3 . For each of the following situations. fewer assumptions about the sampled populations are required when the sample sizes are large. Test the null hypothesis that µ = 1. H a : p < . and conclusion: ˆ a.05.05 a. It would be emphasized that statistical significance differs from practical significance. Use α = . Use α = . and the two must not be confused.05. H a : µ > 40.85.9. Two independent random samples are selected from populations with means µ1 and µ2. H 0 : p = .04. x = 9.5.25.18 against the alternative that µ < 1. p = . For each of the following situations. A random sample of n observation is selected from a population with unknown mean µ and variance σ2. H 0 : p = . s = 9. α = . H a : p ≠ . H 0 : µ = 120.01 9.28. n = 200.01 c.10.5 against the alternative hypothesis Ha: (µ1 µ2) ≠ .1.µ2) = . The sample sizes.5 s = 1 . α = . Independent random samples selected from two binomial populations produced the results given in the table cl . Test the null hypothesis H0: (µ1 .µ2) = 0 against the alternative hypothesis Ha: (µ1 µ2) ≠ 0 at α = . 9.3.05.01.000. α = . n = 60.9 Exercises 9. n = 40. respectively. Sample 1 x = 7 . A random sample of n observations is selected from a binominal population. H 0 : p = . H 0 : µ = 11. n = 35. x = 140. p = . Test the null hypothesis that µ = 1. Often the comparison focuses on the means. x = 60. H a : µ < 11. A random sample of 51 measurements produced the following sums: ∑ x = 50.5.
s12 = 1. Suppose n1 = n 2 = 1.6.10 ˆ ˆ ˆ b. Test H 0 : ( p1 − p 2 ) = 0. H a : σ 12 / σ 2 > 1. Is this sufficient evidence to conclude that σ2 ≠ 2. 9. 9. 14.52. Use α = . 7. and p remain the same as in part a. s 2 = 2.75.05. but the sample estimates p1 . 2. A random sample of n = 10 observations yields x = 231. s 2 = 5. Test the null hypothesis H0: σ2 = 20 against the alternative hypothesis Ha: σ2 < 20. H a : ( p1 − p 2 ) > 0.05. cli . s1 = 1. The following measurements represent a random sample of n = 5 observations from a normal population: 10. Calculate the value of the test statistic for testing H0: σ12/σ22 in each of following cases: 2 2 a.750. 9.8.Sample 1 Number of successes Sample sizes 80 100 Sample 2 74 100 a. H a : σ 1 / σ 2 ≠ 1.000. H a : ( p1 − p 2 ) > 0. at α = .10.23 2 2 b. What assumptions are necessary for the test to be valid.90 2 2 2 2 c.7. s12 =1. at α = . Test H 0 : ( p1 − p 2 ) = 0. p 2 . Test using α = . s 2 =1.7 and s 2 = 15.5. H a : σ 12 / σ 2 < 1.235 9.
61% have primary degree and 11% have higher secondary degree.2 Tests of goodness of fit 10. favor or do not favor. In order to verify these clii .8 Randomized block designs 10.6 Design of experiments 10. secondary and above degree.11 Exercises 10. and secondary and above degree.1 Level of education attained by the women from a rural region is divided into three categories: can read/write degree. A demographer estimates that 28% of them have can read/write degree. It is based on comparison of an observed frequency distribution with the hypothesized distribution. However. the counts) can be analyzed using the binomial probability distribution. primary degree.4 Contingency tables in statistical software packages 10.10 Summary 10. 10.9 Multiple comparisons of means and confidence regions 10. For example..e. A test of such a hypothesis is called a test of goodness of fit. These tests are useful in analyzing more than two population means. In this chapter we will discuss the procedures for selecting sample data and analyzing variances. When the qualitative variable results in one of the two responses (yes or no. qualitative variables such as "level of education" that allow for more than two categories for a response are much more common.Chapter 10 Categorical data analysis and analysis of variance CONTENTS 10. The tests are called chisquare tests. The result of the categorization would be a count of the numbers of rural women falling in the respective categories. etc. The methods involve the comparison of a set of observed frequencies with frequencies specified by some hypothesis to be tested.) the data (i. We will show how to test the hypothesis that two categorical variables are independent. and these must be analyzed using a different method called test of goodness of fit.2 Tests of goodness of fit We know that observations of a qualitative variable can only be categorized. success or failure. "Level of education" is a qualitative variable and each woman would fall into one and only one of the following three categories: can read/write degree. A test of goodness of fit tests whether a given distribution fits a set of data. consider the highest level of education attained by each in a group of women in a rural region.1 Introduction In this chapter we present some methods for treatment of categorical data.3 The analysis of contingency tables 10.1 Introduction 10. The objective of these sections is to introduce some aspects of experimental design and analysis of data from such experiments using an analysis of variance. The test statistics discussed have sampling distributions that are approximated by chisquare distributions. Example 10.5 Introduction to analysis of variance 10. primary degree.7 Completely randomized designs 10.
assuming that the demographer's percentages are accurate.28) = 28 Similarly.2. of responses falling into cell 1 is a binomial random variable and its expected value is e1 = np1 = (100)(.1.2.1 Level of education Can Primary Secondary read/write and above Observed numbers 22 64 14 Total 100 cliii . Table 10. Solution Each woman in the sample was assigned to one and only one of the three educational categories listed in Table 10.11) = 11 The observed numbers of responses and the corresponding expected numbers (in parentheses) are shown in Table 10. then the probabilities that a education level will fall in the three educational categories are as shown in Table 10. and 11% estimated by the demographer? As a first step in answering this question. If we assume that the level of education of any woman independent of the level of education of any other.11 1.1. The number of the women whose level of education falling into each of the three categories is shown in Table 10. 61%. we need to find the number of women in the sample of 100 that would be expected to fall in each of the three educational categories of Table 10. If the demographer's percentages are correct. then the observed number O1.1 disagree with the percentages of 28%. Table 10. Table 10.00 Consider first the "Can read/write" cell of Table 10.28 p2=.61 p3 =.1 Categories corresponding to level of education Primary degree 22 Level of education Secondary degree 64 Higher secondary 14 Total 100 Do the data given in Table 10.2 Categories probabilities based on the demographer's percentages Level of education Can Primary Secondary read/write and above Cell number Cell probability 1 2 3 Total p1 = .3. the expected observed numbers of responses in cells 2 and 3 (categories 2 and 3) are e2 = np2 = (100)(.3 Observed and expected numbers of responses falling in the cell categories for Example 10. a random sample of n = 100 women at the region were selected and their level of education recorded.1.percentages.61) = 61 and e3 = np3 = (100)(.
differ from the values specified in the null hypothesis To find the value of the test statistic. 2 Rejection region: χ 2 > χ α cliv . p3= . Test to determine whether the sample data disagree with the demographer's estimated percentages.61.61. we will reject H0: p1 = .11 At least two of the probabilities. The sum of these quantities is the test statistic used for the goodnessoffit test: χ2 = (O1 − e1 ) 2 (O2 − e2 ) 2 (O3 − e3 ) 2 + + = e1 e2 e3 ∑ i =1 3 (Oi − ei ) 2 ei Substituting the values of the observed and expected cell counts from Table 10.e. 3.26 3 2 Example 10.Expected cell count) 2 (Oi − ei ) 2 = ei Expected cell count for each of the cells. Do the observed responses for the sample of 100 women disagree with the category probabilities based on the demographer's estimates? If they do.2 Specify the rejection region for the test described in the preceding discussion. p2. that a lack of fit exists.. p3.05. p2 = .11 2 for values of chisquare larger than some critical value. The relevant null and alternative hypotheses are: H0: Ha: The category (cell) probabilities are p1= . 2.29 + .3 into the formula for calculating χ2.Expected numbers (28) (61) (11) 100 Formula for calculating expected cell counts ei = npi where ei = Expected count for cell i n = Sample size pi = Hypothesized Probability that an observation will fall in cell i.28. say χ α .28. we obtain (Oi − ei ) 2 (22 − 28) 2 (64 − 61) 2 (14 − 11) 2 χ = ∑ = + + 28 61 11 ei i =1 = 1. p3 = . we first calculate (Observed cell count . p1. Solution Since the value of chisquare increases as the differences between the observed and expected cell counts increase. i. i = 1.82 = 2. Use α = .15 + . alternatively. p2= . we say that the theorized demographer probabilities do not fit the data or.
99147. χ 2 > χ α . if k cells were used in the categorization of the sample data. corresponding to df = 2 is 5.1 Rejection region for Example 10. pk Ha: At least two of the cell probabilities differ from the values specified in H0 Test statistic: χ 2 = where ∑ i =1 k (Oi − ei ) 2 ei k = Number of cells in the categorization table Oi = Observed count for cell i ei = Expected count for cell i n = Sample size = 01 + 02 + . the 2 tabulated value of χ α .26 . . then Degrees of freedom: df = k .1) = (3 . .1. 2 The rejection region for the test. . .2 Summary of a goodness of fit test for specified values of the Cell probabilities H0: The k cell probabilities are p1.The critical values of the χ2 distribution are given in Table 3 of Appendix C. There is insufficient information to indicate a lack of fit of the sample data to the percentages estimated by the demographer. Since the calculated value of the test statistic.1) = 2 and α = . we can not reject H0. For example.05 . p2.1 For our example. . is less than χ .05. is illustrated in Figure 10. . df = (k . Figure 10. We will reject H0 if χ2 > 5. 2 2 χ = 2. .99147. + 0k clv . From Table 3 of Appendix C. The degrees of freedom for the chisquare statistic used to test the goodness of fit of a set of cell probabilities will always be 1 less than the number of cells.
we assumed that each of n observations could fall into one of k categories (or cells). As you will see subsequently.2 Rejection region: χ 2 > χ α At the start. 3. 4. The estimated expected number of responses for each of the k cells should be at least 5. . we will consider a 2 × 3 table. . The objective is to determine whether a dependency exists between the two qualitative variables − the qualitative variable analogue to a correlation analysis for two quantitative random variables. . i. Now if the statement is not true. and we adopt this as our null hypothesis. . The binomial experiment is a multinomial experiment with k = 2. Table 10. In this case the chisquare distribution can be used to determine an approximate critical value that specifies the rejection region. A table clvi . we will present a method for analyzing data that have been categorized according to two qualitative variables. The experiment consists of n identical trials. . There are k possible outcomes to each trial.3 The analysis of contingency tables Qualitative data are often categorized according to two qualitative variables. As a practical example of a twovariable classification of data.4. . . The user should always be certain that the experiment satisfies the assumptions before proceeding with the test. 10. denoted by p1. These characteristics define a multinomial experiment. Because it is widely used. i = 1. 5. . and that the outcome for any one observation was independent of the outcome for any others. Properties of the underlying distribution of response data for a chisquare goodness of fit test 1. . . the chisquare test is also one of the most abused statistical procedures. The trials are independent. . then the response will depend on the sex of the person interviewed. + pk = 1. The probabilities of the k outcomes.4 Contingency table for views of women and men on a proposal In favour Women Men Total 118 84 202 Opposed 62 78 140 Undecided 25 37 62 Total 205 199 404 We are to test the statement that there is no difference in opinion between men and women. 2. pk remain the same from trial to trial. because in this case the chisquare probability distribution gives a poor approximation to the sampling distribution of the χ2 statistic. In the sections that follow. Suppose that a random sample of men and women indicated their view on a certain proposal as shown in Table 10. these methods are also based on the assumption that the sampling satisfies the requirements for one or more multinomial experiments. In addition. 2. .e. The (estimated) expected number of responses for each of the k cells should be at least 5.square test should be avoided when the estimated expected cell counts are small. p2. the response is independent of the sex of the person interviewed. and the table will enable us to calculate the degree of dependence. k. that the probability that an observation would fall in cell 1 was pi. where p1 + p2 + . the chi.
On this basis we may deduce that the proportion of the sample who are female is 205/404. plans that will be put into operation if certain things happen. the expected number of women in favour of proposal is 205/404 × 202 = 102. The estimated expected cell counts for columns of row 2 are Row 2 total 119 e21 = × (Column 1 total) = 99.5 × 62 = n 404 The formula for calculating any estimated expected value can be deduced from the values calculated above. Therefore. Therefore. the purpose of a contingency table analysis is to determine whether a dependence exists between the two qualitative variables. Thus. and as 202 people are in favour of the proposal. the expected number of women against the proposal is (row 1. we would expect the proportion of the sample who are male is 199/404 for all three types of opinion.5 × 202 = n 404 Row 2 total 119 e22 = × (Column 2 total) = 69 ×140 = n 404 Row 2 total 119 e21 = × (Column 3 total) = 30.e.5. Each estimated expected cell count is equal to the product of its respective row and column totals divided by the total sample size n: eij = Ri × C j n Ri = Row total corresponding to row i Cj = Column total corresponding to column j n = Sample size where eij = Estimated expected counts for the cell in row i and column j clvii .constructed in this way (to indicate dependence or association) is called a contingency table. column 3) Row 1 total 205 e13 = × (Column 3 total) = 31. the estimated expected number of women (row 1) in favour of the proposal (column 1) is Row 1 total 205 e11 = × (Column 1 total) = 102. "Contingency" means dependence − many of you will be familiar with the terms "contingency planning". We adopt the null hypothesis that there is no association between the response and the sex of person interviewed.5 × 202 = n 404 Also. as 140 people are against the proposal.5 × 62 = n 404 We now move to row 2 for men and note that the row total is 199. i. column 2) Row 1 total 205 e12 = × (Column 2 total) = 71 ×140 = n 404 And the expected number of undecided women is (row 1.
the chisquare test statistic .5) = + + + . we have two degrees of freedom in calculating the expected values.. χ2 = (O11 − e11 ) 2 (O12 − e12 ) 2 (O13 − e13 ) (O − e ) 2 + + + .. In this example..99 at a significance level of α = .1) × (c 1). + 23 23 e11 e12 e13 e23 2 2 2 (118 − 102.4 Contingency tables in statistical software packages In all statistical software packages there are procedures for analysis of categorical data.5.5) In this example.5) 37 (30.The observed and estimated expected cell counts for the herring gull contingency table are shown in Table 10. General form of a chisquare test for independence of two directions of classification H0: The two direction of classification in the contingency table are independent Ha: The two direction of classification in the contingency table are dependent Test statistic: χ = 2 ∑∑ i =1 j =1 r c (Oij − eij ) 2 eij where r = Number of rows in the table c = Number of columns in the table Oij = Observed number of responses in the cell in row i and column j eij = Estimated expected number of responses in the cell(ij) = (Ri × Cj) / n 2 Rejection region: χ 2 > χ α 2 where χ α is tabulated value of the chisquare distribution based on (r 1) × (c1) degrees of 2 freedom such that P ( χ 2 > χ α ) = α 10.01. χ2.21 at level of α = . we would reject the null hypothesis accepting the alternative hypothesis that men and women think differently with 99% confidence.5 71 31. Following are printouts of the procedure "Crosstabs" of SPSS for creating the contingency table clviii ..5) 84 (99.5 Observed and expected (in parentheses) counts for response of women and men Infavour Women Men 118 (102. In both cases.5) 2 (25 − 31. Table 10.5) Opposed 62 (71) 78 (69) Undecided 25 (31. we see that the critical values for χ2 are 5. Hence.87 102.05 and 9. + = 9. is calculated in the same manner as shown in Example 10.5 30.5 The appropriate degrees of freedom for a contingency table analysis will always be (r .5) (62 − 71) (37 − 30. the computed test statistic is lager than these critical values. where r is the number of rows and c is the numbers of columns in the table.1. Consulting Table 3 of Appendix C.
99.5 415.0% Total N Percent 4171 100. we make some remarks on methods for treating categorical data.0 28.0 659.5% 11.0 2381. CROSSTABS /TABLES=urban BY gd1 /FORMAT= AVALUE TABLES /STATISTIC=CHISQ CC PHI /CELLS= COUNT EXPECTED ROW . Sig.0% Cases Missing N Percent 0 .0 4172.and computing value of the χ2 statistic to test dependence of the education level on living region of women interviewed in the DHS Survey 1988 in Vietnam (data of the survey is given in Appendix A).0% 969 2082 393 3444 934.4% 41.0 22.5% 100. Crosstabs Case Processing Summary Valid N Percent 4172 100.000 .4% 100.000 .5 544.0 728.5 1965. Before changing to discuss about analysis of variance.084 241.517 4172 0 cells (.0% URBAN * Education Level URBAN * Education Level Cross tabulation Education Level Can Primary Secondary read/write and above Total 163 299 266 728 197.000 Pearson ChiSquare Likelihood Ratio LinearbyLinear Association N of Valid Cases a 287.1% 15.0% 1132 2381 659 4172 1132.252 137.1% 36. clix .0%) have expected count less than 5.5 115.1% 57. (2sided) 2 2 1 . The minimum expected count is 114.0 3444.0 27.0% URBAN Urban Rural Total Count Expected Count % within URBAN Count Expected Count % within URBAN Count Expected Count % within URBAN ChiSquare Tests Value a df Asymp.8% 100.1% 60.
4. The process of the design of an experiment can be divided into four steps as follows: 1.6 Reading scores of 18 children using three different workbooks clx . The value − that is. The method used to compare the treatment means is known as analysis of variance.6 gives reading achievement scores. Select the factors to be included in the experiment and identify the parameters that are the object of the study. 10. or ANOVA. we will want to use the sample information to make inferences about the population means associated with the various treatments. The combinations of levels of the factors for which the response will be observed are called treatments. Each set of scores of the 6 children using a type of workbook is considered as a sample from the hypothetical population of all kindergarten children who might use that type of workbook. Usually. We will introduce some aspects of experimental design and the analysis of the data from such experiments using an analysis of variance. When the data have been obtained according to certain specified sampling procedures. Choose the treatments to be included in the experiment. the intensity setting − assumed by a factor in an experiment is called a level.6 Design of experiments The process of collecting sample data is called an experiment and the variable to be measured in the experiment is called the response. they are easy to analyze and also may contain more information pertinent to the population means than could be obtained using simple random sampling. At the end of the year the 18 children in the class will take a test in reading achievement. The procedure for selecting sample data is called the design of the experiment and the statistical procedure for comparing the population means is called an analysis of variance. Table 10.5 Introduction to analysis of variance As we have seen in the preceding chapters. The concept behind an analysis of variance can be explained using the following simple example. The planning of the sampling procedure is called the design of the experiment. called χ2 statistic. the target parameters are the population means associated with the factor level. Example 10. The appropriate test statistic. 3. 2. Once the data for a designed experiment have been collected. A contingency table analysis is an application of the χ2 test for a twoway (or twovariable) classification of data.3 A elementary school teacher wants to try out three different reading workbooks. the solutions to many statistical problems are based on inferences about population means. Determine the number of observations (sample size) to be made for each treatment.2. The object upon which the response measurement is taken is called an experimental unit. The scores are plotted as line plots in Figure 10. Table 10. has a sampling distribution approximated by the chisquare probability distribution and measures the amount of disagreement between the observed number of responses and the expected number of responses in each category. The test allows us to determine whether the two directions of classification are independent. 10. These test scores will be used to compare the workbooks.  Surveys that allow for more than two categories for a single response (a oneway table) can be analyzed using the chisquare goodness of fit test. Variables that may be related to a response variable are called factors. Next sections extend the methods of Chapters 7 9 to the comparison of more than two means. Decide how the treatments will be assigned to the experimental units.
There is no variation within groups. but there is variation between groups.2 Reading scores by workbook used and for combined sample The means of the three samples are 4.8 as illustrations of extreme cases. and 5.7 No variation within groups A 3 3 3 3 Group B 5 5 5 5 C 8 8 8 8 clxi .7 and 10. mean of 3 samples: 6 • • • • • • 0 1 2 3 4 5 6 7 8 9 10 Children using Workbook 1 x1 • • • • • • • 0 1 2 3 4 5 6 7 8 9 10 Children using Workbook 2 x2 • • • • • • 0 1 2 3 4 5 6 7 8 9 10 Children using Workbook 3 x3 • • • • • • • • • • • • • • • • • • 0 1 2 3 4 5 6 7 8 9 10 All children x Figure 10.2 shows these as the centers of the three samples. In Table 10. and every observation in Group C is 8.Workbook 1 2 4 3 4 5 6 24 4 Workbook 2 9 10 10 7 8 10 54 9 Workbook 3 4 5 6 3 7 5 30 5 Sums Sample means Total of 3 samples: 108. we consider Tables 10. there is clearly variability from group to group. Table 10. Figure 10. respectively.7 every observation in Group A is 3. 9. every observation in Group B is 5. The variability in the entire pooled sample of 18 is shown by the last line. In contrast to this rather typical allocation.
to clxii .6)2 + (5 . There is no variation among the group means.4)2 + (3 . We calculate the sum of squared deviation of each of 18 observations from their respective group means.8 the mean of each group is 3.6)2 + (10 .5)2 = 10 + 8 + 10 = 28 Now let us consider the group means of 4. SSE = (2 .9)2 + (10.4)2 + (4 .6)2 + (3 .6)2 + (5 .6)2 + (9 . is the sum of all the observations divided by the total number of observations: x= (2 + 4 + .6)2 + (4 .9)2 + (4 .6)2 + (3 . x . this sum is not comparable to the sum of squares within groups because the sampling variability of means is less than that of individual measurements.9)2 + (8 .6)2 + (6 .6)2 + (4 . Table 10. + 5) 108 = =6 18 18 The sum of squared deviations of all 18 observations from mean of the combined sample is a measure of variability of the combined sample.4)2 + (9 .6)2 = 4 + 9 + 1 =14.6)2 + (10.6)2 + (5 . This sum is called Total Sum of Squares and is denoted by SS(Total). the mean of a sample of 6 observations has a sampling of 1/6 the sampling variance of a single observation.9)2 + (10 .9)2 + (10 .3.5)2 + (7 .5)2 + (5 .6)2 + (4 .6)2 + (8 .4)2 + (5 . one needs to make an assessment of the relative sizes of the betweengroups and withingroups variability. The sum of squared deviation of the group means from the pooled mean of 6 is (4 .6)2 + (7 .6)2 + (10 . the overall mean. In fact. 9. and 5. It is to this assessment that the term "analysis of variance" refers.6)2 + (9 .5)2 + (3 . Neither extreme can be expected to occur in an actual data set.5)2 + (5 .4)2 + (4 .Means 3 5 8 In Table 10.5)2 + (6 . In actual data.6)2 = 34 + 62 + 16 = 112 Next we measure the variability within samples. SS(Total) = (2 . This sum is called the Sum of Squares Within Groups (or Sum of Squared Errors) and is denoted by SS(Within Groups) (or SSE).8 No variation between groups A 3 5 1 3 Means 3 Group B 3 6 2 1 3 C 1 4 3 4 3 In Example 10..6)2 + (5 .6)2 + (7 .. Hence. although there is variability within each group.6)2 + (6 .4)2 + (6 . However.9)2 + (7 .
and the number of degrees of freedom of MSE.36.3 clxiii .867 The fact that MST is 22.50 as very significant. the number of observation in each sample.9 Sums of Squares for Example 10.01 is 6. are now comparable. Table 10. we know that such a ratio computed for different triplets of random samples would vary from triplet to triplet.3 SS(Between Groups) SS(Within Groups) SS(Total) 84 28 112 In this example we notice that the variability between groups is a large proportion of the total variability. Thus we would consider the calculated ratio of 22. They are given in Table 10. The sample variance based on this sum is Within − Group Variation = SS (Within Groups ) 28 = = 1.1 = 2 and the sample variance based on this sum of squares is Between − Group Variation = SS ( Between Groups ) 84 = = 42 3 −1 2 This quantity is also called Mean Square for Treatments (MST). Observe that addition of the first two sum of squares gives the last sum. measuring variability among groups and MSE. which is 2 here. That is.3 = 15 degrees of freedom. Now we have three sums that can be compared: SS(Between Groups). Table 10.put the sum of squared deviations of group mean on a basis that can be compared with SS(Within Groups). The two estimates of variation MST. MSE 1. Therefore 3 samples have 18 . each has 6 1 = 5 degrees of freedom. and hence. This is called the Sum of Squares Between Groups (or Sum of the Squares for Treatment) and is denoted by SS(Between Groups) (or SST). even if the population means were the same. Each involves 6 squared deviations. The sum of squares between groups has 3 deviations about the mean of combined sample.10 Analysis of Variance Table for Example 10. The results of computation are set out in Table 10. The sum of squares within groups is made up of 3 sample sums of squares. we want to use the sums of squares to calculate sample variances.10. We conclude that there are real differences in average reading readiness due to the use of different workbooks. Their ratio is F= MST 42 = = 22. Therefore its number of degrees of freedom is 3 . SS(Within Groups). and SS(Total).9 in order to take account of the number of pieces of information going into each sum of squares. However.5 times MSE seems to indicate that the variability among groups is much greater than that within groups. This demonstrates what we mean by the allocation of the total variability to the variability due to differences between means of groups and variability of individuals within groups. The value in the Ftable for a significance level of . we have to adjust the numbers in Table 10. which is 15 here.9.867 18 − 3 15 This variation is also called Mean Square for Error (MSE). measuring variability within groups.50 . We must take account of this sampling variability. we must multiply it by 6. to obtain 6 × 14 = 84. However. This is done by referring to the Ftables depending on the desired significance level as well as on the number of degrees of freedom of MST.
e.sample variation SSE /( n − k ) MSE where n is the total number of measurements.50 In next sections we will consider the analysis of variance for the general problem of comparing k population means for three special types of experimental designs.sample variation SST /( k − 1) MST = = Within . F∞ we reject H0 and conclude that at least two of the treatment means differ. . then SSE should be substantially smaller than SST. .6 that the F distribution depends on ν1 numerator degrees of freedom and ν2. . Recall from Section 9. k. . The null hypothesis to be tested is that the k treatment means are equal. An analysis of variance provides an easy way to analyze the data from a completely randomized design. respectively. is a measure of the unexplained variability. . We compare the two sources of variability by forming an F statistic: F= Between . i. . . obtained by calculating a pooled measure of the variability within the k samples. H0: µ1 = µ2 = . . our goal is to make inferences about k population means where µi is the mean of the population of measurements associated with treatment i. say k. of treatments. SST and SSE. = µk and the alternative hypothesis is that at least two of the treatment means differ. Test to Compare k Population Means for a Completely Randomized Design clxiv . After collecting the data from a completely randomized design. nk observations. . n2.. This experiment involves a comparison of the means for a number. k. for i = 1. F is based on ν1 = (k . If the computed value of F exceeds the upper critical value. If the treatment means truly differ. . based on independent random samples of n1. drawn from populations associated with treatments 1. Under certain conditions. 10. SSE. . 2. . 2. These two quantities are defined in general term as follows: SST = ∑ ( x j − x ) 2 j =1 k k SSE = ∑∑ ( xi j − x j ) j =1 i =1 nj Recall that the quantity SST denotes the sum of squares for treatments and measures the variation explained by the differences between the treatment means.k) degrees of freedom.Source of Variation Between groups Within groups Sum of Squares 84 28 Degrees of Freedom 2 15 Mean of Squares 42 1. The analysis partitions SS(Total) into two components.7 Completely randomized designs The most common experimental design employed in practice is called a completely randomized design. For the completely randomized design. The sum of squares for error. .867 F 22.1) and ν2 = (n . the F statistic has a repeated sampling distribution known as the Fdistribution. denominator degrees of freedom. .
e. .3 are given in Table 10. 2. mean squares. and the general form of the ANOVA table for a completely randomized design is shown in Table 10.1) numerator df and (n .k) denominator df. 3. The results of the analysis of variance for Example 10. = µk [i. and Fα is the F value found in Table 4 of Appendix C such that P(F > Fα) = α.9. sums of squares. Assumptions: 1. Such a table shows the sources of variation.. The k population variances are equal. The results of an analysis of variance are usually summarized and presented in an analysis of variance (ANOVA) table. and computed F statistic. . All k population probability distributions are normal. Test statistic: clxv .H0: µ1 = µ2 .11. there is no difference in the treatment (population) means] Ha: At least two treatment means differ F = MST/MSE Rejection region: F > Fα where the distribution of F is based on (k . their respective degrees of freedom. The samples from each population are random and independent.
Table 10.11 Analysis of Variance Table for Completely Random Design
Source of Variation Between groups Within groups Total Sum of Squares SST SSE SS(Total) Degrees of Freedom Mean of Squares MST/(k – 1) SSE/(n  k)
F F= MST/MSE
k1 nk n 1
Example 10.4 Consider the problem of comparing the mean number of children born to women in 10 provinces numbered from 1 to 10. Numbers of children born to 3448 women from these provinces are randomly selected from the column heading CEB of Appendix A. The women selected from 10 provinces are considered to be the only ones of interest. This ensure the assumption of equality between the population variances. Now, we want to compare the mean numbers of children born to all women in these provinces, i.e., we wish to test H0: µ1 = µ2 . . . = µ10
Ha: At least two population means differ
Solution We will use the SPSS package to make an analysis of variance. Following are the syntax and the print out of the procedure "OneWay ANOVA" of SPSS for analysis of CEB by province.
ONEWAY ceb BY province /STATISTICS DESCRIPTIVES /MISSING ANALYSIS .
ONEWAY
Descriptives
Children ever born
95% Confidence Interval for Mean N 1 2 3 4 5 6 7 8 9 10 Total 228 323 302 354 412 366 402 360 297 403 3448 Mean 2.40 2.84 3.15 2.80 2.53 3.08 3.26 3.45 3.87 3.75 3.13 Std. Error Std. Deviation 1.55 .10 2.30 .13 2.09 .12 2.00 .11 1.61 7.93E02 1.99 .10 1.83 9.13E02 2.21 .12 2.66 .15 2.52 .13 2.15 3.66E02 Lower Bound 2.19 2.59 2.91 2.59 2.37 2.88 3.08 3.23 3.56 3.51 3.06 Upper Bound 2.60 3.09 3.39 3.01 2.68 3.29 3.44 3.68 4.17 4.00 3.20 Minimum 0 0 0 0 0 0 0 0 0 0 0 Maximum 10 11 12 10 9 11 10 11 12 12 12
clxvi
ANOVA
Children born
Sum of Squares Between Groups Within Groups Total 702.326 15221.007 15923.333 df 9 3437 3446 Mean Square 78.036 4.429 F 17.621 Sig. .000
From the printout we can see that the SPSS OneWay ANOVA procedure presents the results in the form of an ANOVA table. Their corresponding sums of squares and mean squares are: SST = 702.326 SSE = 15221.007 MST = 78.036 MSE = 4.429 The computed value of the test statistic, given under the column heading F is
F = 17.621
with degrees of freedom between provinces is ν1 = 9 and degrees of freedom within provinces is ν2 = 3437. To determine whether to reject the null hypothesis
H0: µ1 = µ2 . . . = µ10
in favor of the alternative
Ha: at least two population means are different
we may consult Table 4 of Appendix C for tabulated values of the F distribution corresponding to an appropriately chosen significance level α. However, since the SPSS printout gives the observed significance level (under the column heading Sig.) of the test, we will use this quantity to assist us in reaching a conclusion. This quality is the probability of obtaining F statistic at least as large as the one calculated when all population means are equal. If this probability is small enough, the null hypothesis (all population means are equal) is rejected. In this example, the observed significance level is approximately .0001. It implies that H0 will be rejected at any chosen level of α lager than .0001. Thus, there is very strong evidence of a difference among the mean numbers of children ever born of women in 10 provinces. The probability that this procedure will lead to a Type I error is .0001. Before ending our discussion of completely randomized designs, we make the following comment. The proper application of the ANOVA procedure requires that certain assumptions be satisfied, i.e., all k populations are approximately normal with equal variances. If you know, for example, that one or more of the populations are nonnormal (e.g., highly skewed), then any inferences derived from the ANOVA of the data are suspect. In this case, we can apply a nonparametric technique.
clxvii
10.8 Randomized block designs
Example 10.5 Three methods of treating beer cans are being compared by a panel of 5 people. Each person samples beer from each type of can and scores the beer with a number (integer) between 0 and 6, 6 indicating a strong metallic taste and 0 meaning no metallic taste. It is obvious that different people will use the scale somewhat differently, and we shall take this into account when we compare the different types of can. The data are reported in Table 10.12. This is an example of a situation in which the investigator has data pertaining to k treatments (k = 3 types of can) in b blocks (b = 5 persons) . We let xgj denote the observation corresponding to the gth treatment and the jth block, x g . denote the mean of the b observations for the gth treatment, x. j the mean of the k observations in the jth block, and x the overall mean of all n = kb observations. When this particular design is used, the three types of can are presented to the individuals in random order. An experimental design of this type is called a randomized blocks design. In agricultural experiments the k treatments might correspond, for example, to k different fertilizers; the field would be divided into blocks of presupposed similar fertility; and every fertilizer was used in each block so that differences in fertility of the soil in different parts of the field (blocks) would not bias the comparison of the fertilizers. Each block would be subdivided into k subblocks, called "plots." The k fertilizers would be randomly assigned to the plots in each block; hence the name, "randomized blocks."
Table 10.12 Scores of three types of can on "metallic" scale
Person Type of Can A B C Sums P1 6 2 6 14 P2 5 3 4 12 P3 6 2 4 12 P4 4 2 4 10 P5 3 1 3 7 Sums 24 10 21 55
In general terms, we can define that a randomized block design as a design in which k treatments are compared within each of b blocks. Each block contains k matched experimental units and the k treatments are randomly assigned, one to each of the units within each block. Table 10.13 shows the pattern of a data set resulting from a randomized blocks design; it is a twoway table with single measurements as entries. In the example people correspond to blocks and cans to treatments. The observation xgj is called the response to treatment g in block j. The treatment mean x g . , estimates the population mean µg, for treatment g (averaged out over people). An objective may be to test the hypothesis that treatments make no difference,
H0: µ1 = µ2 = . . . = µk
clxviii
Table 10.13 Randomized Blocks Design
Blocks Treatments 1 2 ... b
1 2 . . . k
x11 x21 . . . xk1
x12 x22 . . . xk2
... ...
...
x1b x2b . . . xkb
Each observation xgj can be written as a sum of meaningful terms by means of the identity
x gj = x + ( x g . − x ) + ( x. j − x ) + ( x gj − x g . − x. j + x ) .
In word, the
Observed value for deviation deviation gth treatment in = overal + due to gth + due to + (residual ) mean treatment jth block jth block
The "residual" is
x gj − x + ( x g . − x ) + ( x. j − x ) ,
which is the difference between the observation and
[
]
x + ( x. j − x ) + ( x. j − x ) ,
obtained by taking into account the overall mean, the effect of the gth treatment, and the effect of the jth block. Algebra shows that the corresponding decomposition is true for sums of squares:
∑∑ ( x gj − x ) 2 = b∑ ( x g . − x ) 2 + k ∑ ( x. j − x ) 2 + ∑∑ ( x gj − x g . − x. j + x ) 2
g =1 j =1 g =1 j =1 g =1 j =1
k
b
k
b
k
b
that is, SS(Total) = SS(Treatment) + SS(Blocks) + SS(Residuals). The number of degrees of freedom of SS(Total) is kb  1 = n  1, the number of observations less 1 for the overall mean. The number of degrees of freedom of SS(Treatments) is k  1, the number of treatments less 1 for the overall mean. Similarly, the number of degrees of freedom of SS(Blocks) is b  1. There remain, as the number of degrees of freedom for SS(Residuals)
kb  1  (k  1)  (b  1) = (k  1)(b  1).
There is a hypothetical model behind the analysis. It is assumes that in repeated experiments the measurement for the gth treatment in the jth block would be the sum of a constant
clxix
pertaining to the treatment, namely µg, a constant pertaining to the jth block, and a random "error" term with a variance of σ2. The mean square for residuals, MS(Residuals) = SS(Residuals) / (k  1)(b  1) is an unbiased estimate of σ2 regardless of whether the µg's differ (that is, whether there are true effects due to treatments). If there are no differences in the µg's, MS(Treatments) = MS(Treatments) / (k  1) is an unbiased estimate of σ2 (whether or not there are true effects due to blocks). If there are differences among the µg's, then MS(Treatments) will tend to be larger than σ2. One tests H0 by means of
F = MS(Treatments) / MS(Residuals)
When H0 is true, F is distributed as an Fdistribution based on (k  1) numerator df and (k  1) (b 1) df. One rejects H0 if F is sufficiently large, that is, if F exceeds Fα. Table 10.14 is the analysis of variance table.
Table 10.14 Analysis of variance table for randomized blocks design
Sources of variation
k
Sum of squares
Degrees of freedom
Mean square MS(Treatments)
F
MS (Treatments ) MS ( Residuals)
Treatments
b∑ ( x g . − x ) 2
g =1 b
k1
Blocks
k ∑ ( x. j − x ) 2
j =1
b1
MS(Blocks)
MS ( Blocks ) MS ( Residuals)
Residuals
∑∑ ( x
g =1 j =1
k b
k
b
gj
− x g . − x. j + x ) 2
(k 1)(b 1)
MS(Residuals)
Total
∑∑ ( x
g =1 j =1
gj
− x ) 2 = SS (Total )
n1
The computational formulas are
SS (Total ) = ∑∑ x gj
g =1 j =1
k
b
2
1 k b − ∑∑ x gj kb g =1 j =1
2
2
1 k b 1 k b SS (Treatments ) = ∑ ∑ x gj − ∑∑ x gj b g =1 j =1 kb g =1 j =1 1 b k 1 k b SS ( Blocks) = ∑ ∑ x gj − ∑∑ x gj k j =1 g =1 kb g =1 j =1
2 2
2
SS(Residuals) = SS(Total)  SS(Treatments)  SS(Block)
clxx
46. µ2. F = 20. To test the hypothesis that there are no differences in scoring among persons (in the hypothetical population of repeated experiments).15.27 35.67 kb g =1 j =1 15 SS(Total) = 237 .33 3 = 35. The value here of 4.73 5 SS (Treatments ) SS ( Blocks ) SS(Residuals) = 14 2 + 12 2 + 12 2 + 10 2 + 7 2 − 201.9 Multiple comparisons of means and confidence regions The Ftest gives information about all means µ1.201. µk are different.15 Analysis variance table for "Metallic" scale Sources of variation Cans Persons Residual Total Sum of Squares 21.21.38 The roles of cans and people can be interchanged.13 we have ∑∑ x g =1 j =1 k b 2 gj = 6 2 + 5 2 + . . we may conclude that specific pairs µg. is σ2(1/n1 + 1/n2).67 = 35. The corresponding estimated standard deviation is s 1 / n1 + 1 / n 2 .67 = 211 − 201.40 4.67 = 9. we have sufficient evidence to reject the null hypothesis of no difference in metallic taste of types of can at α = .46. one uses the ration of MS(Blocks) to MS(Residuals) and rejects the null hypothesis if that ratio is greater than an Fvalue for b 1 and (k . . The variance of difference between two means. 10. µh are different.40. Table 10.73 = 4. Since the computed value of test statistic.33 = 24 2 + 10 2 + 212 − 201.33 0.33 Degrees of freedom 2 4 8 14 Mean square 10.For the data in Table 10.67 = 223. . clxxi .40 − 201.46.1) degrees of freedom. say x1 and x 2 .. which is estimated as s2(1/n1 + 1/n2).33 4.05.. µk simultaneously. + 3 2 = 237 2 1 k b 55 2 ∑∑ x gj = = 201..87 2.533 F 20. Therefore.05 with 2 and 8 df is 4. Instead of simply concluding that some of µ1.73 9. From Table 4 in Appendix C. .27 The analysis of variance table is Table 10. µ2.38 is referred to Table 4 of Appendix C with 4 and 8 degrees of freedom. the tabulated value of F. for which the 5% point is 3. In this section we consider inferences about differences of pairs of means.84.1)(b . . it is barely significant. we will reject H0 if the calculated value of F is F > 4.. exceeds 4.33 .67 = 21. .
if m comparisons are to be made and the overall Type I error probability is to be at most α.6 the means are x1 = 4. Since all the sample sizes are 6. the probability that some will appear to be "significant" is greater than the nominal significance level α when all the null hypotheses are true.µh . The difference x 2 − x1 = 9 − 4 = 5 is significant. then it suffices to choose the level α* to be . Example 10.01/2 = t. h = 1.5 We illustrate with Example 10. . x 2 = 9.If one were interested simply in determining whether the first two population means differed.789 x 2. that is. .01. h when actually µ1 = µ2 = . . indeed. the value with which to compare each differences x g − x h is t α* / 2 × ( s 1 / n1 + 1 / n 2 ) = t α* / 2 × 1. .03. g.03/3 = .366 × 1 / 6 + 1 / 6 = t α* / 2 × 1. x3 = 5. The corresponding percentage point of Student's tdistribution with 15 degrees of freedom is t. based on 15 degrees of freedom (s = 1. How can one eliminate this false significance? It can be shown that.789 where α* is to be the level of the individual tests. By overall Type I error we mean concluding µg ≠ µh for at least one pair g.= µk .33. Hence the probability that at least one of them would exceed the tvalue. .366 × 1 / 3 = t α* / 2 × . rejecting the null hypothesis if x1 − x 2 /( s 1 / n1 + 1 / n 2 ) > t α / 2 where the number of degrees of freedom for the tvalue is the number of degrees of freedom for s. The value with which to compare x g − x h is . clxxii .366). but µ1 and µ3 may be equal. it is sufficient to use α/m for the significance level of the individual tests. all the µ's were equal. Here 2 s = 1. There are k(k . we want to test all the null hypotheses Hgh: µg = µh.1)/2 such hypotheses.005 = 2. one would test the null hypothesis that µ1 = µ2 at significance level α by using a ttest. . Workbook 2 appears to be superior. k. If we want the overall Type I error probability to be at most .1)/2 = 3 = m.9).867. The conclusion is that µ2 is different from both µ1 and µ3. However. with g ≠ h. now we want to consider each possible difference µg . The difference x3 − x1 = 5 − 4 = 1 is not significant. would be greater than α. If.947. so that there were no real differences.947 = 2. In Table 10. The number of comparisons to be made for k = 3 is k(k . When many differences are tested. the probability that any particular one of the pair wise differences in absolute value would exceed the relevant tvalue is α. so is the x 2 − x3 = 9 − 5 = 4 .6 and 10.3 (Tables 10.
Confidence Regions
With confidence at least 1  α, the following inequalities hold:
x g − x h − t α* / 2 s 1 / n g + 1 / n h < (µ g − µ h ) < x g − x h + t α* / 2 s 1 / n g + 1 / nh
for g ≠ h; g, h = 1, . . . , k, if α* = α/m and the distribution of t is based on (n  k) degrees of freedom.
10.10 Summary
This chapter presented an extension of the methods for comparing two population means to allow for the comparison of more than two means. The completely randomized design uses independent random samples selected from each of k populations. The comparison of the population means is made by comparing the variance among the sample means, as measured by the mean square for treatments (MST), to the variation attributable to differences within the samples, as measured by the mean square for error (MSE). If the ratio of MST to MSE is large, we conclude that a difference exists between the means of at least two of the k populations. We also presented an analysis of variance for a comparison of two or more population means using matched groups of experimental units in a randomized block design, an extension of the matchedpairs design. The design not only allows us to test for differences among the treatment means, but also enables us to test for differences among block means. By testing for differences among block means, we can determine whether blocking is effective in reducing the variation present when comparing the treatment means. Remember that the proper application of these ANOVA techniques requires that certain assumptions are satisfied. In most applications, the assumptions will not be satisfied exactly. However, these analysis of variance procedures are flexible in the sense that slight departures from the assumptions will not significantly affect the analysis or the validity of the resulting inferences.
10.11 Exercises
10.1. A random sample of n = 500 observations were allocated to the k = 5 categories shown in the table. Suppose we want to test the null hypothesis that the category probabilities are p1 =.1, p2 =.1, p3 =.5, p4 =.1, and p5 =.2.
Category 1 27 2 62 3 241 4 69 5 101 Total 500
a. Calculate the expected cell counts.
2 b. Find χ α for α = .05.
c. State the alternative hypothesis for the test. d. Do the data provide sufficient evidence to indicate that the null hypothesis is false?
clxxiii
10.2.
Refer to the accompanying 2 × 3 contingency table.
Columns 1 Rows Totals 1 2 14 21 35 2 37 32 69 3 23 38 61 Totals 74 91 165
a. Calculate the estimated expected cell counts for the contingency table. b. Calculate the chisquare statistic for the table. 10.3. A partially completed ANOVA table for a completely randomized design is shown here.
Source Between groups Within groups Total SS 24.7 62.4 df 4 34 MS F
a. Complete the ANOVA table. b. How many treatments are involved in the experiment? c. Do the data provide sufficient evidence to indicate a difference among the population means? Test using α = .10. 10.4. A randomized block design was conducted to compare the mean responses for three treatments, A, B, and C, in four blocks. The data are shown in the accompanying table, followed by a partial summary ANOVA table.
Block Treatment A B C 1 3 5 2 2 6 7 3 3 1 4 2 4 2 6 2
Source Treatments Blocks Residuals Total
SS 23.167 14.250 42.917
df
MS 4.750 .917
F
a. Complete the ANOVA table. b. Do the data provide sufficient evidence to indicate a difference among treatment means? Testing using α = .05. c. Do the data provide sufficient evidence to indicate that blocking was effective in reducing the experimental error? Testing using α = .10. d. What assumptions must the data satisfy to make the F test in parts b and c valid? 10.5. At the 5% level make the Ftest of equality of population (treatment) means for the data in the table. clxxiv
Blocks Treatment 1 2 3 1 1 4 9 2 4 9 16 3 9 16 23
clxxv
Chapter 11
CONTENTS
Simple Linear regression and correlation
11.1 Introduction: Bivariate relationships 11.2 Simple Linear Regression: Assumptions 11.3 Estimating A and B: the method of least squares 11.4 Estimating σ 2 11.5 Making inferences about the slope, B 11.6 Correlation analysis 11.7 Using the model for estimation and prediction 11.8 Simple Linear Regression: An Example 11.9 Summary 11.10 Exercises
11.1 Introduction: Bivariate relationships
Subject of this Chapter is to determine the relationship between variables. In Chapter 10 we used chisquare tests of independence to determine whether a statistical relationship existed between two variables. The chisquare test tells us if there is such a relationship, but it does not tell us what the relationship is. Regression and correlation analyses will show how to determine both the nature and the strength of a relationship between two variables. The term “regression “ was first used as a statistical concept by Sir Francis Galton. He designed the word regression as the name of the general process of predicting one variable ( the height of the children ) from another ( the height of the parent ). Later, statisticians coined the term multiple regression to describe the process by which several variables are used to predict another. In regression analysis we shall develop an estimating equation – that is a mathematical formula that relates the known variables to the unknown variable. Then, after we have learned the pattern of this relationship we can apply correlation analysis to determine the degree to which the variables are related. Correlation analysis tell us how well the estimating equation actually describes the relationship.
Types of relationships
Regression and correlation analyses are based on the relationship or association between two or more variables.
Definition 11.1
The relationship between two random variables is known as a bivariate relationship. The known variable ( or variables ) is called the independent variable(s). The variable we are trying to predict is the dependent variable.
clxxvi
Example 11.1 A farmer may be interested in the relationship between the level of fertilizer x and the yield of potatoes y. Here the level of fertilizer x is independent variable and the yield of potatoes y is dependent variable. Example 11.2 A medical researcher may be interested in the bivariate relationship between a patient’s blood pressure x and heart rate y. Here x is independent variable and y is dependent variable. Example 11.3 Economists might base their predictions of the annual gross national product (GDP) on the final consumption spending within the economy. Then, the final consumption spending is the independent variable, and the GDP would be the dependent variable.
In regression analysis we can have only one dependent variable in our estimating equation. However, we can use more than one independent variable. We often add independent variables in order to improve the accuracy of our prediction.
Definition 11.2
If when the independent variable x increases, the dependent variable y also increases then the relationship between x and y is direct relationship. In the case, the dependent variable y decreases as the independent variable x increases, we call the relationship inverse.
Scatter diagrams
The first step in determining whether there is a relationship between two variables is to examine the graph of the observed (or known) data, i.e. of the data points.
Definition 11.3
The graph of the data points is called a scatter diagram or scatter gram.
Example 11.4 In recent years, physicians have used the socalled diving reflex to reduce abnormally rapid heartbeats in humans by submerging the patient’s face in old water. A research physician conducted an experiment to investigate the effects of various cold temperatures on the pulse rates of ten small children. The results are presented in Table 11.1.
clxxvii
clxxviii . y beats/minute 2 5 1 10 9 13 10 3 4 6 The scatter gram of the data set in Table 11.2.1. From the scatter gram we can visualize the relationship that exists between the two variables.Table 11. We have done this in Figure 11. As a result we can draw or “fit” a straight line through our scatter gram to represent the relationship.1 Temperature of water – Pulse rate data Child 1 2 3 4 5 6 7 8 9 10 Temperature of Water.1 is depicted in Figure 11.1 Scatter gram for the data in Table 11. 14 12 10 8 6 4 2 0 50 55 60 65 70 75 Figure 11. xo F 68 65 70 62 60 55 58 65 69 63 Reduction in Pulse.
clxxix .2 0. Thus.4 0.6 0.5 To model the relationship between the CO (Carbon Monoxide) ranking. mgs 2 10 13 15 20 The scatter gram with straight line representing the relationship between Nicotine Content x and CO Ranking y “fitted” through it is depicted in Figure 11. and the nicotine content. we can say that it is a linear relationship.14 12 10 8 6 4 2 0 50 55 60 65 70 75 Figure 11. x. y.2 CO RankingNicotine Content Data Cigarett e 1 2 3 4 5 Nicotine Content. y. of an Americanmade cigarette the Federal Trade commission tested a random sample of 5 cigarettes. is inverse because y decreases as x increases Example 11. From this we see that the relationship here is direct.8 1 CO ranking. mgs 0.3. as we see. x. This relationship.2 Scatter gram with straight line representing the relationship between x and y “fitted” through it We see that the relationship described by the data points is well described by a straight line.2 Table 11. The CO ranking and nicotine content values are given in Table 11.
If we suppose that the points deviate above or below the line of means and with expected value E(e) = 0 then the mean value of y is y = A + B x.2 Simple Linear regression: Assumptions Suppose we believe that the value of y tends to increase or decrease in a linear manner as x increases. clxxx . represented by the symbol E(y) graphs as straight line with yintercept A and slope B. A graph of the hypothetical line of means.4. Therefore.e. However. mgs 20 15 10 5 0 0 0. mgs Figure 11. y = A + B x + e. makes assumption that the mean value of y for a given value of x graphs as straight line and that points deviate about this line of means by a random amount equal to e.2 and Figure 11. you can see that this idealistic situation will not occur for the data of Table 11.5 Nicotine Content x. E(y) = A + B x is shown in Figure 11. One type of probabilistic model.25 CO ranking y.2. Such a deterministic model – one that does not allow for errors of prediction – might be adequate if all of the data points fell on the fitted line.3.1 and 11. where A and B are unknown parameters of the deterministic (nonrandom ) portion of the model. i. at least some of points will deviate substantially from the fitted line. the mean value of y for a given value of x.one that acknowledges the random variation of the data points about a line. Then we could select a model relating y to x by drawing a line which is well fitted to a given data set. No matter how you draw a line through the points in Figure 11.3 Scatter gram with straight line representing the relationship between x and y “fitted” through it 11. The solution to the proceeding problem is to construct a probabilistic model relating y to x.5 1 1. a simple linear regression model.
we must first make specific assumptions about its properties.4 The straight line of means A SIMPLE LINEAR REGRESSION MODEL y = A + B x + e. where y = dependent variable (variable to be modeled – sometimes called the response variable) x = independent variable ( variable used as a predictor of y) e = random error A = yintercept of the line B = slope of the line In order to fit a simple linear regression model to a set of data .Figure 11. Since the sampling distributions of these estimators will depend on the probability distribution of the random error e. clxxxi . we must find estimators for the unknown parameters A and B of the line of means y = A + B x.
. The variance of the random error is equal a constant. 3. that is. the average of the errors over an infinitely long series of experiments is 0 for each setting of the independent variable x. y1). ˆ The line of means is E(y) = A + B x and the line fitted to the sample data is y = a + bx .. The mean of the probability distribution of the random error is 0. ˆ y is an estimator of the mean value of y and a predictor of some future value of y. say the point (xi. respectively. and a.. 2. i =1 n The values of a and b that make the SSE minimum is called the least squares estimators of the ˆ population parameters A and B and the prediction equation y = a + bx is called the least squares line. The straightline model for the response y in terms x is y = A + B x + e. the observed value of y is yi and the predicted value of y would be ˆ y i = a + bxi and the deviation of the ith value of y from its predicted value is SSE = ∑ [ y i − (a + bxi )]2 . yi). Definition 11. the error associated with one value of y has no effect on the errors associated with other values. this assumptionsimplies that the mean value of y. E(e) = 0. The errors associated with any two different observations are independent. y2). E(y) for a given value of x is y = A + B x. The probability distribution of the random error is normal. b are estimators of A and B. say σ2. (x2. 4. .3 Estimating A and B: the method of least squares The first problem of simple regression analysis is to find estimators of A and B of the regression model based on a sample data . That is. Suppose we have a sample of n data points (x1. For a given data point. (xn. Thus. yn). clxxxii . for all value of x. 11.ASSUMPTIONS REQUIRED FOR A LINEAR REGRESSION MODEL 1.4 The least squares line is one that has a smaller than any other straightline model.
yintercept: a = y − bx SS xy = ∑ ( x i − x )( y i − y ) . Solution By the least squares method we found the equation of the bestfitting straight line. i =1 n x= 1 n ∑ xi .6 Refer to Example 11. n i =1 n = sample size Example 11.5 Figure 11. The graph of this line is shown in Figure 11.3 + 20. it seems intuitively reasonable to estimate σ2 by dividing the total error SSE by an appropriate number. Find the bestfitting straight line through the sample data points.4 Estimating σ2 In most practical situations. clxxxiii . the variance σ2 of the random error e will be unknown and must be estimated from the sample data.5 Least squares line for Example 11. Since σ2 measures the variation of the y values about the regression line.6 11. n i =1 y= 1 n ∑ yi . i =1 n SS xx = ∑ ( x i − x ) 2 .FORMULAS FOR THE LEAST SQUARES ESTIMATORS Slope: where b= SS xy SS xx . It ˆ is y = −0.5 x .5.
that is E(s2) = σ2.5. the statistic n−2 χ2 = SSE σ2 = ( n − 2) s 2 σ2 has a chisquare distribution with ν = ( n − 2) degrees of freedom.816590. THE ESTIMATED STANDARD DEVIATION OF e We expect most of the observed y values to lie within 2s of their respective least ˆ squares predicted value y .7 Refer to Example 11. 2 2 Theorem 11. Since s measures the spread of distribution of y values about the least squares line.5. s is referred to as a standard error of estimate. Recall that the least squares line estimates the mean value of y for a given value of x. Estimate the value of the error variance σ2 . Example 11.1 Let s 2 = SSE . the function STEYX of MSExcel gives. Then. the result s =1. For example. Usually.2 are satisfied.ESTIMATION OF σ2 s2 = SSE SSE = Degree of freedom fo r error n − 2 where ˆ SSE = ∑ ( y i − y i ) 2 i =1 n From the following Theorem it is possible to prove that s is an unbiased estimator of σ . INTERPRETATION OF s. Data analysis or statistical softwares provide procedures or functions for computing the standard error of estimate s. when the assumptions of Section 11. most observations will lie within 2s of the least squares line. clxxxiv . for the data of Example 11.
b is an unbiased estimator for B. Under the assumptions in section 11. The mean of the least squares estimator b is B. yi). Under the assumptions made on the random error e we have E(y) = A + B x . SS xx = ∑ (x i =1 n i − x)2 We will use these results to test hypotheses about and to construct a confidence interval for the slope B of the population regression line. one tests the hypothesis if B = 0 or not. if x does or does not contribute information for the prediction of y. A and B are unknown parameters. Since σ is usually unknown. For testing hypotheses about B first we state null and alternative hypotheses: H 0 : B = B0 H a : B ≠ B0 (or B < B0 or B > B0 ) where B0 is the hypothesized value of B. that is. The setup of our test of utility of the model is summarized in the box. i =1. The standard deviation of the sampling distribution of b is σb = σ SS xx . we use its estimator s and instead of σ b = estimate sb = σ SS xx we use its s SS xx . If we are given a sample of n data points (xi... The theoretical background for making inferences about the slope B lies in the following properties of the least squares estimator b: PROPERTIES OF THE LEAST SQUARES ESTIMATOR b 1. This line is the sample regression line. 3.5 Making inferences about the slope. 2... b will possess sampling distribution that is normally distributed. It is an estimate for the population regression line.11. and e is a random error. that is. We should be able to use it to make inferences about the population regression line. Often. where σ is the standard deviation of the random error e.2.n. This is the population regression line.3 we ˆ can find the straight line y = a + bx fitted to these sample data. B In Section 11. then by the least squares method in Section 11. In this section we shall make inferences about the slope B of the “true” regression equation that are based upon the slope b of the sample regression equation. where x is independent variable and y is dependent variable.2 we proposed the probabilistic model y = A + B x + e for the relationship between two random variables x and y. clxxxv . E(b) = B.
6 we computed b =20. i. Hence.5. where tα is based on (n .A TEST OF MODEL UTILITY ONETAILED TEST TWOTAILED TEST H0 : B = 0 Ha : B < 0 H0 : B = 0 Ha : B ≠ 0 Test statistic: (or B > 0) Test statistic: t= b b = s b s / SS xx t= b b = s b s / SS xx Rejection region t < −t α ( or t > tα ).05 .025 = 3.182. In Example 11. test the ˆ prediction ability of the least squares straight line model y = −0.182 .182 or t > 3.7 we know s = 1. the test statistic is clxxxvi .4 tα / 2 = t 0.8 Refer to the nicotinecarbon monoxide ranking problem of Example 11.82 and we can compute SSxx = 0. s and SSxx. Rejection region t < −tα / 2 or t > tα / 2 .4 Example 11. the critical value based on (5 2) = 3 df is obtained from Table 7. Solution Testing the usefulness of the model requires testing the hypothesis H0 : B = 0 Ha : B ≠ 0 with n = 5 and α = 0.5 x . In Example 11.4.05 .5. In order to compute the test statistic we need the values of b.2) df. we will reject H0 if t < 3. Thus. At significance level α = 0.3 + 20. where tα / 2 is based on (n2) df.e. test the hypothesis that the nicotine content of a cigarette contributes useful information for the prediction of carbon monoxide ranking y. The values of tα such that P (t ≥ tα ) = α are given in Table 7.
Under this significance level we can not reject the hypothesis H0. and SSxx = 26 we have t= b s / SS xx = 0. clxxxvii . In order to conclude that the mean yearly food cost increases as annual income increases (B > 0) we must tolerate α ≥ 0.1131. Since df >30 we can approximate the tdistribution with the zdistribution.1131 . This interval is formed as shown in the box.82 / 0.21). But it is a big risk and usually we take α = 0. pvalue = P(t >1.05. s =1. the sample data provide sufficient evidence to conclude that nicotine content does contribute useful information for prediction of carbonmonoxide ranking using the linear model.21) ≈ 0. Compute the observed pvalue for a test to determine whether mean yearly food cost y increases as annual income x increases .26. In addition..2) = (100 . Solution The consumer investigator wants to test H0 : B = 0 Ha : B > 0 To compute the observed significance level (pvalue ) of the test we must first find the calculated value of the test statistic.10 Find the 95% confidence interval for B in Example 11.5 1. A (1α)100% CONFIDENCE INTERVAL FOR THE SLOPE B b ± tα / 2 s b .4 = 7. At the significance level α = 0. Example 11.182.1 / 26 = 1.12 Since the calculated tvalue is greater than the critical value t0.21) = P(z >1. we reject the null hypothesis and conclude that the slope B ≠ 0 . the investigator computed the quantities s = 1.05. tc .1.26 1.5 – 0.e.2) = 98 df. where tdistribution is based on (n . whether the slope of the population regression line B is positive. i. Thus.025 = 3. where sb = s SS xx and tα / 2 is based on (n2) df. SSxx = 26. Example 11.t= b s / SS xx = 20.9 A consumer investigator obtained the following least squares straight line model ( based on a sample on n = 100 families ) relating the yearly food cost y for a family of 4 to annual income x: ˆ y = 467 + 0.8.26 x .21 The observed significance level or pvalue is given by P(t > tc ) = P(t >1. Since b = 0.3869 = 0. It means we consider the sample result to be statistically insignificant. Another way to make inferences about the slope B is to estimate it using a confidence interval.1.
to measure the degree of association between two variables. In this section we present two measures for describing the correlation between two variables: the coefficient of determination and the coefficient of correlation.182. E(y) increases as x increases. SSxx = 0.05.1 The coefficient of correlation Definition 11.34 to 29. It is computed ( for a sample of n measurements on x and y ) as follows r= where SS xy SS xx SS yy . Remark From the above we see the complete similarity between the tstatistic for testing hypotheses about the slope B and the tstatistic for testing hypotheses about the means of normal populations in Chapter 9 and the similarity of the corresponding confidence intervals.5 ± 3. clxxxviii . we have b = 20.5. we need to find the value of tα/2 = t0.82 = 20.4.6.6. however. 11.5 The Pearson product moment coefficient of correlation (or simply. Frequently. correlation analysis is used in conjunction with regression analysis to measure how well the least squares line fits the data . the coefficient of correlation) r is a measure of the strength of the linear relationship between two variables x and y.Solution For a 95% confidence interval α = 0. In Example 11. In each case.4 Our interval estimate of the slope parameter B is then 11.5 ± 9. it appears that B is positive and that the mean of y. Also. Thus.66.182 1.025 based on ( 52 ) = 3 df. Therefore. Correlation analysis can also be used by itself.16 0 . Since all the values in this interval are positive.025 = 3. Correlation analysis Correlation analysis is the statistical tool that we can use to describe the degree to which one variable is linearly related to another. a 95% confidence interval for the slope in the model relating carbon monoxide to nicotine content is s b ± tα / 2 SS xx = 20.8 we found that t0. the general form of the test statistic is t= Parameter estimator − Its hypothesized mean Estimated standard error of the estimator and the general form of the confidence interval is Point estimator ± tα/2 (Estimated standard error of the estimator) 11.
. n i =1 Some properties of the coefficient of correlation: i) ii) iii) 1 ≤ r ≤ 1 (this follows from the CauchyBunhiacopskij inequality ) r and b ( the slope of the least squares line ) have the same sign A value of r near or equal to 0 implies little or no linear relationship between x and y. ρ is estimated by the corresponding sample statistic r. the stronger the linear relationship between x and y. Keep in mind that the correlation coefficient r measures the correlation between x values and y values in the sample. But it can be shown that the null hypothesis H0: ρ = 0 is equivalent to the hypothesis H0: B = 0.SS xy = ∑ ( x i − x )( y i − y ) . we might want to test the hypothesis H0: ρ = 0 against Ha: ρ ≠ 0. As you might expect. Or. we omit the test of hypothesis for linear correlation. i =1 n i =1 n n SS yy = ∑ ( y i − y ) 2 . test the hypothesis that x contributes no information for the predicting y using the straight line model against the alternative that the two variables are at least linearly related.6 The coefficient of determination is clxxxix . The population correlation coefficient is denoted by ρ (rho).6.1 The coefficient of determination Another way to measure the contribution of x in predicting y is to consider how much the errors of prediction of y can be reduced by using the information provided by x. This variation is the sum of squares for error (SSE) of the regression model ˆ SSE = ∑ ( y i − y i ) 2 i =1 n The second variation is the variation of y values around their own mean SS yy = ∑ ( y i − y ) 2 i =1 n Definition 11. i. The sample coefficient of determination is develped from the relationship between two kinds of variation: the variation of the y values in a data set around: 1. Their own mean The term variation in both cases is used in its usual statistical sense to mean “ the sum of a group of squared deviations”.e. i =1 x= 1 n ∑ xi . Therefore. SS xx = ∑ ( x i − x ) 2 .. The fitted regression line 2. i. The closer r is to 1 or to –1. around their predicted values. n i =1 y= 1 n ∑ yi .e. The first variation is the variation of y values around the regression line. 11. and that a similar linear coefficient of correlation exists for the population from which the data points were selected. rather than estimating ρ.
It is true that Total variation = Explained variation + Unexplained variation. i =1 n The unexplained portion of the total variation of these points from the regression line is ˆ SSE = ∑ ( y i − y i ) 2 . Therefore.6.SS yy − SSE SS yy It is easy to verify that r2 = SS yy − SSE SS yy = 1− SSE . SS yy where r is the coefficient of correlation. the sum of squared deviations of these points from their mean would be SS yy = ∑ ( y i − y ) 2 .e. y − y . Now consider a whole set of observed y values instead of only one value. The total variation. the unexplained deviation y − y and the remaining explained deviation ˆ y − y . Statisticians interpet the coefficient of determination by looking at the amount of the variation in y that is explained by the regression line. i =1 n The explained portion of the total variation is ˆ ∑(y i =1 n i − y) 2 . i. cxc . To understand this meaning of r2 consider Figure 11. defined in Subsection 11.6 The unexplained deviations explained and Here we singled out one observed value of y and showed the total variation of this y from its ˆ mean y . Figure 11..6. usually we call r2 the coefficient of determination.1.
we can reduce the total sum of squares of our prediction errors by more than 94% by using the least squares equation instead of y . 11.9444.7 Using the model for estimation and prediction The most common uses of a probabilistic model can be divided into two categories: 1) The use of the model for estimating the mean value of y. to predict carbon monoxide ranking. respectively. x. r2 = Explained variation Total variation PRACTICAL INTERPRETATION OF THE COEFFICIENT OF DETERMINATION.11 Refer to Example 11.5. In case 2) we are trying to predict the outcome of a single experiment at the given x value. These errors are given in the next box. Solution By the formulas given in this section we found r2 = 0. That is. for a specific value of x 2) The second use of the model entails predicting a particular y value for a given x value. The difference in these two model uses lies in the relative accuracy of the estimate and the prediction.3 + 20. In case 1) we are attempting to estimate the mean result of a very large number of experiments at the given x value. with the least squares line ˆ y = −0. E(y). cxci . Calculate the coefficient of determination for the nicotine contentcarbon monoxide ranking and interpret its value.5 x accounts for approximately 94% of the total sum of squares of deviations of the five sample CO rankings about their mean.Therefore. y. r2 About 100(r2) % of the total sum of squares of deviations of the sample yvalues about their mean y can be explained by (or attributed to) using x to predict y in the straightline model. These accuracies are best measured by the repeated sampling errors of the least squares line when it is used as estimator and as a predictor. We interpret this value as follows: The use of nicotine content. Example 11.
SAMPLING ERRORS FOR THE ESTIMATOR OF THE MEAN AND THE PREDICTOR OF AN INDIVIDUAL y
The standard deviation of the sampling ˆ distribution of the estimator y of the mean value of y at a fixed x is
The standard deviation of the prediction ˆ error for the predictor y of an individual yvalue at a fixed x is
σ yˆ = σ
1 (x − x)2 + n SS xx
σ ( y− y) = σ 1 + ˆ
1 (x − x)2 + n SS xx
where σ is the square root of σ2, the variance of the random error (see Section 11.2) The true value of σ will rarely be known. Thus, we estimate σ by s and calculate the estimation and prediction intervals as follows
A (1α)100% CONFIDENCE INTERVAL FOR THE MEAN VALUE OF y FOR x = A (1α)100% CONFIDENCE INTERVAL FOR AN INDIVIDUAL y FOR x = xp
xp
ˆ ˆ y ± tα / 2 ( Estimate std of y )
2 1 (x p − x) ˆ + or y ± tα / 2 .s. n SS xx
ˆ ˆ y ± tα / 2 [ Estimate std of ( y − y )]
2 1 (x p − x) ˆ or y ± tα / 2 .s. 1 + + n SS xx
where tα / 2 is based on (n2) df
where tα / 2 is based on (n2) df
Example 11.12 Find a 95% confidence interval for the mean carbon monoxide ranking of all cigarettes that have a nicotine content of 0.4 milligram. Also, find a 95% prediction interval for a particular cigarette if its nicotine content is 0.4 mg. Solution For a nicotine content of 0.4 mg, xp = 0.4 and the confidence interval for the mean of y is calculated by the formula in left of the above box with s = 1.82, n = 5, df = n  2 = 5  2 = 3, ˆ t0.025 = 3.182 y = −0.3 + 20.5 x p = −0.3 + 20.5 * 0.4 = 7.9 , SSxx = 0.4. Hence, we obtain the
confidence interval (7.9 ± 3.17). Also, by the formula in the right cell we obtain the 95% prediction interval for a particular cigarette with nicotine content of 0.4 mg as (7.9 ± 6.60).
cxcii
From the Example 11.12 it is important note that the prediction interval for the carbon monoxide ranking of an individual cigarette is wider than corresponding confidence interval for the mean carbon monoxide ranking. By examining the formulas for the two intervals, we can see that this will always be true. Additionally, over the range of sample data, the width of both intervals increase as the value of x gets further from x (see Figure 11.7).
Figure 11.7 Comparison of 95% confidence interval and prediction interval
11.8. Simple Linear Regression: An Example
In the previous sections we have presented the basic elements necessary to fit and use a straightline regression model. In this section we will assemble these elements by applying them to an example.
Example 11.13 The international rice research institute in the Philippines wants to relate the grain yield of rice varieties, y, to the tiller number, x . They conducted experiments for some rice varieties and tillers. Below there are the results obtained for the rice variety Milfor 6
cxciii
Table 11.3 The grain yield of rice, y, for the tiller number, x
Grain Yield, kg/ha Tillers, no./m2
4,862 5,244 5,128 5,052 5,298 5,410 5,234 5,608
160 175 192 195 238 240 252 282
Step 1 Suppose that the assumptions listed in Section 11.2 are satisfied, we hypothesize a straight line probabilistic model for the relationship between the grain yield, y, and the tillers, x
y = A + B x + e.
Step 2 Use the sample data to find the least squares line. For the purpose we make calculations:
SS xx = ∑ ( x i − x ) 2 ,
i =1 n
n
SS xy = ∑ ( x i − x )( y i − y )
i =1
b=
SS xy SS xx
, a = y − bx
for the data. As a result, we obtain the least squares line
ˆ y = 4242 + 4.56 x
The scattergram for the data and the least squares line fitted to the data are depicted in Figure 11.8.
cxciv
5,800 5,600 5,400 5,200 5,000 4,800 150
200
250
300
Figure 11.8 Simple linear model relating Grain Yield to Tiller Number
Step 3 Compute an estimator, s2, for the variance σ2 of the random error e :
s2 =
SSE n−2
n
where
ˆ SSE = ∑ ( y i − y i ) 2 .
i =1
The result of computations gives s2 = 16,229.66, s = 127.39. The value of s implies that most of the observed 8 values will fall within 2s = 254.78 of their respective predicted values.
Step 4 Check the utility of the hypothesized model, that is, whether x really contributes information for the prediction of y using the straightline model. First test the hypothesis that the slope B is 0, i.e., there is no linear relationship between the grain yield, y, and the tillers, x. We test:
H0 : B = 0 Ha : B ≠ 0
Test statistic:
t=
b b = s b s / SS xx
For the significance level α = 0.05, we will reject H0 if t < −tα / 2 or t > tα / 2 , where tα / 2 is based on (n2) = (8 – 2) = 6 df. On this df we find t0.025 = 2.447,
t=
4.56 127.39 / 125415
= 4.004 .
This tvalue is greater than t0.025. Thus, we reject the hypothesis B = 0. Next, we obtain additional information about the relationship by forming a confidence interval for the slope B. A 95% confidence interval is
s b ± tα / 2 SS xx
= 4.56 ± 2.447 127.39 = 4.56 ± 2.78 . 12541.5
It is the interval (1.78, 7.34). cxcv
Another measure of the utility of the model is the coefficient of correlation
r=
SS xy SS xx SS yy
, where SS yy =
∑(y
i =1
n
i
− y)2 .
Computations give r = 0.853. The high correlation confirms our conclusion that B differs from 0. It appears that the grain yield and tillers are rather highly correlated. The coefficient of determination is r2 = 0.7277, which implies that 72.77% of the total variation is explained by the tillers.
Step 5 Use the least squares model:
Suppose the researchers want to predict the grain yield if the tillers are 210 per m2, i.e., xp =210. The predicted value is
ˆ y = 4242 + 4.56 x p = 4242 + 4.56 * 210 = 5199.6 .
If we want a 95% prediction interval, we calculate
ˆ y ± tα / 2 .s. 1 +
2 1 (x p − x) 1 (210 − 26.75) 2 + = 5199.6 ± 2.447 * 127.39 1 + + n SS xx 8 12541.5
= 5199 ± 331.18 = (4867.82, 5530.18)
Thus, the model yields a 95% prediction interval for the grain yield for the given value 210 of tillers from 4867.82 kg/ha to 5530.18 kg/ha. Below we include the STATGRAPHICS printout for this example.
Regression Analysis  Linear model: Y = a+bX Dependent variable: GrainYield Independent variable: Tillers Standard T Prob. Parameter Estimate Error Value Level Intercept 4242.13 250.649 16.9245 0.00000 Slope 4.55536 1.13757 4.00445 0.00708 Analysis of Variance Source Sum of Squares Df Mean Square FRatio Prob. Level Model 260252.06 1 260252.06 16.0 0.00708 Residual 97377.944 6 16229.657 Total (Corr.) 357630.00 7 Correlation Coefficient = 0.853061 Rsquared = 72.77 percent Stnd. Error of Est. = 127.396
Figure 11.9 STATGRAPHICS printout for Example 11.13
cxcvi
The steps that we follow in the simple linear regression analysis are: To hypothesize a probabilistic straightline model y=A + Bx + e. In fitting a least squares line to n = 22 data points.10 Exercises 1.05. Consider the seven data points in the table x y 5 0. To make assumptions on the random error component e. f) ) Find a 95% confidence interval for the mean value of y when x = 0.11. is it positive or negative? b) Find the correlation coefficient r and interpret its value. suppose you computed the following quantities: SSxx = 25 x=2 SSyy = 17 y =3 SSxy = 20 a) Find the least squares line. To use the method of least squares to estimate the unknown parameters in the deterministic component. 2. We have also presented the method of least squares for fitting a prediction equation to a data set.9 Summary In this chapter we have introduced bivariate relationships and showed how to compute the coefficient of correlation. c) Find the least squares prediction equation. d) Find a 95% confidence interval for the mean value of y when x = 1. along with associated statistical tests and estimations. calculating the coefficient of correlation r and the coefficient of determination r2. E(y). is called a regression analysis.5 0 3.1 1 5. y=A + Bx. e) Test the null hypothesis that the slope B = 0 against the alternative hypothesis that B ≠ 0 . To assess the utility of the hypothesized model. Use α = 0. Included here are making inferences about the slope B.8 3 1. b) Calculate s2 .7 5 6. e) Find a 95% prediction interval for y when x = 1. After examining the scattergram.0 3 4. This procedure. a measure of the strength of the linear relationship between two variables. r .1 1 2. for a given x value and to predict an individual y value for a specific x value 11. If we are satisfied with the model we used it to estimate the mean y value.2 a) Construct a scatter diagram for the data. b) Calculate SSE. f) Find a 90% confidence interval for the slope B. d) Calculate SSE for the data and calculate s2 and s. do you think that x and y are correlated? If correlation is present. cxcvii .
65 0. 4. The data shown in the table provide a measure of corrosion of Armco iron in tap water containing various concentrations of NaPO4 inhibitor: Concentratio n of NaPO4. parts per million Measure of corrosion rate.40252 11.60 11.Y Independent variable: ELECTRIC.00 0. d) Construct a 95% confidence interval for the mean corrosion rate of iron in tape water in which the concentration of NaPO4 is 20 parts per milllion.03 7.68 0. For the relationship between the variables x and y one uses a linear model and for some data collected STATGRAPHICS gives the following printout Regression Analysis .95 6. b) Fit the linear model y = A + B x + e to the data.72 0.0623473 2.445 0.4 Prob.89 5985.214 Df Mean Square 1 8 798516.763 0.56 a) Construct a scatter diagram for the data .00000 Analysis of Variance Source Model Residual Sum of Squares 798516.01 1.00000 Total (Corr.60 13.05.89 47885. y Concentratio n of NaPO4.75 5. A study was conducted to examine the inhibiting properties of the sodium salts of phosphoric acid on the corrosion of iron.3.30 5.43 26. y 2.X Standard Parameter Estimate Error T Value Prob.652 FRatio 133.00 19.00 40.) 846402. x.93 0. Level Intercept Slope 279. parts per million Measure of corrosion rate.Linear model: Y = a+bX Dependent variable: ELECTRIC. Level 0.00 55.10 9 cxcviii .50 5.68 6.20 33.720119 116.00 50.5501 0. c) Does the model of part b) provide an adequate fit? Test using α = 0.04301 0. x.60 7.
Correlation Coefficient = 0. = 77. b) What are the values of SSE and s2 for the data? c) Perform a test of model adequacy.971301 Stnd.05.34 percent Figure 11. a) Identify the least squares model fitted to the data. cxcix . Error of Est.10 STATGRAPHICS printout for Exercise 11.367 Rsquared = 94. Use α = 0.4 .
relates to Plant Height. because E(y) is a linear function of the unknown parameters B0. A quadratic model often referred to as a secondorder linear model in contrast to a straight line or firstorder model.10 Model building: quadratic models 12.8 Multiple linear regression: An overview example 12. For example. the firstorder model in this case is E(y) = B0+ B1x1 + B2x2 and the secondorder model is E(y) = B0 + B1x1 + B2x2 + B3 x1x2 + B4 x12 + B5 x22. we might use the 2 quadratic model E(y) = A + B1x1 + B2x1 . Example 12.2 Suppose we think that the mean time E(y) required to perform a dataprocessing job increases as the computer utilization increases and that relationship is curvilinear.3 Fitting the model: the method of least squares 12.5 Estimating and testing hypotheses about the B parameters 12. The model E(y) = A eBx cc .Chapter 12 Multiple regression CONTENTS 12.7 Using the model for estimating and prediction 12. B1. All the models that we have written so far are called linear models. by the linear model E(y) = B0 + B1x1 + B2 x2. B2.1 The researchers in the international rice research institute suppose that Grain Yield . where x1 is a variable measures computer utilization. . we think that the mean time required to process a job is also related to the size x2 of the job. Example 12. x2.6 Checking the utility of a model 12. Introduction: the general linear model The models for a multiple regression analysis are similar to simple regression model except that they contain more terms.2 Model assumptions 12. and Tiller Number. we could include x2 in the model.12 Exercises ___________________________________________________________________________ 12.. y.1. If.11 Summary 12.. Instead of using the straight line model E(y) = A + Bx1 to model the relationship. x1.1 Introduction: the general linear model 12. in addition.4 Estimating σ2 12.9 Model building: interaction models 12.
Suppose we are given the sample data that are presented in Table 12. cci . 12. . secondorder models may be written in the form of firstorder models. For any given set of values x1.. where e is random error. Note that by introducing new variables. xk. For example.. x2... The random errors are independent. the secondorder model E(y) = B0 + B1x1 + B2x12 becomes the firstorder model E(y) = B0 + B1x1 + B2x2. 2. y = B0 + B1x1 + .is not a linear model because E(y) is not a linear function of the unknown model parameters A and B... + Bkxk + e. THE GENERAL MULTIPLE LINEAR MODEL y = B0 + B1x1 + ..3 Fitting the model: the method of least squares The method of fitting a multiple regression model is identical to that of fitting the straightline model. where y = dependent variable (variable to be modeled – sometimes called the response variable) x1. x2.. + Bkxk + e.. putting x2 = x12. the random error e has a normal probability distribution with the mean equal 0 and variance equal σ2. Therefore. ..1. in the future we consider only multiple firstorder regression model. xk = independent variable ( variable used as a predictor of y) e = random error Bi determines the contribution of the independent variable xi 12.2 Model assumptions ASSUMPTIONS REQUIRED FOR A MULTIPLE LINEAR REGRESSION MODEL 1. 3.
.. + bk x ki )]2 i =1 i =1 n n In order to briefly write the solution of the least squares problem we introduce the matrix notations y1 y Y = 2 . . Μ yn 1 1 X = Μ 1 x11 x12 Μ x1n x 21 Κ x 22 Κ Μ x2 n Κ xk1 xk 2 . . Example 12.Table 12. . x2.. to Plant Height. and Tiller Number. ccii . x2n .3 Refer to Example 12. . by the linear model E(y) = B0 + B1x1 + B2 x2.1 relating Grain Yield . .. The solution of the least squares equations therefore is LEAST SQUARES SOLUTION b = (X’X)1XY . .... . x1n x2 x21 x22 . . xkn We will use the method of least squares and choose estimates of B0. Bk that minimize ˆ SSE = ∑ [ y i − y i ] 2 = ∑ [ yi − (b0 + b1 x1i + b2 x 2i . xk xk1 xk2 . . .1 DATA POINT 1 2 ... . yn x1 x11 x12 .. where X’ is the transpose of X . Μ x kn b0 b b = 1 . B1. Μ bk Then we can write the least squares equations in matrix form as THE LEAST SQUARES MATRIX EQUATION (X’X )b = X’Y. B2... . . x1.. n Y VALUE y1 y2 . . y.
5 14./hill ( x2 ) Solution The Y. B2.Find the least squares estimates of B0.75.2 15.6 84.3412 0. 150.8 17.9 75.8 75.1 77.5 18. 23.596495 2942. b2 After calculations.6 17.5 93.4 b0 b = b1 .X1 GRAIN.2375  cciii .0 118.4 84.930958 2.895492 1.9 19. finally.6 18.2 .6 19. Model fitting results for: GRAIN.6 77.1 14.5 105. error tvalue sig. kg/ha (y) 1 2 3 4 5 6 7 8 5755 5939 6010 6545 6730 6750 6899 7862 PLANT HEIGHT. X and b are shown below 5755 5939 6010 6545.0 14.1 17.4 TILLER.5 16. cm ( x1 ) 110.31 x2.312641 112. Thus. 93.59.1249 GRAIN.X2 150.75 x1 + 150.748104 12. The data are shown in Table 12.4 16.8416 0.1528 0. Y= 6730 6750 6899 7862 1 1 1 1 X = 1 1 1 1 110.59 23.4 17.6 104. the prediction equation is y = 6335. no.31 )’. level Independent variable 6335.6 15.0839 CONSTANT 23.2 Data for Grain Yield Study VARIETY NUMBER GRAIN YIELD. we obtain b = ( 6335.5 105.1 104.Y coefficient std.2 Table 12.069368 1.4 118. Below we include the STATGRAPHICS printout for this example.6 14. B1.
var. 11.1 Properties of the sampling distributions of b0. 12.0138 579455. Since σ2 will rarely be known in advance.149078 DurbWat= 2. (ADJ.2 the STATGRAPHICS printout is following Analysis of Variance for the Full Regression Source Sum of Squares DF Mean Square FRatio Pvalue Model 2632048. bk cciv .. = 340. ESTIMATOR OF σ2. for the data for Grain Yield Study in Table 12. error of est. For example.819569 Rsquared (Adj.000 8 observations fitted. 2 1316024. 7 Rsquared = 0.3557 0..428 DurbinWatson statistic = 2. Error Total (Corr.) = 0. b1.0000 0. of dep. for d..7474 SE= 340. we must use the sample data to estimate its value.RSQ.5 Estimating and testing hypotheses about the B parameters 12. that is E(s2) = σ2. the variance of the random error e that appears in the linear model.5. Notice that in softwares SSE often is referred to as Sum of Squares for Error and s2 is refereed to as Mean Squares for Error.) = 0.000000 0.747396 Stnd. THE VARIANCE OF e IN A MULTIPLE REGRESSION MODEL s2 = where SSE SSE = Degree of freedom for error n − Number of B parameters in model ˆ SSE = ∑ ( y i − y i ) 2 i =1 n It can be proved that s2 is an unbiased estimator of σ2.) 3211504. 5 115891.000000 0. forecast(s) computed for 0 missing val.427774 MAE= 248.f. 12.33739 We see on this printout that SSE = 579455 and s2 = 115891.4 Estimating σ2 ˆ We recall that the variances of the estimators of all the B parameters and of y will depend on 2 the value of σ .337 Previously: 0. .
.. Now. E(bi) = Bi (i = 0. 12. Μ Κ c kk then the standard deviation of the sampling distributions of b0.. yn.. y2.... k ) can be constructed using the t statistic ccv . One showed that the least squares estimators provide unbiased estimators of B0.. Bk. .. Thus... . THEOREM 12. bk are σ b = σ cii i (i = 0... 1....5.2 Estimating and testing hypotheses about the B parameters A (1α)100% confidence interval for a model parameter Bi ( i = 0. bk ) The sampling distribution of bi ( i = 0..... if we denote ( X ' X ) −1 c00 c01 c 10 c11 = c 20 c 21 Μ Μ c k 0 c k 1 Κ cok Κ c1k Κ c 2 k . The standard errors and covariances of the estimators are defined by the elements of the matrix (X’X)1. B1. standard deviation: σ bi = σ cii (i = 0. k ) is normal with: mean E(bi ) = Bi . .1. k) has a normal sampling distribution. From Section 12... b1.. k).3 we know that the least squares estimators b are computed by the formula b = (X’X)1XY.... k ) where σ is the standard deviation of the random error e... variance V(bi ) = c ii .... that is. The properties of the sampling distributions of the least squares estimators are summarized in the box.1 (properties of the sampling distributions of b0.. Therefore. From this form we see that the components of b: b0.1. which serve the theoretical background for estimating and testing hypotheses about B.. b1.. 1. bk are linear functions of n normally distributed random variables y1. bi (i =0.. b j ) = cij σ 2 (i ≠ j ) . . . k ) The covariance of two parameter estimators is equal to Cov(bi .Before making inferences about the B parameters of the multiple linear model we provide some properties of the least squares estimators b .... b1. we can rewrite b in the form b = [(X’X)1X]Y. . 1. 1.
(k+1)] df. Similarly. A (1α)100% CONFIDENCE INTERVAL FOR Bi bi ± tα/2 ( Estimated standard error of bi ) or bi ± tα / 2 s cii where tα/2 is based on [ n – (k+1)] df.4 ccvi . the test statistic for testing the null hypothesis H0: Bi = 0 is t= bi Estimated standard error of bi The test is summarized in the box: TEST OF AN INDIVIDUAL PARAMETER COEFFICIENT IN THE MULTIPLE REGRESSION MODEL y = B0 + B1x1 + . k= number of independent variables in the model where tα / 2 is based [ n.t= bi − Bi bi − Bi = s bi s cii where s is an estimate of σ.(k+1)] df.. + Bkxk + e. Rejection region t < −t α (or t > tα ) where tα / 2 is based on [ n. ONETAILED TEST TWOTAILED TEST H 0 : Bi = 0 H a : Bi < 0 H 0 : Bi = 0 H a : Bi ≠ 0 Test statistic: (or Bi > 0) Test statistic: b b t= i = i sbi s cii Rejection region t= bi b = i sbi s cii t < −tα / 2 or t > tα / 2 .. n = number of observations. k= number of independent variables in the model The values of tα such that P( t ≥ tα ) = α are given in Table 7. n = number of observations.
0061 RSQ.689335 32. c. b. Below is a part of the printout of the procedure “ Multiple regression “.230298 2.000123 3.689335 MAE= 32.) = 0. Test H0: B2 = 0 against Ha: B2 ≠ 0. square feet 1290 1350 1470 1600 1710 1840 1980 2230 2400 2390 y.209833 3.Example 12. B2.Y Independent variable coefficient std.094 Previously: 0. var.4 An electrical utility company wants to predict the monthly power usage of a home as a function of the size of the home based on the model y = B0 + B1x + B2x2 + e. State your conclusions. of dep.3.094 10 observations fitted.0010 ELECTRIC.X 2.0164 ELECTRIC. (ADJ.X 0.382558 415. Table 12.000477 0. Data are shown in Table 12.46109 5.230298 DurbWat= 2. d.level CONSTANT 1303. ccvii . Compute the estimated standard error for b1. forecast(s) computed for 0 missing val. Find the least squares estimators of B0.8687 0. Solution We use computer with the software STATGRAPHICS to do this example.9768 46.3 Data for Power Usage Study SIZE OF HOME MONTHY USAGE x. Model fitting results for: ELECTRIC. B1.4176 0.497984 0. error tvalue sig. kilowatthours 1182 1172 1264 1493 1571 1711 1804 1840 1956 1954 a.1391 0. Compute the value of the test statistic for testing H0: B2 = 0.X * ELECTRIC.9768 SE= 46.
We begin with the easier problem – finding a measure of how well a linear model fits a set of data. the coefficient of determination for the straight line model (Chapter 11). Below we include also a printout from SPSS for the Example 12.8687.210 . for df = [10 – (2+1)] =7 we have tα/2 = 2.497884 x – 0. b.001 Figure 12.196 .382558 + 2.000477 x2. At significance level α = 0.461 . it is very likely that we would make one or more errors in deciding which terms to retain in the model and which to exclude.016 2285.006 1.365. Since the observed value of t = –3. Lower Bound Upper Bound 321. The value of the test statistic for testing H0: B2 = 0 is t = –3.365. We would like to find some statistical quantity that measures how well the model fits the data. The estimated standard error for b1 is 0.4. Coefficients Unstandardized Coefficients 95% Confidence Interval for B t 415.Figure 12. we will need a global test (one that encompasses all the B parameters).869 Sig. Error .461069 ( in std. If we were to conduct a series of ttests to determine whether the individual variables are contributing to the predictive relationship . For this we use the multiple regression equivalent of r2. we will reject H0: B2 = 0 if t < 2.498 4.000 B Model 1 (Constant) 1303.570 3. Checking the utility of a model Conducting ttests on each B parameter in a model is not a good way to determine whether a model is contributing information for the prediction of y. that is. x2 contributes information for the prediction of y.768E04 Std. we reject H0.error column) c.365 or t >2. ccviii . d.6.418 3. To test the utility of a multiple regression model.001 . The least squares model are y = 1303.2 A part of SPSS printout for Example 12.408 .365.4 12. Therefore.000 3.139 5.05.383 X X2 2.8687 is less than 2.1 STATGRAPHICS printout for Example 12.588 .4 From the printout we see that a.
74 we reject H0 and conclude that at least one of the model coefficients B1 and B2 is nonzero. At the significance level α = 0.3 ) we find that the computed F is 190. SS yy = ∑ ( y i − y ) 2 i =1 i =1 n ˆ and y i is the predicted value of yi for the multiple regression model. Solution For the electrical usage example. R2 is a sample statistic that tells how well the model fits the data . the better the model fits the data. and thereby represents a measure of the utility of the entire model . n = Number of observations.638. ....05. TESTING THE OVERALL UTILITY OF THE MODEL E(y) = B0 + B1x1 + . From the computer printout ( see Figure 12. n = 10. Since this value greatly exceeds 4.74. where Fα is value that locate area α in the upper tail of the Fdistribution with ν1 = k and ν2 = n .Definition 12. Test to determine whether the model contributes information for the prediction of the monthly power usage. Test statistic: R2 / k Mean Square for Model SS (Model) / k F= = = 2 SSE /[ n − (k + 1)] (1 − R ) /[ n − (k + 1)] Mean Square for Error Rejection region: F > Fα .05 we will reject H0 : B1 = B2 = 0 if F > F0. + Bkxk H0 : B1 = B2 = .= Bk = 0 ( Null hypothesis: y doesn’t depend on any xi ) Ha : At least one Bi ≠ 0 ( Alternative hypothesis: y depends an at least one of the xi’s.1 The multiple coefficient of determination R2 is defined as R2 = 1− where SSE SS yy n ˆ SSE = ∑ ( yi − y i ) 2 . k = Number of parameters in the model (excluding B0 ) R2 = Multiple coefficient of determination. In general. or F > 4.4. k = 2 and n – ( k+1) = 7.5 Refer to Example 12. is useful for predicting electrical usage. ccix . From the definition we see that R2 = 0 implies a complete lack of fit of the model to the data. where ν1 = 2 and ν2 = 7. Therefore.(k+1). R2 = 1 implies a perfect fit with the model passing through every data point. Example 12. this F test indicates that the second order model y = B0 + B1x + B2x2 + e. It can be used to make inferences about the utility of the model for predicting y values for specific settings of the independent variables. the larger the value of R2..
069 7 . Solution From the SPSS Printout ( Figure 12.356 and the corresponding observed significance level is 0. 2 415571. 190.6893 DurbinWatson statistic = 2.356 6 5 115891.0000 Error 15259. Our methods for prediction and estimation using any general model are identical to those discussed in Section 11.09356 Figure 12. we may decide use it for those purposes.Analysis of Variance for the Full Regression Source Sum of Squares DF Mean Square FRatio Pvalue Model 831143.) 846402.4) we see that the F value is 11. at the significance level greater than 0. error of est. and conclude that the linear model E(y) = A + B1x1 + B2x2 is useful for prediction of the grain yield.) = 0.981972 Rsquared (Adj.7. 9 Rsquared = 0.976821 Stnd.3. ANOVA Model Sum of Squares df Mean Square F Sig. = 46. Thus.014 we reject the null hypothesis.50 0 2 1316024.7 for the simple straight ccx .4 SPSS Printout for Grain Yield Example 12. for d.15 3 Residual 579455.347 Total 3211503.3 STATGRAPHICS Printout for Electrical Usage Example Example 12.07 11.014. 1 Regression 2632048.638 0.6 Refer to Example 12.014 Figure 12. test the utility of the model E(y) = A + B1x1 + B2x2.f.89 Total (Corr.3 7 2179. Using the model for estimating and prediction After checking the utility of the linear model and finding it to be useful for prediction and estimation.
namely. [n(k+1)] ( ) The procedure for forming a prediction interval for y for a given x* is shown in following box. s and (X’X)1 are obtained from the least squares analysis. or a prediction interval for a future value of y for a specific x*. In this section we will assemble these elements by applying them to an example. (2) appraised value of improvements (i. Example 12. We will use the model to form a confidence interval for the mean E(y) for a given value x* of x. home size) Consider the linear model y = B0 + B1x1 + B2x2 + B3x3 + e where ccxi . home value ) (3) area of living space on the property (i. A (1α)100% CONFIDENCE INTERVAL FOR E(y) ˆ y ± tα / 2 s ( x * )' ( X ' X ) −1 x * where * * * ˆ y = b0 + b1 x1 + b2 x 2 + Λ + bk x k * * * x * = 1 x1 x 2 Λ x k ' is the given value of x. tα / 2 is based on the number of degrees of freedom associated with s. A (1α)100% PREDICTION INTERVAL FOR y ˆ y ± tα / 2 s 1 + ( x * )' ( X ' X ) −1 x * where * * * ˆ y = b0 + b1 x1 + b2 x 2 + Λ + bk x k * * * x * = 1 x1 x 2 Λ x k ' is the given value of x. tα / 2 is based on the number of degrees of freedom associated with s. namely. [n(k+1)] ( ) 12.line model.8 Multiple linear regression: An overview example In the previous sections we have presented the basic elements necessary to fit and use a multiple linear regression model . s and (X’X)1 are obtained from the least squares analysis.. The procedure for forming a confidence interval for E(y) is shown in following box.e.7 Suppose a property appraiser wants to model the relationship between the sale price of a residential property in a midsized city and the following three independent variables: (1) appraised land value of the property.e..
53 x3 . Using the formulas given in Section 12. s2.3 we found ˆ y = 1470.8145 x1 + 0.824 x 2 + 13.4 Real Estate Appraisal Data Property # (Obs.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sale price.5) Step 3 Compute an estimator. Table 12. x1 5960 9000 9500 10000 18000 8500 8000 23000 8100 9000 7300 8000 20000 8000 8000 10500 4000 4500 3400 1500 Improvement s value . the appraiser selected a random sample of n = 20 properties from the thousands of properties that were sold in a particular year. Land value.y = Sale price (dollars) x1 = Appraised land value ( dollars) x2 = Appraised improvements ( dollars) x3 = Area (square feet) In order to fit the model. for the variance σ2 of the random error e : s2 = SSE n − (k + 1) where ccxii .28 + 0. y 68900 48500 55500 62000 116500 45000 38000 83000 59000 47500 40500 40000 97000 45500 40900 80000 56000 37000 50000 22400 x3 1873 928 1126 1265 2214 912 899 1803 1204 1725 1080 1529 2455 1151 1173 1960 1344 988 1076 962 Step 1 Hypothesize the form of the linear model y = B0 + B1x1 + B2x2 + B3x3 + e Step 2 Use the sample data to find least squares prediction equation. This is the same result obtained by computer using STATGRAPHICS (see Figure 12. x2 44967 27860 31439 39592 72827 27317 29856 47752 39117 29349 40166 31679 58510 23454 20897 56248 20859 22610 35948 5779 Area.4. The resulting data are given in Table 12.
8850 0. error tvalue sig.level CONSTANT 1470.820445 0.Y Independent variable coefficient std.8013 ESTATE.X1 0.0567 RSQ.0000 0.= Bk = 0 ( Null hypothesis) against Ha : At least one Bi ≠ 0 ( Alternative hypothesis).01.324583 0.6) Step 4 Check the utility of the model a) Does the model fits the data well? For this purpose calculate the coefficient of determination R2 = 1− SSE SS yy You can see in the printout in Figure 12.000 20 observations fitted.81449 0.0013 ESTATE.897427. forecast(s) computed for 0 missing val.58568 2. in Figure 12. error of est. Figure 12. we have strong evidence to reject H0 and conclude that the model is useful for predicting the sale price of residential properties. Thus.) = 0.8782 SE= 7919.000000 0.000000 0.X3 13.6620..0000 (under the column Pvalue ).512219 1.5901 0.482541 MAE= 5009.1314 ESTATE. of dep. var. b) Usefulness of the model Test H0 : B1 = B2 = . the observed significance level for this test is 0. This large value of R2 indicates that the model provides a good fit to the n = 20 sample data points.367657 DurbWat= 1. Model fitting results for: ESTATE. i =1 n STATGRAPHICS gives s = 7919. (ADJ. Test statistic: R2 / k Mean Square for Model SS (Model) / k F= = = 2 SSE /[ n − (k + 1)] (1 − R ) /[ n − (k + 1)] Mean Square for Error In the printout F= 46. and R2 is Rsquared =0.5 STATGRAPHICS Printout for Estate Appraisal Example ccxiii .0543 0.48 (see Stnd.211185 3.275919 5746.ˆ SSE = ∑ ( y i − y i ) 2 ..X2 0. This implies that we would reject the null hypothesis for any level.6 that SSE = 1003491259 ( in column “Sum of Squares” and row “Error”) and SSyy = 9783168000 ( in column “Sum of Squares” and row “Total”). for example 0.52865 6.242 Previously: 0.2559 0.
for d.9 50235. Regression results for ESTATE.7 44212.Analysis of Variance for the Full Regression Source Sum of Squares DF Mean Square FRatio Pvalue Model 8779676741. (Total (Corr. error of est.) 9783168000.7 44643.4. 46. 3 2926558914. 16 62718204. using 95% confidence interval.48 DurbinWatson statistic = 1. 84743.2 59212 105834 43143.897427 Rsquared (Adj. x2 = 50000 and x3 = 1800. x2.f.5 56216. Substituting these particular values of the independent variables into the least squares prediction equation yields the predicted value equal 79061. x3) is (733379. E(y).Y Observation Number Observed Values Fitted Values Lower CL 95% Upper CL 95% for means 1 2 3 4 5 6 7 8 9 10 11 12 68900 48500 55500 62000 116500 45000 38000 83000 59000 47500 40500 40000 68556.) = 0.4 for means ccxiv .6 STATGRAPHICS Printout for Estate Appraisal Example Step 5 Use the model for estimation and prediction (1) Construct a confidence interval for E(y) for particular values of the independent variables.3. In the printout reproduced in Figure 12.8 54981 54662.7 the 95% confidence interval for the sale price corresponding to the given (x1.24161 Figure 12. 19 Rsquared = 0.6 83773.878194 Stnd. Estimate the mean sale price. = 7919.6).0000 Error 1003491259.6620 0. for a property with x1 = 15000.6 56449.
9 50235.8 shows that the prediction interval for y with the given x is (61333.7 STATGRAPHICS Printout for estimated mean and corresponding confidence interval for x1 = 15000.6 83773.1 82686. x2 = 50000 and x3 = 1800.4 Upper 95% CL for forecasts ccxv .7 44212. The printout reproduced in Figure 12. x2 = 50000 and x3 = 1800 (2) Construct a confidence interval for prediction y for particular values of the independent variables. We see that the prediction interval for a particular value of y is wider than the confidence interval for the mean value.4 41000.1 42800.4 73379. 96789.9 79061.Y Observed Values 68900 48500 55500 62000 116500 45000 38000 83000 59000 47500 40500 40000 Observation Number 1 2 3 4 5 6 7 8 9 10 11 12 Fitted Lower 95% CL Values for forecasts 68556. For example.7 44643.2 59212 105834 43143.6 56449.7 20447.9 40024.13 14 15 16 17 18 19 20 21 97000 45500 40900 80000 56000 37000 50000 22400 98977.4).8 54981 54662.5 56216.3 84743. construct a 95% prediction interval for y with x1 = 15000.4 37052 48289. Regression results for ESTATE.4.6 Figure 12.
For example. Therefore. we say that the independent variables do not interact.4 Figure 12.8 STATGRAPHICS Printout for estimated mean and corresponding prediction interval for x1 = 15000. 20 15 E(y) 10 5 0 0 5 x2=0 x2=2 x2=3 1 2 3 4 5 6 7 8 Figure 12.4 41000. are fixed then E(y) is a linear function of the other variable (x1): E(y) = (B0 + B2 x2) + B1x1 .4 96789.9.13 14 15 16 17 18 19 20 21 97000 45500 40900 80000 56000 37000 50000 22400 98977. the graphs of E(y) for x2 = 0. Model building: interaction models Suppose the relationship between the dependent variable y and the independent x1 and x2 is described by firstorder linear model E(y) = B0 + B1x1 + B2 x2.4 37052 48289.1 82686. When the values of one variable. if E(y)=1 + 2x1 – x2 .9 79061. x2 = 50000 and x3 = 1800 12.9 Graphs of E(y) = 1 + 2x1 – x2 versus x1 for fixed values of x2 ccxvi .7 20447. we say that the relationship between E(y) and any one independent variable does not depend on the value of the other independent variable(s) in the model – that is. When this situation occurs ( as it always does for a firstorder model).8. say x2.4 61333. the graph of E(y) against x1 is a set of parallel straight lines.1 42800. x2 = 2 and x2 = 3 are depicted in Figure 12.9 40024.
The crossproduct term. In contrast to Figure 12. In this case we need another model that will take into account this dependence.10. the lines relating E(y) to x1 are no longer parallel. 2 and –3. ccxvii . E(y) is linear functions of x1.e. Graph the relationship between E(y) and x1 for x2 = 0.8 Suppose that the mean value E(y) of a response y is related to two quantitative variables x1 and x2 by the model E(y) = 1 + 2x1 – x2 + x1x2. When this situation occurs.However. depend on the value of x2 held fixed.10 Graphs of E(y) = 1 + 2x1 – x2 + x1x2 versus x1 for fixed values of x2 Solution For fixed values of x2. Below we suggest a practical procedure for building a interaction model. 2 and –3 are depicted in Figure 12. 30 20 E(y) 10 0 10 0 1 2 3 x1 x2=0 x2=2 x2=3 4 5 6 7 Figure 12. The effect of adding a term involving the product x1x2 can be seen in the figure. This model is illustrated in the next example Example 12. The effect on E(y) of a change in x1 (i. in fact. Note that the slope of each line is represented by 2+ x2 . x1x2.. Graphs of the straight lines of E(y) for x2 = 0. if the relationship between E(y) and x1 does. the slope) now depends on the value of x2 . we say that x1 and x2 interact. Interpret the graph.9. then the firstorder model is not appropriate for predicting y. is called an interaction term and the model E(y) = B0 + B1x1 + B2x2 + B3x1x2 is called an interaction model with two independent variables.
If model is useful for predicting y (i. Check if the model fits the data well. Test whether the model is useful for predicting y i..9. 5. Fit the model to the data.Procedure to build a interaction model for the relationship between E(y) and two independent variables x1 and x2 1. 4. test hypothesis H0 : B1 = B2 = B3 = 0 ( Null hypothesis) against Ha : At least one Bi ≠ 0 ( Alternative hypothesis). 2.e. test whether the interaction term contributes significantly to the model: H0 : B3 = 0 ( no interaction between x1 and x2 ) Ha : B3 ≠ 0 (x1 and x2 interact) 12.e. Model building: quadratic models A quadratic (secondorder) model in a single quantitative independent variable E(y) = B0 + B1x + B2x2 where B0 = yintercept of the curve B1 = shift parameter B2 = rate of curvature ccxviii . then the interaction model E(y) = B0 + B1x1 + B2x2 + B3x1x2 is hypothesized. 3. If from observations it is known that the rate of change of E(y) in x1 depends on x2 and vice versa. reject H0 ).
1x1 + 0. Use α = 0. a technique for modeling a dependent variable y as a function of several independent variables x1 .Procedure to build a quadratic model for the relationship between E(y) and independent variables x 1. 2. Check if the model fits the data well.05. it may be used to make estimates and to predict values of y to be observed in the future. reject H0 ). The model coefficients are estimated using the method of least squares.05. Fit the model to the data. Interpret the interval.e. test hypothesis H0 : B1 = B2 = = 0 ( Null hypothesis) against Ha : At least one Bi ≠ 0 ( Alternative hypothesis). State the hypothesized model E(y) = B0 + B1x + B2x2 2. The appropriate model assumptions are made.92 x 2 The estimated standard deviations of the sampling distributions of b1.27. test whether the secondorder term contributes significantly to the model: H0 : B2 = 0 Ha : B ≠ 0.. Test whether the model is useful for predicting y i. 5. The steps employed in a multiple regression analysis are much the same as those employed in a simple regression analysis: 1. 5. 12. Suppose you fit the firstorder multiple regression model y = B0 + B1x1 + B2x2 + e to n = 20 data points and obtain the prediction equation ˆ y = 6. b) Test H0: B2 = 0 against Ha: B2 >0. B1) are 2. The form of the probabilistic model is hypothesized. 4.. ccxix .11 Summary In this chapter we have discussed some of the methodology of multiple regression analysis. 4..e. Interpret the interval. Use α = 0. c) Find a 95% confidence interval for B1. 12. If the model is deemed useful and the assumptions are satisfied. b2 ( least squares estimators of B0. respectively. d) Find a 99% confidence interval for B2. If model is useful for predicting y (i.12 Exercises 1. a) Test H0: B1 = 0 against Ha: B1 >0. 3..4 + 3. 3. The utility of the model is checked using the overall Ftest and ttests on individual Bparameters.3 and 0. x k .. x 2 .
a) Find R2 and interpret its value.05. b) Is the model adequate for predicting y? Test at α = 0.27. Test the null hypothesis H0: B1 = B2 =B3 =0 against the alternative hypothesis that at least one of the B parameters in nonzero.2 x 2 with sb1 = 1.Suppose you fit the firstorder multiple regression model y = B0 + B1x1 + B2x2 + B3x3 + e to n = 20 data points and obtain R2 = 0. Use α = 0. sb2 = 0. Suppose you fit the interaction model E(y) = B0 + B1x1 + B2x2 + B3x1x2 in n = 32 data points and obtain the following results: SSyy = 479 SSE = 21 b3 = 10.05. A scientist would like to know which combination of temperature and pressure yields a plastic with a high breaking strength. kg/ha. kg/ha. The following model is proposed: E(y) = B0 + B1x1 + B2x2 where y = Breaking strength (pounds) x1 = Temperature ( 0F) x2 = Pressure ( pounds per square inch).8 + 4.05.05. x 0 30 60 90 120 1 2 3 4 5 4878 5506 6083 6291 6361 and suggested the quadratic model E(y) = B0 + B1x + B2x2 ccxx . sb3 = 4. The researchers in the international rice research institute in the Philippines conducted a study on the Yield Response of Rice Variety IR6611170 to Nitrogen Fertilizer.11. y Nitrogen Rate. They obtained the following data Pair Number Grain Yield. A small preliminary experiment was run at two pressure levels and two temperature levels. d) Is there evidence that x1 and x2 interact? Test at α = 0.9 x1 + 1. Do the data indicate that the pressure is important predictor of breaking strength? Test using α = 0.2632. A sample of n = 16 observations yield ˆ y = 226. Plastics made under different environmental conditions are known to have differing strengths. c) Use a graph to explain the contribution for the x1x2 term to the model.
(ADJ. var. y Independent variable CONSTANT x x *x RSQ. 2 b) What are the values of SSE and s for the data? c) Perform a test of overall model adequacy.The portions of STATGRAPHICS printouts are shown below. error 47.117857 50.6707 14.312168 MAE= std.014941 25.996775 Sum of Squares 1564516 5062.05.349987 1.63 1569579 DF 2 2 4 Stnd. of dep.440000 tvalue 102.0157 DurbWat= 3.0001 0.457143 26. d) Test whether the secondorder term contributes significantly to the model.426 5 observations fitted.05.64619 0.8884 sig.level 0.0032 ccxxi .032 Pvalue 0.0049 0.2519 7.31 FRatio 309.) = 0. Use α = 0. Analysis of Variance for the Full Regression Source Model Error Total (Corr. a) Identify the least squares model fitted to the data. Model fitting results for: NITROGEN. Use α = 0. error of est.9935 SE= coefficient 4861. = 50.869659 0.3122 Mean Square 782258 2531.) Rsquared = 0. forecast(s) computed for 0 missing val.
5 Comparing population using a completely randomized design: The KruskalWallis H test 13. suppose we want to compare the ease of operation of two types of computer software based on subjective evaluations by trained observers.7 Summary 13.and Ftests are inappropriate is when the data are not measurements but can be ranked in order of magnitude. These parametric tests have used the parametric statistics of samples that came from the population being tested. Introduction The majority of hypothesis tests ( t. But the ttest of Chapter 9 would be inappropriate. And even if a goodnessoffit test indicates that a population is approximately normal. we can not always be certain we’re right.2 The sign test for a single population 13. The nonparametric counterparts of the tand Ftests compare the relative locations of the probability distributions of the sampled populations. we assumed that our samples either were large or came from normally distributed populations. each observer decides either that A is better than B or vice versa.8 Exercises  13.1. But populations are not always normal. Although we can not give an exact value to the variable Ease of operation of the software package.and Ftests) discussed so far have made inferences about population parameters. For the two types of the situations statisticians have developed useful techniques called nonparametric methods or nonparametric statistics. Clearly.Chapter 13 CONTENTS Nonparametric statistics 13. because the only data that can be recorded are preferences. A large number of nonparametric tests exist.4 Comparing two populations based on matched pairs: the Wilcoxon signed ranks test 13.1 Introduction 13. we made restrictive assumptions about the populations from which we drew our samples. we may be able to decide that package A is better than package B. because the testis not 100 percent reliable. we have the standard problem of comparing the probability distributions for two populations of ratings – one for package A and one for package B. ccxxii . For example. To formulate these tests. such as the mean and the proportion. for example. If packages A and B are evaluated by each of ten observers. rather than specific parameters of these populations (such as the means or variances).3 Comparing two populations based on independent random samples: Wilcoxon rank sum test 13. In each case of Chapter 9. An another case in which the t.6 Rank Correlation: Spearman’s rs statistic 13. but this chapter will examine only a few of the better known and more widely used ones. there are certain situations in which the use of the normal curve is not appropriate. Many nonparametric methods use the relative ranks of the sample observations rather than their actual numerical values. that is.
i. the properties of a binomial distribution. That is.. the probability that a xvalue selected from the population is larger than M is 0. the ttestis not valid and we must resort to a nonparametric procedure.M0 > 0) = P(xi > M0) = 0.5. If we call a positive difference a “Success” and a negative difference a “Failure”. or location.M0. x2. For situations in which we collect a small sample (n < 30) from a nonnormal distribution. Therefore.5. If S is “too large” the we will reject H0 in favor of Ha: M > M0. the null hypothesis is true.3. If.e. Therefore. The sign test is specifically designed for testing hypotheses about the median of any continuous population. The rejection region for the sign test is derived as follows.5 Since the trials are independent. Notice that S depends only on the sign (positive or negative) of the difference xi . then S is the number of successes in n trials. xn be a random sample form a population with unknown median M. of the distribution. where S1 = Number of sample observations greater than M0. then we should expect to observe approximately half the sample xvalue greater than M= M0. SIGN TEST FOR A POPULATION MEDIAN ONETAILED TEST TWOTAILED TEST H0 : M = M0 H a : M > M 0 (or M < M 0 ) Test statistic: S = Number of sample observations greater than M0 ( or S = Number of sample observations less than M0 ) H0 : M = M0 Ha : M ≠ M0 Test statistic: S = max ( S1.. The sign test for a single population Recall from Chapter 9 that smallsample procedures for testing a hypothesis about a population mean. Let each sample difference xi . require that the population have an approximately normal distribution. S2). where S = { number of values xi that exceed M0}. Let x1. The sign test utilizes the test statistic S. We can use this fact to calculate the observed significance level (pvalue ) of the sign test. therefore.2 we know that the median is a number such that half the area under the probability distribution lies to the left of M and half lies to the right. listed in Section 5..M0. Like the mean.M0 denote the outcome of a single trial in an experiment consisting of n identical trial