You are on page 1of 182

Chapter 1 Introduction, Graphs and Descriptive Statistics

Example 1.1

A dataset can consist of:

1) a population of all individuals or objects of interest or the measurements obtained from all
individuals or objects of interest

Or

2) a sample that is a portion or part of a population of interest.

When all samples of the same size have an equal probability of being drawn from a population of
interest, we consider one such sample to be a simple random sample.

Methods of descriptive statistics use graphs and calculated summary numbers to organize,
summarize, and present datasets of interest in an informative way. We can describe both
population and sample datasets.

Methods of inferential statistics use information gathered from a sample dataset to investigate
the potential truth of hypotheses made about the larger population from which the sample was
drawn.

Example 1.2:

Population Dataset Examples with possible measurements of interest:

-All Canadians (possible measurements are birth country, age, or income)

-All NHL hockey players (possible measurements are height, goals scored, or assists scored)

-All Monarch butterflies at Cambridge Butterfly Conservatory (possible measurements are wing
width or antenna length)

Sample Dataset Examples with possible measurements of interest:

-10 randomly chosen students from the population of 60 students taking Statistics 161 this
term (possible measurements are favourite music genre, resting pulse, or if they like math).

-1000 randomly chosen Canadians from the population of all eligible Alberta voters (possible
measurements are preferred political party, age, gender identity or amount of education).

1
Example 1.3A: Why do we sample rather than look at everyone in a population?

-Time: too time consuming to ask everyone


-Cost: too costly to ask everyone
-Rare Population: a destructive test could wipe out a population (tiles on the space shuttle)
- Impossible: cannot always check every unit (all frogs in a lake)

Example 1.3B: How can we take a simple random sample?

-Lottery method: All units (balls, papers with names) in a container (hat, bingo ball, etc) and
draw out required number of sample units
-Use a random number generator in a software package (Excel, Minitab, SAS,SPSS)
-Use an online random number generator such as the one at stattrek.com
(http://stattrek.com/Tables/Random.aspx, http://www.randomizer.org/form.htm )

Example 1.4: Observational Study versus Designed Experiment


Observational Study researchers simply observe and measure specific
characteristics

as in a sample survey
Designed Experiment experiment researchers apply some treatment and
controls and then proceed to observe effects on
the subjects and take measurements.

researchers randomly assign subjects to treatment


groups

DATA TYPES

Example 1.5:

Qualitative variables take on nominal (categorical) data values.

Qualitative data can be ordinal (ranked) or non-ordered (no ranking)

Examples of categorical non-ordered variables include:

-pet preference (cat or dog)

-preferred political candidate (O’Toole, Trudeau, Singh)

Examples of categorical ordered variables include:

-level of satisfaction with service in a restaurant (unsatisfied, neutral, satisfied)

-military rankings (private, sergeant, corporal, … , general)

2
Quantitative variables take on numerical data values.

Discrete quantitative variables can only take certain numerical values on a range on a scale.
That is, there are gaps in the data values that the variable can take on.

Examples of discrete data include:

-# children in family (0 to 69*)

-air quality index (1 to 10 in steps of 1)

*Guinness World Records include the 69 babies that were born through the mid 1700s to a
woman recorded only as the “wife” of Feodore Vassilyev.

Continuous quantitative variables can take any numerical value in a range along a numerical
scale

Examples of Continuous Data include:

-yearly rainfall (0 to 11873 mm*)

-pH (0 to 14)

-age (0 to 122.45 years**)

* Mawsynram, Meghalaya, India, has 11,873 mm (467 in) of rain per annum.
** Jeanne Calment of France (1875–1997) lived to the age of 122 years, 164 days. She met
Vincent van Gogh when she was 12 or 13.

Example 1.6:

The characteristics that observations from datasets of different types (with different levels of
measurement) possess can be investigated. These characteristics include identity (observation
uniqueness), magnitude (order to observations), equal intervals between values of observations,
and whether an absolute (meaningful) zero point exists for such observations.

Data types can be split in nominal and numerical, as noted above. Furthermore, nominal data
can be split into ordinal (the data has ranking) and non-ordinal (the data does not have ranking).
Numerical data can be split into interval (distance between equal values meaningful) and ratio
(distance between equal values is meaningful and a meaningful zero point exists). The latter
distinction between interval and ratio data is mentioned merely as it can matter in some scientific
study. We will not remark on it further.

The following table summarizes this for you. Characteristics are listed on the left side of the
table and the different types (levels) of data are listed along the top of the table.

3
Levels of Measurement
Characteristics Nominal Nominal Numerical Numerical
(categorical) (categorical) Interval Ratio
non-ordinal Ordinal
Identity: √ √ √ √
Each observation has a unique
meaning or classification
Magnitude: √ √ √
Ranked data where an ordered
relationship exists
Equal Interval: √ √
Meaningful equal distance
between values
Absolute Zero: √
A meaningful Zero Point
(absence of something) exists

Example 1.7:

Here are some examples of variables for each data type (with their different levels of
measurement). Please make note of the italicized information added for clarification.

Levels of Measurement
Nominal Nominal Numerical Numerical Ratio
(categorical) (categorical) Interval
non-ordinal Ordinal
.Gender Identity .Letter Grades .Shoe Size .Yearly Income
.Insect Species . NHL Team Standing .Temperature (F) .Pulse Rate
.Hair Colour .Military Ranking .Temperature (C) .Hrs worked during week
.Team Names .Level of compliance .Dress Size -width of rubber tires
.Job Titles with law banning cell .IQ
.Birthplace phone use in vehicles

ordinal values are (here zero does not


sometimes reported as mean an absence
numbers (i.e. a rating of something – a
scale of 1 to 5, but are temperature of zero
still categorical) is a temperature)

4
GRAPHING

ONE CATEGORICAL VARIABLE

PIE CHART

A pie chart is a simple visual way to represent categorical (non-ordinal or ordinal) data when we
have a few non-overlapping classifications (levels or choices) for the categorical variable and
either counts or percents available for each of the choices.

Example 1.8: Environment and Climate Change Canada has much data available for the public
online. A 2013 Statistics on the International Movements of Hazardous Waste and
Hazardous Recyclable Material report can be found at http://www.ec.gc.ca/gdd-
mw/default.asp?lang=En&n=BE2CD950-1&printfullpage=true . It further tells us of the
435,300 metric tons of hazardous waste and hazardous recyclable material imported in 2013,
44% was destined for disposal while 56% was destined for recycling.

Summary count information can be placed in a table, and then graphed, as follows.

Destination Count Percent


Disposal 191,532 44%
Recycling 243,768 56%
435,300 100%

Draw a pie as shown, with the sizes of the slices proportional to the counts for the choices
(levels), which can be determined with a little algebra (44% of 435,300 = 191,532 and 56% of
435,300 = 243,768.) Be sure to include an appropriate title, a legend, and the counts on the
slices.

5
BAR CHART

Another visual way to present data when we have a categorical variable with a few non-
overlapping classifications (choices or levels) and either counts or percents available for each of
the choices is with a bar chart.

Example 1.9:

Protected Planet (https://protectedplanet.net) is an online source of information on worldwide


protected areas that is “updated monthly with submissions from governments, non-governmental
organizations, landowners and communities. It is managed by the United Nations Environment
World Conservation Monitoring Centre (UNEP-WCMC) with support from the International
Union for Conservation of Nature (IUCN) and its World Commission on Protected Areas
(WCPA).”

Statistics for Canada can be found on the page https://protectedplanet.net/country/CAN. Canada


has 7642 protected areas! 63 of these have been given international designations. Summary
count information for areas with the three international designations as given the Canadian
protected areas are detailed here.

Designation Count
World Heritage Site 10
UNESCO-MAB Biosphere Reserve 16
Ramsar Site, Wetland of International Importance 37
TOTAL 63

Draw a chart as shown, with the heights of the columns showing the counts corresponding to the
classifications (choices or levels) of Designation as labeled on the bottom axis. It is prudent to
include an appropriate title, axis labels, a scale, and counts (and percents) on the slices.

6
TWO CATEGORICAL VARIABLES

CLUSTER BAR CHARTS

Consider two categorical variables such that one categorical variable has I levels (i=1,2,… I) and
the second categorical variable has J levels (j=1,2,…J).

We can make an “overall” bar graph where the heights of the bars are the frequencies
(counts) and relative frequencies (percent) for each of the possible (i,j) cases.

We can also make “cluster” bar graphs that examine what is going on for all levels of one
variable within each of the levels of the other variable.
The levels of the “within” (outermost) variable (labels are on the bottommost scale) are called
clusters and the heights of the bars within each cluster add to 100%, enabling us to examine
what is happening in each cluster separately.

Cluster bar charts are generally more useful than overall bar charts.

Example 1.10: 70 students (35 BA students and 35 BS students) are surveyed and asked to
indicate their planned degree (BA or BS) along with their preference from the two choices of
having free software in return for testing it and reporting problems to the distributer (betaTester)
or paying for their software and having it be problem free (buGfree).

Here we have two categorical variables, each with two levels:

Degree (levels: BA and BS)


Preference (levels: betaTester and buGfree)

In raw form, this data would consist of 70 rows (records for each respondent) and 2 columns
(one for each variable). Each row (record) would contain a column entry for Degree and, beside
it, a column entry for Preference. A partial table of raw data is presented here.

Respondent Degree Preference


1 BA T
2 BS G
….. …..
69 BS T
70 BA G

We can also present such data in a crosstab table with counts (frequencies) for the (Degree,
Preference) cases (cells) already completed for us.

betaTester (T) buGfree (G)


BA(A) 26 18 44
BS (S) 21 5 26
47 23 70

7
Finally, a summary distribution table illustrates another way to summarize the frequencies
(count) and percents of observations in each of the four mutually exclusive cells (no individual
appears in more than one cell) above.

Degree Preference Count (Frequency) Overall Percent


BA betaTester 26 26/70 = 37.14%
BA buGfree 18 18/70 = 25.71%
BS betaTester 21 21/70 = 30.00%
BS buGfree 5 5/70 = 7.14%
Totals 70 100.00%

Cluster Bar Chart: Overall Percents

A chart to illustrate the total overall counts and percents for each of the (Preference, Degree)
cases with Preference on the bottommost (outermost) X axis and Degree in the clusters for each
of BetaTester and BuGfree (and on the legend) is below. The heights of the bars correspond to
each of the (Preference, Degree) cases. Overall, the percents of all the bars add to 100%.
Students could also make a graph that places Degree on the x axis and Preference in the legend.

Cluster Bar Charts: Percents of one Category within Levels of another Category

A chart in which percents for BA and BS add to 100% within each group of preferences
(betaTester and BuGfree) is presented here. Preferences are on the outermost (bottommost)
Xaxis, and the “clusters” of degree bars (for BA and BS) add to 100% above each preference.

To do the problem by hand, students must first make new distribution tables that gave the
distribution of counts and percents for the mutually exclusive cells of BA and BS within each of
betaTester and BuGfree separately, as shown below.

8
Percents in these new tables will match the percent heights in the cluster bar chart to be
produced. Note, again, how degree percents add to 100% in each of the BetaTester cluster and
the BuGfree cluster.

BetaTester
Degree Count Percent
BA 26 26/47 = 55.32%
BS 21 21/47 = 44.68%
47 100%

BuGfree
Degree Count Percent
BA 18 18/23 = 78.26%
BS 5 5/23 = 21.74%
23 100%

A cluster bar chart such as this allows the situations within the different clusters to be compared
when the counts for each cluster differ. The left cluster of bars contains degree percent bars
within the betaTesters preference, while the right cluster contains degree percent bars within the
buGfree preference. The BA and BS bars in the left cluster of betaTesters total 100%, as do the
BA and BS bars in the right cluster of bugFree types.

9
ONE NUMERICAL VARIABLE

Example 1.11: Definitions


(Absolute) Frequency Distribution: A grouping of numerical data into mutually exclusive
classes that shows the number (count) of observations in each.
Relative Frequency Distribution: A grouping of numerical data into mutually exclusive
classes that shows the relative frequency (proportion) of observations in each.
Percent Distribution: A grouping of numerical data into mutually exclusive classes that
shows the percent (count) of observations in each. (mostly used in software).

A set of n = 32 data points taken from a 1st year Statistics class measured X = the number of
breaths taken in a minute by the students. They have been kindly ordered for you!

4 7 8 8 9 9 9 10 11 12 12 13 13 13 14 14
14 15 15 15 15 15 16 17 17 17 18 18 19 20 21 25

Distributions that summarize the frequency, relative frequency, and percent of observations that
occur in some reasonably chosen mutually exclusive classes that encompass all the data is of
interest.

How many classes? – We choose enough classes to get an idea of the “shape” of our distribution
of data, generally between 5 to 15. We will choose 6 for this example.

Class interval – The class width is generally the same for all classes. We may have a wider
interval at either end of the range of data if there are outlying unusual values.

Class Limits – Classes are kept mutually exclusive (that is, they do not overlap). The
convention is to have an interval that is closed at the left boundary and open at the right, as
below.

Distributions
BREATHS Absolute Frequency, f Relative Frequency, f/n Percent
Class (Counts) (expressed as fraction (100xf/n)
Interval and proportion)
0 –under 5 1 = # of individuals who 1/32 = 0.03 3%
took between 0 to < 5
breaths
5 –under 10 6 = # individuals who took 6/32= 0.19 19%
between 5 to <10 breaths
10-under 15 10 10/32 =0.31 31%
15-under 20 12 12/32 =0.38 38%
20-under 25 2 2/32= 0.06 6%
25-under 30 1 1/32= 0.03 3%
Total 32 (will always total n, the 32/32 = 1.00 100%
number of data values) (will always total 1) (always totals 100%)

10
Lower Class Limits = 0, 5, 10, 15, 20, 25

Class Interval (Width) = Lower Limit of a Class – Lower Limit of Adjacent Class= 5

Class Absolute Frequency – Tally (Count) for a class

Class Relative Frequency – Fraction (proportion) of total number of observations in a class

HISTOGRAM

A histogram is a graph that is a pictorial representation of a frequency or relative frequency


distribution. Class endpoints are placed on a horizontal scale and bars are drawn above each
class. The heights of the bars are f (frequencies or counts) or f/n (relative frequencies, expressed
as fractions or proportions), when graphs are made by hand (as below). Note that frequencies
must total n, and relative frequencies must sum to 1.

By the way, the average adult takes between 12 to 18 breaths a minute, and my student
distribution looks to be not far different from this.

BOXPLOT

A boxplot is made from a column of raw quantitative (numerical) data, and is a way of
presenting a five number summary of the set of data that highlights the minimum, maximum,
median (the value at the middle of the ranked data set), 1st quartile (Q1 = the value at the middle
of the bottom half of the data) and the 3rd quartile (Q3 = the value at the middle of the top half of
the data). A numerical scale that runs from just below the minimum value of the dataset to just
above the maximum value of the dataset appears at the left of the figure. Then a short horizontal
line is drawn at the height of the median (the value at the middle of the ranked data set), and a
box is created around it that extends up to the height of the 3rd quartile value and down to the
height of the 1st quartile value.

The endpoints of the whiskers depend on the whether there are “outliers” (unusual values for our
data). To determine this, we need to do a bit of simple math, as outlined below.

11
The range of the data values is MAX – MIN.
The interquartile range of the data values is Q3 – Q1.
Outliers are values that are above Q3 + 1.5 IQR or below Q1 – 1.5IQR.

By hand,if there are no outliers, a whisker is drawn that extends up from the top of the box at the
3rd quartile height to the maximum value of the dataset, and another whisker is drawn down from
the bottom of the box at the 1st quartile height to the minimum value of the dataset. If there are
outliers, the top whisker will terminate at Q3 + 1.5xIQR and the bottom whisker will terminate at
Q1 – 1.5xIQR. Outliers will then be indicated as points beyond the endpoints of the whiskers.

Example 1.12: We present the breath data again, prior to drawing a boxplot for the data.

Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Value 4 7 8 8 9 9 9 10 11 12 12 13 13 13 14 14

Position 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Value 14 15 15 15 15 15 16 17 17 17 18 18 19 20 21 25

The minimum value in the data set is 4. This student told your instructor that she was a
meditator. The maximum value in the data set is 25. This student had run into the classroom
just after the start of the class. The middle value can be “extrapolated” to be halfway between
the 16th ranked data value (14) and the 17th ranked data value (also 14), so we record the median
as 14. Similarly, the middle value of the first quarter of the data can be extrapolated to be
halfway between the 8th ranked data value (10) and the 9th ranked data value (11), so we record
the first quartile, Q1, as 10.5. Finally, the middle value of the third quarter of the data can be
extrapolated to be halfway between the 24th ranked data value (17) and the 25th ranked data value
(also 17), so we record the third quartile, Q3, as 17.

Here the interquartile range of the data values is Q3 – Q1 = 17 – 10.5 = 6.5, Q3 + 1.5IQR = 17
+ 1.5(6.5) = 17 + 9.75 = 26.75 and Q1 = 4 – 1.5 (6.5) = 4 – 9.75 = -5.75. There are no values
below -5.75 and no values above 26.75, so there are no outliers.

12
TWO (OR MORE) NUMERICAL VARIABLES

SIDE BY SIDE HISTOGRAM

Sometimes one wants to compare two or more histograms side by side. If this is done, scales
should match on the two histograms. This generally happens when our data consists of one
categorical variable (with two or more levels) and each entry in the categorical variable has a
matching entry in a corresponding numerical variable.

Example 1.13:

In September 2015, your instructor surveyed all her Statistics 151 classes (a population of 446
students) and ask them several election related questions, including which party they thought had
the best platform, and how many minutes a day they spent engaged in consuming political news.
Data below shows news minutes for parties chosen by a random sample of 29 of those students.

Conservative Liberal NDP


0 5 5
0 5 10
5 5 15
5 10 15
15 15 30
15 20 40
15 30 60
20 35
30 45
50
60
60
135

We choose common scales for both our x and y axis on the histograms. There is a count of 3
values of 15 minutes in the Conservative sample, so we set the vertical axis from 0 to 4 in steps
of 1. We choose intervals {0,10), {10,20), …{13,140) for our intervals.

13
Here are the counts for our classes (intervals).

COUNTS OF OBSERVATIONS FOR THE PREFERRED PARTIES


Conservative Liberal NDP
0 to under 10 4 3 1
10 to under 20 3 2 3
20 to under 30 1 1 0
30 to under 40 1 2 1
40 to under 50 1 1
50 to under 60 1
60 to under 70 2 1
70 to under 80
80 to under 90
90 to under 100
100 to under 110
110 to under 120
120 to under 130
130 under 140 1
TOTAL 9 13 7

14
SIDE BY SIDE BOXPLOTS

Sometimes one wants to compare two or more boxplots side-by-side. If this is done, scales
should match on the boxplots.

This happens when our data consists of one categorical variable (with two or more levels) and
each entry in the categorical variable has a matching entry in a corresponding numerical variable.

To create side-by-side boxplots for the election data by hand, calculate the minimum, median,
first quartile, third quartile, and maximum for each of our Conservative, Liberal and NDP
groups, along with the values of Q1-1.5xIQR and Q3+1.5xIQR, and then draw our boxplots.

Students can use the original ranked data above to intuitively determine the 5 numbers of interest
for the boxplots below. (You may wish to look ahead to the descriptive statistics section for
formulas for rank (position), or return to this example after we learn them.) Note that a matching
scale of 0 to 120 in steps of 10 is used on the vertical scale for NEWSMIN on the boxplot.

Example 1.14:
Conservative Liberal NDP
Rank Value Rank Value Rank Value
Min 1 0 1 5 1 5
Q1 2.5 2.5 3.5 7.5 2 10
Median 5 15 7 30 4 15
Q3 7.5 17.5 10.5 55 6 40
Max 9 30 13 135 7 60
IQR 17.5-2.5 =15 55-7.5=47.5 40-10=30
Q1–1.5IQR 2.5-1.5(15)=-20 7.5-1.5(47.5)=-63.75 10-1.5(30)= -35
Q3+1.5IQR 7.5+1.5(15)=30 55+1.5(47.5)=126.25 40+1.5(30)= 85
Outliers No Yes, 135 is an outlier No

15
The by hand approach (above) does not view 120 as an outlier, and terminates the Liberal top
whisker at the value of 120 (the maximum news minutes value for the Liberals).

On a handwritten exam, students should 1) use the by hand approach to identify outliers if
creating the boxplots by hand, or 2) identify the outliers as determined by a provided boxplot. On
a lab exam, students should use the information provided by the software they are using.

SCATTERPLOTS
Sometimes, for each respondent in an experiment, the values of the two numerical measurements
are taken. We can be interested in the relationship between two numerical variables

Example 1.15: The ages and the number of push-ups that can be done in a minute is recorded for
n=12 randomly chosen individuals in your extended family, as shown below.

Individual Age Push-ups


Younger Brother 19 34
Elder Sister 28 26
Younger Sister 18 30
Mom 58 12
Middle Sister 23 28
Dad 64 12
Great Grandma 78 6
Uncle 43 21
Aunt 48 22
Elder Brother 36 27
Middle Brother 21 14
Grandpa 67 29

We created a scatterplot of the twelve values of (age, push-up) for the individuals. We think of
age as something that would naturally predict (explain) the number of push-ups that a person
could do, and it is convention, although not necessary, to put the variable that we think of as a
predictor variable on the x axis of a scatterplot, and to put the other variable (the response
variable) on the y axis. There appears to be a linear relationship between age and pushups where
as age increases, the number of pushups decreases. In general, older people in the dataset can do
less pushups. We note the fit 67 year old (what an outlier!) who can do 29 pushups in a minute,
though.

16
PAIRED VARIABLES

Sometimes two numerical variables of interest have a natural pairing for the respondents in an
experiment. For example, the pulse rate of students entering a classroom and the pulse rate of
students ½ way through a class has a natural pairing. We would hypothesize that the pulse rate
of students after classroom activity would tend to be higher than their pulse rate after classroom
activity in a Statistics class (unless a student is super Zen about Statistics and finds it relaxing).

For two numerical variables that are naturally paired, we can examine scatterplots, side-by-side
boxplots and side-by-side histograms, as above. However, we are often only interested in the
differences between the variables! Has activity increased the pulse rate in my Statistics students?

Example 1.16: Below is data for a random sample of 9 students taken from a group of my 54
Statistics 161 students in Fall 2017. It records their pulse rate before and after an activity in the
middle of the class. The difference column can be readily calculated for this small set of data.

BEFORE AFTER DIFFERENCE


79 94 -15 -15
92 85 7 -6
74 72 2 0
76 70 6 0
54 60 -6 2
82 82 0 2
104 104 0 6
110 96 14 7
76 74 2 14

A boxplot and histogram of the DIFFERENCE column can readily be calculated with the
techniques already learned. With only 9 values, these summary graphs may be of limited use, but
are still of interest. Sometimes we are limited to a small sample of values in real life situations,
as mentioned previously. It does not appear from our small sample that pulse rate is higher, on
average, after the activity of completing ½ the statistics class is completed.

17
Chapter 2: Descriptive Statistics - Measuring Location And
Dispersion
Statistical Measures of location that describe a set of data include:
-mean (average value)
-mode (most common value)
-first quartile Q1 (the value such that 1/4 of the data lies below it)
-median (the value such that 1/2 the data values lie below and above it)
-third quartile Q3 (the value such that 3/4 of the data lies below it).

Statistical Measures of dispersion that describe a set of data include:


-range (maximum - minimum values)
-interquartile range (Q3 - Q1)
-variance (“modified average” of squared distances (deviations) of data values from their mean)
-"standard" deviation (a “standardized” measure that takes the square root of the variance so
units match the units of the other statistical measures considered)

In real life, we often take a sample from a larger finite population (for cost and efficiency
reasons), and then we describe the findings in the sample data set, and use that information to
make inferences about the larger population. Ideally, we want a “large sample” of more than 30
units. For the remainder of the chapter, we will assume we are working with sample data. We
introduce a little notation.

Sample statistic (Roman symbols) Population parameter (Greek symbols)

Mode (most common value) estimates Mode (most common value)


median (middle value) estimates Median (middle value)
Q1 (first quartile) estimates Q1 (first quartile)
Q3 (third quartile) estimates Q3 (third quartile)
IQR (interquartile range) estimates IQR (interquartile range)
Range (maximum – minimum) estimates Range (maximum – minimum)
𝑥̅ (mean) estimates µ (mean)
s 2
(variance) estimates  2
(variance)
s (standard deviation) estimates  (standard deviation)
shape (sometimes) gives gist of shape

Example 2.1:

n = 7 randomly chosen students were asked to indicate how many hours of sleep (to the nearest
hour) they had obtained the previous night.

A dotplot of the data gives us the gist of the data shape. Students do not need to know how to
produce this. Pretty thing, isn’t it?

18
Measures of location (including measures of centrality) and measures of dispersion (with the
exception of variance and standard deviation) are intuitively understandable and readily
calculated.

Example 2.2

Data and the formulas necessary for calculating several intuitive summary statistical measures
are given below, and by hand calculations will be made for the data. It is of interest to fill in the
calculation columns here, before proceeding. We will use the sums from this table in our
calculations, explaining why we need them as we go along.
Data Calculation Columns used in formulas below
Rank Energy 𝑥𝑖 𝑥𝑖2
(𝑥𝑖 − 𝑥̅ )= (𝑥𝑖 − 7) (𝑥𝑖 − 𝑥̅ )2 = (𝑥𝑖 − 7)2
1 4 16 4-7=-3 (-3)2=9
2 6 36 6-7=-1 (-1)2=1
3 6 36 -1 1
4 7 49 0 0
5 8 64 1 1
6 8 64 1 1
2
n=7 10 100 10-7=3 (3) =9
∑ 𝑥𝑖2 =365 9+1+…+9=
Sum∑ 4+6+…+10 = ∑ 𝑥𝑖 =49 0 ∑(𝑥𝑖 − 𝑥̅ )2 = 22

The mean (average) value of the xi here is 𝑥̅ = ∑xi /n = 49/7 = 7.

We draw a diagram to show how the xi values deviate from the mean of 7 below.

Deviations 𝑥𝑖 − 𝑥̅
----.----
----.----
-----------------.---------------
_____________________
4 6 8 10

19
𝑥̅ = 7

We want to find a way to measure an “average” of those deviations to obtain a measure of


statistical spread (variation). If we try to average them directly, we run into trouble as they sum
to 0 (zero). We could try taking the average of the absolute values of the deviations, but this
approach is difficult to handle mathematically. We therefore square the deviations instead. We
then add them up, and then weight them by dividing by n – 1. We further discuss why we divide
by n -1 rather than n below.

∑(𝑥𝑖 − 𝑥̅)2
s2 = , the sample variance, is “good” estimator of population variance
𝑛−1

To bring the unit scale “back in line”, we take square root of sample variance s2 to get the sample
“Standard” Deviation, which is denoted s.

We now proceed to calculate a complete table of the most commonly used intuitive sample
statistical measures for our problem.

Sample English Explanation


Statistic Formula Calculated Value
Average of the data values.
Add up all data values and
divide by total number of 49/7 = 7
Mean observations, n. 𝑥̅ = ∑ 𝑥𝑖 /𝑛 (from table)
The data value that appears No mode, bimodal,
most commonly in the data Most common equal height peaks at
Mode values value 6 and 8
The middle data value. 1/2
the data values lie below Value of rank 8/2 =
the median and !/2 the data value of rank value of rank 4
values lie above the n +1 =7
Median median. 2 (from table)
¼ of the data values lie
below the data value of Value of rank 8/4 =
Q1, and ¾ of the data value of rank value of rank 2
1st Quartile values lie above the data n +1 =6
Q1 value of Q1. 4 (from table)
¾ of the data values lie
below the data value of Value of rank 3(8)/4
Q3, and ¼ of the data value of rank = Value of rank 6
3rd Quartile values lie above the data 3( n + 1) =8
Q3 value of Q1. 4 (from table)
The distance between the
Maximum data value and 10-4 =6
Range the Minimum data value. Max - Min (from table)

20
The distance between the
Interquartile Q1 data value and the Q3 8–6=2
Range (IQR) data value. Q3 – Q1 (from above)
A “weighted” average.
Add all squared deviations
(of data values from the
mean) and divide by “n -
1”, where n is the number ∑(𝑥𝑖 − 𝑥̅ )2 22/6= 3.6667
Variance
of observations. 𝑠2 = (from table)
𝑛−1
The square root of the
variance. This number √3.6667
∑(𝑥𝑖 − 𝑥̅ )2
Standard will have the same units as 𝑠= √ = 
Deviation the data values. 𝑛−1 (from above)
1 (∑ 𝑥𝑖 )2 1 (49)2
Note: A shortcut formula for variance is 𝑠 2 = 𝑛−1 (∑ 𝑥𝑖 2 − ) = 6 (365 − )
𝑛 7
1 2401 1 1
= 6 (365 − ) = 6 (365 − 343) = 6 (365 − 343) = 3.667
7

Further discussion about why we divide by n – 1 when calculating variance follows. A deeper
discussion is beyond the scope of the course, but students may find this useful. When we
estimate µ with x , we add some uncertainty. Because of this uncertainty, it turns out that if we
divide by n when we estimate σ2 with s2, we will consistently underestimate σ2. To correct this
we divide by n – 1 instead. (This ensures that if we average all possible s2 estimates that are
created from all possible samples from our population, they average to σ2. Then s2 is called an
“unbiased” estimator. An interesting l reading to illustrate the underestimation of σ2 by
2
s over all possible samples from a population is included at the end of the chapter.)

An easy way to remember is that we lost one “degree of freedom” because we estimated σ2, so
we take one away on our denominator for the formula for s2. Some note, too, that if we know the
value of 𝑥̅ and the values of n -1 of our x values, the remaining x value is determined, and
cannot vary. So we lose a “degree of freedom” that way.

Example 2.3:

7 randomly chosen students were asked to indicate their level of energy (on a scale of 1 to 10).
Data is below. Use the dataset information below to get the information you need to fill in the
table of statistical measures below.
Rank Energy 𝑥𝑖 xi2 𝑥𝑖 − 𝑥̅ = 𝑥𝑖 − 7 (𝑥𝑖 − 𝑥̅ )2 = (𝑥𝑖 − 7)2
1 4 16 4-7=-3 (-3)2=9
2 5 25 5-7=-2 (-2)2=4
3 6 36 -1 1
4 7 49 0 0
5 8 64 1 1
6 9 81 2 4
n=7 10 100 3 9
4+5+…+10 = ∑ 𝑥𝑖 ∑ xi2 =371 9+1+…+9=∑(𝑥𝑖 − 𝑥̅ =28 )2

Sum ∑ =49 0

21
Sample English Statement Formula Calculated Value
Statistic
Mean Average of the data values. 𝑥̅ = ∑ 𝑥𝑖 /n = 49/7=7
Add up all data values and (from table)
divide by total number of
observations, n.
Mode The data value that appears Most common No mode
most commonly in the data value
values

Median The middle data value. 1/2 the Value of rank Value of Rank 8/2 =
data values lie below the n +1 Value of Rank 4= 7
median and !/2 the data values 2 (from table)
lie above the median.

1st Quartile ¼ of the data values lie below n +1 Value of Rank 8/4 =
Q1 the data value of Q1, and ¾ of value of rank 4 Value of Rank 2 = 5
the data values lie above the (from table)
data value of Q1.
3rd Quartile ¾ of the data values lie below value of rank Value of Rank 3(8)/4 =
Q3 the data value of Q3, and ¼ of 3( n + 1) Value of Rank 6= 9
the data values lie above the 4 (from table)
data value of Q1.
Range The distance between the Max - Min 10-4 =6 (from table)
Maximum data value and the
Minimum data value.
Interquartile The distance between the Q1 Q3 – Q1 9 - 5=4 (from above)
Range (IQR) data value and the Q3 data
value.
Variance A “weighted” average. Add 28/6=
all squared deviations (of data 2
∑(𝑥𝑖 − 𝑥̅ )2 4.6667
values from the mean) and 𝑠 = (from table)
𝑛−1
divide by “n -1”, where n is the
number of observations.
Standard The square root of the √4.6667 = 2.1603
Deviation variance. This number will (from above)
∑(𝑥𝑖 − 𝑥̅ )2
have the same units as the data 𝑠= √
values. 𝑛−1

1 (∑ 𝑥𝑖 )2 1 (49)2
Note: A shortcut formula for variance is 𝑠 2 = 𝑛−1 (∑ 𝑥𝑖 2 − ) = 6 (371 − )
𝑛 7
1 2401 1 1
= 6 (371 − ) = 6 (371 − 343) = 6 (371 − 343) = 4.6667
7

22
Example 2.4 Advantages Disadvantages
Mean -takes all numbers into account - affected by outliers (extreme values)
-easy to calculate -no use with categorical/ordinal data*
-useful mathematical properties
-only useful for interval/ratio data
Median -essentially unaffected by outliers -no use with categorical/ordinal data*
(extreme values) - tedious to rank units (if by hand)
-only useful for interval/ratio data -lacks useful mathematical properties
Mode -essentially unaffected by outliers -some data sets have no mode
(extreme values) (uniform data where distribution same
-useful for categorical, ordinal, height for all bars)
interval, and ratio data -some data sets are bimodal, or
multimodal
-lacks useful mathematical properties

Example 2.5: Note: For a symmetric BELL SHAPED data set, the mean, mode and
median are all equal.

Symmetric means that the left half of the distribution is a mirror image of the right half of the
distribution. Watch out, though - we can have a data set that is symmetric, where mean equals
median, but they are not equal to the mode.

Example 2.6 : When is standard deviation useful for the following data types?
Categorical Never – you cannot calculate it for non-numerical data
(non-ordinal)
Ordinal Never – you cannot calculate it for non-numerical data*
Interval Always as data is numerical and “equidistant” between values
Ratio Always as data is numerical and “equidistant” between values

* Some may argue that mean, median and standard deviation can give information for ordinal
data where the rankings can be viewed as “equidistant”. An example would be if you were
asked to rate service in a restaurant on a scale of 1 to 5, with 1 being very unsatisfied and 5
being very satisfied. However, we will not consider mean and median to be viable statistics
from which information can be gleaned in this course.

23
Skewed Distributions
Skewed Distributions have the bulk of the data of interest at one side of the distribution and a tail
on the other side of the distribution with somewhat fewer outlying values.

Example 2.7 : Age of Retirement


A few years ago, your instructor was in line at the grocery store when the cashier congratulated
the young man in line behind her on winning 3.2 million dollars in a lottery. When asked what
he planned to do, he said "retire - at least for a while!" Hence this example. Assume that 45
people are randomly sampled and asked at what age they plan to retire. Note that one individual
(no doubt the young man in line behind me) plans to retire at 20! He is clearly, intuitively, an
"outlier". For this data, the mean = 62.64, the median = 64, Q1 = 62.5 (in the 11.5 position), Q3
= 65.5 (in the 34.5 position), the mode = 65. IQR = 3, and outliers lie above 65.5+1.5(3) = 70 or
below 62.5 – 1.5(3) = 58. There are 4 left outliers and 2 right outliers.

People often think of the mean as the balancing point.


Here the individual who plans to retire at 20 is skewing the results to the left.

The mean is below the median, and the data set is skewed to the left, or negatively skewed.
(MEAN IS PULLED LEFT OF MEDIAN BY “OUTLYING” LEFT TAIL VALUES)

Example 2.8: Suppose I give a test to a class of 6 students.


Five get a mark of 50 and one gets a mark of 100
X
X
X
X
X X
50 58.3 100
Median µ
Mode

Was my exam too hard?


Mean µ = (50+50+50+50+50 + 100) = 350/6= 58.3
Mode = 50
Median = 50

Note the individual who got 100 is skewing the results to the right..
The mean is above the median, and the data set is skewed to the right, or positively skewed.

24
(MEAN IS PULLED RIGHT OF MEDIAN BY “OUTLYING” RIGHT TAIL VALUE

Example 2.9: Choosing statistical summary measures of centrality and spread to describe
distributions

When data is approximately bell shaped the appropriate statistical summary measure to describe
the centrality of the data is the mean and the appropriate statistical summary measure to describe
the spread of the data is the standard deviation.

When data is skewed, the appropriate statistical measure to describe the centrality of the data is
the median and the appropriate statistical summary measure to describe the spread of the data is
the IQR.

For the breath data in Example 1.11, we describe the data with mean 13.844 and the standard
deviation 4.204 You can verify them for practice.

For the retirement data in Example 2.7, we describe the data with the median of 64 and the IQR
of 3.

This is why housing statistics often report both mean and median for household prices in a
neighbourhood. Even just one wealthy home in the neighbourhood can skew the data
distribution of the prices so that the mean is much larger than the median. If only the average
(mean) home price is reported for that neighbourhood, people who might otherwise be able to
afford a home there might not even look there. But if people have more information and know
the median price is within their range of affordability, they will likely look in that
neighbourhood.

25
Example 2.10: Reading that shows how s2 consistently underestimates σ2.

26
Chapter 3: Descriptive Statistics: Relating Two Numerical Variables
An association between two numerical variables (measured on the same respondent) occurs if
knowing something about one variable suggests the values of the other variable. One of the
variables is as an explanatory variable that explains changes in the other response variable.

Example 3.1:

Age (explanatory) and Number of pushups done in a minute (response)


Hours spent studying (explanatory) and test score (response)
Hours spend exercising (explanatory) and resting heart rate (response)

Recall that a scatterplot graphically represents the relationship between two numerical variables.
When we see a relationship between two variables on a scatterplot we are interested in
identifying its form, direction and strength.

Form refers to a shape we might see in the data pattern. Mostly, we will be investigating
relationships that have a linear form, but we may run into situations where another form (such as
quadratic) shows on the scatterplot.

The direction of a relationship is positive if x and y increase together and negative if y decreases
as x increases.

A strong relationship will have points close to a simple form (such as a line).

Example 3.2:

Recall our example about ages and number of push-ups that can be done in a minute for n=12
randomly chosen individuals in an extended family. We put age (our predictor variable) on the x
axis and number of push-ups (our response variable) on the y axis. There appears to be a linear
relationship between age and pushups; as age increases, the number of pushups decreases. We
note the fit 67 year old (what an outlier!) who can do 29 pushups in a minute, though.
Individual Age Push-ups
Younger Brother 19 34
Elder Sister 28 26
Younger Sister 18 30
Mom 58 12
Middle Sister 23 28
Dad 64 12
Great Grandma 78 6
Uncle 43 21
Aunt 48 22
Elder Brother 36 27
Middle Brother 21 14
Grandpa 67 29

27
The relationship between age and number of pushups appears to be linear in form, with negative
direction, and strong (points appear relatively close to a line).

Watch out:

1. Researchers should always watch out for outlier (unusual) values that are outside the pattern
followed by most of the data in a scatterplot under investigation. They can have an outsize
influence on the measures we use to explain associations and patterns in the data.

2. A linear relationship between 2 variables does not mean that an increase (or decrease) in one
variable is the cause of an increase (or decrease) in the other. It simply means that a linear
relationship exists between the 2 variables.

3. Predicting Y values for X values that are outside the range of X values on the graph is known
as extrapolation and should be avoided as one cannot be sure that the relationship pattern will
hold outside of the range of X values used to create the scatterplot.

CORRELATION

Correlation is a measure that investigates the strength of a linear relationship between two
numerical variables as they vary together for a set of paired (xi,yi) values (i = 1, …,n).

As usual, sample data is used to estimate population data. In this case r, the sample correlation,
estimates ρ, the population correlation. For this part of the course, we will concentrate on
understanding r as a descriptive measure, and consider any data set under investigation to be
sample data from a larger population.

∑(𝑥𝑖 −𝑥̅ )(𝑦𝑖 −𝑦̅)


The formula for r = .
√∑(𝑥𝑖 −𝑥̅ )2 √∑(𝑦𝑖 − 𝑦̅)2

We take a moment to provide an intuitive explanation of the reasoning behind the formula.
1
Covariance = sxy = 𝑛−1 ∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅) is a measure that takes an “average” of the products of
deviations from the means. As with standard deviation, a division by n-1 allows for a sample
covariance measure to provide consistent “unbiased” estimators of a population covariance
measure.

1 ∑ 𝑥𝑖 ∑ 𝑦 𝑖
A shortcut formula for covariance is sxy = (∑ 𝑥𝑖 𝑦𝑖 − )
𝑛−1 𝑛

28
Points close to the “line of best fit” will have small values for the product (𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅) . A
line that fits well will have many (xi, yi) with small values for the product (𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅).

Covariance is, however, dependent on the magnitude of the numbers in the data. A co-variance
of 4 years between the ages of children in a sample of families is somewhat large, relatively,
while a co-variance of 4 dollars in the price of cars in a sample of families is somewhat small,
relatively.

∑(𝑥𝑖 − 𝑥̅)2 ∑(𝑥𝑖 − 𝑥̅)2


We can “standardize” covariance by dividing by sx = √ and sy = √
𝑛−1 𝑛−1

1 ∑ 𝑥 𝑖 ∑ 𝑦𝑖
𝑠𝑥𝑦 ∑(𝑥𝑖 −𝑥̅ )(𝑦𝑖 −𝑦̅) ∑(𝑥𝑖 −𝑥̅ )(𝑦𝑖 −𝑦̅) ∑ 𝑥𝑖 𝑦 𝑖 −
𝑛−1 𝑛
r =𝑠 = = =
𝑥 𝑠𝑦 2
√∑(𝑥𝑖 − 𝑥̅) √∑(𝑥𝑖 − 𝑥̅)
2 √∑(𝑥𝑖 −𝑥̅ )2 √∑(𝑦𝑖 − 𝑦̅)2 2
√∑ 𝑥𝑖 2 −(∑ 𝑥𝑖 ) √∑ 𝑦𝑖 2 −(∑ 𝑦𝑖 )
2
𝑛−1 𝑛−1 𝑛 𝑛

The last formula is a computation formula that eases calculation by hand. Examples will follow.

Some important properties of r are:

1. r is unitless.

2. r can only measure the strength of a linear relationship between the two variables of interest. r
does not measure the strength of a relationship between two variables that are related in a non-
linear way.

3. r can only take on values between -1 and +1.

4. The closer r is to 1 in absolute value the stronger the linear relationship. We consider an
absolute value above 0.8 to indicate a strong linear relationship.

5. Like the mean and the standard deviation, r is strongly affected by outliers (values that do not
conform to the linear relationship under investigation).

29
Perfect positive linear relationship, r = 1 Perfect negative linear relationship, r = -1
Points on line, slope positive Points on line, slope negative

Strong positive linear relationship r ≈ 0.95 Weak negative linear relationship r ≈ -0.45
Points close to line, slope positive Points far away from line, slope negative
• r = –1. A perfect downhill (negative)
linear relationship
• r = –0.70. A strong downhill
(negative) linear relationship
• r = –0.50. A moderate downhill
(negative) relationship
• r = –0.30. A weak downhill (negative)
linear relationship
• r = 0. No linear relationship
No discernable linear relationship r ≈ 0 • r = +0.30. A weak uphill (positive)
linear relationship
• r = +0.50. A moderate uphill
(positive) relationship
• r = +0.70. A strong uphill (positive)
linear relationship
• r = +1. A perfect uphill (positive)
linear relationship

Example 3.3:

Individual Age(xi) Push-ups(yi) xi 2 yi 2 xi yi


Younger Brother 19 34 361 1156 646
Elder Sister 28 26 784 676 728
Younger Sister 18 30 324 900 540

30
Mom 58 12 3364 144 696
Middle Sister 23 28 529 784 644
Dad 64 12 4096 144 768
Great Grandma 78 6 6084 36 468
Uncle 43 21 1849 441 903
Aunt 48 22 2304 484 1056
Elder Brother 36 27 1296 729 972
Middle Brother 21 14 441 196 294
Grandpa 67 29 4489 841 1943
Totals:Σ ∑ 𝑥𝑖 503 ∑ 𝑦𝑖 261 ∑ 𝑥𝑖 2 25921 ∑ 𝑦𝑖 2 6531 ∑ 𝑥𝑖 𝑦𝑖 9658

∑ 𝑥 𝑖 ∑ 𝑦𝑖
∑ 𝑥𝑖 𝑦 𝑖 −
𝑛
a. Calculate r using the formula r = . Indicate the strength and direction
2 2
√∑ 𝑥 2 −(∑ 𝑥𝑖 ) √∑ 𝑦 2 −(∑ 𝑦𝑖 )
𝑖 𝑛 𝑖 𝑛

of the linear relationship using information gleaned from the correlation?


(503)(261)
9658−
12
r= = (-1282.25)/(69.5479)(29.2276) = -0.6308
(503)2 2
√25921− √6531 −(261)
12 12

The linear relationship is negative and it is moderate to strong.


∑ 𝑥𝑖 ∑ 𝑦 𝑖 2 (∑ 𝑦𝑖 )2
1 1 (∑ 𝑥𝑖) 1
b. Calculate sxy = (∑ 𝑥𝑖 𝑦𝑖 − ), 𝑠𝑥 2 = 𝑛−1 (∑ 𝑥𝑖 2 −
𝑛
), and 𝑠𝑦 2 = 𝑛−1 (∑ 𝑦𝑖 2 − )
𝑛−1 𝑛 𝑛

1 (503)(261)
sxy = 11 (9658 − ) = -116.5682
12
1 (503)2
𝑠𝑥 2 = 11 (25921 − ) = 439.7197
12

1 (261)2
𝑠𝑦 2 = 11 (6531 − ) = 77.6591
12

𝑠𝑥𝑦
c. Verify that r = 𝑠 gives the same answer for r as in part a)
𝑥 𝑠𝑦

First calculate sx = √439.7197 = 20.9695 and sy = √77.6591 = 8.8124


(−116.5682)
Now r = (20.9695)(8.8124) = -0.6308

Example 3.4

A random sample of 10 people is taken and their salary per year (x) and savings per year (y) are
recorded. Draw a scatterplot of the data and indicate the form of the relationship you see If the
form is linear, calculate the correlation between the two variables, and indicate its strength and
direction. You will need to fill in the table below to get the sums needed to calculate the
equation for r.

31
Savings
Respondent Salary (xi) (yi) xi 2 yi 2 xi yi
Dobby 4 1 16 1 4
Severus 12 0 144 0 0
Rubeus 16 1 256 1 16
Ginny 28 2 784 4 56
Albus 36 4 1296 16 144
Draco 42 3 1764 9 126
Ron 55 4 3025 16 220
Voldemort 59 6 3481 36 354
Harry 63 10 3969 100 630
Hermoine 78 11 6084 121 858
2 2
SUMS Σ xi 393 Σ yi 42 ∑ 𝑥𝑖 20819 ∑ 𝑦𝑖 304 Σ xi yi 2408

∑ 𝑥 𝑖 ∑ 𝑦𝑖
∑ 𝑥𝑖 𝑦 𝑖 −
𝑛
r= 2 2
√∑ 𝑥𝑖 2 −(∑ 𝑥𝑖 ) √∑ 𝑦𝑖 2 −(∑ 𝑦𝑖 )
𝑛 𝑛

(393)(42)
2408−
10
=
(393)2 2
√20819− √304−(42)
10 10

2408−1650.6
= =
√20819−15444.9√304−176.4
757.4
√5374.1√127.6
The form of the data appears linear.
757.4
= (73.3083)(11.296) = 0.9146

The correlation indicates that the linear


relationship is strong and positive.

32
Chapter 4: Probability
We have mentioned before that we describe characteristics (statistics) of sample data with
statistics and use those descriptions to make inferences about the characteristics (parameters) of
the population from which the sample is taken.

Probability theory gives us a tool to enable us to do the inference when we only have sample
information. It allows us to investigate the suitability of hypothesized models for a
population from which we have taken a sample, by allowing us to investigate (calculate
probability) to determine whether an observed result is likely if a hypothesized model is
true. This is how we will use it later in the course. For now, we will get to work on gathering
some basic definitions needed to work with probability theory, and doing some straightforward
(and some more complex) work that uses the language of those definitions. We will sometimes
be able to look at describing sample results in probability terms when we summarize count data
when we ask a number of respondents to each answer one or two categorical questions (each
question having several potential outcomes). You’ll see!

If we ask anyone to tell us what they think the probability is that it snows in Edmonton in July,
we will get different answers, no doubt. These answers are subjective probabilities, and are
based on the experience of the guesser. On the other hand, probabilities that are based on
experiments that are performed under controlled conditions are called objective probabilities.
A very simple example of this would be a coin toss. If the coin is fair, the chance of it turning up
heads is 1/2.

Example 4.1: A group of 55 students are asked to answer the following question.

“Henry Ford, the founder of the Ford Motor company, also invented the windshield wiper. Is
this correct of incorrect?”

The answer to this question is unknown to most of the students, so they guess the answer. There
are two possible outcomes, correct or incorrect.

This is a random phenomenon as the individual outcomes are uncertain, but nevertheless, with
a large number of repetitions, a regular distribution of outcomes will occur.

Over time, if we could ask millions and millions of people to answer this question, we would
expect that about 50% would get it right and 50% would get it wrong, as almost everyone would
guess the answer.
Outcome of sample of 55 people Ask everyone,
everywhere!
Outcome Count Relative Over time, Probability
Frequency relative Distribution
Distribution frequency
Correct 30 0.55 settles to 0.50
Incorrect 25 0.45 probability 0.50
1.00 55 1.00 1.00

33
Over time, the proportion (relative frequencies) of times the outcomes of the random
phenomenon will occur in a long series of repetitions would settle down to 0.5 for correct and
0.5 for incorrect, in decimal form – and we refer to these “settled down” or “limiting” relative
frequencies as probabilities.

The relative frequency distribution of a categorical random variable is a table giving all
possible categories the random variable can assume and their associated relative frequencies.

The probability frequency distribution of a categorical random variable is a table giving all
possible categories the random variable can assume and their associated probabilities.

We can graph the relative frequency distribution (as per the table above) for our experiment as
follows. Note that the bars below do not touch because the random phenomena (guesses) have
nominal (categorical) outcomes (correct and incorrect).

Similarly, we graph the probability distribution (as per the table above) for our experiment, if
we were to repeat it ad-infinitum. Note, again, that the probability distribution bars do not touch
because the random phenomena (guesses) have categorical outcomes (correct and incorrect).

A sample space is the set of all the unique possible outcomes of an experiment, where the
outcomes are disjoint (mutually exclusive) (that is, only one outcome can occur when the
experiment is performed) and collectively exhaustive (that is, all possible outcomes are
contained in the sample space).

The sample space for the above experiment is {incorrect, correct}.

Each outcome has a corresponding probability, which we denote as P(outcome).

34
Above, P(Correct) = ½ and P(Incorrect) = ½.
A particular outcome is known as a simple event.

An event is a set of outcomes that make a subset of a sample space.

OUTCOMES are MUTUALLY EXCLUSIVE (Occurrence of one precludes other occurring)

OUTCOMES are COLLECTIVELY EXHAUSTIVE (all are contained in the sample space)

We note the following axioms (properties) of a probability distribution.

Point 1: 0<=P(outcome) <=1 (similarly, 0<=P(Event)<=1)


Point 2: The sum of the P(outcome)s is 1.

The probability of an event A is equal to the sum of the probabilities of the simple events
(outcomes) contained in A.

Probability of equally likely outcomes (f/N Rule)

For an experiment with N possible outcomes, all of which are equally likely, an event that can
occurred in f ways has probability f/N of occurring.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑎𝑦𝑠 𝑒𝑣𝑒𝑛𝑡 𝑐𝑎𝑛 𝑜𝑐𝑐𝑢𝑟
That is, f/N = 𝑡𝑜𝑡𝑎𝑙 𝑜𝑓 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠

Example 4.2 A bag contains 2 balls: 1 red, and 1 yellow. Two balls are drawn from the bag
with replacement (that is, after drawing the first ball and observing its color, it is returned to the
bag, and then the second ball is drawn). Let R be the event of drawing a red ball in a given
draw, and Y be the event of drawing a yellow ball in a given draw.

List all possible outcomes. What is the probability of each outcome?

S = Sample Space of Outcomes = {RR, RY, YR, YY}.

4 possible outcomes. Each equally likely outcome has probability of ¼.


.
Some Events Outcomes and probability
A = {Exactly 1 yellow ball is drawn} A = {RY,YR}. P(A) = 2 = 0.5
4

B = {The first ball is yellow} 2


B = {YY,YR}. P(B) = 4 = 0.5
C = {No less than one ball is red} 3
C = {RR, RY, YR}. P(C) = 4 = 0.75
D = {Both balls are the same color} 2
D = {RR,YY}. P(D) = 4 = 0.5

(in each case, used f/N rule; f = number of ways event can occur
and N = total number of possible outcomes =4.)

35
List all possible outcomes and find the probabilities of the following events.

Not C not C= {outcomes in S but not in C} = {YY} P(not C) = 1 − P(C) =


1 − 3/4 = 1/4 = 0.25
A&C A&C ={elements common to both A and C }= {RY, YR} P(A&C) = 2/4 = 0.5

A&D A & D ={elements common to both A and D = ∅. P(A&D) = 0

B&D {elements common to both B and D} = {YY}. 𝑃(A&D) = 1/4 = 0.25

C or D C or D={elements in either C or D, or both C and D} = P(C or D) =


{RR, YB, BY, YY}. 𝑃(C) + P(D) − P(C&D)
3 2 1
+ 4 − 4 =1
4
P(C&D) = P({RR}) = 1/4

Which of the following pairs of events are mutually exclusive?

Events Outcomes Mutually Exclusive


A&B {YR} No
A&D ∅ (empty) Yes (no outcomes in common)

Example 4.3: Contingency Table (or Cross-tab Table): 460 randomly chosen students are
classified according to their choice of Bachelor (Arts or Science) and whether they intend to be
self-employed or work for an employer. The contingency table below record frequencies
(counts) for each (degree, employment) combination. We can calculate relative frequencies of
interest from the data in the table. We will refer to such relative frequencies as probabilities
throughout our work below. This is, albeit, a bit of a stretch, but it’s a very nice way to introduce
probability, and if our sample is large enough, and the number of cells that record frequencies for
(Degree, Employment Status) are small enough that we have a good number of observations in
all cells (we like more than 5 a cell), then we can generally assume that the “relative frequencies”
are close to “probabilities” we would obtain if we could sample everyone in a population.

2 Classifications, 2 Outcomes per Classification


Probability Events: BA – Bachelor of Arts
` BS– Bachelor of Science
SE – Self-Employed
WE – Work for an Employer
BA BS
SE 50 30 80 marginal total of SE
WE 280 100 380 marginal total of WE
330 130 460
grand total the same
marginal total of BA marginal total of BS across and down

36
Marginal Probability: total in cell of interest/total overall
Let A – event A – P(A) probability of event A Let B – event B – P(B) – probability of event B

P(WE) = 380/460 P(BA) = 330/460

Complementary Rule: Let A - event A does not occur: P( A ) = 1 – P(A)


P(Not WE)=P( WE )=1 – P(WE)= 1-380/460=80/460
P(Not BA)=P( BA )=1– P(BA)= 1-330/460=130/360
Joint Probability: Probability two events, say A and B,
occur together,
P(SE and BS) = P(SE  BS) =
(probability in cell intersection for SE and BS) = 30/460

Union Probability: Probability A or B or Both occur


(all enclosed parts of diagram)

Addition Rule: P(A or B) = P(AUB) =


P(A) + P(B) – P(AB)

P(SE or BS) =P(SEUBS)=P(SE) + P(BS) – P(SEBS)


= (80 + 130 – 30)/460 = 180/460
(middle number of 30 in cell SEBS counted twice
(once in the 80 and once in the 130) so take one 30 out)

Example 4.4: Students in a class are asked if they plan to complete a BA or a BS and if they
plan to be self-employed (SE) or to work for an employer (WE). Answers are below. Fill in the
tables below and calculate the fractions and probabilities.

Totals Table BA BS Total


SE 50 30 80
WE 280 100 380
Total 330 130 460

Marginal Probabilities:
Probability student is BA P(BA) 330/460 0.7174
Probability student is BS P(BS) 130/460 0.2826
Probability student plans SE P(SE) 80/460 0.1739
Probability student plans WE P(WE) 380/460 0.8260

Complementary Probabilities:
Prob student is not BA P( ) 1 – P(BA) 1 – 330/460 = 130/460 0.2826
Prob student is not BS P( BS ) 1 – P(BS) 1 – 130/460 = 330/460 0.7174
Prob student does not plan SE P( SE ) 1 – P(SE) 1 – 80/460 = =380/460 0.8260
Prob student does not plan WE P( WE ) 1 – P(WE) 1 – 380/460 = 80/460 0.1739
Joint Probabilities:
37
Probability student is BA and SE P(BA  SE) 50/460 0.1087
Probability student is BA and WE P(BA  WE) 280/460 0.6087
Probability student is BS and SE P(BS  SE) 30/460 0.0652
Probability student is BS and WE P(BS  WE) 100/460 0.2174

Union Probabilities:
Probability student is BA or SE P(BAUSE) P(BA)+P(SE)- P(BA  SE) (330+80-50)/460 =360/460 0.7826
Probability student is BA or WE P(BAUWE) P(BA)+P(WE)-P(BA  WE) (330+380 -280)/460 0.9348
=430/460
Probability student is BS or SE P(BSUSE) P(BS)+P(SE)-P(BS  SE) (130+80-30)/460=180/460 0.3913
Probability student is BS or WE P(BSUWE) P(BS)+P(WE)-P(BS  WE) (130+380-100)/460 0.8913
=410/460

Challenge:
Prob student is neither BA nor SE 1 – P(BAUSE) 1 – 360/460 = 100/460 0.2174
Prob student is neither BS nor WE 1 –P(BSUWE) 1 – 410/460 = 50/460 0.1086

Example 4.5: Student Majors in Summer Introductory Statistics Class of 55 students are
recorded. 30 are Science majors, 10 are Arts majors, 3 are Business majors, and 7 have other
majors.

Outcomes: S(SCIENCES),A(ARTS),B(BUSINESS), O(OTHER)

P(S)= 30/55, P(A)= 10/55, P(B) = 3/55, P(R)= 7/55

We note that this is a sample space, as the outcomes are disjoint and collectively exhaustive, the
probabilities are all between 0 and 1 (inclusive) and the probabilities sum to 1.

S&A are disjoint (mutually exclusive), P(S  A)=0

P(AUA)=P(S)+P(A)-P(S  A) =P(S)+P(A)=30/55+10/55-0= 40/55

(Special Addition Rule)


If A and B are disjoint events, then P(AB) = P(A) + P(B).

Conditional Probability:

Probability of event A occurring given that event B has occurred is denoted as P(A|B)

Conditional Probability Formula:


P(A|B) = P(AB)
P(B)

(General Rule of Multiplication for Conditional Probability)


We rewrite the formula above P(AB) = P(A|B) P(B)

38
Example 4.6: Students in a class are asked if they plan to complete a BA or a BS and if they
plan to be self-employed (SE) or to work for an employer (WE).
Totals Table BA BS Total
SE 50 30 80
WE 280 100 380
Total 330 130 460

Probability Table BA BS Probability


SE 50/460=0.1087 30/460=0.0652 80/460=0.1739
WE 280/460=0.6087 100/460=0.2174 380/460=0.8260
Total 330/460=0.7174 130/460=0.2826 1.0000

Find P(a person taking a BA given that the person plans WE) (formal mathematical wording)
(Subgroup of interest at end of sentence. Look within it)
Counts Fraction Probability Fraction Probability
P(BA|WE) P(BAWE)/P(WE) 280/380 0.6087/0.8260 0.7368

Find P(a person is taking a BA given that (we know) the person plans SE) (formal mathematical
wording)(Subgroup of interest at end of sentence. Look within it)
P(BA|SE) P(BASE)/P(SE) 50/80 0.1087/0.1739 0.6250

Find P(a WE plan person is taking a BS) (WE KNOW we are only looking within WE subgroup)
USUAL ENGLISH WORDING (Subgroup of interest at BEGINNING of sentence.)
P(BS|WE) P(BSWE)/P(WE) 100/380 0.2174/0.8260 0.2632

Finish the following questions. USUAL ENGLISH WORDING for all remaining questions!!!
Find P (a SE plan person is taking a BS)
P(BS|SE) P(BS SE)/P(SE) 30/80 0.0652/0.1739 0.3750

Find P(a person taking a BA plans WE)


P(WE|BA) P(WEBA)/P(BA) 280/330 0.6087/0.7174 0.8485

Find P(a person taking a BS plans WE)


P(WE|BS) P(WEBS)/P(BS) 100/130 0.2174/0.2826 0.7692

Find P(a person taking a BA plans SE)


P(SE|BA) P(SEBA)/P(BA) 50/330 0.1087/0.7174 0.1515

Find P(a person taking a BS plans SE)


P(SE|BS) P(SE BS)/P(BS) 30/130 0.0652/0.2826 0.2308

39
Another alternative wording for a conditional probability problem is thus.

If we randomly select a self employed person, what is the probability they are taking a BS?
P(BS|SE) P(BS SE)/P(SE) 30/80 0.0652/0.1739 0.3750

If we randomly select a BS, what is the probability they are self employed?
P(WE|BS) P(WEBS)/P(BS) 100/130 0.2174/0.2826 0.7692

INDEPENDENCE: Sometimes knowing something doesn’t affect the probabilities. For


example, if you toss a coin, knowing that you got a head on the first toss doesn’t affect the
probability that you get a head on the second toss.

Coin tosses are independent (obtaining a head on one toss of a coin does not affect the
probability of obtaining a head on another toss of the coin)

DEFINITION: Two events A and B are independent if the occurrence (or nonoccurrence) of
either event cannot influence the probability of the occurrence (or non-occurrence) of the other.

If A and B are independent events:


P(A|B) = P(A)
P(B|A) = P(B)
P(A  B) = P(A)P(B|A) = P(A)P(B)
(Note this also means that P(A  B) = P(B)P(A|B) = P(B)P(A) )
THIS IS ONLY TRUE FOR INDEPENDENT EVENTS

Special Multiplication Rule for Independent Events


If A, B, C, … are independent events, then
P(A and B and C and …) = P(A)P(B)P(C)…

Example 4.7: Suppose you have two events D and E. P(D) = 0.2, P(E) = 0.6, D and E are
independent events.

a) Find P(D  E): P(D  E) = P(D)P(E) = (0.2)(0.6) = 0.12

b)Find P(D U E)
P(D U E) = P(D) + P(E) - P(D  E) = 0.2 + 0.6 – 0.12 = 0.68

Example 4.8: Suppose you have two events A and B. P(A) = 0.3, P(B)=0.6, A and B are
independent events

a) Find P(A  B): P(A  B) = P(A)P(B) = (0.3)(0.6) = 0.18

b)Find P(A U B)
P(A U B) = P(A) + P(B) - P(A  B) = 0.3 + 0.6 – 0.18 = 0.72

40
PROPER USE OF FORMULAS
Unless we know that A and B are independent events, always use

P(A|B) = P(AB) instead of P(A|B) = P(A)


P(B)

P(AB)=P(A|B) P(B) instead of P(AB)=P(A) P(B)

Example 4.9: A true/false test consists of 4 questions. If you independently guess the answer to
each question, what is the probability of getting all 4 questions correct?

Event C = event you get a question correct.

On each independent trial (question), P(get question correct) = P(C) = ¼ = 0.5

By the special multiplication rule,


P(get 4 questions correct) = P(C)P(C)P(C)P(C) = (0.5)4 = 0.0625

Example 4.10: A bag contains 2 balls: 1 red, and 1 yellow. Two balls are drawn from the bag
with replacement. Let R be the event of drawing a red ball in a given draw, and Y be the event
of drawing a yellow ball. Recall:

Event B = {The first ball is yellow}, which had P(B) = 2/4


Event C ={No less than one ball is red}, which had P(C)= 0.75
Event D ={Both balls are the same color}, which had P(D) = 0.5

Verify mathematically that B and D are independent events, while C and D are not.
2 3 2 1 1
We need: P(B) = 4 , P(C) = 4 , P(D) = 4, P(B&D) = 4 , and P(C&D) = 4.

2 2 1
P(B) × P(D) = 4 × 4 = 4 = P(B&D) , so B and D are independent events.

3 2 6 1
P(C) × P(D) = 4 × 4 = 16 ≠ 4 = P(C&D) , so C and D are dependent events.

Example 4.11:

Consider our table from above. Are BA and SE independent events?


Totals Table BA BS Total
SE 50 30 80
WE 280 100 380
Total 330 130 460

41
If we can show that P(BASE) ≠ P(BA) P(SE), then we can say that BA and SE are not
independent (that is, they are dependent).

P(BASE) = 50/460 = 0.1087


P(BA) = 330/460 = 0.7174
P(SE) = 80/460 = 0.1739
P(BA)P(SE) = (330/460)(80/460) = 0.1248
Since P(BASE) ≠ P(BA) P(SE) (0.1087 ≠ 0.1248),
BA and SE are not independent. They are dependent.

Example 4.12:
Students often confuse independent and mutually exclusive events. They are not the same thing.
Here is an example to illustrate this.

Consider an experiment where two coins are tossed. There are 4 possible outcomes in the sample
space.

Let H1 – probability head on toss 1


Let T1 – probability tail on toss 1
Let H2 – probability head on toss 2
Let T2 – probability tail on toss 2

Outcome Prob(outcome)
H1, H2 ¼
H1, T2 ¼
T1, H2 ¼
T1, T2 ¼
Total 1

P(H2) = ½ (add together the mutually exclusive outcomes (H1, H2) and (T1, H2) )
P(H2|H1) = ½ (knowing what happened on 1st toss doesn’t influence probability of getting a
head on the second toss)
So H1 and H2 are independent

P(H1 and H2) = ¼ .


P(H1 and H2) ≠ 0
H1 and H2 are not mutually exclusive.

COUNTING RULES

Example 4.12: 2 candidates in a group of 3 applicants will be randomly chosen to meet for ½ an
hour each for a job interview with an important CEO of an international company. The CEO has
a flight to Hong Kong later today, and she is more likely to be more attentive to the first person
she meets. The order in which the candidates meet her is of some importance to consider.

Candidates are Alice (A), Brandon (B) and Candace (C). Each hopes to be interviewed first.

42
A PERMUTATION is a selection of x objects from n objects, where the order (arrangement) of
the selected objects is IMPORTANT.

Example 4.13:Draw 2 people from 3.List all possible permutations of the 3 people A, B and C
AB, BA, AC, CA, BC, CB 6 POSSIBLE PERMUTATIONS

n!
n Px = ( n − x )! (# of permutations of x objects in n)

n! = n(n − 1)(n − 2)....(2)(1)


0! is defined to be 1.

3! 3! 3.2.1
= = = 3.2 = 6
3 P2 = (3 − 2)! 1! 1 (# of permutations of 2 objects from 3)

3 possible people for first choice, 2 possible people for second choice

The CEO decides to instead meet both the candidates together at lunch. Order no longer matters.

A COMINATION is a selection of x objects from n objects, where the order (arrangement) of


the selected objects is NOT IMPORTANT

AB and BA are two permutations of the same combination.

Example 4.14: List all possible combinations of the three objects A, B and C
AB, AC, BC 3 POSSIBLE COMBINATIONS

3C2 denotes the number or combinations of size 2 in a group of 3 objects

6 possiblepermutations
= 3 possible combinations = 2 possibleorderingsperpermuation

Formula: nCx = the number of combinations of size x we can choose from a group of n objects =

n!
n Px (n − x)! n!
= =
x! x! x!(n − x)! (# of combinations when choose x objects from n)

3C2 =
3! 3! 3.2.1 6
= = = =3
2!(3 − 2)! 2!1! 2.1.1 2
n
 
(Note: you may see the notation  x  rather than nCx in other courses.)
Example 4.15: Find 5C2 and 5C5
43
5C2 =
5  n! n! 5! 5! 5.4.3! 20
  = = = = = = 10
 2  = x!(n − x)! x!(n − x)! 2!(5 − 2)! 2!3! 2.1.3! 2

Note S = ( {1,2}, {1,3}, {1,4}, {1,5}, {2,3}, {2,4}, {2,5}, {3,4}, {3,5}, {4,5})
-there are 10 possible combinations

5C5 =
 5 n! n! 5! 5! 5!
  = = = = =1
 5  = x!(n − x)! x!(n − x)! 5!(5 − 5)! 5!(0!) 5!(1)

Note S = ({1,2,3,4,5}) – there is one outcome combination (Remember: 0! = 1, by definition)

Example 4.16: We select a random sample of 2 numbers between 1 and 100, sampling without
replacement.

a) How many samples of size 2 are possible? (2 marks)

n! 100! (100)(99)(98!) (100)(99)


100C2= = 2!(100−2)! = = = 4950
x!(n−x)! 2!98! (2)(1)

b) What is the probability that the sum of the values in our sample is less than 7? (3 marks)
1
Each individual random sample is selected with probability = 1/(100C2) = 4950

There are 6 samples whose elements sum to less than 7


({1,2},{1,3}, {1,4}, {1,5}, {2,3}, and {2,4})

The probability of the sum being less than 7 is f/N = 6/(100C2)= 6/4950 = 0.001212

44
Chapter 5: RANDOM VARIABLES

A variable X is a random variable if the value that it assumes, corresponding to the outcome of
an experiment, is a chance or random event.

Random variables can be categorical or quantitative (discrete or continuous).

CATEGORICAL RANDOM VARIABLES

PROBABILITY DISTRIBUTION

For a categorical random variable, the probability distribution is a table giving all possible
categories the random variable can assume along with their associated probabilities. We
anticipated this definition in the Henry Ford example where we guessed the answer! Note again
the probability bars do not touch because this is a categorical random variable.

Outcome Probability
Correct 0.50
Incorrect 0.50
1.00 1.00
Probability Distribution of Guessed
Answer to True/False Question
1
Probability

0.5
0
90648 90647
Answer Given

NUMERICAL RANDOM VARIABLES

Numerical random variables can be discrete or continuous.

DISCRETE RANDOM VARIABLES

PROBABILITY DISTRIBUTION

For a discrete random variable, the probability distribution is a table giving all possible discrete
values that the random variable X can assume along with their associated probabilities p(X).

Value of X 𝑥1 𝑥2 𝑥3 … 𝑥𝑛
Probability 𝑝1 𝑝2 𝑝3 … 𝑝𝑛

1. Each 𝑝𝑖 is a number between 0 and 1 for all i = 1, …, n,


2. ∑ 𝑝𝑖 = 1.

45
Example 5.1: You get 60% on a test worth 20% of your grade and 90% on a test worth 80% of
your grade. Fill in the probability distribution. Graph the probability distribution. Note that the
bars do not touch because this is a discrete probability distribution.

Value of X 60 = 𝑥1 90 = 𝑥2
Probability 0.20= 𝑝1 0.80 = 𝑝2

MEAN AND STANDARD DEVIATION FOR DISCRETE PROBABILITY DISTRIBUTION

The mean (or expected value) of a discrete random variable is given as:

µ = ∑ 𝑥𝑖 𝑝𝑖

The variance of a discrete random variable is given as:

σ2 = ∑(𝑥𝑖 − 𝜇)2 𝑝𝑖

The standard deviation σ of a discrete random variable is √𝜎 2

Example 5.2: You get 60% on a test worth 20% of your grade and 90% on a test worth 80% of
your grade. Let X be your final grade. What is your final expected grade?

µ = ∑ 𝑥𝑖 𝑝𝑖 = 𝑥1 𝑝1 + 𝑥2 𝑝2 = 60(0.2) + 90 (0.8) = 12 + 72 = 84

(weight each test mark by corresponding percentage)

What is the standard deviation for your final grade?

σ2 = ∑(𝑥𝑖 − 𝜇)2 𝑝𝑖 = (60 – 84)2(0.2) + (90-84)2(0.8) = (24)2(0.2) + (6)2(0.8) = 576(0.2) + 36(0.8)


= 115.2 + 28.8 = 144

σ = √𝜎 2 = √144 = 12
46
Example 5.3: The probabilities and the number of customers lined up at Starbucks at noon are
below. What is the expected number of customers and the standard deviation of the number of
customers lined up at noon?

Number of Customers Probability


1 0.1
2 0.2
3 0.5
4 0.2

Let X = number of customers:


X P(X) XP(X) X - μ (X - μ)2 (X - μ)2P(X)
1 0.1 0.1 -1.8 3.24 0.324
2 0.2 0.4 -0.8 0.64 0.128
3 0.5 1.5 0.2 0.04 0.02
4 0.2 0.8 1.2 1.44 0.288
2.8 0.76

μ= ∑ 𝑥𝑖 𝑝𝑖 = 2.8 2 = ∑(𝑥𝑖 − 𝜇)2 𝑝𝑖 = 0.76  = √0.76 = 0.87

Example 5.4: A charity sells 10000 tickets for $1 each. Let X denote your winnings upon
purchasing 1 ticket, and suppose X has the following probability distribution

x P(X = x)
$100 10
10,000
$5 30
10,000
$0 9960
10,000
(a) What is the expected value of X.

The expected value (mean) of X is given by


10 30 1
μ = ∑xP(X = x) = 100 ⋅ +5⋅ +0⋅
10000 10000 10000
1000 + 150 + 0 1150
= = ≃ 0.1150
10000 10000

If you purchase a single $1 ticket, you are expected to receive a return of about 11.5¢.

47
(b) Since each ticket costs $1, define Y=X-1 to be your profit from purchasing 1 ticket.
Compute and interpret the expected value of Y.

Since the expected value of X is 0.115, it follows that the expected value of Y is
0.115 - 1=-0.885.

You are expected to lose about 88.5¢ for each ticket you purchase.

(c) Suppose you purchase 10 tickets. What is the probability that you win any amount of money
among all 10 tickets? For simplicity, assume independence. (3 marks)

Let T be the total winnings among all 10 tickets. Since the probability of winning $0 on 1 ticket
is
9960
P(X = 0) = = 0.996,
10,000
It follows from the special multiplication rule (for independent events) that the probability of
10 losses is
9960 10
P(T = 0) = P(X = 0)10 = (10,000) = (0.996)10 ≃ 0.9607.

Therefore, by the compliment rule,

P(T > 0) = 1 − P(T = 0) ≃ 1 − 0.9607 = 0.0393

BINOMIAL DISTRIBUTION

Properties of a Binomial Experiment, BIN(n, p)


1.n IDENTICAL TRIALS
2. each trial has two possible outcomes, success or failure
3. a) prob (success) = p is the same from trial to trial
b) prob(failure) = 1- p is the same from trial to trial
4. independent trials

We are interested in x = number of successes in n trials


X is a random variable, X ={0,1,2,3,….., n}

Example 5.5:

1. plan 5 children, n=5


2. 2 possible outcomes, boy (failure) or girl (success)
3. a) prob of success = 0.5,same from trial to trial
b) prob of failure = 0.5, same from trial to trial
4. trials are independent, knowing what sex one baby is doesn’t affect probability of next one
being a girl
X = {0,1,2,3,4,5}……..This is a BIN(5, 0.5) experiment.

48
Example 5.6:

Family legend mentions an ancestor who travelled the Silk Road. His legacy to you is a biased
coin that has been in the family for generations. For this coin, the probability of a head on each
independent toss is 0.9. You will toss it 4 times.
X = { 0,1,2,3,4 }…………This is a BIN( 4,0.9 ) experiment.

n n n!
General Formula for a BIN(n,p): P( x) =   p x (1 − p) n − x for x=0,1,2,…n where   =
 x  x  x!(n − x)!
and n!= n(n − 1)(n − 2)....(2)(1)

Calculate P(0) for a BIN(4,0.9) using above formula:


Hint: Recall 0!=1 and p0 = 1 for any number p

 4
P(0) =  (0.9) 0 (1 − 0.9) 4−0
0

 4 4! 4! 4.3.2.1
Where   = = = =1
 0  0!(4 − 0)! 0!4! (1)(4.3.2.1)

So P(0)=(1)(.9)0(.1)4 = 1(1)(.0001) = .0001

Example 5.7:

Follow the format above to calculate P(2) for a BIN(4, 0.9)


n  4
P( x) =   p x (1 − p) n − x = P(2) =  (0.9) 2 (1 − 0.9) 4− 2
 x  2

 n   4 4! 4! 4.3.2.1
Where   =   = = = =6
 x   2  2!(4 − 2)! 2!2! (2.1)(2.1)

So P(2)=6(.9)2(.1)2 = 6(.81)(.01) = .0486

Follow the format above to calculate P(4) for a BIN(6,0.3)


n 6
P( x) =   p x (1 − p) n − x = P(4) =  (0.3) 4 (1 − 0.3) 6− 4
 x  4

 n 6 6! 6! 6.5.4.3.2.1
Where   =   = = = = 15
 x   4  4!(6 − 4)! 4!2! (4.3.2.1)(2.1)

49
So P(4)= 15(0.3)4(0.7)2 = 15(0.0081)(0.49) = .059535
TABLES:
On the next two pages, please find tables for Bin(n,p) for n=1, 2, 3,…, 7 and p= 0.1, 0.2, 0.3, 0.9
1. Look down the right side to find n
2. Look across the top to find p
3. Block your n, p combination
4. Locate the x you are interested in for your block on the left side of the page beside your
n.

Verify that P(4) = 0.0595 for a BIN(6,0.3) in the tables.

50
Example 5.8: For the BIN(6,0.5) distribution, fill in the chart probabilities below and calculate
the probability statements shown. Follow the model shown.

X P (X)
0 0.0156
1 0.0938
2 0.2344
3 0.3125
4 0.2344
5 0.0938
6 0.0156

P(X<=2) = P(0) + P(1) + P(2)


= .0156 + .0938 +.2344
=.3438

P(X>=3) = P(3) + P(4) + P(5) + P(6)


= .3125 + .2344 + .0938 + .0156
= .6563

P(3<=X<=6)=P(3) + P(4) + P(5) + P(6)


= .3125 + .2344 + .0938 + .0156)
= .6563

P(3<X<=6) =
P(4)+P(5)+P(6)
= .2344 + .0938 + .0156
= .3438

P(3<X<6) =
P(4)+P(5)
= .2344 + .0938

51
Example 5.9:

MEAN OF A BIN(n,p)

Toss a fair coin Six Times - BIN(6,0.5)


How many heads do we expect? 3!!! (instinct)

Number of heads we expect = Mean

STANDARD DEVIATION OF A BIN(n,p)

 = np=6(0.5)= 3

 2 = np(1 − p ) = 6(0.5)(1-0.5) = 1.5

 =  2 = 1.5 = 1.22

Example 5.10:
Calculate the mean and standard deviation for a BIN(20, 0.2)

 = np= 20(0.2) = 4

 2 = np(1 − p ) = 20(0.2)(0.8) = 4(0.8) = 3.2

 =  2 = 3.2 = 1.7888

52
SHAPES:
For all n, for small p, BIN dist skewed right

For all n, for large n, BIN dist skewed left

For p = .5, BIN dist bell shaped (for all n >=2)*

*For a BIN(40,0.5), P(X) = 0.5 for both X = 0 and X=1, and the binomial distribution is a uniform distribution.
Interesting Note: When p is small, if n is large enough, it turns out that we end up with a bell
because the outlier probabilities of our skew are so small they are negligible. This happens
when np>=5, and n>20. Below is a case where n = 900, and p =.02. Here np = 900(.02) = 18,
which is >=5.
P(X)
0.10

0.08

0.06

0.04

0.02

0.00 X
5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

53
Mirror Probabilities:

X P(X) X P(X)
0 0.3487 0 0.0000
1 0.3874 1 0.0000
2 0.1937 2 0.0000
3 0.0574 3 0.0000
4 0.0112 4 0.0001
5 0.0015 5 0.0015
6 0.0001 6 0.0112
7 0.0000 7 0.0574
8 0.0000 8 0.1937
9 0.0000 9 0.3874
10 0.0000 10 0.3487

THE PROBABILITY COLUMN FOR B(10,.1)


IS THE EXACT REVERSE
OF THE PROBABILITY COLUMN FOR B(10,.9)

In general, BIN(n,p) and B(n,1-p) have probability columns that are the exact reverse of each other

P(X=x) for BIN(n,p) = P(X=n-x) for BIN(n,1-p)


EXAMPLES
P(X=3) for BIN(10,.1) = P(X=7) for BIN(10,.9)

P(X=2) for BIN(10, .2) = P(X=8) for BIN(10,.8)


X P(X) X P(X)

0 0.1074 0 0.0000
1 0.2684 1 0.0000
2 0.3020 2 0.0001
3 0.2013 3 0.0008
4 0.0881 4 0.0055
5 0.0264 5 0.0264
6 0.0055 6 0.0881
7 0.0008 7 0.2013
8 0.0001 8 0.3020
9 0.0000 9 0.2684
10 0.0000 10 0.1074
P(2<=X<=4) for BIN(10,0.2) = P(6<=X<=8) for BIN (10,0.8)
=P(2)+p(3)+p(4) for BIN(10,0.2) =P(8)+P(7)+P(6)for BIN(10,0.8)

Example 5.11: For a BIN(7, 0.9), find, using mirror properties:


P(3<=X<=5) = P(3) + P(4) + P(5) for a BIN (7, 0.9)
= P(4) + P(3) + P(2) for a BIN (7, 0.1)
= 0.0026 + 0.0230 + 0.1240 = 0.1496

54
Example 5.12:
A healthy elephant starts to become fertile around the age of 15, and can have calves up to the
age of about 50. Females have a 22 month pregnancy, and, in the wild, may have as many as 11-
12 calves (single births) in their lifetime, each about 4 -5 years apart. A zookeeper is using
natural methods to increase the probability that her elephant has female offspring. According to
research, the method she had chosen will increase the probability of having a female to 60% for
each independent conception (that is, the probability of having a male will be 40% for each
independent conception). The zookeeper plans to breed her elephant 6 times, once every 4 ½
years. Calculate the following for this BIN(6,0.4) distribution with a success = male.
http://elephant.elehost.com/About_Elephants/about_elephants.htm
Calculation
ENGLISH TRANSITION ANSWER
P(4)=
EG EXACTLY 4 CALVES MALE P(X=4) 0.1382
P(5)=
a EXACTLY 5 CALVES MALE P(X=5) 0.0369
P(0)+P(1)+P(2)+P(3)
b LESS THAN 4 CALVES MALE P(X<=3) =0.0467+0.1866+0.3110+0.2765 0.8208
P(0)+P(1)+P(2)+P(3)
c AT MOST 3 CALVES MALE P(X<=3) =0.0467+0.1866+0.3110+0.2765 0.8208
P(0)+P(1)+P(2)+P(3) +P(4)
d NO MORE THAN 4 CALVES MALE P(X<=4) =0.0467+0.1866+0.3110+0.2765+0.1382 0.9590
P(0)+P(1)+P(2)+P(3)
e 3 OR LESS CALVES MALE P(X<=3) =0.0467+0.1866+0.3110+0.2765 0.8208
P(6)=
f MORE THAN 5 CALVES MALE P(X>=6) 0.0041
P(2)+P(3)+P(4)+P(5)+P(6)=
g AT LEAST 2 CALVES MALE P(X>=2) 0.3110+0.2765+0.1382+0.0369+0.0041 0.7667
P(5)+P(6)=
h NO LESS THAN 5 CALVES MALE P(X>=5) 0.0369+0.0041 0.0410
P(1)+P(2)+P(3) +P(4)+P(5)+P(6)=
i 1 OR MORE CALVES MALE P(X>=1) 0.1866+0.3110+0.2765+0.1382+0.0369+0.0041 0.9533
P(2)+P(3)+P(4)+P(5)
J BETWEEN 2 AND 5 INCLUSIVE MALE P(2<=X<=5) = 0.3110+0.2765+0.1382+0.0369 0.7626
P(3)+P(4)=
k MORE THAN 2 BUT LESS THAN 5 MALE P(3<=X<=4) 0.2765+0.1382 0.4147
AT LEAST 5 CALVES FEMALE P(1) + P(0)=
l 5 or 6 female = 1 or 0 male P(X<=1) 0.1866+0.0467 0.2333
NO MORE THAN 3 CALVES FEMALE P(6)+P(5)+P(4)+P(3)=
m 0 or 1 or 2 or 3 female = 6 or 5 or 4 or 3 male P(X>=3) 0.0041+0.0369+0.1382+0.2765 0.4557

Helpful Translations from English to Probability Statements


Less than 8 P(X<=7)
At most 7 P(X<=7)
No more than 7 P(X<=7)
7 or less P(X<=7)

More than 6 P(X >=7)


At least 7 P(X >=7)
No less than 7 P(X >=7)
7 or more P(X >=7)

Between 5 and 9 inclusive P(5<=X<=9)


More than 5 but less than 9 P(6<=X<=8)

55
Supplemental Reading:
We want a formula that calculates the Prob (X) = P(X) for a BIN(n,p) experiment.
Family legend mentions an ancestor who travelled the Silk Road. His legacy to you is a biased
coin that has been in the family for generations. For this coin, the probability of a head on each
independent toss is 0.9. Let us toss it 4 times. Let X = # of heads, so this is a BIN(4,.9)
experiment. There are 16 possible outcomes when we toss the coin 4 times.
Outcome
HHHH o
HHHT ~
HHTH ~
HHTT *
HTHH ~
HTHT *
HTTH *
HTTT #
THHH ~
THHT *
THTH *
THTT # x O heads
TTHH * # 1 head
TTHT # * 2 heads
TTTH # ~ 3 heads
TTTT x o 4 heads

We have _1_ permutation of the combination of 0 heads


We have _4_ permutations of the combination of 1 head and 3 tails
We have _6_ permutations of the combination of two heads and two tails
We have _4_ permutations of the combination of 3 heads and 1 tail
We have _1_ permutations of the combination of 4 heads

For each combination


Consider the probability of one particular permutation of a combination
We multiply this by the number of permutations of the combination
THIS GIVES US THE P(x) FOR THE COMBINATION OF INTEREST
P(0)=1(.9)0(.1)4 P(0 )= 1 (.1) (.1) (.1) (.1) = 0.0001
P(1)=4(.9)1(.1)3 P(1) = 4 (.9) (.1) (.1) (.1) = 0.0036
P(2)=6(.9)2(.1)2 P(2) = 6 (.9) (.9) (.1) (.1) = 0.0486
P(3)=4(.9)3(.1)1 P(3) = 4 (.9) (.9) (.9) (.1) = 0.2916
P(4)=1(.9)4(.1)0 P(4) = 1 (.9) (.9) (.9) (.9) = 0.6561

It would be nice to make a formula!!!


General Formula for a BIN(n,p):
n
P( x) =   p x (1 − p) n − x for x=0,1,2,…n where
 x
n n!
  = and n!= n(n − 1)(n − 2)....(2)(1)
 x  x!(n − x)!

56
CONTINUOUS NORMAL DISTRIBUTION

http://www.umass.edu/wsp/statistics/tales/demoivre.html

Abraham de Moivre
26 May 1667 - 27 Nov 1754

“De Moivre was born in France, but went to England to escape the persecutions to which French
Protestants (he was a Huguenot – a type of Calvanist) were then subject. He thus took his place at the
northwest corner of the deeply interconnected world in which the science of statistics was emerging, one
insight at a time. He supported himself by teaching, but most famously as the resident statistician of
Slaughter's Coffee House in London, where the gamblers would pay him to calculate odds for them.
de Moivre noted that when the number of events (coin flips) increased, the shape of the binomial
distribution approached a very smooth (bell shaped) curve. de Moivre reasoned that if he could
find a mathematical expression for this curve, he would be able to solve problems such as
finding the probability of 60 or more heads out of 100 coin flips much more easily. This is
exactly what he did, and the curve he discovered is now called the normal curve.”

Much real life data follows a bell shape (normal, unimodal, mound shaped, symmetric).
Examples include blood pressure, heights of men, heights of women, and observation error when
taking measurements (this was first observed and defined by Carl Friedrich Gauss when he was
taking astronomical observations), and perhaps, sometimes, grades in a course (depending on the
discipline).

Thus, there is a “family” of distributions that take on the normal shape. The mean µ (center and
highest point) and the standard deviation σ (spread) are the parameters that define the location
and shape of a normal distribution.The curve extends from -∞ at the left of the measurement
scale to +∞ at the right of the measurement scale, although almost all of the data is within 3
standard deviations of the mean.

Calculus can be used to establish what is known as the Empirical Rule for the Normal
Distribution. Please note that this rule ONLY holds for the normal distribution.

EMPIRICAL RULE: No matter how narrow or wide the bell is


68.26% of data lies within one standard deviation,  −  to  + 
95.44% of data lies within two standard deviation,  −  to  + 
99.74% of data lies within three standard deviation,  −  to  + 

NOTE: MEAN = MEDIAN = MODE FOR BELL SHAPED DATA

57
Example 5.3:
Bell Shaped Population with  =   = 

 −   −  −  +  +   + 
 −   −   −    +   +   + 
      

interval (60, 80) contains ~68.26% of data


interval (50, 90) contains ~95.44% of values
interval (40, 100) contains ~99.74% of data

Descriptive Methods of Assessing Normality

Much statistical inference involves sample(s) that are taken from a background normal
population. But, sometimes, the shape of a background population is unknown when a sample is
taken (although we can sometimes make an educated guess). Because of this, a statistician
performing inference that requires a normal population will ALWAYS examine sample data to
see if it appears that the sample data is normal. If the sample data looks normal, then it suggests
the population data is normal also. If the sample data does not look normal, then inference that
does not assume a normal background population may instead be of interest. Additionally, the
statistician will bear in mind that one particular sample may or may not provide a good idea of
the population shape, particularly if the sample size in relatively small. In real life,
experimenters and researchers often replicate studies in order to be very sure that their work has
validity and stands the test of time.

Ways to examine the sample data include:


1. Constructing a histogram to see if the sample data has a normal shape
2. Calculating the intervals 𝑥̅ ± s, 𝑥̅ ± 2s, 𝑥̅ ± 3s and checking to see how much of the sample
distribution data is in the intervals
3. Making a normal quantile plot (a scatterplot where points close to a line indicate normality of
the sample data)

Example 5.4: (optional)

In 2009, Guy Laliberté, the Canadian CEO of Cirque du Soliel, who began his rise to fame
busking on the streets of Quebec City, and continues to be a well known entrepreneur, highly
ranked poker player, and philanthropist (he runs the One Drop Foundation, dedicated to ensuring
access to water to everyone worldwide) was the 7th space tourist to go up with the Russian Space
Agency to the International Space Station. http://en.wikipedia.org/wiki/Guy_Lalibert%C3%A9 .
Space tourism is currently suspended, but many young people today would like to someday go
into space. Assume a group of candidates were given a test that determined, on a scale of 0 to
70, their readiness to go into space. Raw data and a histogram of the results are provided below.
Past studies have shown the test scale to have a normal distribution with a mean µ of about 35
for a much larger population of candidates. We will examine this one set of sample data to see if
it appears normal with an 𝑥̅ about 35.

58
Raw data is as follows.
5 11 15 20 22 25 28 29 29 29
29 31 32 32 33 33 34 34 34 35
35 35 35 36 36 36 37 37 38 40
40 41 44 45 45 47 49 52 59 67

For this data, 𝑥̅ = 34.85 and s = 11.573.

1. Histogram: A histogram is made from the raw data (see below). It used the percent distribution
shown. (Your instructor created a histogram from the raw data and then changed the intervals.)
Relative
Frequency,
x Count ~P(X=x)
0 - under 10 1 0.025
10 - under 20 2 0.050
20 - under 30 8 0.200
30 – under 40 18 0.450
40 – under 50 8 0.200
50 – under 60 2 0.050
60 - 70 1 0.025
Sum 40 1.000

Although this distribution is symmetrical with the bulk of its data in the middle of the dataset and
a tapering of tails in a bell-like way, the height of the middle is a bit higher than we might expect
with a true normal distribution. We would wonder if the population had the same configuration
and how that might impact the inference we would chose to do the test.

2. Empirical rule

You compare the percent of observations within 1, 2, and 3 standard deviations of the mean for
your sample data to the expected percent of observations if the sample data was normal.

observations Relative Normal


in sample frequency in Distribution
that fall in sample Empirical
interval Rule
Percents
𝑥̅ ± s 34.85 ± 11.573 (23.277, 46.423) 30 30/40 = 75 % 68.26 %
𝑥̅ ± 2s 34.85 ± 2(11.573) (11.704, 57.966 36 36/40 = 90 % 95.44 %
𝑥̅ ± 3s 34.85 ± 3(11.573) (0.131, 69.569) 40 40/40 = 100 % 99.74 %
When you compare what actually happens in your sample data to what you would expect in a
normal distribution with the same mean and standard deviation as your sample, your sample data
59
has about 7% more observations than you would expect within one standard deviation of the
mean, 5% less observations than you would expect within two standard deviations of the mean,
and close to the percent of observations you would expect within three standard deviation of the
mean. Again, it is heavier in the middle than we would expect for a normal distribution.

3. Normal Quantile Plot

Your instructor created the normal quantile plot using software. It compares the actual sample
data values to the values you would expect to obtain if the sample data was normal with its own
standard deviation and variance. If the sample data is normal, then the points should be on a line.
To construct the plot from scratch is a bit of work, and we will not worry about this at this time.
The normal quantile plot has a slight s shape in the middle (indicating the high percent of data in
the middle) and some values that are outlying farther from the line than we might expect at the
ends of the distribution (indicating the tightness of the standard deviation about the mean in the
data). Again, the sample distribution is not normal, but it is not astonishing different, either.

Calculating Normal Probabilities (continue here)


The density function of a normal distribution is:

1 −(𝑥−𝜇)2
𝑓(𝑥) = 𝑒 2𝜎2
𝜎√2𝜋

No simple formula exists with which to calculate partial areas under the curve of a normal
distribution.

So we will use the tried and true normal tables, painstakingly first created by hand by the French
physicist and mathematician Christian Kramp in 1799! Using these will also give you an
appreciation of exactly what is going on later on when we begin to calculate tail probabilities in
order to investigate how unusual a particular sample of data is if a background population holds
true!

There are also online calculators that can calculate these areas and website urls and photos from
two good ones are provided in the by-hand examples below. It’s a good idea to check by hand
work with the online calculators and vice-versa.
NORMAL TABLES:
X ~ N(μ,) – X is a normal distribution with a mean of µ and a standard deviation of σ
Z ~ N(0, 1)–Z is the standard normal distribution with a mean of 0 and a standard deviation of 1

60
Entries in this table give the area
under the curve to the left of the
z value. For example, for z = -.85,
the cumulative probability is .1977.

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
-2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014
-2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019
-2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026
-2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036
-2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048

-2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064
-2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084
-2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110
-2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143
-2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183

-1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233
-1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
-1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
-1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
-1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559

-1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681
-1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
-1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
-1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
-1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379

-0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
-0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
-0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
-0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
-0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776

-0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121
-0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
-0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
-0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247
0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641

61
Entries in this table give the area
under the curve to the left of the
z value. For example, for z = 1.25,
the cumulative probability is .8944.

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224

0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621

1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441

1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817

2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952

2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990

62
Example 5.5: CASE A: Given a value(s) on the Z axis, find an area. The table can only
calculate left areas!
Type 1: P(Z < a) Type 2: P(Z > b) Type 3: P(a < Z < b)
P(Z<-1.96) P(Z > 1.96) P(-1.96 < Z < 1.96)
(shaded area below is left) (shaded area below) (cannot be (shaded area below splits so
calculated directly from table one can subtract smaller left
and left area is subtracted from area from bigger leftarea)Z<
total area of 1) )

P(Z < -1.96) = 0.025 P(Z > 1.96) P(-1.96 < Z < 1.96) =
= 1 – P(Z < 1.96) =P(Z<1.96) – P(Z<-1.96) =
= 1 -0.975 = 0.025 0.975 – 0.025 = 0.95

(calculated by using
information found in
adjacent examples)

Partial Snapshot of Table:


Partial Snapshot of Table:
Across at 1.9, down at 0.06
Across at -1.9, down at 0.06
yields .975, area left of 1.96
yields 0.025, area left of -1.96

Above three photos and solutions at http://onlinestatbook.com/2/calculators/normal_dist.html.

63
Example 5.6: CASE B: Given an area, find a value on the Z axis. The table only finds left area.
Find ? such that P(Z < ?) = a given area Find ? such that P(Z > ?) = a given area
Find ? such that P(Z < ?) = 0.02 Find ? such that P(Z > ?) = 0.03AME

Find value closest to 0.02 in the tables Find value closest to 0.97 in the tables
this is 0.0202 This is 0.9699
Follow arrows Follow arrows
Value at left and top together give ? = -2.05 Values at left and at top together give ? = 1.88
P(Z < -2.05) = 0.02 P(Z > 1.88) = 0.03

Photos and solutions from http://onlinestatbook.com/2/calculators/inverse_normal_dist.html

Most normal distributions are not Z ~ N(0,1) like a Z distribution.


But all normal problems for a N(µ, σ) can be solved with the Z distribution table.
This is because every X value can be translated to a Z value in a one-to-one manner.

Consider a set of student grades where X ~ N(60,12) (Here 60 = μ and 12 = ). The one-to-one Z
values are shown below.

64
FORMULA: Z = (X – μ)/
standardizes (translates) X to Z
Each X has ONLY ONE Z.

X = 60 translates to (60-60)/12 = 0
X = 72 translates to Z = (72-60)/12 = 1
X = 24 translates to Z = (24 – 60)/12 = -3

For an observed value of a variable x, the corresponding value of the standardized variable
z is called the z score of the observation.

Example 5.7 : There are two major tests of readiness for college, the ACT and the SAT. ACT
scores are reported on a scale from 1 to 36. The distribution of ACT scores is approximately
Normal with mean μ = 21.5 and standard deviation σ = 5.4. SAT scores are reported on a scale
from 600 to 2400. The distribution of SAT scores is approximately Normal with mean μ = 1498
and standard deviation σ = 316.

Emily scores 1832 on the SAT. Liam scores 27.5 on the ACT. Assuming that both tests measure
the same thing, who has the higher score (in a relative sense)?

Student Z-Score Conclusion (Full statement)


Liam N(21.5, 5.4) In a relative sense, Liam has done slightly better than
Z= (27.5-21.5)/5.4 = Emily, because his z test score is slightly higher than hers.
1.111
Emily N(1498,316)
Z=(1832-1498)/316
= 1.057

Example 5.8a): A Z standard normal curve has an area of 0.975 to the left of the value 1.96.

Because the area under a normal curve sums to 1, the area to the right of the value 1.96 is 0.025

The symbol zα is used to denote a z score that has an area of α to its right under the
standard normal curve.

We write z0.025 = 1.96

65
Example 5.8 b): Find the z score of z0.01

This is the value of z such that the area to the right of it is 0.01.

The area to the left of this z will have area 0.99 = 1 - 0.01 (because the area under a density curve
totals 1)

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916

Since 0.9901 is the area closest to 0.00, we find that 𝑧0.01 = 2.33

Example 5.8c): Find the z score of z0.10

This is the value of z such that the area to the right of it is 0.10.

The area to the left of this z will have area 0.90 = 1 - 0.10 (because the area under a density curve
totals 1)

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015

Since 0.8997 is closest to 0.10, we find that z0.10 = 1.28.

66
Example 5.9: CASE A: Given a value(s) on the X axis, find an area.
Type 1:Probability of Type 2: Probability of a Type 3: Probability of a grade
a grade below 50. grade above 90. between 50 and 90
P(X < 50) P(X>90) P(50 < X < 90)
Z = (X – μ)/ Z = (X – μ)/ See adjacent
=(50 – 60)/12 =-0.83 = (90 – 60)/12= 2.5
P(X < 50) = P(X>90) = P(Z>2.5) = P(50 < X < 90) =P(-0.83<Z<2.5)
P(Z < -0.83) = 1 – P(Z<2.5) =P(Z<2.5)–P(Z<-0.83)
area left area right=total area(=1) Area between = area left of large X or Z
– area left – area left of smaller X or Z

See adjacent for appropriate probabilities

P(Z < -0.83) 1 – P(Z < 2.5) P(Z < 2.5)–P(Z < -0.83)
= 0.2033 = 1 – 0.9938 = 0.0062 = 0.9938 – 0.2033 = 0.7905

Online results from http://onlinestatbook.com/2/calculators/normal_dist.html. (“by hand” and


online calculator results here differ due to rounding (the “by hand” calculation of the Z = -0.83
is rounded and that leads to a less accurate final result of 0.7905. The online 0.7915 is more
accurate. The online calculator https://captaincalculator.com/math/statistics/normal-
distribution-calculator/ which carries 8 decimals will allow you to see this using another
source. Student answers on assignments may vary slightly due to rounding.)

Example 5.10 CASE B: Given an area, find a value on the X axis.


What grade is the grade below which a person What grade does a person need to exceed in
is in the bottom 2% of grades? order to be in the top 3% of grades?
Find ? such that P(X < ?) = 0.02 Find ? such that P(X > ?) = 0.03
Find ? such that P(Z < ?) = 0.02 Find ? such that P(Z > ?) = 0.03
(X value such that area left of it is 0.02) SAME: Find ? such that P(Z < ?) = 0.97
(X value such that area left of it is 0.97 is the
same X value such that area right of it is 0.03)

67
Find value closest to 0.02 in the tables Find value closest to 0.97 in the tables
this is 0.0202 This is 0.9699
Put values at left and at top together to get Put values at left and at top together to get
? = -2.05 ? = 1.88
P(Z < -2.05) = 0.02 P(Z > 1.88) = 0.03
So -2.05 = (x - 60)/12 … So 1.88 = (x - 60)/12
SOLVE for x SOLVE for x
12 (-2.05) = x - 60 12 (1.88) = x-60
x = 60 + 12 (-2.05) = 35.4 x = 60 + 12 (1.88) = 82.56
A mark below 35.4 will put a person in the A mark of above 82.56 is needed to be in the
bottom 2% of grades. top 3% of grades.

Photos and solutions at http://onlinestatbook.com/2/calculators/inverse_normal_dist.html

68
Example 5.11 :
1. In March 2004, the average time a Canadian spent on the internet per week was 16 hours
(http://www.crtc.gc.ca/eng/NEWS/RELEASES/2004/r041214.htm ). Assuming this time is
normally distributed with a standard deviation of 1.5 hours, find

a) the probability that a person spent less than 15 hours per week on the internet.

P(X < 15) = P(Z < -0.67) = 0.2514


Since z = (15-16)/1.5 = -0.67

-0.67 0

b) the probability that a person spent more than 19 hours a week on the internet.

P(X > 19)=P(Z > 2.0) = 1 - P(Z < 2.0)


= 1 – 0.9772 = 0.0228
Since z = (19-16)/1.5 = 2
0 2

c) If a person is considered to a “light user” of the internet if their number of hours on the
internet per week is in the bottom 10% of hours used, find the number of hours per week of
internet usage that they would have to be termed a “light user”.

P(X < x) = P(Z < z) = 0.10


From tables, z value corresponding to area of 0.10 in the left (bottom) tail is -1.28
-1.28 = (x – 16)/1.5 0.10
-1.28(1.5) + 16 = 14.08
A person online for 14.08 hours or less is termed a “light user” -1.28 0

d) If a person is considered to a “heavy user” of the internet if their number of hours on the
internet per week is in the top 5% of hours used, find the number of hours per week of internet
usage that they would have to be termed a “heavy user”.

P(X > x) = P(Z > z) = 0.05


P(X<x) = P(Z<a) = 0.95
From tables, z value corresponding to area of 0.95 in the left (bottom) tail is 1.645
1.645 = (x – 16)/1.5
1.645(1.5) + 16 = 18.4675
A person online for 18.4675 hours or more is termed a “heavy user

69
Example 5.12: Targets produced by a company are normally distributed with an average
diameter of 20 cms and a standard deviation of 2 cms in diameter.

a) The pth percentile of a data set is the number that divides the bottom
p% of the data from the top (1-p)% of the data. What is the 20th percentile for the targets?

Want x such that P( X < x) = 0.20


In tables, we find that the closest area to 0.20 is 0.2005
It corresponds to a z value of -0.84
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867

z=(x – 20)/2 ) = -0.84


x – 20 = (-0.84)(2)
x – 20 = -1.68
x = 18.32 cm

b) A target is rejected by the quality control department if it is smaller than 15.9 cm. What
percent of targets are rejected?

P(X < 15.9) = P(Z < (15.9-20)/2 ) = P(Z < -4.1/2) = P(Z <-2.05) = 0.0202
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183

c) A target is rejected by the quality control department is it is smaller than 15.9 cm. Compute
the probability of at least one rejection among a random sample of 5 targets. (5 marks)
The probability of rejecting a randomly picked target is p = P(X < 15.9) = 0.0202.
Randomly pick five targets.
Let Y be the number of rejections.
We can assume Y follows a binomial distribution with n = 5 and p = 0.0202.
5
𝑃(𝑌 ≥ 1) = 1 − 𝑃(𝑌 = 0) = 1 − ( ) × 0.02020 (1 − 0.0202)5−0 .
0
= 1 − (1)(1)(0.9798)5 ≅ 0.903

70
Normal Probability Plots

A normal probability plot is another way of checking for normality in a data set. It compares the
actual sample data values to the normal score values you would expect to obtain from a normally
distributed random variable with your determined mean and standard deviation. If the sample
data is normal, then the points should fall roughly on a straight line.

Example 5.13 a)

Recall Example 1.1 where a set of n = 32 data points taken from a 1st year Statistics class
measured X = the number of breaths taken in a minute by the students

4 7 8 8 9 9 9 10 11 12 12 13 13 13 14 14
14 15 15 15 15 15 16 17 17 17 18 18 19 20 21 25

We created the following histogram. Notice that the bars at the ends occur because of the tail
values 4 and 25.

Below is a normal quantile plot (normal probability plot from two software programs.) The first,
from SPSS, shows the observed values on the x axis and the expected normal values on the y
axis. The second, from R, shows the expected standardized normal values on the x axis and the
observed values on the y axis. Both show that the data values fall close to the line in general and
that the data is fairly normal. The observed values of 4 and 25 are seen to be tail values in the
SPSS plot. The R plot identifies these two points as occurring at position 1 and position 32. The
R plot also shows us that most of the values lie within 2 standard deviations of the mean of the
breath data, contains boundaries within which we would like the points to fall, and tells us we
have multiple values at distinct points.

71
Example 5.13 b)

Traditional Vows versus Your Own Vows: 471 students were randomly selected at a University.
The ages for 246 students who plan to have (or have had) traditional wedding vows (group 1)
and 225 students who plan to write (or had written) their own wedding vows (group 2) were
recorded. Percent histograms and normal probability plots for the two groups are shown below.

Ages are right skewed for both vow groups on the histograms. The right skewness of the data
can also be readily identified on the probability plots as the data forms a curve over the straight
line of the plot, with a shorter left end (below the line), a hump over the middle of the line, and a
longer right tail (below the line)

72
Example 5.13 c)

We revisit the data from Examples 2.2 and 2.3 below. Neither set of data is identified as having
skewness or outliers in the histograms (with a uniform pattern for example 2.2 and a four peaked
pattern with right and left gaps for example 2.3). The probability plot can suggest the pattern of
uniform data rather than a bell for Example 2.2 (as points go above the line for the left side of the
observed value scale and below the line for the right side of the observed values scale) and can
suggest the pattern of the two gaps in Example 2.3 (as the left most point and the right most
points on the scale are spread away from the center of the observed values and just noticeably
above and below the line). This is all a bit nebulous as there are only 7 data values in each data
set. But neither of these normal probability plots suggest outliers or skewness, which would be
concerning.

73
Data Histogram Normal Probability Plot
Example 2.2
4.00
5.00
6.00
7.00
8.00
9.00
10.00

Example 2.3
4.00
6.00
6.00
7.00
8.00
8.00
10.00

It is always a good idea to look at both your histogram and your probability plot. Many other
examples of probability plots can be found online and in textbooks. It is a fine art to learn all the
probability plots can tell you. Basically, if the points are not adhering close to the line, your data
is not normal. Be sure that you can use normal probability plots to identify outliers and blatant
non-normality and skewness.

74
Chapter 6: Sampling Distributions

For the rest of the course, any sample used will be assumed to be a simple random sample.

In Statistical Inference, we take a sample from a parent population and use the sample results to
infer something about the population.

The population distribution of a variable is defined as the distribution of its values for all
members of the population.

The population distribution is also defined as the probability distribution of the variable
when we choose one individual at random from the population.

A sample distribution is a distribution of all values of one sample taken from a population.

Recall that a sample statistic is an estimate of a population parameter.

The sampling distribution of a statistic is the distribution of values taken by the statistic in
all possible samples of size n from the same population.

Often, an investigator has information about a distribution shape to use as a model for the
population, but may be uncertain about other parameters that define the population distribution.
Learning about a sampling distribution for a statistic that estimates a population parameter can
give us a fair idea about the general accuracy of our estimate.

Example 6.1: The sample mean 𝑥̅ (a statistic) estimates the population mean  (a parameter).
𝑋̅ itself is a variable, and it varies depending upon the chosen sample. Sometimes when we
take a sample, our sample mean 𝑥̅ is a fairly accurate estimate of , and sometimes it is not. We
would like to know how close the 𝑥̅ estimates are to , in general. Hence, we are interested in the
distribution of the sample means when we sample from a parent population of any shape, or as it
is more commonly known, the sampling distribution of X .

Example 6.2: Although all our subsequent work will be with continuous data, a simple example
with discrete data provides the gist of how a “sampling” distribution of 𝑥̅ estimates can vary
about . Earlier, a reading looked at a population of 10 units with the values 1, 2, … , 10. The
average of the population was µ = (1+2+…+10)/10 = 5.5. 45 possible samples of size 2 can be
taken from this population of size 10. Each sample of size 2 yields its own 𝑥̅ , some fairly close
to µ = 5.5 and some further away from 5.5 (see the dotplot). Note the average of the 𝑥̅ s can be
calculated to be 5.5 (the same value as µ). Hence, 𝑋̅ is an unbiased estimator of µ.

75
IMPORTANT CONDITIONS AND RESULTS: Suppose a SRS of size n is taken from a
population where a variable X that is to be measured has mean  and standard deviation 𝝈.

We can consider X1, X2, … , Xn (the n observations) to be independent random variables,


each having the same distribution (and of course, the same mean  and standard deviation
𝝈). Then the following are true, no matter what the shape of the population is.

The mean of the sampling distribution of X is 𝝁𝑿̅ = 


𝝈
The standard deviation of the sampling distribution of X is , 𝝈𝑿̅ =
√𝒏
The sampling distribution of X is centered at , and its spread decreases as n increases.

Note: As long as the population is large relative to the sample size (as long as the population is
more than 20 times larger than the sample size), we can relax the independence requirement
above. (This is important when sampling without replacement, as we do with finite populations.
Usually, most populations sampled from are large.)

But what more can we learn about the shape of the sampling distribution of X for
populations of different shapes?

There are many! possible samples of size n that can be taken from a population!

Recall that relative frequencies settle down to probabilities when n is large. Hence a
relative frequency distribution of many 𝑥̅ calculated from many simple random samples
taken from a parent population will give us an excellent indication of the sampling
distribution of that random variable 𝑥̅ .

We look online at an applet that can generate 100,000 samples from a continuous parent
population, calculate all possible 100,000 𝑥̅ s, and then graph a relative frequency distribution
that indicate the shape of the relevant sampling distribution for us. Your instructor has taken
the liberty of smoothing the relative frequency distributions provided by the applet below.
http://onlinestatbook.com/stat_sim/sampling_dist/index.html

76
EXAMPLE: start with a normal parent
distribution.
*Take a sample of size 5. Calculate 𝑥̅ .
Do 100,000 repetitions. Calculate 𝑥̅ each time.
Graph the distribution of the 𝑥̅ s (from the 100,000
samples of size 5).
Note 1: the 𝑥̅ s are all fairly close to  = 16.

*Take a sample of size 25. Calculate 𝑥̅ .


Do 100,000 repetitions. Calculate 𝑥̅ each time.
Graph the distribution of the 𝑥̅ s (from the 100,000
samples of size 25).

Note 2: the 𝑥̅ s are all fairly close to  = 16.

Note 3: the 𝑥̅ s are, in general, closer to  = 16


when n = 25 than when n = 5.

*Note: the website uses the notation N for sample


size. We will use n for sample size.

Note 4: For n=5 and n=25, the distribution graph of the sample means for the 100,000 repetitions
is a good representation of the sampling distribution that occurs if we take all possible samples
and graph the distributions of all the sample means. Those distributions are both normal!

Note 5: As n increases (from 5 to 25), the distribution of sample means becomes less and less
spread out (its variance is smaller).

Note 6: The mean of all the sample means is the same as the mean of the population

RESULT: SAMPLING DISTRIBUTION FOR A NORMAL PARENT POPULATION


(FOR ALL n):

If a population is N(, σ), then the sample mean X of n independent observations is


N(, 𝝈/√𝒏) . This result holds for all sample sizes, n.

That is: The distribution of the sampling means (the sampling distribution) is also normal
shaped with the same mean , but a smaller standard deviation of  / n.

*As n increases, the distribution of sample means becomes less and less spread out (its variance
is smaller)

* The mean of all the sample means is the same as the mean of the population

77
EXAMPLE: start with parent distribution of any
shape.
*Take a sample of size 5. Calculate 𝑥̅ .
Do 100,000 repetitions. Calculate 𝑥̅ each time.
Graph the distribution of the 𝑥̅ s (from the 100,000
samples of size 5).
Note 1: the 𝑥̅ s are fairly spread out, not all that
close to  = 16.

*Take a sample of size 25. Calculate 𝑥̅ .


Do 100,000 repetitions. Calculate 𝑥̅ each time.
Graph the distribution of the 𝑥̅ s (from the 100,000
samples of size 25).

Note 2: the 𝑥̅ s are all fairly close to  = 16.


Note 3: the 𝑥̅ s are, in general, closer to  = 16
when n = 25 than when n =5.

Note 4: For n = 5 and for n = 25, the distribution graph of the sample means for the 100,000
repetitions is a good representation of the sampling distributions that would occur if we took all
possible samples and graphed the distributions of all the sample means. Although the
distribution of means is a little tail heavy when n = 5, it is quite normal looking for n = 25.

Note 5: As n increases (from 5 to 25), the distribution of sample means becomes less and less
spread out (its variance is smaller).

Note 6: The mean of all the sample means is the same as the mean of the population

CENTRAL LIMIT THEOREM:

If a population (of any shape) with mean  and finite standard deviation , then for
large n, the sampling distribution of X is approximately N(, 𝝈/√𝒏) .

That is, the distribution of the sampling means (the sampling distribution) is normal shaped
with the same mean , but a smaller standard deviation of  / n , as long as n is large
enough.

*As n increases, the distribution of sample means is less spread out (its variance is smaller)
* The mean of all possible sample means is the same as the mean of the population.

How large n needs to be depends on the shape of the original population. In some cases (a
very highly skewed population, for example) an n as high as 60 might be needed. In many
cases, n>= about 30 is considered large enough.

78
Inference using Sampling Distributions ( is unknown,  is known)

We use the above information to make inferences about the particular population from which we
draw a random sample in two problems.

1) problems where the population shape is normal and n is any size.


2) problems where the population shape is unknown and n is large (generally above 30).

For each type of problem, we will learn how to:

1) build an interval about a sample mean to get a range of estimates for the population mean, and
2) calculate probabilities (areas under the sampling distribution curve) that are of interest in a
hypothesis testing situation.

Point And Interval Estimation

estimator x̅ estimates the parameter 


estimator s estimates the parameter 

point (single) estimates

*point estimates will be different depending on the selected simple random sample
*some estimates will be close to our parameter of interest and some will not

A statistic used to estimate a parameter is an unbiased estimator if the mean of its sampling
distribution is equal to the true value of the parameter being estimated.

It is possible that more than one unbiased estimator of a parameter might exist.

The best choice of unbiased estimator is one where its variability is as small as possible.

We will work with the estimator x̅ in this course. One individual point estimate x̅ from one
sample might or might not be close to an unknown true population mean µ.

We decide to build an interval estimate for µ to capture what seem to be likely values for µ.

Confidence Intervals

Recall that the mean of all possible sample means taken from a population (that is, the mean of
the sampling distribution of 𝑋̅) is 𝝁𝑿̅ =  and that the standard deviation of the sampling
𝝈
distribution of 𝑋̅ is 𝝈𝑿̅ = 𝒏 (becoming smaller as n increases).

(We mention again that it is assumed that the important necessary conditions mentioned above hold true (a SRS of
size n is taken from a population where a variable X that is to be measured has mean  and standard deviation 𝜎,
and we can consider X1, X2, … , Xn (the n observations) to be independent random variables, each having the same
distribution (and of course, the same mean  and standard deviation 𝜎). Again, the independence condition can be
relaxed as long as the population is more than 20 times bigger than the sample.)

79
Recall that 95% of all data falls within 1.96 standard deviations of the mean μ for a normal
𝝈
distribution. So we add and subtract 1.96 standard deviations (that is, 1.96 𝒏 ) to our 𝐱̅ to

obtain endpoints for the interval we are making.

We call this interval a 95% confidence interval for µ.

The 95% confidence interval for unknown true µ when our population is normal is:
𝝈
𝑥̅ ± 1.96
√𝒏

When we take our simple random sample from a normal population, and make an interval, 95%
of the time, the interval will capture the true population μ, and 5% of the time it will not.

Example: Below are 20 sample means that could have been obtained when different simple
random samples were drawn from a parent population (it could either be 1) normal for any n, or
2) of any shape if n>=30). For each 𝑥̅ , an interval of the same length is built using the above
𝝈
formula where we add and subtract . 1.96 𝒏 to 𝑥̅ .

We expect that 95% of all intervals made this way will capture the true mean μ.

We are 95% certain an interval created this way will contain the unknown true population
mean μ. We cannot know if one particular interval created in this way will contain μ.
Statistics Canada always says: 19 times out of 20 (or 95 times out of 100) (i.e. 95% of the
time), this (calculated) interval will contain the true population mean

80
For n>=30, the sampling distribution used to create an interval for the true mean µ is
approximately normal regardless of the original population distribution shape.

For n <30, we must have a normal population in order to ensure that the sampling distribution
used to create the interval for the true mean µ is normal.

If we have unlimited money or are doing sensitive research, we might wish to be more confident
that our interval captures µ and want a 99% confidence interval. If precise accuracy is not
important with our experiment, and money is limited, we might be satisfied with a 90% interval.

General formula for a (1 - )% confidence interval for µ for a normal population is
̅  z/2 
𝒙
n
where : P(Z<-z/2)=/2 and P(Z> z/2) = /2

Below are intervals for commonly used levels of confidence.


(1 - )% 80% 90% 95% 98% 99%

 0.200 0.100 0.050 0.020 0.010

/2 0.100 0.050 0.025 0.010 0.005

z/2 1.282 1.645 1.960 2.326 2.576


𝜎 𝜎 𝜎 𝜎 𝜎
Interval x̅ ± 1.282 𝑛 x̅ ± 1.645 𝑛 x̅ ± 1.96 𝑛 x̅ ± 2.326 x̅ ± 2.576
√ √ √ √𝑛 √𝑛
Pictures

Ex 6.3: 120 120 120 120 120


x̅ ± 1.282 x̅ ± 1.645 x̅ ± 1.96 x̅ ± 2.326 x̅ ± 2.576
√36 √36 √36 √36 √36
𝐱̅ = 10
10± 1.282(20) 10± 1.645(20) 10 ± 1.96(20) 10± 2.576(20) 10± 2.576(20)
σ = 120
10± 25.64 10± 32.90 10± 39.20 10± 46.52 10± 51.52
n = 36
(-15.64,35.64) (-22.90,42.90) (-29.20,49.20) (-36.52,56.52) (-41.52,61.52)

Ex 6.4: 120 120 120 120 120


x̅ ± 1.282 x̅ ± 1.645 x̅ ± 1.96 x̅ ± 2.576 x̅ ± 2.576
√64 √64 √64 √64 √64
𝐱̅ = 10
10± 1.282(15) 10± 1.645(15) 10± 1.96(15) 10± 2.326(15) 10± 2.576(15)
σ = 120
10± 19.23 10± 24.675 10± 29.4 10± 34.89 10± 38.64
n = 64
(-9.23,29.23) (-14.675,34.675) (-19.4,39.4) (-24.89,44.89) (-28.64,48.64)

Note the interval width increases as the confidence level increases when  is constant. Note the
n
interval width decreases as n increases when the confidence level remains constant.

81
Example 6.3: (a) A group of Statistics students survey their 36 classmates and ask them at what age
they plan to marry. Assume (for educational purposes only) that the sample is a simple random sample
from a larger parent population of all Statistics students at their university this year that has a standard
deviation of 2. The “sample” average age is reported to be 28. Construct a 90% confidence interval for
the mean age at which statistics students at your university this year plan to marry.
I: Popn: NOT normal,  = 2 (known)
Sample: x = 28, n = 36
1-  = .90,  = .10, /2 = 0.05
II : Formula (we can proceed because sample size is large (above 30) )
x ± zα/2 
n

III : z/2 = z.0.05 = 1.64 (from table above or by looking in z tables from scratch)

IV: 68  (1.64) (2/6)


=28  0.5467
V: A 90% confidence interval for , the mean age at which statistics students at your university
this year plan to marry is
(28–0.5467, 28+0.5467)
=(27.4533,28.5467) years
(b) Could we do this problem if n was 20?
No, because the shape of the population is unknown, and n would be too small for the sampling
distribution to be normal.

82
Word Problems: Calculating Probabilities (areas under a sampling distribution curve) to
investigate a supposition (hypothesis, belief, suspicion, hope)

Example 6.4: Application of Sampling Distribution for a Normal Population: The age at
which young people in your long-term economically depressed town traditionally leave their
parent’s “nest” to live on their own has been distributed as normal with mean 25.3 years and
standard deviation of 3 years. That is, the population is N(µ,σ) = N(25.3, 3). Recently, there has
been an upswing in the economy, and an upsurge in new jobs in the area. Yay! We hypothesize
that this new economic security is leading young people to establish their own homes at an
earlier age. A random sample of n=36 young people is taken and their ages are recorded. The
sample average age is x̅ = 24 years. We wonder if this sample average age of 24 years might
provide “evidence” that young people are leaving home earlier. We examine the probability that
the average age would be less than 24 years if the town was still experiencing an economic
depression. If this probability indicates that it is “likely” that we would observe an average value
of 24 years or less if nothing had changed (the town still had an economic depression), we will
decide that there is no evidence that the economic upsurge is causing young people to leave their
parent’s “nest” earlier. If the probability indicates that it is not “likely” we would observe an
average value of 24 years or less if nothing had changed (and the town still had an economic
depression), we will decide to reject the idea that nothing has changed, and believe there is
evidence that the economic upsurge is in fact causing people to leave home earlier.

Find P( X < 24)


Population: Normal,  = 25.3,  known = 3
Sample: x̅ = 24, n = 36
3
X ~ N(,)~N(25.3,3), X ~ N(,  / n ) ~ N(25.3, ) ~ N(25.3,0.5)
36

P( X < 24) = P(Z < -2.6) = .0047

x− 24 − 25 .3 − 1.3
using z = = = = -2.6
 3 0.5
n 36

(the X value of 24 standardizes to a Z value of -2.6 and we can use the Z tables (or an online
calculator) to find our probability)

There is a probability of 0.0047 of observing an average age of 24 years or less for leaving home
if the town was still in an economic depression (nothing had changed). This is a fairly low
probability (quite unlikely), in the scheme of things. We have evidence that the economic
upsurge is leading young people to establish their own homes at an earlier age.

83
Example 6.5: Application of Central Limit Theorem: Pundits are saying that the weather is
good for moth habitat in your area this year, and they hypothesize that moths are living longer
this year. A random sample of n=64 moths has been taken and it is found that their average
lifetime is x̅ = 19 days. We wonder if this sample average lifetime of 19 days provides
“evidence” that the current habitat is indeed giving moths the nutrition and shelter they need to
live longer. We examine the probability that the average lifetime would be more that 19 days if
it was a regular year for habitat in your area. If the probability indicates that it is “likely” that we
would observe an average value of 19 or more days in a year of regular habitat, we will decide
that there is no evidence that the weather is bringing a habitat that increased the moth’s average
lifetime. If the probability indicates that it is not “likely” we would observe an average value of
19 days or more in a year of regular habitat, we will decide that there is evidence that the current
weather is bringing a habitat that has increased the moth’s average lifetime. The shape of the
population of lifetime days for a moth that lives in the regular habitat of your area is not normal,
but it is known that the mean is µ = 20 days on average with a standard deviation σ = 5 days.

Find P( X >19)
Population: Unknown,  = 20,  known = 5
Sample: x̅ = 19, n = 64
Because n is large (larger than 30 is considered large enough, in general), we can apply the
Central Limit Theorem. (We don’t need to have a normal population.)
5
X ~ N(,  / n ) ~ N(20,
By CLT: ̅ ) ~ N(20,0.625)
64
1-

̅ >19) = P(Z > -1.6 ) = 1 – P(Z < -1.6)


P(X
= 1- 0.0548 = 0.9452
x −  19 − 20
using z = = = − 1 = -1.6
 5 0.625
n 64

(the X value of 19 standardizes to a Z value of 1.6 and we can use the Z tables (or an online
calculator) to find our probability)

There is a probability of 0.9452 of observing an average lifetime of 19 days or longer in a regular


habitat in 0.9452. This is a fairly high (not unlikely) probability, in the scheme of things. We do
not have evidence that there is something “magical” going on with the weather this year that is
leading moth lifetimes to be longer, on average, than normal.

84
Sample Size Determination:

Recall that all confidence intervals follow a certain pattern.

We call the part after the  MARGIN OF ERROR (E).

SAMPLE SIZE DETERMINATION

For a confidence interval for some unknown  for given E and level of confidence, we can
calculate a sample size n as follows.


E = z / 2
n
nE = z / 2
z / 2
n=
E

( z / 2 )2  2
n=
E2

Note: Assumes you have a “best” guess for  2 . This is often based on prior surveys or
experiments of a similar nature.

*for a fixed n, to be fairly confident (have a high level of confidence), your interval will have to
be larger. That is, for a fixed n, your interval will be larger for a larger confidence level.
Your z or t statistic would be bigger, leading to a larger confidence interval. Note that to be
100% confident, your interval would have to be infinite!!!.

85
*for a fixed level of confidence, to make an interval smaller, increase your sample size n (a
1
higher n means n is higher, and smaller)
n
For our examples:
“estimate with 95% accuracy” – this means 1- = .95
“within xx units” – this means E = xx units
“within xx%” – this means E = xx%

Example 6.6: How many MacEwan students identifying as female would we need to sample,
with 95% accuracy, in order to estimate the true average height of female MacEwan students to
within 0.75 inches? Assume the population standard deviation is 3 inches.
1 -  = .95,  = .05, /2 = .025
E = .75 inches
 = 3 inches

z/2 = z.025 = 1.96 (from tables)

( z / 2 )2  2 (1.96)2 (3)2
n= = = 61.46
E2 (0.75)2

Take a sample of 62 women.*


(always round up)

*Round up to 62. It’s a bit hard to sample a fraction of a person!

So we would need to sample 62 students identifying as female in order to estimate the true
average height of MacEwan females to within 0.75 inches.

Always round up, we want to be conservative. Using 62 gives us a slightly smaller


confidence level than requested. This is something the client would likely prefer, anyway.
 3
For n = 61, E = z / 2 =(1.96) = 0.7529
n √61

 3
For n = 62, E = z / 2 =(1.96) = .7468
n √62

86
Chapter 7: Inference: (t Distribution Problems)

T – DISTRIBUTION

The whole point of sampling is to make an educated inference about the population. Has anyone
noticed something a little strange? We have been assuming that we know the population mean 
and standard deviation ! That’s not so likely to happen in real life, is it? Let’s look at what we
can do when we don’t know the  and must use s to estimate . (a more realistic situation!)

We assume that we are sampling from a normal population (for any n)


OR
that our sample size is an n>30 if our population shape is unknown or non-normal.
̅− 𝝁
𝑿 ̅− 𝝁
𝑿
We estimate  with s, and use t = 𝑺/ instead of Z =
√𝒏 𝝈/√𝒏

The t distribution is similar to the Z distribution. It looks like a flattened bell curve (has longer
flatter tails and a mound of less height), and it is centred on 0.

Convention: We use the term “degrees of freedom” when we talk about a t distribution.
Degrees of freedom = n – 1(you can think of the 1 as being subtracted because we are estimating
 with s). The higher that the degrees of freedom are, the more the shape of the t distribution
approaches the shape of a standard normal distribution.

*When estimate  with s, the sampling distribution is a t distribution, not a Z distribution.


*t distribution tables allow us to approximate areas under the t curve, and to bracket probabilities
to the right of given t numbers. They can be calculated exactly with online calculators.
*When n>=30, we use the z distribution to approximate the t distribution.

The t distribution was discovered by William Sealy Gosset (1876-1937). He read chemistry and
mathematics at New College, Oxford before joining the Dublin brewery of Arthur Guinness &
Son. He continued to pursue his mathematical interests (with the mentorship of Karl Pearson,
another famous early mathematician) and his knowledge served him well at the brewery, where
he was working with quality control with small batches of beer. He published several papers on
small sample theory while working at Guinness. However, because company policy forbade
employees from publishing papers, he was unable to publish the works under his own name, but,
with the company’s permission, he was allowed to publish under the name “Student”. Thus, for
many years, the t distribution was referred to as the Student’s t distribution.
http://en.wikipedia.org/wiki/William_Sealy_Gosset

87
t Confidence Intervals

Recall from our important result above that the mean of all possible sample means taken from a
population (or the mean of the sampling distribution of 𝑋̅) is 𝝁𝑿̅ =  and that the standard
𝝈
deviation of the sampling distribution of 𝑋̅ is 𝝈𝑿̅ = 𝒏 (becoming smaller as n increases).

(It is assumed that the important necessary conditions mentioned above hold true (a SRS of size n is taken from a
population where a variable X that is to be measured has mean  and standard deviation 𝜎, and we can consider X1,
X2, … , Xn (the n observations) to be independent random variables, each having the same distribution (and of
course, the same mean  and standard deviation 𝜎). The independence condition can be relaxed as long as the
population is more than 20 times bigger than the sample.)

We assume that we are sampling from a normal population (for any n)


OR
that our sample size is an n>30 if our population shape is unknown or non-normal.

The general formula for a (1- )% confidence interval for unknown true µ is now:

s
𝑥̅ ± tα/2, n-1
n

where P(t<-t/2,n-1)=/2 and P(t> t/2,n-1) = /2 for n-1 degrees of freedom.

For reference, a table that lists appropriate confidence intervals under various assumptions
follows. Note that if the population shape is not normal, we cannot build a confidence interval
when our sample size is small (n < 30).

Summary of Confidence Interval formulas for µ

Known σ σ unknown, s known


Normal Shape Not normal (or Normal Shape Not normal (or unknown)
unknown) Shape Shape
n < 30 𝑋̅− 𝜇 Cannot use 𝑋̅− 𝜇 Cannot use
is N(0,1) is tn-1
𝜎/√𝑛 z or t 𝑆/√𝑛 z or t
𝜎 𝑠
x̅ ± zα/2 x̅ ± tα/2, n-1
√𝑛 √𝑛
n ≥ 30 𝑋̅− 𝜇 𝑋̅− 𝜇 𝑋̅− 𝜇 𝑋̅− 𝜇
is N(0,1) is ≈N(0,1) by CLT is tn-1 is ≈ N(0,1) by CLT
𝜎/√𝑛 𝜎/√𝑛 𝑆/√𝑛 𝜎/√𝑛

𝑠
x̅ ± tα/2, n-1 (exact) (𝜎 unknown, use s)
√𝑛
𝜎 𝜎 𝑠
x̅ ± zα/2 x̅ ± zα/2 𝑠 x̅ ± zα/2 (approx)
√𝑛 √𝑛 x̅ ± zα/2 (approx) √𝑛
√𝑛
(use if df are very
high, otherwise, (use only if df are very high,
choose the tα/2, n-1 for otherwise, choose the tα/2, n-1
the next lowest df in for the next lowest df in the
the table) table)

88
-t tables – values in the table match upper tail areas for given degrees of freedom

Area in Upper Tail


df 0.100 0.050 0.025 0.010 0.005
1 3.078 6.314 12.706 31.821 63.656
2 1.886 2.920 4.303 6.965 9.925
3 1.638 2.353 3.182 4.541 5.841
4 1.533 2.132 2.776 3.747 4.604
5 1.476 2.015 2.571 3.365 4.032
6 1.440 1.943 2.447 3.143 3.707
7 1.415 1.895 2.365 2.998 3.499
8 1.397 1.860 2.306 2.896 3.355
9 1.383 1.833 2.262 2.821 3.250
10 1.372 1.812 2.228 2.764 3.169
11 1.363 1.796 2.201 2.718 3.106
12 1.356 1.782 2.179 2.681 3.055
13 1.350 1.771 2.160 2.650 3.012
14 1.345 1.761 2.145 2.624 2.977
15 1.341 1.753 2.131 2.602 2.947
16 1.337 1.746 2.120 2.583 2.921
17 1.333 1.740 2.110 2.567 2.898
18 1.330 1.734 2.101 2.552 2.878
19 1.328 1.729 2.093 2.539 2.861
20 1.325 1.725 2.086 2.528 2.845
21 1.323 1.721 2.080 2.518 2.831
22 1.321 1.717 2.074 2.508 2.819
23 1.319 1.714 2.069 2.500 2.807
24 1.318 1.711 2.064 2.492 2.797
25 1.316 1.708 2.060 2.485 2.787
26 1.315 1.706 2.056 2.479 2.779
27 1.314 1.703 2.052 2.473 2.771
28 1.313 1.701 2.048 2.467 2.763
29 1.311 1.699 2.045 2.462 2.756
30 1.310 1.697 2.042 2.457 2.750
31 1.309 1.696 2.040 2.453 2.744
32 1.309 1.694 2.037 2.449 2.738
33 1.308 1.692 2.035 2.445 2.733
34 1.307 1.691 2.032 2.441 2.728
35 1.306 1.690 2.030 2.438 2.724
36 1.306 1.688 2.028 2.434 2.719
37 1.305 1.687 2.026 2.431 2.715
38 1.304 1.686 2.024 2.429 2.712
39 1.304 1.685 2.023 2.426 2.708
40 1.303 1.684 2.021 2.423 2.704
60 1.296 1.671 2.000 2.390 2.660
120 1.289 1.658 1.980 2.358 2.617
 1.282 1.645 1.960 2.326 2.576

For degrees of freedom (n – 1) and upper tail areas not in the table, your instructor recommends
the following online calculator. http://www.statdistributions.com/t/ . On tests, you will have to
use the by-hand tables, and if the degrees of freedom are not in the table (sporatic above
40), please use the last row of the t-tables, which gives the zα/2 values which can be used as
approximation of tα/2, n-1 values in that case.

89
Example 7.1: practice reading tα/2, n-1 (t value with area to its right = α/2 for n-1 degrees of
freedom) from tables and using that tα/2, n-1 in building confidence intervals)
𝑠
Confidenc α α/2 n tα/2, n-1 Table Rows 𝑥̅ ± tα/2, n-1 𝑛
e Level And Picture √
100(1-α)% 𝑥̅ = 10, s = 25 for all egs
90 0.10 0.05 10 t0.05, 9 25
10 ± t0.05, 9 =
√10
25
10 ± (1.833) =
√10
10 ± 14.491

90% CI is
(-4.491, 24.491)
95 0.05 0.025 17 t0.025, 16 25
10 ± t0.025, 16 =
√17
25
10 ± (2.120) =
√17
10 ± 12.854

95% CI is
(-2.854, 22.854)
99 0.01 0.005 10 t0.005, 9 25
10 ± t0.005, 9 =
√10
25
10 ± (3.250) =
√10
10 ± 25.694

99% CI is
(-15.694, 35.694)

Example 7.2: A new drug to combat AIDS is appearing very hopeful. A random sample of
46 patients with AIDS reveals that taking this drug extended their average lifetime by 3 years
with a standard deviation of .5 years. Find a 95% confidence interval for the true average
increase in lifetime offered to AIDS patients by this new drug. Assume the population of
lifetimes is normal.

Popn: shape normal,  unknown


Sample: n = 46, x̅ = 3, s = 0.5
1 -  = .95,  =.05,  /2 = .025
𝑠
Formula (check in CI table above): x̅ ± 𝑡𝛼/2 for n–1 = 46 - 1=45 degrees of freedom
√𝑛

tα/2, n-1 = t0.025, 45 = 2.014 (t value with .025 in right tail of the t distn with 45 df is 2.014

45 degrees of freedom is not in the t-table. Our t-value is from the online calculator at
http://www.statdistributions.com/t/ . Enter your p-value (0.025) and your degrees of freedom
(45) and tell it to calculate the t-value. It will return 2.014.

90
𝑠 0.5
x̅ ± 𝑡𝛼/2 = 3 ± 2.014 = 3 ± 0.1485
√𝑛 √46

Therefore, a 95% confidence interval for the true average increase in years of lifetime is
(2.8515,3.1485) (in years).

Note that n = 45 is not in the t-tables. On tests, choose the tα/2, n-1 for the next lowest df in the
printed t table that they use.

Example 7.3 (working with raw data)


A student group collecting data on student career plans among their peers collected data about
the age at which their peers knew what they “wanted to be when they grew up”. They assume
(quite loosely and without any rigour) that the data collected from 55 classmates consists of a
simple random samples of all students at their university this year. They wish to form a 95%
confidence interval for the average age at which students at the university knew “what they
wanted to be when they grew up”.

Data (the ages) is as follows.


3 5 6 7 7 8 8 8 9 9 10 10 10 10 10
10 10 10 11 11 11 11 11 11 11 11 11 11 11 12
12 12 12 12 12 12 13 13 13 14 14 15 16 16 17
17 18 19 19 20 20 21 23 25 31

The students examined a bar chart of their collected data. It appeared fairly bell shaped below
the value of 19, but has a bit of a right tail extending up to 31. However, their sample size, at
55, is considered fairly large. We proceed to create a confidence interval.
The Age at which Students Knew "What
they wanted to be when the Grew Up"
15
Frequency

10
5
0
9
3
5
7

11
13
15
17
19
21
23
25
27
29
31

Ages

91
s
We might find an acceptable confidence interval to use in this case is x ± zα/2
n
The students calculate their 𝑥̅ and s to be 12.71 and 5.08 years, respectively. Then they
calculate their needed zα/2 = z0.025 = 1.96. Finally, their confidence interval is calculated to be

5.08
12.71 ± 1.96 = 12.71 ± 1.34
55

So a 95% confidence interval for the average age at which students at the university this year
knew what their major at university would be is (11.37, 14.05) (in years).

To be conservative, if your printed t-tables only went up to 40 df, you would prefer the
s 5.08
confidence interval x ± tα/2,n-1 = 12.71 ± (2.021) = 12.71 ± 1.38, as this uses t0.025.40 = 2.021,
n 55
and returns a confidence interval of (11.33, 14.09) (in years) which is slightly wider.

Word Problems: Calculating Probabilities (areas under a t sampling distribution curve) to


investigate a supposition (hypothesis, belief, suspicion, hope)

Example 7.4: A general health score that measures nutritional intake and physical wellbeing that
has been administered in a certain geographical area for several years, and it is normally
distributed with mean 100. Recently, with a general upswing in development in the area, there is
a belief that the general health score of the population in the geographical area has increased. A
random sample of n=16 people is taken, and general health scores are recorded. The sample
average health score is x̅ = 110 years. We wonder if this sample average health score of 110
years might provide “evidence” that the average general health score has increased in the
population. We examine the probability that the average health score would be greater than 110
years if the town had not experienced this general upswing (that is, if nothing has changed). If
this probability indicates that it is “likely” that we would observe an average health score of 110
if nothing had changed (the geographical area was still underdeveloped), we will decide that
there is no evidence that the development upsurge is causing an increase in health scores. If the
probability indicates that it is not “likely” we would observe an average health score of 110 or
more if nothing had changed (the geographical area was still underdeveloped), we will decide
that there is evidence that development is causing an increase in average health score.

Find P( X > 110)


Population: Normal,  = 100,  unknown (s known)
Sample: x̅ = 110, s = 20, n = 16
𝑋̅ −𝜇 𝑋̅ −100
= 20/√16 is a t distribution with n-1=15 degrees of freedom
𝑠/√𝑛

P( X > 110) = P(t15 > 2) --- area to the right of value 2 for a t distribution with 15 df
̅− 𝝁
𝒙 𝟏𝟏𝟎 − 𝟏𝟎𝟎
Since 𝒔/ = =2
√𝒏 𝟐𝟎/√𝟏𝟔

92
The online calculator at http://www.statdistributions.com/t/ is used here. Enter your test statistic
value (here 2) and your degrees of freedom (here 15) and tell it you want it to calculate the
probability in the right tail (i.e. P(t15 >2) ). It will return P(t15 >2) = 0.032.

However, the t-tables used when the problem is done by hand can only “bracket”
probabilities in the right tail of a t- distribution.

Go to the t-table and place your ruler under the row at df = 15.

0.05-------------0.025 (areas in upper tail of t distribution)


. .
(df)15….1.753----- 2----- 2.131 (t-values corresponding to upper tail areas)

The area to the right of 1.753 is 0.05, and the area to the right of 2.131 is 0.025.
By hand, all we can say is that the area to the right of 2 is between 0.025 and 0.05

We write 0.025 < P( t15 > 2) < 0.05

Conclusion: There is a probability of 0.032 of observing an average health score of 110 years or
more if the geographical area was still underdeveloped (nothing had changed). This is a fairly
low probability (quite unlikely), in the scheme of things. We have evidence that the development
of the geographical area is leading to an upsurge in health scores.

93
Example 7.5: A group of statistics students collect some pulse rate data (ratio data) from a
simple random sample of 25 statistics students at MacEwan in the 10 minutes prior to the
beginning of the final exam. Assume the simple random sample can be considered to be taken
from a much larger normal shaped parent population of pulse rates of all statistics students who
have ever sat a final statistics exam at MacEwan. It is suspected that the average pulse rate of
the class will be elevated above the regular pulse rate of healthy young adults, which is 72 beats
per minute. They find that the average pulse rate for this sample is 85 and the standard
deviation is 5 beats per minute. If nothing was unusual, what is the probability that you would
find a pulse rate of greater than 85 beats per minute when you took a simple random sample?

Population: Normal, µ = 72 (healthy unstressed young adult), σ unknown


Sample: x̅ = 85, s = 5, n =25. We can use the t distribution to do this problem.

̅ > 85) = P(t24 > 85−72 ) = P(t24 > 13)


P(X 5/√25
From the t tables, for 24 degrees of freedom, this probability is smaller than 0.005.

P(t24 > 13) < 0.005

From the website, P(t24 > 13) ≈ 0.

So, if nothing had changed, the probability of observing an average pulse rate of above 85 beats
per minute (which we did observe on the test day) is very small. We are therefore likely to
believe that the test is increasing student anxiety and their average pulse rate.

94
Example 7.6: Extra practice bracketing P(tn-1 > some number ) or P(tn-1 < some number)
Find: P(t16 < -2.0) = P(t16 > +2.0) (by symmetry) Our website returns P(t16 < -2) = 0.031
d.f.=16, place ruler on that row.

Find 2.0, which is between 1.746 and 2.120


P(t16 >1.746)=0.05 and P(t16 > 2.120)=0.025

Bracket the probability: 0.025 < P(t16 > 2.0) < 0.05
By symmetry: 0.025 < P(t16 < -2.0) < 0.05

P(t25 >1.2) Our website returns P(t25 >1.2)= 0.121


d.f. = 25, place ruler on that row.
Find 1.2, it is off the row, before 1.316
P(t25 >1.316) =0 .10
Bracket the probability: 0.10 < P(t25 >1.2) < 0.5
Area is bigger than 0.10 0.01
 
1.2 1.316

95
Chapter 8: Hypothesis Testing All our word problems so far involve testing a null hypothesis
about µ (the population mean) versus an alternative hypothesis (worry, supposition, hope, belief)
that µ has moved (left or right) (causing a population shift). A sample is taken from the
population. We calculate a sample statistic x̅ that is compared to the population µ in the null
hypothesis. We recognize that X ̅ is a random variable that changes according to the sample taken
from the population. We can look at the sample histogram of our one sample we take to get an
idea of the shape of the population. (We bear in mind there are many possible samples that can
be taken from the population and that another sample would yield another x̅ value and a sample
histogram that is somewhat different. However, we hope that our one sample is a representative
one (and note that taking a fairly large sample helps!)). We check how likely our x̅ is if nothing
had changed (so the µ of the null hypothesis stayed as it was, and no population shift occurred)
by using our theoretical knowledge about the sampling distribution of ̅ X to calculate what the
probability of observing an x̅ as extreme as we did (in a right or left tail) would be if the null
hypothesis still held. We use either the Z distribution or the t distribution to help us find those
probabilities. Sometimes, the chance of observing an x̅ value as extreme as we did if the null
hypothesis held is small (say smaller than about 1% to 10%) and sometimes it is large (say larger
than about 10%). If that chance (which is called a p-value, by the way) is small, we decide that it
is indeed possible that the null hypothesis does not hold and the population from which we
sample has shifted. In that case, we say we reject the null hypothesis, and have evidence to
support the alternative hypothesis.

A cut-off percent value at which we would change our decision about whether to reject or not
reject the null hypothesis is called the level of significance. Common levels of significance are
1%, 5%, and 10%. A level of significance is chosen before an experiment is performed.

A “small” p-value (chance of observing a x̅ as extreme as we did if the null hypothesis still held)
smaller than the level of significance will lead us to conclude that we should reject the null
hypothesis and conclude that there has been a shift in the position of the population, whereas a
“large” pvalue larger than the level of significance will lead us to not reject the null hypothesis,
and to conclude that there is no significant evidence that there has been a shift in the position of
the population.

Many statistical tests exist in statistics when we take a sample and use the sample results to make
inferences about the population. You will learn several more tests as we move through the
course. We now move to formalize a procedure for hypothesis testing.

The table below informally summarizes the names and types of data you will encounter with the
hypotheses test we will do during the remainder of the course. This will help you when it comes
time to set up word problems

96
Test Experiment Survey Example/Question Alternative
Graph/Plot Hypotheses
One sample t One numerical One Column: Age Population
or z test random variable numerical “Does average age Mean has
column differ from a particular changed from
Histogram age of interest?” null mean
Two Two A categorical Column 1: Education The two
independent independent column with (Undergrad & Grad) population
samples t test numerical two choices Column 2: Salary means differ
random and a “Does average salary
variables numerical differ for the two
Histograms column education groups?”
ANOVA Three A categorical Column 1: Political At least one of
independent column with Party (Conservative, the means
numerical three or more Green, Liberal, NDP) differs from the
random choices and a Column 2: weekly news others
variables numerical minute consumption
column “Does average weekly
news consumption
Histograms differ for the 4 parties?”
Paired t test Two random Two Pulse rate before The mean of
variables with a numerical exercise, pulse rate after the differences
“natural pairing” columns with exercise (measure pulse between pairs
between them a “natural rate before and after for is nonzero
pairing” each respondent)
Histogram of between them “Does pulse rate change
Differences after exercise?”
Goodness of A categorical One Political Party The
Fit random variable categorical “Do the proportions of distribution of
column people belonging to the proportions
political parties differ differs from a
from some particular base
provided proportions?” distribution of
Bar Chart proportions
Test of Two categorical Two Degree, Political Party The two
Independence random categorical (measure degree and variables are
variables columns (each political party on each related to each
column can respondent) other (we see
have 2 or “Is there a relationship more or less
more choices) between degree and observed
political party? That is, values than
does having a particular expected in
degree coincide with variable choice
you being more or less combinations)
Cluster Bar likely to belong to a
Chart particular party?”

97
Test Experiment Survey Example Alternative
Hypotheses
Simple Two numerical Two numerical Savings, Salary The linear
Linear random columns where one (measure salary model is such
Regression variables that of the columns (the and savings on that the
influence each dependent one) each respondent) independent
other appears naturally Savings – variable
dependent on the Dependent explains the
other column (the Salary - dependent
Scatterplot independent one) Independent variable
Multiple Three Three or more Savings, Salary, The multiple
Linear numerical numerical columns Debt (measure all linear model is
Regression random where one variable three on each such that at
variables that (the dependent one) respondent) least one of the
influence each appears “naturally” Savings – independent
other dependent on the Dependent variables
other two Salary – explains the
Multiple (independent) Independent dependent
Scatterplots variables. Debt - Dependent variable

All hypothesis tests have the following format.

Step: (Hypotheses) State hypotheses (null and alternative) about a population and state a level of
significance

Step: (Assumptions) State assumptions about the population. Investigate validity of population
assumptions with graphs/plots of the sample data, bearing in mind that we are looking only at
one sample here. Establishment of the validity of assumptions should be done before doing a
problem, but validity of assumptions will sometimes be assumed without investigation for
instructional purposes.

Step: (Test statistic) Calculate a “raw” test statistic, and “standardize” it in some way

Step: (P-Value) Calculate the chance (p-value) of observing a value as extreme as the
“standardized” test statistic if the null hypothesis was true (nothing had changed). To do this
requires knowledge of the sampling distribution of the “standardized” test statistic. This value
will be calculated by using online calculators or “by hand” tables.

Step: (Decision) If the p-value is “small” (below level of significance), reject Ho and conclude
there is significant evidence there has been a shift in the population to support the Ha. If the p-
value is “large” (above level of significance), do not reject Ho and conclude there is insufficient
significant evidence to indicate a shift in the population to support the Ha.

We never say we “accept Ho” but instead say that we “fail to reject Ho”. This recognizes that
even though there might only be a small chance of obtaining an extreme test statistic value if Ho
is true, it is possible we might randomly obtain that “rare” test statistic when Ho is true.

98
ONE SAMPLE MEAN: HYPOTHESIS TESTING

Example 8.0: A random sample of n=1280 students is taken from a larger population and it
reveals that the sample mean (average) amount for their student loans/scholarships is $18,900.00.
Suppose it is known that the standard deviation σ of the population is $40,222.00

a)What do you know about the mean, standard deviation, and shape of the sampling distribution
of 𝑋̅? Why do we know the sampling distribution has the indicated shape?
Mean: µ = Unknown
Standard Deviation: σ/√𝑛 = $40,222.00/√1280 = $1,124.24
Approximate Shape of Sampling Distribution: Normal
Why: Sample Size is large (Central Limit Theorem)

b) Perform a hypothesis test to determine if there is evidence that the population mean for student
loans/scholarships lies below $21,000.00. Use a level of significance of 5%.

Hypotheses
Ho: µ = 21000 Ha: µ < 21000
(we will test if there is evidence that a Ho population (which is normal and traditionally
centered about 21000 with standard deviation 1124.24) has shifted left (and is normal, but
centered at a number below 21000 with standard deviation 1124.24)
The university has been economizing on loans/scholarships, so we think the average
scholarship is lower.

Test stat/df
Recall x̅ = 18900, σ = 40222, σ/√n = 40222/√1280 = 1124.24
̅
X ~ N(21000, 1124.24) for Ho

̅−μ
X 18900 − 21000
z = σ/ = = -1.86793008
√n 40222/√1280

P-value
Pvalue Statement and Value
P(X̅ < 18900) = P(Z<-1.86793008)≈P(Z < -1.87) =0.0307
(from tables or statdistributions.com)
The probability of observing a sample mean of 18900 or smaller if Ho is true = 0.0307
(this is relatively small and smaller than 0.05, our level of significance.)

Decision and Reason


Since our pvalue of 0.0307 <= our level of significance of 0.05, we reject Ho.

Decision in “English”
At the 5% significance level, there is significant evidence that the average population student
loan/scholarship mean lies below $21,000.

99
c) Perform a hypothesis test to determine if there is evidence that the population mean for student
loans/scholarships exceeds $17,000.00. Use a level of significance of 1%.
Hypotheses
Ho: µ = 17000 Ha: µ > 17000
(we will test if there is evidence that a Ho population (which is normal and traditionally centered
about 17000 with standard deviation 1124.4) has shifted right (and is normal, but centered at a
number above 17000 with standard deviation 1124.4)
The University has received a large donation to support scholarships, so we think the average
loan/scholarship is higher.

Test stat/df
Recall x̅ = 18900, σ = 40222, σ/√n = 40222/√1280 = 1124.24
̅
X ~ N(17000, 1124.24) for Ho

̅−μ
X 18900 − 17000
z = σ/ = = 1.69003199
√n 40222/√1280

P-value
Pvalue Statement and Value
̅ > 18900) = P(Z>1.69003199) = 1 – P(Z < 1.69) ≈ 1 – 0.9545 = 0.0455
P(X
(from tables or statdistributions.com)
The probability of observing a sample mean of 18900 or larger if Ho is true = 0.0455
(this is relatively small, but it is not smaller than 0.05, our level of significance.)

Decision and Reason


Since our pvalue of 0.0455 > our level of significance of 0.05, we fail to reject Ho.

Decision in “English”
At the 1% significance level, we do not have significant evidence that the average population student
loan/scholarship mean exceeds $17,000. (note that we would have significance if α = 0.05)

d)Calculate a 90% confidence interval for the population mean.


The 90% confidence interval for 𝜇 is: 𝑥̅ ± 𝑧𝛼/2 ⋅ 𝜎/√𝑛
𝛼
90% confidence implies 𝛼 = 0.1 → 2 = 0.05.
So that 𝑧𝛼/2 = 𝑧0.05 = 𝑧 value that captures top 5% and bottom 95% of Z-distribution.
From tables: 𝑧0.05 = 1.645.
Hence the CI is:
𝑥̅ ± 𝑧0.05 ⋅ 𝜎/√𝑛 = 18900 ± 1.644854 ⋅ 1124.24≈ 18900 ± 1.645 ⋅ 1124.24
= 18900 ± 1849.37 = ($17050.63, $20749.37)

We are 90% confident that the true population mean falls within the interval
($17050.63, $20749.37)

We do not know if our particular interval contains the true population mean.

100
e) For a (1-α)% confidence level, we define a significance level that is equal to α%. The
confidence level and the significance level add to 100%. You can think of the significance level
as our willingness to be wrong when we seek to determine if there is statistical evidence that a
population mean differs from a given amount. What is our significance level for this problem?
(1-α)% = 90%, so α% = 10%

f)A friend wonders if the population mean differs from $20,000. What do you tell your friend?
At the _10__% significance level, since $20,000 falls inside the 90% confidence interval, we
do not have significant evidence that the population mean differs from $20,000.

g)A friend wonders if the population mean differs from $30,000. What do you tell your friend?

At the _10__% significance level, since $30,000 falls outside the 90% confidence interval, we
do have significant evidence that the population mean differs from $30,000.

h) Perform a hypothesis test to determine if there is evidence that the population mean for student
loans/scholarships differs from $20,000.00. Use a level of significance of 10%.
Hypotheses
Ho: µ = 20000 Ha: µ ≠ 20000
(we will test if there is evidence that a Ho population (which is normal and traditionally centered
about 20000 with standard deviation 1124.24) has shifted left or right (and is normal with
standard deviation 1124.24, but centered at a number either below 20000 or above 20000)
The university has been economizing on loans/scholarships and has received a large donation to
help with loans/scholarships and we think things might have changed, but we don’t know how.

Test stat/df: Recall x̅ = 18900, σ = 40222, σ/√n = 40222/√1280 = 1124.24


̅
X ~ N(20000, 1124.24) for Ho

𝑋̅ − 𝜇 18900 − 20000
z = 𝜎/ = = -0.97843957
√𝑛 40222/√1280

P-value
Pvalue Statement and Value
2P(X̅ < 18900) = 2P(Z<-.97843957) ≈ 2P(Z < -0.98) =2(0.1635) = 0.3270
(from tables or statdistributions.com)
The probability of observing a sample mean as extreme as 18900 (on the left or right side of
the Ho distribution if Ho is true) = 0.3270
(this is relatively large and larger than 0.05, our level of significance.)

Decision and Reason


Since our pvalue of 0.3270 > our level of significance of 0.10, we fail to reject Ho.

Decision in “English”
At the 10% significance level, we do not have significant evidence that the average
population student loan/scholarship mean differs from $20,000.

101
Example 8.1: You are interested in the average cost of an airbnb in Canada. Your
company’s “status quo” belief is that it is no more than $70.00 a night. Your company
therefore re-imburses accommodation expense claims for up to that amount. Recently, your
employees have noticed that airbnb bills appear to have been climbing, and suspect that this
due to increasing inflation. Your accountant examines a sample of 25 recent airbnb bills
submitted by your employees, and finds a sample mean of $71.20 and a standard deviation of
$2.50. From experience, your accountant knows that the distribution of the cost of airbnb
bills is normal. Is there evidence to support the belief of your employees that the average
cost of a airbnb room per night is rising in Canada?

Step 1: Hypotheses

H0 Popn: o = 70 (our null population mean is 70), shape normal,  unknown,


Sample: n = 25, 𝑥̅ = 71.20, s =2.5

H0:  <= 70
Ha:  > 70 (our belief is that the mean cost of Airbnb nights has increased)

We prechose a level of significance, which we denote α, of 5% (or 0.05). If the chance of


observing our sample if nothing has changed is below 5%, we will reject our null hypothesis, and
conclude a change has occurred (and Airbnb rates have increased).

We are asking ourselves if 𝑥̅ = 71.20 is unusual if Ho is true.

Step 2: Test Statistic

𝑋̅ is N(𝜇𝑜 ,𝜎/√𝑛), but we do not know σ and must use s to estimate it.

𝑋̅ − 𝜇𝑜
t= is a t distribution with n-1= 24 degrees of freedom (df).
𝑆/√𝑛

𝑥̅ − 𝜇0 71.20− 70
We calculate our test statistic t* = = = 2.4
𝑠/√𝑛 2.5/√25

This t* = 2.4 corresponds in a one to one relationship with 𝑥̅ = 71.20

Step 3: P-value: If o = 70 were true, how likely is it we would get a sample result as unusual
as 71.20 (ie: a result of 71.20 or higher). IE: What is P( X > 71.20) if H0 is true?

102
P( X > 71.20) = P(t24 > 2.4) = 0.012 (from online calculator)

(or from t tables with 24 df, we bracket 0.01 < P(t24 >2.4) <0.025)

Note that the direction of the Ha > sign matches the direction of the > sign in the p-value.

Step 4: Decision
We reject H0 if the p-value <=  .

The very small p-value of 0.012 <= 0.05 tells us that this result is very unlikely (unusual) if Ho is
true. Our $71.20 would be very unusual if 𝜇0 = $70.00 were still true. We reject H0 and
conclude that there is significant evidence that the mean cost of a nightly accommodation room
in Canada now exceeds $70.00.

Relationship between p-value and the level of significance

What if our pre-chosen level of significance had been 0.02 (or 2%)?

.Reject Ho

What is our pre-chosen level of significance had been 0.01 (or 1%)?

.Do not reject Ho

This illustrates the importance of choosing a level of significance carefully. And the importance
of always presenting a p-value with your decision. NEVER simply report a decision and a
significance level. ALWAYS include your p-value.

The costs behind your experiment will influence your choice of level of significance. You don’t
want to make the wrong decision and misinform people. Perhaps cost is important. Perhaps
health is involved. You might choose a smaller p-value if your decision involves millions of
dollars or if you are testing a new drug that extends people lifetimes but has serious side effects.
We will talk more about this later.

The following summary tables will be useful for your work. Look at them briefly now as you
read, and then, in the following examples, you can use them as a reference.

103
P Value Approach to Hypothesis Testing for Tests of One Sample Mean

Right Tail Test Left Tail Test Two Tailed Test


Step One: Ho:μ ≤ μo Ho:μ ≥ μo Ho:μ = μo
Hypothesis Ha:μ > μo Ha:μ < μo Ha:μ  μo

Step Two z* or t* z* or t* z* or t*
Test {see table below} {see table below} {see table below}
Statistic

Step Three: P(Z > z*) P(Z < z*) 2P(Z > |z*|)
p-value or or or
P(t > t*) P(t < t*) 2P(t > |t*|)

Step Four: Reject Ho if p-value ≤ Reject Ho if p-value ≤  Reject Ho if p-value ≤ 


Decision Fail to reject Ho if pvalue> Fail to reject Ho if pvalue> Fail to reject Ho if pvalue>

**For a one-sided test, the direction of the symbol in Ha tells you which side of the distribution
to look for to find your p-value. Note that when Ha is two sided, you have no pre-conceived idea
of what constitutes an extreme result, so your p-value reflects the fact that a value as extreme as
that observed could have occurred on either side of the sampling distribution.

Test Statistic Table for Tests of One Sample Mean


Known σ σ unknown, s known
Normal Not normal Normal, σ unknown, Not normal
use s
n < 30 𝑋̅− 𝜇 is N(0,1) Cannot use 𝑋̅ − 𝜇
is tn-1 Cannot use
𝜎/√𝑛 z or t 𝑠/√𝑛 z or t

𝑥̅ − 𝜇0 𝑥̅ − 𝜇0
z* = Use t* =
𝜎/√𝑛 𝑠/√𝑛
n ≥ 30 𝑋̅ − 𝜇 𝑋̅ − 𝜇 𝑋̅ − 𝜇 𝑋̅ − 𝜇
is N(0,1) is N(0,1) by CLT is tn-1 is N(0,1) by CLT
𝜎/√𝑛 𝜎/√𝑛 𝑠/√𝑛 𝜎/√𝑛
𝑥̅ − 𝜇0
t* = (exact)
𝑠/√𝑛
𝑥̅ − 𝜇0
z* ≅ (approximate use s to approximate 𝜎
𝑠/√𝑛
𝑥̅ − 𝜇
if n not in the t tables) z* ≅ 𝑠/ 𝑛0
𝑥̅ − 𝜇0 𝑥̅ − 𝜇0 √
z* = z* = (you may prefer to use a (you may prefer to use
𝜎/√𝑛 𝜎/√𝑛
t* with the df from the a t* with the df from
next lowest line in the the next lowest line in
t-tables to be the t-tables to be
conservative) conservative)

104
Example 8.2: Five years ago a consumer watch agency published evidence to support the claim
of a vitamin company that the average amount of powdered calcium/magnesium in a tablespoon
of their product was at least 1200 mg of calcium/magnesium. Three years ago, the company was
taken over by another much larger company, and many users of the product have recently been
expressing dissatisfaction with the product. The consumer watch agency recently examined a
sample of 36 tablespoons of the product, and found a sample mean of 1150 mg of
calcium/magnesium and a standard deviation of 120 mg of calcium/magnesium. Does the
evidence support the concerns of consumers that the product is now inferior? The consumer
watch agency routinely uses a significance level of 1% in their studies.

Step 1: Hypotheses
H0 Popn: o = 1200, shape unknown,  unknown,
Sample: X = 1150, s = 120 (known), n = 36
Pre-chosen  = 0.01

H0:  >= 1200


Ha:  < 1200

Step 2: Test Statistic (you can reference the table above to be sure you have the right one)
z* ≈( X - o)/(s/ n ) = 1150 − 1200 = − 50 = − 50 = −2.5
120 36 120 20
6

Step 3: P-value

P(Z < -2.5) = 0.0062 (from z tables) suffices.


P(t < -2.5) for 35 degrees of freedom would also be acceptable.

Step 4: Decision
Conclusion: Since 0.0062 < 0.01, we reject Ho, and conclude that there is significant evidence to
support Ha (that the mean amount of calcium/magnesium per tablespoon is less than 1200 mg).

105
TYPE 1 AND TYPE 2 ERROR:
 = P(Reject Ho when Ho is true)- Type 1 error
 = P(Do not reject Ho when Ho is false) - Type 2 error.

% CHANCE THAT UNUSUAL SAMPLE RESULT OCCURRED WHEN H0 WAS


ACTUALLY TRUE AND A MISTAKE WAS MADE.
UNDERLYING DECISION
SITUATION Reject Ho Do not reject Ho
Ho true P(Reject Ho when Ho true) =  P(Do not reject Ho when Ho true) = 1 - 
Ho false (Ha has merit) P(Reject Ho when Ho false) = 1 -  P(Do not reject Ho when Ho false) = 

UNDERLYING DECISION
SITUATION
Reject Ho Do not reject Ho
Ho true INCORRECT = TYPE 1 ERROR CORRECT
Ho false (Ha has merit) CORRECT INCORRECT = TYPE 11 ERROR

For fixed n, if we set a smaller  , our  increases. (Just move the cut line above to see this in
the picture)

Example 8.3: Consider the consequences of Type I and Type II error for testing a new drug to
treat cancer.

UNDERLYING DECISION
SITUATION
Reject Ho (conclude new drug works) Do not reject Ho (conclude new drug no help)
Ho true P(Reject Ho when Ho true) =  P(Do not reject Ho when Ho true) = 1 - 
(new drug no Think drug effective when it isn’t
improvement on Company: waste $$ marketing, lose face
old drug) (impacts on other sales)
Client: false hope, very cruel
TYPE 1 ERROR IS SERIOUS HERE
(for company and for client)
Ho false P(Reject Ho when Ho false)=1 -  P(Do not reject Ho when Ho false) = 
(new drug Think drug doesn’t work when it does
increases life span Company: don’t market it, lose potential profit
of cancer patient) Client: doesn’t know about drug, no
opportunity to try it (but no hopes raised)

106
All our examples so far have assumed that we have an idea (suspicion, hope, belief) about the
direction in which the population mean many have shifted (greater or smaller) because of some
experimental or experiential (socio/economic/environmental/political/medical etc) change. We
now look at the *perhaps* more common situation where we suspect a change has occurred, but
have no idea in what direction a population may have shifted.

Example 8.4: A census done of students at your university in 1880 revealed that they spent, on
average, 7 hours a day doing homework. Over time, the electronic age has presented many
challenges to those wishing to be attentive to school work (telephone, radio, television, the
internet, texting), but it has also offered many timesaving devices that might help those who wish
to have more time for homework (electric light, computers, faster transit) and you are curious
whether the mean number of hours a day that students today spend doing homework has
changed. You take a simple random sample of 64 students and find a mean of 6.7 hours a day
and a standard deviation of 2. Assuming a level of significance of 0.05, test whether the average
daily hours of homework engaged in by students at your university has changed.

Step 1: H0 Popn: o = 7, shape unknown,  unknown,


Sample: X = 6.7, s = 2 (known), n = 64,
Choose  = .05

H0:  = 7 versus Ha:   7 (nothing there to indicate what our thoughts are ahead of time on
how the  might differ from 7)

Step 2: With a population of unknown shape, since n >=30, and we must use s to approximate
the unknown , our test statistic is

z* ≈ ( X - o)/(s/ n) = (6.7-7)/ (2/ 64 ) = -0.3/(2/8) = -1.2

Step 3: Calculate the p-value: (the chance of observing a value as extreme as -1.2 must
consider both tails when we calculate the p-value because we had no idea ahead of time
how daily homework hours might have changed from the status quo)

2P(Z<-1.2) = 2(0.1151) = 0.2302 (from tables)

You may prefer to use 2P(t < -1.2) with 63 df to be more conservative.

Step 4: Conclusion: Do not reject Ho since 0.2302 is not < = 0.05. At a level of significance of
=.05, there is not enough evidence to conclude that the average daily hours of homework
differs from the historical “norm” of 7 hours a day.

107
95% Confidence Interval. For our unknown shape, since n >=30, and we must use s to
approximate the unknown , our confidence interval is
s s 2
x̅ ± zα/2 = x̅ ± z0.025 6.7 ± 1.96 = 6.7 ± 0.49 = (6.21,7.19) hours a day of homework
√n √n √64

Note that 7, our null hypothesized mean, is in the 95% confidence interval. There is no
significant evidence at the 5% significance level that there has been a change from the historic
average“norm” of 7 hours a day of homework. You could use t0.025,63 to be more conservative.

Notice that the two sided hypothesis test with a significance level of 5% and the 95% confidence
interval both lead to the same decision.

In general, any two sided hypothesis test with a significance level of α% will lead to the
same decision as a (1-α)% confidence level.

Example 8.5: Hypothesis Test of µ, population shape unknown, σ known, sample size large

The Parks and Recreation Board is trying to decide whether to build new tennis court in their
city. A random sample of 100 persons in this city revealed that they play tennis, on average, 1.2
hours per week during the summer. A previous pilot study indicated that the population standard
deviation is 1.0 hours. Test whether this sample indicates that the number of hours that tennis is
played in this city differs from the national average of 1.1 hours. Use a significance level of 5%.

Population: unknown shape, σ = 1, µo= 1.1


Sample: n = 100, 𝑥̅ = 1.2, s unknown
α = 0.05
Hypotheses:
Ho: µ = 1.1
Ha: µ ≠ 1.1

Test Statistic:
𝑥̅ − 𝜇 1.2− 1.1
z* = 𝜎/ 𝑛0 = 1/√100 = 1

Pvalue: The p-value is the probability of observing a value as unusual as 1. Since this is a 2
sided test, we look at both tails as unusual, and we have:

Pvalue = 2P(Z > z*) = 2P(Z > 1) = 2(0.1587) = 0.3174

Decision: Do not reject Ho.Our p-value of


0.3174 is > 0.05 (our pre-chosen α), and we
conclude the average number of hours of
tennis played in this city does not differ from
the national average.

108
CLASSICAL APPROACH (CRITICAL REGION APPROACH) ONE SAMPLE MEAN

Sometimes, people simply ask about the bottom line, and what decision should be made.
However, really, if you are using an α = 0.05, and your p-value turns out to be 0.051, you may
decide that your result is close enough to “significant” that you believe your alternative
hypothesis to have merit. Nuances and precision are important in statistics, as elsewhere in life.

That said, it is in your interest to now become aware of and conversant in the use of the classical
(critical region) approach to hypothesis testing. Not least because it is the method you will use
for discerning significance in future tests you will learn in this course.

Recall that the level of significance α is, in essence, a cut-off point that we compare to the p-
value when we make our decision. Any z* test statistic result that yields a p-value smaller than
or equal to α leads to rejection of Ho.

Let us look specifically at what happens with a 2-sided test when α = 0.05. We draw the
following diagram to elucidate understanding.

Now, P(Z< -1.96) = 0.025 and (Z > 1.96) = 0.025.


We call the region Z< -1.96 and Z > 1.96 the critical region (or the rejection region).

Recall that we use the notation z0.025 = 1.96 to denote the z value corresponding to an area of
0.025 to the right of 1.96, and the notation - z0.025 = -1.96 to denote the z value corresponding to
an area of 0.025 to the left of -1.96.

Now, a z* result yielding a p-value smaller than 0.05 leads to a rejection of Ho.

This happens when z* falls in the critical region.


If z * is below -1.96 or above 1.96, the corresponding p-value will be below 0.05.

We write:

Critical Region: Reject Ho if z* < -1.96 or z* > 1.96


(or, since α = 0.05) Reject Ho if z* < -z0.025 or z* > z0.025

Returning to Example 8.5, the decision can be rewritten using the classical approach, as follows.

109
Decision: Since z* = 1 is not in the rejection region, we do not reject Ho, and conclude that the
average number of hours of tennis played in this city does not differ from the national average.

The following diagram illustrates both the critical region and the p-value for Example 8.5.

Summary tables for the Classical Approach follow; students may find them a useful reference.

Classical Approach to Hypothesis Testing for tests of the population mean


Right Tail Test Left Tail Test Two Tailed Test
Step One: Ho:μ ≤ μo Ho:μ ≥ μo Ho:μ = μo
Hypothesis Ha:μ > μo Ha:μ < μo Ha:μ  μo

Step Two z* or t* z* or t* z* or t*
Test Statistic {see table below} {see table below} {see table below}

Step Three: zα or tα,n-1 -zα or -tα,n-1 zα/2 or tα/2,n-1


Critical Value (from z or t table) (from z or t table) (from z or t table)
Step Four: Reject Ho if z*> zα Reject Ho if z*< -zα Reject Ho if z*> zα/2 or z*< - zα/2
Decision or or or
Reject Ho if t* > tα,n-1 Reject Ho if t* < -tα,n-1 Reject Ho if t* > tα/2,n-1 or t* < - tα/2,n-1

Test Statistic Table


Known σ σ unknown, s known
Normal Not normal Normal Not normal

n < 30 ̅− 𝜇
𝑋 Cannot use ̅− 𝜇
𝑋 Cannot use
is N(0,1) is tn-1
𝜎/√𝑛 z or t 𝑆/√𝑛 z or t

̅ − 𝜇0
𝑥 ̅ − 𝜇0
𝑥
z* = t* =
𝜎/√𝑛 𝑠/√𝑛
n ≥ 30 ̅− 𝜇
𝑋 ̅− 𝜇
𝑋 𝑋̅− 𝜇 𝑋̅− 𝜇
is N(0,1) is N(0,1) by CLT is tn-1 is N(0,1) by CLT
𝜎/√𝑛 𝜎/√𝑛 𝑆/√𝑛 𝜎/√𝑛
𝑥̅ − 𝜇0
t* = (exact)
𝑠/√𝑛
𝑥̅ − 𝜇0
̅ − 𝜇0
𝑥 ̅ − 𝜇0
𝑥 𝑥̅ − 𝜇0 z* ≅ (approx)
z* = z* = z* ≅ (approx) 𝑠/√𝑛
𝜎/√𝑛 𝜎/√𝑛 𝑠/√𝑛
You may prefer to use a t
You may prefer to use a t distribution with n- 1 df to be
distribution with n- 1 df to be more conservative.
more conservative.

110
Example 8.6:Hypothesis Test of µ,population shape normal, σ unknown, sample size small
A firm that packages deluxe ornamental matches for fireplace use designed a process to place, on
the average, 18 or more matches in each box. Due to a recent machinery breakdown and hasty
repair, you are worried that the process is yielding less matches than required, on average. A
random sample of 16 boxes is drawn. On the basis of this sample, the average number of
matches per box is 17, while the standard deviation calculated was 2. Test, using both the p-
value and the critical value approach, if the process meeting its requirements? Assume that the
population is normal, and our pre-chosen α = 0.05.

Population: Normal, µ0 = 18
Sample: n = 16, 𝑥̅ = 17, s = 2
α = 0.05

Hypotheses:
Ho: µ >= 18 versus Ha: µ < 18

Test Statistic:
𝑥̅ − 𝜇 17−18
t* = 𝑠/ 𝑛0 = 2/√16 = - 2

P-Value Approach:
P(t < -2) with degrees of freedom = 15
From t tables, 0.025 < p value < 0.05

0.05 p-value0.025
↓ ↓ ↓
1.753 2 2.131

Decision: Reject Ho since p-value is <= 0.05, and conclude process is not meeting requirements.

Critical Region Approach:


Reject Ho if t* < -t0.05
Reject Ho if t* < -1.753

Decision: Since -2 < -1.753, reject Ho and conclude that process is not meeting its requirements.

This picture illustrates needed information to complete the test when taking both the p-value and
the critical value approach for this example.

111
Chapter 9: TWO SAMPLE INFERENCE: COMPARATIVE TEST PROCEDURES

Thus far, we have looked at statistical procedures to be used when we take a random sample
from one population and wish to make inferences about a population parameter of interest (such
as a population mean or population proportion). In what follows, we will look at two further
experimental designs where we wish to do comparative inference using tests that are designed to
appropriately investigate new parameters of interest.

First we will examine the situation where our data will consist of a random sample of naturally
matched pairs, where the observations for each pair are dependent on each other. Matched pair
examples might be twins, married couples, or measurements made before and after some
“treatment” on the same subject. We might look at the paired differences in the birth weights of
twins, the age difference between married couples, or blood pressure measurements before and
after a subject takes a new medicine for a certain time period. Our inference test will consider
the sample of paired differences to be a test on the population of paired differences that we
observe for each matched pair. It is equivalent to a one sample t test because we test on the
random variable of the paired differences.

Secondly, we will examine the situation where we have two independent populations, and we
take two independent random samples, one from each population. In this case the sample
selected from each population has no bearing on the sample selected from the other population.
We will use test statistics derived from these samples to determine whether or not the two
population means (or proportions) are alike or different. We will consider two situations. In
one situation, we will assume that our 2 independent samples come from two normal populations
with equal variances, and in the second situation we will assume that our two independent
samples come from normal populations with unequal variances. We will examine the
robustness of our test statistics of interest to variations from these assumptions.

MATCHED PAIRS:
Example 9.1: A high school teacher selects a random sample of twelve of her students who are
planning to apply to University and records their averages in January and June, along with the
differences in their averages between June and January, as below. The teacher thinks it
reasonable to assume that student averages increase in the last half of the year, as students begin
to think about college. She wishes to test her supposition at a 5% level of significance.
Student January June 110
Amber 85 91 Differences = June - January 6
Beryl 78 82 4
Crystal 79 82 3
Diamond 77 80 3
Emerald 75 77 2
Fuchsite 76 78 2
Garnet 82 84 2
Hiddenite 91 93 2
Ironstone 88 88 0
Jade 89 89 0
Kyanite 83 81 -2
Lapus Lazuli 84 80 -4

112
If we assume that the distribution of the population of paired differences is normal, then the
𝑑̅ − 𝜇d
random variable td = 𝑆 will follow a t distribution with 𝑛𝑑 -1 degrees of freedom. Here d̅ is
𝑑 /√ 𝑛𝑑
the random variable of the sample mean paired difference, 𝜇d is the mean of the population of
paired differences, Sd is the random variable of the sample standard deviation or the paired
differences, and nd is the number of paired differences in the sample. **

**Note: It can be a little confusing to see the same symbol d̅ used for both the random variable and also for the
sample mean for a specific sample, but this is the convention, and your author has followed it. The context should
make it clear, and as long as one remembers that the d ̅ in the formula for the test statistic td* refers to the sample
mean for a specific sample, one should be fine.

We next take a quick moment to look at a bar chart of the one random variable of sample
differences. It does not seem unreasonable to believe that the population differences follow a
normal distribution. We can therefore proceed to create confidence intervals and do hypothesis
tests in the same manner as if we were doing a one sample procedure.

We note that if the teacher’s hypothesis is correct, we would expect to see a fairly positive 𝜇d ,
whereas if she is incorrect, and students’ grades do not increase significantly in the last 6 months
of high school, we would expect to see a 𝜇d close to 0.

Step 1 (Hypotheses): Ho: 𝜇d ≤ 0 versus Ha: 𝜇d > 0


Ho Popn of Differences: 𝜇d0 = 0 , shape normal, unknown 𝜎𝑑
Sample: d̅ = 1.5, sd = 2.68, nd = 12

𝑑̅ − 0 1.5− 0
Step 2 (Test statistic): td* = 𝑠 = 2.68/√12 = 1.94
𝑑 /√𝑛𝑑
Step 3 (P-value): p-value = P(t11 > 1.94) =
0.039 (from online)
From the tables for 11 degrees of freedom,
0.025 < p-value < 0.05

Step 4 (Decision): Since our p-value is less than our level of significance of 0.05, we conclude
there is significant evidence to support the teacher’s supposition that student grades at her school
increase in the second half of the school year.

113
Confidence intervals for a random sample of paired differences are created in an analogous
manner to confidence intervals for a single random sample. The formula for the confidence
interval for μd is as follows.

sd
d̅ ± t α/2 where t α/2 is such that P(t<-t/2) = P(t> t/2) = /2 for nd – 1 df
√n d

A 95% confidence interval for 𝜇d is:


2.68
1.5 ± 2.201 = 1.5 ± 1.71 (as t0.025 = 2.201 for 11 df)
√12

Our confidence interval for the paired differences is (-0.21, 3.21). We are 95% confident that the
mean difference between average grades in January and average grades in July is between -0.21
and 3.21 grade points. As the confidence interval does contain 0, there is sufficient evidence at
the 5% significance level to support the teacher’s supposition that student grades at her school
increase in the second half of the school year.

114
TWO INDEPENDENT SAMPLE TESTS

TWO INDEPENDENT SAMPLES Z TEST

Consider two independent samples drawn randomly from two populations. We consider the
samples to be independent when sample data drawn from one population is completely unrelated
to sample data drawn from the other population.

This next test enables us to determine whether or not the means of these two populations (subject
to various constraints indicated below) differ.

We will label the first population the X1 population and the second population the X2 population,
and denote the mean and standard deviation of the first population as 𝜇1 and 𝜎1 , and the mean
and standard deviation of the second population as 𝜇2 and 𝜎2 .

We let 𝑥̅1 be the mean of the sample taken from the X1 population, and 𝑥̅2 be the mean of the
sample taken from the X2 population. Now suppose we repeat this experiment of taking two
independent samples over and over, each time calculating 𝑥̅1 and 𝑥̅2 . Eventually, we can
generate a sampling distribution for the difference 𝑋̅1 – 𝑋̅2. What will this sampling distribution
look like?

RESULT: Let X1 and X2 be populations with means 𝜇1 and 𝜇2 and standard deviations 𝜎1 and
𝜎2 respectively. If we take independent random samples of sizes n1 and n2 from each of the X1
and X2 populations, the sampling distribution of 𝑋̅1– 𝑋̅2
a) has mean 𝜇1 − 𝜇2 and
b) has standard deviation √𝜎12 /𝑛1 + 𝜎22 /𝑛2 .
This result holds for any n1 and n2, regardless of size.

FURTHER RESULT: Let X1 and X2 be normal populations with means 𝜇1 and 𝜇2 and standard
deviations 𝜎1 and 𝜎2 respectively. If we take independent random samples of sizes n1 and n2
from each of the X1 and X2 populations, the sampling distribution of 𝑋̅1– 𝑋̅2
a) has mean 𝜇1 − 𝜇2 and
b) has standard deviation √𝜎12 /𝑛1 + 𝜎22 /𝑛2 .
c) is N(𝜇1 − 𝜇2 , √𝜎12 /𝑛1 + 𝜎22 /𝑛2 ) .
This result also holds for any n1 and n2, regardless of size.

(We indicated above that the 𝑋1 − 𝑋2 distribution was normal. This follows as 𝑋1 is normal
because the X1 population is normal, 𝑋2 is normal as the X2 population is normal, and 𝑋1 − 𝑋2
is normal because a linear combination of two or more normal distributions is also normal.)

In this case, it thus follows that, for all n1 and n2, the standardized test statistic
(𝑋̅1 − 𝑋̅2 ) –(𝜇1 − 𝜇2 )
Z= has a N(0,1) distribution.
√𝜎12 /𝑛1 +𝜎22 /𝑛2

Important Note: The Central Limit Theorem tells us that we can relax the normality
assumption for our original populations if both n1 >30 and n2 > 30.

115
An analogous discussion to the discussion about one sample confidence intervals would further
lead us to the development of a (1 – α)% confidence interval for 𝜇1 − 𝜇2 of the following form

σ2 σ2
(𝑥̅1 − 𝑥̅2 ) ± zα/2 √n1 + n2 where zα/2 is such that P(Z<-z/2) = P(Z> z/2) = /2
1 2

DEMO PROBLEM:
a)Consider an X1 distribution with mean 8 and standard deviation 2 and an X2 distribution with
mean 9 and standard deviation 3. Suppose independent random samples of sizes 8 and 9 are
taken, respectively from the X1 and X2 distribution. Find the mean and standard deviation of
𝑋1 − 𝑋2.

For 𝑛1 = 8, 𝑛2 = 9, 𝜇1 = 8, 𝜎1 = 2, 𝜇2 = 9, 𝜎2 = 3, the mean and standard deviation of X1 − X2


are -1 and 1.225 respectively.

𝜎12 𝜎22 22 32
𝜇𝑋1−𝑋2 = 𝜇1 − 𝜇2 = 8 − 9 = −1, 𝜎𝑋1 −𝑋2 = √ + =√ + = √1.5 = 1.225
𝑛1 𝑛2 8 9
Here, as long as the two samples are independently drawn, the above result is true no matter
what the distribution shapes are for 𝑋1 and/or 𝑋2.

b)Suppose the two independent samples are randomly obtained from normal populations X1 and
X2, with sample sizes 𝑛1 = 8 and 𝑛2 = 9. Compute the probability that either sample
average is at least 2 units greater than the other.

Recall 𝑋1 and 𝑋2 are normally distributed for any sample size as they are drawn from normal
populations X1 and X2, and 𝑋1 − 𝑋2 is also normally distributed, as any linear combination of
normal distributions is also normally distributed.

Here X1 − X2 is normal with mean μX1−X2 = −1 and standard deviation σX1−X2 = 1.225.

We seek 𝑃(|𝑋1 − 𝑋2 | > 2), where 𝑋1 − 𝑋2 ∼ 𝑁(−1, 1.225).


𝑃(|𝑋1 − 𝑋2 | > 2) = 1 − 𝑃(|𝑋1 − 𝑋2 | < 2) = 1 − 𝑃(−2 < 𝑋1 − 𝑋2 < 2)
(𝑋1 −𝑋2 )−𝜇𝑋 −𝑋
1 2
We standardize the -2 and 2 in the equation using 𝑍 = .
𝜎𝑋 −𝑋
1 2
−2−(−1) 2−(−1)
Hence, 𝑃(|𝑋1 − 𝑋2 | > 2) = 1 − 𝑃 ( 1.225 < 𝑍 < 1.225 ) = 1 – P(-0.82 < Z < 2.45)
= 1 − ( (𝑃(𝑍 < 2.45) − 𝑃(𝑍 < −0.82) )
=0.9929 – 0.2061= 0.7868
78.68% of all pairs of independent samples of sizes 8 and 9 are such that the magnitude of the
difference between the sample means is at least 2.

Population variances are generally unknown; we will not do examples that assume this but
will look at the situation when we use sample variances to estimate population variances.

116
TWO INDEPENDENT T TEST – POOLED VARIANCES

In this section, we look at an inferential method that is used when we know that the populations
from which we are sampling have equal variances, although this equal value is unknown. As
always, such knowledge would only be likely if we were very familiar with data from previous
similar experiments or surveys.

We assume we have two independent normal populations from which we take two simple
random samples, and that the two independent normal populations have equal unknown
variances, σ2 = σ12 = σ22.

(𝑋̅1 − 𝑋̅2 ) –(𝜇1 − 𝜇2 ) (𝑋̅1 − 𝑋̅2 ) –(𝜇1 − 𝜇2 )


Hence, Z = can be rewritten as 1 1
√𝜎12 /𝑛1 +𝜎22 /𝑛2 𝜎√( + )
𝑛1 𝑛2

What we need is an estimate for this unknown σ2. The most useful one we can make is a
“pooled” estimate of the known sample variances s12 and s22, both of which are estimators of σ2
themselves.

sp2 weights each of s12 and s22 appropriately for the theory about the shape of the sampling
distribution that follows to be valid. Note that if s12 and s22 were quite different, we might
question our decision to pool the variances.

(𝑛1 −1)𝑠12 +(𝑛2 −1)𝑠22 (𝑛1 −1)𝑠12 +(𝑛2 −1)𝑠22


sp 2 = and therefore sp = √ ,
𝑛1 + 𝑛2 −2 𝑛1 + 𝑛2 −2

(𝑋̅1 − 𝑋̅2 ) –(𝜇1 − 𝜇2 )


t= 1 1
is a t random variable with n1 + n2 – 2 degrees of freedom.
𝑆𝑝 √( + )
𝑛1 𝑛2

Therefore, as long as the following assumptions are made,

1. Simple random samples


2. Independent samples
3. Normal populations (or large samples (both n1 and n2 >=30))
4. Equal population standard deviations

the following chart indicates the appropriate hypothesis testing procedure.


Step 1 Ho: 𝜇1 - 𝜇2 ≤ 0 or Ho: 𝜇1 ≤ 𝜇2 Ho: 𝜇1 - 𝜇2 ≥ 0 or Ho: 𝜇1 ≥ 𝜇2 Ho: 𝜇1 - 𝜇2 = 0 or Ho: 𝜇1 = 𝜇2
(Hypothesis) Ha: 𝜇1 - 𝜇2 > 0 Ha: 𝜇1 > 𝜇2 Ha: 𝜇1 - 𝜇2 < 0 Ha: 𝜇1 < 𝜇2 Ha: 𝜇1 - 𝜇2 ≠ 0 Ha: 𝜇1 ≠ 𝜇2

Step 2 (𝑥̅1 − 𝑥̅2 ) –(𝜇1 − 𝜇2 )


t* = 1 1
, df= 𝑛1 + 𝑛2 − 2
(Test Statistic) 𝑠𝑝 √( + )
𝑛1 𝑛2

Step 3 P(t > t*) P(t < t*) 2P(t<t*) if t* negative


(p value)
2P(t>t*) if t* positive
Step 4 Reject Ho is p-value ≤ α
(Decision)

117
Example 9.2:
At a certain University, the number of hours of homework done per course per week by 1st term
(winter) students and the number of hours of homework done per course per week by 2nd term
(fall) students are known to be normally distributed. Two independent simple random samples
are taken, one of 25 Fall term students and one of 9 Winter term students, and the number of
hours of homework that each student does per course per week is calculated. The sample
average number of hours for Fall termers is 3.1 and the sample standard deviation of hours for
Fall termers is 0.22 hours. The sample average number of hours for Winter termers is 2.2 and
the sample standard deviation of hours for Winter termers is 0.18 hours. Previous studies have
indicated that the population standard deviations of hours of homework done per course per
week for both the Fall and Winter term populations of students are equal, so we will assume this
for our study. Perform, with a significance level of 1%, a two way sample test to determine if
there is evidence to support a supposition that the hours of homework per course for Fall term
students differs from the hours of homework per course for Winter term students.

Step 1(Hypotheses):
Ho: 𝜇1 - 𝜇2 = 0 versus Ha: 𝜇1 - 𝜇2 ≠ 0
OR Ho: 𝜇1 ≤ 𝜇2 versus Ho: 𝜇1 ≠ 𝜇2
Popns: both normal, unknown equal variances, population 1 = Fall termers, population 2 =
Winter termers
Samples: Fall: n1 =25, 𝑥̅1 = 3.1, 𝑠1 = 0.22, Winter: n2 =9, 𝑥̅2 = 2.2, 𝑠2 = 0.18
Prechosen α = 0.01

Step 2 (Hypotheses): It is necessary to calculate sp2 prior to calculating our test statistic.
(𝑛1 −1)𝑠12 +(𝑛2 −1)𝑠22 (24)(0.22)2 +(8)(0.18)2
sp 2 = = = 0.0444 , sp = .2107
𝑛1 + 𝑛2 −2 25+ 9−2

(𝑥̅ 1 − 𝑥̅ 2 ) –(𝜇1 − 𝜇2 ) (3.1− 2.2)− 0)


t*= 1 1
= 1 1
= 10.9886
𝑠𝑝 √( + ) (.2107)√( + )
𝑛1 𝑛2 25 9

Step 3 (P-value):

P-value = P(t32 > 10.9886) for a t distribution with 25+9-2 = 32 degrees of freedom. This p-value
is essentially zero. Students can verify this by checking statdistributions.com.

Step 4 (Decision):
Since the p-value is clearly below the significance level of 0.01, we reject Ho and conclude that
there is significant evidence to indicate that the hours of homework per course for Fall term
students differs from the hours of homework per course for Winter term students.

A (1 – α)% Confidence Interval for µ1 - µ2 for two independent samples is:

1 1
(𝑥̅1 − 𝑥̅2 ) ± tα/2𝑠𝑝 √( + )
𝑛 𝑛 1 2
where tα/2 is the t value such that the P(t < -t α/2) = P(t > t α/2) = α/2 for n1 + n2 – 2 degrees of
freedom

118
Calculate a 99% confidence interval for µ1 - µ2, the difference in the average hours of weekly
homework per course performed by Fall termers and Winter termers at the University.

1 1 1 1
(3.1 − 2.2) ± t0.005(.2107)√( + ) = (3.1 − 2.2) ± (2.738)(.2107)√( + )
25 9 25 9

= 0.9 ± 0.2243 for 32 (=25+9-2) df

We can obtain t0.005 = 2.738 for 32 df from the t-tables.

Hence our 99% confidence interval for µ1 - µ2, the difference in the average hours of weekly
homework per course performed by Fall termers and Winter termers at the University, is
(0.6757, 1.1243) hours.

Since the confidence interval does not contain 0, we have significant evidence, at the 1%
significance level, that there is a difference in the average hours of weekly homework per course
performed by Fall termers and Winter termers at the University.

For the same set of sample data, a two sided hypothesis test with level of significance α% will
always give the same conclusion regarding significance between two independent random
variables as a confidence interval with level of confidence (1-α)% .

Example 9.3:

Past studies have indicated that the standard deviations of the two populations of the minutes
spent texting per week by undergraduate and graduate students in your city are approximately
equal. Two independent random samples of students in your city are taken, one of
undergraduates and one of graduates. They are asked how many minutes they spend texting
during a regular week. Results are below. Test, with α= 5%, whether there is evidence that
undergraduates text more than graduates in your city.

Undergraduates Graduates
𝑥̅1 = 354.2 𝑥̅2 = 344.5
𝑠12 = 50.9 𝑠22 = 49.1
𝑛1 = 50 𝑛2 = 50

Hypothesis: (set Population 1 = undergraduates, Population 2 = graduates)


Ho: 𝜇1 - 𝜇2 ≤ 0 or Ho: 𝜇1 ≤ 𝜇2
Ha: 𝜇1 - 𝜇2 > 0 Ha: 𝜇1 > 𝜇2

Assumptions:
Independent random samples, two normal populations, unknown equal variances

119
Test Statistic: It is necessary to calculate sp2 prior to calculating our test statistic.

(𝑛1 −1)𝑠12 +(𝑛2 −1)𝑠22 (49)(50.9)+(49)(49.1)


sp 2 = = = 50 , sp = 7.071
𝑛1 + 𝑛2 −2 50+50−2

(𝑥̅ 1 − 𝑥̅ 2 ) –(𝜇1 − 𝜇2 ) (354.2−344.5)− 0)


t*= 1 1
= 1 1
= 6.86
𝑠𝑝 √( + ) (7.071)√( + )
𝑛1 𝑛2 50 50

P-value:
P(t > 6.86) for a t distribution with 50+50 -2 = 98 degrees of freedom
This p-value is essentially zero. Students can verify this by checking statdistributions.com or if
students wish to use an approximation from printed t-tables, they should use the df line closest to
98 that is below 98 in the printed tables they use.

Decision:
Since the p-value is clearly below the significance level of 0.05, we reject Ho and conclude that
there is significant evidence to indicate that undergraduates do more texting than graduates.

Confidence Interval:

Calculate a 95% confidence interval for µ1 - µ2, the difference in the average number of weekly
minutes texted by undergraduates and graduates in your city.

1 1 1 1
(354.2 − 344.5) ± t0.025,98(7.071)√( + ) = (354.2 − 344.5) ± (1.984)(7.071)√( + )
50 50 50 50
for 98(=50+50-2) df

= 9.7 ± 2.806

Note: we obtained t0.025,98 = 1.984 for 98 df online.


If students wish to use an approximation from printed t-tables, they should use t0.025,df, for the df
closest to 98 that is below 98 in the printed tables they use.

Our 95% confidence interval for µ1 - µ2, the difference in the weekly number of minutes of
texting performed by undergraduates and graduates in the city is (6.894,12.506) minutes.

This confidence interval for the differences does not contain 0 (it contains positive lower and
upper values), so we have significant evidence, at the 5% significance level, that undergraduates
and graduates text a different weekly number of minutes, on average.

120
TWO INDEPENDENT T SAMPLES WELCH’S TEST: (NON-POOLED VARIANCE)

What happens when we cannot make the assumption that the population variances of our two
independent populations are equal?

Let us assume again that we have two independent normal populations from which we take
two simple random samples.

(𝑥̅ 1 − 𝑥̅ 2 ) –(𝜇1 − 𝜇2 )
In this case, we will use the test statistic t* =
√𝑠12 /𝑛1 +𝑠22 /𝑛2

Welch’s test statistic has approximately a t distribution and its degrees of freedom can be
calculated using Satterthwaite’s approximation, shown below.

2 2 2
𝑠 𝑠
[( 1 )+( 2 )]
𝑛1 𝑛2
df = 2 2 rounded down to the nearest integer
𝑠2 𝑠2
( 1) ( 2)
𝑛1 𝑛2
+
𝑛1 −1 𝑛2 −1

Because these degrees of freedom can only be readily calculated by a computer, this test was not
widely used a generation ago. However, statistical software today has revolutionized its use.

A conservative choice for the degrees of freedom when doing a two independent samples t test
by hand is to use degrees of freedom = min(n1 - 1, n2 – 1) . This allows you to avoid having to
calculate the df from the unwieldly formula above.

As long as the following assumptions are made


1. Simple random samples
2. Independent samples
3. Normal populations (or large samples (both n1 and n2 >=30))
the following chart indicates the appropriate 4 step hypothesis testing procedure.
Step 1 Ho: 𝜇1 - 𝜇2 ≤ 0 or Ho: 𝜇1 ≤ 𝜇2 Ho: 𝜇1 - 𝜇2 ≥ 0 or Ho: 𝜇1 ≥ 𝜇2 Ho: 𝜇1 - 𝜇2 = 0 or Ho: 𝜇1 = 𝜇2
(Hypothesis) Ha: 𝜇1 - 𝜇2 < 0 Ha: 𝜇1 < 𝜇2 Ha: 𝜇1 - 𝜇2 ≠ 0 Ha: 𝜇1 ≠ 𝜇2
Ha: 𝜇1 - 𝜇2 > 0 Ha: 𝜇1 > 𝜇2
Step 2 2 2 2
𝑠 𝑠
(Test Statistic) (𝑥̅1 − 𝑥̅2 ) –(𝜇1 − 𝜇2 ) [( 1 )+( 2 )]
𝑛1 𝑛2
t* = , df = 2 2 2 rounded down to nearest integer
𝑠 𝑠2
√𝑠21 /𝑛1 +𝑠22 /𝑛2 ( 1) ( 2)
𝑛1 𝑛2
+
𝑛1 −1 𝑛2 −1
or, conservatively, df approximately min(n1-1,n2–1)
Step 3 P(t > t*) P(t < t*) 2P(t<t*) if t* negative
(p value) 2P(t>t*) if t* positive
Step 4 Reject Ho if p-value ≤α
(Decision)

A (1 – α)% Confidence Interval for µ1 - µ2 is

(𝑥̅1 − 𝑥̅2 ) ± tα/2√𝑠12 /𝑛1 + 𝑠22 /𝑛2 where P(t<-t/2) = P(t> t/2) = /2 for the df above.

121
Robustness of the pooled t test and Welch’s test
The pooled t test is robust to modest departures from normality, but even for large samples, it
can be affected by outliers. The pooled t test is also robust to moderate violations of the equal
population variances assumption, as long as the sample sizes are equal. Welch’s test is somewhat
forgiving of the normality assumption also, but if sample sizes are less than 30, the distributions
should be mound shaped and approximately symmetric. If both sample sizes are ≥ 30, then the
normality assumption can be relaxed.

Example 9.4:
A consumer group is testing IPADs. Two independent samples are taken with 12 people being
randomly selected to beta test an IPAD made by StarTrek Enterprises and another 12 people
being randomly selected to beta test an IPAD made by DeltaQuadrant Inc. Study participants
are asked to record the number of times they had to contact customer service during a month.
An examination of the sample distributions of the data indicates that we can safely presume that
the populations of number of customer service contact hours per week are normally distributed.
Test, using the summary data presented below, and a significance level of 10%, whether there is
any evidence of a difference (either way) in the number of customer contact hours for StarTrek
Enterprises and DeltaQuadrant Inc customers.

StarTrek Enterprises DeltaQuadrant Inc


𝑥̅1 = 18 𝑥̅2 = 17
𝑠12 = 6 𝑠22 = 5.5
𝑛1 = 12 𝑛2 = 12

Hypothesis: Let 1 – Star Trek Enterprises, and 2 – DeltaQuadrant Inc


Ho: 𝜇1 - 𝜇2 = 0 or Ho: 𝜇1 = 𝜇2
Ha: 𝜇1 - 𝜇2 ≠ 0 Ha: 𝜇1 ≠ 𝜇2

Assumptions: Two independent normal populations, unknown unequal variances,

Test Statistic:
(𝑥̅ 1 − 𝑥̅ 2 ) –(𝜇1 − 𝜇2 ) (18− 17)− 0 1 1 1
𝑡∗= = = = = 0.9789 = 1.022
√𝑠12 /𝑛1 +𝑠22 /𝑛2 √6/12+5.5/12 √0.5 +0.4583 √0.9583

2 2 2
𝑠 𝑠 6 5.5 2
[( 1 )+( 2 )] [( )+( )] [(.5)+(0.4583)]2 [0.9583]2 0.9183 0.9183
𝑛1 𝑛2 12 12
df = 2 2 = 6 2 5.5 2
= (.5)2 (0.4583)2
= .25 0.21 = .0227+.0191 = = 21.97, which we
𝑠2 𝑠2 ( ) ( ) + + .0418
( 1) ( 2) 12 + 12 11 11 11 11
𝑛1 𝑛2
+ 11 11
𝑛1 −1 𝑛2 −1

round down to the nearest integer, and use df = 21

Note: We could use min (n1 -1, n2 -2) = min(11,16) = 11 in order to be conservative, if we wish.

P-value:

2P(t21 > 1.022) for a t distribution with 21 degrees of freedom = 2(0.159) = 0.318 (from online)

122
Decision: Since the p-value is above our significance level of 10%, we do not reject Ho and
conclude that there is no significant evidence that there is a difference in the monthly number of
hours of customer services required by beta testers using the two companies.

A 90% Confidence Interval for µ1 - µ2 is

(𝑥̅1 − 𝑥̅2 ) ± tα/2√𝑠12 /𝑛1 + 𝑠22 /𝑛2 = (18 − 17) ± t 0.05√6/12 + 5.5/12
= (18 − 17) ± 1.72√(. 5) + (0.4583) = 1 ± 1.72√. 9583 1 ± 1.72(.9789)= 1 ± 1.68

since 1.72 = t0.05 at 21 df.

Therefore a 90% CI for the difference in the monthly number of hours of customer service
contacts made to the two companies by the beta testers is (-0.68, 2.68) hours.

Note that this confidence interval covers 0, indicating that, at the 10% level of significance, there
is no significant evidence that there is a difference in the monthly number of hours of customer
service contacts made to the two companies by the beta testers.

Example 9.5
Independent random samples of single parent (one wage earner) households in a certain city are
sampled and whether the parent finished high-school is recorded (yes or no) is recorded along
with the yearly household income is measured. Data is given below.
Yearly Income:High School Graduates(Y) Yearly Income:Non-High School Graduates(N)
𝑥̅1 = $42,000 𝑥̅2 = $31,000
𝑠1 =$5,000 𝑠2 = $2,000
𝑛1 = 100 𝑛2 = 121

We suspect high school graduate single parent sole wage earner houses will have more income,
on average.

Test, at the 1% level of significance, whether there is evidence to support this supposition. Note
that because both sample sizes are ≥ 30, we can perform a two independent samples t test
without assuming normal populations. Since 5000/2000 is above 2, we perform an unpooled test.

123
Hypotheses:
Ho: 𝜇1 - 𝜇2 ≤ 0 versus Ha: 𝜇1 - 𝜇2 > 0 or Ho: 𝜇1 ≤ 𝜇2 versus Ha: 𝜇1 > 𝜇2
(here Popn 1 = High School Graduates, Popn 2 = Non High School Graduates
Assumptions: independent simple random samples, unequal variances, populations of unknown
shape but since both n1 and n2 are >=30, it is ok to proceed

Test statistic:

(𝑥̅ 1 − 𝑥̅ 2 ) –(𝜇1 − 𝜇2 ) (42000− 31000)− 0 11000 11000 11000


𝑡 ∗= = 2 2
= = = 532.0318 =
√𝑠12 /𝑛1 +𝑠22 /𝑛2 √(5000) +(2000) √250000+33057.8512 √283057.8512
100 121
20.6755

2 2 2
𝑠 𝑠 2
[( 1 )+( 2 )] 50002 20002
𝑛1 𝑛2 [( )+( )] [250000+33057.8512]2 [283057.8512]2
100 121
df = 2 2 = 2 2 = (250000)2 (33057.8512)2
= 631313131.3131+9090324.6241=
𝑠2 𝑠2 50002 20002 +
( 1) ( 2) (
100
) (
121
) 99 120
𝑛1 𝑛2
+ +
𝑛1 −1 𝑛2 −1 99 120

80121747125.9613
= 125.1113 rounds down to 125
640403455.9372

P-value: P(t169 > 20.6755) ≈ 0.000 for a t distribution with 125 degrees of freedom. Students
can verify this by checking statdistributions.com or if students wish to use an approximation
from printed t-tables, they should use the df line closest to 125 that is below 125 in the printed
tables they use. Students familiar with SPSS can also check this result in SPSS.

Decision: Since the p-value is below our significance level of 1%, we reject Ho and conclude
that there is significant evidence that the difference in income (High School Graduate household
income minus Non High School Graduate household income) for single parent sole wage earners
is positive. High School Graduate single parent sole wage owners earn more than Non-High
School Graduate single parent sole wage earners.

Confidence Interval: A 99% Confidence Interval for µ1 - µ2 is

(5000)2 (2000)2 (5000)2 (2000)2


(42000 − 31000) ± t0.005√ + = (42000 − 31000) ± (2.616)√ +
100 121 100 121
11000 ± (2.616) (532.0318) = 11000 ± 1391.7952

Since 2.616 = t0.005 for 125 df (from online).

Decision: A 99% CI for the difference in yearly income between High School and Non High
School Graduate single parent sole wage earners is ($9608.2048, $12391.7952). This confidence
interval does not include 0, thus indicating that High School Graduate single parent wage owners
earn a significantly different amount than Non High School Graduate single parent wage earners,
on average.

124
Chapter 10: ANOVA: ANALYSIS OF VARIANCE
The ANOVA test of inference is one test that can be used (subject to various conditions) with
random sample data where a continuous interval (or ratio) numerical variable is classified
according to a nominal (categorical) treatment variable. The hypothesis of interest is whether
the assumed same normal shaped distributions (with same variances) of the treatment
..populations differ in location. We explain this in more detail below.

Example 10.1:
Treatment (Categorical) Variable: Favourite form of summer exercise (cycle, water, walk)
Numerical Variable: Age of Seniors
Hypothesis of interest: Do the same shaped normal treatment population locations differ (i.e. do
the mean ages of seniors in the cycle, water and walk populations differ)

Example 10.2:
Treatment (Categorical) Variable: Week (Week before exams, Exam Week, Week after Exams)
Numerical Variable: Number of hours of exercise per week
Hypothesis of Interest: Do the same shaped normal treatment population locations differ (i.e.
does the mean number of hours of exercise per week undertaken by students before, during, and
after exams differ)

Example 10.3:
Treatment (Categorical) Variable: Parental Status of Students (Yes/No)
Numerical Variable: Number of hours of homework per week
Hypothesis of interest: Do the same shaped normal treatment population locations differ (i.e.
does the mean number of hours of homework per week undertaken by students in the parent and
non-parent populations differ)

Assumptions:
1. Random samples of observations (numerical values) are drawn from treatment populations
2. These samples are independent of each other
3. The treatment populations are normally distributed
4. The treatment populations all have the same variance, σ2

Note: On the one hand, the first two assumptions are typically satisfied when a simple random
sample of continuous numerical observations is taken from one population (i.e. age) and then
classified according to some treatment variable (i.e. favourite form of exercise). Experimenters,
on the other hand, randomly assign people from a population to treatment groups (i.e. people
might be assigned to four treatment (categorical) groups (drug A, drug B, herb, placebo) and then
have the length of time to recovery from an illness measured).

The picture below illustrates the assumptions for the summer exercise/age of seniors example.
We are assuming that we have three independently selected random samples of seniors,
and that the three treatment populations of ages of these seniors all have the same normal
shape and variance. What we are interested in is whether they differ in location (or, more
simply, whether their means differ in location). When Ho is true, they will not differ in location,
but when Ha is true (as in the picture), they will differ in location.

125
μ1 μ2 μ3

We will perform a test known as the ANOVA (Analysis of Variance) test to discern this.

Ho: all the means of the treatment populations are the same
Ha: at least one population (treatment) mean differs from the others

Ho: the mean age of cycling, swimming, and walking preference differs for our seniors
Ha: at least one mean age of cycling, swimming and walking preference for our seniors differs
from the others

Ho: µ1 = µ2 = µ3
Ha: at least one µi differs (i = 1, 2, 3)

In general:
Ho: µ1 = µ2 = .... = µI
Ha: at least one µi differs (i = 1, 2, ..., I)

We test the above hypothesis for a set of data values which we arrange in I treatment columns.

Each treatment will have ni observations.

We let xi,j = the jth observation in the ith treatment


We let 𝑥̅𝑖 = the mean of the ith treatment
We let 𝑥̅ = the grand (total) mean of all observations

Consider SRSs from each of the I populations with the sample from the ith population having ni
observations denoted xi1, xi2, …xi,ni.

The One Way ANOVA Model is:


xij = µi + εij for i = 1, …I and j = 1, …., ni

where we assume that the εij are normally and independently distributed, with mean 0 and
variance σ2.

The parameters of the model are the population means µ1, µ2, …. µI and the common standard
deviation σ.

(When I = 2, the One Way ANOVA test is equivalent to a two sample t test.)

126
Example 10.4: Consider two sets of data:
Treatment Group: Club Treatment Group: Family
Ratio Variable: Age Ratio Variable: Age
I = 3 treatments I = 3 treatments
n1 = 4, n2 = 4, n3 = 4 n1 = 4, n2 = 4, n3 = 4
nT = 12 nT = 12
Club 1 Club 2 Club 3 Family 1 Family 2 Family 3
6 12 35 40 39 39
8 13 37 39 37 35
10 15 = 39 12 10 8
12 18 40 9 8 6
Total 36 58 151 Total 100 94 88
Mean 𝑥̅1 = 9 𝑥̅2 =14.5 𝑥̅3 = 37.75 Mean 𝑥̅1 = 25 𝑥̅2 = 23.5 𝑥̅3 = 22
𝑥̅ = 20.4 𝑥̅ = 23.5
Numbers vary from 6 to 40 Numbers vary from 6 to 40
Little variability of age within each club Much variability of age within each family
(here comparing xi,j’s to 𝑥̅𝑖 ’s) (here comparing xi,j’s to 𝑥̅𝑖 ’s)
Much variability of age between clubs (here Little variability of age between clubs (here
comparing 𝑥̅𝑖 ’s to 𝑥̅ 𝑇 ) comparing 𝑥̅𝑖 ’s to 𝑥̅ 𝑇 )
Means appear unequal across columns (treatment Means appear fairly equal across columns
populations) (treatment populations)
An Ho of equal means would be rejected An Ho of equal means would not be rejected
Computational Notation: (preliminary to forming our test statistic)

SST = Sum of Squares Total = sum of squared differences between each observation and the
overall mean = ∑ ∑(𝑥𝑖𝑗 − 𝑥̅ )2

SSG = Sum of Squares Between treatment groups = sum of squared differences between each
treatment mean and overall mean = ∑ 𝑛𝑖 ( 𝑥̅𝑖 − 𝑥̅ )2

SSE = Sum of Squares Within treatments = sum of squared differences between each
observation and its respective treatment group mean = ∑ ∑(𝑥𝑖𝑗 − 𝑥̅𝑖 )2

Note: SST = SSG + SSE (Although intuitive, this actually takes about 5 minutes to show
algebraically, and students are not responsible for knowing how to do this.)

MST = SST/(nT – 1) = Mean Variation

MSG = SSG/(I – 1) = Mean Variation between Treatments

MSE = SSE/(nT – I) = Mean Variation within Treatments

MST, MSG and MSE are all estimates of the population variance

127
The ratio MSG/MSE follows an F distribution with:
I - 1 degrees of freedom in the numerator (DFG) and
nT – I degrees of freedom in the denominator (DFE)
where I = number of treatments and nT = total number of observations

The F distribution family consists of right skewed distributions that are always positive. Their
shapes vary according to their numerator and denominator degrees of freedom.

http://www.mathcaptain.com/statistics/ https://upload.wikimedia.org/wikipedia/
f-test.html commons/7/74/F-distribution_pdf.svg

“Relatively” high values of F* for a particular F distribution (with particular numerator and
denominator degrees of freedom) will lead to rejection of a Ho of equal means. So our ANOVA
test will test a one sided F test with a rejection region in the upper tail only!

Let F* be our observed test statistic value of MSG/MSE


If F* is large, most variation will be between treatments (MSG considerably larger than MSE)
If F* is small, little variation will be within treatments (MSG close to MSE)
If F* is large, reject Ho (all means equal)
If F* is small, do not reject Ho (all means equal)

Analysis of Variance Output is often presented in an ANOVA table, thusly:


Sum of Df Mean Squares F statistic Pvalue
Squares
Between Groups SSG DFG MSG = SSG/DFG F* = MSG/MSE P(F > F*)
Within Groups SSE DFE MSE = SSE/DFE
Total SST

128
Example: An ANOVA test with 4 treatments is performed to determine if there is
significant evidence that at least one of the treatment means differs from the others. You
are provided with the following ANOVA results.

Source of SS Df MS F
Variation
Between Groups 126 3 42 21
Within Groups 26 13 2
Total 158 16

a) Pretend the italicized numbers are invisible, and make sure you can obtain them
using the other information.
WORK SPACE: SSG = 126 (from 158-26) DFE = 13 (from 16-3)
MSG= 126/3 = 42
MSE= 26/13 = 2
F *= 42/2 = 21

b) Determine if there is significant evidence that at least one of the means differs from
the others. Use a significance level of 1%.
Degrees of Freedom: (3,13) (3 is numerator df, and 13 is denominator df)

Rejection Region: Reject Ho if F > 5.74 (look at 1% F-tables with (3, 13) df to get
number) (find 3 at the top of a column, and then look down that column until you find
the 13 row)

Decision and why: Since F*= 13 is > 5.74 (the critical value), it in the rejection region.

Reject Ho and conclude that there is significant evidence that at least one of the means
differs from the others.

129
130
131
Example 10.5

(created when your instructor was determining which fertilizer to use on her tomato plants.)
A student randomly assigns 12 just sprouted potted tomato plants to 3 treatment groups. All
plants are planted in a healthy soil and subjected to a healthy watering regime. The first group
receives no additional treatment, the second group receives a careful calibration of N-P-K
(nitrogen, phosphorus, potassium) in the ratio 8-32-16, and the third group receives a careful
calibration of N-P-K (nitrogen, phosphorus, potassium) in the ratio 6-24-24. She measures their
growth in height in mm over the next week to determine if the mean growth differs in at least
one of the groups. Measurements follow.

NONE R8-32-16 R6-24-24


40 80 85
50 94 68
45 75 102
43 82 94

We have I = 3 treatments (NONE, R8-32-16, R6-24-24) and we have 4 observations per


treatment (n1 = n2 = n3= 4). Assuming all necessary assumptions are met, perform, with a
significance level of 1%, an ANOVA test to determine whether there is a difference in the means
of the treatment populations (that is, if the mean height of plant growth) differs in at least one
treatment.

Step 1: Hypothesis
Ho: µNONE= µR8-32-16= µR6-24-24
Ha: at least one of the µi s differs

Step 2: Test statistic

Students are expected to be able to read, understand, interpret and make decisions when given
summary ANOVA output.

𝑀𝑆𝐺 𝑆𝑆𝐺/(𝐼−1)
Our test statistic value is F* = 𝑀𝑆𝐸 = = 22.4086
𝑆𝑆𝐸/(𝑛𝑇 −𝐼)
Step 3: Critical region and P-value

132
F critical value: α = 1%, Df numerator = I – 1 = 2, Df denominator = 𝑛𝑇 − 𝐼 = 9 is 8.02 (tables)
Pvalue (from ANOVA output) is approximately 0.

F0.01 tables
2

9→ 8.02 = Fcrit
(from tables from class)

(from http://www.statdistributions.com/f/)

Step 4: Decision.
At the 1% level of significance, we reject Ho since F* = 22.4086 > 8.02.
(We reject Ho since pvalue <= α (0.000 <= 0.01)
We conclude that at least one of the treatment population mean heights of plant growth in a week
differs from the others.

Example 10.6: A random sample of nT = 12 individual students are chosen, and they are
classified according to treatments (statistical software used, Minitab(M), R(R), or SPSS(S) ) and
the length of time needed to solve a given statistics problem. We have I = 3 treatments (M, R,
and S) and we have (n1 = 4, n2 = 3, and n3= 5). Assuming all necessary assumptions are met,
perform, with a significance level of 5%, an ANOVA test to determine whether there is a
difference in the means of the treatment populations (if the mean length of time needed to solve
a given statistics problem differs across statistical software packages).

Hypotheses: Ho: µM = µR = µS
Ha: at least one of the µi s differs

Calculation of Test Statistic


M:Minitab R:R S:SPSS
7 9 2
4 5 3
4 7 5
3 3
8

133
Using software generated the following information.

𝑀𝑆𝐺 𝑆𝑆𝐺/(𝐼−1)
F* = 𝑀𝑆𝐸 = 𝑆𝑆𝐸/(𝑛𝑇 −𝐼)
= 8.1/4.42 = 1.83

Critical Value: F2,9 = 4.26 for  = 0.05 . (from tables)

Pvalue = 0.215 (from ANOVA table)

F0.05 tables
2

9→ 4.26 = Fcrit

(from tables from class)

(from online tables http://www.statdistributions.com/f/)

Decision:
We reject Ho if F* > 4.26
We reject Ho if pvalue <= 0.05

Since F*=1.83 is not in the rejection region (it is below 4.26), we fail to reject Ho
Since the pvalue = 0.215 > 0.05 , we fail to reject Ho.

134
At the 5% significance level, we do not have significant evidence to conclude that at least one of
the problem solving mean times differs from the others.
ANOVA ASSUMPTIONS IN MORE DETAIL:
A proper sample design ensures that independent random samples are drawn from each
treatment population.

Let xij be the jth observation from the ith population where i = 1, 2, … I and j = 1, 2, … nI

xij = µi + εij

where the εij are independent and from a N(0,σ) distribution.

Parameters of the model are population means µ1, µ2, …. µI and standard deviation σ.

NORMALITY OF SHAPE
The ANOVA test is somewhat robust around departures of the treatment populations from
normality in shape. It will usually suffice to examine dot plots (or histograms) of the sample unit
values for each treatment to check the shape of the sample distributions – if the sample
distributions are normal in shape, this will suggest that the population distributions are also, and
if the sample distributions are blatantly non-normal, this will suggest that the population
distributions are also (as long as the sample sizes are fairly “large”).

EQUAL VARIANCES
The assumption of equal variances in the treatment populations (homogeneity of variances) is
not as robust an assumption. When this assumption does not hold, an ANOVA test becomes
an invalid test and a non-parametric test should be used instead.

The assumption of equal variances can be tested using a test known as Levene’s test.

Levene’s Test tests the hypothesis Ho that all I treatment population variances σi2 (i = 1, 2,...I)
are equal against a Ha that at least one population variance differs from the other population
variances. When this test is run, we want not to reject the null hypothesis, as this tells us there
is no significant evidence that the population variances are different.

We will not learn the calculation of this test statistic, but rely on SPSS to provide it for us. You
can ask for the Levene Statistic to check Homogeneity of Variance in the Output popup from the
One Way Anova window. Levene Statistic output for our Tomato example above is provided
here. In all cases, Ho is not rejected, and we conclude there is no significant evidence that the
population variances differ.

135
ANOVA: CONFIDENCE INTERVALS

When we reject an ANOVA Ho and conclude that the treatment means differ, it is of interest to
determine which mean(s) are different.

We will consider two approaches to finding confidence intervals about the differences between
means. It will be of interest to see if these confidence intervals contain 0 for pre-chosen
significance levels, and if they contain 0, the means will not be considered significantly different.
It they do not contain 0, the means will be considered significantly different.

CONTRASTS (APPROACH 1)

Contrasts are linear combinations of means where the sum of the coefficients of the means add
to 0. A difference in means is one example of a contrast.

When contrasts are formulated before seeing the data, inference about contrasts are
valid whether or not the ANOVA test of means rejected the Ho of equality of treatment
means.

Some definitions and results needed to do the inference follow.

Linear combinations of population means have the form, where i = 1, 2, …,I

𝜓 = ∑ 𝑎𝑖 𝜇𝑖

A population contrast is a linear combination where the 𝑎𝑖 coefficients add to zero (∑ 𝑎𝑖 = 0).

Example: Suppose we have 3 population means, 𝜇1 , 𝜇2 , and 𝜇3 .

For the population contrast 𝜇1 - 𝜇2 : 𝑎1 = 1, 𝑎2 = −1 𝑎3 = 0

For the population contrast 𝜇1 - 𝜇3 : 𝑎1 = 1, 𝑎2 = 0 𝑎3 = −1

For the population contrast 𝜇2 - 𝜇3 : 𝑎1 = 0, 𝑎2 = 1 𝑎3 = −1

The sample contrast for a sample of size nT = ∑ 𝑛𝑖 is:

c = ∑ 𝑎𝑖 𝑥̅𝑖

𝑎2
The distribution of C has 𝜇𝐶 = 𝜓 and standard error SEC = 𝑠𝑝 √ ∑ 𝑛𝑖 where 𝑠𝑝 = √𝑀𝑆𝐸
𝑖

C–𝜓
T= SE𝑐
has a t distribution with degrees of freedom DFE (associated with 𝑠𝑝 )
The test statistic for testing H0: 𝜓 = 𝜓o is:

136
(𝑎1 𝑥̅1 +⋯+𝑎𝐼 𝑥̅𝐼 )− 𝜓o
to =
𝑎 2 𝑎 2
𝑠𝑝 √( 1 +⋯+ 𝐼 )
𝑛1 𝑛𝐼

A (1 – α)% confidence interval for 𝜓 is:


𝑎𝑖2
∑ 𝑎𝑖 𝑥̅𝑖 ± 𝑡𝛼/2 𝑠𝑝 √ ∑
𝑛𝑖

where 𝑠𝑝 = √𝑀𝑆𝐸 is the pooled estimate for σ,

𝑡𝛼/2 is the value of the t distribution with area 𝛼/2 to the right of it,

and df = n1 + ... + nI – I.

CONTRAST INTERVAL FOR A PAIR OF MEANS

We use (𝑥̅𝑖 − 𝑥̅𝑗 ) to estimate the contrast (𝜇𝑖 − 𝜇𝑗 ).

1 1
A (1 – α)% Confidence interval for(𝜇𝑖 − 𝜇𝑗 ) is (𝑥̅𝑖 − 𝑥̅𝑗 ) ± 𝑡𝛼/2 𝑠𝑝 √( + )
𝑛 𝑖 𝑛𝑗

where 𝑠𝑝 = √𝑀𝑆𝐸 is the pooled estimate for σ, and 𝑡𝛼/2 is the value of the t distribution with area
𝛼/2 to the right of it for d.f. = n1 + ... + nI – I (where I is the number of treatments)

A test statistic for testing H0: (𝜇𝑖 − 𝜇𝑗 ) = 0

(𝑥̅𝑖 − 𝑥̅𝑗 )
to = 1 1
𝑠𝑝 √( + )
𝑛𝑖 𝑛𝑗

We will calculate individual contrast confidence interval formulas and test statistics for testing
(𝜇1 − 𝜇2 ), (𝜇1 − 𝜇3 ) and (𝜇2 − 𝜇3 ) at the 5% significance level for the tomato example. We
can pretend that we were prescient enough to think of these contrasts before performing our
ANOVA tests above.

Tomatoes Example (cont)

SETUP

The t distribution has n – I = 12 – 3 = 9 degrees of freedom.

Use α=0.05, and t0.025=2.262

sp = √𝑀𝑆𝐸 = √𝑀𝑆𝐸 = √886.5/9 = 9.92

Total Treatment 1 = 178, Total Treatment 2 = 331, Total Treatment 3 = 349 (calculated by
hand)

137
𝑥1 = 178/4 = 44.5, ̅̅̅
̅̅̅ 𝑥2 = 331/4 = 82.75, ̅̅̅
𝑥3 = 349/4 = 87.25 since n1 = 4, n2 = 4, n3 = 4

CONTRAST INTERVAL CALCULATIONS, CONFIDENCE INTERVALS, AND T TESTS

𝐻𝑜: 𝜇1 − 𝜇2 = 0 (𝑥̅1 − 𝑥̅2 )


𝐻𝐴 : 𝜇1 − 𝜇2 ≠ 0 1 1 to = 1 1
=
(𝑥̅1 − 𝑥̅2 ) ± 𝑡𝛼/2 𝑠𝑝 √( + ) 𝑠𝑝 √( + )
𝑛1 𝑛2
𝑛1 𝑛2 (44.5− 82.75)
=
178 331 1 1 1
9.92√( + )
1
= ( − )± 2.262(9.92)√( + ) 4 4
4 4 4 4
NONE = (44.5-82.75)/7.0145
= ( 44.5 – 82.75 ) ± 15.87 = -38.25 ± 15.87
R8-32-16 = -5.45
95% CI: (-54.12, -22.38 ) mm Reject Ho if to < -2.262 or to > 2.262

Interval does not capture 0. Can reject Ho. Can reject Ho.
𝐻𝑂 : 𝜇1 − 𝜇3 = 0 (44.5 − 82.75)
𝐻𝐴 : 𝜇1 − 𝜇3 ≠ 0 1 1 to = 1 1
= -6.09
(44.5 − 87.25) ± 2.262(9.92)√( + ) 9.92√( + )
4 4
4 4
NONE = -42.75 ± 15.87 Reject Ho if to < -2.262 or to > 2.262
R6-24-24
95% CI: (-58.62, -26.88) mm Can reject Ho.

Interval does not capture 0. Can reject Ho.


𝐻𝑂 : 𝜇2 − 𝜇3 = 0 (82.75−87.25)
𝐻𝐴 : 𝜇2 − 𝜇3 ≠ 0 1 1 to = = -0.64
1 1
(82.75 − 87.25) ± 2.262(9.92)√( + ) 9.92√( + )
4 4
4 4
R8-32-16 = -4.5 ± 15.87 = Reject Ho if to < -2.262 or to > 2.262
R6-24-24
95% CI:(-20.37, 11.37) mm Cannot reject Ho.

Interval does capture 0. Cannot reject Ho.

With 95% confidence, there is significant evidence 𝜇1 (NONE) differs from 𝜇2 (R8-32-16).
With 95% confidence, there is significant evidence 𝜇1 (NONE) differs from 𝜇3 (R6-24-24).
With 95% confidence, there is no significant evidence 𝜇2 (R8-32-16) differs from 𝜇3 (R6-24-24)

PAIRWISE MULTIPLE COMPARISONS OF MEANS: APPROACH 2


Often, specific questions (contrasts) cannot be formulated before the ANOVA analysis.
However, once the ANOVA analysis leads us to reject Ho, turning up significance, we
become very interested in which pair of means differs from the others! In this case we
turn to Multiple Comparison Methods.

The two approaches below (Tukey-Kramer and Bonferoni) are based on simultaneously
creating confidence intervals for ALL differences of pairs of treatment means.

138
When a confidence interval for a pair of means contains 0 (i.e.: the lower bound is < 0 and the
upper bound is > 0), we conclude there is no significant difference between these particular
treatment means.

The error rate used is called an experiment wide error rate.

In general, an individual confidence interval of interest examines the difference between


the corresponding population means. If several confidence intervals are of interest, the
family wise (also called experiment wise or simultaneous or overall) confidence level is
the confidence we have that all the confidence intervals of simultaneous interest contain
the differences between the corresponding population means.

The more comparisons we do, the more likely it becomes that we make a wrong decision (that
is, that we have at least one wrong conclusion in our results).
*Suppose we made three comparisons for three independent tests. Then the joint probability of not making
a type 1 error on all three tests is (.95)3 = .8574. And the probability of making at least one type 1 error is
then 1 - .8574 = .1426. Now, in our case, our tests are not independent because MSE is used in each test,
and we have the same xis in various tests. It can be shown (this is beyond the scope of our course) that the
error involved is even greater in this case.

Suppose we have a 5% experiment wide error rate. This means that if we repeat an experiment
(of say, growth of tomato seedlings given different diets of feed) several times, and each time,
we constructed simultaneous confidence intervals, in 5% of the several times that we created
these simultaneous confidence intervals, we would make at least one mistake in our
conclusions about which populations have different means.

What we are doing is setting things up to guarantee that the probability of false rejection among
all the comparisons is no greater than 0.05. This is a much stronger protection than controlling
the probability of a false rejection at 0.05 for each separate comparison.

The decision we must make is whether to control for the family wise error rate or for the
individual error rates. Several strategies exist to handle this situation. Remember that multiple
comparisons should only be made if they are of interest (i.e. we would not make multiple
comparisons if the overall F model was not significant), and that an acceptable individual error
rate or an acceptable experiment wise (family wise) error rate, α, must be decided.

BONFERONI CONFIDENCE INTERVALS:

This multiple comparison method can be used to test for all possible contrasts of interest in very
general situations when we want to control family wise error rate. Contrasts of interest are
decided ahead of time. It uses individual significance levels of α* = α/g, where g is the number
of contrasts of interest and α is the overall error rate.

When used with pairwise comparisons, Bonferroni intervals have the following formula:

1 1
𝑥̅𝑖 - 𝑥̅𝑗  𝑡𝛼∗ /2,𝑛𝑇 −𝐼 √𝑀𝑆𝐸 √ +
𝑛𝑖 𝑛𝑗

𝛼
where 𝛼 ∗ = 𝑚 and m is the number of unique pairwise comparisons and nT - I is the d.f. (I is
the number of treatments). For a balanced design, all Bonferroni intervals have the same width.

139
TUKEY-KRAMER CONFIDENCE INTERVALS:

The formula for Tukey-Kramer confidence intervals is

𝑀𝑆𝐸 1 1
(𝑥̅𝑖 − 𝑥̅𝑗 ) ± 𝑞𝛼 √ ( + )
2 𝑛𝑖 𝑛𝑗

where qα is from the studentized range table with D1 = I and D2 = nT – I for an α = experiment
wide error rate that is prechosen.

Students can calculate 𝑞𝛼 using the table below (also provided in your Formulas and Tables
handout and also on Blackboard) or by using the online Tukey-Kramer critical value calculator at
http://faculty.vassar.edu/lowry/ch14pt2.html . Also, a table of 𝑞𝛼 values for α=0.01and 0.05 is at
http://web.mst.edu/~psyworld/virtualstat/tukeys/criticaltable.html.

For a balanced design where all treatments have the same sample size, the Tukey approach
results in narrower intervals than the Bonferroni approach. The Bonferroni approach is more
conservative than the Tukey-Kramer approach.

140
Tomatoes Example (continued): For the plant height example, let us calculate the Bonferoni
and the Tukey-Kramer intervals with an experiment wide error rate of 5%.
SETUP
Total Treatment 1 = 178, Total Treatment 2 = 331, Total Treatment 3 = 349
n1=4, n2=4, n3=4
𝑥1
̅̅̅=178/4=44.5, 𝑥2
̅̅̅=331/4=82.75, 𝑥3
̅̅̅=349/4=87.25
MSE=SSE/(nT – I)= 886.5/9 = 98.5
Bonferoni
I = 3, and nT – I = 12-3 = 9. α = 0.05, so α* = (0.05/3) = 0.167
𝛼∗ 𝛼/𝑚 0.05/3
2
= 2 = 2 = 0.0083 , m = 3 unique comparison pairs, n – I = 12-3 =9
𝑡𝛼∗ /2,𝑛−𝐼 = 𝑡0.0083,9 = 2.936 (use online calculator at Stattrek.com)

1&2: NONE, R8-32-16


1 1
(44.5 − 82.75) ± 2.936√98.5√ + = -38.25 ± 2.936 (7.018)= -38.25 ± 20.61=
4 4
(-58.86,-17.64) mm. Interval does not contain 0.

1&3: NONE, R6-24-24


1 1
(44.5 − 87.25)  2.936√98.5√ + = -42.75 ±2.936(7.018) = -42.75 ± 20.61 =
4 4
(-63.36, -22.14) mm. Interval does not contain 0.

2&3: R8-32-16, R6-24-24


1 1
(82.75 − 87.25)  2.936√98.5√4 + 4
= ±2.936(7.018) = -4.5 ± 20.61 =
= (-25.11, 16.11) mm. Interval does contain 0.
Tukey
q0.05 = 3.95 for D1 = 3 and D2 = 12 – 3 = 9 (use online calculator at vassarstats.edu)

1&2: NONE, R8-32-16


178 331 98.5 1 1
( 4
− 4 )± 3.95√ 2 4
( + 4) = (44.5 − 82.75) ± 3.95(4.9624) = -38.25 ± 19.6015 =
(-57.85, -18.65) mm. Interval does not contain 0.

1&3: NONE, R6-24-24


178 349 98.5 1 1
( 4
− 4 )± 3.95√ 2 4
( + 4) = (44.5 − 87.25) ± 3.95(4.9624) = -42.75 ± 19.6015 =
(-62.35,-23.15) mm. Interval does not contain 0.

2&3: R8-32-16, R6-24-24


331 349 98.5 1 1
( − )± 3.95√ ( + ) = ( 82.75 − 87.25) ± 3.95(4.9624) = -4.5 ± 19.6015=
4 4 2 4 4
(-24.10,15.11) mm. Interval does contain 0.
Conclusion: Note that the Bonferoni intervals are wider than the Tukey intervals.
We reach the same conclusion with the Bonferoni and the Tukey intervals. For an experiment
wide error rate of 5%, we have significant differences between the NONE and R8-32-16
treatment means and the NONE and R6-24-24 treatment means. We do not have a
significant difference between the R8-32-16 and R6-24-24 treatment means. .

141
Chapter 11: CHI SQUARE TESTS

GOODNESS OF FIT

The goodness-of-fit test considers data that comes from one observed random categorical
variable, and determines whether evidence indicates that it has shifted away from a hypothesized
discrete distribution.

Example 11.1: Voting data collected from n = 446 MacEwan Statistics 151 students in the
Canadian Fall 2015 Federal election are summarized below, along with corresponding data from
all people who voted in Alberta and Canada. Only data from the Conservative, Green, Liberal,
and NDP parties are considered. We are interested in how our Student data compares to the
Canadian data and to Albertan data.

Party MacEwan MacEwan Students


Students Count Relative Frequency
(in Percent)
Conservative 130 29.15%
Green 23 5.16%
Liberal 220 49.33%
NDP 73 16.37%
446 100.00%

Party Albertans Count Actual Albertan


Percent*
Conservative 1150101 60.63%
Green 48742 2.57%
Liberal 473416 24.96%
NDP 224800 11.85%
1897059 100.00%

Party Canadians Actual Canadian


Count Percent*
Conservative 5613633 33.76%
Green 602933 3.63%
Liberal 6942937 41.75%
NDP 3469368 20.86%
16628871 100.00%

*The Alberta and Canada data were calculated by your instructor using data online at Elections
Canada. Interested students can google and take a look at the statistics provided there. These
include ONLY Canadians and Albertans who voted for the 4 parties above. This will allow us to
make our comparisons sensibly. (Your instructor notes that out of all votes cast, 98.26% of
Albertans and 94.53% of Canadians voted for those 4 parties. The Canadian percent is lower
because of those voting BQ. No BQ candidates ran in Alberta).

142
A Chi-Square Goodness of Fit test compares the relative frequency distribution from a sample to
a given (hypothesized) probability distribution for a categorical variable.

We will introduce some notation and assumptions and then perform a statistical test.

Consider a categorical random variable with k possible outcomes.


Let p1, p2,… pk be the true probability that observations fall into outcome 1, 2, …, k

Suppose a random sample of n observations is taken.


Let O1, O2, …, Ok be the OBSERVED count of observations in each of the k outcomes.
(some textbooks use n1, n2, …., nk, but we will use O1, O2, …, Ok as it leads into future notation)

Suppose a claim (null hypothesis) assumes a certain probability distribution about the proportion
of observations falling into each outcome in a population.

Let p01, p02, …., p0k be the claimed proportions.

Then E1 = np01, E2 = np02, … Ek = np0k are the EXPECTED cell counts in each outcome.

We compare the observed frequencies to the expected frequencies. Sometimes Ei and Oi can
differ considerably (in a manner relative to the units of measurement). (Note that a difference of
10 units in an Ei and an Oi would be considerably more notable if Ei = 10 and Oi = 20 than if Ei
= 1000000 and Fi = 1000010).

We need a test statistic to test how “close” the Ei and Oi are.

-You calculate the squared differences, and weight them according to the cell they are in.

(𝑂𝑖 −𝐸𝑖 )2
The test statistic is χ2* = Σ Σ 𝐸𝑖

(𝑂𝑖 −𝐸𝑖 )2
A relatively large for an outcome (cell) in the data under consideration means a big
𝐸𝑖
relative difference between what was observed and what we expected for that cell.

Note that the test statistic will be always be positive and that high values of the test statistic will
correspond to the situation where at least some of the weighted squared differences between the
squares of the observed and expected values are considerably different.

If p01, p02, …., p0k is the true distribution, then the test statistic is distributed as a χ2 (Chi-Square)
distribution with k – 1 degrees of freedom. We will look at how to determine p-values and
critical values from a Chi-Square table below.

Example (continued): We consider, somewhat artificially, the Statistics 151 student data to be a
random sample from a much larger hypothetical population of Statistics 151 students. We make
a null hypothesis that the background proportions of the population of Alberta introductory
Statistics students match the Albertan proportions. Prechose α = 0.05 as our level of significance

143
Ho: p01 = 0.6063, p02 = 0.0257, p03= 0.2496, p04 = 0.1185
Ha: Ho is not true

Assumptions: A random sample, and all Ei >=5. This ensures a reasonable sample size and
avoids inflated cell contributions to the test statistic.
Test Statistic: pOi Observed Counts, Oi Expected Counts, Ei = 446p0i (𝑂𝑖 − 𝐸𝑖 )2
PARTY 𝐸𝑖
Conservative 0.6063 130 270.4098 72.90753
Green 0.0257 23 11.4622 11.6139
Liberal 0.2496 220 111.3216 106.098
NDP 0.1185 73 52.851 7.681637
TOTALS 1.0001 446 446.0446 198.301
(𝑂𝑖 −𝐸𝑖 )2
χ2* = Σ Σ = 198.301 with k-1 = 4-1 = 3 degrees of freedom
𝐸𝑖

The Chi-Square Distribution is a right skewed distribution, and our rejection region is right
sided (as it always is with Goodness of Fit tests). It is large values of χ2*that indicate
unusual cell results, and indicate rejection of Ho. A full table is provided below for student
reference, but we will use a snip of it here.

Critical Value: χ20.05 = 7.815 for  = 0.05. (from tables for 3 degrees of freedom)

PValue: P(χ2 > χ2*) = P(χ2 > 198.301) < 0.005 (from tables) as we can see by looking across the
row at 3 df and down the row at 0.005. The area to the right of 12.838 is 0.005 and 198.301 is far
larger than 12.838. The area to the right of 198.301 is extremely small, and it is very unlikely we
would have observed a test statistic value that high if Ho were true.

χ2 tables for 3 degrees of freedom

= χ20.05 = χ2crit

(from tables from class)

(from online tables http://www.statdistributions.com/chisquare/)


Decision: Since our p-value < = 0.005, which is less than 0.05 (our prechosen α), we reject Ho,
and conclude we have sufficient evidence that at least one of the proportions for the statistics
students survey differs from the hypothesized proportions of the population (of Alberta voters).

144
145
2 X 2 TABLES

Example 11.2: Return to the Degree and Software Preference example from earlier in the course.

betaTester (T) buGfree (G)


BA(A) 26 18 44
BS (S) 21 5 26
47 23 70

In this case, a random sample of individuals was taken from a population of students, and two
categorical variables, Preference and Degree, were measured on each student.

TEST OF INDEPENDENCE
We can perform a Test of Independence to determine whether the two categorical variables are
associated with each other.

Ho: No relationship (association) between degree and preference


Ha: A relationship (association) exists between degree and preference
Prechose α = 0.10

Assumptions: We require a random sample, and that all individual expected counts are >=5. We
discuss this further after performing the test (as we can be a bit less rigorous in requirement for
cross-tab tables with more than 4 cells). This ensures a large sample size and that cell
contributions to the test statistic are not artificially inflated.

Earlier, we learned how to check if two events, say A and T, were independent.

Check: P(AT) = P(A)P(T).

(total in cell A∩T) (total in row A) (total in column T)


Check: = x
overall total overall total overall total

(total in row A) x (total in column T)


Check: (total in cell AT) = overall total

Now for all (i,j) cells, we want to see if Oij = Eij, where Oij = the observed cell count in row i
and column j and Eij = the expected cell count in row i and column j

We use a test statistic that considers how Oij compares to Eij for all the cells.

(𝑂𝑖𝑗 −𝐸𝑖𝑗 )2
Test Statistic: χ2* = Σ Σ 𝐸𝑖𝑗

The sampling distribution of this test statistic has (r-1)(c-1) degrees of freedom where there are r
rows and c columns in the table. (In our case, with a 2x2 table, we have 1 degree of freedom).

146
Note that the test statistic is a sum of “cell contributions” from each cell. Each ‘cell
contribution” is “standardized” by dividing by Eij to be sure it contributed a “relative”
contribution to the test statistic total.

A large value of the test statistic indicates that at least one cell contribution was “relatively”
large, and that Oij and Eij were not “relatively “close for such cells.

Oij :
betaTester (T) buGfree (G)
BA(A) 26 18 44
BA(S) 21 5 26
47 23 70

Eij :
betaTester (T) buGfree (G)
BA(A) (44)(47)/70 = 29.54 14.46 44
BA(S) 17.46 8.54 26
47 23 70

(𝑂𝑖𝑗 −𝐸𝑖𝑗 )2
(cell contributions to test statistic):
𝐸𝑖𝑗
betaTester (T) buGfree (G)
BA(A) (26-29.54)2/29.54 = 0.42 0.87
BA(S) 0.72 1.47

(𝑂𝑖𝑗 −𝐸𝑖𝑗 )2
χ2* = ∑ = 3.4814 with 1 degree of freedom
𝐸𝑖𝑗

Critical Value: χ20.10 = 2.706 for  = 0.10. (from tables for 1 degree of freedom)

PValue: P(P(χ2 > χ2*) is such that .05 < P(χ2 > 3.481) < 0.10 (from tables for 1 degree of freedom)
P(χ2 > 3.481) = 0.062 from online tables

χ2 tables for 1 degrees of freedom

= χ20.10= χ2crit

(from tables from class)

147
(from online tables http://www.statdistributions.com/chisquare/)

Decision: At the 10% significance level, we reject the null hypothesis, and have significant
evidence that the variables are dependent (associated).

As cell (2,2) makes the highest contribution to the test statistic, we note that it appears that those
pursuing BS degrees are less likely to have a bugFree preference than expected.

Cell Size Checking

Here are the rules that we shall use in our course for Tests of Independence and Homogeneity
(see below for the distinction is how these tests are set up for these identical test procedures).

Convention, tradition and practicality have all conspired to create an oft-quoted “rule of thumb”
that in order to do appropriately use a Chi-square test statistic when categorical count data is in a
table with r rows and c columns (where at least one of r or c is greater than 2) is that at least 80%
of the expected counts be >=5, and all individual expected counts be >=1.

For a 2x2 table, we shall be a little stricter, and require that all expected counts should be >=5.

(For a Goodness of Fit test, we will require that all expected counts are >= 5.)

There is much in the literature that indicates that these rules came from a time when hand
computation made small expected counts difficult to deal with in division, and there is evidence
that this rule can be relaxed some in our day of enhanced computation.

Please note that we do need to be sure we have a large enough sample size. Although the theory
allows us to perform Chi-square tests with small observed counts (a cell count of 0 could occur
from the sampling), and in some cases this is not untoward, it should be noted that small
observed counts can bring unique problems of their own. If an entire row or entire column of
observed 0s appeared, an expected count would be 0 and we couldn’t calculate the test statistic!

148
TEST OF HOMOGENEITY
A "test of independence" is a way of determining whether two categorical variables are
associated with one another in a population, like gender and software preference above. It
provides a method for deciding whether the observed P(A|B) are "too far" from the observed
P(A) to conclude independence.

A “test of homogeneity” is performed in the same way as a test of independence. It is a way of


determining if two or more sub-groups of a population share the same distribution of a single
categorical variable. For example, do people of different gender identity have different
proportions of Conservative, Liberal, Green and NDP preference? The null hypothesis is that
each sub-group (female and male) shares the same distribution of the other categorical variable.
(Say, Conservative, Liberal, Green, NDP preferences.)

The difference between when to do a test of independence and when to do a test of homogeneity
is a matter of design. In the test of independence, observational units are collected at random
from a population and two categorical variables are observed for each unit (often in a survey). In
the test of homogeneity, the data are collected by randomly sampling from each sub-group
separately (often with experimental design).

Example 11.3: Consider the following cross tab/contingency table, in which a set of 180
randomly chosen babies are categorized according to how they are delivered and the time of day
at which they arrive. Perform a test of independence to see if there is evidence to support the
hypothesis that delivery method and time of day are associated (related) (dependent). Use a level
of significance of 0.05. (PS: Students can read more about this at http://blogs:scientific-
american.com/sa-visual/why-are-so-many-babies-born-about-8-00-a-m/ )

Observed Frequencies in cell i,j are Oijs


(Row i, Column j)
Natural C-Section Induced Totals
6 am – 6pm 60 30 35 125
6pm – 6 am 40 5 10 55
Totals 100 35 40 180

Hypothesis:
Ho: delivery method and time of birth are not related (independent) , α = 0.05
Ha: delivery method and time of birth are related (dependent)

Assumptions: We require a random sample, a large sample size, and that all observed counts
exceed 5. In addition, we require, at least 80% of the expected counts are >=5, and all individual
expected counts are >=1. Everything looks good based on our discussion and observation for
this problem!

149
Test Statistic:
Expected Frequencies - Eijs N Totals
Natural C-Section Induced Totals
6 am – 6pm 69.4444 24.3056 31.25 125
6pm – 6 am 30.5556 10.6944 13.75 55
Totals 100 35 45 180

Cell Contributions to Test Statistic:

(𝑂𝑖𝑗 −𝐸𝑖𝑗 )2
Values of are:
𝐸𝑖𝑗

Natural C-Section Induced


6 am – 6pm 1.2844 1.3341 0.4500
6pm – 6 am 2.9191 3.0321 1.0227

χ2* = 10.0426 for (r-1)(c-1) = (2)(1) = 2 degrees of freedom

Critical Value: χ20.05 = 5.991 for  = 0.05. (from tables for 2 degree of freedom)

PValue: P(P(χ2 > χ2*) is such that 0.005< P(χ2 > 10.0426) < 0.01 (from tables for 2 df)
P(χ2 > 10.0426) = 0.00659595 from Daniel Soper’s online tables

χ2 tables for 2 degrees of freedom

χ20.05= χ2crit = 5.991


and 0.005< P(χ2 > 10.0426) < 0.01
(from tables from class)

Our rejection region is for χ2* > 5.991. Since 10.0426 is > 5.991, we reject Ho.

150
Our p-value is 0.005< P(χ2 > 10.0426) < 0.01. Since 0.01 < 0.05 (our level of significance), we
reject Ho. There is significant evidence that there is a relationship between delivery method and
time of birth.

Conclude: Reject Ho and conclude there is evidence of a significant relationship between


delivery method and time of birth.

It appears that babies born naturally are more likely to be born between 6 pm and 6 am
than expected if delivery method and time of birth were independent. It also appears that
babies born by Caesarian and induction are less likely to be born between 6 pm and 6 am
than expected if delivery method and time of birth were independent. This makes sense
because inductions and caesarians (as much as you can schedule caesarians!) are likely to
be scheduled in the daytime when more medical staff is available.

Example 11.4: The following cross tab/contingency table categorizes a set of 600 randomly
chosen respondents according to the city in which they live and whether they strongly disagree,
disagree, agree, or strongly agree with the statement “crime is becoming more of a problem in
our area”. Perform a test of independence to see if there is evidence to support the hypothesis
that opinion and location are associated (related) (dependent). Use a level of significance of 0.05.

Observed Frequencies - Oijs

Edmonton Calgary Red Deer Totals


St Dis 10 15 20 45.00
Dis 20 15 25 60.00
Neutral 30 20 45 95.00
Ag 100 115 90 305.00
St Ag 40 35 20 95.00
Totals 200.00 200.00 200.00 600.00

Ho: city and view about problem of crime are not related (independent)
Ha: city and view about problem of crime are related (dependent)

Assumptions: We require a random sample, a large sample size, and that all observed counts
exceed 5. In addition, we require, at least 80% of the expected counts are >=5, and all individual
expected counts are >=1. Everything looks good based on our discussion and observation for
this problem!

151
Test Statistic:

Expected Frequencies – Eijs


Edmonton Calgary Red Deer Totals
St Dis 15.00 15.00 15.00 45.00
Dis 20.00 20.00 20.00 60.00
Neutral 31.67 31.67 31.67 95.00
Ag 101.67 101.67 101.67 305.00
St Ag 31.67 31.67 31.67 95.00
Totals 200.00 200.00 200.00 600.00

Cell Contributions
Edmonton Calgary Red Deer
St Dis 1.67 0.00 1.67
Dis 0.00 1.25 1.25
Neutral 0.09 4.30 5.61
Ag 0.03 1.75 1.34
St Ag 2.19 0.35 4.30

(𝑂𝑖𝑗 −𝐸𝑖𝑗 )2 (10−15)2 (15−15)2 (20−31.67)2


χ2* = Σ Σ = + +…+ = 1.67 + 1.67 + …. +4.30 = 25.75
𝐸𝑖𝑗 15 15 31.67

Critical Value: χ20.05 = 15.51 for  = 0.05. (from tables for (r-1)(c-1) = (3-1)(5-1) = (2)(4) = 8 df)

PValue: P(P(χ2 > χ2*) is such that P(χ2 > 25.8) < 0.005 (from tables for 8 df)

P(χ2 > 25.8) = 0.001 from online tables


χ2 tables for 8 degrees of freedom

χ20.05= χ2crit = 15.51


P(χ2 > 25.8) < 0.005
(from tables from class)

152
(from http://www.statdistributions.com/chisquare/)

Our rejection region is for χ2* > 15.51. Since 25.8 is > 15.51, we reject Ho.

Our p-value is P(χ2 > 25.8) < 0.005. Since our p-value is <0.05 (our level of significance), we
reject Ho. There is significant evidence of a relationship between opinion and location.

Conclude: Reject Ho and conclude that there is a relationship between city and opinion (views
about crime). Of those that are neutral, less people than expected are neutral in Calgary and
more people that expected are neutral in Red Deer. (see large contributions of cells (3,2) and
(3,3) to test statistic). Of those that strongly agree, more people than expected strongly agree in
Edmonton and less people than expected strongly agree in Red Deer. (see large contribution of
cells (5,1) and (5,3) to test statistic.)

153
Chapter 12: SIMPLE LINEAR REGRESSION

POPULATION REGRESSION MODEL AND LINE:

Example 12.1

Consider a population of houses.


Let X – size of house in square feet
Let Y – price of house in thousands of dollars

Consider the following graph of (x, y) points from the population. We will connect the average
selling prices for the various x values with a line.

Note that house selling prices vary for a given house size, x.
Let E(Y|x) represent the selling price of a house for a given x in square feet.

E(Y|x) = µY = β0 + β1x is the equation of the population regression line, and we fit it to the
graph of house prices.

β0, the y intercept, is the value of y when x = 0


β1, the slope, is a parameter such that for every increase of one unit in X, there is an increase of
one unit in Y.

For a particular house, at point (xi, yi), the value yi = β0 + β1xi + ε i ,


where ε i is the error for point (xi, yi)

We write Y = β0 + β1X + ε , as our equation for our population regression model

Y is the dependent variable (its value depends on the value of X)


X is the independent variable

154
Now, for our data, software calculations with the raw data yielded E(Y|x) = 30,000 + 40x as the
population regression line.

β1, our slope, is $40. Does this means that for every increase of 1 unit in square feet, there is a
corresponding price increase of $40.00.

β0, our intercept is $30,000. This means that the price of a house of 0 square feet is $30,000!?

Watch out here! Extrapolation beyond the values of x that are sensibly of interest is not useful.
In this case, the x values of interest are roughly 800 to 3000 square feet, and our equation is only
of use within that range of values.

Example 12.2:

What is the expected selling price for a 1500 square foot home?
E(Y|x=1500) = 30,000 + 40(1500) = $90000

Example 12.3:

If a 1500 square feet home sells for $100,000, what is the error?
Recall: For a particular house, at point (xi, yi), the value yi = β0 + β1xi + ε i ,
where ε i is the error for point (xi, yi).
Here 100000 = 30000 + 40(1500) + ε i = 90000 + ε i
And ε i= 100000 – 90000 =$10000

SAMPLE REGRESSION MODEL AND LINE:

Usually, we only know sample information, not population information. Here we would have
data for a sample of n houses, that is, for n pairs (xi, yi) of square footage and selling prices.

155
We write ŷ= b0 + b1x as the equation of the sample regression line, also called the estimated or
fitted line, or line of best fit, and we fit this line to the graph of our sample of n pairs.

b0, the y intercept, is the value of y when x = 0

b1, the slope, is a value such that for every increase of one unit in x, there is an increase of one
unit in y.

ŷ gives an estimate for y for a given value of x

For a particular xi value, at point (xi, yi), yi = b0 + b1xi + e i ,

where e i is the error, or residual, for point (xi, yi)

We write y = b0 + b1x + e, as our equation for our sample regression model

The “ordinary least squares” method for finding the best sample regression line to go through a
scatter of points minimizes the sum of the squared errors with respect to b0 and b1. The
mathematics behind this calculation is calculus, and for those of you who are curious, the
derivation can be found in many calculus textbooks.

The ordinary least squares solution for the “line of best fit” for sample data is the line

ŷ = b0 + b1x , where
∑ 𝑥 𝑖 ∑ 𝑦𝑖
∑ 𝑥𝑖 𝑦 𝑖 −
𝑛
b1 = (∑ 𝑥𝑖 )
2
2
∑ 𝑥𝑖 −
𝑛

b0 = 𝑦̅ − 𝑏1 𝑥̅

156
Note that 𝑦̅ = b0 + b1𝑥̅ , and that therefore the sample regression line always passes through the
point (𝑥̅ , 𝑦̅) and the point (0, b0 ).

Example 12.4: We retrieve the sample (age, pushup) data from earlier in the course, and will use
the information in the following table to calculate the least squares line of best fit by hand.
Recall:

X – age of individual (explanatory variable)


Y – number of push-ups individual can do in a minute (response variable)

Individual Age(xi) Push-ups(yi) xi 2 yi 2 xi yi


Younger Brother 19 34 361 1156 646
Elder Sister 28 26 784 676 728
Younger Sister 18 30 324 900 540
Mom 58 12 3364 144 696
Middle Sister 23 28 529 784 644
Dad 64 12 4096 144 768
Great Grandma 78 6 6084 36 468
Uncle 43 21 1849 441 903
Aunt 48 22 2304 484 1056
Elder Brother 36 27 1296 729 972
Middle Brother 21 14 441 196 294
Grandpa 67 29 4489 841 1943
Totals:Σ ∑ 𝑥𝑖 503 2 2
∑ 𝑦𝑖 261 ∑ 𝑥𝑖 25921 ∑ 𝑦𝑖 6531 ∑ 𝑥𝑖 𝑦𝑖 9658

a. Calculate 𝑥̅ and 𝑦̅.

𝑥̅ = 503/12 = 41.9167 and 𝑦̅ = 261/12 = 21.75

157
∑ 𝑥 𝑖 ∑ 𝑦𝑖
∑ 𝑥𝑖 𝑦 𝑖 −
𝑛
b. Calculate b1 using the formula b1 = (∑ 𝑥𝑖 )
2
2
∑ 𝑥𝑖 −
𝑛

∑ 𝑥 𝑖 ∑ 𝑦𝑖 (503)(261)
∑ 𝑥𝑖 𝑦 𝑖 − 9658−
𝑛 12
b1 = (∑ 𝑥𝑖 )
2 = (503)2
= -0.2651
2
∑ 𝑥𝑖 − 25921−
𝑛 12

c. Calculate bo using the formula bo = 𝑦̅ − 𝑏1 𝑥̅

b0 = 𝑦̅ − 𝑏1 𝑥̅ = (261/12) – (-0.2651)(503/12) = 32.8621

d. Write the least squares equation, ŷ= b0 + b1x . What direction does the least squares line have?

ŷ = b0 + b1x = 32.8621 – 0.2651x is the least squares equation.


The least square line has a negative direction.

𝑠𝑦
e. Verify that the formula b1 = r (𝑠 ) gives the same answer for b1 as in part b. Recall that we
𝑥
found that r = -0.6308 for this data earlier in the course.

8.8124
b1 = (-0.6308) (20.9695) = -0.2651

f. Use the given commands to make a scatterplot that shows the least squares line on the
scatterplot of the points (xi, yi). Note that it goes through the points (0, b1) and (𝑥̅ , 𝑦̅). Note also
that the graph contains a value r2 of 0.398, which is the square of the correlation r = -0.6308 that
we calculated for this data earlier in the course. We will discuss the meaning of r2 later in this
chapter.

Graphs>Legacy Dialogs>Simple Scatter


Define and put PUSHUPS on the Y axis and AGE on the X axis. Say OK to produce the scatter.
Double click to bring up the chart editor.
Right click on the graph, and select “Add Fit Line at Total”.

158
(0, b0 ) = (0, 32.8621)
and (𝑥̅ , 𝑦̅ ) = ( 503/12, 261/12) = ( 41.9167, 21.75)

g. Follow the Analysis>Regression>Linear path and select PUSHUP as your dependent variable
and AGE as your independent variable. Paste the coefficients box from the output and note that
b0 = 32.862 and b1 = -0.265 can also be read from it.

h. How many push-ups do you predict a 60 year old relative can do?

ŷ = 32.8621 – 0.2651(60) = 16.9561, so we predict a 60 year old could do 16.9561 push-ups.

i. Are there any outliers in the data? If so, how do they affect the sample regression line?
We can view the 21 year old who can only do 14 push-ups per minute and the 67 year old who
can do 29 push-ups per minute as “outliers” who lie outside the bulk of the data. This is an
interesting example, as one sees that the 21 year old pulls the regression line down, while the
69 year old pulls it up. So two unexpected results “balance” each other out. This is certainly
not always the case.

159
ANOVA: REGRESSION SUMS OF SQUARES

Y values vary about their mean. Some of the variation can be explained by the regression model,
but some of the variation cannot be explained by the regression model.

Note: yi - 𝑦̅ = (yi – ŷi) + (ŷi - 𝑦̅)

where the difference (yi – ŷi) cannot be “explained”


and the difference (ŷi - 𝑦̅) can be “explained” by the regression model

In addition, it turns out (by calculus) that

Σ(yi - 𝑦̅)2 = Σ(yi – ŷi)2 + Σ(ŷi - 𝑦̅)2


or
SST = SSE + SSM

SST = Sum of Squares Total = Σ(yi - 𝑦̅)2


SSE = Sum of Squares Error = Σ(yi – ŷi)2
SSM = Sum of Squares Model = Σ(ŷi - 𝑦̅)2

That is, the total variation in Y is equal to the sum of the total variation due to error (that cannot
be explained) and the total variation due to the regression model (which can be explained).

Coefficient of Determination, r2

r2 = SSM/SST is the proportion of the total variability in Y that can explained by X


Note that 0 <= r2 <= 1

Computational Formulas for SST, SSM and SSE


2
(∑ 𝑦𝑖 )
SST = ∑ 𝑦𝑖 2 − 𝑛
∑ 𝑥𝑖 ∑ 𝑦 𝑖
SSM = b1 (∑ 𝑥𝑖 𝑦𝑖 − 𝑛
)
SSE = SST – SSM

160
It is common to present the test results for simple linear regression in an ANOVA table, as
follows.
Sum of Squares Degrees of Mean Sum of Test Statistic Pvalue
Freedom Squares
Between Groups SSM DFM = 1 MSM = SSM/1 F=MSM/MSE
Within Groups SSE DFE = n-2 MSE = SSE/(n-2)
Total SST DFT = n-1

Example 12.5: Calculate SST, SSM, SSE and r for the Push-ups example
2
(∑ 𝑦)
SST = ∑ 𝑦 2 − = 6531 – (261)2/12 = 854.25
𝑛

∑𝑥∑𝑦
SSM = b1 (∑ 𝑥𝑦 − ) = (-0.2651)*( 9658 – (503)(261) /12 ) = 339.9245
𝑛

SSE = SST – SSM = 854.25 – 339.9245 = 514.3255

r2 = SSM/SST = 339.9245/854.23 = 0.3979

Note that the correlation coefficient, r, is related to the coefficient of determination, r2, as
follows:

r = (sign of b1)√𝑟 2 = -0.6308

SPSS:

Data should be entered into two numerical scale columns with 0 decimals, PUSHUP and AGE.

Follow the Analysis>Regression>Linear path and select PUSHUP as your dependent variable
and age as your independent variable. Saying OK will produce output, including this ANOVA
table.

STANDARD ERROR OF THE ESTIMATE

Recall that SSE measures the variability of the actual observations about the estimated regression
line. Every sum of squares is associated with a number called the degrees of freedom. SSE has
n – 2 degrees because two parameters (β0 and β1) must be estimated to compute SSE.

161
𝑆𝑆𝐸
MSE = 𝑛−2 is an estimator of σ2.
𝑆𝑆𝐸
se = √𝑛−2 is an estimator of σ, and known as the Standard error of the Estimator

Example 12.6. Calculate se for the push-ups data.

𝑆𝑆𝐸 513.3255
se = √𝑀𝑆𝐸 = √𝑛−2 = √ = 7.1716
10

This can be readily calculated from the ANOVA output for our example, also.

t TEST FOR β1

Hypothesis
Ho: β1= 𝛽10
Ha: β1 ≠ 𝛽10

Test Statistic
𝑏1− 𝛽1 0 1 1
t* = , where 𝑆𝐸𝑏1 = se√∑(𝑥 −𝑥̅ )2 = se√ 2
𝑆𝐸𝑏1 𝑖 (∑ 𝑥𝑖 )
∑ 𝑥𝑖 2 −
𝑛

𝑆𝐸𝑏1 is the standard error of the slope.

Rejection Region

t* < - tα/2, n-2 or t* > t α/2, n-2

Decision

Reject Ho if t* < - tα/2, n-2 or t* > t α/2, n-2

If X and Y are linearly related, then β1 ≠ 0.

√𝑛−2
Note: An alternative formula for t* = .
√1− 𝑟2

162
Example 12.7:

Using a prechosen level of significance of 5%, test the hypothesis below for the push-ups data.

Hypothesis
Ho: β1= 0
Ha: β1 ≠ 0

Test Statistic
𝑏1− 𝛽1 0 1
t* = , where 𝑆𝐸𝑏1 = se√ 2
𝑆𝐸𝑏1 (∑ 𝑥𝑖 )
∑ 𝑥𝑖 2 −
𝑛

1 1
Using 𝑆𝐸𝑏1 == se√ (∑ 𝑥𝑖 )
2 = (7.1716) ( √ (503)2
) = 0.1031
∑ 𝑥𝑖 2 − 25921−
𝑛 12

recall se = 7.1716 and b1 = -0.2651 (from above examples)


−0.2651 − 0
so t* = = = -2.5713
0.1031

Rejection Region
t* < - tα/2, n-2 or t* > t α/2, n-2
t* < - t0.025, 10 or t* > t 0.025, 10
t* < - 2.228 or t* > 2.228

(Image augmented. Don’t forget each tail has area 0.025)


from class tables http://www.statdistributions.com/t/?p=0.1&df=12&tail=2

Decision: Since t* = -2.5713 falls in the rejection region, we reject Ho at the 5 % level of
significance. We conclude that the slope is not equal to 0 and that there is a linear relationship
between age and push-ups.

√12−2 √12−2
Note: t* = 𝑟 √√𝑛−22 = (-0.6308) =(-0.6308) = -2.5708 (slight difference
1− 𝑟 √1− (−0.6308)2 √1− (−0.6308)2
because of rounding)

163
CONFIDENCE INTERVAL FOR β1

b1 ± tα/2, n-2 𝑆𝐸𝑏1

Example 12.8:
Find and interpret a 95% confidence interval for β1 for the push-ups data.

b1 ± tα/2, n-2 𝑆𝐸𝑏1 = -0.2651 ± 2.228 (0.1031) = -0.2651 ± 0.2297

95% CI is (-.4948, -0.0354 )

Since 0 is not in the confidence interval, we reject Ho and conclude that, at the 5% significance
level, we have significant evidence that the slope differs from 0, and that there is a linear
relationship between age and pushups.

Notice, as always, that the two way hypothesis test with level of significance α gives the same
result as when we use a (1- α)% confidence interval to test our hypothesis.

SPSS:Data should be entered in 2 numerical scale columns with 0 decimals, PUSHUP and AGE.

Follow the Analysis>Regression>Linear path and select PUSHUP as your dependent variable
and age as your independent variable. Popup the statistics box and choose estimates, model
fit,,confidence intervals at a level of 95%, descriptives, and covariance matrix. Say ok. In the
output you will find a box of coefficients that contains the intercept, the predictor coefficient,
and the test statistic, two sided pvalue, and 95% confidence interval for the predictor coefficient.
.

OTHER INTERVALS OF INTEREST

There are other intervals that might interest us when we perform a regression analysis on a set of
data points. They are as follows.

1) Confidence interval for E(Y|X=x*)

1 (𝑥 ∗ − 𝑥̅ )2
ŷi ± tα/2, n-2 se√𝑛 + ∑(𝑥 − 𝑥̅ )2
𝑖
This interval is for the mean value of Y for a given value x* of X (that is, for the estimated mean
response for the subpopulation corresponding to the value x* of the explanatory variable.)
164
2) Prediction interval for a particular yi , given x*

1 (x∗ − 𝑥̅ )2
ŷ ± tα/2, n-2 se√1 + 𝑛 + ∑(𝑥 − 𝑥̅ )2
𝑖
This interval is for the range of values of Y for a particular value of X (that is, for a future
observation from the subpopulation corresponding to the value x* of the explanatory variable).

To further understand the distinction above, consider the following example. Suppose we wish
to estimate the salary of employees based on their years of experience. We can make a
confidence interval for the mean salary of all employees with 10 years of experience. On the
other hand if we wished to estimate the salary of just you yourself, an employee with 10 years of
experience, we would calculate a prediction interval.

SPSS
1. To calculate a confidence interval for the mean of y for a given value of X= x* with SPSS
choose Analyze>Regression >Linear, select Save and check Mean in the Prediction intervals
box. If you also check Individual you get the individual prediction intervals, too. (In the
predicted values box, check Unstandardized. This will return the point estimates for your
confidence interval and prediction interval.) Type 95 (or another desired confidence level) in
the Confidence Interval text box.

2. Return to DataView. New columns will appear there.

3. The lower bounds of the Confidence interval for E(Y|X= x*) are in the column Lmci 1 and
the upper bound is in the column Umci 1.

4. The lower bounds of the individual prediction intervals are in the column Lici_1 and the
upper bounds are in the column Uici_1.

Note: If you wish to obtain a confidence interval for E(Y|X = x*) or an individual prediction
interval for a particular individual with a value of xi that does not appear in the data values, type
that number in the cell below the bottom entry in the x column in the data file. Then run the
commands above again. The upper and lower bounds of an individual prediction interval for the
x value of interest will appear in the data file in the same row where you typed your x*.

Example 12.9: Find a 95% confidence interval for the average number of pushups that can be
done for all individuals of age 60. You are given that ∑(𝑥𝑖 − 𝑥̅ )2 = 4836.916667 for this data.

1 (𝑥 ∗ − 𝑥̅ )2 1 (60− 41.92)2
ŷ ± t0.025, 10 se√𝑛 + ∑(𝑥 − 𝑥̅ )2 = 32.8621 – 0.2651(60) ± (2.228) (7.1716)√12 + 4836.916667
𝑖

= 16.96 ± 6.20

A 95% CI is (10.76, 23.16) pushups.

165
Find a 95% prediction interval for the number of pushups that can be done by aparticular
individual age 60. You are given that ∑(𝑥𝑖 − 𝑥̅ )2 = 4836.916667 for this data.

1 (𝑥 ∗ − 𝑥̅ )2 1 (60− 41.92)2
ŷ ± t0.025, 10 se√1 + 𝑛 + ∑(𝑥 − 𝑥̅ )2 = 32.8621 – 0.2651(60) ± (2.228)(7.1716)√1 + 12 + 4836.916667
𝑖
= 16.96 ± 17.14

A 95% prediction interval is (-0.18, 34.1) pushups.

SPSS output:

Example 12.10: SPSS Regression output provides an F test statistic with a p-value. This F test is
testing for the significance of the regression model; in particular, it is testing whether the
parameter 1 has a non-zero value.

We re-paste the ANOVA table for the pushup data here.

For the push-ups data, our hypothesis test would be performed as follows, assuming a pre-chosen
 = 0.05

Step 1: Hypotheses:
Ho: 1 = 0
Ha: 1  0

166
𝑆𝑆𝑀/1 𝑆𝑆𝑀/1 𝑀𝑆𝑀
Step 2: Test statistic, F* = 𝑆𝑆𝐸/(𝑛−2) == 𝑆𝑆𝐸/10 = = 339.9201/51.4330 = 6.6090
𝑀𝑆𝐸

Step 3: Critical Region:


For an  = 0.05, Fcrit = F1,n-2 = F1,10 = 4.96
Critical Region is F > 4.96.

Step 4: Decision:
Since F* = 6.6090 falls in the critical region for  = 0.05 (6.69 > 4.96), we reject Ho and
conclude that the regression is significant. Note that the software calculates an actual p-value
for us, and that P(F > 6.6090) = 0.028 for an F distribution with 1 degree of freedom in the
numerator and 10 degrees of freedom in the denominator.
ASSUMPTIONS

Assumptions about the error term in the Regression Model Y = β0 + β1X + ε

The work we have done above assumes that we have an underlying population model
Y= β0 + β1X + ε where the distribution of errors ε meet the following criteria.

1. ε is a random variable with a mean of 0


2. the variance of ε is the same for all x
3. the values of ε are independent
4. ε is a normally distributed random variable

Implications:

1. Since E(ε) = 0, this means E(Y) = E(β0 + β1X + ε) = β0 + β1X


2. The variance of Y about the regression model equals σ2 and is the same for all values of X
3. The value of ε for a particular value of x is not related to the value of ε for any other value of
x; thus the value of Y for a particular value of x is not related to the value of Y for any other
value of x (in the push-up example, we could see this assumption violated if we had sampled a
group of people who all worked out at the same place, for example)
4. Because Y is a linear function of ε, Y is also a normally distributed random variable.

The graph below illustrates these assumptions and their implications. The value of E(Y|X)
changes according to its associated x. But regardless of the x value, the probability distribution
of ε, and hence the probability distribution of y, are normally distributed with the same variance,
σ2. The specific value of any error ε depends on whether the actual value of y is larger or smaller
than E(Y|X).

167
Assumptions about the error term in the Regression Model

Assumption 1, that the expected error is 0, is satisfied by taking the least squares approach.

Assumption 2, that we have equal variances for the error populations, can be checked by making
a scatterplot of the predicted y’s (on the x axis) and the residuals (on the y axis). We are looking
for a horizontal band of residuals, indicating that we have homoskedastic errors, as illustrated on
the far left below. This tells us that our assumption of equal variances is valid. On the other
hand, if the scatterplot of residuals looks like one of the other pictures below, we have
heteroscedastic errors and the assumption of equal variances is not true.

Residual Patterns

3 2.5 3
2 1.5 2
1 1
Residuals

Residuals

Residuals

0.5
0 0
-0.5 0
-1 0 5 10 5 10
-1
0 5 10

-2 -1.5 -2
-3 -2.5 -3
Predicted Ys Predicted Ys Predicted Ys

Assumption 3, independence of errors, is satisfied as long as we have taken a true simple


random sample and have independent observations. It does become an issue for time series
data, where errors can be correlated. This assumption is tested by plotting residuals against the
order of the data (if the data has a natural order). A pattern will show possible autocorrelation.
Examples of patterns showing autocorrelation are shown below.

168
Correlated Time Series Data

Also, a statistical test known as the Durban Watson test will check for autocorrelation. A value
of the Durban Watson test statistic that is below 1.5 or above 2.5 indicates autocorrelation
problems. Various statistical software packages will readily calculate the Durban Watson test
statistic for a given set of data.

Assumption 4 (normality of the ε distributions) can be checked by looking at a histogram of the


residuals (as long as the sample size is not too small) or by checking a normal probability plot
that compares observed residuals with residuals that would be expected if data was normal (and a
linear plot indicates that we have normality). An example of a normal probability plot is shown
below.

Push-ups Example (continued):

Assumption 1 (the expected values of the ε's is zero) is satisfied for our age/push-ups example.
We used the method of least squares to generate our regression line.

Assumption 2 (equal variances of the ε) can be examined by looking at a plot of the predicted y’s
(on the x axis) and the residuals (on the y axis), as below.

While we do seem to have a band of errors here, we do note the two outlying residuals.
(Remember, our sample is small, for teaching purposes).

The above was found using SPSS, as follows.

Data was entered into two numerical scale columns with 0 decimals, PUSHUP and AGE.

169
Then the Analysis>Regression>Linear path was followed and PUSHUP selected as your
dependent variable and age as your independent variable. Then you popped up the plots box,
and put ZRESID in the Y box and ZPRED in the X box for the scatter.
Assumption 3 (independence of the ε values) is satisfied in this case, as long as we have taken
an appropriate simple random sample. There really is no natural order to our data, and our data
is not time series data.

Assumption 4 (normality of the ε distribution) can be examined by looking at a histogram of the


residuals or by checking a normal probability plot. For illustrative purposes, we will do both
below. Note again that our sample size, chosen for illustrative purposes, is only 12.

SPSS is used to produce the histogram and normal probability plot as follows.

Data should be entered into two numerical scale columns with 0 decimals, PUSHUP and AGE.

Follow the Analysis>Regression>Linear path and select PUSHUP as your dependent variable
and age as your independent variable. Popup the plots box, and and check histogram. Popup the
Save box, and check standarized in the Residual box and standardized in the Predicted Variables
box. Say continue and ok.

Follow the Analyze>Descriptive Statistics>QQ plot path. Move ZRE_1 into the Variables box.
Make sure your Test Distribution dropdown is set to normal. Say ok.

Here the assumption of normality might be considered to be not too badly violated, but there are
more positive residuals than negative ones.

Note that the histogram appears to have a normal flavor. However, we only used 4 classes (an
attempt to mitigate the fact that there are only 12 data values), and we really cannot consider the
histogram of the sample data an adequate representation of the population distribution.

The QQplot appears to have a linear flavor, indicating support for the assumption of normality
for the errors. Two outlying residuals should be noted and checked out.

170
Chapter 13: MULTIPLE REGRESSION MODEL:

A multiple regression model expresses a dependent variable Y in terms of p independent


(explanatory) X1, X2 , …., Xp variables.

The multiple regression model is:

Y=0 + 1X1 + 2X2 + …..+ iXi + ….. + kXp + 

where p is the number of parameters in the model, and i = 1, 2,…., p

Assumptions about the error term are as before.

1. The error term  is a random variable with a mean or expected value of zero.
2. The variance of  is the same for all values of 𝑋1, 𝑋2, … , 𝑋𝑝 ,
3. The values of  are independent.
4. The error term  is a normally distributed random variable.

It is of interest to fit a line of “best” fit to n observed data points (𝑦𝑖 , 𝑥1𝑖 , 𝑥2𝑖 , … 𝑥𝑝𝑖 ) i= 1, …, n.

Some calculus will provide us with 𝑏0 , 𝑏1 , …, 𝑏𝑝 , the least squares estimates of 0, 1 , … p .

The sample multiple regression line (based on the observed values) can be written as:

𝑦̂ = 𝑏𝑜 + 𝑏1 𝑥1 + ⋯ + 𝑏𝑝 𝑥𝑝

where the 𝑦̂ is the predicted value for given values of the explanatory variables.

The ANOVA table for multiple regression is presented thusly.

Sum of Squares Degrees of Mean Sum of Test Statistic Pvalue


Freedom Squares
Between Groups SSM DFM = p MSM = SSM/1 F=MSM/MSE
Within Groups SSE DFE = n -p -1 MSE = SSE/(n-p -1)
Total SST DFT = n -1

The formula SST = SSM + SSE still holds, where SST is the total variation in the model, SSE is
the unexplained variation in the model and SSM is the explained variation in the model.

𝜎̂ = s = √𝑀𝑆𝐸 . The degrees of freedom for the error is n – (p+1) = n – p - 1.

From this point on, we shall rely on SPSS output for the calculation of statistics of interest
needed in our analysis.

171
R2, the multiple coefficient of determination, represents the total Y variability that is explained
by the k explanatory x variables.

R2 = (Total variability Explained by the Regression)/(Total Variability in Y) = SSM/SST

Finally, we can perform an F test to test for significance of the regression model, as follows.

Step 1:
Ho: all is = 0 (for i = 1,…, p)
Ha: at least one of the is  0

Step 2: Test statistic

𝑆𝑆𝑀/𝑝 𝑀𝑆𝑀
F* = 𝑆𝑆𝐸/(𝑛−𝑝−1) = 𝑀𝑆𝐸
This test statistic compares the amount of variation in Y caused by the regression line to the
amount of variation caused by other (error) factors.

A large value of F* corresponds to MSM being much larger than MSE. In this case, rejection of
Ho means that a significant amount of the variability in Y is caused by the explanatory variables
taken together.

Step 3: Critical Region: F > Fk,n-p-1 for your pre-chosen 

Step 4: Decision: Reject Ho if F*> Fp,n-p-1

Example 13.1: Botanists at the University of Toronto wished to investigate how the weight
change (y) in snow goslings is a function of other variables such as digestion efficiency (x1) and
acid-detergent fibre (least digestible portion) amount in feed (x2). For each of 42 feeding trials,
the change in the weight of the gosling after 2.5 hours was recorded as a percentage of initial
weight. Your instructor has created an SPSS file of the data. (The original data can be found
online at
http://oregonstate.edu/instruct/st352/kollath/handouts/multiplereg/categorical/snowgeese.htm. )

We wish to investigate whether it is reasonable to use the data to fit a model of the form

Weight Change =0 + 1Digestive Efficiency + 2Acid-detergent fibre + 

To begin, we create a matrix plot (multiple scatter plots) and a correlation matrix for the data set
to examine the pairwise relationships among the three variables to investigate how strongly the
explanatory variables are related to the response variable.

172
SPSS: Open the file Ducksinarow.sav that can be found on Blackboard. Note that needed data is
in 3 numerical scale columns, WGTCHA, DIGEFF, and ACIDDETFIB. Follow the
Graphs>Legacy Dialogs>Scatter/Dot path and select the Matrix icon. In the Matrix variables
box, select WGTCHA, DIGEFF and ACIDDETFIB for the box, and say OK. Follow the
Analyze>Correlate>Bivariate path, and move WGTCHA, DIGEFF and ACIDDETFIB to the
variables box, and say OK. The following output is produced.

Here we can see that weight change and digestive efficiency have a significant positive
correlation of 0.612, so they increase together. Weight gain and acid-detergent fibre has a
significant negative correlation of -0.725, so weight gain decreases as acid-detergent fibre
increases (and vice-versa). Finally, digestive efficiency and acid-detergent fibre have a
significant negative correlation of -0.880, so digestive efficiency decreases as acid-detergent
fibre increases (and vice versa).

We specially note the indication that the variables digestive efficiency and acid-detergent fibre are
associated, and keep this in mind for investigation with a model that considers this interaction
later on.

173
We run SPSS to get the estimates of the coefficients and of σ , and to get 95% confidence intervals
pertinent to this set of data. Our equation can be read from the table of parameter estimates to be

̂
WeightGain = 12.180 - 0.458 Acid-detergent fibre - 0.027 digestive efficiency

SPSS:

Open the file Ducksinarow.sav that can be found on Blackboard. Note that needed data is in 3
numerical scale columns, WGTCHA, DIGEFF, and ACIDDETFIB.

Follow the Analysis>Regression>Linear path and select WGTCHA as your dependent variable
and DIGEFF and ACIDDETFIB as your independent variables. Saying OK will produce output,
including a parameter estimates table and an ANOVA table.
Parameter Estimates

Dependent Variable: WGTCHA

95% Confidence Interval

Parameter B Std. Error t Sig. Lower Bound Upper Bound

Intercept 12.180 4.402 2.767 .009 3.276 21.085

ACIDDETFIB -.458 .128 -3.569 .001 -.717 -.198

DIGEFF -.027 .053 -.496 .623 -.135 .082

The intercept (12.180%) represents the weight change when digestive efficiency is 0% and acid-
detergent fibre equals 0%. This is of no use to us at all, as the independent variable values are far
outside the range of interest.

Each slope represents the mean change in weight associated with a one percent increase in the
corresponding explanatory variable, if the other explanatory variable remains unchanged.

For example, if acid detergent fiber were to increase by 1%, and digestive efficiency were to
remain constant, then weight change would decrease by 0.458 percent.

And if digestive efficiency were to increase by 1%, and acid-detergent fibre were to remain
constant, then weight change would decrease by 0.027 percent.

Notice that a drop in weight change is estimated here (with the multiple regression formula)
when digestive efficiency increases (acid-detergent fibre being held constant). This contradicts
what we observed in the scatterplot above, that weight change increases as digestive efficiency
increases. The association between the two independent variables leads us to question our
choice of model here. We will come back to this.

In the model summary output from SPSS, we can find MSE = 12.387, and calculate s = √𝑀𝑆𝐸 =
3.5195.

174
ANOVAa
Model Sum of Squares df Mean Square F Sig.

1 Regression 542.035 2 271.017 21.880 .000b

Residual 483.084 39 12.387

Total 1025.119 41

a. Dependent Variable: WGTCHA

b. Predictors: (Constant), ACIDDETFIB, DIGEFF

Note that R2 = SSM/SST = 542.035/1025.119 = 0.529. 52.9% of the variation in weight change
can be explained by the regression fit (from the variables acid-detergent fibre and digestion
efficiency.)

An R2 close to 1 indicates a good fit, while an R2 close to 0 indicates a lack of fit.

We note, as an aside, that a R2 equal to 1 occurs if the model fits the data perfectly, and we use the
same number of parameters as the sample size in our model. This is something we wish to avoid.
We should take sample sizes that are much larger than the number of parameters when fitting a
regression model.

A test of model utility allows us to test the hypothesis that all slope parameters are 0 against the
hypothesis that at least one is not. We use a pre-chosen alpha of 1%.

Hypothesis :
Ho: 1 = 2 = 0
Ha: at least one of the is  0

Assumptions: We assume that the 4 required assumptions are all met, viz:
1. The error term  is a random variable with a mean or expected value of zero.
2. The variance of  is the same for all values of 𝑋1 and 𝑋2
3. The values of  are independent.
4. The error term  is a normally distributed random variable.
We shall use residuals from the sample data to check for normality of the errors and equality of
variances below.

Test statistic

𝑆𝑆𝑀/𝑝 𝑀𝑆𝑀
F* = 𝑆𝑆𝐸/(𝑛−𝑝−1) = = 21.880
𝑀𝑆𝐸

This test statistic compares the amount of variation in Y caused by the regression line to the
amount of variation caused by other (error) factors.

A large value of F* corresponds to MSM being much larger than MSE. In this case, rejection of
Ho means that a significant amount of the variability in Y is caused by the explanatory variables
taken together.

175
Critical Region Approach: Reject Ho if F*> F2,39 (if F* > 5.195)

P-value Approach: Reject Ho if p-value < 0.01

From tables, we can infer that F* is between http://www.statdistributions.com/f/


5.18 and 5.39 (closer to 5.18) (F* = 5.195 can be read here)

Since 21.880 > 5.195 (or since 0.000 (the p-value in the table above) < 0.01), we reject Ho. At a
1% significance, the model is useful for predicting the mean weight change of snow goslings.

Having found this result, it then makes sense, for this model, to test the significance of the slope
parameters for acid-detergent fibre.

We can test for the significance of the individual parameters with a confidence interval or
hypothesis test approach.

Confidence Interval:

A 100(1-α)% confidence interval for a slope parameter βi is

bi ± tα/2, n-p-1 𝑆𝐸𝑏𝑖


Due to the complexity of the formulas, we rely on the software output to provide the intervals.

We have:

Confidence interval for the slope β1 (for acid-detergent fibre) is (-0.717, -0.198)%
Confidence interval of the slope β2 is (for digestive efficiency) (-0.135, 0.082)%

Since 0 does not fall within the 95% confidence interval for β1, we reject the possibility that β1
might be 0. There is significant evidence that the slope β1 differs from 0.

Since 0 falls within the 95% confidence interval for β2, we cannot reject the possibility that β2
might be 0.

176
Hypothesis Test:

Only the test of β1 is of interest, as we failed to find significance in the test of β2 with the
confidence level approach.

H0 : β1 >= 0 versus Ha: β1 < 0.

The t-statistic = -3.569 with df = 39 and the two-sided test P-value = 0.001 can be read in the
second row of the coefficients table. We would reject the null hypothesis for any reasonable level
of significance. (With our one-sided test, we would have to divide the p-value by 2, as SPSS is
doing a two sided test, but a P-value of 0.0005 is still small.) We conclude that there is
significant evidence that as acid detergent fibre intake increases, weight change decreases. (We
would proceed, at this point, to move to a model that only included acid-detergent fibre as a
predictor of gosling weight change. (We would, of course, investigate how well that model fit,
too, by looking at correlation, test of fit, and residuals.) )

Residuals

In multiple regression, we again can use the residuals, ei = yi - 𝑦̂,


𝑖 as estimates of the error terms.

Below, we investigate how residuals look for our duckling example with our multiple regression
model for the dependent variable WGTCHA with the independent variables DIGEFF and
ACIDDETFIB.

SPSS:
Open the file Ducksinarow.sav that can be found on Blackboard. Note that needed data is in 3
numerical scale columns, WGTCHA, DIGEFF, and ACIDDETFIB.

Follow the Analysis>Regression>Linear path and select WGTCHA as your dependent variable
and DIGEFF and ACIDDETFIB as your independent variables.

Popup the plots box, and put ZRESID in the Y box and ZPRED in the X box.for the scatter. Also
check histogram.

Popup the Save box, and check standardized in the Residual box and standardized in the
Predicted Variables box. Say continue and ok.

Follow the Analyze>Descriptive Statistics>QQ plot path. Move ZRE_1 into the Variables box.
Make sure your Test Distribution dropdown is set to normal. Say ok.

177
178
Here the assumption of normality of the errors might be considered to not be untoward, based on
the histogram of the residuals. However, we note that the right tail is a bit too short for a normal
distribution.

The scatterplot shows an even band of scatter around 0, but the two stragglers in the upper left
corner are of note. The sample size is small here, but we may begin to wonder if the assumption
of equal variances of the errors is violated.

Still, our work so far appears to have validity.

In the above model, if we fix X1 some fixed value, say 𝑥1∗ , then

𝐸(𝑌|X1 = 𝑥1∗ , )= (𝛽0 + 𝛽1 𝑥1∗ ) + 𝛽2 𝑋2 . This is a line with response variable Y and explanatory
variable 𝛽2 𝑋2 that has intercept (𝛽0 + 𝛽1 𝑥1∗ ) and slope 𝛽2.

That is, for all values of X1, the slope in the relationship between Y and X2 is the same!

So for a model such as this, where we consider only main effects, we imply that the
relationship between each independent variable and the response is not affected by the
value of the other independent variables.

In practice, this is not always the case, and two independent variables can interact in their effect
upon the response variable.

We can further investigate the interaction between two numerical variables with the following
multiple regression model:

Y = β0 + β1X1 + β2X2 + β3 X1X2 + ε

where Y is the dependent (response) variable.

X1 and X2 are the independent (predictor/explanatory) variables

𝜇𝑌 = β0 + β1X1 + β2X2 + β3 X1X2 is the deterministic part of the model

β1 + β3 X2 represents the change in Y for one unit increase in X1

β2 + β3 X1 represents the change in Y for one unit increase in X2

ε is the random error, which is assumed to be N(0, σ)

179
For the gosling example, the interaction model is:

Y = β0 + β1X1 + β2X2 + β3 X1X2 + ε

where Y is weight change, X1 is acid detergent fibre, and X2 is digestive efficiency.

X1 and X2 are the independent (predictor/explanatory) variables

We run SPSS to get the estimates of the coefficients and of σ, and to get 95% confidence intervals
for this set of data.

SPSS: Follow the Analyze>Generalized Linear Models > Univariate path. Place WGTCHA in the
Dependent Variable box, and DIGEFF and ACIDDETFIB in the Covariate box. Bring up the
Model box, and select Custom. Select ACIDDETFIB and DIGEFF (one at a time), and use the
Build Term “Main Effects” to bring each variable into the Model box. Select ACIDDETFIB and
DIGEFF simultaneously and drop on Build Term(s) to Interaction and click on the arrow to bring
DIGEFF*ACIDDETFIB into the Model box as “Interaction”. Continue and say OK.

Our sample regression equation can be read from the table of parameter estimates to be

𝑦̂= 9.561 - 0.357 𝑥1 + 0.024 𝑥2 - 0.002 𝑥1 ∗ 𝑥2


= 9.561 -0.357ACIDDETFIB + 0.024DIGEFF – 0.002DIGEFF*ACIDDETFIB

In the model summary output, we can find MSE = 12.550, and calculate s = √𝑀𝑆𝐸 = 3.543.

180
Note that R2 = 0.535. 𝑅𝑎2 = 0.498. 𝑅𝑎2 is an adjusted R2 that takes into account the number of
parameters in a model. It does not automatically increase when parameters are added to the model.

There is not a lot of difference between the two, and both indicated that about 50% of the variation
in weight change can be explained by the regression fit (the differences in acid-detergent fibre and
digestion efficiency.)

A test of model utility allows us to test the hypothesis that all slope parameters are 0 against the
hypothesis that at least one is not. We use a pre-chosen alpha of 1%.

Hypotheses:
Ho: 1 = 2 = 3 = 0
Ha: at least one of the is  0

Test statistic:

𝑆𝑆𝑀/p 𝑀𝑆𝑀
F* = 𝑆𝑆𝐸/(𝑛−𝑝−1) = = 14.561 with (3, 38) degrees of freedom
𝑀𝑆𝐸

Critical Region: F > F3,38 for your pre-chosen .

From tables, we can infer that F* is between http://www.statdistributions.com/f/


4.31 and 4.53 (closer to 4.31) (F* = 4.343 can be read here)

Decision:
Critical Region Approach: Reject Ho if F*> F3,38 (if F* > 4.35)
P-value Approach: Reject Ho if p-value < 0.01

At a 1% significance, the model is useful for predicting the mean weight change of snow goslings
Since 14.561 > 4.343 (or since 0.000 (the p-value in the table above) < 0.01), we reject Ho. At a
1% significance, the model is useful for predicting the mean weight change of snow goslings.
.
Having found this significance in the test of model of fit, it then makes sense, for this model, to
test the significance of each the slope parameters.

181
Hypothesis Tests:

H0 : β1 = 0 versus Ha: β1 ≠ 0.
H0 : β2 = 0 versus Ha: β2 ≠ 0.
H0 : β3 = 0 versus Ha: β3 ≠ 0.

Assumptions:

Test Statistic:
For ACIDDETFIB: β1: test statistic = -1.845 degrees of freedom = 38
For DIGEFF: β2: test statistic = 0.270, degrees of freedom = 38
For ACIDDETFIB*DIGEFF: β3: test statistic = -0.702, degrees of freedom = 38

P-value:
For β1: p-value = 0.073,
For β2: p-value = 0.788,
For β3: p-value = 0.487,

Decision:

None of the p-values are small (smaller than 0.05, for example) and we cannot reject any of the
null hypothesis, including the hypothesis of no interaction. (We do note that the p-value for β1 =
0.073, so our decision about rejection of the null hypothesis is that case is “borderline”.)

Because the test for interaction is not significant, and the test for significance of the digestion
efficiency coefficient does not lead to rejection, while the test for significance for the acid
detergent fibre is borderline, the decision would be made to go back to a model that just included
acid-detergent fibre as a predictor of weight change.

Students are directed to the following readings, at their leisure, to get a taste for what else might
be done in future courses!

http://www.medicine.mcgill.ca/epidemiology/Joseph/courses/EPIB-621/interaction.pdf

http://www.psychwiki.com/wiki/Interaction_between_two_continuous_variables

https://www3.nd.edu/~rwilliam/stats2/l55.pdf

182

You might also like