This action might not be possible to undo. Are you sure you want to continue?

Jamie DeCoster

Department of Psychology

University of Alabama

348 Gordon Palmer Hall

Box 870348

Tuscaloosa, AL 35487-0348

Phone: (205) 348-4431

Fax: (205) 348-8648

August 1, 1998

These were compiled from Jamie DeCoster’s introductory statistics class at Purdue University. Textbook ref-

erences refer to Moore’s The Active Practice of Statistics. CD-ROM references refer to Velleman’s ActivStats.

If you wish to cite the contents of this document, the APA reference for them would be

DeCoster, J. (1998). Introductory Statistics Notes. Retrieved <month, day, and year you downloaded

this ﬁle> from http://www.stat-help.com/notes.html

For help with data analysis visit

http://www.stat-help.com

ALL RIGHTS TO THIS DOCUMENT ARE RESERVED.

Contents

I Understanding Data 2

1 Introduction 3

2 Data and Measurement 4

3 The Distribution of One Variable 5

4 Measuring Center and Spread 7

5 Normal Distributions 10

II Understanding Relationships 12

6 Comparing Groups 13

7 Scatterplots 14

8 Correlation 15

9 Least-Squares Regression 16

9 Association vs. Causation 18

III Generating Data 19

10 Sample Surveys 20

11 Designed Experiments 21

IV Experience with Random Behavior 23

12 Randomness 24

13 Intuitive Probability 25

14 Conditional Probability 26

15 Random Variables 27

16 Sampling Distributions 28

i

V Statistical Inference 29

17 Estimating With Conﬁdence 30

18 Conﬁdence Intervals for a Mean 31

19 Testing Hypotheses 32

20 Tests for a Mean 34

VI Topics in Inference 36

21 Comparing Two Means 37

22 Inference for Proportions 39

23 Two-Way Tables 41

24 Inference for Regression 42

25 One-Way Analysis of Variance 45

1

Part I

Understanding Data

2

Chapter 1

Introduction

• It is important to know how to understand statistics so that we can make the proper judgments when

a person or a company presents us with an argument backed by data.

• Data are numbers with a context. To properly perform statistics we must always keep the meaning of

our data in mind.

• You will spend several hours every day working on this course. You are responsible for material covered

in lecture, as well as the contents of the textbook and the CD-ROM. You will have homework, CD-

ROM, and reading assignments every day. It is important not to get behind in this course. A good

work schedule would be:

◦ Review the notes from the previous day’s lecture, and take care of any unﬁnished assignments.

◦ Attend the lecture.

◦ Attend the lab section.

◦ Do your homework. You will want to plan on staying on campus for this, as your homework will

often require using the CD-ROM.

◦ Do the CD-ROM assignments.

◦ Do the Reading assignments.

This probably seems like a lot of work, and it is. This is because we need to cover 15 weeks of material

in 4 weeks during Maymester. Completing the course will not be easy, but I will try to make it as good

an experience as I can.

3

Chapter 2

Data and Measurement

• Statistics is primarily concerned with how to summarize and interpret variables. A variable is any

characteristic of an object that can be represented as a number. The values that the variable takes

will vary when measurements are made on diﬀerent objects or at diﬀerent times.

• Each time that we record information about an object we observe a case. We might include several

diﬀerent variables in the same case. For example, we might measure the height, weight, and hair color

of a group of people in an experiment. We would have one case for each person, and that case would

contain that person’s height, weight, and hair color values. All of our cases put together is called our

data set.

• Variables can be broken down into two types:

◦ Quantitative variables are those for which the value has numerical meaning. The value refers

to a speciﬁc amount of some quantity. You can do mathematical operations on the values of

quantitative variables (like taking an average). A good example would be a person’s height.

◦ Categorical variables are those for which the value indicates diﬀerent groupings. Objects that

have the same value on the variable are the same with regard to some characteristic, but you

can’t say that one group has “more” or “less” of some feature. It doesn’t really make sense to do

math on categorical variables. A good example would be a person’s gender.

• Whenever you are doing statistics it is very important to make sure that you have a practical under-

standing of the variables you are using. You should make sure that the information you have truly

addresses the question that you want to answer.

Speciﬁcally, for each variable you want to think about who is being measured, what about them is

being measured, and why the researcher is conducting the experiment. If the variable is quantitative

you should additionally make sure that you know what units are being used in the measurements.

4

Chapter 3

The Distribution of One Variable

• The pattern of variation of a variable is called its distribution. If you examined a large number of

diﬀerent objects and graphed how often you observed each of the diﬀerent values of the variable you

would get a picture of the variable’s distribution.

• Bar charts are used to display distributions of categorical variables. In a bar chart diﬀerent groups

are represented on the horizontal axis. Over each group a bar is drawn such that the height of the bar

represents the number of cases falling in that group.

• A good way to describe the distribution of a quantitative variable is to take the following three steps:

1. Report the center of the distribution.

2. Report the general shape of the distribution.

3. Report any signiﬁcant deviations from the general shape.

• A distribution can have many diﬀerent shapes. One important distinction is between symmetric and

skewed distributions. A distribution is symmetric if the parts above and below its center are mirror

images. A distribution is skewed to the right if the right side is longer, while it is skewed to the left if

the left side is longer.

• Local peaks in a distribution are called modes. If your distribution has more than one mode it often

indicates that your overall distribution is actually a combination of several smaller ones.

• Sometimes a distribution has a small number of points that don’t seem to ﬁt its general shape. These

points are called outliers. It’s important to try to explain outliers. Sometimes they are caused by data

entry errors or equipment failures, but other times they come from situations that are diﬀerent in some

important way.

• Whenever you collect a set of data it is useful to plot its distribution. There are several ways of doing

this.

◦ For relatively small data sets you can construct a stemplot. To make a stemplot you

1. Separate each case into a stem and a leaf. The stem will contain the ﬁrst digits and the leaf

will contain a single digit. You ignore any digits after the one you pick for your leaf. Exactly

where you draw the break will depend on your distribution: Generally you want at least ﬁve

stems but not more than twenty.

2. List the stems in increasing order from top to bottom. Draw a vertical line to the right of

the stems.

3. Place the leaves belonging to each stem to the right of the line, arranged in ascending numer-

ical order.

5

◦ To compare two distributions you can construct back-to-back stemplots. The basic procedure is

the same as for stemplots, except that you place lines on the left and the right side of the stems.

You then list out the leaves from one distribution on the right, and the leaves from the other

distribution on the left.

◦ For larger distributions you can build a histogram. To make a histogram you

1. Divide the range of your data into classes of equal width. Sometimes there are natural

divisions, but other times you need to make them yourself. Just like in a stemplot, you

generally want at least ﬁve but less than twenty classes.

2. Count the number of cases in each class. These counts are called the frequencies.

3. Draw a plot with the classes on the horizontal axis and the frequences on the vertical axis.

• Time plots are used to see how a variable changes over time. You will often observe cycles, where the

variable regularly rises and falls within a speciﬁc time period.

6

Chapter 4

Measuring Center and Spread

• In this section we will discuss some ways of generating mathematical summaries of distributions. In

these equations we will make use of some stastical notation.

◦ We will always use n to refer to the total number of cases in our data set.

◦ When referring to the distribution of a variable we will use a single letter (other than n). For

example we might use h to refer to height. If we want to talk about the value of the variable in a

speciﬁc case we will put a subscript after the letter. For example, if we wanted to talk about the

height of person 5 in our data set we would call it h

5

.

◦ Several of our formulas involve summations, represented by the

¸

symbol. This is a shorthand

for the sum of a set of values. The basic form of a summation equation would be

x =

b

¸

i=a

f(i).

To determine x you calculate the function f(i) for each value from i = a to i = b and add them

together. For example,

4

¸

i=2

i

2

= 2

2

+ 3

2

+ 4

2

= 29.

• Some important properties of summation:

n

¸

i=1

k = nk, where k is a constant (4.1)

n

¸

i=1

(Y

i

+Z

i

) =

n

¸

i=1

Y

i

+

n

¸

i=1

Z

i

(4.2)

n

¸

i=1

(Y

i

+k) =

n

¸

i=1

Y

i

+nk, where k is a constant (4.3)

n

¸

i=1

(kY

i

) = k

n

¸

i=1

Y

i

, where k is a constant (4.4)

• When people want to give a simple description of a distribution they typically report measures of its

central tendency and spread.

• People most often report the mean and the standard deviation. The main reason why they are used is

because of their relationship to the normal distribution, an important distribution that will be discussed

in the next lesson.

7

◦ The mean represents the average value of the variable, and can be calculated as

¯ x =

1

n

n

¸

i=1

x

i

. (4.5)

◦ In statistics our sums are almost always over all of the cases, so we will typically not bother to

write out the limits of the summation and just assume that it is performed over all the cases. So

we might write the formula for the mean as

1

n

¸

x

i

.

◦ The standard deviation is a measure of the average diﬀerence of each case from the mean, and

can be calculated as

s =

¸

(x

i

− ¯ x)

2

n −1

. (4.6)

◦ The term x

i

− ¯ x is the deviation of an case from the mean. We square it so that all the deviation

scores will come out positive (otherwise the sum would always be zero). We then take the square

root to put it back on the proper scale.

◦ You might expect that we would divide the sum of our deviations by n instead of n − 1, but

this is not the case. The reason has to do with the fact that we use ¯ x, an approximation of the

mean, instead of the true mean of the distribution. Our values will be somewhat closer to this

approximation than they will be to the true mean because we calculated our approximation from

the same data. To correct for this underestimate of the variability we only divide by n−1 instead

of n. Statisticians have proven that this correction gives us an unbiased estimate of the true

standard deviation.

• While the mean and standard deviation provide some important information about a distribution, they

may not be meaningful if the distribution has outliers or is strongly skewed. People will therefore often

report the median and the interquartile range of a variable.

◦ These are called resistant measures of central tendency and spread because they are not strongly

aﬀected by skewness or outliers.

◦ The formal procedure to calculate the median is:

1. Arrange the cases in a list ordered from lowest to highest.

2. If the number of cases is odd, the median is the center obsertation in the list. It is located

n+1

2

places down the list.

3. If the number of cases is even, the median is the average of the two cases in the center of the

list. These are located

n

2

and

n+2

2

places down the list.

◦ The median is the 50th percentile of the list. This means that it is greater than 50% of the cases

in the list. The interquartile range is the diﬀerence between the cases in the 25th and the 75th

percentiles. To calculate the interquartile range you

1. Calculate the value of the 25th percentile (also called the ﬁrst quartile). This is the median

of the cases below the location of the 50th percentile.

2. Calculate the value of the 75th percentile (also called the third quartile). This is the median

of the cases above the location of the 50th percentile.

3. Subtract the value of the 25th percentile from the value of the 75th percentile. This is the

interquartile range.

◦ To get a more accurate description of a distribtion, people will often provide its ﬁve-number

summary. This consists of the mininum, ﬁrst quartile, median, third quartile, and the maximum.

• Often times the same theoretical construct (such as temperature) might be measured in diﬀerent units

(such as Farenheit and Celcius). These variables are often linear transformations of each other, in that

you can calculate one of the variables from another with an equation of the form

x

∗

= a +bx. (4.7)

If you have the summary statistics (mean, standard deviation, median, interquartile range) of the

original variable x it is easy to calculate the summary statistics of the transformed variable x

∗

.

8

◦ The mean, median, and quartiles of the transformed variable can be obtained by multiplying the

originals by b and then adding a.

◦ The standard deviation and interquartile range of the transformed variable can be obtained by

multiplying the originals by b.

Linear transformations do not change the overall shape of the distribution, although they will change

the values over which the distribution runs.

9

Chapter 5

Normal Distributions

• A density curve is a mathematical function used to represent the distribution of a variable. These

curves are designed so that the area under the curve gives the relative frequencies for the distribution.

This means that the area under the curve within a portion of the range is equal to the probability of

getting an case in that portion of the range.

• Normal distributions are a speciﬁc type of distribution that have by bell-shaped, symmetric density

curves. There are many naturally occuring variables that tend to follow normal distributions. Some

examples are biological characteristics, IQ tests, and repeated measurements taken by physcial devices.

• A normal distribution can be completely described by its mean and standard deviation. We therefore

often use the notation N(µ, σ) to refer to a normal distribution with mean µ and standard deviation

σ.

We use µ and σ instead of ¯ x and s when we describe a normal distribution because we are talking

about a population instead of a sample. For now you can just think about them in the same way. We

will talk more about about the diﬀerence between populations and samples in Part 3.

• For all normal distributions, 68% of the data will fall witin one standard deviation of the mean, 95% will

fall within two standard deviations of the mean, and 99.7% will fall within three standard deviations

of the mean. This is called the 68-95-99.7 rule.

• The standard normal curve is a normal distribution with mean 0 and standard deviation 1. Most

statistics books will provide a table with the probabilities associated with this curve.

• People often use standardized cases when working with variables that have normal distributions. If

your variable X has a mean µ and standard deviation σ then the standardized cases Z are equal to

Z =

X −µ

σ

. (5.1)

These standardized cases have a mean of 0 and a standard deviation of 1. This means that they will

follow a standard normal curve.

• Sometimes you will be asked to calculate normal relative frequencies. In these problems you are

asked to determine the probability of observing certain types of cases in a variable that has a normal

distribution. To calculate normal relative frequencies you:

1. Formulate the problem in terms of the original variable X.

“I want to know the probability that X > 5.”

2. Restate the problem in terms of a standardized normal variable Z.

“Because the mean is equal to 2 and the standard deviation is equal to 3, this corresponds to the

probability that Z > 1.”

10

3. Look up the probabilites associatied with the cases you are interested in. Since standardized

variables have a standard normal curve you can use the that table. You may also need to use the

fact that the total area underneath the curve is equal to 1.

“According to the table, the probability that Z < 1 is equal to .8413. This means that the

probability that Z > 1 is equal to .1587.”

• We can only use the standard normal curve to calculate probabilities when we are pretty sure that

our variable has a normal distribution. Statisticians use normal probability plots to check normality.

Variables with normal distributions have normal probability plots that look approximately like straight

lines. It doesn’t matter what angle the line has.

11

Part II

Understanding Relationships

12

Chapter 6

Comparing Groups

• In the ﬁrst part we looked at how to describe the distribution of a single variable. In this part we

examine how the distribution of two variables relate to each other.

• Boxplots are used to provide a simple graph of a quantitative variable’s distribution. To produce a

boxplot you:

1. Identify values of the variable on the y-axis.

2. Draw a line at the median.

3. Draw a box around the middle quartiles.

4. Draw dots for any suspected outliers. Outliers are cases that appear to be outside the general

distribution. A case is considered an outlier if it is more than 1.5 x IQR outside of the box.

5. Draw lines extending from the box to the greatest and least cases that are not outliers.

• People will often present side-by-side boxplots to compare the distribution of a quantitative variable

across several diﬀerent groups.

13

Chapter 7

Scatterplots

• Response variables measure an outcome of a study. We try to predict them using explanatory variables.

Sometimes we want to show that changes in the explanatory variables cause changes in the response

variables, but other times we just want to show a relationship between them.

• Response variables are sometimes called dependent variables, and explanatory variables are sometimes

called independent variables or predictor variables.

• You can use scatterplots to graphically display the relationship between two quantitative variables

measured on the same cases. One variable is placed on the horizontal (X) and one is placed on the

vertical (Y) axis . Each case is represented by a point on the graph, placed according to its values on

each of the variables. Cases with larger values of the X variable are represented by points further to

the right, while those with larger values of the Y variable are represented by points that are closer to

the top.

• If we think that one of the variables causes or explains the other we put the explanatory variable on

the x-axis and the response variable on the y-axis.

• When you want to interpret a scatterplot think about:

1. The overall form of the relationship. For example, the relationship could resemble a straight line

or a curve.

2. The direction of the relationship (positive or negative).

3. The strength of the relationship.

• When the points on a scatterplot appear to fall along a straight line, we say that the two variables have

a linear relationship. If the line rises from left to right we say that the two variables have a positive

relationship, while if it decreases from left to right we say that they have a negative relationship.

• Sometimes people use a modiﬁed version of the scatterplot to plot the relationship between categorical

variables. In these graphs, diﬀerent points on the axis correspond to diﬀerent categories, but their

speciﬁc placement on the axis is more or less arbitrary.

14

Chapter 8

Correlation

• The correlation coeﬃcient is designed to measure the linear association between two variables without

distinguishing between response and explanatory variables. The correlation between the variables x

and y is deﬁned as

r =

1

n −1

¸

¸

x

i

− ¯ x

s

x

y

i

− ¯ y

s

y

. (8.1)

In this equation, s

x

refers to the standard deviation of x and s

y

refers to the standard deviation of y.

• The correlation between two variables will be positive when they tend to be both high and low at the

same time, will be negative when one tends to be high when the other is low, and will be near zero

when the value on one variable is unrelated to the value on the second.

• The correlation formula can also be written as

r =

¸

(x

i

y

i

) −

1

n

¸

x

i

¸

y

i

(n −1)s

x

s

y

. (8.2)

This is called the calculation formula for r. This formula is easier to use when calculating correlations

by hand, but is not as useful when trying to understand the meaning of correlations.

• Correlations have several important characteristics.

◦ The value of r always falls between -1 and 1. Positive values indicate a positive association

between the variables, while negative values indicate a negative association between the variables.

◦ If r = 1 or r = −1 then all of the cases fall on a straight line. This means that the one variable is

actually a perfect linear function of the other. In general, when the correlation is closer to either

1 or -1 then the relationship between the variables is closer to a straight line.

◦ The value of r will not change if you change the unit of measurement of either x or y.

◦ Correlations only measure the degree of linear association between two variables. If two variables

have a zero correlation, they might still have a strong nonlinear relationship.

15

Chapter 9

Least-Squares Regression

• A regression line is an equation that describes the relationship between a response variable y and an

explanatory variable x. When talking about it you say that you “regress y on x.”

• The general form of a regression line is

y = a +bx, (9.1)

where y is the value of the response variable, x is the value of the explanatory variable, b is the slope of

the line, and a is the intercept of the line. The slope is the amount that y changes when you increase

x by 1, and the intercept is the value of y when x = 0.

• The values on the line are called the predicted values, represented by ˆ y. The diﬀerence between the

observed and the predicted value (y

i

− ˆ y

i

) for each case is called the residual.

• Least-squares regression is a speciﬁc way of determining the slope and intercept of a regression line.

The line produced by this method minimizes the sum of the squared residuals. We use the squared

residuals because the residuals can be both positive or negative.

• The least-regression line of y on x calculated from n cases is

ˆ y = a +bx, (9.2)

where

b =

¸

xy −

1

n

¸

x

¸

y

¸

x

2

−

1

n

(

¸

x)

2

and

a = ¯ y −b¯ x.

• To see if it is appropriate to use a regression line to represent our data we often examine residual plots.

A residual plot is a graph with the residual on one axis and either the value of explanatory variable

the predicted value on the other, with one point from each case. In general you want your residual

plot to look like a random scattering of points. Any sort of pattern would indicate that there might

be something wrong with using a linear model.

• A lurking variable is a variable that has an important eﬀect on your response variable but which is not

examined as an explanatory variable. Often a plot of the residuals against time can be used to detect

the presence of a lurking variable.

• In linear regression, an outlier is a case whose response lies far from the ﬁtted regression line. An

inﬂuential observation is a case that has a strong eﬀect on the position of the regression line. Points

that have unusually high or low values on the explanatory variable are often inﬂuential observations.

16

It is important to carefully examine both outliers and inﬂuential observations. Sometimes these points

are not actually from the population you want to study, in which case you should drop them from your

data set.

• It is important to notice that you get diﬀerent least-squares regression equations if you regress y on x

than if you regress x on y. This is because in the ﬁrst case the equation is built to minimize errors in

the vertical direction, while in the second case the equation is built to minimize errors in the horizontal

direction.

• Like correlations, the slope of the regression line of y on x is a measure of the linear assocation between

the two variables. These two are related by the following formula:

b = r

s

y

s

x

. (9.3)

• The square of the correlation, r

2

, is the proportion of the variation in the variable y that can be

explained by the least-squares regression of y on x. It is the same as the proportion of variation in the

variable x that can be explained by the least-squares regression of x on y.

• You should notice that the equations for regression lines and correlations are based on means and

standard deviations. This means that they will both be strongly aﬀected by outliers or inﬂuential

observations. Additionally, lurking variables can create false relationships or obscure true relationships.

• Regression lines are often used to predict one variable from the other. However, you must be careful

not to extrapolate beyond the range of your data. You should only use the regression line to make

predictions within the range of the data you used to make the line. Additionally, you should always

make sure that new case you want to predict is drawn from the same underlying population.

17

Chapter 9

Association vs. Causation

• Once you determine that you can use an explanatory variable to predict a response variable you can

try to determine whether changes in the explanatory variable cause changes in the response variable.

• There are three situations that will produce a relationship between an explanatory and a response

variable.

1. Changes in the explanatory variable actually cause changes in the response variable. For example,

studying more for a test will cause you to get a better grade.

2. Changes in both the explanatory and response variables are both related to some third variable.

For example, oversleeping class can cause you to both have uncombed hair and perform poorly on

a test. Whether somone has combed hair or not could then be used to predict their performance

on the test, but combing your hair will not cause you to perform better.

3. Changes in the explanatory variable are confounded with a third variable that causes changes in

the response. Two variables are confounded when they tend to occur together, but neither causes

the other. For example, a recently published book stated that African-Americans perform worse

on intelligence tests than Caucasians. However, we can’t conclude that there is a racial diﬀerence

in intellegence because a number of other important factors, such as income and educational

quality, are confounded with race.

• The best way to establish that an explanatory variable causes changes in a response variable is to

conduct an experiment. In an experiment, you actively change the level of the explanatory variable and

observe the eﬀect on the response variable while holding everything else about the situation constant.

We will talk more about experiments in Chapter 11.

18

Part III

Generating Data

19

Chapter 10

Sample Surveys

• The population is the group that you want to ﬁnd out about from your data.

◦ If your data represents only a portion of the population you have a sample.

◦ If your data represents the entire population you have a census.

• Most data from surveys are recorded in terms of numbers. However, these numbers do not always

mean the same thing. It is important to understand the scale of measurement used by a given question

so you know what are the appropriate analyses. The four basic scales of measurement are:

1. Nominal scales, where the numbers just serve as labels for categories. Example: coding 1=male,

2=female.

2. Ordinal scales, where the numbers indicate the rank order of a category. Example: coding grade

level in high school as 1=freshman, 2=sophomore, 3=junior, 4=senior.

3. Interval scales, where equal diﬀerences between numbers reﬂect equal diﬀerences on some mean-

ingful quantity. Example: day of the year.

4. Ratio scales, which are just like interval scales but a “0” value indicates that something possesses

none of the measured characteristic. Example: height.

For our purposes you should treat nominal and ordinal questions as categorical variables, and interval

and ratio questions as quantitative variables.

• By selecting a design you choose a systematic way of producing data in your study. This includes

things like what people or objects you will examine and the situations under which you will examine

them. A poorly chosen design will make it diﬃcult to draw any conclusions from the data.

• The basic goal of sampling design is to make sure that the sample you select for your study properly

represents the population that you are interested in. Your sample is biased if it favors certain parts of

the population over others.

• One example of a bad sampling design is a voluntary response sample. In this design you ask a question

to a large number of people and collect data from those who choose to answer. This sample does not

represent the general population because the people who are choose to respond to your question are

probably diﬀerent from the people who do not.

• A good way to get unbiased data is with a simple random sample. In a simple random sample every

member of the population has an equal chance of being in the sample. Also, every combination of

members is as likely as every other combination.

• You should also be careful about introducing bias into your data with the way you write your survey.

The wording of each item should be clear and should not inﬂuence the way people answer the question.

20

Chapter 11

Designed Experiments

• In observational studies you try to examine and measure things as they naturally occur. In experi-

mental studies you impose some manipulation and measure its eﬀects. It is easier to generalize from

observational studies, but it is easier to determine causality from experimental studies.

• The objects on which an experiment is performed are called the experimental units. Human experi-

mental units used to be called “subjects,” but we now call them “research participants.”

• The explanatory variables in an experiment are referred to as factors, while the values each factor

can have are called its levels. A speciﬁc experimental condition applied to a set of units is called a

treatment.

• The eﬀectiveness of a treatment is generally measured by comparing the results of units that underwent

the treatment to the results of units in a control condition. Units in a control condition go through the

experiment but never receive the treatment.

• It is generally not a good idea to determine the eﬀect of a treatment by comparing the units before

the treatment to the same units after the treatment. Just being in an experiment can sometimes cause

changes, even if the treatment itself doesn’t really do anything. This is called the placebo eﬀect.

• The basic idea of an experiment is that if you control for everything in the environment except for

the factor that you manipulate, then any diﬀerences you observe between diﬀerent groups in your

experiment must be caused by the factor. It is therefore very important that the groups of experimental

units you have in each level of your factor are basically the same. If there is some diﬀerence between

the types of units in your groups, then the results might be caused by those diﬀerences instead of your

treatment.

• Sometimes people use matching to make their groups similar. In this case you assign people to the

diﬀerent levels of your factor in such a way that for every person in one factor level you have a similar

person in each of the other levels.

• The preferred way to make your groups similar is by randomization. In this case you just randomly

pick people to be in each of your groups, relying on the fact that on average the diﬀerences between

the groups will average out. Not only does this require less eﬀort, but you don’t have to know ahead

of time what are the important things that you need to match on.

• There are many diﬀerent ways to randomly assign your experimental units to conditions. You can use

a table of random numbers, a mathematical calculator that produces random digits, or you can use

statistical software.

• Randomization is one of the basic ideas behind statistics. When we randomly assign units to be in

either the treatment or control conditions, there will be some diﬀerences between the groups. However,

this is not a problem because mathematicians know a lot about how random variables work. When we

21

see a diﬀerence between the groups we can ﬁgure out the probability that the diﬀerence is just due to

the way that we assigned the units to the groups. If the probability is too low for us to believe that the

results were just due to chance we say that the diﬀerence is statistically signiﬁcant, and we conclude

that our treatment had an eﬀect.

• In conclusion, when designing an experiment you want to:

1. Control the eﬀects of irrelevant variables.

2. Randomize the assignment of experimental units to treatment conditions

3. Replicate your experiment across a number of experimental units.

22

Part IV

Experience with Random Behavior

23

Chapter 12

Randomness

• In Chapter 11 we talked about how and why staticians use randomness in experimental designs. Once

we run an experiment we want to determine the probability that any relationships we see are just due

to chance. In the next chapters we will learn more about probability and randomness so we can make

sense of our results.

• A random phenomenon is one where we can’t exactly predict what the end result will be, but where we

can describe how often we would expect so see each possible result if we repeated the event a number

of times. The diﬀerent possible end results are referred to as outcomes.

• The probability of an outcome of a random phenomenon is the proportion of times we would expect to

see the outcome if we observed the event a large number of times.

• We can determine the probabilities of outcomes by observing a random phenomenon a number of times,

but most often we either assign them using common sense or else calculate them using mathematics.

24

Chapter 13

Intuitive Probability

• An event is a collection of outcomes of a random phenomenon.

◦ An event can be only a single outcome, or it can be many diﬀerent outcomes.

◦ In statistics we often use set notation to describe events. The event itself is a set, and the outcomes

included in the event are elements of the set.

◦ When we talk about the probability of an event we are referring to the probability that the

outcome of a random phenomenon is included in the event.

◦ Two events are called disjoint if they have no outcomes in common. This means that they can

never both occur at the same time.

◦ Two events are called independent if knowing that one occurs does not change the probability

that the other occurs. This is similar to saying that there isn’t anything that is simultaneously

inﬂuencing both events. Note that this is diﬀerent from being disjoint – in fact, two events cannot

be both disjoint and independent.

• Probabilities of events follow a number of rules:

◦ The probability that event A occurs = P(A).

◦ The probability that an event A does not occur = 1 −P(A).

◦ Probabilities are always between 0 and 1.

◦ If the event contains all possible outcomes then its probability = 1.

◦ If events A and B are disjoint then P(A or B) = P(A) +P(B).

◦ If events A and B are independent then P(A and B) = P(A)P(B).

• When thinking about a random phenomenon one of the ﬁrst things we typically do is consider the

diﬀerent results we might observe.

◦ A sample space is a list of the all of the possible outcomes from a speciﬁc random phenomenon.

◦ Just like events, we typically use a set to describe the sample space of a random phenomenon.

This set is typically named S, and includes each possible outcome as an element.

• A sample space is called discrete if we can list out all of the possible outcomes. The probability of an

event in a discrete sample space is the sum of the probabilities of the individual outcomes making up

the event.

• A sample space is called continuous if we cannot list out all of the possible outcomes.

◦ In a continuous sample space we use a density curve to represent the probabilities of outcomes.

The area under any section of the density curve is equal to the probability of the result falling

into that section.

◦ To calculate the probability of an event you add up the area underneath the density curve for

each range of outcomes included in the event.

25

Chapter 14

Conditional Probability

• Many times the probability of an event will depend on external factors. Think about the probability

that a high school student will go to college. We know that many factors, such as socio-economic status

(SES) can impact this probability. Conditional probability lets us think about the probability of an

event given that something else is true.

• We use the notation P(A|B) to refer to the conditional probability that event A occurs given that B

is true. For example the probability that someone who is low in SES goes to college would be written

as P(going to college | low SES).

• Conditional probabilites can also be thought of as limiting the type of people you want to include in

your probability assessment. One good way to see conditional probabilities is in a 2-way table.

• Last chapter we mentioned that two events are independent if knowing that one occurs does not change

the probability that the other occurs. Using conditional probability we can write this mathematically:

Two events A and B are independent if P(A|B) = P(A). (14.1)

• If we know the probabilities of two events and the probability of their intersection we can calculate

either conditional probability:

P(A|B) =

P(A and B)

P(B)

, P(B|A) =

P(A and B)

P(A)

. (14.2)

• Similarly, if we know the probabilities of two events and the corresponding conditional probabilities we

can calculate the probability of the intersection:

P(A and B) = P(A)P(B|A) = P(B)P(A|B). (14.3)

This is actually an extension of the intersection formula we learned in chapter 13. While that formula

would only work if A and B are independent, this one will always work.

• We can also extend the union formula we learned in chapter 13:

P(AorB) = P(A) +P(B) −P(AandB). (14.4)

This formula will always be accurate, whether events A and B are disjoint or not.

• People often think that there are patterns in sequences that are actually random. There are many

diﬀerent types of sequences that can be reasonable results of independent processes.

26

Chapter 15

Random Variables

• A random variable is a variable whose value is determined by the outcome of a random phenomenon.

The statistical techniques we’ve learned so far deal with variables, not events, so we need to deﬁne a

variable in order to analyze a random phenomenon.

• A discrete random varaible has a ﬁnite number of possible values, while a continuous random variable

can take all values in a range of numbers.

• The probability distribution of a random variable tells us the possible values of the variable and how

probabilities are assigned to those values.

◦ The probability distribution of a discrete random variable is typically described by a list of the

values and their probabilites. Each probability is a number between 0 and 1, and the sum of the

probabilities must be equal to 1.

◦ The probability distribution of a continuous random variable is typically described by a density

curve. The curve is deﬁned so that the probability of any event is equal to the area under the

curve for the values that make up the event, and the total area under the curve is equal to 1. One

example of a continuous probability distribution is the normal distribution.

• We use the term parameter to refer to a number that describes some characteristic of a population. We

rarely know the true parameters of a population, and instead estimate them with statistics. Statistics

are numbers that we can calculate purely from a sample. The value of a statistic is random, and will

depend on the speciﬁc observations included in the sample.

• The law of large numbers states that if we draw a bunch of numbers from a population with mean µ,

we can expect the mean of the numbers ¯ y to be closer to µ as we increase the number we draw. This

means that we can estimate the mean of a population by taking the average of a set of observations

drawn from that population.

27

Chapter 16

Sampling Distributions

• As discussed in the last chapter, statistics are numbers we can calculate from sample data, usually as

estimates of population parameters. However, a large number of diﬀerent samples can be taken from

the same population, and each sample will produce slightly diﬀerent statistics. To be able to understand

how a sample statistic relates to an underlying population parameter we need to understand how the

statistic can vary.

• The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible

samples of the same size.

• The distribution of the sample mean ¯ y of n observations drawn from a population with mean µ and

standard deviation σ has mean

µ

¯ y

= µ (16.1)

and standard deviation

σ

¯ y

=

σ

√

n

. (16.2)

If the original sample has a normal distribution, then the distribution of the sample mean has a normal

distribution.

The distribution of the sample mean will also have a normal distribution (approximately) if the sample

size n is fairly large (at least 30), even if the original population does not have a normal distribution.

The larger the sample size, the more closer the distribution of the sample mean will be to a normal

distribution. This is referred to as the central limit theorem.

• It is important not to confuse the distribution of the population and the distribution of the sample

mean. This is sometimes easy to do because both distributions are centered on the same point, although

the population distribution has a larger spread.

28

Part V

Statistical Inference

29

Chapter 17

Estimating With Conﬁdence

• We have already learned about using statistics to estimate population parameters. Now that we have

an understanding of probability we can also state how much trust we have in our estimates.

• One important thing to keep in mind when thinking about statistical inference is that we always assume

that our sample data are a random sample of the population whose parameter we want to estimate. If

this is not true then our inferences we make might be invalid.

• Let’s think about estimating the mean of a population µ. We already know that one way to calculate

this estimate is to take the mean of a sample drawn from the population ¯ y. We also know that ¯ y is

more likely to be closer to µ if we use a larger sample. We want a way to report how close we think

our statistic ¯ y is likely to be to the parameter µ.

• We know that ¯ y follows a normal distribution with mean µ and standard deviation

σ

√

n

. The 68-95-97

rule says that the probability that an observation from a normal distribution is within two standard

deviations of the mean is .95. This means that there is a 95% chance that our sample mean ¯ y is within

2σ

√

n

of the population mean µ. We can therefore say that we are 95% conﬁdent that the population

parameter µ lies within

2σ

√

n

of our estimate ¯ y.

• Reporting the probability that a population parameter lies within a certain distance of a sample statistic

is called presenting a conﬁdence interval. The basic steps to get a conﬁdence interval are:

1. Calculate your estimate. For a conﬁdence interval around the mean you use ¯ y.

2. Determine your conﬁdence level C. This is basically the probability that the parameter will fall

within the interval you calculate.

3. Determine the deviation of your interval. This is the amount above and below your estimate that

will be contained in your interval. For a conﬁdence interval around the mean this is equal to

z

∗

σ

√

n

, (17.1)

where z

∗

is how far from the center of the normal distribution you would have to go so that the

area under the curve between −z

∗

and z

∗

would be equal to C. This is the same thing as the

number of standard deviations from the center you would need to include a proportion of the

observations equal to C.

4. Calculate your conﬁdence interval. Your interval will be between (estimate - deviation) and

(estimate + deviation).

5. Present your conﬁdence interval. You should report both the actual interval and your conﬁdence

level.

• Notice that there is a tradeoﬀ between your conﬁdence level and the size of your interval. The higher

the conﬁdence level you want the larger the conﬁdence interval will be.

30

Chapter 18

Conﬁdence Intervals for a Mean

• In the last section we introduced the idea of a conﬁdence interval around the estimated mean. However,

to calculate our interval we needed to know σ, the standard deviation of the population. Like most

parameters, however, we generally don’t know what this is. In this section we learn a way to calculate

conﬁdence intervals without having to know any population parameters.

• Remember that the validity of our conﬁdence interval relies on the assumption that our sample mean

has a normal distribution centered on the population mean. This is true as long as either the original

population had a normal distribution or you have a sample size of at least 30.

• When we don’t have access to the population standard deviation we can estimate it with the standard

deviation of our sample, s.

• Estimating the standard deviation, however, adds some additional error into our measurements. We

have to change our conﬁdence intervals to take this into account. Speciﬁcally, our conﬁdence intervals

will be slightly wider when we estimate the standard deviation than when we know the parameter

value.

• To calculate a conﬁdence interval for the mean from a sample of size n when you do not know the

population standard deviation you:

1. Calculate the sample mean ¯ y and standard deviation s.

2. Determine your conﬁdence level C.

3. Determine the deviation of your interval. This is equal to

t

∗

s

√

n

, (18.1)

where t

∗

is how far from the center of the t distribution with n −1 degrees of freedom you would

have to go so that the area under the curve between −t

∗

and t

∗

would be equal to C.

4. Calculate your conﬁdence interval. As before, your interval will be between (estimate - deviation)

and (estimate + deviation).

5. Present your conﬁdence interval. You should report both the actual interval and your conﬁdence

level.

• Notice that this is basically the same as when you do know the population standard deviation except

that you are drawing values from a t distribution instead of the normal distribution. Statisticians have

proven that the values of the t distribution make the correction we need when we estimate the standard

deviation from our sample. A t distribution looks very much like the normal distribution except that

it is lower in the center and higher in the tails.

• There are actually a number of diﬀerent t distributions, and the one you should use will depend on

how many people you have in your sample. All t distributions have the same shape, but those with

more degrees of freedom are look more like the normal distribution.

31

Chapter 19

Testing Hypotheses

• When we calculate a conﬁdence interval we are interested in estimating a population parameter. Some-

times, however, instead of estimating a parameter we want to test a speciﬁc theory about the popula-

tion. For example, you might want to test whether the mean of the population is equal to a speciﬁc

value. In this section we will talk about using sample statistics to test speciﬁc hypotheses about the

population.

• The ﬁrst thing you do is to write a mathematical statement of the theory you want to test. Since this

is a theory about the population, your statement will be written in terms of population parameters.

• You write your statement in terms of a null hypothesis H

0

and an alternative hypothesis H

a

. Usually

H

0

is a statement that a parameter is equal to a constant (typically zero) while H

a

states that the

parameter is diﬀerent from the constant. You typically write these so that your theory says that H

a

is true.

• The test we will use attempts to ﬁnd evidence for H

a

over H

0

, so the two hypotheses must be disjoint.

Tests of this sort (pitting a null hypothesis against an alternative hypothesis) are very common in

statistics.

• Many times we are interested in determining if the sample mean is equal to zero. One set of hypotheses

that would test this is

H

0

: µ = 0

H

a

: µ > 0.

(19.1)

Notice that the hypotheses can’t both be true at the same time. The alternative hypothesis µ > 0

is called one-sided because it only considers possibilities on one side of the null hypothesis. It is also

possible to have two-sided alternative hypotheses, such as µ <> 0.

• Once we have decided on our hypotheses we examine our sample data to determine how unlikely the

results would be if H

0

is true, and if the observed ﬁndings are in the direction suggested by H

a

.

• We quantify the result of our test by calculating a p-value. This is the probability, assuming that H

0

is true, that the observed outcome would be at least as extreme as that found in the data. The lower

the p-value, the more conﬁdent we are in rejecting the null hypothesis.

• Exactly how you calculate the p-value for your test depends on what your null and alternative hypothe-

ses are. For example, think about hypotheses about the value of the population mean. We know that

the sample mean follows a normal distribution centered on the population mean. If we determine how

many standard deviations there are between the sample mean and value speciﬁed under H

0

, we can

look up the probability that a population with a mean equal to the H

0

value would produce a sample

with a mean at least as extreme as that observed in the sample. This method is called a one-sample z

test.

• The speciﬁc steps to perform a one-sample z test are:

32

1. Calculate the sample mean ¯ y.

2. Calculate the one-sample z statistic

z =

¯ y −µ

0

σ

√

n

, (19.2)

where µ

0

is the mean assumed in the null hypothesis, σ is the known population standard devia-

tion, and n is the sample size.

3. Look up the p-value associated with z in a normal distribution table. If you are performing a

two-sided test you need to multiply this value by 2 (because you have to consider the probability

of getting extreme results of the opposite sign).

• Once we calculate a p-value for a test we have to decide whether the probability is low enough to

conclude that the null hypothesis is unlikely. Before analyzing the data you should choose a value that

you would accept as indicating that the null is probably false. This value is called your signiﬁcance

level, and you say that your results are statistically signiﬁcant if the p-value is less than your signiﬁcance

level.

• The exact signiﬁcance level you should use will depend on the ﬁeld that you are in. In psychology we

typically use either .05 or .01.

• If the results are statistically signiﬁcant we say that we reject H

0

in favor of H

a

. Otherwise we say

that we fail to reject H

0

.

33

Chapter 20

Tests for a Mean

• In the last chapter we learned how to perform one-sample z tests to test hypotheses about the sample

mean. However, this test assumes that we know the population standard deviation σ. In this chapter

we will learn how to test hypotheses about the value of a mean when we do not know σ. We will also

learn how to test for a diﬀerence bewteen measurements taken on matched observations.

• Just as when calculating conﬁdence intervals, when we don’t have access to the population standard

deviation we can estimate it with the standard deviation of our sample, s.

• As previously discussed, estimating the standard deviation will add some additional error into our

measurements. To take into account we will use a t distribution instead of a normal distribution.

• To examine hypotheses about the mean when we don’t know the population standard deviation we use

a one-sample t test. To perform this test you

1. Calculate the sample mean ¯ y and standard deviation s.

2. Calculate the one-sample t statistic

t =

¯ y −µ

0

s

√

n

, (20.1)

where µ

0

is the mean assumed in the null hypothesis and n is the sample size.

3. Look up the p-value associated with t in a t distribution table using n −1 degrees of freedom. If

you are performing a two-sided test you need to multiply this value by 2.

• We will often ﬁnd ourselves calculating t statistics. It has basically the same interpretation as a z

statistic: in this case it is the number of standard devations between the sample mean ¯ y and the value

we are testing under the null hypothesis µ

0

. Whenever we calculate a t statistic it will have the form

t =

estimate −value assumed in H

0

SE

estimate

, (20.2)

where ”SE” stands for ”standard error.”

• You can also use the one-sample t test to test for diﬀerences between two groups of observations if you

have a matched pairs design. In a this design, for every observation in one group you have a matching

observation in the other. They could either be diﬀerent observations from the same participant, or

they could be from diﬀerent participants that are similar on important characteristics.

To test whether the two groups are the same you ﬁrst create a new variable that is the diﬀerence

between the matched participants. You will have a total of

n

2

of these. You then perform a means test

on this new variable just as explained above. When calculating your t statistic and degrees of freedom

you substitute

n

2

for the sample size.

• A statistical inference is called robust if the conﬁdence level or p-value does not change very much

when the assumptions of the procedure are violated.

34

• The one-sample t test is fairly robust to violations of normality, but it is not very robust to violations

of the assumption that the observed data are a random sample of the population.

35

Part VI

Topics in Inference

36

Chapter 21

Comparing Two Means

• At the end of the last chapter we learned how to test for a diﬀerence between two groups on a variable

when we have a matched pairs design. In this chapter we will learn how to test for a diﬀerence

between groups when the two groups have independent observations. This is one of the most-often

used procedures in statistics.

• To compare the means of two populations on a single variable you can use a two-sample t test. To use

this test

1. You must take independent samples from both populations.

2. The sample means from both populations must be normally distributed. This will automatically

be satisﬁed if you take large samples.

3. The size of your two samples can be diﬀerent.

4. Your experimental hypotheses should be of the form

H

0

: µ

1

−µ

2

= 0

H

a

: µ

1

−µ

2

> 0.

(21.1)

The alternative hypothesis can be either one-sided (as in the example) or two sided (such as when

H

a

is µ

1

−µ

2

= 0).

• The steps to perform a two-sample t test are:

1. Calculate the sample means ¯ y

1

and ¯ y

2

, and the sample standard deviations s

1

and s

2

. The

statistics with the same subscript are calculated from the same sample, although it doesn’t matter

which subscript you give to each sample.

2. Calculate the standard error of the diﬀerence:

SE

¯ y

1

−¯ y

2

=

s

1

2

n

1

+

s

2

2

n

2

, (21.2)

where n

1

and n

2

are your sample sizes.

3. Calculate the two-sample t statistic

t =

¯ y

1

− ¯ y

2

SE

¯ y

1

−¯ y

2

. (21.3)

4. Unfortunately this statistic does not have a t distribution. The t distribution corrects for the

error introduced by estimating one standard deviation, but to calculate this t we have to estimate

two. We therefore typically rely on computers to calculate the p-value for this statistic.

37

5. If you do need to calculate a p-value for a two-sample t statistic by hand you can look it up in a t

distribution table with k degrees of freedom, where k is equal to the smaller of n

1

−1 and n

2

−1.

The actual p-value is guaranteed to be no larger than that produced by this method, so this is a

conservative approximation.

Remember to multiply your obtained p-value by 2 if you are performing a two-sided test.

• The boundaries for a level C conﬁdence interval for µ

1

−µ

2

are

(¯ y

1

− ¯ y

2

) ±t

∗

(SE

¯ y

1

−¯ y

2

), (21.4)

where t

∗

is the critical value from a t distribution with k degrees of freedom such that the area between

−t

∗

and t

∗

is equal to C. k can either be obtained from the method above or (more accurately) from

a statistics program.

• There is a special version of the two-sample t test that assumes that to two populations have the same

standard deviation. The advantage of this is that the resulting test statistic only estimates a single

standard deviation so p-values can be directly calculated from the t distribution. The procedure is

called the pooled t test.

• The steps to perform a pooled t test are:

1. Calculate the sample means ¯ y

1

and ¯ y

2

, and the sample standard deviations s

1

and s

2

.

2. Calculate the pooled standard deviation

s

p

=

(n

1

−1)s

1

2

+ (n

2

−1)s

2

2

n

1

+n

2

−2

. (21.5)

This is an estimate of the common standard deviation using a weighted average of the sample

standard deviations.

3. Calculate the pooled t statistic

t =

¯ y

1

− ¯ y

2

s

p

1

n

1

+

1

n

2

. (21.6)

4. Look up the p-value associated with t in a t distribution table using (n

1

+ n

2

− 2) degrees of

freedom. If you are performing a two-sided test you need to multiply this value by 2.

38

Chapter 22

Inference for Proportions

• Up until now we’ve only thought about problems where the response variable was continuous. Some-

times, however, we might want to consider a variable that only has the values ”true” or ”false” such as

whether a woman becomes pregnant or not. In this case we can test hypotheses about the proportion

of a speciﬁc outcome in the population.

• We can estimate the proportion of people who have a speciﬁc outcome in the population with the

proportion of people who have that outcome in a random sample taken from the population. The

population proportion is the parameter p, while the estimate taken from the sample is the statistic ˆ p.

• Assuming that the population is much larger than our sample, the estimate ˆ p has an approximately

normal distribution with mean p and standard deviation

σ

ˆ p

=

p(1 −p)

n

, (22.1)

where n is the sample size.

• To test whether a population proportion is equal to a speciﬁc value you can perform a one-sample z

test for a proportion. Your experimental hypotheses should be of the form

H

0

: p = p

0

H

a

: p < p

0

.

(22.2)

The alternative hypothesis can be either one-sided (as in the example) or two sided (such as H

a

: p =

p

0

).

• The steps to perform a one-sample z test for a proportion are:

1. Calculate your estimate of the proportion ˆ p from your sample.

2. Calculate the one-sample z statistic for a proportion

z =

ˆ p −p

0

p

0

(1−p

0

)

n

, (22.3)

where p

0

is the proportion assumed in your null hypothesis, and n is your sample size.

3. Look up the p-value associated with z in a normal distribution table. If you are performing a

two-sided test you need to multiply this value by 2.

• The boundaries for a level C conﬁdence interval for p

0

are

ˆ p ±z

∗

ˆ p(1 − ˆ p)

n

, (22.4)

where z

∗

is the critical value from a normal distribution that will include an area C between −z

∗

and

z

∗

.

39

• We can also test whether the proportions of two diﬀerent populations are equal. For this we would use

a two-sample z test for proportions. Your experimental hypotheses should be of the form

H

0

: p

1

−p

2

= 0

H

a

: p

1

−p

2

> 0.

(22.5)

The null hypothesis could also be written as H

0

: p

1

= p

2

. The alternative hypothesis can be either

one-sided (as in the example) or two sided (such as H

a

: p

1

−p

2

= 0).

• The steps for this test are:

1. Calculate your two proportion estimates ˆ p

1

and ˆ p

2

your samples.

2. Calculate the pooled sample proportion. This is the overall proportion combining both of your

samples:

ˆ p =

count of successes in both samples combined

total number of observations in both samples combined

. (22.6)

3. Calculate the two-sample z statistic for proportions

z =

ˆ p

1

− ˆ p

2

ˆ p(1 − ˆ p)

1

n

1

+

1

n

2

, (22.7)

where ˆ p is the pooled sample proportion from above, and n

1

and n

2

are your sample sizes.

4. Look up the p-value associated with z in a normal distribution table. If you are performing a

two-sided test you need to multiply this value by 2.

• To calculate a conﬁdence interval around p

1

−p

2

you

1. Determine your conﬁdence level C.

2. Calculate the standard error of the diﬀerence in proportions. This is

SE

ˆ p

1

−ˆ p

2

=

ˆ p

1

(1 − ˆ p

1

)

n

1

+

ˆ p

2

(1 − ˆ p

2

)

n

2

. (22.8)

3. The boundaries for a level C conﬁdence interval for p

1

−p

2

are

(ˆ p

1

− ˆ p

2

) ±z

∗

(SE

ˆ p

1

−ˆ p

2

), (22.9)

where z

∗

is the critical value from a normal distribution that will include an area C between −z

∗

and z

∗

.

40

Chapter 23

Two-Way Tables

• In the last chapter we learned how to perform a two-sample z test for proportions, which allows us to

determine if the proportions of an outcome are the same in two diﬀerent populations. In this chapter

we extend this idea and learn how to test whether a proportion is the same across three or more groups.

• If you have a large number of categories you might think about just using the two-sample z test several

times. For example, if you have three categories you could calculate three diﬀerent z’s: one comparing

the ﬁrst and second group, one comparing the ﬁrst and third group, and one comparing the second

and third group. If you discover any signiﬁcant diﬀerences then you could claim that the means are

not equal.

Unfortunately do this will give you invalid results. The problem is that the more groups you have the

bigger the probability that any two of them would happen to be diﬀerent just due to chance. We not

only need a statistic that can test for diﬀerences between groups, we also need one that won’t be more

likely to ﬁnd diﬀerences when we have more groups.

• A good way to look at your data when both your predictor and response variables are categorical is

with a two-way table. A two-way table is organized so that you have a diﬀerent predictor group in each

column and a diﬀerent outcome in each column. The inside of the table contains counts of how many

people in each group had each outcome.

• We can use the information in a two-way table to calculate a chi-square statistic that will test whether

the proportion of outcomes is the same for each of our groups. In this case our hypotheses are

H

0

: p

1

= p

2

= p

3

H

a

: at least one pair is not equal.

(23.1)

• The steps to perform a chi-square test are:

1. Arrange your data in a two-way table.

2. Find the counts that would be expected in each cell of the table if H

0

were true. These may be

computed using the formula

expected count =

(row total) ×(column total)

n

. (23.2)

3. Calculate the chi-square statistic

χ

2

=

¸

(observed count −expected count)

2

expected count

. (23.3)

4. Look up the p-value associated with χ

2

in a χ

2

distribution table using (r −1)(c −1) degrees of

freedom, where r is the number of rows and c is the number of columns in your two-way table.

• You should not use the χ

2

test if more than 20% of your expected cell counts are less than 5. As long

as the expected counts are high it is ok if the observed cell counts are less than 5.

41

Chapter 24

Inference for Regression

• In earlier chapters we learned how we can use correlation and least-squares regression to describe the

strength of a linear relationship between two quantitative variables. In this chapter we will learn how

to test hypotheses about linear relationships, and we will learn how to use a regression line to predict

new values.

• You should always check a scatterplot of your data before trying to make inferences about the regression

line. You should make sure

1. That there are no outliers in your data. As we’ve seen before outliers can have a strong inﬂuence

on the least-squares regression line.

2. That the relationship between the two variables is approximately linear. Your inferences will be

invalid if the relationship between the variables is seriously curved.

• Remember that a least-squares regression line has the form

ˆ y = a +bx, (24.1)

and that the values of a and b are given by the following equations:

b =

¸

(x

i

y

i

)−

1

n

¸

x

i

¸

y

i

¸

x

i

2

−

1

n

(

¸

x

i)

2

a = ¯ y −b¯ x.

(24.2)

Notice that a and b are statistics since they are computed strictly from sample data.

• Imagine that you can measure x and y for every person in the population. You could then calculate a

population regression line which would have the form

ˆ y

i

= α +βx

i

. (24.3)

• In order for us to draw inferences from our sample regression line (such as testing hypotheses or building

conﬁdence intervals) we have to make several assumptions. If we violate these assumptions then our

inferences might not be correct. We assume

1. If we examined a large number of responses at a single value of x the values we observe in y would

have a normal distribution.

2. All the observations are independent of each other except for the inﬂuence of x.

3. The standard deviation of y (which we will call σ) around the regression line is the same for all

values of x.

42

• If we took a number of samples and calculated the least-squares regression line for each you would ﬁnd

that we would get diﬀerent values for a and b each time. In fact, you would ﬁnd that these would have

a normal distribution. Staticians have proven that the mean of a’s distribution is α and the mean of

b’s distribution is β. This means that we can use a to estimate α and b to estimate β.

• To test hypotheses about the parameters α and β we need estimates of these distributions. The ﬁrst

step to this is to calculate an estimate of the standard deviation of y around the regression line:

s =

1

n −2

¸

(y

i

− ˆ y

i

)

2

. (24.4)

This is something like an average deviation from the regression line. We divide by n − 2 instead of

n because we had to estimate two parameters (a and b, which are used to calculate ˆ y

i

) to make this

calculation. Notice that this not the same as the standard deviation of y. For this standard deviation

we don’t consider the variability in y that is explained by the regression line.

• Once we have this we can estimate the standard deviation of these distributions with the following

equations:

SE

a

= s

1

n

+

¯ x

i

2

¸

(x

i

− ¯ x)

2

(24.5)

SE

b

=

s

¸

(x

i

− ¯ x)

2

. (24.6)

• Now that we have the estimates a and b, and the standard errors of these estimates SE

a

and SE

b

we

can test theories about the parameters α and β.

• Most often we want to test theories about the relationship between x and y. We can do this by

proposing hypotheses about the parameter β, which is the slope of the regression line. Our hypotheses

will be of the form

H

0

: β = b

0

H

a

: β = b

0

.

(24.7)

We can also examine one-sided hypotheses, such as H

a

: β > b

0

.

• To test these hypotheses we calculate a one-sample t statistic

t =

b −b

0

SE

b

, (24.8)

and determine the p-value from a t distribution with n −2 degrees of freedom. We must remember to

multiply the p-value by 2 if we perform a two-sided test.

• We are most often interested in testing whether or not there is a signiﬁcant relationship between x and

y. In this case b

0

= 0.

• The boundaries for a level C conﬁdence interval for β are

b ±t

∗

(SE

b

), (24.9)

where t

∗

is the critical value from a t distribution with n − 2 degrees of freedom such that the area

between −t

∗

and t

∗

is equal to C.

• We can test hypotheses and build conﬁdence intervals around α in the same way, but this is not

typically done. All that α is is the mean value of y when x = 0, which usually is not very important.

• We’ve just learned how to use a regression model to test the strength of the linear relationship between

two variables. Sometimes, however, we are more interested in using the regression line to predict new

observations.

43

• As you might suspect, if we want to predict the value of y at a speciﬁc value of x (which we will refer

to as x

∗

), we just put that value of x into the regression equation. We talked about this back in Lesson

9. What we will learn in this section is how to build a conﬁdence interval around our predicted value.

• There are actually two diﬀerent conﬁdence intervals we might want. First, we might want to build a

conﬁdence interval around the mean of ˆ y at x

∗

. For example, we might want to make an interval such

that we are 95% conﬁdent that it contains the mean value of ˆ y at x

∗

. The variability in our interval is

caused by our imprecision in estimating the population mean.

The limits for a level C conﬁdence interval around the mean are

ˆ y ±t

∗

(SE

ˆ µ

), (24.10)

where t

∗

is the critical value from a t distribution with n − 2 degrees of freedom such that the area

between −t

∗

and t

∗

is equal to C, and SE

ˆ µ

is the standard error for predicting the mean response,

calculated from the formula

SE

ˆ µ

= s

1

n

+

(x

∗

− ¯ x)

2

¸

(x

i

− ¯ x)

2

. (24.11)

• We also might want to build a prediction interval for a single observation of ˆ y at x

∗

. For example, we

might want to make an interval that will contain 95

The limits for a level C prediction interval for a single observation are

ˆ y ±t

∗

(SE

ˆ y

), (24.12)

where t

∗

is the critical value from a t distribution with n − 2 degrees of freedom such that the area

between −t

∗

and t

∗

is equal to C, and SE

ˆ y

is the standard error for predicting the mean response,

calculated from the formula

SE

ˆ y

= s

1 +

1

n

+

(x

∗

− ¯ x)

2

¸

(x

i

− ¯ x)

2

. (24.13)

We can examine the distribution of the residuals from ﬁtting a regression line to determine how well

the problem satisﬁes the assumptions required for valid regression inference.

44

Chapter 25

One-Way Analysis of Variance

• In Lesson 21 we learned how to compare the means of two populations. In this chapter we will learn

how to perform an Analysis of Variance (usually just referred to as ANOVA) which will allow us to

simultaneously compare the means of three or more groups.

• We already know how to use a two-sample t test to compare the means of two groups. To compare

more than two groups we might just calculate several t statistics, one comparing each of our groups.

However, this procedure would be invalid. Think back to when we talked about using several z tests

for proportions to compare the likelihood of an outcome in three or more groups. We would have

the same problem: the more groups you have the bigger the probability that any two of them would

happen to be diﬀerent just due to chance.

• Just like we use the chi-square statistic to simultaneously compare the proportions of an outcome in

three or more populations, we can use Analysis of Variance (ANOVA) to simultaneously compare the

means of three or more populations.

• What the ANOVA does is compare the between-group variance (which is how diﬀerent your groups are

from each other) to the within-group variance (which is how diﬀerent members of the same group are

from each other). If the between-group variance is much larger than the within-group variance then

you conclude that the means are signiﬁcantly diﬀerent.

• In order to draw inferences from an ANOVA we have to make several assumptions. Violating these

assumptions makes our inferences less valid. These assumptions are:

1. We have independent random samples from each population. The number we take from each

population can be diﬀerent.

2. Each individual population has a normal distribution.

3. All of the individual distributions have the same standard deviation. This is similar to the

assumption we make when performing linear regression. The distributions can (and we expect

them to) have diﬀerent means.

• An ANOVA will test the hypotheses

H

0

: µ

1

= µ

2

= . . . = µ

I

H

a

: at least two means are diﬀerent,

(25.1)

where I is the number of diﬀerent groups we want to compare.

• To measure the within- and between-group variability we calculate sums of squares. The between-

groups sum of squares is calculated as

SSG =

¸

groups

n

i

(¯ x

i

− ¯ x)

2

, (25.2)

45

while the within-groups sum of squares is calculated as

SSE =

¸

groups

(n

i

−1)s

i

2

, (25.3)

where n

i

is the number of observations in group i, ¯ x

i

is the mean of group i, and s

i

is the standard

deviation of group i.

The sum of SSG and SSE gives us the total sum of squares SST. This is a measure of the total variability

found in x.

• Each one of our sums of squares has degrees of freedom associated with it. This is related to the

number of diﬀerent sources that contribute to the sum. We typically divide the sums of squares by the

degrees of freedom to get a measure of the average variation. Working with these mean squares lets us

compare many groups simultaneously without inﬂating the probability of ﬁnding signiﬁcant results.

• The relationships between these variables is summarized in the ANOVA table below:

Source of variation SS df MS

Between Group SSG I −1

SSG

I−1

Within Group SSE N −I

SSE

N−I

Total SST N −1

SST

N−1

In the table, I is the number of groups and N is the total number of subjects. Notice that the total df

is equal to the sum of the between-group and within-group df.

• We can perform an Ftest to determine if we have a signiﬁcant amount of between group variance. This

is the same thing as testing whether any of our means are signiﬁcantly diﬀerent. To perform an F test

you

1. Collect the information needed to build an ANOVA table.

2. Calculate the F statistic

F =

MSG

MSE

. (25.4)

3. Look up the corresponding p-value in an F distribution table with I − 1 numerator degrees of

freedom and N −I denominator degrees of freedom.

46

Contents

I Understanding Data 2

3 4 5 7 10 1 Introduction 2 Data and Measurement 3 The Distribution of One Variable 4 Measuring Center and Spread 5 Normal Distributions

II

Understanding Relationships

12

13 14 15 16 18

6 Comparing Groups 7 Scatterplots 8 Correlation 9 Least-Squares Regression 9 Association vs. Causation

III

Generating Data

19

20 21

10 Sample Surveys 11 Designed Experiments

IV

Experience with Random Behavior

23

24 25 26 27 28

12 Randomness 13 Intuitive Probability 14 Conditional Probability 15 Random Variables 16 Sampling Distributions

i

V

Statistical Inference

29

30 31 32 34

17 Estimating With Conﬁdence 18 Conﬁdence Intervals for a Mean 19 Testing Hypotheses 20 Tests for a Mean

VI

Topics in Inference

36

37 39 41 42 45

21 Comparing Two Means 22 Inference for Proportions 23 Two-Way Tables 24 Inference for Regression 25 One-Way Analysis of Variance

1

Part I

Understanding Data

2

To properly perform statistics we must always keep the meaning of our data in mind. ◦ Do the Reading assignments. and reading assignments every day. • You will spend several hours every day working on this course. This is because we need to cover 15 weeks of material in 4 weeks during Maymester. ◦ Attend the lab section. You are responsible for material covered in lecture. Completing the course will not be easy. A good work schedule would be: ◦ Review the notes from the previous day’s lecture. You will have homework. and it is. 3 . as your homework will often require using the CD-ROM. as well as the contents of the textbook and the CD-ROM. ◦ Attend the lecture.Chapter 1 Introduction • It is important to know how to understand statistics so that we can make the proper judgments when a person or a company presents us with an argument backed by data. ◦ Do the CD-ROM assignments. This probably seems like a lot of work. CDROM. • Data are numbers with a context. and take care of any unﬁnished assignments. ◦ Do your homework. but I will try to make it as good an experience as I can. It is important not to get behind in this course. You will want to plan on staying on campus for this.

A good example would be a person’s gender. and why the researcher is conducting the experiment. weight. You can do mathematical operations on the values of quantitative variables (like taking an average). Speciﬁcally. • Variables can be broken down into two types: ◦ Quantitative variables are those for which the value has numerical meaning. It doesn’t really make sense to do math on categorical variables. You should make sure that the information you have truly addresses the question that you want to answer. for each variable you want to think about who is being measured. If the variable is quantitative you should additionally make sure that you know what units are being used in the measurements. but you can’t say that one group has “more” or “less” of some feature. A variable is any characteristic of an object that can be represented as a number. 4 . weight. The values that the variable takes will vary when measurements are made on diﬀerent objects or at diﬀerent times. and hair color values. and hair color of a group of people in an experiment. For example. The value refers to a speciﬁc amount of some quantity. • Whenever you are doing statistics it is very important to make sure that you have a practical understanding of the variables you are using. • Each time that we record information about an object we observe a case. we might measure the height. We might include several diﬀerent variables in the same case. ◦ Categorical variables are those for which the value indicates diﬀerent groupings. We would have one case for each person. Objects that have the same value on the variable are the same with regard to some characteristic. A good example would be a person’s height. what about them is being measured. and that case would contain that person’s height.Chapter 2 Data and Measurement • Statistics is primarily concerned with how to summarize and interpret variables. All of our cases put together is called our data set.

There are several ways of doing this. One important distinction is between symmetric and skewed distributions. If you examined a large number of diﬀerent objects and graphed how often you observed each of the diﬀerent values of the variable you would get a picture of the variable’s distribution. • A distribution can have many diﬀerent shapes. The stem will contain the ﬁrst digits and the leaf will contain a single digit. Sometimes they are caused by data entry errors or equipment failures. ◦ For relatively small data sets you can construct a stemplot. 3. Exactly where you draw the break will depend on your distribution: Generally you want at least ﬁve stems but not more than twenty. A distribution is skewed to the right if the right side is longer. • Local peaks in a distribution are called modes. You ignore any digits after the one you pick for your leaf. • A good way to describe the distribution of a quantitative variable is to take the following three steps: 1. If your distribution has more than one mode it often indicates that your overall distribution is actually a combination of several smaller ones. Draw a vertical line to the right of the stems. Report the general shape of the distribution. In a bar chart diﬀerent groups are represented on the horizontal axis. 2. List the stems in increasing order from top to bottom. Over each group a bar is drawn such that the height of the bar represents the number of cases falling in that group. To make a stemplot you 1. 5 . These points are called outliers. arranged in ascending numerical order. • Sometimes a distribution has a small number of points that don’t seem to ﬁt its general shape. Report any signiﬁcant deviations from the general shape. but other times they come from situations that are diﬀerent in some important way. • Bar charts are used to display distributions of categorical variables. 2. while it is skewed to the left if the left side is longer. Place the leaves belonging to each stem to the right of the line. A distribution is symmetric if the parts above and below its center are mirror images. • Whenever you collect a set of data it is useful to plot its distribution.Chapter 3 The Distribution of One Variable • The pattern of variation of a variable is called its distribution. 3. Separate each case into a stem and a leaf. It’s important to try to explain outliers. Report the center of the distribution.

Divide the range of your data into classes of equal width. Sometimes there are natural divisions. You will often observe cycles. 6 . and the leaves from the other distribution on the left. ◦ For larger distributions you can build a histogram. To make a histogram you 1. The basic procedure is the same as for stemplots. These counts are called the frequencies. Draw a plot with the classes on the horizontal axis and the frequences on the vertical axis. You then list out the leaves from one distribution on the right.◦ To compare two distributions you can construct back-to-back stemplots. Count the number of cases in each class. where the variable regularly rises and falls within a speciﬁc time period. except that you place lines on the left and the right side of the stems. you generally want at least ﬁve but less than twenty classes. • Time plots are used to see how a variable changes over time. but other times you need to make them yourself. 3. Just like in a stemplot. 2.

• People most often report the mean and the standard deviation. where k is a constant i=1 n (kYi ) = k i=1 Yi . 4 i2 = 22 + 32 + 42 = 29. The basic form of a summation equation would be b x= i=a f (i). where k is a constant • When people want to give a simple description of a distribution they typically report measures of its central tendency and spread. For example. If we want to talk about the value of the variable in a speciﬁc case we will put a subscript after the letter. For example we might use h to refer to height. i=2 • Some important properties of summation: n k i=1 n = nk. The main reason why they are used is because of their relationship to the normal distribution. an important distribution that will be discussed in the next lesson.Chapter 4 Measuring Center and Spread • In this section we will discuss some ways of generating mathematical summaries of distributions. For example. 7 . ◦ We will always use n to refer to the total number of cases in our data set. This is a shorthand for the sum of a set of values. In these equations we will make use of some stastical notation. To determine x you calculate the function f (i) for each value from i = a to i = b and add them together.2) (4. represented by the symbol.3) (4.1) (4.4) (Yi + Zi ) = i=1 n i=1 n Yi + i=1 Zi (Yi + k) = i=1 n i=1 Yi + nk. where k is a constant n n (4. ◦ When referring to the distribution of a variable we will use a single letter (other than n). if we wanted to talk about the height of person 5 in our data set we would call it h5 . ◦ Several of our formulas involve summations.

◦ The formal procedure to calculate the median is: 1.6) s= n−1 ◦ The term xi − x is the deviation of an case from the mean. people will often provide its ﬁve-number summary. These variables are often linear transformations of each other. an approximation of the ¯ mean. This is the median of the cases above the location of the 50th percentile. ﬁrst quartile. This means that it is greater than 50% of the cases in the list.7) If you have the summary statistics (mean. they may not be meaningful if the distribution has outliers or is strongly skewed. interquartile range) of the original variable x it is easy to calculate the summary statistics of the transformed variable x∗ . the median is the center obsertation in the list.5) ◦ In statistics our sums are almost always over all of the cases. median. The reason has to do with the fact that we use x. Subtract the value of the 25th percentile from the value of the 75th percentile. instead of the true mean of the distribution. To correct for this underestimate of the variability we only divide by n − 1 instead of n. We then take the square root to put it back on the proper scale. It is located n+1 2 places down the list. 2. People will therefore often report the median and the interquartile range of a variable. Statisticians have proven that this correction gives us an unbiased estimate of the true standard deviation. but this is not the case. ◦ You might expect that we would divide the sum of our deviations by n instead of n − 1.◦ The mean represents the average value of the variable. the median is the average of the two cases in the center of the list. third quartile. i=1 (4. 2. and the maximum. 8 . 3. median. • Often times the same theoretical construct (such as temperature) might be measured in diﬀerent units (such as Farenheit and Celcius). and can be calculated as x= ¯ 1 n n xi . If the number of cases is odd. ◦ To get a more accurate description of a distribtion. This is the median of the cases below the location of the 50th percentile. in that you can calculate one of the variables from another with an equation of the form x∗ = a + bx. (4. 2 2 ◦ The median is the 50th percentile of the list. If the number of cases is even. ◦ These are called resistant measures of central tendency and spread because they are not strongly aﬀected by skewness or outliers. 3. ◦ The standard deviation is a measure of the average diﬀerence of each case from the mean. so we will typically not bother to write out the limits of the summation and just assume that it is performed over all the cases. This is the interquartile range. Calculate the value of the 75th percentile (also called the third quartile). (4. • While the mean and standard deviation provide some important information about a distribution. Our values will be somewhat closer to this approximation than they will be to the true mean because we calculated our approximation from the same data. Calculate the value of the 25th percentile (also called the ﬁrst quartile). To calculate the interquartile range you 1. The interquartile range is the diﬀerence between the cases in the 25th and the 75th percentiles. So 1 we might write the formula for the mean as n xi . This consists of the mininum. Arrange the cases in a list ordered from lowest to highest. and can be calculated as (xi − x)2 ¯ . These are located n and n+2 places down the list. We square it so that all the deviation ¯ scores will come out positive (otherwise the sum would always be zero). standard deviation.

Linear transformations do not change the overall shape of the distribution.◦ The mean. and quartiles of the transformed variable can be obtained by multiplying the originals by b and then adding a. 9 . ◦ The standard deviation and interquartile range of the transformed variable can be obtained by multiplying the originals by b. although they will change the values over which the distribution runs. median.

Formulate the problem in terms of the original variable X. • For all normal distributions. We therefore often use the notation N (µ. and 99. Most statistics books will provide a table with the probabilities associated with this curve.” 2. IQ tests. 68% of the data will fall witin one standard deviation of the mean.7% will fall within three standard deviations of the mean.1) These standardized cases have a mean of 0 and a standard deviation of 1. and repeated measurements taken by physcial devices. This means that the area under the curve within a portion of the range is equal to the probability of getting an case in that portion of the range. Some examples are biological characteristics. σ) to refer to a normal distribution with mean µ and standard deviation σ. this corresponds to the probability that Z > 1. • Sometimes you will be asked to calculate normal relative frequencies.7 rule. 95% will fall within two standard deviations of the mean. We use µ and σ instead of x and s when we describe a normal distribution because we are talking ¯ about a population instead of a sample. To calculate normal relative frequencies you: 1. • People often use standardized cases when working with variables that have normal distributions.” 10 . These curves are designed so that the area under the curve gives the relative frequencies for the distribution. If your variable X has a mean µ and standard deviation σ then the standardized cases Z are equal to Z= X −µ . In these problems you are asked to determine the probability of observing certain types of cases in a variable that has a normal distribution. Restate the problem in terms of a standardized normal variable Z. • The standard normal curve is a normal distribution with mean 0 and standard deviation 1. There are many naturally occuring variables that tend to follow normal distributions. This is called the 68-95-99. We will talk more about about the diﬀerence between populations and samples in Part 3. • A normal distribution can be completely described by its mean and standard deviation. “Because the mean is equal to 2 and the standard deviation is equal to 3.Chapter 5 Normal Distributions • A density curve is a mathematical function used to represent the distribution of a variable. For now you can just think about them in the same way. “I want to know the probability that X > 5. symmetric density curves. • Normal distributions are a speciﬁc type of distribution that have by bell-shaped. σ (5. This means that they will follow a standard normal curve.

Variables with normal distributions have normal probability plots that look approximately like straight lines. Look up the probabilites associatied with the cases you are interested in. the probability that Z < 1 is equal to .” • We can only use the standard normal curve to calculate probabilities when we are pretty sure that our variable has a normal distribution. This means that the probability that Z > 1 is equal to . Statisticians use normal probability plots to check normality. 11 . You may also need to use the fact that the total area underneath the curve is equal to 1. Since standardized variables have a standard normal curve you can use the that table.3.8413. It doesn’t matter what angle the line has.1587. “According to the table.

Part II Understanding Relationships 12 .

5 x IQR outside of the box. Outliers are cases that appear to be outside the general distribution. In this part we examine how the distribution of two variables relate to each other. A case is considered an outlier if it is more than 1. 3. 5. 13 . Draw a line at the median. Draw lines extending from the box to the greatest and least cases that are not outliers. • Boxplots are used to provide a simple graph of a quantitative variable’s distribution. To produce a boxplot you: 1. 2.Chapter 6 Comparing Groups • In the ﬁrst part we looked at how to describe the distribution of a single variable. Draw a box around the middle quartiles. Identify values of the variable on the y-axis. 4. Draw dots for any suspected outliers. • People will often present side-by-side boxplots to compare the distribution of a quantitative variable across several diﬀerent groups.

and explanatory variables are sometimes called independent variables or predictor variables. • If we think that one of the variables causes or explains the other we put the explanatory variable on the x-axis and the response variable on the y-axis. while those with larger values of the Y variable are represented by points that are closer to the top. We try to predict them using explanatory variables. while if it decreases from left to right we say that they have a negative relationship. One variable is placed on the horizontal (X) and one is placed on the vertical (Y) axis . The overall form of the relationship. If the line rises from left to right we say that the two variables have a positive relationship. • When the points on a scatterplot appear to fall along a straight line. 14 . but their speciﬁc placement on the axis is more or less arbitrary. In these graphs.Chapter 7 Scatterplots • Response variables measure an outcome of a study. the relationship could resemble a straight line or a curve. Sometimes we want to show that changes in the explanatory variables cause changes in the response variables. Cases with larger values of the X variable are represented by points further to the right. but other times we just want to show a relationship between them. 2. • Sometimes people use a modiﬁed version of the scatterplot to plot the relationship between categorical variables. • When you want to interpret a scatterplot think about: 1. The strength of the relationship. For example. we say that the two variables have a linear relationship. • You can use scatterplots to graphically display the relationship between two quantitative variables measured on the same cases. Each case is represented by a point on the graph. diﬀerent points on the axis correspond to diﬀerent categories. • Response variables are sometimes called dependent variables. The direction of the relationship (positive or negative). placed according to its values on each of the variables. 3.

when the correlation is closer to either 1 or -1 then the relationship between the variables is closer to a straight line.2) This is called the calculation formula for r. while negative values indicate a negative association between the variables.Chapter 8 Correlation • The correlation coeﬃcient is designed to measure the linear association between two variables without distinguishing between response and explanatory variables. sx refers to the standard deviation of x and sy refers to the standard deviation of y. 15 . they might still have a strong nonlinear relationship. This means that the one variable is actually a perfect linear function of the other. Positive values indicate a positive association between the variables. and will be near zero when the value on one variable is unrelated to the value on the second. but is not as useful when trying to understand the meaning of correlations. The correlation between the variables x and y is deﬁned as 1 xi − x ¯ yi − y ¯ r= . • The correlation between two variables will be positive when they tend to be both high and low at the same time. This formula is easier to use when calculating correlations by hand. ◦ The value of r always falls between -1 and 1. ◦ If r = 1 or r = −1 then all of the cases fall on a straight line. In general.1) n−1 sx sy In this equation. ◦ The value of r will not change if you change the unit of measurement of either x or y. (8. • The correlation formula can also be written as r= 1 (xi yi ) − n xi (n − 1)sx sy yi . • Correlations have several important characteristics. ◦ Correlations only measure the degree of linear association between two variables. If two variables have a zero correlation. (8. will be negative when one tends to be high when the other is low.

16 1 xy − n x y 2 − 1( x x)2 n (9. • In linear regression. Any sort of pattern would indicate that there might be something wrong with using a linear model. When talking about it you say that you “regress y on x. • The values on the line are called the predicted values.” • The general form of a regression line is y = a + bx. and a is the intercept of the line. The line produced by this method minimizes the sum of the squared residuals. A residual plot is a graph with the residual on one axis and either the value of explanatory variable the predicted value on the other. The diﬀerence between the ˆ observed and the predicted value (yi − yi ) for each case is called the residual. • The least-regression line of y on x calculated from n cases is y = a + bx. and the intercept is the value of y when x = 0. • A lurking variable is a variable that has an important eﬀect on your response variable but which is not examined as an explanatory variable. b is the slope of the line.Chapter 9 Least-Squares Regression • A regression line is an equation that describes the relationship between a response variable y and an explanatory variable x. We use the squared residuals because the residuals can be both positive or negative. represented by y . an outlier is a case whose response lies far from the ﬁtted regression line. An inﬂuential observation is a case that has a strong eﬀect on the position of the regression line. Often a plot of the residuals against time can be used to detect the presence of a lurking variable. ˆ where b= and a = y − b¯.2) .1) where y is the value of the response variable. Points that have unusually high or low values on the explanatory variable are often inﬂuential observations. (9. ˆ • Least-squares regression is a speciﬁc way of determining the slope and intercept of a regression line. The slope is the amount that y changes when you increase x by 1. In general you want your residual plot to look like a random scattering of points. ¯ x • To see if it is appropriate to use a regression line to represent our data we often examine residual plots. x is the value of the explanatory variable. with one point from each case.

in which case you should drop them from your data set. you must be careful not to extrapolate beyond the range of your data. the slope of the regression line of y on x is a measure of the linear assocation between the two variables. This means that they will both be strongly aﬀected by outliers or inﬂuential observations. • It is important to notice that you get diﬀerent least-squares regression equations if you regress y on x than if you regress x on y. Sometimes these points are not actually from the population you want to study. • Regression lines are often used to predict one variable from the other. • You should notice that the equations for regression lines and correlations are based on means and standard deviations. 17 . is the proportion of the variation in the variable y that can be explained by the least-squares regression of y on x. It is the same as the proportion of variation in the variable x that can be explained by the least-squares regression of x on y. while in the second case the equation is built to minimize errors in the horizontal direction. Additionally. This is because in the ﬁrst case the equation is built to minimize errors in the vertical direction. However. These two are related by the following formula: b=r sy . you should always make sure that new case you want to predict is drawn from the same underlying population. You should only use the regression line to make predictions within the range of the data you used to make the line. sx (9. r2 . Additionally. lurking variables can create false relationships or obscure true relationships.It is important to carefully examine both outliers and inﬂuential observations.3) • The square of the correlation. • Like correlations.

However. For example. We will talk more about experiments in Chapter 11. but combing your hair will not cause you to perform better. but neither causes the other. Changes in the explanatory variable actually cause changes in the response variable. • There are three situations that will produce a relationship between an explanatory and a response variable. are confounded with race. 1. 2. a recently published book stated that African-Americans perform worse on intelligence tests than Caucasians. In an experiment. 3. we can’t conclude that there is a racial diﬀerence in intellegence because a number of other important factors. For example. you actively change the level of the explanatory variable and observe the eﬀect on the response variable while holding everything else about the situation constant. studying more for a test will cause you to get a better grade. • The best way to establish that an explanatory variable causes changes in a response variable is to conduct an experiment. oversleeping class can cause you to both have uncombed hair and perform poorly on a test. 18 . Whether somone has combed hair or not could then be used to predict their performance on the test. For example. such as income and educational quality. Changes in the explanatory variable are confounded with a third variable that causes changes in the response. Causation • Once you determine that you can use an explanatory variable to predict a response variable you can try to determine whether changes in the explanatory variable cause changes in the response variable. Changes in both the explanatory and response variables are both related to some third variable. Two variables are confounded when they tend to occur together.Chapter 9 Association vs.

Part III Generating Data 19 .

However. In this design you ask a question to a large number of people and collect data from those who choose to answer. Also. For our purposes you should treat nominal and ordinal questions as categorical variables. these numbers do not always mean the same thing. 2=sophomore. Your sample is biased if it favors certain parts of the population over others. • A good way to get unbiased data is with a simple random sample. ◦ If your data represents only a portion of the population you have a sample. It is important to understand the scale of measurement used by a given question so you know what are the appropriate analyses. ◦ If your data represents the entire population you have a census. Ratio scales. and interval and ratio questions as quantitative variables. A poorly chosen design will make it diﬃcult to draw any conclusions from the data. 2. every combination of members is as likely as every other combination. Nominal scales. Example: coding 1=male. • By selecting a design you choose a systematic way of producing data in your study. where the numbers indicate the rank order of a category. In a simple random sample every member of the population has an equal chance of being in the sample.Chapter 10 Sample Surveys • The population is the group that you want to ﬁnd out about from your data. • One example of a bad sampling design is a voluntary response sample. The wording of each item should be clear and should not inﬂuence the way people answer the question. 3=junior. Example: day of the year. 4=senior. This sample does not represent the general population because the people who are choose to respond to your question are probably diﬀerent from the people who do not. 4. where the numbers just serve as labels for categories. which are just like interval scales but a “0” value indicates that something possesses none of the measured characteristic. The four basic scales of measurement are: 1. This includes things like what people or objects you will examine and the situations under which you will examine them. • Most data from surveys are recorded in terms of numbers. where equal diﬀerences between numbers reﬂect equal diﬀerences on some meaningful quantity. Example: coding grade level in high school as 1=freshman. Ordinal scales. 20 . 2=female. 3. Interval scales. • The basic goal of sampling design is to make sure that the sample you select for your study properly represents the population that you are interested in. Example: height. • You should also be careful about introducing bias into your data with the way you write your survey.

However. relying on the fact that on average the diﬀerences between the groups will average out. but you don’t have to know ahead of time what are the important things that you need to match on. this is not a problem because mathematicians know a lot about how random variables work. even if the treatment itself doesn’t really do anything.Chapter 11 Designed Experiments • In observational studies you try to examine and measure things as they naturally occur. If there is some diﬀerence between the types of units in your groups. • The preferred way to make your groups similar is by randomization. When we 21 . • The objects on which an experiment is performed are called the experimental units.” • The explanatory variables in an experiment are referred to as factors. Units in a control condition go through the experiment but never receive the treatment. Human experimental units used to be called “subjects. This is called the placebo eﬀect. Just being in an experiment can sometimes cause changes. In this case you just randomly pick people to be in each of your groups. • Randomization is one of the basic ideas behind statistics. • There are many diﬀerent ways to randomly assign your experimental units to conditions. It is easier to generalize from observational studies. A speciﬁc experimental condition applied to a set of units is called a treatment. You can use a table of random numbers. • The basic idea of an experiment is that if you control for everything in the environment except for the factor that you manipulate. • Sometimes people use matching to make their groups similar. In this case you assign people to the diﬀerent levels of your factor in such a way that for every person in one factor level you have a similar person in each of the other levels. there will be some diﬀerences between the groups. • It is generally not a good idea to determine the eﬀect of a treatment by comparing the units before the treatment to the same units after the treatment.” but we now call them “research participants. In experimental studies you impose some manipulation and measure its eﬀects. • The eﬀectiveness of a treatment is generally measured by comparing the results of units that underwent the treatment to the results of units in a control condition. It is therefore very important that the groups of experimental units you have in each level of your factor are basically the same. or you can use statistical software. Not only does this require less eﬀort. while the values each factor can have are called its levels. a mathematical calculator that produces random digits. When we randomly assign units to be in either the treatment or control conditions. then the results might be caused by those diﬀerences instead of your treatment. but it is easier to determine causality from experimental studies. then any diﬀerences you observe between diﬀerent groups in your experiment must be caused by the factor.

Randomize the assignment of experimental units to treatment conditions 3.see a diﬀerence between the groups we can ﬁgure out the probability that the diﬀerence is just due to the way that we assigned the units to the groups. and we conclude that our treatment had an eﬀect. Replicate your experiment across a number of experimental units. • In conclusion. when designing an experiment you want to: 1. 2. Control the eﬀects of irrelevant variables. If the probability is too low for us to believe that the results were just due to chance we say that the diﬀerence is statistically signiﬁcant. 22 .

Part IV Experience with Random Behavior 23 .

24 . but most often we either assign them using common sense or else calculate them using mathematics. but where we can describe how often we would expect so see each possible result if we repeated the event a number of times. Once we run an experiment we want to determine the probability that any relationships we see are just due to chance. • We can determine the probabilities of outcomes by observing a random phenomenon a number of times. In the next chapters we will learn more about probability and randomness so we can make sense of our results. • The probability of an outcome of a random phenomenon is the proportion of times we would expect to see the outcome if we observed the event a large number of times.Chapter 12 Randomness • In Chapter 11 we talked about how and why staticians use randomness in experimental designs. • A random phenomenon is one where we can’t exactly predict what the end result will be. The diﬀerent possible end results are referred to as outcomes.

This set is typically named S. If the event contains all possible outcomes then its probability = 1.Chapter 13 Intuitive Probability • An event is a collection of outcomes of a random phenomenon. The event itself is a set. we typically use a set to describe the sample space of a random phenomenon. The area under any section of the density curve is equal to the probability of the result falling into that section. ◦ Two events are called disjoint if they have no outcomes in common. ◦ A sample space is a list of the all of the possible outcomes from a speciﬁc random phenomenon. • A sample space is called discrete if we can list out all of the possible outcomes. This is similar to saying that there isn’t anything that is simultaneously inﬂuencing both events. and includes each possible outcome as an element. ◦ When we talk about the probability of an event we are referring to the probability that the outcome of a random phenomenon is included in the event. or it can be many diﬀerent outcomes. and the outcomes included in the event are elements of the set. ◦ Just like events. • A sample space is called continuous if we cannot list out all of the possible outcomes. two events cannot be both disjoint and independent. ◦ To calculate the probability of an event you add up the area underneath the density curve for each range of outcomes included in the event. If events A and B are disjoint then P (A or B) = P (A) + P (B). • Probabilities of events follow a number of rules: ◦ ◦ ◦ ◦ ◦ ◦ The probability that event A occurs = P (A). • When thinking about a random phenomenon one of the ﬁrst things we typically do is consider the diﬀerent results we might observe. ◦ An event can be only a single outcome. ◦ In statistics we often use set notation to describe events. Probabilities are always between 0 and 1. If events A and B are independent then P (A and B) = P (A)P (B). Note that this is diﬀerent from being disjoint – in fact. 25 . The probability that an event A does not occur = 1 − P (A). This means that they can never both occur at the same time. ◦ Two events are called independent if knowing that one occurs does not change the probability that the other occurs. ◦ In a continuous sample space we use a density curve to represent the probabilities of outcomes. The probability of an event in a discrete sample space is the sum of the probabilities of the individual outcomes making up the event.

(14. One good way to see conditional probabilities is in a 2-way table. P (A) (14. For example the probability that someone who is low in SES goes to college would be written as P (going to college | low SES). Think about the probability that a high school student will go to college.3) This is actually an extension of the intersection formula we learned in chapter 13.4) 26 . if we know the probabilities of two events and the corresponding conditional probabilities we can calculate the probability of the intersection: P (A and B) = P (A)P (B|A) = P (B)P (A|B). such as socio-economic status (SES) can impact this probability. Using conditional probability we can write this mathematically: Two events A and B are independent if P (A|B) = P (A). • We can also extend the union formula we learned in chapter 13: P (AorB) = P (A) + P (B) − P (AandB). this one will always work. We know that many factors.2) • Similarly. While that formula would only work if A and B are independent. • People often think that there are patterns in sequences that are actually random. whether events A and B are disjoint or not. (14. This formula will always be accurate. There are many diﬀerent types of sequences that can be reasonable results of independent processes. Conditional probability lets us think about the probability of an event given that something else is true. P (B) P (B|A) = P (A and B) . • Last chapter we mentioned that two events are independent if knowing that one occurs does not change the probability that the other occurs.1) • If we know the probabilities of two events and the probability of their intersection we can calculate either conditional probability: P (A|B) = P (A and B) .Chapter 14 Conditional Probability • Many times the probability of an event will depend on external factors. • Conditional probabilites can also be thought of as limiting the type of people you want to include in your probability assessment. (14. • We use the notation P (A|B) to refer to the conditional probability that event A occurs given that B is true.

Each probability is a number between 0 and 1. The statistical techniques we’ve learned so far deal with variables. ◦ The probability distribution of a discrete random variable is typically described by a list of the values and their probabilites. and instead estimate them with statistics. while a continuous random variable can take all values in a range of numbers. Statistics are numbers that we can calculate purely from a sample. so we need to deﬁne a variable in order to analyze a random phenomenon. One example of a continuous probability distribution is the normal distribution. we can expect the mean of the numbers y to be closer to µ as we increase the number we draw. The value of a statistic is random.Chapter 15 Random Variables • A random variable is a variable whose value is determined by the outcome of a random phenomenon. and the sum of the probabilities must be equal to 1. and the total area under the curve is equal to 1. and will depend on the speciﬁc observations included in the sample. We rarely know the true parameters of a population. This ¯ means that we can estimate the mean of a population by taking the average of a set of observations drawn from that population. 27 . • A discrete random varaible has a ﬁnite number of possible values. • We use the term parameter to refer to a number that describes some characteristic of a population. The curve is deﬁned so that the probability of any event is equal to the area under the curve for the values that make up the event. • The probability distribution of a random variable tells us the possible values of the variable and how probabilities are assigned to those values. ◦ The probability distribution of a continuous random variable is typically described by a density curve. not events. • The law of large numbers states that if we draw a bunch of numbers from a population with mean µ.

This is sometimes easy to do because both distributions are centered on the same point.Chapter 16 Sampling Distributions • As discussed in the last chapter. even if the original population does not have a normal distribution.1) If the original sample has a normal distribution. the more closer the distribution of the sample mean will be to a normal distribution. The larger the sample size. • It is important not to confuse the distribution of the population and the distribution of the sample mean. 28 . and each sample will produce slightly diﬀerent statistics. although the population distribution has a larger spread. To be able to understand how a sample statistic relates to an underlying population parameter we need to understand how the statistic can vary. a large number of diﬀerent samples can be taken from the same population. usually as estimates of population parameters. statistics are numbers we can calculate from sample data. • The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size. This is referred to as the central limit theorem. then the distribution of the sample mean has a normal distribution. ¯ n (16. The distribution of the sample mean will also have a normal distribution (approximately) if the sample size n is fairly large (at least 30). ¯ • The distribution of the sample mean y of n observations drawn from a population with mean µ and standard deviation σ has mean µy = µ ¯ and standard deviation σ σy = √ .2) (16. However.

Part V Statistical Inference 29 .

• One important thing to keep in mind when thinking about statistical inference is that we always assume that our sample data are a random sample of the population whose parameter we want to estimate. This means that there is a 95% chance that our sample mean y is within ¯ 2σ √ of the population mean µ. 30 . 3. • Notice that there is a tradeoﬀ between your conﬁdence level and the size of your interval. Your interval will be between (estimate . The basic steps to get a conﬁdence interval are: 1. • Let’s think about estimating the mean of a population µ. We already know that one way to calculate this estimate is to take the mean of a sample drawn from the population y .95. Calculate your estimate. 5.deviation) and (estimate + deviation). Now that we have an understanding of probability we can also state how much trust we have in our estimates. For a conﬁdence interval around the mean this is equal to σ z∗ √ . ¯ • Reporting the probability that a population parameter lies within a certain distance of a sample statistic is called presenting a conﬁdence interval.Chapter 17 Estimating With Conﬁdence • We have already learned about using statistics to estimate population parameters. Determine the deviation of your interval. ¯ σ • We know that y follows a normal distribution with mean µ and standard deviation √n . The higher the conﬁdence level you want the larger the conﬁdence interval will be. ¯ 2. For a conﬁdence interval around the mean you use y . If this is not true then our inferences we make might be invalid.1) n where z ∗ is how far from the center of the normal distribution you would have to go so that the area under the curve between −z ∗ and z ∗ would be equal to C. You should report both the actual interval and your conﬁdence level. This is the same thing as the number of standard deviations from the center you would need to include a proportion of the observations equal to C. We can therefore say that we are 95% conﬁdent that the population n 2σ parameter µ lies within √n of our estimate y . Determine your conﬁdence level C. Calculate your conﬁdence interval. This is the amount above and below your estimate that will be contained in your interval. Present your conﬁdence interval. We also know that y is ¯ ¯ more likely to be closer to µ if we use a larger sample. The 68-95-97 ¯ rule says that the probability that an observation from a normal distribution is within two standard deviations of the mean is . (17. This is basically the probability that the parameter will fall within the interval you calculate. 4. We want a way to report how close we think our statistic y is likely to be to the parameter µ.

however. your interval will be between (estimate . • To calculate a conﬁdence interval for the mean from a sample of size n when you do not know the population standard deviation you: 1. • There are actually a number of diﬀerent t distributions. however. However. • When we don’t have access to the population standard deviation we can estimate it with the standard deviation of our sample. Calculate your conﬁdence interval. All t distributions have the same shape. Calculate the sample mean y and standard deviation s. • Notice that this is basically the same as when you do know the population standard deviation except that you are drawing values from a t distribution instead of the normal distribution. Statisticians have proven that the values of the t distribution make the correction we need when we estimate the standard deviation from our sample. 5. s. n (18. Determine the deviation of your interval. • Estimating the standard deviation. ¯ 2. 31 . A t distribution looks very much like the normal distribution except that it is lower in the center and higher in the tails. • Remember that the validity of our conﬁdence interval relies on the assumption that our sample mean has a normal distribution centered on the population mean. In this section we learn a way to calculate conﬁdence intervals without having to know any population parameters. the standard deviation of the population. 4. and the one you should use will depend on how many people you have in your sample. We have to change our conﬁdence intervals to take this into account. to calculate our interval we needed to know σ. This is equal to s t∗ √ . As before. adds some additional error into our measurements. You should report both the actual interval and your conﬁdence level. our conﬁdence intervals will be slightly wider when we estimate the standard deviation than when we know the parameter value. we generally don’t know what this is. Speciﬁcally. Determine your conﬁdence level C. Present your conﬁdence interval.deviation) and (estimate + deviation).1) where t∗ is how far from the center of the t distribution with n − 1 degrees of freedom you would have to go so that the area under the curve between −t∗ and t∗ would be equal to C. This is true as long as either the original population had a normal distribution or you have a sample size of at least 30. 3. Like most parameters.Chapter 18 Conﬁdence Intervals for a Mean • In the last section we introduced the idea of a conﬁdence interval around the estimated mean. but those with more degrees of freedom are look more like the normal distribution.

• Exactly how you calculate the p-value for your test depends on what your null and alternative hypotheses are. The lower the p-value. instead of estimating a parameter we want to test a speciﬁc theory about the population. For example. however. • The test we will use attempts to ﬁnd evidence for Ha over H0 . that the observed outcome would be at least as extreme as that found in the data. • Once we have decided on our hypotheses we examine our sample data to determine how unlikely the results would be if H0 is true. so the two hypotheses must be disjoint. • We quantify the result of our test by calculating a p-value. • The speciﬁc steps to perform a one-sample z test are: 32 . Notice that the hypotheses can’t both be true at the same time. such as µ <> 0. We know that the sample mean follows a normal distribution centered on the population mean. and if the observed ﬁndings are in the direction suggested by Ha . Sometimes.Chapter 19 Testing Hypotheses • When we calculate a conﬁdence interval we are interested in estimating a population parameter. Since this is a theory about the population. the more conﬁdent we are in rejecting the null hypothesis. For example. your statement will be written in terms of population parameters. This method is called a one-sample z test. If we determine how many standard deviations there are between the sample mean and value speciﬁed under H0 . You typically write these so that your theory says that Ha is true. The alternative hypothesis µ > 0 is called one-sided because it only considers possibilities on one side of the null hypothesis. Tests of this sort (pitting a null hypothesis against an alternative hypothesis) are very common in statistics. • You write your statement in terms of a null hypothesis H0 and an alternative hypothesis Ha . assuming that H0 is true. It is also possible to have two-sided alternative hypotheses. we can look up the probability that a population with a mean equal to the H0 value would produce a sample with a mean at least as extreme as that observed in the sample.1) Ha : µ > 0. In this section we will talk about using sample statistics to test speciﬁc hypotheses about the population. One set of hypotheses that would test this is H0 : µ = 0 (19. think about hypotheses about the value of the population mean. you might want to test whether the mean of the population is equal to a speciﬁc value. • The ﬁrst thing you do is to write a mathematical statement of the theory you want to test. • Many times we are interested in determining if the sample mean is equal to zero. This is the probability. Usually H0 is a statement that a parameter is equal to a constant (typically zero) while Ha states that the parameter is diﬀerent from the constant.

This value is called your signiﬁcance level. Calculate the sample mean y . • The exact signiﬁcance level you should use will depend on the ﬁeld that you are in. In psychology we typically use either . Look up the p-value associated with z in a normal distribution table.01. If you are performing a two-sided test you need to multiply this value by 2 (because you have to consider the probability of getting extreme results of the opposite sign).2) where µ0 is the mean assumed in the null hypothesis. and n is the sample size. ¯ 2.05 or .1. σ is the known population standard deviation. Before analyzing the data you should choose a value that you would accept as indicating that the null is probably false. and you say that your results are statistically signiﬁcant if the p-value is less than your signiﬁcance level. Calculate the one-sample z statistic z= y − µ0 ¯ σ √ n . Otherwise we say that we fail to reject H0 . • If the results are statistically signiﬁcant we say that we reject H0 in favor of Ha . (19. 3. 33 . • Once we calculate a p-value for a test we have to decide whether the probability is low enough to conclude that the null hypothesis is unlikely.

¯ 2. However. (20.2) where ”SE” stands for ”standard error. • As previously discussed. s. They could either be diﬀerent observations from the same participant. when we don’t have access to the population standard deviation we can estimate it with the standard deviation of our sample. You then perform a means test 2 on this new variable just as explained above. • Just as when calculating conﬁdence intervals. 3. 2 • A statistical inference is called robust if the conﬁdence level or p-value does not change very much when the assumptions of the procedure are violated. 34 . In this chapter we will learn how to test hypotheses about the value of a mean when we do not know σ. To test whether the two groups are the same you ﬁrst create a new variable that is the diﬀerence between the matched participants. this test assumes that we know the population standard deviation σ. or they could be from diﬀerent participants that are similar on important characteristics.1) where µ0 is the mean assumed in the null hypothesis and n is the sample size. If you are performing a two-sided test you need to multiply this value by 2. for every observation in one group you have a matching observation in the other. Whenever we calculate a t statistic it will have the form t= estimate − value assumed in H0 .Chapter 20 Tests for a Mean • In the last chapter we learned how to perform one-sample z tests to test hypotheses about the sample mean. It has basically the same interpretation as a z statistic: in this case it is the number of standard devations between the sample mean y and the value ¯ we are testing under the null hypothesis µ0 . You will have a total of n of these. When calculating your t statistic and degrees of freedom you substitute n for the sample size. To take into account we will use a t distribution instead of a normal distribution. To perform this test you 1. We will also learn how to test for a diﬀerence bewteen measurements taken on matched observations.” • You can also use the one-sample t test to test for diﬀerences between two groups of observations if you have a matched pairs design. estimating the standard deviation will add some additional error into our measurements. Calculate the one-sample t statistic t= y − µ0 ¯ s √ n . Look up the p-value associated with t in a t distribution table using n − 1 degrees of freedom. Calculate the sample mean y and standard deviation s. SEestimate (20. In a this design. • To examine hypotheses about the mean when we don’t know the population standard deviation we use a one-sample t test. • We will often ﬁnd ourselves calculating t statistics.

• The one-sample t test is fairly robust to violations of normality. 35 . but it is not very robust to violations of the assumption that the observed data are a random sample of the population.

Part VI Topics in Inference 36 .

37 . Calculate the two-sample t statistic t= y1 − y2 ¯ ¯ . The ¯ ¯ statistics with the same subscript are calculated from the same sample. SEy1 −¯2 ¯ y (21. The sample means from both populations must be normally distributed. but to calculate this t we have to estimate two. To use this test 1. We therefore typically rely on computers to calculate the p-value for this statistic.Chapter 21 Comparing Two Means • At the end of the last chapter we learned how to test for a diﬀerence between two groups on a variable when we have a matched pairs design. n1 n2 (21.2) 4. You must take independent samples from both populations. Unfortunately this statistic does not have a t distribution. In this chapter we will learn how to test for a diﬀerence between groups when the two groups have independent observations. and the sample standard deviations s1 and s2 . 2. (21. • The steps to perform a two-sample t test are: 1.3) s2 2 s1 2 + . although it doesn’t matter which subscript you give to each sample. • To compare the means of two populations on a single variable you can use a two-sample t test. Calculate the standard error of the diﬀerence: SEy1 −¯2 = ¯ y where n1 and n2 are your sample sizes. Your experimental hypotheses should be of the form H0 : Ha : µ1 − µ2 = 0 µ1 − µ2 > 0. This is one of the most-often used procedures in statistics. The t distribution corrects for the error introduced by estimating one standard deviation. 4. This will automatically be satisﬁed if you take large samples.1) The alternative hypothesis can be either one-sided (as in the example) or two sided (such as when Ha is µ1 − µ2 = 0). Calculate the sample means y1 and y2 . 3. 3. The size of your two samples can be diﬀerent. 2.

Remember to multiply your obtained p-value by 2 if you are performing a two-sided test. y ¯ ¯ y (21.5. (21. and the sample standard deviations s1 and s2 . Calculate the pooled standard deviation sp = (n1 − 1)s1 2 + (n2 − 1)s2 2 . ¯ ¯ 2. • The steps to perform a pooled t test are: 1. where k is equal to the smaller of n1 − 1 and n2 − 1. 38 . 3. Calculate the sample means y1 and y2 . Look up the p-value associated with t in a t distribution table using (n1 + n2 − 2) degrees of freedom.4) where t∗ is the critical value from a t distribution with k degrees of freedom such that the area between −t∗ and t∗ is equal to C. so this is a conservative approximation. The advantage of this is that the resulting test statistic only estimates a single standard deviation so p-values can be directly calculated from the t distribution. • There is a special version of the two-sample t test that assumes that to two populations have the same standard deviation. The actual p-value is guaranteed to be no larger than that produced by this method.6) 4. • The boundaries for a level C conﬁdence interval for µ1 − µ2 are (¯1 − y2 ) ± t∗ (SEy1 −¯2 ). If you do need to calculate a p-value for a two-sample t statistic by hand you can look it up in a t distribution table with k degrees of freedom.5) This is an estimate of the common standard deviation using a weighted average of the sample standard deviations. The procedure is called the pooled t test. If you are performing a two-sided test you need to multiply this value by 2. n1 + n2 − 2 (21. k can either be obtained from the method above or (more accurately) from a statistics program. Calculate the pooled t statistic t= sp y1 − y2 ¯ ¯ 1 n1 + 1 n2 .

3) where p0 is the proportion assumed in your null hypothesis. ˆ 2. (22. Sometimes.4) where z ∗ is the critical value from a normal distribution that will include an area C between −z ∗ and z∗. while the estimate taken from the sample is the statistic p. • To test whether a population proportion is equal to a speciﬁc value you can perform a one-sample z test for a proportion. • The boundaries for a level C conﬁdence interval for p0 are p ± z∗ ˆ p(1 − p) ˆ ˆ . n (22. In this case we can test hypotheses about the proportion of a speciﬁc outcome in the population. and n is your sample size. we might want to consider a variable that only has the values ”true” or ”false” such as whether a woman becomes pregnant or not.1) The alternative hypothesis can be either one-sided (as in the example) or two sided (such as Ha : p = p0 ). • We can estimate the proportion of people who have a speciﬁc outcome in the population with the proportion of people who have that outcome in a random sample taken from the population. • The steps to perform a one-sample z test for a proportion are: 1. Look up the p-value associated with z in a normal distribution table. (22. Calculate your estimate of the proportion p from your sample. Your experimental hypotheses should be of the form H0 : p = p0 Ha : p < p 0 . The population proportion is the parameter p. the estimate p has an approximately ˆ normal distribution with mean p and standard deviation σp = ˆ where n is the sample size. 39 .2) p(1 − p) .Chapter 22 Inference for Proportions • Up until now we’ve only thought about problems where the response variable was continuous. n (22. ˆ • Assuming that the population is much larger than our sample. however. Calculate the one-sample z statistic for a proportion z= p − p0 ˆ p0 (1−p0 ) n . If you are performing a two-sided test you need to multiply this value by 2. 3.

2. • To calculate a conﬁdence interval around p1 − p2 you 1. p ˆ ˆ ˆ (22.• We can also test whether the proportions of two diﬀerent populations are equal. If you are performing a two-sided test you need to multiply this value by 2. (22. Calculate your two proportion estimates p1 and p2 your samples. and n1 and n2 are your sample sizes. The boundaries for a level C conﬁdence interval for p1 − p2 are (ˆ1 − p2 ) ± z ∗ (SEp1 −p2 ). + 1 n2 (22. Calculate the two-sample z statistic for proportions z= p1 − p2 ˆ ˆ p(1 − p) ˆ ˆ 1 n1 .7) where p is the pooled sample proportion from above. Calculate the standard error of the diﬀerence in proportions.9) where z ∗ is the critical value from a normal distribution that will include an area C between −z ∗ and z ∗ . This is SEp1 −p2 = ˆ ˆ p1 (1 − p1 ) p2 (1 − p2 ) ˆ ˆ ˆ ˆ + . For this we would use a two-sample z test for proportions. This is the overall proportion combining both of your samples: count of successes in both samples combined p= ˆ .6) total number of observations in both samples combined 3. n1 n2 (22. 40 . Your experimental hypotheses should be of the form H0 : Ha : p1 − p2 = 0 p1 − p2 > 0. ˆ 4. • The steps for this test are: 1. Determine your conﬁdence level C.5) The null hypothesis could also be written as H0 : p1 = p2 . The alternative hypothesis can be either one-sided (as in the example) or two sided (such as Ha : p1 − p2 = 0). ˆ ˆ 2. Calculate the pooled sample proportion.8) 3. Look up the p-value associated with z in a normal distribution table. (22.

• We can use the information in a two-way table to calculate a chi-square statistic that will test whether the proportion of outcomes is the same for each of our groups. • If you have a large number of categories you might think about just using the two-sample z test several times.1) 4. • A good way to look at your data when both your predictor and response variables are categorical is with a two-way table. Arrange your data in a two-way table. 2. n (23. The inside of the table contains counts of how many people in each group had each outcome. 41 . expected count (23. where r is the number of rows and c is the number of columns in your two-way table.Chapter 23 Two-Way Tables • In the last chapter we learned how to perform a two-sample z test for proportions. Calculate the chi-square statistic χ2 = (observed count − expected count)2 . If you discover any signiﬁcant diﬀerences then you could claim that the means are not equal. These may be computed using the formula expected count = 3. one comparing the ﬁrst and third group.2) (23. We not only need a statistic that can test for diﬀerences between groups. if you have three categories you could calculate three diﬀerent z’s: one comparing the ﬁrst and second group. In this case our hypotheses are H0 : p1 = p2 = p3 Ha : at least one pair is not equal. and one comparing the second and third group. we also need one that won’t be more likely to ﬁnd diﬀerences when we have more groups. As long as the expected counts are high it is ok if the observed cell counts are less than 5. • The steps to perform a chi-square test are: 1. Look up the p-value associated with χ2 in a χ2 distribution table using (r − 1)(c − 1) degrees of freedom. Unfortunately do this will give you invalid results. A two-way table is organized so that you have a diﬀerent predictor group in each column and a diﬀerent outcome in each column. In this chapter we extend this idea and learn how to test whether a proportion is the same across three or more groups. For example. Find the counts that would be expected in each cell of the table if H0 were true. • You should not use the χ2 test if more than 20% of your expected cell counts are less than 5. The problem is that the more groups you have the bigger the probability that any two of them would happen to be diﬀerent just due to chance. which allows us to determine if the proportions of an outcome are the same in two diﬀerent populations.3) (row total) × (column total) .

The standard deviation of y (which we will call σ) around the regression line is the same for all values of x.3) • In order for us to draw inferences from our sample regression line (such as testing hypotheses or building conﬁdence intervals) we have to make several assumptions. 2. and we will learn how to use a regression line to predict new values. We assume 1. In this chapter we will learn how to test hypotheses about linear relationships. 3. ¯ x Notice that a and b are statistics since they are computed strictly from sample data. That there are no outliers in your data. As we’ve seen before outliers can have a strong inﬂuence on the least-squares regression line.1) xi xi ) 2 yi (24. 2. All the observations are independent of each other except for the inﬂuence of x. ˆ and that the values of a and b are given by the following equations: b= a= 1 (xi yi )− n 1 xi 2 − n ( (24. If we violate these assumptions then our inferences might not be correct. • Remember that a least-squares regression line has the form y = a + bx. • You should always check a scatterplot of your data before trying to make inferences about the regression line. 42 . • Imagine that you can measure x and y for every person in the population. Your inferences will be invalid if the relationship between the variables is seriously curved. That the relationship between the two variables is approximately linear.Chapter 24 Inference for Regression • In earlier chapters we learned how we can use correlation and least-squares regression to describe the strength of a linear relationship between two quantitative variables. ˆ (24. If we examined a large number of responses at a single value of x the values we observe in y would have a normal distribution. You could then calculate a population regression line which would have the form yi = α + βxi . You should make sure 1.2) y − b¯.

In fact. This means that we can use a to estimate α and b to estimate β.7) Ha : β = b0 . We can also examine one-sided hypotheses. • Most often we want to test theories about the relationship between x and y. which usually is not very important. • Once we have this we can estimate the standard deviation of these distributions with the following equations: SEa = s SEb = 1 + n s (xi − x)2 ¯ xi 2 ¯ (xi − x)2 ¯ . • To test hypotheses about the parameters α and β we need estimates of these distributions. however. (24. such as Ha : β > b0 . For this standard deviation we don’t consider the variability in y that is explained by the regression line. The ﬁrst step to this is to calculate an estimate of the standard deviation of y around the regression line: s= 1 n−2 (yi − yi )2 . ˆ (24. which are used to calculate yi ) to make this ˆ calculation. we are more interested in using the regression line to predict new observations. • We are most often interested in testing whether or not there is a signiﬁcant relationship between x and y. Sometimes. Our hypotheses will be of the form H0 : β = b0 (24. Notice that this not the same as the standard deviation of y.9) where t∗ is the critical value from a t distribution with n − 2 degrees of freedom such that the area between −t∗ and t∗ is equal to C. you would ﬁnd that these would have a normal distribution. and the standard errors of these estimates SEa and SEb we can test theories about the parameters α and β.8) and determine the p-value from a t distribution with n − 2 degrees of freedom. which is the slope of the regression line.6) • Now that we have the estimates a and b.• If we took a number of samples and calculated the least-squares regression line for each you would ﬁnd that we would get diﬀerent values for a and b each time.5) (24. 43 . We can do this by proposing hypotheses about the parameter β. • We’ve just learned how to use a regression model to test the strength of the linear relationship between two variables. • We can test hypotheses and build conﬁdence intervals around α in the same way. We divide by n − 2 instead of n because we had to estimate two parameters (a and b. SEb (24. Staticians have proven that the mean of a’s distribution is α and the mean of b’s distribution is β. • The boundaries for a level C conﬁdence interval for β are b ± t∗ (SEb ). • To test these hypotheses we calculate a one-sample t statistic t= b − b0 . (24. but this is not typically done. All that α is is the mean value of y when x = 0. In this case b0 = 0.4) This is something like an average deviation from the regression line. We must remember to multiply the p-value by 2 if we perform a two-sided test.

What we will learn in this section is how to build a conﬁdence interval around our predicted value. For example. (24. First. (24.10) where t∗ is the critical value from a t distribution with n − 2 degrees of freedom such that the area between −t∗ and t∗ is equal to C. we ˆ might want to make an interval that will contain 95 The limits for a level C prediction interval for a single observation are y ± t∗ (SEy ). and SEµ is the standard error for predicting the mean response. ˆ ˆ (24.12) where t∗ is the critical value from a t distribution with n − 2 degrees of freedom such that the area between −t∗ and t∗ is equal to C. if we want to predict the value of y at a speciﬁc value of x (which we will refer to as x∗ ).• As you might suspect.13) ˆ n (xi − x)2 ¯ We can examine the distribution of the residuals from ﬁtting a regression line to determine how well the problem satisﬁes the assumptions required for valid regression inference. and SEy is the standard error for predicting the mean response. For example. The variability in our interval is ˆ caused by our imprecision in estimating the population mean. ˆ calculated from the formula (x∗ − x)2 ¯ 1 SEy = s 1 + + . we might want to make an interval such ˆ that we are 95% conﬁdent that it contains the mean value of y at x∗ . We talked about this back in Lesson 9. ˆ ˆ (24. we might want to build a conﬁdence interval around the mean of y at x∗ . we just put that value of x into the regression equation. ˆ calculated from the formula (x∗ − x)2 ¯ 1 SEµ = s + . • There are actually two diﬀerent conﬁdence intervals we might want. The limits for a level C conﬁdence interval around the mean are y ± t∗ (SEµ ).11) ˆ n (xi − x)2 ¯ • We also might want to build a prediction interval for a single observation of y at x∗ . 44 .

where I is the number of diﬀerent groups we want to compare.and between-group variability we calculate sums of squares. . • An ANOVA will test the hypotheses H0 : µ1 = µ2 = . However. The distributions can (and we expect them to) have diﬀerent means. • To measure the within. We have independent random samples from each population. If the between-group variance is much larger than the within-group variance then you conclude that the means are signiﬁcantly diﬀerent. Each individual population has a normal distribution.2) (25. . 2. • We already know how to use a two-sample t test to compare the means of two groups. The betweengroups sum of squares is calculated as SSG = groups 45 ni (¯i − x)2 .Chapter 25 One-Way Analysis of Variance • In Lesson 21 we learned how to compare the means of two populations. this procedure would be invalid. Violating these assumptions makes our inferences less valid. one comparing each of our groups.1) . To compare more than two groups we might just calculate several t statistics. Think back to when we talked about using several z tests for proportions to compare the likelihood of an outcome in three or more groups. = µI Ha : at least two means are diﬀerent. We would have the same problem: the more groups you have the bigger the probability that any two of them would happen to be diﬀerent just due to chance. • In order to draw inferences from an ANOVA we have to make several assumptions. • Just like we use the chi-square statistic to simultaneously compare the proportions of an outcome in three or more populations. These assumptions are: 1. This is similar to the assumption we make when performing linear regression. In this chapter we will learn how to perform an Analysis of Variance (usually just referred to as ANOVA) which will allow us to simultaneously compare the means of three or more groups. • What the ANOVA does is compare the between-group variance (which is how diﬀerent your groups are from each other) to the within-group variance (which is how diﬀerent members of the same group are from each other). we can use Analysis of Variance (ANOVA) to simultaneously compare the means of three or more populations. 3. All of the individual distributions have the same standard deviation. The number we take from each population can be diﬀerent. x ¯ (25.

Working with these mean squares lets us compare many groups simultaneously without inﬂating the probability of ﬁnding signiﬁcant results. 46 . The sum of SSG and SSE gives us the total sum of squares SST. • Each one of our sums of squares has degrees of freedom associated with it. Notice that the total df is equal to the sum of the between-group and within-group df. Calculate the F statistic F = MSG .3) SSE N −I SST N −1 In the table. This is related to the number of diﬀerent sources that contribute to the sum. To perform an F test you 1. Look up the corresponding p-value in an F distribution table with I − 1 numerator degrees of freedom and N − I denominator degrees of freedom. This is a measure of the total variability found in x. (25. I is the number of groups and N is the total number of subjects. MSE (25. This is the same thing as testing whether any of our means are signiﬁcantly diﬀerent. Collect the information needed to build an ANOVA table.4) 3. • The relationships between these variables is summarized in the ANOVA table below: Source of variation Between Group Within Group Total SS SSG SSE SST df I −1 N −I N −1 MS SSG I−1 (ni − 1)si 2 . 2.while the within-groups sum of squares is calculated as SSE = groups where ni is the number of observations in group i. xi is the mean of group i. We typically divide the sums of squares by the degrees of freedom to get a measure of the average variation. and si is the standard ¯ deviation of group i. • We can perform an F test to determine if we have a signiﬁcant amount of between group variance.

Sign up to vote on this title

UsefulNot useful- Introductory Statistics
- Introductory Statistics
- Resources
- 5thgrade Lessons 4Learning 22616
- Hands as Calculator
- 2002 Power Reading
- Figuring Made Easy
- CK 12 Trigonometry Second Edition b v3 Zvd s1
- 2. Organic Chemistry - Reactions and Mechanisms (1)
- Chemistry Question Paper - 10 - CBSE Board - 2012 set 1
- Maths Matters
- Visual Math
- Rapid Math Tricks and Tips. Edward H. Julius
- Practical Physics
- CLASS 10 IMO WORKBOOK
- PROBABILITY.pdf
- Normal Distribution
- Normal Distribution
- Levine Dkk_Statistics for Managers Using Microsoft Excel 5 Ed_bab06
- 9ABS304 Probability and Statistics
- Maths File
- 2assignmentsolns.pdf
- roshan qt
- Monte Carlo Simulation
- We Standardize to Eliminate Units
- Johnson Statistics 6e TIManual
- AP Statistics Semester Exam Review
- Statistical Inference Course Notes
- Bayesian
- BIO 2060 Answers Problem Set 3
- Introductory Statistics Notes