You are on page 1of 28

Chapter 10: Comparing Two Groups

Bivariate Analysis: Methods for comparing two groups are special cases of bivariate
statistical methods – Two variables exist:
Response variable – outcome variable on which comparisons are made
Explanatory variable – binary variable that specifies the groups
Statistical methods analyze how the outcome on the response variable depends on or is
explained by the value of the explanatory variable
Independent Samples: Most comparisons of groups use independent samples from the
groups, The observations in one sample are independent of those in the other sample
Example: Randomized experiments that randomly allocate subjects to two treatments
Example: An observational study that separates subjects into groups according to their
value for an explanatory variable
Dependent samples: Dependent samples result when the data are matched pairs – each
subject in one sample is matched with a subject in the other sample

Example: set of married couples, the men being in one sample and the women in the other.
Example: Each subject is observed at two times, so the two samples have the same subject
Categorical response variable:
For a categorical response variable
- Inferences compare groups in terms of their population proportions in a particular category
- We can compare the groups by the difference in their population proportions: (p
1
– p
2
)
Example: Experiment: Subjects were 22,071 male physicians
Every other day for five years, study participants took either an aspirin or a placebo
The physicians were randomly assigned to the aspirin or to the placebo group
The study was double-blind: the physicians did not know which pill they were taking, nor did
those who evaluated the results
What is the response variable?
The response variable is whether the subject had a heart attack, with categories ‘yes’ or ‘no’.
What are the groups to compare?
The groups to compare are:
Group 1: Physicians who took a placebo
Group 2: Physicians who took aspirin

Estimate the difference between the two population parameters of interest
p
1
: the proportion of the population who would have a heart attack if they participated in
this experiment and took the placebo
p
2
: the proportion of the population who would have a heart attack if they participated in
this experiment and took the aspirin
Sample statistics:


To make an inference about the difference of population proportions, (p
1
– p
2
), we need to
learn about the variability of the sampling distribution of: ) ˆ ˆ (
2 1
p p ÷
Standard error for comparing two proportions:
The difference, ) ˆ ˆ (
2 1
p p ÷ , is obtained from sample data
It will vary from sample to sample
This variation is the standard error of the sampling distribution of ) ˆ ˆ (
2 1
p p ÷ :

Confidence Interval for the Difference
Between Two Population Proportions
The z-score depends on the confidence level
This method requires:
Categorical response variable for two groups
Independent random samples for the two groups
Large enough sample sizes so that there are at least 10 “successes” and at least 10 “failures”
in each group
Confidence Interval Comparing Heart Attack Rates for Aspirin and Placebo
95% CI:

Since both endpoints of the confidence interval
(0.005, 0.011) for (p
1
- p
2
) are positive, we infer that
(p
1
- p
2
) is positive
Conclusion: The population proportion of heart
attacks is larger when subjects take the placebo than
when they take aspirin
The population difference (0.005, 0.011) is small
Even though it is a small difference, it may be important in public health terms
For example, a decrease of 0.01 over a 5 year period in the proportion of people suffering
heart attacks would mean 2 million fewer people having heart attacks
The study used male doctors in the U.S, - The inference applies to the U.S. population of
male doctors, Before concluding that aspirin benefits a larger population, we’d want to see
results of studies with more diverse groups
008 . 0 009 . 0 017 . 0 ) ˆ ˆ (
009 . 0 11037 / 104 ˆ
017 . 0 11034 / 189 ˆ
2 1
2
1
= ÷ = ÷
= =
= =
p p
p
p
2
2 2
1
1 1
) ˆ 1 ( ˆ ) ˆ 1 ( ˆ
n
p p
n
p p
se
÷
+
÷
=
0.011) (0.005, or , 003 . 0 008 . 0
11037
) 009 . 1 ( 009 .
11034
) 017 . 1 ( 017 .
96 . 1 ) 009 . 017 (.
±
=
÷
+
÷
± ÷
Interpreting a confidence interval for a difference of proportions
Check whether 0 falls in the CI
If so, it is plausible that the population proportions are equal
If all values in the CI for (p
1
- p
2
) are positive, you can infer that (p
1
- p
2
) >0
If all values in the CI for (p
1
- p
2
) are negative, you can infer that (p
1
- p
2
) <0
Which group is labeled ‘1’ and which is labeled ‘2’ is arbitrary
The magnitude of values in the confidence interval tells you how large any true difference is,
If all values in the confidence interval are near 0, the true difference may be relatively small
in practical terms
Significance tests comparing population proportions:
1. Assumptions:
- Categorical response variable for two groups
- Independent random samples
-Significance tests comparing proportions use the sample size guideline from confidence
intervals: Each sample should have at least about 10 “successes” and 10 “failures”
- Two–sided tests are robust against violations of this condition
… At least 5 “successes” and 5 “failures” is adequate
2. Hypotheses:
The null hypothesis is the hypothesis of no difference or no effect: H
0
: p
1
=p
2

The alternative hypothesis is the hypothesis of interest to the investigator
H
a
: p
1
≠p
2
(two-sided test)
H
a
: p
1
<p
2
(one-sided test)
H
a
: p
1
>p
2
(one-sided test)
Pooled Estimate
Under the presumption that p
1
= p
2
, we estimate the common value of p
1
and p
2
by the
proportion of the total sample in the category of interest
This pooled estimate is calculated by combining the number of successes in the two groups
and dividing by the combined sample size (n
1
+n
2
)
3. The test statistic is:
|
|
.
|

\
|
+ ÷
÷ ÷
=
2 1
2 1
1 1
) ˆ 1 ( ˆ
0 ) ˆ ˆ (
n n
p p
p p
z where pˆ is the pooled estimate
4. P-value: Probability obtained from the standard normal table of values even more
extreme than observed z test statistic
5. Conclusion: Smaller P-values give stronger evidence against H
0
and supporting H
a




Example: Tv violence – aggressive behavior, 707 families, observations ove 17 years
Define Group 1 as those who watched less than 1 hour of TV per day, on the average, as
teenagers
Define Group 2 as those who averaged at least 1 hour of TV per day, as teenagers
p
1
= population proportion committing aggressive acts for the lower level of TV watching
p
2
= population proportion committing aggressive acts for the higher level of TV watching

Test the Hypotheses: H
0
: (p
1
- p
2
) = 0, H
a
: (p
1
- p
2
) ≠ 0, using a significance level of 0.05,
Test statistic: ( )
04 . 4
0476 . 0
192 . 0
0476 . 0
249 . 0 057 . 0 ˆ ˆ
0476 . 0
619
1
88
1
) 775 . 0 ( 225 . 0
1 1
ˆ 1 ˆ
225 . 0
619 88
154 5
ˆ
0
2 1
2 1
0
÷
÷
=
÷
=
÷
=
=
|
.
|

\
|
+ =
|
|
.
|

\
|
+ ÷ =
=
+
+
=
se
p p
z
n n
p p se
p

Conclusion: Since the P-value is less than 0.05, we reject H
0

We conclude that the population proportions of aggressive acts differ for the two groups
The sample values suggest that the population proportion is higher for the higher level of TV
watching
Example: Two Proportions Summer Jobs Example
Is there evidence that the proportion of male
students who had summer jobs differs from the
proportion of female students who had summer
jobs?
Null: The proportion of male students who had
summer jobs is the same as the proportion of
female students who had summer jobs, [H
0
: p
1
= p
2
]
Alt: The proportion of male students who had s.jobs
differs from the proportion of female students who had summer jobs, [H
a
: p
1
≠ p
2
]
Test statistic: n
1
= 797 and n
2
= 732 (both large, so test statistic follows a Normal
distribution) Pooled sample proportion:
1529
1311
732 797
593 718
ˆ =
+
+
= p
Test statistic: 07 . 5
732
1
797
1
1529
1311
1
1529
1311
732
593
797
718
=
|
.
|

\
|
+
|
.
|

\
|
÷
÷
= z
Summer Status Men Women
Employed 718 593
Not Employed 79 139
Total 797 732
Hypotheses: H
0
: p
1
= p
2
, H
a
: p
1
≠ p
2

Test Statistic: z = 5.07
P-value: P-value = 2P(Z > 5.07) = 0.000000396 (using a computer)
Conclusion: Since the P-value is quite small, there is very strong evidence that the proportion
of male students who had summer jobs differs from that of female students.

Comparing Means: We can compare two groups on a quantitative response variable by
comparing their means
Example: A 30-month study, Evaluated the degree of addiction that teenagers form to
nicotine, 332 students who had used nicotine were evaluated, The response variable was
constructed using a questionnaire called the Hooked on Nicotine Checklist (HONC)
The HONC score is the total number of questions to which a student answered “yes” during
the study, The higher the score, the more hooked on nicotine a student is judged to be
The study considered explanatory variables, such as gender, that might be associated with
the HONC score




How can we compare the sample HONC scores for females and males?
We estimate (µ
1
- µ
2
) by (
2 1
x x ÷ ): 2.8 – 1.6 = 1.2
On average, females answered “yes” to about one more question on the HONC scale than
males did
To make an inference about the difference between population means, (µ
1
– µ
2
), we need to
learn about the variability of the sampling distribution of: ) (
2 1
x x ÷
Standard error for comparing two means:
The difference, ) x x (
2 1
÷ , is obtained from sample data. It will vary from sample to sample.
This variation is the standard error of the sampling distribution of ) x x (
2 1
÷ :
2
2
2
1
2
1
n
s
n
s
se + =
Confidence interval for the difference between two population means:
A confidence interval for m
1
– m
2
is: ( )
2
2
2
1
2
1
025 . 2 1
n
s
n
s
t x x + ± ÷
t
.025
is the critical value for a 95% confidence level from the t distribution
The degrees of freedom are calculated using software. If you are not using software, you
can take df to be the smaller of (n
1
-1) and (n
2
-1) as a “safe” estimate

This method assumes:
Independent random samples from the two groups
An approximately normal population distribution for each group
This is mainly important for small sample sizes, and even then the method is robust to
violations of this assumption
Example: Data as summarized by HONC scores for the two groups:
Smokers: = 5.9, s
1
= 3.3, n
1
= 75
Ex-smokers: = 1.0, s
2
= 2.3, n
2
= 257
Were the sample data for the two groups approximately normal?
Most likely not for Group 2 (based on the sample statistics:
2
x = 1.0, s
2
= 2.3)
Since the sample sizes are large, this lack of normality is not a problem
95% CI for (µ
1
- µ
2
):
) 7 . 5 , 1 . 4 ( , 8 . 0 9 . 4
257
3 . 2
75
3 . 3
985 . 1 ) 1 9 . 5 (
2 2
or ±
= + ± ÷

We can infer that the population mean for the smokers is between 4.1 higher and 5.7 higher
than for the ex-smokers
Example: Exercise and pulse rate
A study is performed to compare the mean resting pulse rate of adult subjects who exercise
regularly to the mean resting pulse rate of those who do not exercise regularly.
This is an example of when to use the two-
sample t procedures.



Find a 95% confidence interval for the difference in population means (non-exercisers minus
exercisers).


Note: we use the “safe” estimate of 29-1=28 for our degrees of freedom in this calculation
“We are 95% confident that the difference in mean resting pulse rates (non-exercisers minus
exercisers) is between 4.35 and 13.65 beats per minute.”
How can we interpret a confidence interval for a difference of means?
Check whether 0 falls in the interval
When it does, 0 is a plausible value for (µ
1
– µ
2
), meaning that it is possible that µ
1
= µ
2

A confidence interval for (µ
1
– µ
2
) that contains only positive numbers suggests that (µ
1
– µ
2
)
is positive, We then infer that µ
1
is larger than µ
2

n mean std. dev.
Exercisers 29 66 8.6
Non-exercisers 31 75 9.0
2
2
2
1
2
1
2 1
n
s
n
s
t x x +
-
± ÷
13.65 to 4.35
4.65 9
=
± =
29
2
(8.6)
31
2
(9.0)
2.048 66 75 + ± ÷ =
A confidence interval for (µ
1
– µ
2
) that contains only negative numbers suggests that

1
– µ
2
) is negative
We then infer that µ
1
is smaller than µ
2

Which group is labeled ‘1’ and which is labeled ‘2’ is arbitrary
Significance tests comparing population means:
1. Assumptions:
Quantitative response variable for two groups
Independent random samples
Approximately normal population distributions for each group
This is mainly important for small sample sizes, and even then the two-sided t test is robust
to violations of this assumption
2. Hypotheses:
The null hypothesis is the hypothesis of no difference or no effect: H
0
: (µ
1
- µ
2
) =0
The alternative hypothesis:
H
a
: (µ
1
- µ
2
) ≠ 0 (two-sided test)
H
a
: (µ
1
- µ
2
) < 0 (one-sided test)
H
a
: (µ
1
- µ
2
) > 0 (one-sided test)
3. The test statistic is: Note change from “z” to “t” in formula

4. P-value: Probability obtained from the standard normal table
5. Conclusion: Smaller P-values give stronger evidence against H
0
and supporting H
a


Example: Does cell phone use while driving impair reaction times? Experiment:
64 college students, 32 randomly assigned to the cell phone group, 32 to control group
Students used a machine that simulated driving situations
At irregular periods a target flashed red or green
Participants were instructed to press a “brake button” as soon as possible when they
detected a red light
For each subject, the experiment analyzed their
mean response time over all the trials
Averaged over all trials and subjects, the mean
response time for the cell-phone group was 585.2
milliseconds
The mean response time for the control group was
533.7 milliseconds
Boxplot of data:

2
2
2
1
2
1
2 1
0 ) (
n
s
n
s
x x
t
+
÷ ÷
=
Test the hypotheses:
H
0
: (µ
1
- µ
2
) =0 vs
H
a
: (µ
1
- µ
2
) ≠ 0
using a significance level of 0.05
Conclusion:
The P-value is less than 0.05, so we can reject H
0

There is enough evidence to conclude that the population mean response times differ
between the cell phone and control groups
The sample means suggest that the population mean is higher for the cell phone group
What do the box plots tell us?
There is an extreme outlier for the cell phone group
It is a good idea to make sure the results of the analysis aren’t affected too strongly by that
single observation
Delete the extreme outlier and redo the analysis
In this example, the t-statistic changes only slightly ¨
Insight:
In practice, you should not delete outliers from a data set without sufficient cause (i.e., if it
seems the observation was incorrectly recorded)
It is however, a good idea to check for sensitivity of an analysis to an outlier
If the results change much, it means that the inference including the outlier is on shaky
ground

Alternative method for Comparing Means:
An alternative t- method can be used when, under the null hypothesis, it is reasonable to
expect the variability as well as the mean to be the same
This method requires the assumption that the population standard deviations be equal






The Pooled Standard Deviation:
This alternative method estimates the common value σ of σ
1
and σ
1
by:



2
) 1 ( ) 1 (
2 1
2
2 2
2
1 1
÷ +
÷ + ÷
=
n n
s n s n
s
Comparing population means, assuming equal population standard deviations
Using the pooled standard deviation estimate, a 95% CI for (µ
1
- µ
2
) is:
2 1
025 . 2 1
1 1
) (
n n
s t x x + ± ÷

This method has df =n
1
+ n
2
- 2
These methods assume:
- Independent random samples from the two groups
- An approximately normal population distribution for each group
- This is mainly important for small sample sizes, and even then, the CI and the two-sided
test are usually robust to violations of this assumption
- σ
1

2

The ratio of proportions: The relative risk
The ratio of proportions for two groups is:
In medical applications for which the proportion refers to a category that is an undesirable
outcome, such as death or having a heart attack, this ratio is called the relative risk
The ratio describes the sizes of the proportions relative to each other
Recall Physician’s Health Study:


The proportion of the placebo group who had a heart attack was 1.82 times the proportion
of the aspirin group who had a heart attack.
Dependent samples:
Each observation in one sample has a matched observation in the other sample
The observations are called matched pairs
Benefits of using dependent samples (matched pairs):
Many sources of potential bias are controlled so we can make a more accurate comparison
Using matched pairs keeps many other factors fixed that could affect the analysis
Often this results in the benefit of smaller standard errors
To compare means with matched pairs, use paired differences:
For each matched pair, construct a difference score
d = (reaction time using cell phone) – (reaction time without cell phone)
Calculate the sample mean of these differences:
d
x
2
1
ˆ
ˆ
p
p
82 . 1 0094 . 0 0171 . 0 ˆ ˆ = risk relative sample
0094 . 0 11037 / 104 ˆ
0171 . 0 11034 / 189 ˆ
2 1
2
1
= =
= =
= =
p p
p
p
The difference (
1
x


2
x ) between the means of the two samples equals the mean
d
x of the
difference scores for the matched pairs
The difference (µ
1
– µ
2
) between the population means is identical to the parameter µ
d
that
is the population mean of the difference scores
Confidence interval for dependent samples:
Let n denote the number of observations in each sample
This equals the number of difference scores
The 95 % CI for the population mean difference is:

Paired difference inferences: These paired-difference inferences are special cases of single-
sample inferences about a population mean so they make the same assumptions
To test the hypothesis H
0
: µ
1
= µ
2
of equal means, we can conduct the single-sample test of
H
0
: µ
d
= 0 with the difference scores
The test statistic is:
1 with
0
÷ =
÷
= n df
n
s
x
t
d
d

Assumptions:
The sample of difference scores is a random sample from a population of such difference
scores
The difference scores have a population distribution that is approximately normal
This is mainly important for small samples (less than about 30) and for one-sided inferences
Confidence intervals and two-sided tests are robust: They work quite well even if the
normality assumption is violated
One-sided tests do not work well when the sample size is small and the distribution of
differences is highly skewed

Example: Cell phones and driving study
The box plot shows skew to the right for the
difference scores
Two-sided inference is robust to violations of the
assumption of normality
The box plot does not show any severe outliers

Significance test:
H
0
: µ
d
= 0 (and hence equal population means for the two conditions)
H
a
: µ
d
≠ 0
Test statistic:


deviation standard their is s
s difference the of mean sample the is

d
025 .
d
d
d
x
n
s
t x ±
46 . 5
32
5 . 52
6 . 50
= = t
Comparing proportions with dependent samples: A recent GSS asked subjects whether they
believed in Heaven and whether they believed in Hell:
Belief in Hell
Belief in Heaven Yes No Total
Yes 833 125 958
No 2 160 162
Total 835 285 1120
We can estimate p
1
- p
2
as: 11 . 0 1120 835 1120 958 ˆ ˆ
2 1
= ÷ = ÷ p p
Note that the data consist of matched pairs.
Recode the data so that for belief in heaven or hell, 1=yes and 0=no
Heaven Hell Interpretation Difference, d Frequency
1 1 believe in Heaven and Hell 1-1=0 833
1 0 believe in Heaven, not Hell 1-0=1 125
0 1 believe in Hell, not Heaven 0-1=-1 2
0 0 do not believe in Heaven or Hell 0-0=0 160
Sample mean of the 1120 difference scores is [0(833)+1(125)-1(2)+0(160)]/1120=0.11
Note that this equals the difference in proportions
2 1
ˆ ˆ p p ÷
We have converted the two samples of binary observations into a single sample of 1120
difference scores. We can now use single-sample methods with the differences as we did
for the matched-pairs analysis of means.
Confidence interval comparing proportions with matched-pairs data
Use the fact that the sample difference
2 1
ˆ ˆ p p ÷ is the mean of difference scores of the re-
coded data
We can then find a confidence interval for the population mean of difference scores using
single sample methods






( )
) 128 . 0 , 091 . 0 (
0187 . 0 1098 . 0
1120 3185 . 0 96 . 1 1098 . 0 = CI 95%
3185 . 0
1098 . 0
1120
=
± =
±
=
=
=
d
d
s
x
n
McNemar Test for Comparing Proportions with Matched-Pairs Data
Hypotheses: H
0
: p
1
=p
2
, H
a
can be one or two sided
Test Statistic: For the two counts for the frequency of “yes” on one response and “no” on
the other, the z test statistic equals their difference divided by the square root of their sum.
P-value: The probability of observing a sample even more extreme than the observed
sample
Assumptions: The sum of the counts used in the test should be at least 30, but in practice,
the two-sided test works well even if this is not true.
Recall GSS example about belief in Heaven and Hell:
McNemar’s Test: 9 . 10
2 125
2 125
=
+
÷
= z P-value is approximately 0.
Note that this result agrees with the confidence interval for p
1
-p
2
calculated earlier
A practically significant difference:
When we find a practically significant difference between two groups, can we identify a
reason for the difference?
Warning: An association may be due to a lurking variable not measured in the study
Control variable:
In a previous example, we saw that teenagers who watch more TV have a tendency later in
life to commit more aggressive acts
Could there be a lurking variable that influences this association?
Perhaps teenagers who watch more TV tend to attain lower educational levels and perhaps
lower education tends to be associated with higher levels of aggression
-We need to measure potential lurking variables and use them in the statistical analysis
- If we thought that education was a potential lurking variable we would want to measure it
- Including a potential lurking variable in the study changes it from a bivariate study to a
multivariate study
- A variable that is held constant in a multivariate analysis is called a control variable
This analysis uses three variables:
- Response variable: Whether the subject
has committed aggressive acts
- Explanatory variable: Level of TV watching
- Control variable: Educational level
Can An Association Be Explained by a Third Variable?
- Treat the third variable as a control variable
- Conduct the ordinary bivariate analysis while holding that control variable constant at fixed
values (multivariate analysis)
- Whatever association occurs cannot be due to the effect of the control variable
-At each educational level, the percentage committing an aggressive act is higher for those
who watched more TV
- For this hypothetical data, the association observed between TV watching and aggressive
acts was not because of education









Chapter 11: Analyzing the Association Between Categorical Variables






Association between categorical variables:
The chi-squared test and measures of association such as (p
1
– p
2
) and

p
1
/p
2
are fundamental
methods for analyzing contingency tables
The P-value for
2
_ summarized the strength of evidence against H
0
: independence
If the P-value is small, then we conclude that somewhere in the contingency table the
population cell proportions differ from independence
The chi-squared test does not indicate whether all cells deviate greatly from independence
or perhaps only some of them do so


Residual analysis
A cell-by-cell comparison of the observed counts with the counts that are expected when H
0

is true reveals the nature of the evidence against H
0

The difference between an observed and expected count in a particular cell is called a
residual

The residual is negative when fewer subjects are in the cell than expected under H0
The residual is positive when more subjects are in the cell than expected under H0
To determine whether a residual is large enough to indicate strong evidence of a deviation
from independence in that cell we use a adjusted form of the residual: the standardized
residual
The standardized residual for a cell= (observed count – expected count)/se
A standardized residual reports the number of standard errors that an observed count falls
from its expected count
The se describes how much the difference would tend to vary in repeated sampling if the
variables were independent
Its formula is complex
Software can be used to find its value
A large standardized residual value provides evidence against independence in that cell
Example: to what extent do you consider yourself a religious person?
- Interpret the standardized residuals in the table
- The table exhibits large positive residuals for the cells for females who are very religious
and for males who are not at all religious.
- In these cells, the observed count is much larger than the expected count
- There is strong evidence that the population has more subjects in these cells than if the
variables were independent
The table exhibits large negative residuals for the cells for females who are not at all
religious and for males who are very religious
In these cells, the observed count is much smaller than the expected count
There is strong evidence that the population has fewer subjects in these cells than if the
variables were independent
Fisher’s exact test:
The chi-squared test of independence is a large-sample test
When the expected frequencies are small, any of them being less than about 5, small-sample
tests are more appropriate
Fisher’s exact test is a small-sample test of independence
The calculations for Fisher’s exact test are complex
Statistical software can be used to obtain the P-value for the test that the two variables are
independent
The smaller the P-value, the stronger the evidence that the variables are associated
This is an experiment conducted by Sir Ronald Fisher
His colleague, Dr. Muriel Bristol, claimed that when drinking tea she could tell whether the
milk or the tea had been added to the cup first
Experiment: Fisher asked her to taste eight cups of tea:
Four had the milk added first
Four had the tea added first
She was asked to indicate which four had the milk added first
The order of presenting the cups was randomized
Results:




Analysis:








The one-sided version of the test pertains to the alternative that her predictions are better
than random guessing
Does the P-value suggest that she had the ability to predict better than random guessing?
The P-value of 0.243 does not give much evidence against the null hypothesis
The data did not support Dr. Bristol’s claim that she could tell whether the milk or the tea
had been added to the cup first
Assumptions:
Two binary categorical variables, Data are random
Hypotheses: H
0
: the two variables are independent (p
1
=p
2
),
H
a
: the two variables are associated
(p
1
≠p
2
or p
1
>p
2
or p
1
<p
2
)
Test Statistic: First cell count (this determines the others given the margin totals)
P-value: Probability that the first cell count equals the observed value or a value even more
extreme as predicted by H
a

Conclusion: Report the P-value and interpret in context. If a decision is required, reject H
0

when P-value ≤ significance level













Chapter 12: Analyzing Association between Quantitative variables: Regression analysis
First step of a regression analysis, is to identify the response and explanatory variables
Y to denote response variable and X to denote explanatory variable
The scatterplot: First step in answering the question of association; to look at the data
A scatterplot is a graphical display of the relationship between the response variable (y-axis)
and the explanatory variable (x-axis)
Ex: What do we learn from scatterplot in strength study?
The MINITAB output shows the following regression equation:
BP = 63.5 + 1.49 (BP_60)
The y-intercept is 63.5 and the slope is 1.49
The slope of 1.49 tells us that predicted maximum bench press increases by about 1.5
pounds for every additional 60-pound bench press an athlete can do

The regression line equation:
When the scatterplot shows a linear trend, a straight line
can be fitted through the data points to describe that trend
The regression line is:
is the predicted value of the response variable y
is the y-intercept and is the slope
Check for outliers by plotting the data, The regression line can
be pulled toward an outlier and away from the general trend
of points
Influential points: An observation can be influential in affecting the regression line when
two things happen:
-Its x-value is low or high compared to the rest of the data
- It does not fall in the straight-line pattern that the rest of the data have
Residuals are prediction errors: The regression equation is
often called a prediction equation. The difference
between an observed outcome and its predicted value is
the prediction error, called a residual.
Each observation has a residual; A residual is the vertical
distance between the data point and the regression line
We can summarize how near the regression line the data points fall by:
The regression line has the smallest sum of squared
residuals and is called the least squares line
Regression model: A line describes how the mean of y depends on x,
At a given value of x, the equation:
Predicts a single value of the response variable
But, we should not except all subjects at that value of x to have the same value of y
Variability occur in the y values.
The regression line connects the estimated means of y at the various x values,
In summary, y=a+bx, describes the relationship between x and the estimated means of y at
the various values of x
The population regression equation:
describes the relationship in the population between x and the means of y
In the population regression equation, α is a population y-intercept and β is a population
slope, These are parameters
bx a y + = ˆ

a b
y y ˆ ÷
¿ ¿
÷ =
=
2 2
) ˆ ( ) (

y y residuals
residuals squared of sum
bx a y + = ˆ
x
y
| o µ + =
In practice we estimate the population regression equation using the prediction equation for
the sample data
The population regression equation merely approximates the actual relationship between x
and the population means of y, It is a model
A model is a simple approximation for how variables relate in the population



The regression model








If the true relationship is far from a straight
line, this regression model may be a poor
one


























Chapter 13: Multiple Regression
Regression models:
The model that contains only two variables, x and y, is called a bivariate model


Suppose there are two predictors, denoted by x
1
and x
2

This is called a multiple regression model


The multiple regression model relates the mean µ
y
of a quantitative response variable y to a
set of explanatory variables x
1
, x
2
,…
Example: For three explanatory variables, the multiple regression equation is:


x
y
| o µ + =
2 2 1 1
x x
y
| | o µ + + =
3 3 2 2 1 1
x x x
y
| | | o µ + + + =
Example: The sample prediction equation with three explanatory variables is:



Example: Predicting selling price using house and lot size
The data set “house selling prices” contains observations on 100 home sales in Florida in
November 2003
A multiple regression analysis was done with selling price as the response variable and with
house size and lot size as the explanatory variables
Output from the analysis:


Prediction Equation:
2 1
84 . 2 8 . 53 536 , 10 ˆ x x y + + ÷ =
where y = selling price, x
1
=house size and x
2
= lot size
One house listed in the data set had house size = 1240 square feet, lot size = 18,000 square
feet and selling price = $145,000
Find its predicted selling price:

Find its residual: 724 , 37 276 , 107 000 , 145 ˆ = ÷ = ÷ y y
The residual tells us that the actual selling price was $37,724 higher than predicted
The number of explanatory variables
You should not use many explanatory variables in a multiple regression model unless you
have lots of data
A rough guideline is that the sample size n should be at least 10 times the number of
explanatory variables
3 3 2 2 1 1
ˆ
x b x b x b a y + + + =
276 , 107
) 000 , 18 ( 84 . 2 ) 1240 ( 8 . 53 536 , 10 ˆ
=
+ + ÷ = y
Plotting relationships
Always look at the data before doing a multiple regression
Most software has the option of constructing scatterplots on a single graph for each pair of
variables - This is called a scatterplot matrix










Interpretation of multiple regression coefficients:
The simplest way to interpret a multiple regression equation looks at it in two dimensions as
a function of a single explanatory variable
We can look at it this way by fixing values for the other explanatory variable(s)
Example using the housing data: Suppose we fix x
1
= house size at 2000 square feet
The prediction equation becomes:
2
2
2.84x 97,022
84 . 2 ) 2000 ( 8 . 53 536 , 10 ˆ
+ =
+ + ÷ = x y

Since the slope coefficient of x
2
is 2.84, the predicted selling price increases by $2.84 for
every square foot increase in lot size when the house size is 2000 square feet
For a 1000 square-foot increase in lot size, the predicted selling price increases by
1000(2.84) = $2840 when the house size is 2000 square feet
Example using the housing data:
- Suppose we fix x
2
= lot size at 30,000 square feet
- The prediction equation becomes:
1
1
53.8 74,676
) 000 , 30 ( 84 . 2 8 . 53 536 , 10 ˆ
x
x y
+ =
+ + ÷ =

Since the slope coefficient of x
1
is 53.8, for houses with a lot size of 30,000 square feet, the
predicted selling price increases by $53.80 for every square foot increase in house size
In summary, an increase of a square foot in house size has a larger impact on the selling
price ($53.80) than an increase of a square foot in lot size ($2.84)
We can compare slopes for these explanatory variables because their units of measurement
are the same (square feet)
Slopes cannot be compared when the units differ
Summarizing the effect while controlling for a variable:
The multiple regression model assumes that the slope for a particular explanatory variable is
identical for all fixed values of the other explanatory variables
For example, the coefficient of x
1
in the prediction equation:
2 1
84 . 2 8 . 53 536 , 10 ˆ x x y + + ÷ =
is 53.8 regardless of whether we plug in x
2
= 10,000 or x
2
= 30,000 or x
2
= 50,000







Slopes in multiple regression and in bivariate regression:
In multiple regression, a slope describes the effect of an explanatory variable while
controlling effects of the other explanatory variables in the model
Bivariate regression has only a single explanatory variable
A slope in bivariate regression describes the effect of that variable while ignoring all other
possible explanatory variables
Importance of multiple regression:
One of the main uses of multiple regression is to identify potential lurking variables and
control for them by including them as explanatory variables in the model
Multiple correlation:
To summarize how well a multiple regression model predicts y, we analyze how well the
observed y values correlate with the predicted yˆ values
The multiple correlation is the correlation between the observed y values and the predicted
yˆ values
- It is denoted by R
For each subject, the regression equation provides a predicted value
Each subject has an observed y-value and a predicted y-value




The correlation computed between all pairs of observed y-values and predicted y-values is
the multiple correlation, R
The larger the multiple correlation, the better are the predictions of y by the set of
explanatory variables
The R-value always falls between 0 and 1
In this way, the multiple correlation ‘R’ differs from the bivariate correlation ‘r’ between y
and a single variable x, which falls between -1 and +1
R-squared
For predicting y, the square of R describes the relative improvement from using the
prediction equation instead of using the sample mean, y
The error in using the prediction equation to predict y is summarized by the residual sum of
squares:
¿
÷
2
) ˆ ( y y
The error in using y to predict y is summarized by the total sum of squares:
¿
÷
2
) ( y y
The proportional reduction in error is:
The better the predictions are using the regression equation, the larger R
2
is
For multiple regression, R
2
is the square of the multiple correlation, R
Example: How well can we predict house
selling prices:
For the 100 observations on y = selling
price, x
1
= house size, and x
2
= lot size, a
table, called the ANOVA (analysis of
variance) table was created
The table displays the sums of squares in the SS column
The R
2
value can be created from the sums of squares in the table:

Using house size and lot size together to predict
selling price reduces the prediction error by 71%,
relative to using y alone to predict selling price

Find and interpret the multiple correlation
84 . 0 711 . 0
2
= = = R R

There is a strong association between the observed and the predicted selling prices
House size and lot size are very helpful in predicting selling prices
2
2 2
2
) (
) ˆ ( ) (
y y
y y y y
R
÷
÷ ÷ ÷
=
711 . 0
90,756
90,756 - 314,433

) (
) ˆ ( ) (
2
2 2
2
= =
÷
÷ ÷ ÷
=
¿
y y
y y y y
R
If we used a bivariate regression model to predict selling price with house size as the
predictor, the r
2
value would be 0.58
If we used a bivariate regression model to predict selling price with lot size as the predictor,
the r
2
value would be 0.51
The multiple regression model has R
2
0.71, so it provides better predictions than either
bivariate model






The single predictor in the data set that is most strongly associated with y is the house’s real
estate tax assessment: (r
2
= 0.679)
When we add house size as a second predictor, R
2
goes up from 0.679 to 0.730
As other predictors are added, R
2
continues to go up, but not by much
R
2
does not increase much after a few predictors are in the model
When there are many explanatory variables but the correlations among them are strong,
once you have included a few of them in the model, R
2
usually doesn’t increase much more
when you add additional ones
This does not mean that the additional variables are uncorrelated with the response variable
It merely means that they don’t add much new power for predicting y, given the values of
the predictors already in the model

Properties of R
2
The previous example showed that R
2
for the multiple regression model was larger than r
2
for a bivariate model using only one of the explanatory variables
A key factor of R
2
is that it cannot decrease when predictors are added to a model
R
2
falls between 0 and 1
The larger the value, the better the explanatory variables collectively predict y
R
2
=1 only when all residuals are 0, that is, when all regression predictions are prefect
R
2
= 0 when the correlation between y and each explanatory variable equals 0
R
2
gets larger, or at worst stays the same, whenever an explanatory variable is added to the
multiple regression model
The value of R
2
does not depend on the units of measurement


























Chapter 14: Comparing Groups: Analysis of Variance Methods
How can we compare several means? ANOVA
The analysis of variance method: Compares means of several groups
- Let g denote the number of groups
- each group has a corresponding population of subjects
- means of response variable for the g populations are denoted: µ
1
, µ
2
, … µ
g

Hypotheses and Assumptions for the ANOVA test
- The analysis of variance is a significance test of the null hypothesis of equal population
means: H
0
: µ
1
= µ
2
= …= µ
g

- Alternative hypothesis: H
a
: At least two of the population means are unequal
The assumptions for the ANOVA test comparing population means are as follows:
1. The population distributions of the response variable for the g groups are normal
with the same standard deviation for each group
2. Randomization:
- In a survey sample, independent random samples are selected from the g
populations
- In an experiment, subjects are randomly assigned separately to the g groups
Ex: How long will you tolerate being put on hold? – with music; advertisement, muzak,
classical music.
Denote the holding time means for the populations that these three random samples
represent by:
µ
1
= mean for the advertisement
µ
2
= mean for the Muzak
µ
3
= mean for the classical music
The hypotheses for the ANOVA test are:
H
0
: µ
1

2

3

H
a
: at least two of the population means are different
Since the sample means are quite different, can we conclude that the population means
differ?
This alone is not sufficient evidence to enable us to reject H
0

Variability between groups and within groups
The ANOVA method – used to compare population means.
It is called ANALYSIS OF VARIANCE because it uses evidence about two types of variability.
EX: Two Data sets with equal means but unequal variability.
Which case do you think gives stronger evidence against H
0

1

2

3
?
What is the difference between the data in these two cases?
In both cases the variability between pairs of means is the same
In ‘Case b’ the variability within each sample is much smaller than in ‘Case a.’
The fact that ‘Case b’ has less variability within each sample gives stronger evidence against H
0



ANOVA F-test Statistic,
The analysis of variance (ANOVA) F-test statistic is:
The larger the variability between groups relative
to the variability within groups, the larger
the F test statistic tends to be
The test statistic for comparing means has the F-distribution,
The larger the F-test statistic value, the stronger the evidence against H
0
ANOVA F-test for comparing population means of several groups
1. Assumptions: - Independent random samples,
- Normal population distributions with equal standard deviations

2. Hypothesis: H
0

1

2
= … =µ
g

H
a
:at least two of the population means are different

ty variabili groups Within
ty variabili groups Between
= F
3. Test statistic:
ty variabili groups Within
ty variabili groups Between
= F
F- sampling distribution has df
1
= g -1, df
2
= N – g = (total sample size – no. of groups)

4. P-value: Right-tail probability above observed F-value
5. Conclusion: If decision is needed, reject if P-value ≤ significance level (such as 0.05)
The variance estimates and the ANOVA table
- Let σ denote the standard deviation for each of the g population distributions
- One assumption for the ANOVA F-test is that each population has the same standard
deviation, σ
The F-test statistic is the ratio of two estimates of σ
2
, the population variance for each group
- The estimate of σ
2
in the denominator of the F-test statistic uses the variability within each
group
- The estimate of σ
2
in the numerator of the F-test statistic uses the variability between each
sample mean and the overall mean for all the data
-Computer software displays the two estimates of σ
2
in the ANOVA table
- The MS column contains the two estimates, which are called mean squares
- The ratio of the two mean squares is the F- test statistic
- This F- statistic has a P-value
Examples, customers telephone holding time again:
Since P-value < 0.05, there is sufficient evidence to reject H
0

1

2

3

We conclude that a difference exists among the three types of messages in the population
mean amount of time that customers are willing to remain on hold


Estimate the difference between the two population parameters of interest p1: the proportion of the population who would have a heart attack if they participated in this experiment and took the placebo p2: the proportion of the population who would have a heart attack if they participated in this experiment and took the aspirin Sample statistics: p  189 / 11034  0.017 ˆ1

ˆ p2  104 / 11037  0.009 ˆ ˆ ( p1  p2 )  0.017  0.009  0.008
To make an inference about the difference of population proportions, (p 1 – p2), we need to ˆ ˆ learn about the variability of the sampling distribution of: ( p1  p2 ) Standard error for comparing two proportions: ˆ ˆ The difference, ( p1  p2 ) , is obtained from sample data It will vary from sample to sample ˆ ˆ This variation is the standard error of the sampling distribution of ( p1  p2 ) :
1 1 2 Confidence Interval for the Difference se   2 Between Two Population Proportions n1 n2 The z-score depends on the confidence level This method requires: Categorical response variable for two groups Independent random samples for the two groups Large enough sample sizes so that there are at least 10 “successes” and at least 10 “failures” in each group

ˆ ˆ p (1  p )

ˆ ˆ p (1  p )

Confidence Interval Comparing Heart Attack Rates for Aspirin and Placebo 95% CI: .017(1  .017) .009(1  .009)
(.017  .009)  1.96 11034 0.008  0.003, or (0.005,0.011)  11037 

Since both endpoints of the confidence interval (0.005, 0.011) for (p1- p2) are positive, we infer that (p1- p2) is positive Conclusion: The population proportion of heart attacks is larger when subjects take the placebo than when they take aspirin The population difference (0.005, 0.011) is small Even though it is a small difference, it may be important in public health terms For example, a decrease of 0.01 over a 5 year period in the proportion of people suffering heart attacks would mean 2 million fewer people having heart attacks The study used male doctors in the U.S, - The inference applies to the U.S. population of male doctors, Before concluding that aspirin benefits a larger population, we’d want to see results of studies with more diverse groups

p2) >0 If all values in the CI for (p1. we estimate the common value of p1 and p2 by the proportion of the total sample in the category of interest This pooled estimate is calculated by combining the number of successes in the two groups and dividing by the combined sample size (n1+n2) 3.Interpreting a confidence interval for a difference of proportions Check whether 0 falls in the CI If so. you can infer that (p1.p2) are negative. Conclusion: Smaller P-values give stronger evidence against H0 and supporting Ha .Independent random samples -Significance tests comparing proportions use the sample size guideline from confidence intervals: Each sample should have at least about 10 “successes” and 10 “failures” . the true difference may be relatively small in practical terms Significance tests comparing population proportions: 1. Assumptions: .Categorical response variable for two groups . If all values in the confidence interval are near 0.p2) <0 Which group is labeled ‘1’ and which is labeled ‘2’ is arbitrary The magnitude of values in the confidence interval tells you how large any true difference is.p2) are positive. P-value: Probability obtained from the standard normal table of values even more extreme than observed z test statistic 5. you can infer that (p1.Two–sided tests are robust against violations of this condition … At least 5 “successes” and 5 “failures” is adequate 2. The test statistic is: z  ˆ ˆ ( p1  p2 )  0 1 1 ˆ ˆ  p(1  p )     n1 n2  ˆ where p is the pooled estimate 4. Hypotheses: The null hypothesis is the hypothesis of no difference or no effect: H0: p1=p2 The alternative hypothesis is the hypothesis of interest to the investigator Ha: p1≠p2 (two-sided test) Ha: p1<p2 (one-sided test) Ha: p1>p2 (one-sided test) Pooled Estimate Under the presumption that p1= p2. it is plausible that the population proportions are equal If all values in the CI for (p1.

192 z 1    4. 5  154 ˆ p  0.jobs differs from the proportion of female students who had summer jobs. we reject H0 We conclude that the population proportions of aggressive acts differ for the two groups The sample values suggest that the population proportion is higher for the higher level of TV watching Example: Two Proportions Summer Jobs Example Summer Status Men Is there evidence that the proportion of male students who had summer jobs differs from the Employed 718 proportion of female students who had summer jobs? Not Employed 79 Null: The proportion of male students who had summer jobs is the same as the proportion of Total 797 female students who had summer jobs.225(0.p2) ≠ 0. 707 families. [H0: p1 = p2] Alt: The proportion of male students who had s.057  0.0476 0.775)    0.05. so test statistic follows a Normal 718  593 1311 ˆ distribution) Pooled sample proportion: p   797  732 1529 718 593  797 732 Test statistic: z   5.05.0476   88 619   n1 n2  ˆ ˆ p  p2 0.249  0.p2) = 0.04 se0 0. [Ha: p1 ≠ p2] Test statistic: n1 = 797 and n2 = 732 (both large.07 1311  1311  1 1   1    1529  1529  797 732  Women 593 139 732 .0476 Conclusion: Since the P-value is less than 0.225 88  619 Test statistic: se0  1 1 1   1 ˆ ˆ p1  p     0.Example: Tv violence – aggressive behavior. on the average. as teenagers p1 = population proportion committing aggressive acts for the lower level of TV watching p2 = population proportion committing aggressive acts for the higher level of TV watching Test the Hypotheses: H0: (p1. using a significance level of 0. observations ove 17 years Define Group 1 as those who watched less than 1 hour of TV per day. as teenagers Define Group 2 as those who averaged at least 1 hour of TV per day. Ha: (p1.

(x1  x 2 ) . The response variable was constructed using a questionnaire called the Hooked on Nicotine Checklist (HONC) The HONC score is the total number of questions to which a student answered “yes” during the study. It will vary from sample to sample. Comparing Means: We can compare two groups on a quantitative response variable by comparing their means Example: A 30-month study. females answered “yes” to about one more question on the HONC scale than males did To make an inference about the difference between population means. that might be associated with the HONC score How can we compare the sample HONC scores for females and males? We estimate (µ1 .07 P-value: P-value = 2P(Z > 5. such as gender. is obtained from sample data. we need to learn about the variability of the sampling distribution of: ( x1  x2 ) Standard error for comparing two means: The difference. (µ 1 – µ2). there is very strong evidence that the proportion of male students who had summer jobs differs from that of female students.025 is the critical value for a 95% confidence level from the t distribution The degrees of freedom are calculated using software. Ha: p1 ≠ p2 Test Statistic: z = 5.000000396 (using a computer) Conclusion: Since the P-value is quite small. The higher the score. If you are not using software. 332 students who had used nicotine were evaluated.07) = 0. This variation is the standard error of the sampling distribution of (x1  x 2 ) : se  Confidence interval for the difference between two population means: A confidence interval for m1 – m2 is: x1  x2   t.025 2 s12 s2  n1 n2 s1 s  2 n1 n2 2 2 t.2 On average.6 = 1. Evaluated the degree of addiction that teenagers form to nicotine. you can take df to be the smaller of (n1-1) and (n2-1) as a “safe” estimate .8 – 1.Hypotheses: H0: p1 = p2.µ2) by ( x1  x2 ): 2. the more hooked on nicotine a student is judged to be The study considered explanatory variables.

0. and even then the method is robust to violations of this assumption Example: Data as summarized by HONC scores for the two groups: Smokers: = 5.65 Note: we use the “safe” estimate of 29-1=28 for our degrees of freedom in this calculation “We are 95% confident that the difference in mean resting pulse rates (non-exercisers minus exercisers) is between 4. s2 = 2. 0 is a plausible value for (µ1 – µ2).0)2 31  (8.35 and 13. s2 s2 x1  x 2  t  1  2 n1 n2  75  66  2.0.048 (9. this lack of normality is not a problem 3.9. 8.3. s2 = 2.1 higher and 5. n2 = 257 Were the sample data for the two groups approximately normal? Most likely not for Group 2 (based on the sample statistics: x 2 = 1. dev.” How can we interpret a confidence interval for a difference of means? Check whether 0 falls in the interval When it does.6 9.32   95% CI for (µ1.32 2.65 beats per minute. This is an example of when to use the twosample t procedures.65  4.985 We can infer that the population mean for the smokers is between 4. Exercisers Non-exercisers n 29 31 mean 66 75 std.35 to 13.8. n1 = 75 Ex-smokers: = 1. meaning that it is possible that µ1 = µ2 A confidence interval for (µ1 – µ2) that contains only positive numbers suggests that (µ1 – µ2) is positive.3) Since the sample sizes are large.9  0.1.6)2 29  9  4.7 higher than for the ex-smokers Example: Exercise and pulse rate A study is performed to compare the mean resting pulse rate of adult subjects who exercise regularly to the mean resting pulse rate of those who do not exercise regularly.This method assumes: Independent random samples from the two groups An approximately normal population distribution for each group This is mainly important for small sample sizes. 5.3. s1 = 3. We then infer that µ1 is larger than µ2 .0 Find a 95% confidence interval for the difference in population means (non-exercisers minus exercisers).9  1)  1. or (4.7) (5.µ2): 75 257 4.

The test statistic is: Note change from “z” to “t” in formula 4. P-value: Probability obtained from the standard normal table 5. and even then the two-sided t test is robust to violations of this assumption 2.µ2) > 0 (one-sided test) t ( x1  x2 )  0 2 s12 s2  n1 n2 3.2 milliseconds The mean response time for the control group was 533. Assumptions: Quantitative response variable for two groups Independent random samples Approximately normal population distributions for each group This is mainly important for small sample sizes. 32 randomly assigned to the cell phone group. the experiment analyzed their mean response time over all the trials Averaged over all trials and subjects. the mean response time for the cell-phone group was 585. Hypotheses: The null hypothesis is the hypothesis of no difference or no effect: H0: (µ1.µ2) < 0 (one-sided test) Ha: (µ1.µ2) =0 The alternative hypothesis: Ha: (µ1. Conclusion: Smaller P-values give stronger evidence against H0 and supporting Ha Example: Does cell phone use while driving impair reaction times? Experiment: 64 college students.µ2) ≠ 0 (two-sided test) Ha: (µ1.7 milliseconds Boxplot of data: . 32 to control group Students used a machine that simulated driving situations At irregular periods a target flashed red or green Participants were instructed to press a “brake button” as soon as possible when they detected a red light For each subject.A confidence interval for (µ1 – µ2) that contains only negative numbers suggests that (µ1 – µ2) is negative We then infer that µ1 is smaller than µ2 Which group is labeled ‘1’ and which is labeled ‘2’ is arbitrary Significance tests comparing population means: 1.

if it seems the observation was incorrectly recorded) It is however. it means that the inference including the outlier is on shaky ground Alternative method for Comparing Means: An alternative t. under the null hypothesis. you should not delete outliers from a data set without sufficient cause (i. it is reasonable to expect the variability as well as the mean to be the same This method requires the assumption that the population standard deviations be equal The Pooled Standard Deviation: This alternative method estimates the common value σ of σ1 and σ1 by: 2 (n1  1) s12  (n2  1) s2 s n1  n2  2 . a good idea to check for sensitivity of an analysis to an outlier If the results change much. the t-statistic changes only slightly ¨ Insight: In practice. so we can reject H0 There is enough evidence to conclude that the population mean response times differ between the cell phone and control groups The sample means suggest that the population mean is higher for the cell phone group What do the box plots tell us? There is an extreme outlier for the cell phone group It is a good idea to make sure the results of the analysis aren’t affected too strongly by that single observation Delete the extreme outlier and redo the analysis In this example.Test the hypotheses: H0: (µ1.05.method can be used when.e.05 Conclusion: The P-value is less than 0.µ2) =0 vs Ha: (µ1..µ2) ≠ 0 using a significance level of 0.

0094 ˆ ˆ sample relative risk = p1 p2  0.0094  1. use paired differences: For each matched pair.0171 0. this ratio is called the relative risk The ratio describes the sizes of the proportions relative to each other Recall Physician’s Health Study: ˆ p1  189 / 11034  0.82 The proportion of the placebo group who had a heart attack was 1.025 s 1 1  n1 n2 This method has df =n1+ n2.82 times the proportion of the aspirin group who had a heart attack.σ1=σ2 The ratio of proportions: The relative risk ˆ p1 The ratio of proportions for two groups is: ˆ p2 In medical applications for which the proportion refers to a category that is an undesirable outcome.Comparing population means.0171 ˆ p2  104 / 11037  0.Independent random samples from the two groups . Dependent samples: Each observation in one sample has a matched observation in the other sample The observations are called matched pairs Benefits of using dependent samples (matched pairs): Many sources of potential bias are controlled so we can make a more accurate comparison Using matched pairs keeps many other factors fixed that could affect the analysis Often this results in the benefit of smaller standard errors To compare means with matched pairs.2 These methods assume: . and even then. such as death or having a heart attack. assuming equal population standard deviations Using the pooled standard deviation estimate.An approximately normal population distribution for each group . construct a difference score d = (reaction time using cell phone) – (reaction time without cell phone) Calculate the sample mean of these differences: x d .µ2) is: ( x1  x2 )  t. the CI and the two-sided test are usually robust to violations of this assumption . a 95% CI for (µ1 .This is mainly important for small sample sizes.

46 52 . we can conduct the single-sample test of H0: µd = 0 with the difference scores The test statistic is: t  s xd  0 with df  n  1 sd n Assumptions: The sample of difference scores is a random sample from a population of such difference scores The difference scores have a population distribution that is approximately normal This is mainly important for small samples (less than about 30) and for one-sided inferences Confidence intervals and two-sided tests are robust: They work quite well even if the normality assumption is violated One-sided tests do not work well when the sample size is small and the distribution of differences is highly skewed Example: Cell phones and driving study The box plot shows skew to the right for the difference scores Two-sided inference is robust to violations of the assumption of normality The box plot does not show any severe outliers Significance test: H0: µd = 0 (and hence equal population means for the two conditions) 50 .025 d n Let n denote the number of observations in each sample xd is the sample mean of the differences This equals the number of difference scores The 95 % CI for the population mean difference is: s d is their standard deviation Paired difference inferences: These paired-difference inferences are special cases of singlesample inferences about a population mean so they make the same assumptions To test the hypothesis H0: µ1 = µ2 of equal means.5 Ha: µd ≠ 0 32 Test statistic: .6 t  5.The difference ( x1 – x 2 ) between the means of the two samples equals the mean x d of the difference scores for the matched pairs The difference (µ1 – µ2) between the population means is identical to the parameter µ d that is the population mean of the difference scores Confidence interval for dependent samples: xd  t.

091.3185  0.11 Note that the data consist of matched pairs. d 1-1=0 1-0=1 0-1=-1 0-0=0 Frequency 833 125 2 160 Sample mean of the 1120 difference scores is [0(833)+1(125)-1(2)+0(160)]/1120=0. We can now use single-sample methods with the differences as we did for the matched-pairs analysis of means. Confidence interval comparing proportions with matched-pairs data ˆ ˆ Use the fact that the sample difference p1  p 2 is the mean of difference scores of the recoded data We can then find a confidence interval for the population mean of difference scores using single sample methods n  1120 xd  0.0.11 ˆ ˆ Note that this equals the difference in proportions p1  p 2 We have converted the two samples of binary observations into a single sample of 1120 difference scores.0187  (0.Comparing proportions with dependent samples: A recent GSS asked subjects whether they believed in Heaven and whether they believed in Hell: Belief in Hell Belief in Heaven Yes No Total Yes 833 2 835 No 125 160 285 Total 958 162 1120 ˆ ˆ We can estimate p1 .p2 as: p1  p2  958 1120  835 1120  0.3185 1120   .1098 sd  0.96 0.128) 95% CI = 0. not Heaven do not believe in Heaven or Hell Difference.1098  0. 1=yes and 0=no Heaven 1 1 0 0 Hell 1 0 1 0 Interpretation believe in Heaven and Hell believe in Heaven. Recode the data so that for belief in heaven or hell. not Hell believe in Hell.1098  1.

Response variable: Whether the subject has committed aggressive acts . the percentage committing an aggressive act is higher for those who watched more TV .Whatever association occurs cannot be due to the effect of the control variable -At each educational level. Recall GSS example about belief in Heaven and Hell: 125  2  10 . the two-sided test works well even if this is not true. Ha can be one or two sided Test Statistic: For the two counts for the frequency of “yes” on one response and “no” on the other.McNemar Test for Comparing Proportions with Matched-Pairs Data Hypotheses: H0: p1=p2.Explanatory variable: Level of TV watching .For this hypothetical data.Treat the third variable as a control variable . can we identify a reason for the difference? Warning: An association may be due to a lurking variable not measured in the study Control variable: In a previous example.Control variable: Educational level Can An Association Be Explained by a Third Variable? .9 P-value is approximately 0. we saw that teenagers who watch more TV have a tendency later in life to commit more aggressive acts Could there be a lurking variable that influences this association? Perhaps teenagers who watch more TV tend to attain lower educational levels and perhaps lower education tends to be associated with higher levels of aggression -We need to measure potential lurking variables and use them in the statistical analysis . but in practice.Including a potential lurking variable in the study changes it from a bivariate study to a multivariate study .A variable that is held constant in a multivariate analysis is called a control variable This analysis uses three variables: . the z test statistic equals their difference divided by the square root of their sum. 125  2 Note that this result agrees with the confidence interval for p1-p2 calculated earlier McNemar’s Test: z  A practically significant difference: When we find a practically significant difference between two groups. P-value: The probability of observing a sample even more extreme than the observed sample Assumptions: The sum of the counts used in the test should be at least 30. the association observed between TV watching and aggressive acts was not because of education .If we thought that education was a potential lurking variable we would want to measure it .Conduct the ordinary bivariate analysis while holding that control variable constant at fixed values (multivariate analysis) .

then we conclude that somewhere in the contingency table the population cell proportions differ from independence The chi-squared test does not indicate whether all cells deviate greatly from independence or perhaps only some of them do so Residual analysis A cell-by-cell comparison of the observed counts with the counts that are expected when H0 is true reveals the nature of the evidence against H0 The difference between an observed and expected count in a particular cell is called a residual The residual is negative when fewer subjects are in the cell than expected under H0 The residual is positive when more subjects are in the cell than expected under H0 .Chapter 11: Analyzing the Association Between Categorical Variables Association between categorical variables: The chi-squared test and measures of association such as (p1 – p2) and p1/p2 are fundamental methods for analyzing contingency tables The P-value for  2 summarized the strength of evidence against H0: independence If the P-value is small.

. any of them being less than about 5.Interpret the standardized residuals in the table .There is strong evidence that the population has more subjects in these cells than if the variables were independent The table exhibits large negative residuals for the cells for females who are not at all religious and for males who are very religious In these cells. the observed count is much smaller than the expected count There is strong evidence that the population has fewer subjects in these cells than if the variables were independent Fisher’s exact test: The chi-squared test of independence is a large-sample test When the expected frequencies are small.To determine whether a residual is large enough to indicate strong evidence of a deviation from independence in that cell we use a adjusted form of the residual: the standardized residual The standardized residual for a cell= (observed count – expected count)/se A standardized residual reports the number of standard errors that an observed count falls from its expected count The se describes how much the difference would tend to vary in repeated sampling if the variables were independent Its formula is complex Software can be used to find its value A large standardized residual value provides evidence against independence in that cell Example: to what extent do you consider yourself a religious person? . small-sample tests are more appropriate Fisher’s exact test is a small-sample test of independence .In these cells. the observed count is much larger than the expected count .The table exhibits large positive residuals for the cells for females who are very religious and for males who are not at all religious.

Muriel Bristol.The calculations for Fisher’s exact test are complex Statistical software can be used to obtain the P-value for the test that the two variables are independent The smaller the P-value. Dr. claimed that when drinking tea she could tell whether the milk or the tea had been added to the cup first Experiment: Fisher asked her to taste eight cups of tea: Four had the milk added first Four had the tea added first She was asked to indicate which four had the milk added first The order of presenting the cups was randomized Results: Analysis: The one-sided version of the test pertains to the alternative that her predictions are better than random guessing Does the P-value suggest that she had the ability to predict better than random guessing? The P-value of 0. the stronger the evidence that the variables are associated This is an experiment conducted by Sir Ronald Fisher His colleague.243 does not give much evidence against the null hypothesis The data did not support Dr. Bristol’s claim that she could tell whether the milk or the tea had been added to the cup first .

49 (BP_60) The y-intercept is 63.5 + 1.49 The slope of 1. Data are random Hypotheses: H0: the two variables are independent (p1=p2). If a decision is required. to look at the data A scatterplot is a graphical display of the relationship between the response variable (y-axis) and the explanatory variable (x-axis) Ex: What do we learn from scatterplot in strength study? The MINITAB output shows the following regression equation: BP = 63. is to identify the response and explanatory variables Y to denote response variable and X to denote explanatory variable The scatterplot: First step in answering the question of association.5 .49 tells us that predicted maximum bench press increases by about 1.Assumptions: Two binary categorical variables. reject H0 when P-value ≤ significance level Chapter 12: Analyzing Association between Quantitative variables: Regression analysis First step of a regression analysis.5 and the slope is 1. Ha: the two variables are associated (p1≠p2 or p1>p2 or p1<p2) Test Statistic: First cell count (this determines the others given the margin totals) P-value: Probability that the first cell count equals the observed value or a value even more extreme as predicted by Ha Conclusion: Report the P-value and interpret in context.

a straight line can be fitted through the data points to describe that trend The regression line is: ˆ y  a  bx ˆ y is the predicted value of the response variable y a is the y-intercept and b is the slope Check for outliers by plotting the data. α is a population y-intercept and β is a population slope. called a residual. the equation: y  a  bx Predicts a single value of the response variable But. Each observation has a residual. describes the relationship between x and the estimated means of y at the various values of x The population regression equation: describes the relationship in the population between x and the means of y  y    x In the population regression equation. we should not except all subjects at that value of x to have the same value of y Variability occur in the y values. The difference y  y ˆ between an observed outcome and its predicted value is the prediction error.pounds for every additional 60-pound bench press an athlete can do The regression line equation: When the scatterplot shows a linear trend. These are parameters . In summary. The regression line connects the estimated means of y at the various x values. y=a+bx. The regression line can be pulled toward an outlier and away from the general trend of points Influential points: An observation can be influential in affecting the regression line when two things happen: -Its x-value is low or high compared to the rest of the data .It does not fall in the straight-line pattern that the rest of the data have Residuals are prediction errors: The regression equation is often called a prediction equation. A residual is the vertical distance between the data point and the regression line We can summarize how near the regression line the data points fall by: The regression line has the smallest sum of squared residuals and is called the least squares line sum of squared residuals  ˆ  (residuals)   ( y  y) 2 2 Regression model: A line describes how the mean of y depends on x. ˆ At a given value of x.

It is a model A model is a simple approximation for how variables relate in the population The regression model If the true relationship is far from a straight line.In practice we estimate the population regression equation using the prediction equation for the sample data The population regression equation merely approximates the actual relationship between x and the population means of y. this regression model may be a poor one .

is called a bivariate model  y    x Suppose there are two predictors. x2. x and y.… Example: For three explanatory variables. denoted by x1 and x2 This is called a multiple regression model  y    1 x1   2 x2 The multiple regression model relates the mean µy of a quantitative response variable y to a set of explanatory variables x1. the multiple regression equation is:  y    1 x1   2 x2   3 x3 .Chapter 13: Multiple Regression Regression models: The model that contains only two variables.

84 x2 where y = selling price.536  53 .000  107.276 ˆ Find its residual: y  y  145.000 ) Find its predicted selling price:  107 .724 The residual tells us that the actual selling price was $37.000 square feet and selling price = $145. x1=house size and x2 = lot size One house listed in the data set had house size = 1240 square feet.536  53 . lot size = 18.Example: The sample prediction equation with three explanatory variables is: ˆ y  a  b1 x1  b2 x2  b3 x3 Example: Predicting selling price using house and lot size The data set “house selling prices” contains observations on 100 home sales in Florida in November 2003 A multiple regression analysis was done with selling price as the response variable and with house size and lot size as the explanatory variables Output from the analysis: ˆ Prediction Equation: y  10 .724 higher than predicted The number of explanatory variables You should not use many explanatory variables in a multiple regression model unless you have lots of data A rough guideline is that the sample size n should be at least 10 times the number of explanatory variables .8 x1  2.84 (18.000 ˆ y  10 .276  37.8(1240 )  2.

536  53.84(30.84x2 The prediction equation becomes:  97. for houses with a lot size of 30. the predicted selling price increases by $2.536  53.84) = $2840 when the house size is 2000 square feet Example using the housing data: .84 for every square foot increase in lot size when the house size is 2000 square feet For a 1000 square-foot increase in lot size. an increase of a square foot in house size has a larger impact on the selling price ($53.This is called a scatterplot matrix Interpretation of multiple regression coefficients: The simplest way to interpret a multiple regression equation looks at it in two dimensions as a function of a single explanatory variable We can look at it this way by fixing values for the other explanatory variable(s) Example using the housing data: Suppose we fix x1 = house size at 2000 square feet ˆ y  10.80 for every square foot increase in house size In summary.Plotting relationships Always look at the data before doing a multiple regression Most software has the option of constructing scatterplots on a single graph for each pair of variables .8.84.000)  74.676  53.8 x1  2.84) We can compare slopes for these explanatory variables because their units of measurement . the predicted selling price increases by 1000(2. the predicted selling price increases by $53.80) than an increase of a square foot in lot size ($2.8(2000)  2.000 square feet.022  2.84x2 Since the slope coefficient of x2 is 2.Suppose we fix x2 = lot size at 30.000 square feet .The prediction equation becomes: ˆ y  10.8x1 Since the slope coefficient of x1 is 53.

000 or x2 = 30. the coefficient of x1 in the prediction equation: y  10 .000 Slopes in multiple regression and in bivariate regression: In multiple regression.It is denoted by R For each subject.are the same (square feet) Slopes cannot be compared when the units differ Summarizing the effect while controlling for a variable: The multiple regression model assumes that the slope for a particular explanatory variable is identical for all fixed values of the other explanatory variables ˆ For example.8 x1  2.536  53 .000 or x2 = 50. the regression equation provides a predicted value Each subject has an observed y-value and a predicted y-value . we analyze how well the ˆ observed y values correlate with the predicted y values The multiple correlation is the correlation between the observed y values and the predicted ˆ y values . a slope describes the effect of an explanatory variable while controlling effects of the other explanatory variables in the model Bivariate regression has only a single explanatory variable A slope in bivariate regression describes the effect of that variable while ignoring all other possible explanatory variables Importance of multiple regression: One of the main uses of multiple regression is to identify potential lurking variables and control for them by including them as explanatory variables in the model Multiple correlation: To summarize how well a multiple regression model predicts y.8 regardless of whether we plug in x2 = 10.84 x2 is 53.

a table. called the ANOVA (analysis of variance) table was created The table displays the sums of squares in the SS column The R2 value can be created from the sums of squares in the table: 2 R Using house size and lot size together to predict selling price reduces the prediction error by 71%. the larger R 2 is For multiple regression.711  0. R Example: How well can we predict house selling prices: For the 100 observations on y = selling price. x1 = house size. y The error in using the prediction equation to predict y is summarized by the residual sum of squares: ˆ  ( y  y) 2 The error in using y to predict y is summarized by the total sum of squares: The proportional reduction in error is: R 2   ( y  y) 2 ˆ ( y  y )2  ( y  y)2 2 ( y  y) The better the predictions are using the regression equation.756  ( y  y)  2 ˆ  ( y  y)2 Find and interpret the multiple correlation R  R2  0.433. the better are the predictions of y by the set of explanatory variables The R-value always falls between 0 and 1 In this way. the square of R describes the relative improvement from using the prediction equation instead of using the sample mean.84 There is a strong association between the observed and the predicted selling prices House size and lot size are very helpful in predicting selling prices .The correlation computed between all pairs of observed y-values and predicted y-values is the multiple correlation. the multiple correlation ‘R’ differs from the bivariate correlation ‘r’ between y and a single variable x. which falls between -1 and +1 R-squared For predicting y. R2 is the square of the multiple correlation. R The larger the multiple correlation. relative to using y alone to predict selling price ( y  y)2 314.756   0.90.711 90. and x2 = lot size.

so it provides better predictions than either bivariate model The single predictor in the data set that is most strongly associated with y is the house’s real estate tax assessment: (r2 = 0.If we used a bivariate regression model to predict selling price with house size as the predictor. the better the explanatory variables collectively predict y R2 =1 only when all residuals are 0. the r2 value would be 0. when all regression predictions are prefect R2 = 0 when the correlation between y and each explanatory variable equals 0 R2 gets larger.58 If we used a bivariate regression model to predict selling price with lot size as the predictor. but not by much R2 does not increase much after a few predictors are in the model When there are many explanatory variables but the correlations among them are strong. R2 continues to go up. R2 usually doesn’t increase much more when you add additional ones This does not mean that the additional variables are uncorrelated with the response variable It merely means that they don’t add much new power for predicting y. the r2 value would be 0. given the values of the predictors already in the model Properties of R2 The previous example showed that R2 for the multiple regression model was larger than r2 for a bivariate model using only one of the explanatory variables A key factor of R2 is that it cannot decrease when predictors are added to a model R2 falls between 0 and 1 The larger the value.51 The multiple regression model has R2 0. once you have included a few of them in the model. whenever an explanatory variable is added to the multiple regression model . R2 goes up from 0. or at worst stays the same.730 As other predictors are added.679 to 0.71. that is.679) When we add house size as a second predictor.

The value of R2 does not depend on the units of measurement Chapter 14: Comparing Groups: Analysis of Variance Methods .

Alternative hypothesis: Ha: At least two of the population means are unequal The assumptions for the ANOVA test comparing population means are as follows: 1.means of response variable for the g populations are denoted: µ1. independent random samples are selected from the g populations .In a survey sample.The analysis of variance is a significance test of the null hypothesis of equal population means: H0: µ1 = µ2 = …= µg . Randomization: . classical music. µ2.In an experiment. can we conclude that the population means differ? This alone is not sufficient evidence to enable us to reject H0 .each group has a corresponding population of subjects . Denote the holding time means for the populations that these three random samples represent by: µ1 = mean for the advertisement µ2 = mean for the Muzak µ3 = mean for the classical music The hypotheses for the ANOVA test are: H0: µ1=µ2=µ3 Ha: at least two of the population means are different Since the sample means are quite different. subjects are randomly assigned separately to the g groups Ex: How long will you tolerate being put on hold? – with music. … µg Hypotheses and Assumptions for the ANOVA test .Let g denote the number of groups . The population distributions of the response variable for the g groups are normal with the same standard deviation for each group 2. muzak.How can we compare several means? ANOVA The analysis of variance method: Compares means of several groups . advertisement.

the larger the F test statistic tends to be F Between groups variabili ty Within groups variabili ty The test statistic for comparing means has the F-distribution. .Normal population distributions with equal standard deviations 2. Assumptions: . The analysis of variance (ANOVA) F-test statistic is: The larger the variability between groups relative to the variability within groups. It is called ANALYSIS OF VARIANCE because it uses evidence about two types of variability. EX: Two Data sets with equal means but unequal variability. Which case do you think gives stronger evidence against H0:µ1=µ2=µ3? What is the difference between the data in these two cases? In both cases the variability between pairs of means is the same In ‘Case b’ the variability within each sample is much smaller than in ‘Case a.Variability between groups and within groups The ANOVA method – used to compare population means.Independent random samples. The larger the F-test statistic value.’ The fact that ‘Case b’ has less variability within each sample gives stronger evidence against H0 ANOVA F-test Statistic. the stronger the evidence against H0 ANOVA F-test for comparing population means of several groups 1. Hypothesis: H0:µ1=µ2= … =µg Ha:at least two of the population means are different .

σ The F-test statistic is the ratio of two estimates of σ2. the population variance for each group .05.Let σ denote the standard deviation for each of the g population distributions .The MS column contains the two estimates.3. of groups) 4. Test statistic: F  Between groups variabili ty Within groups variabili ty F. which are called mean squares . P-value: Right-tail probability above observed F-value 5. there is sufficient evidence to reject H0:µ1=µ2=µ3 We conclude that a difference exists among the three types of messages in the population mean amount of time that customers are willing to remain on hold .The estimate of σ2 in the denominator of the F-test statistic uses the variability within each group .The estimate of σ2 in the numerator of the F-test statistic uses the variability between each sample mean and the overall mean for all the data -Computer software displays the two estimates of σ2 in the ANOVA table .test statistic .The ratio of the two mean squares is the F.This F. df2 = N – g = (total sample size – no.05) The variance estimates and the ANOVA table . Conclusion: If decision is needed. reject if P-value ≤ significance level (such as 0.sampling distribution has df1= g -1.statistic has a P-value Examples.One assumption for the ANOVA F-test is that each population has the same standard deviation. customers telephone holding time again: Since P-value < 0.