Statistics Chapter 4

STATISTICS
INFORMED DECISIONS USING DATA

Fifth Edition
Chapter 4
Describing the
Relation between
Two Variables
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
Learning Objectives
1. Draw and interpret scatter diagrams
2. Describe the properties of the linear correlation coefficient
3. Compute and interpret the linear correlation coefficient
4. Determine whether a linear relation exists between two
variables
5. Explain the difference between correlation and causation
4.1.1 Draw and Interpret Scatter Diagrams (1 of 6)
The response variable is the variable whose value can be
explained by the value of the explanatory or predictor
variable.
A scatter diagram is a graph that shows the relationship
between two quantitative variables measured on the same
individual. Each individual in the data set is represented by a
point in the scatter diagram. The explanatory variable is
plotted on the horizontal axis, and the response variable is
plotted on the vertical axis.
EXAMPLE Drawing and Interpreting a Depth at Which Time to Drill 5
Scatter Diagram Drilling Begins, x Feet, y
(in feet) (in minutes)
The data shown to the right are based 35 5.88
on a study for drilling rock. The 50 5.99
researchers wanted to determine 75 6.74
whether the time it takes to dry drill a 95 6.1
distance of 5 feet in rock increases with 120 7.47
the depth at which the drilling begins.
130 6.93
So, depth at which drilling begins is the
145 6.42
explanatory variable, x, and time (in
155 7.97
minutes) to drill five feet is the response
160 7.92
variable, y. Draw a scatter diagram of
175 7.62
the data.
185 6.89
Source: Penner, R., and Watts, D.G. “Mining Information.”
The American Statistician, Vol. 45, No. 1, Feb. 1991, p. 6. 190 7.9
Various Types of Relations in a Scatter Diagram
Two variables that are linearly related are positively
associated when above-average values of one variable are
associated with above-average values of the other variable
and below-average values of one variable are associated
with below-average values of the other variable. That is, two
variables are positively associated if, whenever the value of
one variable increases, the value of the other variable also
increases.
Two variables that are linearly related are negatively
associated when above-average values of one variable are
associated with below-average values of the other variable.
That is, two variables are negatively associated if, whenever
the value of one variable increases, the value of the other
variable decreases.
4.1.2 Describe the Properties of the Linear Correlation
Coefficient (1 of 6)
The linear correlation coefficient or Pearson product

moment correlation coefficient is a measure of the
strength and direction of the linear relation between two
quantitative variables. The Greek letter ρ (rho) represents
the population correlation coefficient, and r represents the
sample correlation coefficient. We present only the formula
for the sample correlation coefficient.
Sample Linear Correlation Coefficient
Properties of the Linear Correlation Coefficient
1. The linear correlation coefficient is always between −1 and 1,
inclusive. That is, −1 ≤ r ≤ 1.
2. If r = + 1, then a perfect positive linear relation exists between
the two variables.
3. If r = −1, then a perfect negative linear relation exists between
the two variables.
4. The closer r is to +1, the stronger the evidence is of a positive
association between the two variables.
5. The closer r is to −1, the stronger the evidence is of a negative
association between the two variables.
6. If r is close to 0, then little or no evidence exists of a linear

relation between the two variables. So r close to 0 does not
imply no relation, just no linear relation.
7. The linear correlation coefficient is a unitless measure of
association. So the unit of measure for x and y plays no role in
the interpretation of r.
8. The correlation coefficient is not resistant. Therefore, an
observation that does not follow the overall pattern of the data
could affect the value of the linear correlation coefficient.
4.1.3 Compute and Interpret the Linear Correlation
Depth at Which Time to Drill 5
EXAMPLE Determining Drilling Begins, x Feet, y
(in feet) (in minutes)
the Linear Correlation
35 5.88
Coefficient
50 5.99
Determine the linear 75 6.74
correlation coefficient of 95 6.1
the drilling data. 120 7.47
130 6.93
145 6.42
155 7.97
160 7.92
175 7.62
185 6.89
190 7.9
IN CLASS ACTIVITY
Correlation
Randomly select six students from the class and have them determine their at-
rest pulse rates and then discuss the following:
1. When determining each at-rest pulse rate, would it be better to count beats for
30 seconds and multiply by 2 or count beats for 1 full minute?
Explain. What are some other ways to find the at-rest pulse rate?
Do any of these methods have an advantage?
2. What effect will physical activity have on pulse rate?
3. Do you think the at-rest pulse rate will have any effect on the pulse rate after
physical activity? If so, how? If not, why not?
Have the same six students jog in place for 3 minutes and then immediately
determine their pulse rates using the same technique as for the at-rest pulse
rates.
4. Draw a scatter diagram for the pulse data using the at-rest data as the
explanatory variable.
5. Comment on the relationship, if any, between the two variables. Is this
consistent with your expectations?
6. Based on the graph, estimate the linear correlation coefficient for the data.
Then compute the correlation coefficient and compare it to your estimate.
4.1.4 Determine whether a Linear Relation Exists between
Two Variables (1 of 2)
Testing for a Linear Relation

Step 1 Determine the absolute value of the correlation coefficient.
Step 2 Find the critical value in Table II for the given sample size.
Step 3 If the absolute value of the correlation coefficient is greater
than the critical value, we say a linear relation exists
between the two variables. Otherwise, no linear relation
exists.
4.1.4 Determine whether a Linear Relation Exists between
Two Variables (2 of 2)
EXAMPLE Does a Linear Relation Table II
Exist? Critical Values for Correlation Coefficient
Determine whether a linear relation exists n blank

between time to drill five feet and depth at 3 0.997
which drilling begins. Comment on the 4 0.950
type of relation that appears to exist 5 0.878
between time to drill five feet and depth at 6 0.811
which drilling begins.
7 0.754
The correlation between drilling depth and 8 0.707
time to drill is 0.773. The critical value for 9 0.666
n = 12 observations is 0.576. Since 0.773 10 0.632
> 0.576, there is a positive linear relation
11 0.602
between time to drill five feet and depth at
which drilling begins. 12 0.576
13 0.553
14 0.532
4.1.5 Explain the Difference between Correlation and
Causation (1 of 8)
According to data obtained from the Statistical Abstract of

the United States, the correlation between the percentage of
the female population with a bachelor’s degree and the
percentage of births to unmarried mothers since 1990 is
0.940.
Does this mean that a higher percentage of females with
bachelor’s degrees causes a higher percentage of births to
unmarried mothers?
Causation (2 of 8)
Certainly not! The correlation exists only because both

percentages have been increasing since 1990. It is this
relation that causes the high correlation. In general, time
series data (data collected over time) may have high
correlations because each variable is moving in a specific
direction over time (both going up or down over time; one
increasing, while the other is decreasing over time).
When data are observational, we cannot claim a causal
relation exists between two variables. We can only claim
causality when the data are collected through a designed
experiment.
Causation (3 of 8)
Another way that two variables can be related even though

there is not a causal relation is through a lurking variable.
A lurking variable is related to both the explanatory and
response variable.
For example, ice cream sales and crime rates have a very
high correlation. Does this mean that local governments
should shut down all ice cream shops? No! The lurking
variable is temperature. As air temperatures rise, both ice
cream sales and crime rates rise.
Causation (4 of 8)
Table 4
EXAMPLE Lurking Variables in a Number of Bone Mineral

Colas per Week Density (g/cm2)
Bone Mineral Density Study 0 0.893
0 0.882
Because colas tend to replace healthier
1 0.891
beverages and colas contain caffeine and
1 0.881
phosphoric acid, researchers Katherine L.
2 0.888
Tucker and associates wanted to know
2 0.871
whether cola consumption is associated
3 0.868
with lower bone mineral density in
3 0.876
women. The table lists the typical number
4 0.873
of cans of cola consumed in a week and 5 0.875
the femoral neck bone mineral density for 5 0.871
a sample of 15 women. The data were 6 0.867
collected through a prospective cohort 7 0.862
study. 7 0.872
8 0.865
Causation (5 of 8)
EXAMPLE Lurking Variables in a Bone Mineral Density

Study
The figure on the next slide shows the scatter diagram of the data.
The correlation between number of colas per week and bone mineral
density is −0.806.The critical value for correlation with n = 15 from
Table II in Appendix A is 0.514. Because |−0.806| > 0.514, we
conclude a negative linear relation exists between number of colas
consumed and bone mineral density. Can the authors conclude that
an increase in the number of colas consumed causes a decrease in
bone mineral density? Identify some lurking variables in the study.
Causation (6 of 8)
Causation (7 of 8)
EXAMPLE Lurking Variables in a Bone Mineral Density Study

In prospective cohort studies, data are collected on a group of subjects
through questionnaires and surveys over time. Therefore, the data are
observational. So the researchers cannot claim that increased cola
consumption causes a decrease in bone mineral density.
Some lurking variables in the study that could confound the results are:
• body mass index
• height
• smoking
• alcohol consumption
• calcium intake
• physical activity
Causation (8 of 8)
EXAMPLE Lurking Variables in a Bone Mineral Density Study

The authors were careful to say that increased cola consumption
is associated with lower bone mineral density because of potential
lurking variables. They never stated that increased cola
consumption causes lower bone mineral density.
4.2 Least-squares Regression
Learning Objectives
1. Find the least-squares regression line and use the line to
make predictions
2. Interpret the slope and the y-intercept of the least-squares
regression line
3. Compute the sum of squared residuals
EXAMPLE Finding an Equation that Describes Linearly
Relate Data (1 of 2)
Using the following sample data:
x 0 2 3 5 6 6
y 5.8 5.7 5.2 2.8 1.9 2.2
EXAMPLE Finding an Equation that Describes Linearly
Relate Data (2 of 2)
(b) Graph the equation on the scatter diagram.
4.2.1 Find the Least-Squares Regression Line and Use the
Line to Make Predictions (1 of 7)
The difference between the observed value of y and the predicted value of y is
the error, or residual.
Using the line from the last example, and the predicted value at x = 3:
residual = observed y − predicted y
= 5.2 − 4.75
= 0.45
Least-Squares Regression Criterion
The Least-Squares Regression Line

The equation of the least-squares regression line is given by
The Least-Squares Regression Line
Depth at Time to Drill
EXAMPLE Finding the Least- Which Drilling 5 Feet, y
Begins, x (in (in minutes)
squares Regression Line feet)
Using the drilling data 35 5.88
50 5.99
(a) Find the least-squares regression
75 6.74
line.
95 6.1
(b) Predict the drilling time if drilling 120 7.47
starts at 130 feet. 130 6.93
(c) Is the observed drilling time at 130 145 6.42
feet above, or below, average. 155 7.97
160 7.92
(d) Draw the least-squares regression
175 7.62
line on the scatter diagram of the
185 6.89
data.
190 7.9
(c) The observed drilling time is 6.93 seconds. The predicted

drilling time is 7.035 seconds. The drilling time of 6.93
seconds is below average.
4.2.2 Interpret the Slope and the y-Intercept of the Least-
Squares Regression Line (1 of 3)
Interpretation of Slope:
The slope of the regression line is 0.0116. For each additional foot
of depth we start drilling, the time to drill five feet increases by
0.0116 minutes, on average.
Interpretation of the y-Intercept:

The y-intercept of the regression line is 5.5273. To interpret the y-
intercept, we must first ask two questions:
1. Is 0 a reasonable value for the explanatory variable?
2. Do any observations near x = 0 exist in the data set?
A value of 0 is reasonable for the drilling data (this indicates that
drilling begins at the surface of Earth. The smallest observation in
the data set is x = 35 feet, which is reasonably close to 0. So,
interpretation of the y-intercept is reasonable.
The time to drill five feet when we begin drilling at the surface of
Earth is 5.5273 minutes.
If the least-squares regression line is used to make predictions

based on values of the explanatory variable that are much larger
or much smaller than the observed values, we say the researcher
is working outside the scope of the model. Never use a least-
squares regression line to make predictions outside the scope of
the model because we can’t be sure the linear relation continues
to exist.
4.2.3 Compute the Sum of Squared Residuals
To illustrate the fact that the sum of squared residuals for a
least-squares regression line is less than the sum of squared
residuals for any other line, use the “regression by eye”
applet.
4.3 Diagnostics on the Least-squares Regression Line
Learning Objectives
1. Compute and interpret the coefficient of determination
4.3.1 Compute and Interpret the Coefficient of Determination (1 of 18)
The coefficient of determination, R2, measures the

proportion of total variation in the response variable that is
explained by the least-squares regression line.
The coefficient of determination is a number between 0 and
1, inclusive. That is, 0 < R2 < 1.
If R2 = 0 the line has no explanatory value
If R2 = 1 means the line explains 100% of the variation in the
response variable.
The data to the right are based Depth at Which Time to Drill
Drilling Begins, 5 Feet, y
on a study for drilling rock. The x (in feet) (in minutes)
researchers wanted to determine 35 5.88
whether the time it takes to dry 50 5.99
drill a distance of 5 feet in rock 75 6.74
increases with the depth at 95 6.1
which the drilling begins. So, 120 7.47
depth at which drilling begins is 130 6.93
the predictor variable, x, and 145 6.42
time (in minutes) to drill five feet 155 7.97
is the response variable, y. 160 7.92
Source: Penner, R., and Watts, D.G. “Mining 175 7.62

Information.” The American Statistician, Vol. 185 6.89
45, No. 1, Feb. 1991, p. 6. 190 7.9
Sample Statistics blank

blank Mean Standard Deviation
Depth 126.2 52.2
Time 6.99 0.781
Correlation Between Depth and Time: 0.773
Regression Analysis
The regression equation is
Time = 5.53 + 0.0116 Depth
Suppose we were asked to predict the time to drill an

additional 5 feet, but we did not know the current depth of
the drill. What would be our best “guess”?
Suppose we were asked to predict the time to drill an

additional 5 feet, but we did not know the current depth of
the drill. What would be our best “guess”?
ANSWER:
The mean time to drill an additional 5 feet: 6.99 minutes
Now suppose that we are asked to predict the time to drill an

additional 5 feet if the current depth of the drill is 160 feet?
ANSWER:
The difference between the observed value of the response

variable and the mean value of the response variable is
called the total deviation and is equal to
The difference between the predicted value of the response

variable and the mean value of the response variable is
called the explained deviation and is equal to
The difference between the observed value of the response

variable and the predicted value of the response variable is
called the unexplained deviation and is equal to
Total Deviation = Unexplained Deviation + Explained Deviation
Total Deviation = Unexplained Deviation + Explained

Deviation
Total Variation = Unexplained Variation + Explained

Variation
To determine R2 for the linear regression model simply

square the value of the linear correlation coefficient.
EXAMPLE Determining the Coefficient of Determination

Find and interpret the coefficient of determination for the drilling
data.
Because the linear correlation coefficient, r, is 0.773, we have that
R2 = 0.7732 = 0.5975 = 59.75%.
So, 59.75% of the variability in drilling time is explained by the
least-squares regression line.
DATA SET A DATA SET B DATA SET C
X Y X Y X Y
3.6 8.9 3.1 8.9 2.8 8.9
8.3 15.0 9.4 15.0 8.1 15.0
0.5 4.8 1.2 4.8 3.0 4.8
1.4 6.0 1.0 6.0 8.3 6.0
8.2 14.9 9.0 14.9 8.2 14.9
5.9 11.9 5.0 11.9 1.4 11.9
4.3 9.8 3.4 9.8 1.0 9.8
8.3 15.0 7.4 15.0 7.9 15.0
0.3 4.7 0.1 4.7 5.9 4.7
6.8 13.0 7.5 13.0 5.0 13.0
Draw a scatter diagram for each of these data sets. For each data
set, the variance of y is 17.49.
Data Set A: 99.99% of the variability in y is explained by the least-

squares regression line
Data Set B: 94.7% of the variability in y is explained by the least-
Data Set C: 9.4% of the variability in y is explained by the least-
4.4 Contingency Tables and Association
Learning Objectives
1. Compute the marginal distribution of a variable
2. Use the conditional distribution to identify association
among categorical data
3. Explain Simpson’s Paradox
Example: Data Information
A professor at a community college in New Mexico conducted a study to assess
the effectiveness of delivering an introductory statistics course via traditional
lecture-based method, online delivery (no classroom instruction), and hybrid
instruction (online course with weekly meetings) methods, the grades students
received in each of the courses were tallied.
blank Traditional Online Hybrid

A 36 39 24
B 52 55 66
C 57 68 90
D 46 38 41
F 46 54 31
4.4.1 Compute the Marginal Distribution of a Variable (1 of 3)
A marginal distribution of a variable is a frequency or

relative frequency distribution of either the row or column
variable in the contingency table.
EXAMPLE Determining Frequency Marginal Distributions
A professor at a community college in New Mexico conducted a study to

assess the effectiveness of delivering an introductory statistics course via
traditional lecture-based method, online delivery (no classroom
instruction), and hybrid instruction (online course with weekly meetings)
methods, the grades students received in each of the courses were
tallied. Find the frequency marginal distributions for course grade and
delivery method.
blank Traditional Online Hybrid Total
A 36 39 24 99
B 52 55 66 173
C 57 68 90 215
D 46 38 41 125
F 46 54 31 131
Total 237 254 252 743
EXAMPLE Determining Relative Frequency Marginal

Distributions
Determine the relative frequency marginal distribution for course
grade and delivery method.
blank Traditional Online Hybrid Total

A 36 39 24 0.133
B 52 55 66 0.233
C 57 68 90 0.289
D 46 38 41 0.168
F 46 54 31 0.176
blank 0.319 0.342 0.339 1.000
4.4.2 Use the Conditional Distribution to Identify Association
among Categorical Data (1 of 4)
A conditional distribution lists the relative frequency of

each category of the response variable, given a specific
value of the explanatory variable in the contingency table.
EXAMPLE Determining a blank Traditional Online Hybrid

Conditional Distribution A 36 39 24
B 52 55 66
Construct a conditional C 57 68 90
distribution of course grade by D 46 38 41
method of delivery. Comment on F 46 54 31
any type of association that may
exist between course grade and
blank Traditional Online Hybrid
delivery method.
A 0.152 0.154 0.095
It appears that students in the B 0.219 0.217 0.262
C 0.241 0.268 0.357
hybrid course are more likely to D 0.194 0.150 0.163
pass (A, B, or C) than the other F 0.194 0.213 0.123
two methods.
EXAMPLE Drawing a Bar Graph of a Conditional

Distribution
Using the results of the previous example, draw a bar graph that
represents the conditional distribution of method of delivery by
grade earned.
The following Men Women Boys Girls

contingency table shows Survived 334 318 29 27
the survival status and Died 1360 104 35 18
demographics of
passengers on the ill-
fated Titanic.
Draw a conditional bar
graph of survival status
by demographic
characteristic.
4.4.3 Explain Simpson’s Paradox (1 of 6)
EXAMPLE Illustrating Simpson’s Paradox
Insulin dependent (or Type 1) diabetes is a disease that results in
the permanent destruction of insulin-producing beta cells of the
pancreas. Type 1 diabetes is lethal unless treatment with insulin
injections replaces the missing hormone. Individuals with insulin
independent (or Type 2) diabetes can produce insulin internally.
The data shown in the table below represent the survival status of
902 patients with diabetes by type over a 5-year period.
blank Type 1 Type 2 Total

Survived 253 326 579
Died 105 218 323
358 544 902
EXAMPLE Illustrating Simpson’s Paradox

Survived 253 326 579
Died 105 218 323
358 544 902
However, Type 2 diabetes is usually contracted after the age of
40. If we account for the variable age and divide our patients into
two groups (those 40 or younger and those over 40), we obtain
the data in the table below.

< 40 > 40 < 40 > 40
Survived 129 124 15 311 579
Died 1 104 0 218 323
blank 130 228 15 529 902

< 40 > 40 < 40 > 40
Survived 129 124 15 311 579
Died 1 104 0 218 323
blank 130 228 15 529 902

< 40 > 40 < 40 > 40
Survived 129 124 15 311 579
Died 1 104 0 218 323
blank 130 228 15 529 902
Simpson’s Paradox describes a situation in which an
association between two variables inverts or goes away
when a third variable is introduced to the analysis.

Statistics Chapter 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Chapter 4

Uploaded by

Copyright:

Available Formats

STATISTICS

INFORMED DECISIONS USING DATA

The linear correlation coefficient or Pearson product

Sample Linear Correlation Coefficient

6. If r is close to 0, then little or no evidence exists of a linear

Testing for a Linear Relation

Determine whether a linear relation exists n blank

According to data obtained from the Statistical Abstract of

Certainly not! The correlation exists only because both

Another way that two variables can be related even though

EXAMPLE Lurking Variables in a Number of Bone Mineral

EXAMPLE Lurking Variables in a Bone Mineral Density

EXAMPLE Lurking Variables in a Bone Mineral Density Study

EXAMPLE Lurking Variables in a Bone Mineral Density Study

residual = observed y − predicted y

Least-Squares Regression Criterion

The Least-Squares Regression Line

The Least-Squares Regression Line

(c) The observed drilling time is 6.93 seconds. The predicted

Interpretation of the y-Intercept:

If the least-squares regression line is used to make predictions

1. Compute and interpret the coefficient of determination

The coefficient of determination, R2, measures the

Source: Penner, R., and Watts, D.G. “Mining 175 7.62

Sample Statistics blank

Suppose we were asked to predict the time to drill an

Suppose we were asked to predict the time to drill an

Now suppose that we are asked to predict the time to drill an

The difference between the observed value of the response

The difference between the predicted value of the response

The difference between the observed value of the response

Total Deviation = Unexplained Deviation + Explained Deviation

Total Deviation = Unexplained Deviation + Explained

Total Variation = Unexplained Variation + Explained

To determine R2 for the linear regression model simply

EXAMPLE Determining the Coefficient of Determination

Data Set A: 99.99% of the variability in y is explained by the least-

blank Traditional Online Hybrid

A marginal distribution of a variable is a frequency or

A professor at a community college in New Mexico conducted a study to

EXAMPLE Determining Relative Frequency Marginal

blank Traditional Online Hybrid Total

A conditional distribution lists the relative frequency of

EXAMPLE Determining a blank Traditional Online Hybrid

EXAMPLE Drawing a Bar Graph of a Conditional

The following Men Women Boys Girls

blank Type 1 Type 2 Total

blank Type 1 Type 2 Total

blank Type 1 Type 2 Total

blank Type 1 Type 2 Total

blank Type 1 Type 2 Total

You might also like