You are on page 1of 77

STATISTICS

INFORMED DECISIONS USING DATA


Fifth Edition

Chapter 4
Describing the
Relation between
Two Variables

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
Learning Objectives
1. Draw and interpret scatter diagrams
2. Describe the properties of the linear correlation coefficient
3. Compute and interpret the linear correlation coefficient
4. Determine whether a linear relation exists between two
variables
5. Explain the difference between correlation and causation

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.1 Draw and Interpret Scatter Diagrams (1 of 6)
The response variable is the variable whose value can be
explained by the value of the explanatory or predictor
variable.
A scatter diagram is a graph that shows the relationship
between two quantitative variables measured on the same
individual. Each individual in the data set is represented by a
point in the scatter diagram. The explanatory variable is
plotted on the horizontal axis, and the response variable is
plotted on the vertical axis.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.1 Draw and Interpret Scatter Diagrams (2 of 6)
EXAMPLE Drawing and Interpreting a Depth at Which Time to Drill 5
Scatter Diagram Drilling Begins, x Feet, y
(in feet) (in minutes)
The data shown to the right are based 35 5.88
on a study for drilling rock. The 50 5.99
researchers wanted to determine 75 6.74
whether the time it takes to dry drill a 95 6.1
distance of 5 feet in rock increases with 120 7.47
the depth at which the drilling begins.
130 6.93
So, depth at which drilling begins is the
145 6.42
explanatory variable, x, and time (in
155 7.97
minutes) to drill five feet is the response
160 7.92
variable, y. Draw a scatter diagram of
175 7.62
the data.
185 6.89
Source: Penner, R., and Watts, D.G. “Mining Information.”
The American Statistician, Vol. 45, No. 1, Feb. 1991, p. 6. 190 7.9

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.1 Draw and Interpret Scatter Diagrams (3 of 6)

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.1 Draw and Interpret Scatter Diagrams (4 of 6)
Various Types of Relations in a Scatter Diagram

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.1 Draw and Interpret Scatter Diagrams (5 of 6)
Two variables that are linearly related are positively
associated when above-average values of one variable are
associated with above-average values of the other variable
and below-average values of one variable are associated
with below-average values of the other variable. That is, two
variables are positively associated if, whenever the value of
one variable increases, the value of the other variable also
increases.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.1 Draw and Interpret Scatter Diagrams (6 of 6)
Two variables that are linearly related are negatively
associated when above-average values of one variable are
associated with below-average values of the other variable.
That is, two variables are negatively associated if, whenever
the value of one variable increases, the value of the other
variable decreases.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.2 Describe the Properties of the Linear Correlation
Coefficient (1 of 6)

The linear correlation coefficient or Pearson product


moment correlation coefficient is a measure of the
strength and direction of the linear relation between two
quantitative variables. The Greek letter ρ (rho) represents
the population correlation coefficient, and r represents the
sample correlation coefficient. We present only the formula
for the sample correlation coefficient.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.2 Describe the Properties of the Linear Correlation
Coefficient (2 of 6)

Sample Linear Correlation Coefficient

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.2 Describe the Properties of the Linear Correlation
Coefficient (3 of 6)
Properties of the Linear Correlation Coefficient
1. The linear correlation coefficient is always between −1 and 1,
inclusive. That is, −1 ≤ r ≤ 1.
2. If r = + 1, then a perfect positive linear relation exists between
the two variables.
3. If r = −1, then a perfect negative linear relation exists between
the two variables.
4. The closer r is to +1, the stronger the evidence is of a positive
association between the two variables.
5. The closer r is to −1, the stronger the evidence is of a negative
association between the two variables.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.2 Describe the Properties of the Linear Correlation
Coefficient (4 of 6)

6. If r is close to 0, then little or no evidence exists of a linear


relation between the two variables. So r close to 0 does not
imply no relation, just no linear relation.
7. The linear correlation coefficient is a unitless measure of
association. So the unit of measure for x and y plays no role in
the interpretation of r.
8. The correlation coefficient is not resistant. Therefore, an
observation that does not follow the overall pattern of the data
could affect the value of the linear correlation coefficient.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.2 Describe the Properties of the Linear Correlation
Coefficient (5 of 6)

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.2 Describe the Properties of the Linear Correlation
Coefficient (6 of 6)

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.3 Compute and Interpret the Linear Correlation
Coefficient (1 of 5)
Depth at Which Time to Drill 5
EXAMPLE Determining Drilling Begins, x Feet, y
(in feet) (in minutes)
the Linear Correlation
35 5.88
Coefficient
50 5.99
Determine the linear 75 6.74
correlation coefficient of 95 6.1
the drilling data. 120 7.47
130 6.93
145 6.42
155 7.97
160 7.92
175 7.62
185 6.89
190 7.9
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.3 Compute and Interpret the Linear Correlation
Coefficient (2 of 5)

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.3 Compute and Interpret the Linear Correlation
Coefficient (3 of 5)

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.3 Compute and Interpret the Linear Correlation
Coefficient (4 of 5)

IN CLASS ACTIVITY
Correlation
Randomly select six students from the class and have them determine their at-
rest pulse rates and then discuss the following:
1. When determining each at-rest pulse rate, would it be better to count beats for
30 seconds and multiply by 2 or count beats for 1 full minute?
Explain. What are some other ways to find the at-rest pulse rate?
Do any of these methods have an advantage?
2. What effect will physical activity have on pulse rate?
3. Do you think the at-rest pulse rate will have any effect on the pulse rate after
physical activity? If so, how? If not, why not?
Have the same six students jog in place for 3 minutes and then immediately
determine their pulse rates using the same technique as for the at-rest pulse
rates.
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.3 Compute and Interpret the Linear Correlation
Coefficient (5 of 5)
4. Draw a scatter diagram for the pulse data using the at-rest data as the
explanatory variable.
5. Comment on the relationship, if any, between the two variables. Is this
consistent with your expectations?
6. Based on the graph, estimate the linear correlation coefficient for the data.
Then compute the correlation coefficient and compare it to your estimate.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.4 Determine whether a Linear Relation Exists between
Two Variables (1 of 2)

Testing for a Linear Relation


Step 1 Determine the absolute value of the correlation coefficient.
Step 2 Find the critical value in Table II for the given sample size.
Step 3 If the absolute value of the correlation coefficient is greater
than the critical value, we say a linear relation exists
between the two variables. Otherwise, no linear relation
exists.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.4 Determine whether a Linear Relation Exists between
Two Variables (2 of 2)
EXAMPLE Does a Linear Relation Table II
Exist? Critical Values for Correlation Coefficient

Determine whether a linear relation exists n blank


between time to drill five feet and depth at 3 0.997
which drilling begins. Comment on the 4 0.950
type of relation that appears to exist 5 0.878
between time to drill five feet and depth at 6 0.811
which drilling begins.
7 0.754
The correlation between drilling depth and 8 0.707
time to drill is 0.773. The critical value for 9 0.666
n = 12 observations is 0.576. Since 0.773 10 0.632
> 0.576, there is a positive linear relation
11 0.602
between time to drill five feet and depth at
which drilling begins. 12 0.576
13 0.553
14 0.532

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.5 Explain the Difference between Correlation and
Causation (1 of 8)

According to data obtained from the Statistical Abstract of


the United States, the correlation between the percentage of
the female population with a bachelor’s degree and the
percentage of births to unmarried mothers since 1990 is
0.940.
Does this mean that a higher percentage of females with
bachelor’s degrees causes a higher percentage of births to
unmarried mothers?

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.5 Explain the Difference between Correlation and
Causation (2 of 8)

Certainly not! The correlation exists only because both


percentages have been increasing since 1990. It is this
relation that causes the high correlation. In general, time
series data (data collected over time) may have high
correlations because each variable is moving in a specific
direction over time (both going up or down over time; one
increasing, while the other is decreasing over time).
When data are observational, we cannot claim a causal
relation exists between two variables. We can only claim
causality when the data are collected through a designed
experiment.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.5 Explain the Difference between Correlation and
Causation (3 of 8)

Another way that two variables can be related even though


there is not a causal relation is through a lurking variable.
A lurking variable is related to both the explanatory and
response variable.
For example, ice cream sales and crime rates have a very
high correlation. Does this mean that local governments
should shut down all ice cream shops? No! The lurking
variable is temperature. As air temperatures rise, both ice
cream sales and crime rates rise.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.5 Explain the Difference between Correlation and
Causation (4 of 8)
Table 4

EXAMPLE Lurking Variables in a Number of Bone Mineral


Colas per Week Density (g/cm2)
Bone Mineral Density Study 0 0.893
0 0.882
Because colas tend to replace healthier
1 0.891
beverages and colas contain caffeine and
1 0.881
phosphoric acid, researchers Katherine L.
2 0.888
Tucker and associates wanted to know
2 0.871
whether cola consumption is associated
3 0.868
with lower bone mineral density in
3 0.876
women. The table lists the typical number
4 0.873
of cans of cola consumed in a week and 5 0.875
the femoral neck bone mineral density for 5 0.871
a sample of 15 women. The data were 6 0.867
collected through a prospective cohort 7 0.862
study. 7 0.872
8 0.865

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.5 Explain the Difference between Correlation and
Causation (5 of 8)

EXAMPLE Lurking Variables in a Bone Mineral Density


Study
The figure on the next slide shows the scatter diagram of the data.
The correlation between number of colas per week and bone mineral
density is −0.806.The critical value for correlation with n = 15 from
Table II in Appendix A is 0.514. Because |−0.806| > 0.514, we
conclude a negative linear relation exists between number of colas
consumed and bone mineral density. Can the authors conclude that
an increase in the number of colas consumed causes a decrease in
bone mineral density? Identify some lurking variables in the study.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.5 Explain the Difference between Correlation and
Causation (6 of 8)

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.5 Explain the Difference between Correlation and
Causation (7 of 8)

EXAMPLE Lurking Variables in a Bone Mineral Density Study


In prospective cohort studies, data are collected on a group of subjects
through questionnaires and surveys over time. Therefore, the data are
observational. So the researchers cannot claim that increased cola
consumption causes a decrease in bone mineral density.
Some lurking variables in the study that could confound the results are:
• body mass index
• height
• smoking
• alcohol consumption
• calcium intake
• physical activity
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.1 Scatter Diagrams and Correlation
4.1.5 Explain the Difference between Correlation and
Causation (8 of 8)

EXAMPLE Lurking Variables in a Bone Mineral Density Study


The authors were careful to say that increased cola consumption
is associated with lower bone mineral density because of potential
lurking variables. They never stated that increased cola
consumption causes lower bone mineral density.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.2 Least-squares Regression
Learning Objectives
1. Find the least-squares regression line and use the line to
make predictions
2. Interpret the slope and the y-intercept of the least-squares
regression line
3. Compute the sum of squared residuals

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.2 Least-squares Regression
EXAMPLE Finding an Equation that Describes Linearly
Relate Data (1 of 2)
Using the following sample data:
x 0 2 3 5 6 6
y 5.8 5.7 5.2 2.8 1.9 2.2

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.2 Least-squares Regression
EXAMPLE Finding an Equation that Describes Linearly
Relate Data (2 of 2)
(b) Graph the equation on the scatter diagram.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.2 Least-squares Regression
4.2.1 Find the Least-Squares Regression Line and Use the
Line to Make Predictions (1 of 7)
The difference between the observed value of y and the predicted value of y is
the error, or residual.

Using the line from the last example, and the predicted value at x = 3:

residual = observed y − predicted y

= 5.2 − 4.75

= 0.45

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.2 Least-squares Regression
4.2.1 Find the Least-Squares Regression Line and Use the
Line to Make Predictions (2 of 7)

Least-Squares Regression Criterion

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.2 Least-squares Regression
4.2.1 Find the Least-Squares Regression Line and Use the
Line to Make Predictions (3 of 7)

The Least-Squares Regression Line


The equation of the least-squares regression line is given by

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.2 Least-squares Regression
4.2.1 Find the Least-Squares Regression Line and Use the
Line to Make Predictions (4 of 7)

The Least-Squares Regression Line

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.2 Least-squares Regression
4.2.1 Find the Least-Squares Regression Line and Use the
Line to Make Predictions (5 of 7)
Depth at Time to Drill
EXAMPLE Finding the Least- Which Drilling 5 Feet, y
Begins, x (in (in minutes)
squares Regression Line feet)
Using the drilling data 35 5.88
50 5.99
(a) Find the least-squares regression
75 6.74
line.
95 6.1
(b) Predict the drilling time if drilling 120 7.47
starts at 130 feet. 130 6.93
(c) Is the observed drilling time at 130 145 6.42
feet above, or below, average. 155 7.97
160 7.92
(d) Draw the least-squares regression
175 7.62
line on the scatter diagram of the
185 6.89
data.
190 7.9
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.2 Least-squares Regression
4.2.1 Find the Least-Squares Regression Line and Use the
Line to Make Predictions (6 of 7)

(c) The observed drilling time is 6.93 seconds. The predicted


drilling time is 7.035 seconds. The drilling time of 6.93
seconds is below average.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.2 Least-squares Regression
4.2.1 Find the Least-Squares Regression Line and Use the
Line to Make Predictions (7 of 7)

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.2 Least-squares Regression
4.2.2 Interpret the Slope and the y-Intercept of the Least-
Squares Regression Line (1 of 3)

Interpretation of Slope:
The slope of the regression line is 0.0116. For each additional foot
of depth we start drilling, the time to drill five feet increases by
0.0116 minutes, on average.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.2 Least-squares Regression
4.2.2 Interpret the Slope and the y-Intercept of the Least-
Squares Regression Line (2 of 3)

Interpretation of the y-Intercept:


The y-intercept of the regression line is 5.5273. To interpret the y-
intercept, we must first ask two questions:
1. Is 0 a reasonable value for the explanatory variable?
2. Do any observations near x = 0 exist in the data set?
A value of 0 is reasonable for the drilling data (this indicates that
drilling begins at the surface of Earth. The smallest observation in
the data set is x = 35 feet, which is reasonably close to 0. So,
interpretation of the y-intercept is reasonable.
The time to drill five feet when we begin drilling at the surface of
Earth is 5.5273 minutes.
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.2 Least-squares Regression
4.2.2 Interpret the Slope and the y-Intercept of the Least-
Squares Regression Line (3 of 3)

If the least-squares regression line is used to make predictions


based on values of the explanatory variable that are much larger
or much smaller than the observed values, we say the researcher
is working outside the scope of the model. Never use a least-
squares regression line to make predictions outside the scope of
the model because we can’t be sure the linear relation continues
to exist.
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.2 Least-squares Regression
4.2.3 Compute the Sum of Squared Residuals
To illustrate the fact that the sum of squared residuals for a
least-squares regression line is less than the sum of squared
residuals for any other line, use the “regression by eye”
applet.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
Learning Objectives

1. Compute and interpret the coefficient of determination

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (1 of 18)

The coefficient of determination, R2, measures the


proportion of total variation in the response variable that is
explained by the least-squares regression line.
The coefficient of determination is a number between 0 and
1, inclusive. That is, 0 < R2 < 1.
If R2 = 0 the line has no explanatory value
If R2 = 1 means the line explains 100% of the variation in the
response variable.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (2 of 18)

The data to the right are based Depth at Which Time to Drill
Drilling Begins, 5 Feet, y
on a study for drilling rock. The x (in feet) (in minutes)
researchers wanted to determine 35 5.88
whether the time it takes to dry 50 5.99
drill a distance of 5 feet in rock 75 6.74
increases with the depth at 95 6.1
which the drilling begins. So, 120 7.47
depth at which drilling begins is 130 6.93
the predictor variable, x, and 145 6.42
time (in minutes) to drill five feet 155 7.97
is the response variable, y. 160 7.92

Source: Penner, R., and Watts, D.G. “Mining 175 7.62


Information.” The American Statistician, Vol. 185 6.89
45, No. 1, Feb. 1991, p. 6. 190 7.9

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (3 of 18)

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (4 of 18)

Sample Statistics blank


blank Mean Standard Deviation
Depth 126.2 52.2
Time 6.99 0.781
Correlation Between Depth and Time: 0.773
Regression Analysis
The regression equation is
Time = 5.53 + 0.0116 Depth

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (5 of 18)

Suppose we were asked to predict the time to drill an


additional 5 feet, but we did not know the current depth of
the drill. What would be our best “guess”?

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (6 of 18)

Suppose we were asked to predict the time to drill an


additional 5 feet, but we did not know the current depth of
the drill. What would be our best “guess”?
ANSWER:
The mean time to drill an additional 5 feet: 6.99 minutes

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (7 of 18)

Now suppose that we are asked to predict the time to drill an


additional 5 feet if the current depth of the drill is 160 feet?
ANSWER:

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (8 of 18)

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (9 of 18)

The difference between the observed value of the response


variable and the mean value of the response variable is
called the total deviation and is equal to

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (10 of 18)

The difference between the predicted value of the response


variable and the mean value of the response variable is
called the explained deviation and is equal to

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (11 of 18)

The difference between the observed value of the response


variable and the predicted value of the response variable is
called the unexplained deviation and is equal to

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (12 of 18)

Total Deviation = Unexplained Deviation + Explained Deviation

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (13 of 18)

Total Deviation = Unexplained Deviation + Explained


Deviation

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (14 of 18)

Total Variation = Unexplained Variation + Explained


Variation

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (15 of 18)

To determine R2 for the linear regression model simply


square the value of the linear correlation coefficient.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (16 of 18)

EXAMPLE Determining the Coefficient of Determination


Find and interpret the coefficient of determination for the drilling
data.
Because the linear correlation coefficient, r, is 0.773, we have that
R2 = 0.7732 = 0.5975 = 59.75%.
So, 59.75% of the variability in drilling time is explained by the
least-squares regression line.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (17 of 18)
DATA SET A DATA SET B DATA SET C
X Y X Y X Y
3.6 8.9 3.1 8.9 2.8 8.9
8.3 15.0 9.4 15.0 8.1 15.0
0.5 4.8 1.2 4.8 3.0 4.8
1.4 6.0 1.0 6.0 8.3 6.0
8.2 14.9 9.0 14.9 8.2 14.9
5.9 11.9 5.0 11.9 1.4 11.9
4.3 9.8 3.4 9.8 1.0 9.8
8.3 15.0 7.4 15.0 7.9 15.0
0.3 4.7 0.1 4.7 5.9 4.7
6.8 13.0 7.5 13.0 5.0 13.0

Draw a scatter diagram for each of these data sets. For each data
set, the variance of y is 17.49.
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.3 Diagnostics on the Least-squares Regression Line
4.3.1 Compute and Interpret the Coefficient of Determination (18 of 18)

Data Set A: 99.99% of the variability in y is explained by the least-


squares regression line
Data Set B: 94.7% of the variability in y is explained by the least-
squares regression line
Data Set C: 9.4% of the variability in y is explained by the least-
squares regression line
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.4 Contingency Tables and Association
Learning Objectives
1. Compute the marginal distribution of a variable
2. Use the conditional distribution to identify association
among categorical data
3. Explain Simpson’s Paradox

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.4 Contingency Tables and Association
Example: Data Information
A professor at a community college in New Mexico conducted a study to assess
the effectiveness of delivering an introductory statistics course via traditional
lecture-based method, online delivery (no classroom instruction), and hybrid
instruction (online course with weekly meetings) methods, the grades students
received in each of the courses were tallied.

blank Traditional Online Hybrid


A 36 39 24
B 52 55 66
C 57 68 90
D 46 38 41
F 46 54 31

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.4 Contingency Tables and Association
4.4.1 Compute the Marginal Distribution of a Variable (1 of 3)

A marginal distribution of a variable is a frequency or


relative frequency distribution of either the row or column
variable in the contingency table.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.4 Contingency Tables and Association
4.4.1 Compute the Marginal Distribution of a Variable (2 of 3)
EXAMPLE Determining Frequency Marginal Distributions

A professor at a community college in New Mexico conducted a study to


assess the effectiveness of delivering an introductory statistics course via
traditional lecture-based method, online delivery (no classroom
instruction), and hybrid instruction (online course with weekly meetings)
methods, the grades students received in each of the courses were
tallied. Find the frequency marginal distributions for course grade and
delivery method.
blank Traditional Online Hybrid Total
A 36 39 24 99
B 52 55 66 173
C 57 68 90 215
D 46 38 41 125
F 46 54 31 131
Total 237 254 252 743
Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.4 Contingency Tables and Association
4.4.1 Compute the Marginal Distribution of a Variable (3 of 3)

EXAMPLE Determining Relative Frequency Marginal


Distributions
Determine the relative frequency marginal distribution for course
grade and delivery method.

blank Traditional Online Hybrid Total


A 36 39 24 0.133
B 52 55 66 0.233
C 57 68 90 0.289
D 46 38 41 0.168
F 46 54 31 0.176
blank 0.319 0.342 0.339 1.000

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.4 Contingency Tables and Association
4.4.2 Use the Conditional Distribution to Identify Association
among Categorical Data (1 of 4)

A conditional distribution lists the relative frequency of


each category of the response variable, given a specific
value of the explanatory variable in the contingency table.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.4 Contingency Tables and Association
4.4.2 Use the Conditional Distribution to Identify Association
among Categorical Data (2 of 4)

EXAMPLE Determining a blank Traditional Online Hybrid


Conditional Distribution A 36 39 24
B 52 55 66
Construct a conditional C 57 68 90
distribution of course grade by D 46 38 41
method of delivery. Comment on F 46 54 31
any type of association that may
exist between course grade and
blank Traditional Online Hybrid
delivery method.
A 0.152 0.154 0.095
It appears that students in the B 0.219 0.217 0.262
C 0.241 0.268 0.357
hybrid course are more likely to D 0.194 0.150 0.163
pass (A, B, or C) than the other F 0.194 0.213 0.123
two methods.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.4 Contingency Tables and Association
4.4.2 Use the Conditional Distribution to Identify Association
among Categorical Data (3 of 4)

EXAMPLE Drawing a Bar Graph of a Conditional


Distribution
Using the results of the previous example, draw a bar graph that
represents the conditional distribution of method of delivery by
grade earned.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.4 Contingency Tables and Association
4.4.2 Use the Conditional Distribution to Identify Association
among Categorical Data (4 of 4)

The following Men Women Boys Girls


contingency table shows Survived 334 318 29 27
the survival status and Died 1360 104 35 18
demographics of
passengers on the ill-
fated Titanic.
Draw a conditional bar
graph of survival status
by demographic
characteristic.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.4 Contingency Tables and Association
4.4.3 Explain Simpson’s Paradox (1 of 6)
EXAMPLE Illustrating Simpson’s Paradox
Insulin dependent (or Type 1) diabetes is a disease that results in
the permanent destruction of insulin-producing beta cells of the
pancreas. Type 1 diabetes is lethal unless treatment with insulin
injections replaces the missing hormone. Individuals with insulin
independent (or Type 2) diabetes can produce insulin internally.
The data shown in the table below represent the survival status of
902 patients with diabetes by type over a 5-year period.

blank Type 1 Type 2 Total


Survived 253 326 579
Died 105 218 323
358 544 902

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.4 Contingency Tables and Association
4.4.3 Explain Simpson’s Paradox (2 of 6)
EXAMPLE Illustrating Simpson’s Paradox

blank Type 1 Type 2 Total


Survived 253 326 579
Died 105 218 323
358 544 902

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.4 Contingency Tables and Association
4.4.3 Explain Simpson’s Paradox (3 of 6)
However, Type 2 diabetes is usually contracted after the age of
40. If we account for the variable age and divide our patients into
two groups (those 40 or younger and those over 40), we obtain
the data in the table below.

blank Type 1 Type 2 Total


< 40 > 40 < 40 > 40
Survived 129 124 15 311 579
Died 1 104 0 218 323
blank 130 228 15 529 902

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.4 Contingency Tables and Association
4.4.3 Explain Simpson’s Paradox (4 of 6)

blank Type 1 Type 2 Total


< 40 > 40 < 40 > 40
Survived 129 124 15 311 579
Died 1 104 0 218 323
blank 130 228 15 529 902

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.4 Contingency Tables and Association
4.4.3 Explain Simpson’s Paradox (5 of 6)

blank Type 1 Type 2 Total


< 40 > 40 < 40 > 40
Survived 129 124 15 311 579
Died 1 104 0 218 323
blank 130 228 15 529 902

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
4.4 Contingency Tables and Association
4.4.3 Explain Simpson’s Paradox (6 of 6)
Simpson’s Paradox describes a situation in which an
association between two variables inverts or goes away
when a third variable is introduced to the analysis.

Copyright © 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved

You might also like