Professional Documents
Culture Documents
Chapter 6
Scatterplots, Association, and Correlation
6.
Think
U.S. Population
The variables are year and U.S. population, in millions of
Millions of people
people. Both variables are quantitative. The association
225
between year and population is strong, positive, and curved.
150
Population has been increasing over the last 200 years.
Furthermore, the rate of population growth has been increasing. 75
The U.S. population has been growing faster in more recent
years. We will attempt to straighten the scatter plot using a 1800 1880 1960
logarithmic re-expression and a square root re-expression. Year
Show
sqrt(population)
2.0 12
1.5 8
1.0 4
Tell
The scatterplot of log(population) and year shows a strong, positive association, but it is still
curved. This is not a good re-expression. The scatterplot of square root of population and
year shows a strong, positive, more linear association. The square root re-expression
straightens the scatterplot well.
Supplemental Resources
On the following page is a sample of a worksheet that examines the relationship between
height and weight of some students, using association, correlation, regression and prediction.
It is designed to be used with the TI-83 graphing calculator, but you can easily adapt it to
your own data analysis package. You may also choose to collect and use your own data.
1. Below are heights and weights data for male and female Stat students. Create TI lists MHT,
MWT, FHT, FWT, and enter the data. You will need to keep these data for several days.
(Remember, to create a list: (1) put the cursor atop the names (L1,L2,...), then (2) space to the
right, and (3) type the new name in the blank space. You can then access these names via the
LIST NAMES menu.)
2. Is it reasonable to assume these data are drawn from populations that are normally
distributed? Check summary statistics and histograms for each variable.
3. a) Make a scatterplot for each gender (STAT PLOT, first plot type). Which is the explanatory
variable?
b) Describe the relationship (form, strength, direction, outliers, etc.) for each gender.
5. a) Write an equation of the least squares line of best fit for each gender.
b) Check the residuals plots. Do you think that a line is a good model? Explain.
c) Explain what the slope of each line means in the context of this relationship.
d) Predict weights for : a 60” male; a 60” female; a 70” male;a 70” female
e) Predict the weight of a 7’2” male; of a 20” newborn baby girl. Comment on these results.
MALES FEMALES
HT(in) WT(lb) HT(in) WT(lb) HT(in) WT(lb) HT(in) WT(lb)
67 140 71 132 63 117 64 110
71 165 70 140 62 107 63 123
73 168 71 140 75 170 64 110
71 142 70 140 61 91 71 134
74 200 69 130 62 118 64 129
74 175 70 150 63 130 62 129
68 135 74 170 66 135 65 123
73 145 71 175 63 120 64.5 115
71 150 74 180 67 125 68.5 122
72 155 72 150 67 117 65.5 120
69 168 70 150 64 135 64 111
66 106 73 190 61 88 64 115
70 144
65.0 69.5 74.0 100 160 60.0 67.5 75.0 87.5 125.0 162.5
MHT MWT FHT FWT
Mean = 70.96 in Mean = 153.6 lbs Mean = 64.73 in Mean = 120.6 lbs
Median = 71 in Median = 150 lbs Median = 64 in Median = 120 lbs
All of the distributions are at least roughly unimodal and symmetric. There are several tall females, one of
which is also heavier than average. Since the means and medians are roughly the same, it seems reasonable that
these data are drawn from populations that are normally distributed.
3a. The explanatory variable is height. Height determines weight.
3b. There is a moderate, positive linear relationship between 200
male height and male weight. Taller males are generally 180
160
12.5
Residual
0.0
0 although removing the heaviest
-12.5 -10 and two lightest females may
yield a better overall model.
120 135 150 165 112.5 137.5
predicted(M/M) predicted(F/F)
5c. According to the linear models, each additional inch in male height is associated with about 7.3 additional
pounds in weight, and each additional inch in female height is associated with about 3.7 additional pounds in
weight.
5d. 60” male: 73.6 lbs 60” female: 103.3 lbs 70” male: 146.6 lbs 70” female: 139.9 lbs.
5e. 7’2” male: 263.4 lbs 20” newborn baby girl: – 43.3 lbs
The weight of the male seems low. A male this tall would weigh much more. The weight of the newborn baby
girl is impossible. These models were designed to predict the weights of typical high school students, not the
very tall or babies. We wouldn’t expect these predictions to be accurate.
1. After conducting a survey of his students, a professor reported that “There appears to be a
strong correlation between grade point average and whether or not a student works.”
Comment on this observation.
2. The following scatterplot shows a relationship between x and y that results in a correlation
coefficient of r = 0. Explain why r = 0 in this situation even though there appears to be a
strong relationship between the x and y variables.
Scatterplot of y vs x
18
16
14
12
10
y
-5 -4 -3 -2 -1 0 1 2 3 4
x
3. The following scatterplot shows the relationship between the time (in seconds) it took men to
run the 1500m race for the gold medal and the year of the Olympics that the race was run in:
Scatterplot of Time vs Year
250
240
Time
230
220
210
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990
Year
b. The correlation between Olympic gold medal times for the 100m dash and year is -1.37.
c. Since the correlation between Olympic gold medal times for the 800m hurdles and 100m
dash is –0. 41, the correlation between times for the 100m dash and the 800m hurdles is
+0.41.
d. If we were to measure Olympic gold medal times for the 800m hurdles in minutes instead
of seconds, the correlation would be –0.66/60 = –0.011.
1. After conducting a survey of his students, a professor reported that “There appears to be a
strong correlation between grade point average and whether or not a student works.”
Comment on this observation.
Correlation measures the strength of a linear association between two quantitative variables. Whether or not
a student works is a categorical variable, so correlation cannot be calculated between GPA and whether or
not a student works.
2. The following scatterplot shows a relationship between x and y that results in a correlation
coefficient of r = 0. Explain why r = 0 in this situation even though there appears to be a
strong relationship between the x and y variables.
Scatterplot of y vs x
18
16
14
12
10
y
-5 -4 -3 -2 -1 0 1 2 3 4
x
The correlation coefficient only measures the strength of linear associations. The relationship between x
and y that we see here is far from linear (in fact, it is a parabolic relationship).
3. The following scatterplot shows the relationship between the time (in seconds) it took men to
run the 1500m race for the gold medal and the year of the Olympics that the race was run in:
Scatterplot of Time vs Year
250
240
Time
230
220
210
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990
Year
b. The correlation between Olympic gold medal times for the 100m dash and year is –1.37.
Correlation has to be between –1 and +1.
c. Since the correlation between Olympic gold medal times for the 800m hurdles and 100m
dash is –0. 41, the correlation between times for the 100m dash and the 800m hurdles is
+0.41.
Correlation does not change if we reverse the role of the x and y variables.
d. If we were to measure Olympic gold medal times for the 800m hurdles in minutes instead
of seconds, the correlation would be –0.66/60 = –0.011.
Correlation does not change when we change units.
1. After conducting a survey at a pet store to see what affect having a pet had on the condition
of the yard, a news reporter stated “There appears to be a strong correlation between the
owning a pet and the condition of the yard.” Comment on this observation.
4. A common objective for many school administrators is to increase the number of students
taking SAT and ACT tests from their school. The data from each state from 2003 are
reflected in the scatterplot at the right.
c. If the point in the top left corner (4, 1215) were removed, would the correlation become
stronger, weaker, or remain about the same? Explain briefly.
d. If the point in the very middle (38, 1049) were removed, would the correlation become
stronger, weaker, or remain about the same? Explain briefly.
1. After conducting a survey at a pet store to see what affect having a pet had on the condition
of the yard, a news reporter stated “There appears to be a strong correlation between the
owning a pet and the condition of the yard.” Comment on this observation.
The variables – owning a pet and condition of the yard – are both categorical variables. Correlation
cannot be calculated with categorical variables.
A positive association means in general people who had more sleep were able to memorize more
information.
The child psychologist is attributing association to cause and effect. There is an implication that more
sleep will cause better memorization, therefore causing an increase in assessments scores. Perhaps
people who had memorized more were able to sleep more restfully, or perhaps differences in brain
chemistry allowed some people to memorize more and to sleep more easily.
4. A common objective for many school administrators is to increase the number of students
taking SAT and ACT tests from their school. The data from each state from 2003 is reflected
in the scatterplot at the right.
c. If the point in the top left corner (4, 1215) were removed, would the correlation become
stronger, weaker, or remain about the same? Explain briefly.
If the point in the top left corner (4, 1215) were removed, the correlation would become stronger
because the remaining points show a pattern with slightly less scatter.
d. If the point in the very middle (38, 1049) were removed, would the correlation become
stronger, weaker, or remain about the same? Explain briefly.
If the point in the very middle (38, 1049) were removed, the correlation would remain about the same;
this point does not contribute much to the scatter.
1. After conducting a marketing study to see what consumers thought about a new tinted
contact lens they were developing, an eyewear company reported, “Consumer satisfaction is
strongly correlated with eye color.” Comment on this observation.
3. A school board study found a moderately strong negative association between the number of
hours high school seniors worked at part-time jobs after school hours and the students’ grade
point averages.
a. Explain in this context what “negative association” means.
b. Hoping to improve student performance, the school board passed a resolution urging
parents to limit the number of hours students be allowed to work. Do you agree or
disagree with the school board’s reasoning. Explain.
10 11 12 13 14
Circum
a. Write a few sentences describing
the association.
c. If the point in the lower right corner (at about 14” and 38#) were removed, how would
the correlation become stronger, weaker, or remain about the same?
d. If the point in the upper right corner (at about 15” and 75#) were removed, would the
correlation become stronger, weaker, or remain about the same?
1. After conducting a marketing study to see what consumers thought about a new tinted
contact lens they were developing, an eyewear company reported, “Consumer satisfaction is
strongly correlated with eye color.” Comment on this observation.
There may be an association between customer satisfaction and eye color, but these are both categorical
variables so they cannot be “correlated.”
3. A school board study found a moderately strong negative association between the number of
hours high school seniors worked at part-time jobs after school hours and the students’ grade
point averages.
a. Explain in this context what “negative association” means.
b. Hoping to improve student performance, the school board passed a resolution urging
parents to limit the number of hours students be allowed to work. Do you agree or
disagree with the school board’s reasoning. Explain.
They are mistakenly attributing the association to cause and effect. Maybe students with low grades are
more likely to seek jobs, or maybe there is some other factor in their home life that leads both to lower
grades and to the desire or need to work.
6-16
10 11 12 13 14
Circum
a. Write a few sentences describing
the association.
There is a moderate, positive, linear association between forearm circumference and grip strength
among these boys. In general, the larger their forearms, the stronger their grip. One boy in particular
had very large forearms and a very strong grip. There was one outlier—the boy with the second largest
forearms had one of the weakest grips.
c. If the point in the lower right corner (at about 14” and 38#) were removed, how would
the correlation become stronger, weaker, or remain about the same?
d. If the point in the upper right corner (at about 15” and 75#) were removed, would the
correlation become stronger, weaker, or remain about the same?