Professional Documents
Culture Documents
Scatterplots may show a relationship or an association between two quantitative variables. These variables are
often called the explanatory variable (x) and the response variable (y).
We are looking for a LINEAR relationship between our two variables.
Example #1: Leonardo Da Vinci used measurements of bones and body parts to predict height and body type of a
person. He represented his findings in his famous drawing, The Proportions of the Human Figure (1492). Below
is a graph of a sample of 55 adults (male and female).
31
30
29
28
LeftArm
27
26
25
24
23
22
60 70 80
Height
Looking at Scatterplots:
• Look at the Direction of the Association: On average, are changes in X associated with changes in Y?
Do you see
• Look for the Strength of the Association: Do the points follow a single stream that is tight to the line or
is there considerable spread (or variability) around the line?
Do you see
35
Do you see
• Look for Unusual Features: Are there any outliers, influential observations, or subgroups?
Example #2:
A data set was created from the rosters from teams in the National Basketball Association (NBA). This dataset
includes several variables such as the height and weight of the players. When creating a scatterplot of the weight
vs. the height of the players we see that there is an outlier in the scatterplot. When we identify this individual we
find it is Earl Boykins of the Denver Nuggets.
36
Attributes of the Correlation Coefficient
1. The correlation coefficient is a unit-less measurement, denoted with the letter r, and can take on values
between -1 and 1: − 1 r 1
2. Both variables (X and Y) must be numerical (not categorical). It measures the strength and direction of
a linear relationship.
3. r=1 means all the data points lie on a straight line with a positive slope and a perfect association.
4. r =-1 means all the data points lie on a straight line with a negative slope and a perfect association.
5. Values of r close to 0 means that the linear relationship is weak : there is a general linear trend, but there is
a lot of variability around that trend.
6. If r=0 means that the line that describes the data has a linear relationship of zero. In other words, the best
fitting line has a slope of zero. (There may be a relationship other than linear.)
7. Correlation is sensitive to outliers. One very large or very small value can dramatically change the
correlation coefficient.
8. Because we use z-scores, the correlation coefficient does not change when converting to different units. If
you change all of your Y-values from inches to centimeters, the correlation with the X-variable will remain
the same.
37
Example #3: Estimate a correlation coefficient for each and describe the relationship.
1.
Mean January Air Temperatures
for 30 New Zealand Locations
20 ______________________________
Temperature (°C)
19
______________________________________
18
______________________________________
17
16 ______________________________________
15
14
35 40 45
Latitude (°S)
2.
Distances of Planets from the Sun
Distance (million miles)
4000
______________________________
3000
______________________________________
2000
______________________________________
1000
______________________________________
0
0 1 2 3 4 5 6 7 8 9
Position Number
3.
80
______________________________
70
60 ______________________________________
50
40 ______________________________________
30
20 ______________________________________
10
0
0 10 20 30 40
GDP per capita (thousands of dollars)
38
Calculating the Correlation Coefficient:
Remember how to calculate the Z-score? We used this calculation to determine how many standard deviations
our observation was from the mean.
x −
RECALL: z-score = z =
In this case, we were only concerned with one variable.
Now, we are considering two variables and each must be standardized (using z-scores).
X − X Y i − Y
i
S S
x y
FORMULA: r=
n −1
***You will not have to calculate the correlation coefficient by hand. For homework, you may need to use
your statistical function on your calculator to find the value and interpret the meaning of r as it relates to
the explanatory and response variables. On quizzes and exams the value will not have to be calculated.
Example #4: The data below represent the number of deaths and the magnitude for six earthquakes.
A) Graph the data set B) Calculate the correlation coefficient with your calculator
C) Based on your graph, do you think this is an accurate statistic? Explain
500
400
______________________________
300
______________________________________
200
______________________________________
100
______________________________________
39
What can go wrong when using correlation?
1) Correlation simply does not imply causation (The correlation may be a coincidence)
2) Both correlation variables might be directly influenced by some common underlying cause (lurking
variable).
Practice Problems:
1. Below is a scatterplot of data from the World Bank. All of the world’s nations for which data are
available are represented. The explanatory variable is a measure of how rich a country is, the gross
domestic product (GDP) per person. GDP is the total value of the goods and services produced in a
country, converted into dollars. The response variable is life expectancy at birth. We expect people in
richer countries to live longer. (Correlation Coefficient=0.718)
a. Describe the correlation coefficient given above (without considering the graph).
b. Would you use a linear model for this data set after seeing the graph? Why or why not?
80
70
Life expectancy
60
50
40
Does this mean that taller children are generally better readers? What might explain the correlation?
40
3. Which of the following is true of the correlation coefficient?
A. It is a resistant measure of association
B. r is a measure of strength and direction between a categorical response variable and a quantitative
explanatory variable.
C. −1.0 r 1.0
D. If r is the correlation between X and Y, then –r is the correlation between Y and X.
Answer_________
4. A correlation between college entrance exam grades and scholastic achievement was found to be -1.08.
On the basis of this you would tell the university that:
A. the entrance exam is a good predictor of success.
B. they should hire a new statistician.
C. the exam is a poor predictor of success.
D. students who do best on this exam will make the worst students.
E. students at this school are underachieving.
Answer_________
5. A study found a correlation of r = 0.89 between ethnicity and the frequency of coronary heart disease.
You may correctly conclude:
A. This is incorrect because r does not make sense here.
B. Caucasians have a higher frequency of coronary heart disease compared to other ethnic groups.
C. Hispanics have a higher frequency of coronary heart disease compared to other ethnic groups.
D. An arithmetic mistake was made because the correlation should be negative.
Answer_________
6. A reviewer rated a sample of fifteen wines on a score from 1 (very poor) to 7 (excellent). A correlation of .92
was obtained between these ratings and the cost of the wines at a local store. In plain English, this means that
A. Wines with low ratings are likely to be more expensive (probably because fewer will be sold).
B. Having to pay more caused the reviewer to give a higher rating.
C. In general, as the cost went up so did the rating.
D. In general, the reviewer liked the cheaper wines better.
Answer_________
41
Linear Regression
A regression line is a straight line that models the linear relationship between an
explanatory variable and a response variable. Therefore, it is only useful when one of the
Residuals:
Since we are dealing with a “model,” the best-fitting line will not typically go through all of the data points.
Some of the data points might be above the line, and some might be below the line. For a given value of x,
we have a true data value for y. For this same value of x, we also have a predicted value of y from our linear model.
The residual is simply the distance between the true y value and the predicted y value (called y-hat) for the
same value of x.
Example #1: Scatterplot of Systolic Blood Pressure versus Weight (Sample of 12 American Adults).
160
150
SBP
140
130
Pearson correlation of SBP and WEIGHT = 0.971 The regression equation is: ŷ = 1.1 + 0.764(x)
1) Suppose we know that one person in the sample weighs 188 pounds and has a systolic blood pressure of 136.
2) Predict the SBP for a person weighing 188 pounds using your regression model.
42
Slope-Intercept formula for a line:
b1 =
Notation: yˆ = b0 + b1 x where
b0 =
Note: The Least Squares Regression Line always passes through the point ( x , y ).
Interpretation:
• Slope ( b1 ): For each 1-unit increase in x, y increases (decreases) by the amount of the slope.
Example #2:
A researcher was interested in studying the relationship between a city's latitude (in degrees) and its average
April temperature (in degrees Fahrenheit). A regression analysis was performed, and the Minitab output is
displayed below.
Fitted Line Plot
AprTemp = 118.8 - 1.644 latitude
80
70
AprTemp
60
50
40
25 30 35 40 45 50
latitude
a) The regression equation is reported to be AprTemp = 118.8 – 1.644 Latitude. Interpret the slope in terms of
this particular problem.
b) San Diego has latitude of 33o. Use the regression equation to predict the average April temperature for San
Diego.
c) A meteorologist would like you to predict the average April temperature for a city with latitude 15. Would
you feel comfortable using your regression line in this capacity? Why or why not?
43
Example #3:
An international distance triathlon consists of a 1.5 km swim, a 40 km bike ride and a 10 km run. Triathletes are
ranked based on their overall finishing times, and some people suggest that an athlete’s time for the swim has the
largest influence on his overall performance. Data from 10 male triathletes who competed in the 2004 Camp
Pendleton International Triathlon was analyzed to produce the results below:
Scatterplot of Overall Finishing Time vs Swim Time
200
180
170
160
150
20 25 30 35 40 45
Swim Time (Minutes)
The regression equation is: Overall Finishing Time = 122 + 1.56 (Swim Time)
b. Find the athlete’s actual finish time if the value of the residual for this swim time is 11 minutes.
Example #4:
Classified ads in the Ithaca Journal offered several used Toyota Corollas for sale. Using a sample of 17 cars, the
following computer output was obtained relating Car Age (yr) and Price Advertised ($)
a) Use the computer output provided to identify the equation for the regression line.
Quantitative data condition: correlation and linear models only make sense with quantitative data
Linearity condition: the regression model assumes that the relationship between the variables is linear.
Outlier condition: unusual observations can distort the correlation and dramatically change a regression model.
Independence assumption: when fitting a linear regression model the residuals should be independent of one
another
____________________________
____________________________
____________________________
2) __________________________
____________________________
____________________________
____________________________
3) __________________________
____________________________
____________________________
____________________________
Note: A single residual is not sufficient to describe predictive power. You would have to calculate all of the residuals
and plot them on a graph.
45
Example: Baseball players and salaries: Residuals against the number of years the player has been in the major
leagues.
b. Will the model overestimate or underestimate the salaries of players who are new to the majors?
The closer r2 is to 100%, the better the regression model describes the connection between x and y - in particular,
predictions made with the equation will be more accurate.
b) Describe the strength of the predictive power based on the guidelines above.
46
Practice Problems:
Use the following information to answer problems 1-5
Hurricanes develop low pressure at their centers. A sample of 163 hurricanes (with central pressure ranging from
910 to 1000 mb) was studied to investigate the relationship between central pressure (mb) and maximum wind
speed (knots). The data was analyzed to produce the results below:
Regression equation: Max wind speed = 955.27 – 0.897Central Pressure R-squared = 77.24%
1. What is the correlation between the central pressure and maximum wind speed? Use this value to assess
the strength of the linear association.
Correlation coefficient:
Interpretation
Slope:
Interpretation
3. Hurricane Katrina had a central pressure of 920 mb and maximum wind speed of 110 knots. Calculate
the residual for this model value.
Answer:
4. A hurricane has a calculated residual that is negative. Which of the following is a correct statement?
A. Since the residual is negative the prediction must be accurate.
B. The residual value indicates that the prediction is too low.
C. The residual value indicates that the prediction is too high
D. We cannot determine whether the prediction is too low or too high from a residual value.
Answer:
5. Would it be appropriate to use the above model to predict maximum wind speed for a hurricane with
central pressure 1100mb? Explain why or why not in a full sentence.
47
Use the following information to answer problems 6-10
The U.S. Food and Drug Administration (FDA) require nutrition labeling on most foods. The nutritional content
for one cup of each of 21 different breakfast cereals was recorded. Information recorded included calories, sugar
(in grams) and carbohydrates (in grams). The data was analyzed to produce the results below:
7. (4 points) What percent of the variation in sugar content can be explained by the regression of Sugar
content on carbohydrate content? Use this value to assess the predictive power of the model.
Percent of variation:
Interpretation
8. Raisin Bran has 29 grams of carbohydrates per cup. What is the predicted amount of sugar per cup for
Raisin Bran?
Answer:
9. The residual for Raisin Bran is 8.556. How many grams of sugar does Raisin Bran actually have per cup?
Answer:
10. The relationship between sugar content and calories was also analyzed with the following results:
Which is a better predictor of sugar content in cereal, carbohydrates or calories? Justify your answer.
48
Regression Wisdom
1. If the relationship is not linear and the correlation is not strong, predictions will not be accurate.
2. Extrapolation: Do not make predictions outside of the range for which you have data.
3. Correlation simply does not imply causation. The correlation may be a coincidence or both correlation
variables might be directly influenced by some common underlying cause like a lurking variable.
____________________________
____________________________
____________________________
50
40
X
Variable 2
30
20
10
0
0 10 20 30 40 50 60 70 80
Variable 1
49