You are on page 1of 15

Chapter 4: Correlation and Linear Regression

Scatterplots may show a relationship or an association between two quantitative variables. These variables are
often called the explanatory variable (x) and the response variable (y).
We are looking for a LINEAR relationship between our two variables.

Example #1: Leonardo Da Vinci used measurements of bones and body parts to predict height and body type of a
person. He represented his findings in his famous drawing, The Proportions of the Human Figure (1492). Below
is a graph of a sample of 55 adults (male and female).

Relationship Between Arm Length and Height

31
30
29
28
LeftArm

27
26
25
24
23
22

60 70 80
Height

Does there appear to be a linear relationship?

Looking at Scatterplots:

• Look at the Direction of the Association: On average, are changes in X associated with changes in Y?

Do you see

a positive association… or a negative association?

• Look for the Strength of the Association: Do the points follow a single stream that is tight to the line or
is there considerable spread (or variability) around the line?

Do you see

a strong relationship… or a weak relationship?


little scatter lots of scatter

• Look for Form: Is it straight, or curved or some other pattern, or no pattern?

35
Do you see

a linear trend… or a non-linear trend?

• Look for Unusual Features: Are there any outliers, influential observations, or subgroups?

Do you see any outliers? Do you see any groupings?

Example #2:

A data set was created from the rosters from teams in the National Basketball Association (NBA). This dataset
includes several variables such as the height and weight of the players. When creating a scatterplot of the weight
vs. the height of the players we see that there is an outlier in the scatterplot. When we identify this individual we
find it is Earl Boykins of the Denver Nuggets.

36
Attributes of the Correlation Coefficient

1. The correlation coefficient is a unit-less measurement, denoted with the letter r, and can take on values
between -1 and 1: − 1  r  1

2. Both variables (X and Y) must be numerical (not categorical). It measures the strength and direction of
a linear relationship.

3. r=1 means all the data points lie on a straight line with a positive slope and a perfect association.

4. r =-1 means all the data points lie on a straight line with a negative slope and a perfect association.

5. Values of r close to 0 means that the linear relationship is weak : there is a general linear trend, but there is
a lot of variability around that trend.

6. If r=0 means that the line that describes the data has a linear relationship of zero. In other words, the best
fitting line has a slope of zero. (There may be a relationship other than linear.)

7. Correlation is sensitive to outliers. One very large or very small value can dramatically change the
correlation coefficient.

8. Because we use z-scores, the correlation coefficient does not change when converting to different units. If
you change all of your Y-values from inches to centimeters, the correlation with the X-variable will remain
the same.

Guidelines: How strong is the linear relationship?

0  r  0.35 => weak positive − 0.35  r  0 => weak negative


0.35  r  0.75 => moderate positive − 0.75  r  −0.35 => moderate negative
0.75  r  1 => strong positive − 1  r  −0.75 => strong negative

r = -1 r = -0.7 r = -0.4 r=0 r = 0.3 r = 0.8 r=1

Points fall No linear Points fall


exactly on a relationship exactly on a
straight line (uncorrelated) straight line

37
Example #3: Estimate a correlation coefficient for each and describe the relationship.
1.
Mean January Air Temperatures
for 30 New Zealand Locations
20 ______________________________
Temperature (°C)

19
______________________________________
18
______________________________________
17
16 ______________________________________
15
14
35 40 45
Latitude (°S)

2.
Distances of Planets from the Sun
Distance (million miles)

4000
______________________________
3000
______________________________________
2000
______________________________________
1000
______________________________________
0
0 1 2 3 4 5 6 7 8 9
Position Number

3.

80
______________________________
70
60 ______________________________________
50
40 ______________________________________
30
20 ______________________________________
10
0
0 10 20 30 40
GDP per capita (thousands of dollars)

38
Calculating the Correlation Coefficient:

Remember how to calculate the Z-score? We used this calculation to determine how many standard deviations
our observation was from the mean.
x −
RECALL: z-score = z =

In this case, we were only concerned with one variable.

Now, we are considering two variables and each must be standardized (using z-scores).

 X − X  Y i − Y 
 i  
 S  S
 x  y 
FORMULA: r=
n −1

***You will not have to calculate the correlation coefficient by hand. For homework, you may need to use
your statistical function on your calculator to find the value and interpret the meaning of r as it relates to
the explanatory and response variables. On quizzes and exams the value will not have to be calculated.

Example #4: The data below represent the number of deaths and the magnitude for six earthquakes.
A) Graph the data set B) Calculate the correlation coefficient with your calculator
C) Based on your graph, do you think this is an accurate statistic? Explain

Magnitude (X) 6.7 7.8 6.4 6.6 6.9 7.3

Deaths (Y) 60 498 115 65 63 3

500

400
______________________________
300
______________________________________
200
______________________________________
100
______________________________________

6.0 6.5 7.0 7.5 8.0

39
What can go wrong when using correlation?
1) Correlation simply does not imply causation (The correlation may be a coincidence)
2) Both correlation variables might be directly influenced by some common underlying cause (lurking
variable).

Practice Problems:
1. Below is a scatterplot of data from the World Bank. All of the world’s nations for which data are
available are represented. The explanatory variable is a measure of how rich a country is, the gross
domestic product (GDP) per person. GDP is the total value of the goods and services produced in a
country, converted into dollars. The response variable is life expectancy at birth. We expect people in
richer countries to live longer. (Correlation Coefficient=0.718)

a. Describe the correlation coefficient given above (without considering the graph).

b. Would you use a linear model for this data set after seeing the graph? Why or why not?

Scatterplot of Life expectancy vs Gross domestic product

80

70
Life expectancy

60

50

40

0 5000 10000 15000 20000 25000 30000


Gross domestic product

2. Height and reading


A researcher studies children in elementary school and finds a strong positive linear association between
height and reading scores.

Does this mean that taller children are generally better readers? What might explain the correlation?

40
3. Which of the following is true of the correlation coefficient?
A. It is a resistant measure of association
B. r is a measure of strength and direction between a categorical response variable and a quantitative
explanatory variable.
C. −1.0  r  1.0
D. If r is the correlation between X and Y, then –r is the correlation between Y and X.

Answer_________

4. A correlation between college entrance exam grades and scholastic achievement was found to be -1.08.
On the basis of this you would tell the university that:
A. the entrance exam is a good predictor of success.
B. they should hire a new statistician.
C. the exam is a poor predictor of success.
D. students who do best on this exam will make the worst students.
E. students at this school are underachieving.

Answer_________

5. A study found a correlation of r = 0.89 between ethnicity and the frequency of coronary heart disease.
You may correctly conclude:
A. This is incorrect because r does not make sense here.
B. Caucasians have a higher frequency of coronary heart disease compared to other ethnic groups.
C. Hispanics have a higher frequency of coronary heart disease compared to other ethnic groups.
D. An arithmetic mistake was made because the correlation should be negative.

Answer_________

6. A reviewer rated a sample of fifteen wines on a score from 1 (very poor) to 7 (excellent). A correlation of .92
was obtained between these ratings and the cost of the wines at a local store. In plain English, this means that
A. Wines with low ratings are likely to be more expensive (probably because fewer will be sold).
B. Having to pay more caused the reviewer to give a higher rating.
C. In general, as the cost went up so did the rating.
D. In general, the reviewer liked the cheaper wines better.

Answer_________

41
Linear Regression

A regression line is a straight line that models the linear relationship between an

explanatory variable and a response variable. Therefore, it is only useful when one of the

variables helps explain or predict the other.

Least-square regression line or Sample regression line:

• This is the best-fitting line to the data.


• This line makes the sum of the squares of the vertical distances of the data points from the line as small as
possible.
• The straight line describes how a response variable y changes as an explanatory variable x changes.
• We can use this line to predict a response, ŷ , from a given explanatory variable, x.

Residuals:
Since we are dealing with a “model,” the best-fitting line will not typically go through all of the data points.
Some of the data points might be above the line, and some might be below the line. For a given value of x,
we have a true data value for y. For this same value of x, we also have a predicted value of y from our linear model.
The residual is simply the distance between the true y value and the predicted y value (called y-hat) for the
same value of x.

Residual = Real value of y - Predicted y (e= y − ŷ ) **e= error

Example #1: Scatterplot of Systolic Blood Pressure versus Weight (Sample of 12 American Adults).

Scatterplot of SBP vs Weight


170

160

150
SBP

140

130

160 170 180 190 200 210 220


Weight

Pearson correlation of SBP and WEIGHT = 0.971 The regression equation is: ŷ = 1.1 + 0.764(x)

1) Suppose we know that one person in the sample weighs 188 pounds and has a systolic blood pressure of 136.

2) Predict the SBP for a person weighing 188 pounds using your regression model.

3) Find the residual.

42
Slope-Intercept formula for a line:

b1 =
Notation: yˆ = b0 + b1 x where
b0 =

Note: The Least Squares Regression Line always passes through the point ( x , y ).

Interpretation:

• Slope ( b1 ): For each 1-unit increase in x, y increases (decreases) by the amount of the slope.

• Y-intercept ( b 0 ): The y-intercept is the value of y when x=0.

Example #2:
A researcher was interested in studying the relationship between a city's latitude (in degrees) and its average
April temperature (in degrees Fahrenheit). A regression analysis was performed, and the Minitab output is
displayed below.
Fitted Line Plot
AprTemp = 118.8 - 1.644 latitude
80

70
AprTemp

60

50

40

25 30 35 40 45 50
latitude

a) The regression equation is reported to be AprTemp = 118.8 – 1.644 Latitude. Interpret the slope in terms of
this particular problem.

b) San Diego has latitude of 33o. Use the regression equation to predict the average April temperature for San
Diego.

c) A meteorologist would like you to predict the average April temperature for a city with latitude 15. Would
you feel comfortable using your regression line in this capacity? Why or why not?

43
Example #3:
An international distance triathlon consists of a 1.5 km swim, a 40 km bike ride and a 10 km run. Triathletes are
ranked based on their overall finishing times, and some people suggest that an athlete’s time for the swim has the
largest influence on his overall performance. Data from 10 male triathletes who competed in the 2004 Camp
Pendleton International Triathlon was analyzed to produce the results below:
Scatterplot of Overall Finishing Time vs Swim Time
200

Overall Finishing Time(Minutes)


190

180

170

160

150
20 25 30 35 40 45
Swim Time (Minutes)

The regression equation is: Overall Finishing Time = 122 + 1.56 (Swim Time)

1. The correct interpretation of the slope is:


A. For every one minute increase in overall finishing time, there is a 122 minute increase in swim time
B. For every one minute increase in swim time, there is a 1.56 minute increase in overall finishing time
C. For every one minute increase in swim time, there is a 122 minute increase in overall finishing time
D. For every one minute increase in overall finishing time, there is a 1.56 minute increase in swim time

2. One athlete completed the swim in 34 minutes.


a. Calculate the predicted finishing time.

b. Find the athlete’s actual finish time if the value of the residual for this swim time is 11 minutes.

Example #4:
Classified ads in the Ithaca Journal offered several used Toyota Corollas for sale. Using a sample of 17 cars, the
following computer output was obtained relating Car Age (yr) and Price Advertised ($)

Predictor Coef SE Coef T P


Constant 12319.6 575.7 21.40 0.000
Age (yr) -924.00 82.29 -11.23 0.000

S = 1220.55 R-Sq = 89.4% R-Sq(adj) = 88.7%

a) Use the computer output provided to identify the equation for the regression line.

b) Interpret the slope.

c) What is the value of the correlation coefficient?


44
Checking the Model
The same conditions should be checked for both correlation and regression:

Quantitative data condition: correlation and linear models only make sense with quantitative data

Linearity condition: the regression model assumes that the relationship between the variables is linear.

Outlier condition: unusual observations can distort the correlation and dramatically change a regression model.

Independence assumption: when fitting a linear regression model the residuals should be independent of one
another

Measures of Predictive Power: Two types

1) Residual Plot: A residual plot is a scatterplot of the (x, residual) pairs.

***Note: The sum of your residuals should equal ZERO.

Features to look for:


1. Unusually large values for your residuals
2. Non-linear patterns (Curvature)
3. Uneven variation (Fanning)
4. Influential observations (Individual points whose removal would cause a
substantial change in the regression line)

Examples of Residual Plots:


1) __________________________

____________________________

____________________________

____________________________

2) __________________________

____________________________

____________________________

____________________________

3) __________________________

____________________________

____________________________

____________________________
Note: A single residual is not sufficient to describe predictive power. You would have to calculate all of the residuals
and plot them on a graph.

45
Example: Baseball players and salaries: Residuals against the number of years the player has been in the major
leagues.

a. Describe the pattern you see.

b. Will the model overestimate or underestimate the salaries of players who are new to the majors?

2. Coefficient of Determination: r2 , the correlation coefficient, squared (usually written as R2)


• It is the percent of the variation in the values of y that is explained by the least-squares regression of y on
x.
• How much of the variation is accounted for by the linear relationship?
• It is a measure of how successful the regression line was in explaining the response.

The closer r2 is to 100%, the better the regression model describes the connection between x and y - in particular,
predictions made with the equation will be more accurate.

GUIDELINES: PREDICTIVE POWER:


r2 > 80%= Excellent
50% < r2 < 80% = Good
25% < r2 < 50% =Fair
r2 < 25%=Weak

Template to use for the definition of r2:

_________% of the variability in __________________(y variable) can be explained by the least-squares


regression of ________________________(y variable) on ______________________(x variable).

Example: Does more education result in more crime?


Education was measured as the percentage of residents aged at least 25 in the county who had at least a high
school degree. Crime rate was measured as the number of crimes in Florida County in the past year per 1000
residents. The correlation coefficient between these variables is 0.67.
a) What is the coefficient of determination?

b) Describe the strength of the predictive power based on the guidelines above.
46
Practice Problems:
Use the following information to answer problems 1-5
Hurricanes develop low pressure at their centers. A sample of 163 hurricanes (with central pressure ranging from
910 to 1000 mb) was studied to investigate the relationship between central pressure (mb) and maximum wind
speed (knots). The data was analyzed to produce the results below:

Regression equation: Max wind speed = 955.27 – 0.897Central Pressure R-squared = 77.24%

1. What is the correlation between the central pressure and maximum wind speed? Use this value to assess
the strength of the linear association.

Correlation coefficient:

Interpretation

2. Identify and interpret the slope for the above equation.

Slope:

Interpretation

3. Hurricane Katrina had a central pressure of 920 mb and maximum wind speed of 110 knots. Calculate
the residual for this model value.

Answer:

4. A hurricane has a calculated residual that is negative. Which of the following is a correct statement?
A. Since the residual is negative the prediction must be accurate.
B. The residual value indicates that the prediction is too low.
C. The residual value indicates that the prediction is too high
D. We cannot determine whether the prediction is too low or too high from a residual value.
Answer:

5. Would it be appropriate to use the above model to predict maximum wind speed for a hurricane with
central pressure 1100mb? Explain why or why not in a full sentence.

47
Use the following information to answer problems 6-10
The U.S. Food and Drug Administration (FDA) require nutrition labeling on most foods. The nutritional content
for one cup of each of 21 different breakfast cereals was recorded. Information recorded included calories, sugar
(in grams) and carbohydrates (in grams). The data was analyzed to produce the results below:

Regression equation: Sugar = -5.10 + 0.536 Carbohydrate Correlation = 0.544

6. Interpret the slope for the above equation.

7. (4 points) What percent of the variation in sugar content can be explained by the regression of Sugar
content on carbohydrate content? Use this value to assess the predictive power of the model.

Percent of variation:

Interpretation

8. Raisin Bran has 29 grams of carbohydrates per cup. What is the predicted amount of sugar per cup for
Raisin Bran?

Answer:

9. The residual for Raisin Bran is 8.556. How many grams of sugar does Raisin Bran actually have per cup?

Answer:

10. The relationship between sugar content and calories was also analyzed with the following results:

Regression equation: Sugar = -8.82 + 0.152 Calories Correlation = 0.576

Which is a better predictor of sugar content in cereal, carbohydrates or calories? Justify your answer.

48
Regression Wisdom

Conditions and cautions when using Regression Lines:

1. If the relationship is not linear and the correlation is not strong, predictions will not be accurate.

2. Extrapolation: Do not make predictions outside of the range for which you have data.

3. Correlation simply does not imply causation. The correlation may be a coincidence or both correlation
variables might be directly influenced by some common underlying cause like a lurking variable.

4. Look for unusual points :


• Outliers: Any data point that stands away from the others. They may result in a large
residual or have high leverage.
• Leverage: Data points with x-values far from the average of x are said to exert leverage
on a linear model. Their residuals can appear to be small.
• Influential points: A data point is influential if omitting it from the analysis gives a very
different regression model.

Example 1: Oil production.


The correlation between oil production and year is r = 0 .117.
Is there a relationship between year and oil production? Is a linear regression appropriate?

____________________________

____________________________

____________________________

Example 2: Which statement can be correctly applied to the point X?


60

50

40
X
Variable 2

30

20

10

0
0 10 20 30 40 50 60 70 80
Variable 1

A. The point X is an outlier B.The point X is an influential point


C. The point X has high leverage D. A & B are true E. A, B & C are true.

49

You might also like