You are on page 1of 8

Chapter 9: Correlation and Regression: Solutions

9.1 Correlation

In this section, we aim to answer the question: Is there a relationship between A and B?
Is there a relationship between the number of employee training hours and the number of
on-the-job accidents? Is there a relationship between the number of hours a person sleeps
and their reaction time? Is there a relationship between the number of hours a student
spends studying for a calculus test and the student’s score on that calculus test?
Definition: a correlation is a relationship between two variables.
Typically, we take x to be the independent variable. We take y to be the dependent
variable. Data is represented by a collection of ordered pairs (x, y).
Mathematically, the strength and direction of a linear relationship between two variables is
represented by the correlation coefficient. Suppose that there are n ordered pairs (x, y)
that make up a sample from a population. The correlation coefficient r is given by:
P P P
(xy) − ( x) ( y)
n
r=q P P 2q P 2
n x − ( x) n y − ( y)2
2
P

This will always be a number between -1 and 1 (inclusive).

• If r is close to 1, we say that the variables are positively correlated. This means there
is likely a strong linear relationship between the two variables, with a positive slope.

• If r is close to -1, we say that the variables are negatively correlated. This means there
is likely a strong linear relationship between the two variables, with a negative slope.

• If r is close to 0, we say that the variables are not correlated. This means that there
is likely no linear relationship between the two variables, however, the variables may
still be related in some other way.

1
(Image from Laerd Statistics) The correlation coefficient of the population is denoted by ρ
– and is usually unknown.
Example 1: The time x in years that an employee spent at a company and the employee’s
hourly pay, y, for 5 employees are listed in the table below. Calculate and interpret the
correlation coefficient r. Include a plot of the data in your discussion.

x y x2 y2 xy
5 25 25 625 125
3 20 9 400 60
4 21 16 441 84
10 35 100 1225 350
P 15 P 38 P 225
2
P 1444
2
P 570
x = 37 y = 139 x = 375 y = 4135 xy = 1189

Hint: Calculate the numerator:

X X  X 
n (xy) − x y = 5 · 1189 − 37 · 139 = 802

Then calculate the denominator:

r X 2 r X X 2 q q
X
n 2
x − x n 2
y − y = 5 · 375 − (37) 5 · 4135 − (139)2
2

√ √
= 506 1354 ≈ 827.72

802
Now, divide to get r ≈ ≈ 0.97.
827.72
Interpret this result: There is a strong positive correlation between the number of years and
employee has worked and the employee’s salary, since r is very close to 1.

2
Example 2: The table below shows the number of absences, x, in a Calculus course and the
final exam grade, y, for 7 students. Find the correlation coefficient and interpret your result.

x 1 0 2 6 4 3 3
y 95 90 90 55 70 80 85

You may use the facts that (double check this for practice)

X X X X X
x = 19, y = 565, x2 = 75, y 2 = 46, 775, xy = 1, 380.

Calculate the numerator:

X X  X 
n (xy) − x y = 7 · 1380 − 19 · 565 = −1075

Then calculate the denominator:

r X 2 r X X 2 q q
X
n 2
x − x n 2
y − y = 7 · 75 − (19) 7 · 46775 − (565)2
2

√ √
= 164 8200 ≈ 1159.66

−1075
Now, divide to get r ≈ ≈ −0.93.
1159.66
Interpret this result: There is a strong negative correlation between the number of absences
and the final exam grade, since r is very close to −1. Thus, as the number of absences
increases, the final exam grade tends to decrease.

3
Example 3: The table below shows the height, x, in inches and the pulse rate, y, per minute,
for 9 people. Find the correlation coefficient and interpret your result.

x 68 72 65 70 62 75 78 64 68
y 90 85 88 100 105 98 70 65 72

You may use the facts that (double check this for practice)

X X X X X
x = 622, y = 773, x2 = 43, 206, y 2 = 68, 007, xy = 53, 336.

Calculate the numerator:

X X  X 
n (xy) − x y = 9 · 53336 − 622 · 773 = −782

Then calculate the denominator:

r X 2 r X X 2 q q
X
n 2
x − x n 2
y − y = 9 · 43206 − (622) 9 · 68007 − (773)2
2

√ √
= 1970 14534 ≈ 5350.89

−782
Now, divide to get r ≈ ≈ −0.15.
5350.89
Interpret this result: There appears to be an extremely weak, if any, correlation between
height and pulse rate, since r is close to 0.
Example 4: The table below shows the number of absences, x, in a Calculus course and the
final exam grade, y, for 7 students. Find the correlation coefficient and interpret your result.

x 1 0 2 6 4 3 3
y 85 80 70 55 90 90 95

There are 7 ordered pairs (x, y), so n = 7. Calculate the needed sums:

x y x2 y2 xy
1 85 1 7225 85
0 80 0 6400 0
2 70 4 4900 140
6 55 36 3025 330
4 90 16 8100 360
3 90 9 8100 270
P 3 P 95 P 29 P 29025 P 285
x = 19 y = 565 x = 75 y = 46775 xy = 1470
4
Calculate the numerator:

X X  X 
n (xy) − x y = 7 · 1470 − 19 · 565 = −445

Then calculate the denominator:

r X 2 r X X 2 q q
X
n 2
x − x n 2
y − y = 7 · 75 − (19) 7 · 46775 − (565)2
2

√ √
= 164 8200 ≈ 1159.66

−445
Now, divide to get r ≈ ≈ −0.38.
1159.66
Interpret this result: There is a weak negative correlation between the study time and final
exam grade, since r is closer to 0 than it is to −1. (Compare this problem with Example 2).
Interpreting the Correlation Between Two Variables:
Suppose that you find a strong positive or negative correlation between two variables. Is
there a cause-and-effect relationship between these variables?

• There could be a direct cause-and-effect relationship: that is, x causes y.

• There could be a reverse cause-and-effect relationship: that is, y causes x.

• There could be a third (or fourth? or more?) variable that leads to the relationship
between x and y.

• The “relationship” between x and y may just be a coincidence.

5
9.2 Linear Regression

If there is a “significant” linear correlation between two variables, the next step is to find the
equation of a line that “best” fits the data. Such an equation can be used for prediction: given
a new x-value, this equation can predict the y-value that is consistent with the information
known about the data. This predicted y-value will be denoted by ŷ. The line represented
by such an equation is called the linear regression line.
The equation for a line is

ŷ = mx + b,

where m is the slope of the line and b is the y-intercept (the y-value for which x is 0).
In general, the regression line, will not pass through each data point. For each data point,
there is an error: the difference between the y-value from the data and the y-value on the
line, ŷ. By definition, this linear regression line is such that the sum of the squares of the
errors is the least possible. It turns out, given a set of data, there is only one such line. The
slope m and y-intercept b are given by

P P P P P
n xy − ( x) ( y) y x
m= b= −m
n (x2 ) − ( x)2 n n
P P

Examples: Find the equation of the regression line for each of the two examples and two
practice problems in section 9.1.
Example 1:
First, find the slope m. Start by determining the numerator:

X X  X 
n xy − x y 5 · 1189 − 37 · 139 = 802

Next, find the denominator:

X X 2
n 2
(x ) − x = 5 · 375 − (37)2 = 506

802
Divide to obtain m = ≈ 1.58
506
P P
y x 139 37
Now, find the y-intercept: b = −m ≈ − 1.58 · ≈ 16.11
n n 5 5
Therefore, the equation of the regression line is ŷ = 1.58x + 16.11
Additional Questions: Use the equations to (Ex 1) predict the hourly pay rate of an employee
who has worked for 20 years, and (Ex 2) predict the test score for a student with 5 absences.
6
For an employee who has worked 20 years, x = 20. Plug this into the equation for the
regression line: ŷ = 1.58 · 20 + 16.11 = 47.71 is the predicted salary, based on the data.
Example 2:
First, find the slope m. Start by determining the numerator:

X X  X 
n xy − x y = 7 · 1380 − 19 · 565 = −1075

Next, find the denominator:

X X 2
n (x2 ) − x = 7 · 775 − (19)2 = 164

−1075
Divide to obtain m = ≈ −6.55
164
P P
y x 565 19
Now, find the y-intercept: b = −m ≈ − (−6.55) · = 98.49
n n 7 7
Therefore, the equation of the regression line is ŷ = −6.55x + 98.49
Additional Questions: Use the equations to (Ex 1) predict the hourly pay rate of an employee
who has worked for 20 years, and (Ex 2) predict the test score for a student with 5 absences.
For a student with 5 absences, x = 5. Plug this into the equation for the regression line:
ŷ = −6.55 · 5 + 98.49 = 65.74 is the predicted score, based on the data.
Example 3:
First, find the slope m. Start by determining the numerator:

X X  X 
n xy − x y = 9 · 53336 − 622 · 773 = −782

Next, find the denominator:

X X 2
n 2
(x ) − x = 9 · 43206 − (622)2 = 1970

−782
Divide to obtain m = ≈ −0.40
1970
P P
y x 773 622
Now, find the y-intercept: b = −m ≈ − (−0.40) · = 113.53
n n 9 9
Therefore, the equation of the regression line is ŷ = −0.40x + 113.53. Even though we found
an equation, recall that the correlation between x and y in this example was weak. Thus,
this regression line many not work very well for the data.

7
Example 4:
First, find the slope m. Start by determining the numerator:

X X  X 
n xy − x y = 7 · 1470 − 19 · 565 = −445

Next, find the denominator:

X X 2
n (x2 ) − x = 7 · 75 − (19)2 = 164

−445
Divide to obtain m = ≈ −2.71
164
P P
y x 565 19
Now, find the y-intercept: b = −m ≈ − (−2.71) · = 88.07
n n 7 7
Therefore, the equation of the regression line is ŷ = −2.71x + 88.07. Even though we found
an equation, recall that the correlation between x and y in this example was weak. Thus,
this regression line many not work very well for the data. For example, for a student with
x = 0 absences, plugging in, we find that the grade predicted by the regression line is 88.

You might also like