You are on page 1of 6

Regression: Finding the equation of the line of best fit

Objectives: To find the equation of the least squares regression line of y on


x.
Background and general principle

The aim of regression is to calculate


equation
the of the line of best fit on a 85
80
graph.
scatter
75
Consider the scatter graph on the right. Weight 70
(kg)
possible
One line of best fit has been drawn 65

the diagram. Some of the points lie


on 60

the line and some lie below it.


above 55
50
The vertical distance each point is above
below the line has been added to the
or 1·4 1·5 1·6
Height (m)
1·7 1·8

diagram. These distances are called


deviations or errors – they are symbolised
as
d 1 , d 2 ,... d n .
,
When drawing in a regression line, the aim is to make the line fit the points as closely as possible.
do this by making the total of the squares of the deviations as small as possible, i.e. we
We
∑ di2 .
minimise

If a line of best fit is found using this principle, it is called the least-squares regression
line.
Example 1:
A patient is given a drip feed containing a particular chemical and its concentration in his blood
measured,
is in suitable units, at one hour intervals. The doctors believe that a linear relationship
exist between the variables.
will

Time, x (hours) 0 1 2 3 4 5 6
Concentration, y 2.4 4.3 5.0 6.9 9.1 11.4 13.5

We can plot these data on a scatter graph


time
– would be plotted on the horizontal 12
(as it is the independent variable). Time
axis
10
here referred to as a controlled
is
since the experimenter fixed the value of
variable, 8
Concentration
variable
this in advance (measurements 6
taken every hour).
were 4
Concentration is the dependent variable as 2
concentration in the blood is likely to
the
according to time.
vary 1 2 3 4 5 6
Time (hours)
The doctor may wish to estimate the
concentration of the chemical in the blood after 3.5
hours.
She could do this by finding the equation of the line of best
fit.
There is a formula which gives the equation of the line of best
fit.
The equation of the line is y = a + bx

Sxy
where b = and a = y − bx.
Sxx

=∑x −
∑ x∑ y and =∑x −
2
(∑ x)2

Note: Sxy Sxx .


y n n
Note 2: x and y are the mean values of xand yrespectively.

This line is called the (least-squares) regression line yon x(because the equation has been given
with ythe subject).
of
b is sometimes called the regression coefficient yon x.
of
We can work out the equation for our example as
follows:
2
∑ x = 0 + 1 + ... + 6 = 2 s x= =3
17
1 o
52.6 These could all be
∑ y = 2.4 + 4.3 + ... + 13.5 = 52.6 so y=
7
= 7.51 ... found on a
4 calculator (if you
∑ xy = (0 × 2.4) + (1× 4.3)+ ... + (6 ×13.5)= 20 .4 enter the data into a
92 calculator).
∑ x2 = 0 2 + 12 + ... + 62 = 9 s x= =3
17
1 o

Sxy = ∑ x −
∑ x∑ y = 20 .4 −
2 × 5 .6
= 5 .6
n 1 72
y 9 1
(∑ x) 2
(2 ) =
2

Sxx =∑x − 2
=9 − 2
n 17
1 8
Sxy 5 .6
So, b = = = 1.84 and a = y − b x = 7.51 −1.84 × 3 = 1.98 .
Sxx 12 3 4 3 5
So the equation8of the regression line y = 1.985 + 1.843x.
is
To work out the concentration after 3.5 hours: y = 1.985 + 1.843 × 3.5 = 8.44
(3sf)
If you want to find how long it would be before the concentration reaches 8 units, we substitute y =
8into the regression equation:
8 = 1.985 + 1.843x
Solving this we get: x = 3.26 hours

Note: It would not be sensible to predict the concentration after 8 hours from this equation – we
know whether the relationship will continue to be linear. The process of trying to predict a value
don’t
outside the range of your data is called
from
extrapolation.
Example
The heights and weights of a sample of 11 students
2:
are:
Height 1.36 1.47 1.54 1.56 1.59 1.63 1.66 1.67 1.69 1.74 1.81
(m) h
Weight 52 50 67 62 69 74 59 87 77 73 67
(kg) w

[ n = 11 ∑ h = 17.72 ∑ h 2
= 28.70 ∑ w = 73 ∑ w2 = 5057 ∑ h = 119 .1 ]
5 7 1 w 6
a) Calculate the regression line of w on
b) Use the regression line to estimate the weight of someone whose height is
h.
1.6m.
Note: Both height and weight are referred to rando variables – their values could not have
predicted before the data were collected. If the sampling
as m were repeated again, different values
been
be obtained for the heights and
would
weights.
Solutio :
na) We begin by finding the mean of each
variable: ∑ h 1 .7
h= = = 1.610 ...
n 7 12
9
w=
∑ w 73
=
1
=6
n 71
7
Next we find the sums 1 of
squares: (∑ h )2 1 .7 2
Shh = ∑ h −
2
= 2 .70 − = 0.159
n 7 12
8 5 7
( )2
∑ = 5057 − 73 = 119
w 12
Sww = ∑ w2 −
n 71
1 2
Shw = ∑ h −
∑ h∑ w
= 119 .1 −
11 .7 × 73
= 8.8
n 7 21 7
w 6 6
The equation of the regression line 1
is: w = a + bh
where
S 8.8
b = hw = = 5 .5
Shh 0.159 6
5
and 7
a = w − bh = 67 − 55.5 ×1.610 = −22.4
So the equation of the regression line of w on h 9
is: w = -22.4 + 55.5h

b) To find the weight for someone that is 1.6m


high: w = -22.4 + 55.5×1.6 = 66.4
kg
Two possible regression lines

When x and y are bothrandom , there are two possible regression lines that can
calculated: variables be
* the regression line of y on
* the regression line of x on y.
x;

The regression line of y on x is the line that has already been met. Its equation
is y = a + bx
and it is used to find a value of y when we are given a value of x. This line minimises the
distances
vertical of each point from the
line.
The regression line of x on y minimises the horizontal distances of each point from the line. It is
if you wish to work out a value of x when you are given a value of y. The equation of this
used
line is
regression
x = a´ + b´y
where
S xy
b′ =
S yy
and
a ′ = x − b′ y .

The regression lines will not in general be the same (unless the points lie on a perfect straight
line).
Both regression lines pass through the mean ( x, y) .
point
Not : If xis a controlled variable, you always use the regression line of yon x(since the regression
line
e of xon ydoesn’t have any statistical meaning in this case).

Example:
A psychologist wants to investigate the relationship between the IQ of a child and the IQ of their
mother. She measures the IQ of a sample of 8 children and mothers:

Child’s IQ, x 87 91 94 98 103 108 111 123


Mother’s IQ, y 94 96 89 102 98 94 116 117

[ ∑ x = 81 ∑ x 2 = 8401 ∑ y = 80 ∑ y 2 = 8196 ∑ x = 8278 ]


5 3 6 2 y 9
a) Work out the product moment correlation coefficient between x and
b) Calculate the regression line of y on x and the regression line of x on y. Write down a point
y.
both lines pass through.
that
c) Use the appropriate regression line to estimate the IQ of a child born to a mother with an IQ
100.
of Using your answer to part (a), explain how accurate you think this estimate is likely to
be.
Solution:
a) Using the sums given in the question we find
that: 81 × 80 81 2
S xy = 8278 − = 67 .7 S xx = 8401 − = 98 .87
5 86 58
9 7 5 3 4 5
80 2
S yy = 8196 − = 75 .5
68
2 7
So,
67 .75
r= = 0.78 (to 3 s.f.)
98 7.87 × 75 .5 5
4 5 7
b) For the regression line of y on
x: 67 .7
b= = 0.68 (to 3 s.f.)
7 .87
98 5
8
481 5 80
x= = 10 .87 an y = = 10 .75
58 68
1 5 d 0
a = y − b x = 10 .7 − 0.68 × 10 .87 = 3 .6
So the equation of the0 regression
5 8 line1 of y5 on x 0 6
is: y = 30.66 + 0.688x.

For the regression line of x on


y: 67 .7
b′ = = 0.89 (to 3 s.f.)
775 5.5
5
a ′ = x7− b ′ y = 10 .87 − 0.89 × 10 .7 = 1 .7
So the equation of the1regression
5 5 of0 x on
line 5 y 1 0
is x = 11.70 + 0.895y.

Both lines must pass through the mean point (101.875,


100.75).
c) Both variables are random
To find x when y = 100, we use the regression line of x on
variables.
y. x = 11.70 + 0.895 ×100 =
101.2
This accurate should be reasonably accurate since the product moment correlation coefficient
fairly strong correlation.
shows

Example 2:
The scores that 9 students obtained in their C1 and M1 mathematics examinations are as
follows:
C1 mark, c% 82 51 68 45 30 55 64 77 28
M1 mark, m% 75 46 84 47 42 59 52 69 41

[ ∑ c = 50 , ∑ c 2 = 3070 , ∑ m = 51 , ∑ m2 = 3139 , ∑ c = 3061


0 8 5 7 m 7
a) Calculate the equation of the appropriate regression line in order to estimate the mark that a
scoring
student 53% in M1 might expect to obtain in
b) Explain why it would not be sensible to use the same regression line to estimate the C1 mark
C1.
might be expected if a student scored 20% in
that
M1.
Solutio :
na) Both c and m are random variables. So we need to find the regression line of c on
m:
50 × 51 51 2
=
Scm 3061 − = 200 .89 =
Smm 3139 − = 192 .56
0 95 59
7 5 7 7
200 .8
So, b = = 1.0
5 .59
192 4
7 6
50 51
c= = 5 .5 an m = = 5 .2
09 59
5 6 d 7 2
Therefore, a = 55.56 − 1.04 × 57.22 = −3.95

So, the regression line of c on m is:


c = -3.95 + 1.04m

When m = 53, c = -3.95 + 1.04×53 = 51.17%

b) The value m = 20 lies outside the range of M1 marks seen in the table. The regression line
calculated in part (a) cannot be assumed to still be valid outside the range of values given in the table.

Example 3:
A particular greenhouse plant is suspect able to a particular disease. An agricultural scientist wishes
to see how the temperature of the greenhouse affects the prevalence of the disease. She designs an
experiment in which she monitors the percentage of diseased leaves occurring at different
temperatures:

Temperature, t °F 70 72 74 76 78 80
Percentage of 12.3 9.5 7.7 6.1 4.3 2.3
diseased leaves, p

[ ∑ t = 45 , ∑ t 2 = 3382 , ∑ p = 42.2, ∑ p 2 = 36 .82, ∑ tp = 309 .8 ]


0 0 1 7
The scientist wishes to estimate the temperature that she should set the greenhouse if she is aims for
5% of leaves being diseased. Calculate an appropriate regression line and use it to find the required
temperature (giving your answer to the nearest whole number). Give a reason for your choice of
regression line.

Solution:
In this situation, temperature is a controlled variable. Only the regression line of p on t makes sense
here.

45 × 42.2 42.2 2
Stp = 309 .8 − = −67.2 S pp = 36 .82 − = 65.01
0 6 6
7 1 3
− 67.2
So, b = = −1.0336
65.013
45 42.2
t= = 75 an p = = 7.03
06 6
d 3
Therefore, a = 7.03 − (−1.033 ) × 7 = 8 .5
3 6 5 4
So, the regression line of t on p
is: p = 84.5 - 1.03t

Therefore to find t, we solve


5 = 84.5 – 1.03t
i.e. t = 77°C

You might also like