You are on page 1of 7

Color profile: Disabled

Composite Default screen

584 TWO VARIABLE STATISTICS (Chapter 18)

C LEAST SQUARES REGRESSION


Let us revisit the Opening Problem again. weight (kg)
We know that there is quite a strong posi- 120
tive correlation between the height and the 100
weight of the players.
80
Consequently, we should be able to find a 60
linear equation which ‘best fits’ the data.
40
This line of best fit could be found by eye. 20
However, different people will use different height (cm)
0
lines. So, how do we find mathematically, 175 180 185 190 195
the line of best fit?

RESIDUALS
A residual is the vertical distance between a data point and the possible line of best fit.

That is: y possible line


a positive of best fit
A residual is a value of y¡b y where y is y
residual
an observed value and yb is on the possible
line of best fit above or below y. a negative
ŷ residual

So, there are positive and negative residuals. x

LEAST SQUARES REGRESSION LINE FOR y ON x


Statisticians invented a method where the best line results. The process is minimisation of
the sum of the squares of the residuals.

Suppose the line of best fit is y = mx + c: y (x3, yˆ3) (x3, y3) y = mx + c


(x1, y1)
r12 = (y1 ¡ yb1 )2 (x4, yˆ 4)
r1
r22 = (y2 ¡ yb2 )2
(x4, y4)
r32 = (y3 ¡ yb3 )2 c (x2, y2) (x2, yˆ2)
(x1, yˆ1)
etc.
x
So, we need to minimise
S = r12 + r22 + r32 + :::::: + rn2 .
DEMO

Click on the icon to experiment with finding the ‘line of best fit’
by minimising the sum of the squares of the residuals.
Write down the function that you find which minimises the sum of the squares of the residuals.
95
0

50
25

75

100
0

25

50

100
75

95

IB_03
cyan black
Z:\...\IBBK3_18\584IB318.CDR
Wed Jul 21 09:22:40 2004
Color profile: Disabled
Composite Default screen

TWO VARIABLE STATISTICS (Chapter 18) 585

LEAST SQUARES FORMULAE


You can probably imagine the time-consuming work needed to find m and c, especially with
20 or so points (not just three as in the above working).
In fact a formula exists for finding the least squares regression line for y on x. It is:
sxy
y¡y = (x ¡ x)
sx2

Example 4
Use the formulae for calculating m and c for the line of best fit through (1, 3),
(3, 5) and (5, 6).
P P P
So, x = 9, y = 14, xy = 48,
x y xy x2
P 2
1 3 3 1 x = 35, n = 3
P P
3 5 15 9 P ( x)( y) 9 £ 14
5 6 30 25 sxy = xy ¡ = 48 ¡ =6
P n 3
9 14 48 35 P
2
P 2 ( x)2 92
sx = x ¡ = 35 ¡ =8
n 3
P P
x 9 y
x= = = 3 and y = = 14
3
n 3 n
sxy 14
So, using y ¡ y = (x ¡ x) we get y¡ 3 = 68 (x ¡ 3)
sx2
y ¡ 4:67 + 0:75x ¡ 2:25
y + 0:75x ¡ 2:25 + 4:67
y + 0:75x + 2:42

From this point onwards we will use technology to find the least squares regression line.
We can find the least squares regression line using:
² a computer package ² a graphics calculator ² a computer spreadsheet
To do this consider the tabled data:
x 1 2 3 4 5 6 7
y 5 8 10 13 16 18 20

USING A STATISTICS PACKAGE


STATISTICS
PACKAGE
The package is very easy to use. Click on the icon.
Enter the data. Examine all features that the package produces.

USING A GRAPHICS CALCULATOR


TI
Enter the data in two lists and use your calculator to find
the equation of the regression line. C
0

25

50

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\585IB318.CDR
Wed Jul 21 09:23:16 2004
Color profile: Disabled
Composite Default screen

586 TWO VARIABLE STATISTICS (Chapter 18)

USING A COMPUTER SPREADSHEET


The plotting of points, fitting the line of best fit and finding its equation can be easily deter-
mined using a spreadsheet such as Microsoft Excel ®. Following is a step-by-step procedure
for determining the line of best fit.
Step 1: Enter the data into the cells. SPREADSHEET

Step 2: Highlight (blacken) the cells containing


the data by clicking the LH mouse but-
ton on A1 and dragging through to B7.
You should now see:

Step 3: Click on , the chart wizard icon.

Click on XY (Scatter) Then NEXT

Click on FINISH
You should now have a graph showing the
7 points.
Step 4: Place the arrow on one of the points and click
the RH mouse button once.

Select Add Trendline and click on this with


the LH mouse button.
Select Linear OK and the line of best fit
should be added to your graph.

Step 5: Place the arrow on the trend line and


click with RH mouse button.
Select Format Trendline , Options then

select Display equation on chart

and Display R-squared value on chart OK .


You should now have
y = 2:5357x + 2:7143
and R2 = 0:9955 added to your graph.

INTERPOLATION / EXTRAPOLATION
The two variables in the following scatterplot are the mass of a platypus (independent variable
plotted on the x-axis) and the length of the same platypus (dependent variable plotted on the
y-axis) for 14 different animals.
The data was collected in an experiment to discover if there was a relationship between the
length and mass of these animals.
0

25

50

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\586IB318.CDR
Wed Jul 21 09:23:47 2004
Color profile: Disabled
Composite Default screen

TWO VARIABLE STATISTICS (Chapter 18) 587

Even though the correlation in this case is only moderate,


a line of best fit has been drawn to enable predictions to length
upper pole
be made.
If we use the equation of the least squares line to predict line of
length values for mass values in between the smallest and best fit
largest mass values that were supplied from the experiment,
we say we are interpolating (in between the poles). lower pole mass
If we predict length values for mass values outside the
extra- inter- extra-
smallest and largest mass values that were supplied from polation polation polation
the experiment we say we are extrapolating (outside the
poles).
The accuracy of an interpolation depends on how linear the original data was. This can
be gauged by determining the correlation coefficient and ensuring that the data is randomly
scattered around the line of best fit.
The accuracy of an extrapolation depends not only on ‘how linear’ the original data was, but
also on the assumption that the linear trend will continue past the poles.
The validity of this assumption depends greatly on the situation under investigation.

CARE MUST BE TAKEN WHEN EXTRAPOLATING


The performance of a light spring is under consideration.
An attempt is being made to find the connection between
the extension (y cm) of the spring and the mass in the
basket (x grams). ruler
A typical graph for this experiment is:
extension (y cm)
90 add weights
(500, 89) pointer to the
80 (400, 82) wire
70 basket
(300, 72)
60 (200, 63)
50 (100, 55)
40
mass (g)
100 200 300 400 500

There is a very high positive correlation between the variables, and the line of best fit is
determined to be y + 0:087x + 46:1 cm.
However, it would be dangerous to predict that for a mass of 800 grams the extension would
be 0:087 £ 800 + 46:1 = 115:7 cm because we may have exceeded the elastic limit of the
spring somewhere between x = 500 grams and x = 800 grams, meaning that the spring
becomes permanently stretched more than predicted by the graph.
A further example could be the world record for the long
jump prior to the Mexico City Olympic Games of 1968.
A steady regular increase in the World record over the
previous 30 years had been recorded. However, due to the
high altitude and a perfect jump, the USA competitor Bob
Beamon, shattered the record by a huge amount, not in
keeping with previous increases.
50

100
0

25

75

95
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\587IB318.CDR
Wed Jul 21 09:24:27 2004
Color profile: Disabled
Composite Default screen

588 TWO VARIABLE STATISTICS (Chapter 18)

Example 5
The table below shows the sales for Hancock’s Electronics established in late 1998.

Year 1999 2000 2001 2002 2003 2004


Sales ($ £ 10 000) 5 9 14 18 21 27

a Draw a graph to illustrate this data. b Find r2 :


c Find the equation of the line of best fit using the linear regression formula.
d Predict the sales figures for year 2006, giving your answer to the nearest
$10 000. Comment on the reasonableness of this prediction.

a Let t be the time t S S


in years from 1998 30
1 5 25
and S be the sales 2 9 20
in $10 000’s, i.e., 3 14 15
4 18 10
5 21 5
t
6 27 1 2 3 4 5 6

b Using technology, r2 = 0:9941:


c The line of best fit is S = 4:286t + 0:667 :
d In 2006, x = 8 ) S + 4:286 £ 8 + 0:667 + 35
i.e., predicted year 2006 sales would be $350 000.
The r and r2 values suggest that the linear relationship between sales and year
is very strong and positive. However, since this prediction is an extrapolation,
it will only be reasonable if the trend evident from 1999 to 2006 continues to
the year 2006, and this may or may not occur.

EXERCISE 18C
1 Recall the tread depth data of car tyres after travelling thousands of kilometres:
kilometres (X thousand) 14 17 24 34 35 37 38 39
tread depth (Y mm) 5:7 6:5 4:0 3:0 1:9 2:7 1:9 2:3

a Which is the dependent variable?


b Find the equation of the least
squares regression line.
c On a scatterplot graph the least
squares regression line. depth of tread
d Use the equation of the line of
tyre cross-section
best fit to estimate the tread depth
of a new tyre.
e If a tread depth of 2 mm or more is considered to be essential for safe driving, esti-
mate the distance a tyre of this brand should last.
0

25

50

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Y:\...\IBBK3_18\588IB318.CDR
Thu Jul 22 09:43:32 2004
Color profile: Disabled
Composite Default screen

TWO VARIABLE STATISTICS (Chapter 18) 589

2 Recall the restauranteur’s data for the number of diners in March and the temperature at
noon.
Temperature (X o C) 23 25 28 30 30 27 25 28 32 31 33 29 27
Number of diners (Y ) 57 64 62 75 69 58 61 78 80 67 84 73 76

a What is the independent variable?


b Find the covariance of X and Y , denoted Cov(X, Y ).
c Find the equation of the least squares regression line.
d On a scatterplot graph the least squares regression line.
e How accurate would the interpolation using the regression line be? Why?

3 Recall the spray on tomatoes data:


Spray concentration (X mL/L) 3 5 6 8 9 11
Yield of tomatoes per bush (Y ) 67 90 103 120 124 150
a Define the role of each variable and produce an appropriate scatterplot.
b Use the method of least squares to determine the equation of the line of best fit.
c Give an interpretation for the slope and vertical intercept of this line.
d Use the equation of the least squares line to predict the yield if the spray concen-
tration was 7 mL/L. Comment on the reasonableness of this prediction.
e If a 50 mL/L spray concentration was used, would this ensure a large tomato yield?
Explain.

4 Recall the frost on cherry data:


Number of frosts, (X) 27 23 7 37 32 14 16
Cherry yield (Y tonnes) 5:6 4:8 3:1 7:2 6:1 3:7 3:8
a Define the role of each variable and produce an appropriate scatterplot.
b Use the method of least squares to determine the equation of the line of best fit.
c Give an interpretation for the slope and vertical intercept of this line.
d Use the equation of the least squares line to predict the cherry yield if 29 frosts
were recorded. Comment on the reasonableness of this prediction.
e Use the equation of the least squares line to predict the cherry yield if 1 frost was
recorded. Comment on the reasonableness of this prediction.

5 The rate of a chemical reaction in a certain plant depends on the number of frost-free
days experienced by the plant over a year which, in turn, depends on altitude. The higher
the altitude, the greater the chance of frost. The following table shows the rate of the
chemical reaction R, as a function of the number of frost-free days, n.

Frost-free days (n) 75 100 125 150 175 200


Rate of reaction (R) 44:6 42:1 39:4 37:0 34:1 31:2
a Produce a scatterplot for the data of R against n.
b Find a linear model which best fits the data. State the value of r2 .
c Estimate the rate of the chemical reaction when the number of frost free days is:
i 90 ii 215:
d Complete: “The higher the altitude, the ...... the rate of reaction.”
50
0

25

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\589IB318.CDR
Wed Jul 21 09:25:45 2004
Color profile: Disabled
Composite Default screen

590 TWO VARIABLE STATISTICS (Chapter 18)

6 The yield (Y kg) of pumpkins on a farm depends on the quantity of fertiliser (X g/m2 ).
The following table shows X 4 13 20 26 30 35 50
corresponding X and Y values.
Y 1:8 2:9 3:8 4:2 4:7 5:7 4:4
a Draw a scatterplot of the data
and identify an outlier.
b Calculate the correlation coefficient:
i with the outlier included ii without the outlier.
c Calculate the equation of the least squares regression line:
i with the outlier included ii without the outlier.
d If you wish to estimate the yield when 15 g/m2 are used, which regression line from
c should be used?
e Can you explain what may have caused the outlier?
7 Find the least squares regression line for y on x if:
a x = 6:12, y = 5:94, sxy = ¡4:28, sx = 2:32
b x = 21:6, y = 45:9, sxy = 12:28, sx = 8:77
P P P P P
8 n = 6, x = 61, y = 89, xy = 1108, (x¡x)2 = 138 and (y¡y)2 = 284
a Find i the mean of X ii the mean of Y:
b Find i the standard deviation of X ii the standard deviation of Y:
c Find the covariance of X and Y .
d Find the least squares regression model for y on x.

INVESTIGATION SPEARMAN’S RANK ORDER


CORRELATION COEFFICIENT
Suppose we wish to test the degree of agreement between two wine tasting
judges at a vintage festival, or between two judges at a diving competition.
Spearman’s rank order correlation coefficient can be used for this purpose.
His formula is:
X
6 d2 t is Spearman’s rank order correlation coefficient
t = 1¡ where d is the difference in the ranking
n(n2 ¡ 1)
n is the number of rankings
X
Note: d2 is the sum of the squares of the differences.

What to do:
1 Amy and Lee are two wine judges. They are considering six red wines: A, B, C, D, E
and F. They taste each wine and put them in order of enjoyment from 1 (best) to 6
(worst), and the results of their judging is shown in the table which follows:
Wine A B C D E F Notice that for wine A, d = 6 ¡ 3 = 3
and for wine C, d = 6 ¡ 2 = 4
Amy’s order 3 1 6 2 4 5
a Find Spearman’s rank order correlation
Lee’s order 6 5 2 1 3 4 coefficient for the wine tasting data.
b Comment on the degree of agreement between their rankings of the wine.
c What is the significance of the sign of t?
50
0

25

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\590IB318.CDR
Wed Jul 21 09:26:24 2004

You might also like