You are on page 1of 23

CORRELATION AND REGRESSION C 5606 / 5/ 1

UNIT 5

CORRELATION AND REGRESSION

OBJECTIVES

General Objective

 To understand and apply the concept of correlation and regression

Specific Objectives

At the end of the unit, you should be able to:

 Draw a scatterplot for a set of ordered pairs


 Compute the correlation coefficient
 Compute the equation of the regression line
CORRELATION AND REGRESSION C 5606 / 5/ 2

INPUT

5.0 CORRELATION

So far we have considered the statistics of one variable. Of course we sometimes get
data involving two variables. For example, look at the marks obtained on two
Mathematics paper by a group of students below.

Student A B C D E F G H I J
Paper 1 42 84 50 42 33 50 69 81 50 35
Paper 2 31 83 42 60 28 63 59 92 73 40

So what can we find out from the data ? Students B and H have done very well on
both papers, E has done very badly on both papers, student I has done much better
on paper 2 than paper 1.

A graph might help us to make more sense of the data, as would the average (mean)
mark for papers 1 and 2. The most useful type of graph is a scatter diagram.
CORRELATION AND REGRESSION C 5606 / 5/ 3

5.1 CORRELATION- SCATTER DIAGRAM

If we plot the data as points, with marks for Paper 1 on the x- axis and for paper 2 on
the y-axis, we obtain a graph like the one shown heree. Note that we do not need to
start the scales at zero.

We see that the points go roughly from bottom left to top right(this is made clearer by
enclosing the points as shown below.
CORRELATION AND REGRESSION C 5606 / 5/ 4

From the data the mean value for paper 1 x = 53.6


And for paper 2 y = 57.1

We now plot the line x = 53.6 and y = 57.1 on the scatter diagram:

The line divide the graph into four quadrants :

Top Right – All points have both x values and y values greater than their respective
means i.e. (x – x ) <0, (y - y ) < 0. The product would be positive.

Bottom Left – All points have both x values and y values less than their respective
means i.e. (x – x ) <0, (y - y ) < 0. The product would be positive.

Top left – x values less than x , y values greater than y . Product negative.

Bottom right – x values greater than x , y values less than y . Product negative.

Look at the scattergrams (scatter diagrams) below. The patterns seem to be very
different.
CORRELATION AND REGRESSION C 5606 / 5/ 5

Roughly speaking:

Positive correlation – “the higher the value of x, the higher the value of y.”
Negative correlation – “the higher value of x, the lower value of y.”
Zero correlation – “no fixed relationship between x and y.”

Again this is made clearer by drawing the lines y = y , x = x .

You have met scatter diagrams in your work of which you may have drawn a “line of
best fit” on the graph in order to estimate a value of y given a value of x. The line was
drawn by “eye” but you would know that the line passes through the mean values of (
x , y ) as shown below.
CORRELATION AND REGRESSION C 5606 / 5/ 6

The lines on the first two diagrams are relatively easy to draw, but where do we draw
a line on the third and having drawn it, would it be of any practical use?

Notice that we have been looking for a special type of relationship between the x and
y values – a straight line or linear relationship. The fact that we can’t find such a
relationship does not mean that there is no relationship at all.

The product-moment formula for determining the linear correlation coefficient

The convention of dealing with data

Horizontal (x) axis – The independent variable

Vertical (y) axis – The dependent variable

Let us look at some data on the height of students and the distance they can throw a
cricket ball.

Height (x) cm 122 124 133 138 144 156 158 161 164 168
Distance (y) m 41 38 52 56 29 54 59 61 63 67

Just looking at the data, a general response might be “the taller a person, the further
they can throw a cricket ball.” (apart from the odd person!)
CORRELATION AND REGRESSION C 5606 / 5/ 7

Does a scatter diagram support that hypothesis?

The example below shows one drawback: SCALE


CORRELATION AND REGRESSION C 5606 / 5/ 8

One of the measures of the degree of linear correlation between two variables is
called the coefficient of correlation, denoted by the symbol ‘r’. The coefficient of
correlation for two variables, say X and Y, is given by:

( X  X )(Y  Y )  xy
r
( X  X ) 2
(Y  Y ) 2  oe simply =
( x 2
)(  y 2 ) 
The
Thevalue
valueofofthethecorrelation
correlationcoefficient
coefficientranges
rangesfrom
from
+1 for a perfect correlation
+1 for a perfect correlation
toto-1-1for
fora aperfect
perfectnegative
negativecorrelation
correlation

Example 5.1

a) Determine the coefficient of correlation between X and Y based on the data


below.

X 4 5 6 9
Y 12 10 8 6

b) The data given below gives the experimental values obtained for the torque output
from an electric motor, X, against the current taken from the supply, Y. Determine
the value, degree and nature of the coefficient of linear correlation between the
variables X and Y (if there is one).

X 0 1 2 3 4 5 6 7 8 9
Y 4 6 6 6 8 10 10 10 14 12
CORRELATION AND REGRESSION C 5606 / 5/ 9

Solution to Example 5.1

a) Construct a table from the given data.

1 2 3 4 5 6 7
X Y x=X- y = Y- Y xy x2 y2
X
4 12 -2 3 -6 4 9
5 10 -1 1 -1 1 1
6 8 0 -1 0 0 1
9 6 3 -3 -9 9 9
 X  24
Y  36
 xy  16  x 2  14  y 2  20
24 36
X  6 Y  9
4 4

 xy  16  16
   0.9562
r=
( x 2
)( y ) 2
  (14)(20) 280

b)

x= y=
X Y X X Y Y xy x2 y2
0 4 -4.5 -4.6 20.7 20.25 21.16
1 6 -3.5 -2.6 9.1 12.25 6.76
2 6 -2.5 -2.6 6.5 6.25 6.76
3 6 -1.5 -2.6 3.9 2.25 6.76
4 8 -0.5 -0.6 0.3 0.25 0.36
5 10 0.5 1.4 0.7 0.25 1.96
6 10 1.5 1.4 2.1 2.25 1.96
7 10 2.5 1.4 3.5 6.25 1.96
8 14 3.5 5.4 18.9 12.25 29.16
9 12 4.5 3.4 15.3 20.25 11.56
 x  45  y  86
45 86  xy  81 .  x 2  82.5  y 2  88.4
X   4.5 Y   8 .6
10 10 0

 xy 81
  0.95
r=
(  x 2
)(  y 2 )   (82.5)(88.4)
CORRELATION AND REGRESSION C 5606 / 5/ 10

A good direct correlation exists between the the values of X and Y.

ACTIVITY 5A

TEST YOUR UNDERSTANDING BEFORE PROCEEDING TO THE NEXT INPUT...!

1. Determine the coefficient of correlation up to 4 decimal places between X and Y


based on the data below.

X 122 124 133 138 144 156 158 161 164 168
Y 41 38 52 56 29 54 59 61 63 67

2. The co-ordinates given below refer to an experiment to verufy Newton’s law of


cooling over a limited range of values. Determine the value, degree and nature of
the coefficient of correlation.

Time (min) 4 8 10 12 16 22
Temperatuer (oC) 46 34 30 26 24 20

3. The following results were obtained experimentally when verifying Hooke’s law:

Load (N) 2 5 8 11 15
Extension (mm) 2 23 62 119 223

Determine the value, degree and nature of the coefficient of correlation.

4. The thickness of case-hardening achieved varies with temperature and some co-
ordinated obtained by experiment are as shown.

Temperature (oC) 400 420 350 320 400 480 440 370
Thickness (µm) 3.7 3.4 3.7 3.8 3.6 3.3 3.4 3.7
CORRELATION AND REGRESSION C 5606 / 5/ 11

Determine the coefficient of correlation based on these values.+-

FEEDBACK TO ACTIVITY 5A

1. r = 0.7289
2. r = -0.92, good, inverse
3. 0.97, good, direct
4. 0.93
CORRELATION AND REGRESSION C 5606 / 5/ 12

INPUT

5.2 LEAST SQUARES REGRESSION LINE

Scatter Diagrams – Line Of the Best

We have already referred to the drawing of a line of best fit by eye

Thev only calculation involved determining x dan y , since the line of best – fit
passes through the point ( x , y ).

From the line you might be expected to estimate a y – value given an x- value. Of
course, “ by eye “ line fitting is a subjective matter, trying to minimise the distances
between the points and the line.

A mathematical computation method is available to produce two lines : known as ‘y


and ‘x ( to estimate value of y) and ‘x on ‘y ( to estimate values of x)

These are known as (Linear) Regression Lines or Least-Squares Regression Lines.


CORRELATION AND REGRESSION C 5606 / 5/ 13

Scatter Diagrams – The ‘y on ‘x Regression Line

Since the line must pass through (( x , y ), the parameters that can vary are the
gradient of the line and the point where the line cuts the y – axis.

The equation of the line will be of the form y = a + bx “y on “x ( some syllabuses use
Greek letters α and β instead of a and b)

The y on x line minimises the sum of the squares of the vertical distances from the
points to the regression line ( the square of the distance is used to ensure a positive
result).

As with correlation there is a formula derived from a proof and a corresponding ‘


computational’ method. The proof is not required at A/AS Level )

( x  y
 xy n
For y = a + bx b = a = y -b x
( x )
2

x 2

n

Where y and x are the mean values of y and x.


CORRELATION AND REGRESSION C 5606 / 5/ 14

Example 5.2

a) y on x Regression Line ( Least Squares Regression Line )

x 2.5 4 8 5 7 9.5 8.5 12.5 12.5 14.5


y 3.5 3 6.5 7 8 11 9 10.5 13 13
 
 x = 84  y = 84.5  xy = 827  x = 845.5 n = 10 2
x = 8.4 y =
8.45

Calculate the regression line y on x.

b) Based on the data alreday calculated, find the regression line y on x and estimate
the value of y when x = 160

 x = 1468  y = 520  xy = 77689  x = 218070 n = 10 2
x = 8.4

Solution to Example 5.2

a) To calculate the regression line y-on-x

( x  y (84 x84.5)
 xy n
827 
10
b = = = 0.8377
( x )
2 2
84
845.5  (
x 2

n 10
)

a= y -bx = 8.45 – (0.8377 x 8.4) = 1.4133

So least squares regression line y - on - x is y = 1.4133 + 0.8377 x

Least Squares Regression Line - y - on – x

From the previous page , the least squares regression line y - on - x is :


CORRELATION AND REGRESSION C 5606 / 5/ 15

y = 1.4133 + 0.8377x

We can now use this equation to calculate ( estimate) a value of y for a given value of
x.

For example . Find a value for y given x = 10

Substituting y = 1.4133 + (0.8377 x 10)

Finding a value from within the range of x is called interpolation

Warning . Estimation a value from outside the data range ( say x = 20 ) is called
extrapolation and should bec avoided ( at all cost ) since you do not know that the
relationship between x and y will hold for larger and smaller values than those
recorded.

b) For the regression line y – on – x,

( x  y (1468 x520)
 xy n
77689 
10
b = = = 0.5270
( x )
2 2
1468
218070  (
x 2

n 10
)

a = y - (b x ) = 52 - (0.5270 x 146.8 ) = - 25.3636

So, regresson line is y = -25.3636 + 0.5270x

When x = 160, y = -25.3636 + (0.5270 x 160) = 58.96


CORRELATION AND REGRESSION C 5606 / 5/ 16

ACTIVITY 5B

TEST YOUR UNDERSTANDING BEFORE PROCEEDING TO THE NEXT INPUT...!

a. The table shows the results for a number of athletes. X represents long
jump (metres )

 x = 19  y = 66  xy = 126.22  x = 36.44 n = 8
2

X y x2 y2 xy
1.8 6.7 3.24 44.89 12.06
2.1 7.6 4.41 57.76 15.96
1.9 6.3 3.61 39.69 11.97
2.0 6.8 4.00 46.24 13.6
1.8 5.9 3.24 34.81 10.62
1.8 7.9 3.24 62.41 14.22
1.6 5.5 2.56 30.25 8.8
1.8 5.6 3.24 31.36 10.08
1.9 6.5 3.61 42.25 12.35
2.3 7.2 5.29 51.84 16.56
19 66 36.44 441.5 126.22

Calculate the values of b for the regression line y = a + bx

b. The length y metres of a cable subjected to a load of x kilograms is given by


y = α + βx. In an experiment to estimate α and β for a particular cable, the value
of of y was measured for each of x . The following quantities were calculated from
the 15 pair of values.

 x = 225  y = 238  xy = 3581  x = 3625 2

Calculated the least squares estimates of α and β


CORRELATION AND REGRESSION C 5606 / 5/ 17

c. Set of bivariate data can be summarised as follows :

 x = 21  y = 43  xy = 171  x = 91 n = 6 2
 y = 335
2

i) Calculate the equation of the regression line of y on x . Give your answer in


the form y = a + bx, where the values of a and b should be stated to 3
significant figures.
ii) It is required to estimate the value of y for a given value of x. State
circumstances under which the regression line of x and y should be used,
rather than the regression line of y and x
CORRELATION AND REGRESSION C 5606 / 5/ 18

FEEDBACK TO ACTIVITY 5B

a. b = 2.4118

b. y = α + βx y = 15.69 + 0.014x

c. i) a = 3.0688, regression line is y = 3.07 + 1.17 ( 3 significant figures)


ii) Use regression line of x on y to estimate value of x when y is the
independent variable.
CORRELATION AND REGRESSION C 5606 / 5/ 19

SELF ASSESSMENT 5

You are approaching success. Try all the questions in this self-assessment section
and check your answers given on the next page. If you encounter any problems,
consult your instructor. Good luck.

1. The data given below refers to the relationship between man-hours worked
and production achieved in a factory. Determine the coefficient of
correlation.

Index of
production
man-hour 100 97 100 101 93 103 91 89 110 86
basis
Index of
production, 94 91 100 105 84 112 83 80 123 78
actual
basis

2. The number of man-days lost per week due to sickness in two similar
departments of a factory are show for a 12-week period.

Department A 20 18 19 21 17 18 12 16 14 17 13 15
Department B 18 21 18 20 17 19 16 15 15 18 16 18

Determine the coefficent of correlation and comment on its degree and


nature.
CORRELATION AND REGRESSION C 5606 / 5/ 20

3. The masses and height for ten people were measured and the results are
as shown.

Mass 38 38 38 44 44 51 32 51 77 32
(kg)
Height 135 140 137 141 147 145 132 149 164 130
(cm)

Calculate the coefficient of correlation for this data

4. The relationship between the pressure and volume of a gas was measured
and the follwowing results were obtained :

Pressure 58 62 67 73 81 81 86 92 104
(kPa)
Volume 0.36 0.97 0.43 0.52 0.48 0.29 0.31 0.75 0.27
(m3)

Determine the coefficient of correlation and comment on the result


obtained.

5. The caloric intake of rats varies with body mass as shown below.

Body 2.0 3.1 3.6 4.6 5.0 6.0 7.0 8.0 8.5 9.0 10.0
mass
(g)
Caloric 2.1 3.2 3.6 3.6 3.9 4.1 4.2 4.5 4.6 5.9
Intake 1.5
(cal h-1

Is there a linear correlation between these results ?


CORRELATION AND REGRESSION C 5606 / 5/ 21

6. Determine the coefficient of correlation for the data given below and test
the null hypothesis that  = 0 at a level of significance of 0.1. The
datagiven relates the number of hours of sunshime per week to the hours
lost due to sickness.

Hours of 10 13 15 17 18 20 22 23 24
sunshine/week
Hous lost due 90 75 75 65 55 45 55 45 35
to sickness

7. The length y metres of a cable subjected to a load of x kilograms is given


by y = α + βx. In an experiment to estimate α and β a particular cable, the
value of y was measured for each of 15 values of x. The following
quantities were calculated from the pairs of values.

 x = 225  y = 238.5  xy = 3581  x = 3625 2

a) Calculate the least squares estimates of α and β

8. A set of bivariate data can be summarised as follows

 x = 21  y = 43  xy = 171  x = 91 n = 6 2
y 2
= 335

i) Calculated the equation of regression line of y and x. Give your


answer in the form y = a + bx, where the values of a and b should
be stated to 3 significant figures.
ii) It is required to estimate the value of y for a given value of x. State
circumstances under which the regression line of x and y should
be used, rather than the regression line of y on x

9. The data given below is relationship between the heights and masses of ten
people.

Height, 175 180 193 165 187 171 198 168 184 177
X cm
Mass, 82 78 86 72 91 80 95 72 89 74
Y kg

Determine the equation of the regression line of mass on height,


expressing the regression coefficients correct to two decimal places.
CORRELATION AND REGRESSION C 5606 / 5/ 22

10. The power needed to drive a lathe increase as the cutting angle of the tool
increase when cutting a constant speed and depth of cut. The relationship for
mild steel is :

Cutting 50 55 60 65 70 75 80 85 90
angle
(degrees)X
Power 6.2 6.8 7.6 8.2 8.1 8.8 9.7 10.0 10.4
(kW)Y

Determine a) the equation of the regression line of power on cutting angle and
b) the equation of the regression line of cutting angle on power,
expresing the regression coefficients correct to three significant
figures in each case.
CORRELATION AND REGRESSION C 5606 / 5/ 23

FEEDBACK TO SELF ASSESSMENT 5

Have you tried all the questions?? If “YES”, check your answers now.

1. 0.97
2. 0.70 , fair direct

3. 0.97
4. -0.31, It is probable that the measurements were made at different
Temperatures

5. r = 0.94, hence there is a good, direct correlation.

6. r = -0.95, t.99  7 = 1.42 I tI = 8.05 hypothesis is rejected

7. α= 15.69 β= 0.014 y= 15.69 + 0.014x

8. i) y = 3.07 + 1.17x
ii) use regression line of x and y to estimate value of x when y is the
independent variable.

9. y = -036.83 + 0.66x

10. a) Y = 1.14 + 0.104 X


b) X = -9.27 + 9.41Y

You might also like