Professional Documents
Culture Documents
UNIT 5
OBJECTIVES
General Objective
Specific Objectives
INPUT
5.0 CORRELATION
So far we have considered the statistics of one variable. Of course we sometimes get
data involving two variables. For example, look at the marks obtained on two
Mathematics paper by a group of students below.
Student A B C D E F G H I J
Paper 1 42 84 50 42 33 50 69 81 50 35
Paper 2 31 83 42 60 28 63 59 92 73 40
So what can we find out from the data ? Students B and H have done very well on
both papers, E has done very badly on both papers, student I has done much better
on paper 2 than paper 1.
A graph might help us to make more sense of the data, as would the average (mean)
mark for papers 1 and 2. The most useful type of graph is a scatter diagram.
CORRELATION AND REGRESSION C 5606 / 5/ 3
If we plot the data as points, with marks for Paper 1 on the x- axis and for paper 2 on
the y-axis, we obtain a graph like the one shown heree. Note that we do not need to
start the scales at zero.
We see that the points go roughly from bottom left to top right(this is made clearer by
enclosing the points as shown below.
CORRELATION AND REGRESSION C 5606 / 5/ 4
We now plot the line x = 53.6 and y = 57.1 on the scatter diagram:
Top Right – All points have both x values and y values greater than their respective
means i.e. (x – x ) <0, (y - y ) < 0. The product would be positive.
Bottom Left – All points have both x values and y values less than their respective
means i.e. (x – x ) <0, (y - y ) < 0. The product would be positive.
Top left – x values less than x , y values greater than y . Product negative.
Bottom right – x values greater than x , y values less than y . Product negative.
Look at the scattergrams (scatter diagrams) below. The patterns seem to be very
different.
CORRELATION AND REGRESSION C 5606 / 5/ 5
Roughly speaking:
Positive correlation – “the higher the value of x, the higher the value of y.”
Negative correlation – “the higher value of x, the lower value of y.”
Zero correlation – “no fixed relationship between x and y.”
You have met scatter diagrams in your work of which you may have drawn a “line of
best fit” on the graph in order to estimate a value of y given a value of x. The line was
drawn by “eye” but you would know that the line passes through the mean values of (
x , y ) as shown below.
CORRELATION AND REGRESSION C 5606 / 5/ 6
The lines on the first two diagrams are relatively easy to draw, but where do we draw
a line on the third and having drawn it, would it be of any practical use?
Notice that we have been looking for a special type of relationship between the x and
y values – a straight line or linear relationship. The fact that we can’t find such a
relationship does not mean that there is no relationship at all.
Let us look at some data on the height of students and the distance they can throw a
cricket ball.
Height (x) cm 122 124 133 138 144 156 158 161 164 168
Distance (y) m 41 38 52 56 29 54 59 61 63 67
Just looking at the data, a general response might be “the taller a person, the further
they can throw a cricket ball.” (apart from the odd person!)
CORRELATION AND REGRESSION C 5606 / 5/ 7
One of the measures of the degree of linear correlation between two variables is
called the coefficient of correlation, denoted by the symbol ‘r’. The coefficient of
correlation for two variables, say X and Y, is given by:
( X X )(Y Y ) xy
r
( X X ) 2
(Y Y ) 2 oe simply =
( x 2
)( y 2 )
The
Thevalue
valueofofthethecorrelation
correlationcoefficient
coefficientranges
rangesfrom
from
+1 for a perfect correlation
+1 for a perfect correlation
toto-1-1for
fora aperfect
perfectnegative
negativecorrelation
correlation
Example 5.1
X 4 5 6 9
Y 12 10 8 6
b) The data given below gives the experimental values obtained for the torque output
from an electric motor, X, against the current taken from the supply, Y. Determine
the value, degree and nature of the coefficient of linear correlation between the
variables X and Y (if there is one).
X 0 1 2 3 4 5 6 7 8 9
Y 4 6 6 6 8 10 10 10 14 12
CORRELATION AND REGRESSION C 5606 / 5/ 9
1 2 3 4 5 6 7
X Y x=X- y = Y- Y xy x2 y2
X
4 12 -2 3 -6 4 9
5 10 -1 1 -1 1 1
6 8 0 -1 0 0 1
9 6 3 -3 -9 9 9
X 24
Y 36
xy 16 x 2 14 y 2 20
24 36
X 6 Y 9
4 4
xy 16 16
0.9562
r=
( x 2
)( y ) 2
(14)(20) 280
b)
x= y=
X Y X X Y Y xy x2 y2
0 4 -4.5 -4.6 20.7 20.25 21.16
1 6 -3.5 -2.6 9.1 12.25 6.76
2 6 -2.5 -2.6 6.5 6.25 6.76
3 6 -1.5 -2.6 3.9 2.25 6.76
4 8 -0.5 -0.6 0.3 0.25 0.36
5 10 0.5 1.4 0.7 0.25 1.96
6 10 1.5 1.4 2.1 2.25 1.96
7 10 2.5 1.4 3.5 6.25 1.96
8 14 3.5 5.4 18.9 12.25 29.16
9 12 4.5 3.4 15.3 20.25 11.56
x 45 y 86
45 86 xy 81 . x 2 82.5 y 2 88.4
X 4.5 Y 8 .6
10 10 0
xy 81
0.95
r=
( x 2
)( y 2 ) (82.5)(88.4)
CORRELATION AND REGRESSION C 5606 / 5/ 10
ACTIVITY 5A
X 122 124 133 138 144 156 158 161 164 168
Y 41 38 52 56 29 54 59 61 63 67
Time (min) 4 8 10 12 16 22
Temperatuer (oC) 46 34 30 26 24 20
3. The following results were obtained experimentally when verifying Hooke’s law:
Load (N) 2 5 8 11 15
Extension (mm) 2 23 62 119 223
4. The thickness of case-hardening achieved varies with temperature and some co-
ordinated obtained by experiment are as shown.
Temperature (oC) 400 420 350 320 400 480 440 370
Thickness (µm) 3.7 3.4 3.7 3.8 3.6 3.3 3.4 3.7
CORRELATION AND REGRESSION C 5606 / 5/ 11
FEEDBACK TO ACTIVITY 5A
1. r = 0.7289
2. r = -0.92, good, inverse
3. 0.97, good, direct
4. 0.93
CORRELATION AND REGRESSION C 5606 / 5/ 12
INPUT
Thev only calculation involved determining x dan y , since the line of best – fit
passes through the point ( x , y ).
From the line you might be expected to estimate a y – value given an x- value. Of
course, “ by eye “ line fitting is a subjective matter, trying to minimise the distances
between the points and the line.
Since the line must pass through (( x , y ), the parameters that can vary are the
gradient of the line and the point where the line cuts the y – axis.
The equation of the line will be of the form y = a + bx “y on “x ( some syllabuses use
Greek letters α and β instead of a and b)
The y on x line minimises the sum of the squares of the vertical distances from the
points to the regression line ( the square of the distance is used to ensure a positive
result).
( x y
xy n
For y = a + bx b = a = y -b x
( x )
2
x 2
n
Example 5.2
b) Based on the data alreday calculated, find the regression line y on x and estimate
the value of y when x = 160
x = 1468 y = 520 xy = 77689 x = 218070 n = 10 2
x = 8.4
( x y (84 x84.5)
xy n
827
10
b = = = 0.8377
( x )
2 2
84
845.5 (
x 2
n 10
)
y = 1.4133 + 0.8377x
We can now use this equation to calculate ( estimate) a value of y for a given value of
x.
Warning . Estimation a value from outside the data range ( say x = 20 ) is called
extrapolation and should bec avoided ( at all cost ) since you do not know that the
relationship between x and y will hold for larger and smaller values than those
recorded.
( x y (1468 x520)
xy n
77689
10
b = = = 0.5270
( x )
2 2
1468
218070 (
x 2
n 10
)
ACTIVITY 5B
a. The table shows the results for a number of athletes. X represents long
jump (metres )
x = 19 y = 66 xy = 126.22 x = 36.44 n = 8
2
X y x2 y2 xy
1.8 6.7 3.24 44.89 12.06
2.1 7.6 4.41 57.76 15.96
1.9 6.3 3.61 39.69 11.97
2.0 6.8 4.00 46.24 13.6
1.8 5.9 3.24 34.81 10.62
1.8 7.9 3.24 62.41 14.22
1.6 5.5 2.56 30.25 8.8
1.8 5.6 3.24 31.36 10.08
1.9 6.5 3.61 42.25 12.35
2.3 7.2 5.29 51.84 16.56
19 66 36.44 441.5 126.22
x = 21 y = 43 xy = 171 x = 91 n = 6 2
y = 335
2
FEEDBACK TO ACTIVITY 5B
a. b = 2.4118
b. y = α + βx y = 15.69 + 0.014x
SELF ASSESSMENT 5
You are approaching success. Try all the questions in this self-assessment section
and check your answers given on the next page. If you encounter any problems,
consult your instructor. Good luck.
1. The data given below refers to the relationship between man-hours worked
and production achieved in a factory. Determine the coefficient of
correlation.
Index of
production
man-hour 100 97 100 101 93 103 91 89 110 86
basis
Index of
production, 94 91 100 105 84 112 83 80 123 78
actual
basis
2. The number of man-days lost per week due to sickness in two similar
departments of a factory are show for a 12-week period.
Department A 20 18 19 21 17 18 12 16 14 17 13 15
Department B 18 21 18 20 17 19 16 15 15 18 16 18
3. The masses and height for ten people were measured and the results are
as shown.
Mass 38 38 38 44 44 51 32 51 77 32
(kg)
Height 135 140 137 141 147 145 132 149 164 130
(cm)
4. The relationship between the pressure and volume of a gas was measured
and the follwowing results were obtained :
Pressure 58 62 67 73 81 81 86 92 104
(kPa)
Volume 0.36 0.97 0.43 0.52 0.48 0.29 0.31 0.75 0.27
(m3)
5. The caloric intake of rats varies with body mass as shown below.
Body 2.0 3.1 3.6 4.6 5.0 6.0 7.0 8.0 8.5 9.0 10.0
mass
(g)
Caloric 2.1 3.2 3.6 3.6 3.9 4.1 4.2 4.5 4.6 5.9
Intake 1.5
(cal h-1
6. Determine the coefficient of correlation for the data given below and test
the null hypothesis that = 0 at a level of significance of 0.1. The
datagiven relates the number of hours of sunshime per week to the hours
lost due to sickness.
Hours of 10 13 15 17 18 20 22 23 24
sunshine/week
Hous lost due 90 75 75 65 55 45 55 45 35
to sickness
x = 21 y = 43 xy = 171 x = 91 n = 6 2
y 2
= 335
9. The data given below is relationship between the heights and masses of ten
people.
Height, 175 180 193 165 187 171 198 168 184 177
X cm
Mass, 82 78 86 72 91 80 95 72 89 74
Y kg
10. The power needed to drive a lathe increase as the cutting angle of the tool
increase when cutting a constant speed and depth of cut. The relationship for
mild steel is :
Cutting 50 55 60 65 70 75 80 85 90
angle
(degrees)X
Power 6.2 6.8 7.6 8.2 8.1 8.8 9.7 10.0 10.4
(kW)Y
Determine a) the equation of the regression line of power on cutting angle and
b) the equation of the regression line of cutting angle on power,
expresing the regression coefficients correct to three significant
figures in each case.
CORRELATION AND REGRESSION C 5606 / 5/ 23
Have you tried all the questions?? If “YES”, check your answers now.
1. 0.97
2. 0.70 , fair direct
3. 0.97
4. -0.31, It is probable that the measurements were made at different
Temperatures
8. i) y = 3.07 + 1.17x
ii) use regression line of x and y to estimate value of x when y is the
independent variable.
9. y = -036.83 + 0.66x