You are on page 1of 23

CORRELATION AND REGRESSION

C 5606 / 5/ 1

UNIT 5
CORRELATION AND REGRESSION

OBJECTIVES

General Objective To understand and apply the concept of correlation and regression Specific Objectives At the end of the unit, you should be able to: Draw a scatterplot for a set of ordered pairs Compute the correlation coefficient Compute the equation of the regression line

CORRELATION AND REGRESSION

C 5606 / 5/ 2

INPUT

5.0 CORRELATION So far we have considered the statistics of one variable. Of course we sometimes get data involving two variables. For example, look at the marks obtained on two Mathematics paper by a group of students below. Student Paper 1 Paper 2 A 42 31 B 84 83 C 50 42 D 42 60 E 33 28 F 50 63 G 69 59 H 81 92 I 50 73 J 35 40

So what can we find out from the data ? Students B and H have done very well on both papers, E has done very badly on both papers, student I has done much better on paper 2 than paper 1. A graph might help us to make more sense of the data, as would the average (mean) mark for papers 1 and 2. The most useful type of graph is a scatter diagram.

CORRELATION AND REGRESSION

C 5606 / 5/ 3

5.1 CORRELATION- SCATTER DIAGRAM


If we plot the data as points, with marks for Paper 1 on the x- axis and for paper 2 on the y-axis, we obtain a graph like the one shown heree. Note that we do not need to start the scales at zero.

We see that the points go roughly from bottom left to top right(this is made clearer by enclosing the points as shown below.

CORRELATION AND REGRESSION

C 5606 / 5/ 4

From the data the mean value for paper 1 And for paper 2

y = 57.1

x = 53.6

We now plot the line x = 53.6 and y = 57.1 on the scatter diagram:

The line divide the graph into four quadrants : Top Right All points have both x values and y values greater than their respective means i.e. (x x ) <0, (y - y ) < 0. The product would be positive. Bottom Left All points have both x values and y values less than their respective means i.e. (x x ) <0, (y - y ) < 0. The product would be positive. Top left x values less than x , y values greater than y . Product negative. Bottom right x values greater than x , y values less than y . Product negative. Look at the scattergrams (scatter diagrams) below. The patterns seem to be very different.

CORRELATION AND REGRESSION

C 5606 / 5/ 5

Roughly speaking: Positive correlation the higher the value of x, the higher the value of y. Negative correlation the higher value of x, the lower value of y. Zero correlation no fixed relationship between x and y. Again this is made clearer by drawing the lines y = y , x = x .

You have met scatter diagrams in your work of which you may have drawn a line of best fit on the graph in order to estimate a value of y given a value of x. The line was drawn by eye but you would know that the line passes through the mean values of ( x , y ) as shown below.

CORRELATION AND REGRESSION

C 5606 / 5/ 6

The lines on the first two diagrams are relatively easy to draw, but where do we draw a line on the third and having drawn it, would it be of any practical use? Notice that we have been looking for a special type of relationship between the x and y values a straight line or linear relationship. The fact that we cant find such a relationship does not mean that there is no relationship at all. The product-moment formula for determining the linear correlation coefficient The convention of dealing with data Horizontal (x) axis The independent variable

Vertical (y) axis The dependent variable Let us look at some data on the height of students and the distance they can throw a cricket ball. Height (x) cm Distance (y) m 122 41 124 38 133 52 138 56 144 29 156 54 158 59 161 61 164 63 168 67

Just looking at the data, a general response might be the taller a person, the further they can throw a cricket ball. (apart from the odd person!)

CORRELATION AND REGRESSION

C 5606 / 5/ 7

Does a scatter diagram support that hypothesis?

The example below shows one drawback: SCALE

CORRELATION AND REGRESSION

C 5606 / 5/ 8

One of the measures of the degree of linear correlation between two variables is called the coefficient of correlation, denoted by the symbol r. The coefficient of correlation for two variables, say X and Y, is given by:
r=

[( X X )

( X X )(Y Y )
2

(Y Y ) 2

oe simply =

[( x

xy
2

)( y 2 )

The value of the correlation coefficient ranges from +1 for a perfect correlation to -1 for a perfect negative correlation

Example 5.1 a) Determine the coefficient of correlation between X and Y based on the data below. X Y 4 12 5 10 6 8 9 6

b) The data given below gives the experimental values obtained for the torque output from an electric motor, X, against the current taken from the supply, Y. Determine the value, degree and nature of the coefficient of linear correlation between the variables X and Y (if there is one). X Y 0 4 1 6 2 6 3 6 4 8 5 10 6 10 7 10 8 14 9 12

CORRELATION AND REGRESSION

C 5606 / 5/ 9

Solution to Example 5.1 a) Construct a table from the given data. 1 X 4 5 6 9 2 Y 12 10 8 6


Y =36

3 x=XX

4 y = Y- Y 3 1 -1 -3

5 xy -6 -1 0 -9

6 x2 4 1 0 9
x 2 = 14

7 y2 9 1 1 9
y 2 = 20

X = 24
X = 24 =6 4

-2 -1 0 3

Y =

36 =9 4

xy = 16

r= b)

[( x

xy
2

)( y 2 )

[ (14)(20)]

16

16 280

= 0.9562

x= X 0 1 2 3 4 5 6 7 8 9 Y 4 6 6 6 8 10 10 10 14 12
X X

y=
Y Y

x == 45 45 X = = 4 .5 10

y = 86 86 Y = = 8.6 10

-4.5 -3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5 4.5

-4.6 -2.6 -2.6 -2.6 -0.6 1.4 1.4 1.4 5.4 3.4

xy 20.7 9.1 6.5 3.9 0.3 0.7 2.1 3.5 18.9 15.3

x2 20.25 12.25 6.25 2.25 0.25 0.25 2.25 6.25 12.25 20.25
x 2 = 82 .5

y2 21.16 6.76 6.76 6.76 0.36 1.96 1.96 1.96 29.16 11.56 y 2 = 88.4

xy = 81 . 0 81 = 0.95

r=

[( x

xy
2

)( y 2 )

[ (82 .5)(88 .4)]

CORRELATION AND REGRESSION

C 5606 / 5/ 10

A good direct correlation exists between the the values of X and Y.

ACTIVITY 5A

TEST YOUR UNDERSTANDING BEFORE PROCEEDING TO THE NEXT INPUT...! 1. Determine the coefficient of correlation up to 4 decimal places between X and Y based on the data below. X Y 122 41 124 38 133 52 138 56 144 29 156 54 158 59 161 61 164 63 168 67

2.

The co-ordinates given below refer to an experiment to verufy Newtons law of cooling over a limited range of values. Determine the value, degree and nature of the coefficient of correlation. Time (min) Temperatuer (oC) 4 46 8 34 10 30 12 26 16 24 22 20

3. The following results were obtained experimentally when verifying Hookes law: Load (N) Extension (mm) 2 2 5 23 8 62 11 119 15 223

Determine the value, degree and nature of the coefficient of correlation.

4. The thickness of case-hardening achieved varies with temperature and some coordinated obtained by experiment are as shown. Temperature (oC) 400 420 Thickness (m) 3.7 3.4 350 3.7 320 3.8 400 3.6 480 3.3 440 3.4 370 3.7

CORRELATION AND REGRESSION

C 5606 / 5/ 11

Determine the coefficient of correlation based on these values.+-

FEEDBACK TO ACTIVITY 5A

1. 2. 3. 4.

r = 0.7289 r = -0.92, good, inverse 0.97, good, direct 0.93

CORRELATION AND REGRESSION

C 5606 / 5/ 12

INPUT

5.2 LEAST SQUARES REGRESSION LINE Scatter Diagrams Line Of the Best We have already referred to the drawing of a line of best fit by eye

Thev only calculation involved determining x dan y , since the line of best fit passes through the point ( x , y ). From the line you might be expected to estimate a y value given an x- value. Of course, by eye line fitting is a subjective matter, trying to minimise the distances between the points and the line. A mathematical computation method is available to produce two lines : known as y and x ( to estimate value of y) and x on y ( to estimate values of x) These are known as (Linear) Regression Lines or Least-Squares Regression Lines.

CORRELATION AND REGRESSION

C 5606 / 5/ 13

Scatter Diagrams The y on x Regression Line Since the line must pass through (( x , y ), the parameters that can vary are the gradient of the line and the point where the line cuts the y axis. The equation of the line will be of the form y = a + bx y on x ( some syllabuses use Greek letters and instead of a and b)

The y on x line minimises the sum of the squares of the vertical distances from the points to the regression line ( the square of the distance is used to ensure a positive result). As with correlation there is a formula derived from a proof and a corresponding computational method. The proof is not required at A/AS Level )
(x y ( x ) n n
2

For y = a + bx

b =

xy x
2

a = y -b x

Where y and x are the mean values of y and x.

CORRELATION AND REGRESSION

C 5606 / 5/ 14

Example 5.2 a) y on x Regression Line ( Least Squares Regression Line )

x y

2.5 3.5

4 3

8 6.5

5 7

7 8

9.5 11
2

8.5 9

12.5 10.5

12.5 13
x = 8.4

14.5 13
y

8.45

x y xy x = 84 = 84.5 = 827 = 845.5 n = 10

Calculate the regression line y on x. b) Based on the data alreday calculated, find the regression line y on x and estimate the value of y when x = 160
x y xy = 1468 = 520 = 77689 x = 218070 n = 10
2

x = 8.4

Solution to Example 5.2 a) To calculate the regression line y-on-x

b =

xy x
2

(x y ( x) n n
2

827

(84 x84 .5) 10 2 84 845 .5 ( ) 10

= 0.8377

a= y -bx

= 8.45 (0.8377 x 8.4) = 1.4133 y = 1.4133 + 0.8377 x

So least squares regression line y - on - x is Least Squares Regression Line y - on x

From the previous page , the least squares regression line y - on - x is :

CORRELATION AND REGRESSION

C 5606 / 5/ 15

y = 1.4133 + 0.8377x We can now use this equation to calculate ( estimate) a value of y for a given value of x . For example . Find a value for y given x = 10 Substituting y = 1.4133 + (0.8377 x 10)

Finding a value from within the range of x is called interpolation Warning . Estimation a value from outside the data range ( say x = 20 ) is called extrapolation and should bec avoided ( at all cost ) since you do not know that the relationship between x and y will hold for larger and smaller values than those recorded. b) For the regression line y on x,

b =

xy x
2

(x y ( x) n n
2

77689

(1468 x520 ) 10 2 1468 218070 ( ) 10

= 0.5270

a = y - (b x )

= 52 - (0.5270 x 146.8 )

= - 25.3636

So, regresson line is y = -25.3636 + 0.5270x When x = 160, y = -25.3636 + (0.5270 x 160) = 58.96

CORRELATION AND REGRESSION

C 5606 / 5/ 16

ACTIVITY 5B

TEST YOUR UNDERSTANDING BEFORE PROCEEDING TO THE NEXT INPUT...! a. The table shows the results for a number of athletes. X represents long jump (metres )
x y xy x = 19 = 66 = 126.22 = 36.44 n = 8
2

X 1.8 2.1 1.9 2.0 1.8 1.8 1.6 1.8 1.9 2.3 19

y 6.7 7.6 6.3 6.8 5.9 7.9 5.5 5.6 6.5 7.2 66

x2 3.24 4.41 3.61 4.00 3.24 3.24 2.56 3.24 3.61 5.29 36.44

y2 44.89 57.76 39.69 46.24 34.81 62.41 30.25 31.36 42.25 51.84 441.5

xy 12.06 15.96 11.97 13.6 10.62 14.22 8.8 10.08 12.35 16.56 126.22

Calculate the values of b for the regression line y = a + bx b. The length y metres of a cable subjected to a load of x kilograms is given by y = + x. In an experiment to estimate and for a particular cable, the value of of y was measured for each of x . The following quantities were calculated from the 15 pair of values.
x y xy = 225 = 238 = 3581 x = 3625
2

Calculated the least squares estimates of and

CORRELATION AND REGRESSION

C 5606 / 5/ 17

c. Set of bivariate data can be summarised as follows :


x y xy x = 21 = 43 = 171 = 91 n = 6
2

y = 335
2

i) ii)

Calculate the equation of the regression line of y on x . Give your answer in the form y = a + bx, where the values of a and b should be stated to 3 significant figures. It is required to estimate the value of y for a given value of x. State circumstances under which the regression line of x and y should be used, rather than the regression line of y and x

CORRELATION AND REGRESSION

C 5606 / 5/ 18

FEEDBACK TO ACTIVITY 5B

a. b. c.

b = 2.4118 y = + x y = 15.69 + 0.014x

i) a = 3.0688, regression line is y = 3.07 + 1.17 ( 3 significant figures) ii) Use regression line of x on y to estimate value of x when y is the independent variable.

CORRELATION AND REGRESSION

C 5606 / 5/ 19

SELF ASSESSMENT 5

You are approaching success. Try all the questions in this self-assessment section and check your answers given on the next page. If you encounter any problems, consult your instructor. Good luck. 1. The data given below refers to the relationship between man-hours worked and production achieved in a factory. Determine the coefficient of correlation. Index of production man-hour 100 basis Index of production, 94 actual basis

97 91

100 100

101 105

93 84

103 112

91 83

89 80

110 123

86 78

2. The number of man-days lost per week due to sickness in two similar departments of a factory are show for a 12-week period. Department A Department B 2 0 1 8 1 8 2 1 19 18 21 20 17 17 18 19 12 16 16 15 14 15 17 18 13 16 15 18

Determine the coefficent of correlation and comment on its degree and nature.

CORRELATION AND REGRESSION

C 5606 / 5/ 20

3. The masses and height for ten people were measured and the results are as shown. Mass 38 (kg) Height 135 (cm) 38 140 38 137 44 141 44 147 51 145 32 132 51 149 77 164 32 130

Calculate the coefficient of correlation for this data 4. The relationship between the pressure and volume of a gas was measured and the follwowing results were obtained : Pressure 58 (kPa) Volume 0.36 3 (m ) 62 0.97 67 0.43 73 0.52 81 0.48 81 0.29 86 0.31 92 0.75 104 0.27

Determine the coefficient of correlation and comment on the result obtained. 5. The caloric intake of rats varies with body mass as shown below. Body mass (g) Caloric Intake (cal h-1 2.0 3.1 2.1 1.5 Is there a linear correlation between these results ? 3.6 3.2 4.6 3.6 5.0 3.6 6.0 3.9 7.0 4.1 8.0 4.2 8.5 4.5 9.0 4.6 10.0 5.9

CORRELATION AND REGRESSION

C 5606 / 5/ 21

6. Determine the coefficient of correlation for the data given below and test the null hypothesis that = 0 at a level of significance of 0.1. The datagiven relates the number of hours of sunshime per week to the hours lost due to sickness. Hours of 10 sunshine/week Hous lost due 90 to sickness 13 75 15 75 17 65 18 55 20 45 22 55 23 45 24 35

7. The length y metres of a cable subjected to a load of x kilograms is given by y = + x. In an experiment to estimate and a particular cable, the value of y was measured for each of 15 values of x. The following quantities were calculated from the pairs of values.
x y xy x = 225 = 238.5 = 3581 = 3625
2

a)

Calculate the least squares estimates of and

8. A set of bivariate data can be summarised as follows


x y xy x = 21 = 43 = 171 = 91 n = 6
2

y = 335
2

i) ii)

Calculated the equation of regression line of y and x. Give your answer in the form y = a + bx, where the values of a and b should be stated to 3 significant figures. It is required to estimate the value of y for a given value of x. State circumstances under which the regression line of x and y should be used, rather than the regression line of y on x

9. The data given below is relationship between the heights and masses of ten people. Height, 175 X cm Mass, 82 Y kg 180 78 193 86 165 72 187 91 171 80 198 95 168 72 184 89 177 74

Determine the equation of the regression line of mass on height, expressing the regression coefficients correct to two decimal places.

CORRELATION AND REGRESSION

C 5606 / 5/ 22

10. The power needed to drive a lathe increase as the cutting angle of the tool increase when cutting a constant speed and depth of cut. The relationship for mild steel is : Cutting 50 angle (degrees)X Power 6.2 (kW)Y 55 6.8 60 7.6 65 8.2 70 8.1 75 8.8 80 9.7 85 10.0 90 10.4

Determine a) the equation of the regression line of power on cutting angle and b) the equation of the regression line of cutting angle on power, expresing the regression coefficients correct to three significant figures in each case.

CORRELATION AND REGRESSION

C 5606 / 5/ 23

FEEDBACK TO SELF ASSESSMENT 5

Have you tried all the questions?? If YES, check your answers now. 1. 2. 3. 4. 5. 6. 7. 8. 0.97 0.70 , fair direct 0.97 -0.31, It is probable that the measurements were made at different Temperatures r = 0.94, hence there is a good, direct correlation. r = -0.95, t.99

= 1.42

I tI = 8.05

hypothesis is rejected

= 15.69 = 0.014

y= 15.69 + 0.014x

i) y = 3.07 + 1.17x ii) use regression line of x and y to estimate value of x when y is the independent variable. y = -036.83 + 0.66x a) Y = 1.14 + 0.104 X b) X = -9.27 + 9.41Y

9. 10.