You are on page 1of 74

Module 3- Correlation and Regression

If X and Y is a two dimensional R.V uniformly distributed over the triangular region R bounded
4x
by y  0, x  3 and y  . Find f (x), f ( y ), E (X ), Var(X), E (Y ),  XY .
3
Solution:
Given X and Y are uniformly distributed .
Therefore, f ( x, y)  k (a cons tan t )
We know that,   f ( x, y)dxdy  1
4 3
That is,   kdxdy  1
0 3y
4
4
 k  [ x]33 y dy  1
0
4
4 3y 
 k  3   dy  1
0 4
1
 6k  1  k 
6
3 3 1 1
f ( y )   f ( x, y )dx =  dx = (4  y),0  y  4
3y 3y 6 8
4 4
4x
3 1 2
f ( x)   dy = x,0  x  3
0 6 9
 32
E ( X )   xf ( x)dx =  x 2 dx = 2
 09
 4y 4
E (Y )   yf ( y )dy =  (4  y )dy 
 08 3
 9
E ( X 2 )   x 2 f ( x)dx 
 2
 8
E (Y 2 )   y 2 f ( y )dy 
 3
1
Var( X )  E ( X 2 )  [ E ( X )]2 =
2
8
Var(Y )  E (Y 2 )  [ E (Y )]2 =
9
14 3
E ( XY )    xydxdy = 3
6 0 3y
4
E ( XY )  E ( X ) E (Y ) 1
Now,  XY  =
 X . Y 2
Correlation:

If the change in one variable affects a change in the other variable, the variables are said to be
correlated.
Correlation between variables gives the degree of relationship between them.

Example: 1.The correlation between the heights and weights of a group of persons.
2. income and expenditure and so on

Karl –Pearson’s coefficient of correlation:

Correlation coefficient between two random variables X and Y is denoted by r ( X , Y ) , is a


numerical measure of linear relationship between them.
Cov( x, Y )
r( X ,Y )   ( X ,Y ) 
 X . Y
1
Where Cov( X , Y )   XY  XY
n
1 1
X  X X
2 2
and  Y  2
Y  Y
2
n n
n - number of items in the given data

Note: Two independent variables are uncorrelated when Cov(X,Y) = 0

Problems:

1. Calculate the correlation coefficient for the following heights (in inches) of fathers X
and their sons Y.

X 65 66 67 67 68 69 70 72
Y 67 68 65 68 72 72 69 71

Solution:

X Y XY X2 Y2
65 67 4355 4225 4489
66 68 4488 4356 4624
67 65 4355 4489 4225
67 68 4556 4489 4624
68 72 4896 4624 5184
69 72 4968 4761 5184
70 69 4830 4900 4761
72 71 5112 5184 5041
544 552 37560 37028 38132
544 552
Now, X   68 , Y   69 , XY  (68)(69)  4692
8 8
1 37028
X  2
X X =
2
 4624 = 2.121
n 8
1 38132
Y  2
Y  Y =
2
 4761 = 2.345
n 8
1 1
 XY  XY  37560  4692
r( X ,Y )  n = 8 = 0.6030
 X . Y 2.121  2.345

2. Find the correlation coefficient for the following data

X 10 14 18 22 26 30
Y 18 12 24 6 30 36

Solution : r(X,Y) = 0.6

3. Let X, Y and Z be uncorrelated random variables with zero means and standard
deviations 5, 12 and 9 respectively. If U = X + Y and V = Y + Z, find the correlation
coefficient between U and V.

Solution:
Given that all the three random variables have zero mean.
Hence, E(X) = E(Y) = E(Z) = 0.
Now, Var(X) = E ( X 2 )  [ E ( X )]2
 E ( X 2 ) = Var(X) { since, E(X) = 0}
= 52 = 25
Similarly, E (Y 2 ) = 12 2 = 144 and E ( Z 2 ) = 9 2 = 81

Since X and Y are uncorrelated we have Cov(X,Y) = 0


 E(XY) = E(X).E(Y) = 0

Similarly, E(YZ) = 0 and E(ZX) = 0.

To find  (U ,V ) :

E (UV )  E (U ).E (V )
Now,  (U , V ) 
 U . V

E(U) = E [X + Y] = E[X] + E[Y] = 0


E(V) = E [Y + Z] = E[Y] + E[Z] = 0

E (U 2 )  E[( X  Y ) 2 ] = E [ X 2 ]  E [Y 2 ]  2E [ XY ]
= 25 + 144 + 0
= 169
Similarly, E (V 2 )  225

Now, Var(U )  E (U 2 )  [ E (U )]2  169


  U  169  13
Similarly, Var(V )  E (V 2 )  [ E (V )]2  225
  V  225  15

E(UV) = E[(X+Y) (Y+Z)]


= E(XY) + E (Y 2 ) + E(XZ) + E(YZ)
= 144

E (UV )  E (U ).E (V ) 144  0 144 48


Therefore,  (U , V )  = = 
 U . V 13.15 195 65

4. If the joint pdf of (X,Y) is given by f ( x, y)  x  y, 0  x, y  1 . Find  XY .


Solution:
E ( XY )  E ( X ).E (Y )
We know that,  ( X , Y ) 
 X . Y
 
Now, E ( XY )    xyf ( x, y )dxdy
 
11
=   xy ( x  y )dxdy
00
1
1  x3 y x2 y2 
=    dy
0  3 2 
0
1 y y2 
=   dy
0  3 2 
1
 y 2 y3 
=   
 6 6 
0
1
=
3
The pdf of X and Y is given by
1
1 1  y2  1
f ( x)   f ( x, y )dy =  ( x  y )dy =  xy   = x
0 0  2 
0
2
1
1 1  x2  1
f ( y )   f ( x, y )dx =  ( x  y )dx =   xy  = y 
0 0  2  0 2
1
1 1  1  x3 x 2  1 1 7
E ( X )   xf ( x)dx =  x x  dx =    =  
0 0  2  3 4 
0
3 4 12
1
1 1  1  y3 y 2  1 1 7
E (Y )   yf ( y )dy =  y y  dy =    =  
0 0  2  3 4 
0
3 4 12
1
2
1
2
11 2  x 4 x3  1 1 5
E ( X )   x f ( x)dx =  x  x  dx =    =  
0 0  2  4 6 
0
4 6 12
1
2
1
2
11 2  y 4 y3  1 1 5
E (Y )   y f ( y )dy =  y  y  dy =    =  
0 0  2  4 6  4 6 12
0
2
5 7 11
Var( X )  E ( X 2 )  [ E ( X )] 2    
12  12  144
11
X 
12
2
5 7 11
Var(Y )  E (Y 2 )  [ E (Y )] 2    
12  12  144
11
 Y 
12
1 7 7
 .
E ( XY )  E ( X ).E (Y ) 3 12 12 1
Therefore,  ( X , Y )  = 
 X . Y 11 11 11
.
12 12

5. The independent random variables X and Y have the pdf given by


4ax ,0  x  1 4by ,0  y  1
f X ( x)   , fY ( y)  
 0 , otherwise  0 , otherwise
Find the correlation coefficient.

Solution:
1
 1 1
2 x3  4a
E ( X )   xf ( x)dx =  x 4axdx = 4a  x dx = 4a   =
 0 0  3  0 3
1
 1 1
2 y3  4b
E (Y )   yf ( y )dy =  y 4bydy = 4b  y dy = 4b   =
 0 0  3  0 3
Since X and Y are independent, the joint pdf of X and Y is given by f ( x, y)  f ( x). f ( y)
= (4ax)(4by)
= 16abxy, 0  x  1, 0  y  1
  11
Now, E ( XY )    xyf ( x, y )dxdy =   xy (16 abxy)dxdy
  00
11 1  x3 
= 16 ab   x 2 y 2 dxdy = 16ab   y 2 dy
00 0  3 

=
16 ab 1 2
3 0
 
 y dy =
16ab
9
16ab  4a  4b 
Therefore, Cov(X,Y) = E(XY) – E(X)E(Y) = -    = 0
9  3  3 
  ( X ,Y )  0

Rank Correlation:

Let ( xi  yi ), i  1,2,..., n be the ranks of ‘i’ individuals in two characteristics A abd


B respectively. Perason coefficient of correlation between the ranks xi ' s and yi ' s is called
the rank correlation coefficient between the characteristics A and B for that group of
individuals and is given by
6 d i2
r( X ,Y )  1  where d i  ( xi  yi )
n(n 2  1)

1. Find the rank correlation coefficient from the following data:

Rank 1 2 3 4 5 6 7
in X
Rank 4 3 1 2 6 5 7
in Y

Solution:

Rank in X 1 2 3 4 5 6 7
( xi )
Rank in Y 4 3 1 2 6 5 7
( yi )
d i  ( xi  yi ) -3 -1 2 2 -1 1 0 0
d i2 9 1 4 4 1 1 0 20

Now, Rank correlation coefficient,


6 d i2
r( X ,Y )  1 
n(n 2  1)
6  20 120
= 1 = 1  0.6429
7(49  1) 336
Repeated Ranks:

If any two or more individuals are equal in any classification w.r.to characteristic a and B or
if there is more than one item with same value in the series then Spearman’s formula for
calculating the rank correlation coefficient breaks down. In this case common ranks are given
to the repeated ranks. This common rank is the average of the ranks which these items would
have assumed if they are slightly different from each other and the next item will get the rank
next to the ranks already assumed.
As a result of this correction is made in the correction formula.
m(m 2  1)
In the correction formula, we add the factor to  d 2 where m is the number of
12
items an item is repeated.

1. Obtain the rank correlation coefficient for the following data :

X 68 64 75 50 64 80 75 40 55 64
Y 62 58 68 45 81 60 68 48 50 70

Solution:

X 68 64 75 50 64 80 75 40 55 64
Y 62 58 68 45 81 60 68 48 50 70
Rank X ( xi ) 4 6 2.5 9 6 1 2.5 10 8 6
Rank Y ( yi ) 5 7 3.5 10 1 6 3.5 9 8 2
d i  ( xi  yi ) -1 -1 -1 -1 5 -5 -1 1 0 4
d i2 1 1 1 1 25 25 1 1 0 16

Correction factors:

2(2 2  1) 1
In X series, 75 repeated twice, C.F .  
12 2
3(3  1)
2
In X series, 64 repeated thrice, C.F .  2
12
2(2 2  1) 1
In Y series, 68 repeated twice, C.F .  
12 2
1 1
6( d 2   2  )
Therefore, rank correlation r  1  2 2
2
10(10  1)
6[72  0.5  2  0.5] 450
 1 = 1  0.5454
10[99] 990

Partial and Multiple Correlation:


Let us consider the example of yield of rice in a firm. It may be affected by the type of
soil, temperature, amount of rainfall, usage of fertilizers etc.,. It will be useful to determine
how yield of rice is influenced by one factor or how yield of rice is affected by several other
factors. This is done with the help of partial and multiple correlation analysis.

The basic distinction between multiple and partial correlation nalysis is that in the
former, the degree of relationship between the variable Y and all the other variables
X 1 , X 2 ,..., X n taken together is measired, whereas, in the latter, the degree of relationship
between Y and one of the variables X 1 , X 2 ,..., X n is measured by removing the effect of all
the other variables.

Partial correlation:

Partial correlation coefficient provides a measure of the relationship between the


dependent variable and other variable, with the effect of the rest of the variables eliminated.

If there are three variables X 1 , X 2 and X 3 , there will be three coefficients of partial
correlation, each studying the realtionship between two variables when the third is held
constant. If we denote by r12.3, that is, the coefficient of partial correlation X1 and X 2
keeping X 3 constant, it is calculated as
r12  r13 r23 r13  r12 r23 r23  r12 r13
r12.3  , r13.2  , r23.1 
1  r132 1  r23
2
1  r122 1  r23
2
1  r122 1  r132

1. In a trivariate distribution, it is found that r12  0.7 , r13  0.61 and r23  0.4 .
Find the partial correlation coefficients.

Solution:

r12  r13 r23 0.7  (0.61  0.4)


r12.3  =  0.628
1  r132 1  r23
2
1  (0.61) 2
1  (0.4) 2

r13  r12 r23 0.61  (0.7  0.4)


r13.2  =  0.504
1  r122 1  r23
2
1  (0.7) 2
1  (0.4) 2

r23  r12 r13 0.4  (0.7  0.61)


r23.1  =  0.048
1  r122 1  r132 1  (0.7) 2
1  (0.61) 2

Multiple corelation:

In multiple correlation, we are trying to make estimates of the value of one of the variable
based on the values of all the others. The variable whose value we are trying to estimate is
called the dependent variable and the other variables on which our estimates are based are
known as independent variables.
The coefficient of multiple correlation with three variables X 1 , X 2 and X 3 are
R1.23, R2.13 and R3.21 . R1.23, is the coefficient of multiple correlation related to X1 as a
dependent variable and X 2 , X 3 as two independent variables and it can be expressed interms
of r12, r23 and r13 as
r122  r132  2r12 r23 r13 r122  r23
2
 2r12 r23 r13
R1.23  , R2.13  ,
1  r23
2
1  r132
r132  r23
2
 2r12 r23 r13
R3.12 
1  r122

1. The following zero-order correlation coefficients are given: r12  0.98,


r13  0.44 and r23 = 0.54. Calculate multiple correlation coefficient treating first variable as
dependent and second and third variables as independent.

Solution:
r122  r132  2r12 r23 r13
R1.23 
1  r23
2

(0.98) 2  (0.44) 2  2(0.98)(0.54)(0.44)


=  0.986
1  (0.54) 2

Regression:

Regression is a mathematical measure of the average relationship between two or more


variables in terms of the original limits of the data.

Lines of regression:

2. The line of regression of Y on X is given by y  y  r. Y ( x  x )
X

3. The line of regression of X on Y is given by x  x  r. X ( y  y )
Y
Regression Coefficients:

2. Regression coefficient of Y on X : r. Y  bYX
X
 ( x  x )( y  y )
Where bYX 
2
 (x  x)

3. Regression coefficient of X on Y : r. X  b XY
Y
 ( x  x )( y  y )
Where b XY 
2
 ( y  y)
4. Correlation coefficient r   b XY  bYX

1. From the following data find (i) two regression equations (ii) the coefficient of
correlation between the marks in Economics and Statistics (iii) the most likely marks in
Statistics when marks in Economics are 30.

Marks in Economics 25 28 35 32 31 36 29 38 34 32
Marks in Statistics 43 46 49 41 36 32 31 30 33 39

Solution:

X Y XX Y Y ( X  X )2 (Y  Y ) 2 (X  X )
= X  32 = Y  38 (Y  Y )
25 43 -7 5 49 25 -35
28 46 -4 8 16 64 -32
35 49 3 11 9 121 33
32 41 0 3 0 9 0
31 36 -1 -2 1 4 2
36 32 4 -6 16 36 -24
29 31 -3 -7 9 49 21
38 30 6 -8 36 64 -48
34 33 2 -5 4 25 -10
32 39 0 1 0 1 0
320 380 0 0 140 398 -93

 X 320  Y 380
Here, X    32 , Y    38
n 10 n 10

The line of regression of X on Y is given by x  x  bXY ( y  y)


 ( x  x )( y  y )  93
b XY  =  0.2337
2
 ( y  y) 398
 ( x  32)   0.23337 ( y  38)
=  0.2337 y  0.2337  38
 x  0.2337 y  40.8806

The line of regression of Y on X is given by y  y  bYX ( x  x )


 ( x  x )( y  y )  93
bYX  =  0.6643
2
 (x  x) 140
 ( y  38)   0.6643 ( x  32)
=  0.6643 x  0.6643  32
 y  0.6643 x  59.2576

Coefficient of correlation r 2  bYX  b XY


= (-0.6643) (-0.2337) = 0.1552
r   0.1552   0.394
Now, we have to find the most likely marks in statistics (Y) when marks in Economics
(X) are 30. We use the line of regression of Y on X.
ie) y  0.6643 x  59.2576

Put x  30 , we get y  0.6643(30)  59.2576  39

2. The two lines of regression are 8 x  10 y  66  0 , 40 x  18 y  214  0 . The variance


of X is 9. Find the mean values of X and Y.

Solution:

Since both the lines of regression passes through the mean values x and y , the point ( x , y )
must satisfy the two given regression lines .
8 x  10 y  66 …………..(1)
40 x  18 y  214 …………..(2)

Solving these (1) and (2) we get, x  13 , y  17


Ten students got the following percentages of marks in statistics at a degree
examination and at a competitive examination.
Student 1 2 3 4 5 6 7 8 9 10
Marks in Deg. Exam 78 40 94 22 76 84 90 62 65 39
Marks in comp. Exam 84 51 91 60 68 62 86 58 53 47
Calculate the correlation coefficient
Solution: Method 1:
X Y XY X2 Y2
78 84 6552 6084 7056
40 51 2040 1600 2601
94 91 8554 8836 8281
22 60 1320 484 3600
76 68 5168 5776 4624
84 62 5208 7056 3844
90 86 7740 8100 7396
62 58 3596 3844 3364
65 53 3445 4225 2809
39 47 1833 1521 2209
650 660 45456 47526 45784
650 660
Now, X   65 , Y   66 , XY  (65)(66)  4290
10 10

1 47526
X  2 2
X X =  4225 = 22.97
n 10

1 45784
Y  2 2
Y  Y =  4356 = 14.91
n 10
1 1
 XY  XY  45456  4290
r( X ,Y )  n = 10 = 0.746
 X . Y 22.97  14.91

There is a positive correlation between X and Y.


Method 2:
X Y (X  X ) (Y  Y ) ( X  X )2 (Y  Y ) 2 ( X  X )(Y  Y )

 ( X  65)  (Y  66)

78 84 13 18 169 324 234


40 51 -25 -15 625 225 375
94 91 29 25 841 625 725
22 60 -43 -6 1849 36 258
76 68 11 2 121 4 22
84 62 19 -4 361 16 -76
90 86 25 20 625 400 500
62 58 -3 -8 9 64 24
65 53 0 -13 0 169 0
39 47 -26 -19 676 361 494
650 660 5276 2224 2556

(X  X )(Y  Y ) 2556


Cov( X , Y )  Cov(U ,V )    255.6
n 10

1 5276
X  2
(X  X )   22.97
n 10

1 2224
Y  2
 (Y  Y )   14.91
n 10
Cov( x, Y ) 255 .6
r ( X ,Y )    0.746
 X . Y (22.97)(14.91)
6. If X and Y are independent random variables with means 5 and 10 and standard
deviations 2 and 3 respectively. Obtain r (U ,V ) where U  3 X  4Y and V  3 X  Y .

Solution: Given that E ( X )  5, E (Y )  10, S .D( X )  2 , S .D(Y )  3 .

This implies that, Var( X )  4 , Var(Y )  9

Since X and Y are independent, E ( XY )  E ( X ).E (Y ) .

This implies that, E ( XY )  (5)(10)  50 ------------------------------ (1)

Also, given that U  3 X  4Y and V  3 X  Y


Therefore, E[U ]  E[3 X  4Y ]  3E ( X )  4E (Y )  3(5)  4(10)  55 ------------ (2)

Similarly, E[V ]  E[3 X  Y ]  3E ( X )  E (Y )  3(5)  10  5 ---------------------- (3)

Now, Var( X )  4  E[ X 2 ]  [ E ( X )]2  E[ X 2 ]  (5)2

This implies that, E[ X 2 ]  29 ---------------------------------------------------------- (4)

Now, Var(Y )  9  E[Y 2 ]  [ E (Y )]2  E[Y 2 ]  (10)2

This implies that, E[Y 2 ]  109 -----------------------------------------------------------(5)

Consider, Cov(U ,V )  E (UV )  E (U ) E (V )

 E[(3 X  4Y )(3 X  Y )]  (55)(5)

 E[9 X 2  9 XY  4Y 2 ]  (55)(5)

 9E[ X 2 ]  9E[ XY ]  4E[Y 2 ]  (55)(5)


 9(29)  9(50)  4(109)  (55)(5) { by substituting (1), (2), (3),

0 (4) and (5)}


This implies that, r (U ,V )  0 .

Therefore, U and V are uncorrelated.


7. Two components of a minicomputer have the following joint probability density
function for their useful lifetimes X and Y:

2  x  y; 0  x  1; 0  y  1
f ( x, y )  
0 ; otherwise

1
Show that r ( X , Y )   .
11
E ( XY )  E ( X ).E (Y )
Solution: We know that, r ( X , Y ) 
 X . Y
  11
Now, E ( XY )    xyf ( x, y )dxdy =   xy (2  x  y )dxdy
  00

1
1 x3 y x2 y 2 
=   x2 y    dy
0  3 2 
0

1 y y2 
=   y   dy
0  3 2 

1
=
6
1
11 11 1 x3 x 2 y 
E ( X )    xf ( x, y )dxdy =   x(2  x  y )dxdy =   x 2    dy
00 00 0  3 2 
0

1 1 y 5
  1    dy 
0 3 2 12
11 11 5
E (Y )    yf ( x, y )dxdy    y (2  x  y )dxdy 
00 00 12
1
2
11 11 1  2 x3 x 4 x3 y 
E( X )    x 2 f ( x, y )dxdy    x 2 (2  x  y )dxdy =      dy
00 00 0  3 4 3 
0

1 2 1 y 1
     dy 
0 3 4 3 4

11 11 1
E (Y 2 )    y 2 f ( x, y )dxdy    y 2 (2  x  y )dxdy 
00 00 4

2
2 1 5 2  11
Var( X )  E ( X )  [ E ( X )]     
4  12  144

1 5 5
 .
E ( XY )  E ( X ).E (Y ) 1
Therefore, r ( X , Y )  = 6 12 12 
 X . Y 11 11 11
.
12 12
Rank Correlation Coefficient:
In real life, there are situations when we get data in the form of ranks or otherwise the
original data are rated with different grades. For example, if two inspectors are asked to grade
the units produced by a machine, then we may have two different sets of grades (ranks). If two
sets of observations of a quality characteristic are given to an inspector to rank them, we may
get a pair of ranks for each pair of observations based on their performance. Under these
circumstances, we may have to obtain the correlation between the two sets of ranks instead of
using the observations as it is.

If 1,2,..., n are the ranks given based on the outcomes of the random variable X or the
ranks given to the n values ( x1, x2 ,..., xn ) of X and also 1,2,..., n are the ranks given based on
the outcomes of the random variable Y or the ranks given to the n values ( y1, y2 ,..., yn ) of Y
, then the correlation coefficient between X and Y, known as Spearman’s rank correlation
coefficient, is given by
n
6  di2
r ( X ,Y )  1  i 1
2
n(n  1)

where di  Rank of the ith value of X (i.e. xi ) – Rank of the ith value of Y (i.e. yi )

That is, d i  ( xi  yi ) .

Note: If one or more of the ranks are repeated within a variable, then the following formula is
suggested:

n 1 1 
6  di2   mx (mx2  1)   m y (m2y  1)
i 1 12 x 12 y
r ( X ,Y )  1   2

n(n  1)

where mx is the number of times a value repeated in variable X and m y is the number of
times a value repeated in variable Y .

1. The rankings of ten students in two subjects A and B are as follows:

A 3 5 8 4 7 10 2 1 6 9
B 6 4 9 8 1 2 3 10 5 7

Find the rank correlation coefficient.

Solution:
A ( xi ) B ( yi ) d i  ( xi  yi ) di2

3 6 -3 9
5 4 1 1
8 9 -1 1
4 8 -4 16
7 1 6 36
10 2 8 64
2 3 -1 1
1 10 -9 81
6 5 1 1
9 7 2 4
0  di
2
 214

n
6  di2
i 1 6(214)
The rank correlation coefficient is r ( X , Y )  1  2
1
n(n  1) 10(10 2  1)
 0.297
2. The marks secured by recruits in the selection test X and in the proficiency test Y
are given below:

Serial No. 1 2 3 4 5 6 7 8 9
X 10 15 12 17 13 16 24 14 22
Y 30 42 45 46 33 34 40 35 39

Calculate the rank correlation coefficient.

Solution:

X Y Ranks in Ranks in di  ( xi  yi ) di2


X ( xi ) Y ( yi )

10 30 9 9 0 0
15 42 5 3 2 4
12 45 8 2 6 36
17 46 3 1 2 4
13 33 7 8 -1 1
16 34 4 7 -3 9
24 40 1 4 -3 9
14 35 6 6 0 0
22 39 2 5 -3 9
2
 di  72

n
6  di2
i 1 6(72)
The rank correlation coefficient is r ( X , Y )  1  2
1  0.4
n(n  1) 9(92  1)

3. Ten competitors in a beauty contest are ranked by three judges as follows:

Competitors
1 2 3 4 5 6 7 8 9 10
X 6 5 3 10 2 4 9 7 8 1
Judges Y 5 8 4 7 10 2 1 6 9 3
Z 4 9 8 1 2 3 10 5 7 6

Discuss which pair of judges has the nearest approach to common tests of beauty.

Solution:
X Y Z d1  x  y d12 d2  x  z d22 d2  x  z d32

6 5 4 1 1 2 4 1 1
5 8 9 -3 9 -4 16 -1 1
3 4 8 -1 1 -5 25 -4 16
10 7 1 3 9 9 81 6 36
2 10 2 -8 64 0 0 8 64
4 2 3 2 4 1 1 -1 1
9 1 10 8 64 -1 1 -9 81
7 6 5 1 1 2 4 1 1
8 9 7 -1 1 1 1 2 4
1 3 6 -2 4 -5 25 -3 9
158 158 214

6 d12 6(158)
r ( X ,Y )  1  2
1  0.042
n(n  1) 9(92  1)

6 d 22 6(158)
r( X , Z )  1  2
1  0.042
n(n  1) 9(92  1)

6 d32 6(214)
r (Y , Z )  1  1  0.296
n(n 2  1) 9(92  1)
Therefore, ( X , Y ) and ( X , Z ) have the nearest approach to common tastes of beauty.

4. The following table gives the number of units rejected by two operators X and Y in 8
inspections:
X 15 20 28 12 40 60 20 80
Y 40 30 50 30 20 10 30 60
Obtain rank correlation coefficient between X and Y with respect to the quality of
the product.
Solution:
X Y Ranks in X ( xi ) Ranks in Y ( yi ) di  ( xi  yi ) di2

15 40 2 6 -4 16
20 30 3.5 4 -0.5 0.25
28 50 5 7 -2 4
12 30 1 4 -3 9
40 20 6 2 4 16
60 10 7 1 6 36
20 30 3.5 4 -0.5 0.25
80 60 8 8 0 0
2
 di  81.5
m(m2  1) 2(22  1) 1
In X series 20 repeated twice, correction factor   
12 12 2

m(m2  1) 3(32  1)
In Y series 30 repeated thrice, correction factor   2
12 12

 1 
681.5   2
Therefore, r ( X , Y )  1    0
2
2
8(8  1)
Here, since the correlation coefficient is 0, we conclude that there is no relationship
between the quality of product X and that of Y.

Exercise

1. Find the correlation coefficient for the following data:

X 10 14 18 22 26 30
Y 18 12 24 6 30 36
Solution: r = 0.6

2. The age in years of 14 young couples is given below:

X 21 25 26 24 22 30 19 24 28 32 31 29 21 18
Y 19 20 24 21 21 24 18 22 19 30 27 26 19 18

To know the extent relationship between the age of husbands ( X ) and wives (Y ) ,

Calculate the coefficient of correlation.


Solution: r = 0.85

e x ; 0  y  x  

3. Let the joint p.d.f. of X and Y be given by f ( x, y)  

0 ; otherwise
Find the correlation coefficient between X and Y.
1
Solution: r ( X , Y ) 
2
4. Let the random variables X and Y have the joint probability density function
 x  y; 0  x  1; 0  y  1
f ( x, y )  
0 ; otherwise
Compute the correlation coefficient between X and Y.
1
Solution: r ( X , Y )  
11

5. In a marketing survey the price of tea and coffee in a town based on quality was
found as shown below. Could you find any relation between tea and co ffee price.

Price of tea 88 90 95 70 60 75 50
Price of coffee 120 134 150 115 110 140 100
Solution: r = 0.8929. The relation between price of tea and coffee is positive.

6. Find the rank correlation for tied observations. Following are the marks obtained by
10 students in a class in two tests.

Students A B C D E F G H I J
Test 1 70 68 67 55 60 60 75 63 60 72
Test 2 65 65 80 60 68 58 75 63 60 70
Solution: r = 0.68.

Regression
Regression is a mathematical measure of the average relationship between two or more variable
in terms of the original units of data.
Regression Equations
A regression line can be represented by an algebraic expression which gives the relationship
between the two variables. There are two regression equations:
1. The equation which gives the best mean values of X corresponding to given values of

Y , that is, the regression equation of X on Y is X  X  r. x (Y  Y ) .
y
2. The equation which gives the best mean values of Y corresponding to given values of
y
X , that is, the regression equation of Y on X is Y  Y  r. (X  X )
x
where X and Y are the means of X and Y ;  x and  y are the standard deviations of

X and Y ; r is the coefficient of correlation.

Regression Coefficients
y
5. Regression coefficient of Y on X : r.  byx
x
(X  X )(Y  Y )
where byx 
(X  X )2


6. Regression coefficient of X on Y : r. x  bxy
y
(X  X )(Y  Y )
where bxy 
 (Y  Y )2

7. Relationship between correlation coefficient and regression coefficients


r   bxy  byx

Angle between two lines of regression

y
If the equations of lines of regression of Y on X and X on Y are Y  Y  r. (X  X )
x

and X  X  r. x (Y  Y ) .
y

1  r 2   y x 
Then the angle ' ' between the two lines of regression is given by tan  
r   x2   2y 
 

1. A departmental store gives in-service training to salesmen followed by a test. It is


experienced that the performance regarding sales of any salesman is linearly related to
the scores secured by him. The following data give test scores and sales made by nine
salesmen during fixed period.

Test Scores X 16 22 28 24 29 25 16 23 24
Sales (’00 Rs) Y 35 42 57 40 54 51 34 47 45
The sales Y of any salesmean are considered to depend on his ability is judged by his
test scores X .

Solution: The regression line of Y on X can be fitted to the data in the following
manner.
x 207  y 405
X   23 , Y    45
n 9 n 9

X Y X X Y Y ( X  X )2 (Y  Y )2 ( X  X )(Y  Y )

16 35 -7 -10 49 100 70
22 42 -1 -3 1 9 3
28 57 5 12 25 144 60
24 40 1 -5 1 25 -5
29 54 6 9 36 81 54
25 51 2 6 4 36 12
16 34 -7 -11 49 121 77
23 47 0 2 0 4 0
24 45 1 0 1 0 0
207 405 0 0 166 520 271

(X  X )(Y  Y ) 271


Coefficient of regression of Y on X is byx    1.63
2
(X  X) 166

y
Hence, the regression equation of Y on X is Y  Y  r. (X  X )
x
 (Y  45)  1.63( X  23)
 Y  7.51  1.63 X

2. Calculate the coefficient of correlation from the following data:

X 1 2 3 4 5 6 7 8 9
Y 9 8 10 12 11 13 14 16 15
(i) Obtain the regression equations and the coefficient of correlation.
(ii) Determine an estimate of Y which should correspond on the average to X  6.2 .

Solution: The regression line of Y on X can be fitted to the data in the following
manner.
x 45  y 108
X   5, Y    12
n 9 n 9

X Y X X Y Y ( X  X )2 (Y  Y )2 ( X  X )(Y  Y )

1 9 -4 -3 16 9 12
2 8 -3 -4 9 16 12
3 10 -2 -2 4 4 4
4 12 -1 0 1 0 0
5 11 0 -1 0 1 0
6 13 1 1 1 1 1
7 14 2 2 4 4 4
8 16 3 4 9 16 12
9 15 4 3 16 9 12
45 108 0 0 60 60 57

  ( X  X )(Y  Y ) 57
(i) Coefficient of regression of X on Y is r. x  bxy    0.95
y  (Y  Y )
2 60


Hence, the regression equation of X on Y is X  X  r. x (Y  Y )
y
 ( X  5)  0.95(Y  12)
 X  0.95Y  6.4

y (X  X )(Y  Y ) 57
Coefficient of regression of Y on X is r.  byx    0.95
x (X  X )2 60
y
Hence, the regression equation of Y on X is Y  Y  r. (X  X )
x
 (Y  12)  0.95( X  5)
 Y  0.95 X  7.25

Coefficient of correlation r   bxy  byx

  0.95  0.95
 0.95

(ii) The estimate of Y corresponding to X  6.2 is Y  0.95(0.62)  7.25

 13.14  13

3. A statistical investigator obtains the following regre ssion equations in a survey


X  Y  6  0 and 0.64 X  4.08  0 . Here X  age of husband and Y  age of wife.
Find the mean of X and Y .

Solution: Since both the lines of regression passes through ( X , Y ), we get

X Y  6  0 --------------------- (1)
0.64 X  4.08  0 -------------------- (2)
4.08
From (2), X   6.375
0.64
From (1), X  Y  6
 6.375  Y  6

 Y  0.375

Therefore, mean X  6.375 , Y  0.375 .


Multiple Regression

If the number of independent variables in a regression model is more than one, then the
model is called as multiple regression. In fact, many of the real-world applications demand the
use of multiple regression models.
A sample application is as stated below:

Y  b0  b1 X1  b2 X 2  b3 X 3  b4 X 4

where Y represents the economic growth rate of a country, X1 represents the time period, X 2
represents the size of the populations of the country, X 3 represents the level of employment
in percentage, X 4 represents the percentage of literacy, b0 is the intercept and b1, b2 , b3 and b4
are the slopes of the variables X1, X 2 , X 3 and X 4 respectively. In this regression model,
X1, X 2 , X 3 and X 4 are the independent variables and Y is the dependent variable.

Regression Model with Two Independent Variables using Normal Equations:

Suppose the number of independent variables is two, then Y  b0  b1 X1  b2 X 2 .

Normal equations are

 Y  nb0  b1  X1 b2  X 2

 YX1  b0  X1  b1  X12  b2  X1 X 2

 YX 2  b0  X 2  b1  X1 X 2  b2  X 22

where n is the total number of combinations of observations. The solution to the above set of
simultaneous equations will form the results for the coefficients b0 , b1 and b2 of the regression
model.

Example 1: The annual sales revenue (in crores of rupees) of a product as a function of sales
force (number of salesmen) and annual advertising expenditure (in lakhs of rupees) for the past
10 years are summarized in the following table.

Annual sales revenue Y 20 23 25 27 21 29 22 24 27 35


Sales force X1 8 13 8 18 23 16 10 12 14 20

Annual advertising 28 23 38 16 20 28 23 30 26 32
expenditures X 2
Solution: Let the regression model be Y  b0  b1 X1  b2 X 2

where Y is the annual sales revenue; X1 is the sales force; X 2 is the annual advertising
expenditures.
Y X1 X2 X 12 X 22 X1 X 2 YX1 YX 2

20 8 28 64 784 224 160 560


23 13 23 169 529 229 299 529
25 8 38 64 1444 304 200 950
27 18 16 324 256 288 486 432
21 23 20 529 400 460 483 420
29 16 28 256 784 448 464 812
22 10 23 100 529 230 220 506
24 12 30 144 900 360 288 720
27 14 26 196 676 364 378 702
35 20 32 400 1024 640 700 1120
253 142 264 2246 7326 3617 3678 6751
Substituting the required values in the norma
l equations, we get the following
simultaneous equations

10 b0  142 b1  264 b2  253

142 b0  2246 b1  3617 b2  3678

264 b0  3617 b1  7326 b2  6751

The solution to the above set of simultaneous equation is b0  5.1483, b1  0.6190 and
b2  0.4304 .

Therefore, the regression model is Y  5.1483  0.6190 X1  0.4304 X 2 .

Exercise:

1. Following table gives the data on rainfall and discharge in a certain river. Obtain the line
of regression of Y on X .
Rainfall (inches) X 1.53 1.78 2.60 2.95 3.42
Discharge (1000 c.c) Y 33.5 36.3 40.0 45.8 53.5

Solution: Y  9.7992 X  17.714


2. From the following data find (i) two regression equations (ii) the coefficient of correlation
between the marks in Economics and Statistics (iii) the most likely marks in Statistics when
marks in Economics are 30.

Marks in Economics 25 28 35 32 31 36 29 38 34 32
Marks in Statistics 43 46 49 41 36 32 31 30 33 39

Solution: (i) x  0.2337 y  40.8806 , y  0.6643x  59.2576

(ii) r   0.394 (iii)  39

3. In a partially destroyed record of an analysis of correlation data, the following results are
legible. The two lines of regression are 8 X  10Y  66  0 and 40 X 18Y  214  0 . Find the
mean values of X and Y .

Solution: X  13 , Y  17 .

1 3 2
4. If r12  ; r23  ; r31  then find the value of R1.23 .
2 4 3
Solution: 0.5

5. For a trivariate distribution, the following correlation coefficient were obtained r12  0.77,
r13  0.72 and r23  0.52 . Find the partial correlation coefficient r12.3 and multiple correlation
coefficient R1.23
Solution: r12.3  0.6673 , R1.23  0.8561

6. The following are data on the number of twists required to break a certain kind of forged
alloy bar and the percentage of two alloying elements present in the metal.

No. of 41 49 69 65 40 50 58 57 31 36 44 57 19 31 33 43
twist (Y )
Percent of 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
element A
( X1 )

Percent of 5 5 5 5 10 10 10 10 15 15 15 15 20 20 20 20
element B
(X2)

Fit a least squares regression model.

Solution: Y  46.4  7.78 X1  1.65 X 2


Unit-8
REGRESSION ANALYSIS
INTRODUCTION
So far we have studied correlation analysis, which measures the direction and strength of the relationship
between two variables. After establishing the correlation existing between the two variables one may be interested
in estimating the value of one variable with the help of value of another variable. The statistical method with the
help of which we are in a possible to estimate or predict the unknown value of one variable from the known value
of another variables is called Regression.
The Regression succeeds the correlation once the correlation ship between the two variations is
established, the regression analysis proceeds with the estimation of probable values.
Sir. Francis Galton, a British biometrician, introduced the concept regression for the first time in
1877: while studying the correlation between the heights of sons and their fathers. He concluded in his studies,
“Tall fathers tend to have tall sons and short fathers short sons. The average height of the sons of a group of tall
fathers is less than that of the fathers. While the average height of the sons of a group of short fathers is greater
than that of the fathers.
It means the coming generations of tall or short parents tend to step back to average height of population.
Now a days a modern statistician prefer to use the term Regression in the sense of estimation, which is an
important statistical tool in a economics business.

Meaning
Regression means returning or stepping back to the average value. In statistics, the term
Regression means simple the average relationship. We can predict or estimate the value of dependent variable
from the given related values of independent variable with the help of a Regression Technique.
The measure of Regression studies the nature of correlation ship to estimate the most probable values. It
establishes a functional relationship between the independent and dependent variables.

Definition
According to Blair “Regression is the measure of the average relationship between two or more variable
in terms of the original units of the data”
According to Taro Yamame “ One of the most frequently used technique in economics and business
research to find a relation between two or more variables that are related casually, is regression analysis.
According to Wallis and Robert “It is often more important to find out what the relation actually is, in
order to estimate or predict one variable and statistical technique appropriate in such a case is called regression
analysis.

USES OF REGRESSION ANALYSIS


Regression analysis is of great practical use even more than the correlation analysis; the following are
some uses,
1. Regression analysis helps in establishing a functional relationship between two or more
variables once this is established, it can be used for various advanced analytic purpose.
2. With the use of electronic machine and computers tedium of collection of regression equation
particularly expressing multiple and a non-linear relationship has been reduced a great deal.
3. Since most of the problems of economic analysis are based on cause and effect relationship.
The regression analysis is a highly valuable tool in economic and business research.
4. The regression analysis is very useful for prediction purpose. Once a functional relationship is
known, the value of dependent variable can be predicted from the given value of the
independent variable.

CORRELATION AND REGRESSION

These two techniques are directed towards a common purpose of establishing the degree and the direction
of relationship between two or more variables but the methods of doing so are different. The choice of one or the
other will depend on the purpose. In spite certain similarities between these two, but there are some basic
differences in the two approaches, which have been summarized below:

138
CORRELATION REGRESSION
1. Correlation, literally means related or 1. Regression literally means return to the normal,
sympathetic movements between variables which is true on account of the average of
2. There is a sort of interdependence, which is relationship.
mutual. 2. It establishes a functional relationship, which is
3. There is no cause and effect relation ship. It mathematical showing dependence of one
only shows the existence of some association in variable on the other.
the movement of variables. 3. It may have a cause and effect relationship.
4. It may be spurious correlation if the 4. It is a mathematical relationship, which should
sympathetic movement is on account of the be interpreted suitably.
influence of an out side variable which has no 5. It is an absolute measure of relationship.
relevance. 6. Besides verification it can also be used for
5. It is a relative measure showing association estimation and prediction. It tenders more
between variables. comprehensive information.
6. It is used only for testing and verification of the 7. It is very useful for further mathematical
relationship. It tenders only a limited treatment.
information.
7. It is not very useful for further mathematical
treatment.
METHODS OF REGRESSION ANALYSIS
There are two methods:
1. Graphic methods (Not included in the syllabus)
2. Algebraic method.
The algebraic methods for simple linear regression can be broadly divided in to the following,
A. Regression lines.
B. Regression Equations.
C. Regression coefficient.

A. REGRESSION LINES:
In the graphical jargon, a regression line is a straight line fitted to the data by the method of least squares.
It indicates the best probable mean value of one variable corresponding to the mean value of the other. Since a
regression line is the line of best fit, it cannot be used conversely therefore, there are always two regression
lines constructed for the relation ship between tow variables x and y. Thus one regression line shows regression
of x upon y and the other shows regression of y upon x.
When two variables have relationship, then we can draw a regression line. The regression line of x on y
gives the most probable vales of x for any given value of y. In the same manner the regression line of y on x
gives the most probable values of y for any given value of x. Thus there will be two regression lines in the case
of two variables.
REGRESSION EQUATIONS
Regression equation is an algebraic method. It is an algebraic expression of the regression line. It can be
classified in to regression equation, regression coefficients.
As there are two regression lines, there are two regression equations. For the two variables x and y, there
are two regression equations. They are regression equation of x on y and the regression equation of y on x.
I Regression equation of x on y

(X-X)=r (Y-Y)
Y
II Regression Equation of Y on X

Y
(Y-Y)=r (X-X)

139
Application of Regression Equations when all required values are given

ILLUSTRATION =01
From the following results, obtain the two-regression equation and estimate the yield of crops when the
rainfall is 29 cms and the rainfall when the yield is 600 kg.
Y X
Yield Rainfall
In Kg In cm
26.7
Mean 508.4
4.6
S.D 36.8
Co efficient of correlation between yield and rainfall=0.52
Solution:
To estimate the yield of crops, we have to use Y on X Regression Equation.

Y
(Y-Y)= r (X-X)

36.8
Y-508.4=0.52 (X-26.7)
4.6

Y-508.4 = 4.16 (x-26.7)


Y-508.4 =4.16x-111.072
Y = 4.16x-111.072+508.4
Y=4.16x +397.328 R.line
When x =29
=4.16 x 29 + 397.328
= 120.64 + 397.328
= 517.968 kgs

Similarly to estimate rainfall, we have to used x on y Regression equation.

Y
(X-X)=r (Y-Y)

4.6
X-26.7=0.52 (Y-508.4)
36.8
X-26.7=0.065 (Y-508.4)

X-26.7=0.065Y-33.046

X=0.065Y-33.046+26.7

X=0.065Y-6.346 R, Line
When Y=600 Kgs
X=0.065X600-6.346
=39-6.346
X=32.654

140
ILLUSTRATION =02
Find out the regression equation, showing the regression of capacity utilization on production from the
following data.
Production In lakh Average Standard Deviation
Units 35.6 10.5
Capacity Utilization
84.8 8.5
(in percentage)
Coefficient of correlation}=0.62
Estimate the production when the capacity utilization is 70%
SOLUTION; Let the production and capacity utilization be denoted by X and Y respectively. Then we are given;

X=35.6 Y=84.8 X=10.5 Y=8.5 P=0.62

To estimate production we have to use X on Y regression equation

(X-X)=r (Y-Y)
Y
10.5
(X-35.6)=0.62 (Y-84.8)
8.5
X=35.6=0.7658(Y-84.8)
X-35.6=0.7658Y-64.94
X=0.7658y—64.94+35.6
X=0.7658y-29.34 R.Line
When Y=70%
=0.7658X70-29.34
=53.606-29.34
X=24.266 lakh unit

ILLUSTRATION = 03

Karl Pearson’ coefficient of correlation between the ages of brother’s and sisters in a community was
found to be 0.8.
Average of the brother’s ages was 25 years and that of sister’s were 22years.Their standard deviations were 4
and 5 respectively.
Find a. The expected age of brother when the sister’s age is 12 years.
b. The expected age of sister when the brother’s age is 33 years.
Solution:
Brother Sister
X Y
Mean age 25 years 22years
Standard
Deviation 4 5

Co-efficient of Correlation 0.8


To estimate the brother’s age, we have to use X on Y Regression equation. X=? When Y =12

(X-X)= r (Y-Y)
Y

141
4
X-25=0.8 (Y-22)
5
X-25=0.64(Y-22)
X-25=0.64Y-14.08
X=0.64Y-14.08+25
X=0.64Y+10.92 R.Line
When Y=12
=0.64X12+10.92
X=18.6 years, brother’s age

To estimate the sister’s age, we have to use Y on X regression equation Y=? When X=33years

Y
(Y-Y)=r (X-X)

5
(Y-22)=0.8 (X-25)
4 Y=X-3 R.Line
Y-22=1.0 (X-25) When X=33
Y-22=1X-25 Y=33-3
Y=X-22+22 Y=30 years, sister’s age

ILLUSTARION=04
Give the following data, estimate
1. The value of Y when X=70
2. The value of X when Y=90
X-Series Y-Series
Mean 18 100
Standard deviation 14 20
Co-efficient of correlation 0.8
SOLUTION
II. X=? When Y=90
I .Y=? When X =70 use Y on X R. equation Use X on Y R. Equation

Y
(Y-Y)= r (X-X) (X-X)= r (Y-Y)
X Y
20
Y-100=0.8 (X-18) 14
14 X-18=0.8 (Y-100)
Y-100=1.143 (X-18) 20
Y-100=1.143X-20.574 X-18=0.56 (Y-100)
Y=1.143X-20.574+100 X-18=0.56Y-56
Y=1.143X+79.426 R.Line X=0.56Y-56+18
When X=70 X=0.56Y-38 R.Line
Y=1.143 X 79 + 79.426 When Y=90
Y=80.01+79.426 X=0.56 X 90-38
Y=159.436 =50.4-38
X=12.4
142
ILLUSTRATION=05
To study the relationship between expenditure on a accommodation (X) and expenditure on Food (Y), an
enquiry in to 50 families gave the following result;

∑X=8500, ∑Y=9600, X=60, Y=20, r=0.60


Estimate the expenditure on food when expenditure on accommodation is Rs200.

SOLUTION
To estimate expenditure on food, we should use Y on X Regression Equation.

∑X 8500 ∑y 9600
Y
X= = =170, Y= =192
(Y-Y)=r (X-X)
n 50 n 50

20
(Y-192)=0.6 (X-170) when X=200
60 Y=0.1999 X 200 + 158
Y-192=0.1999(X-170) =39.98+158
Y-192=0.1999X-33.9999 Y=Rs.197.98
Y=0.1999X+158 R.L Rs.197.98 is required to be spent on food.

ILLUSTRATION=06

Obtain the two Regression Equations from the following;

X-Series Y-Series
Mean 20 25
Variance 4 9
Coefficient of correlation =0.75
SOLUTION
Obtaining of two Regression lines

X on Y R. Equation Y on X R. Equation
X= Variance Y= Variance
= =
=2 =3
bxy=Regression coefficient on x on y bxy=Regression coefficient on Y on X
b=Regression coefficient b=Regression coefficient

X Y
bxy= r bxy= r
Y X
(X-X)=bxy (Y-Y) (Y-Y)=bxy (X-X)
2 3
X-20=0.75 (Y-25) Y-25=0.75 (X-20)
3 2
X-20=0.5 (Y-25) Y-25=1.125 (X-20)
X-20=0.5-12.5 Y-25=1.125-22.5
X=0.5-12.5+20 Y=1.125X-22.5+25
X=0.5+7.5 R.Line
ILLUSTRATION=07 Y=1.125+2.5 R.Line

143
ILLUSTRATION = 07
You are given the following data.

X-Sries Y-Series
Mean 47 96
Variance 64 81

Coefficient of Correlation =0.36

Calculate Y when X is 50, and X when Y is 88.

SOLUTION

X on Y R.Equation Y on X R.Equation

X= Variance = 64 = 8 Y= Variance = 81 = 9

Y
bxy= r bxy= r
Y
X-X =bxy (Y-Y) (Y-Y) =bxy (X-X)
8 9
X-47=0.36 (Y-96) Y-96=0.36 (X-47)
9 8
X-45=0.3199 (Y-96) Y-96=0.405 (X-47)
X-47=0.3199Y-30.7199 Y-96=0.405X-19.035
X=0.3199Y-30.7199+47 Y=0.405X-19.035+96
X=0.3199Y+16.28 R.Line Y=0.405X+76.965 R.Line
When Y=88 When X=50
X=0.3199 X 88 + 16.28 Y=0.405 X 50 + 76.965
X=28.1512 + 16.28 =20.25 + 76.965
X= 44.4312 Y= 97.215

ILLUSTRATION=08
The following results for heights and weights of 100 men were calculated.
Coefficient of
Mean Standard Deviation
Correlation
Weights 150 lbs 20 lbs
0.60
Heights 68 ” 2.5 “
Find an estimate
1. The weight of a man whose height is 5’ (5’=60”)
2. Height of a man whose is 200 lbs

SOLUTION
Let X= Weight and Y = Height.

144
X on Y R Equation X on Y R Equation
(X-X)=bxy (Y-Y)
(Y-Y)=byx (X-X)
20
20
(X-150)= X 0.6 (Y-68)
(Y-68)= (X-150)
2.5
2.5
X-150=4.8 (Y-68)
Y-68=0.075 (X-150)
X-150= 4.8Y-326.4
Y-68= 0.075X-11.25
X= 4.8Y-326.4+150
Y= 0.075X-11.25+68
X=4.8Y-176.4 RL when Y=60 5
Y=0.075X+176.4 RL when X=200 lbs
X=4.8 X 600-176.4
Y=0.075 X 200 + 56.75
X=111.6”
Y =71.75 lbs
OR X =9’-3.6”

REGRESSION COEFFICIENTS
Regression coefficient is denoted by ‘b’. There are two regression equations and therefore
there are two regression coefficients also. Regression coefficients measure the changes in the series corresponding
to a unit change in the other series.
The Regression coefficient of X on Y

X
i.e bxy =r
Y
Give us the value by which X-variable changes for a unit change in the value of Y-variable.

∑dxdy X n – (∑dx X ∑dy)


∴bxy =
∑d 2Yxn - (∑dy)2
Similarly the regression on of Y on X
Y
i.e. byx =r
X
Refers to the value by which Y-variable changes for a unit change in X-variable

∑dxdy X n – (∑dx X ∑dy)


∴byx =
∑d 2xX n-(∑dx)2

These two coefficients measure the change in dependent variable corresponding to the unit
change in independent variable. They also help in direct calculation of coefficient of correlation.
Square – root of the product of two Regression coefficient gives us the value of correlation,
as under;
X Y
Bxy X box = ς Xr
Y X

Bxy X byx =ς2

∴r = bxy X byx

145
CALCULATION OF REGRESSION COEFFICENTS AND MAKING ESTIMATION OF UN-
KNOWN VALUE

INDIVIDUAL SERIES =
When actual data is given and deviation are
taken from assumed mean

ILUSTRATION =09
From the data given below find out;
a. Regression coefficients
b. Regression Equations
c. Estimate the age when B.P is 130
d. Estimate the B.P when age is 50 years
e. Find the coefficient of correlation through Regression coefficients.

Age 56 42 72 36 63 47 55 49 38 42 68 60
B.P 147 125 160 118 149 128 150 145 115 140 152 155

SOLUTION
Age X-47 B.P Y-128
D2x D2Y dxdy ∑dx
X dx Y dy
56 9 81 147 19 361 171 X=A+ X C
42 -5 25 125 -3 9 15 N
72 25 625 160 32 1024 800 64
36 -11 121 118 -10 100 110 =47+ X1
63 16 256 149 21 441 336 12
47 0 0 128 0 0 0 X=52.33
55 8 64 150 22 484 176 ∑dy
49 2 4 145 17 289 34 Y=A + XC
38 -9 81 115 -13 169 117 n
42 -5 25 140 12 144 -60 148
68 21 441 152 24 576 504 =128+ X1
60 13 169 155 27 729 351 12
N= 64 1892 N= 148 4326 2554 =128+12.33
12 ∑dx ∑d2x 12 ∑dy ∑d2y ∑dxdy Y= 140.33

Regression coefficient X onY Regression coefficient X on Y


X Y
bxy= ς Y bxy= ς x
∑dxdy X n – (∑dx X ∑dy) ∑dxdy X n – (∑dx X ∑dy)
byx = ∑d 2Y X n - (∑dY)2 byx = ∑d 2x X n - (∑dX)2
= 2554 X 12 – 64X148 = 2554 X 12 – 64X148
4326X12 – (148) 2 1892 X12 – (64) 2
= 30648 – 9472 = 21176
51912 – 21904 22704 – 4096
= 21176 =0.7057 = 21176
30008 0.7057 18608 =1.138
X on Y =R. Equation X on Y =R. Equation
(x- 0)=bxy (Y-Y) (Y-Y)=byx (x- 0)
(X-52.33)=0.7057 (Y-140.33) Y-140.33=1.138 (X-52.33)

146
X-52.33=0.7057Y-99.031 Y-140.33=1.138X-59.55
X=0.7057Y-99.031+52.33 Y=1.138X-59.55+140.33
X=0.7057Y-46.701 Y=1.138X-80.78
Estimation of age (X) when Estimation of B.P (Y) when
B.P(Y) is 130 Age(X) is 50 years
X=0.7057 X 130-46.701 Y=1.138 X 50-80.78
=91.741-46.701 =56.9-80.78
X=45.04 years Y=137.68

Coefficient of correlation =√bxy X bys = √0.7057 X 1.138


ς=0.896

ILLUSTRATION=10
From the following data, obtain the two Regression Equations. Also calculate coefficient of
correlation based on regression coefficient.
Sales: X 91 97 108 121 67 124 51 73 111 57
Purchases: Y 71 75 69 97 70 91 39 61 80 47
SOLUTION

X-67 Y-70
X dx2 Y Dx2 dxdy
dx dy
91 24 576 71 1 1 24 X=A +∑dx X C
97 30 900 75 5 25 150
W
108 41 1681 69 -1 1 -41
121 54 2416 97 27 729 1458 =67+230 X 1
67 0 0 70 0 0 0 10
124 57 3249 91 21 441 1197 =90
51 -16 256 39 -31 961 496
73 6 36 61 -9 81 -54 Y= A + ∑dy X C
111 44 1936 80 10 100 440 N
57 -10 100 47 -23 529 230 =70 + 0 X 1
230 11150 0 2868 3900 10
∑dx ∑d2x ∑dy ∑d2x ∑dxdy Y = 70
X on y Regression on coefficients Y on X Regression on coefficients
X Y
Bxy =ς Bxy =ς
Y X

∑dxdy X n – (∑dx X ∑dy) ∑dxdy X n – (∑dx X ∑dy)


Bxy= Bxy=
∑dy2 X n – (∑dy)2 ∑d2x X n – (∑dX)2

= 3900 X 10 – (230 X 0) = 3900 X 10 – (230 X 0)


2868 X 10 – (0) 2 11150 X 10 – (230) 2

=39000 – 0 = 39000 =39000 = 39000


28680 – 0 28680 = 1.359 11150 - 52900 = 1.359 = 0.665

147
Regression Equation Regression Equation

(X-X) = bxy (Y-Y) (Y-Y) = byx (X-X)


X-90 = 1.359 (Y-70) (Y-70)= 0.665 (X-90)
X-90 = 1.359Y – 95.B Y-70 = 0.665X – 59.85
X = 1.359Y – 95.B + 90 Y = 0.665X – 59.85 + 70
X = 1.359Y - 5.13 R.Line Y = 0.665X + 10.15 R.Line
Coefficient of Correlation = √bxy X byx
=√1.359 X 0.665 = 0.9506

ILLUSTRATION = 11
The following data related to the ages of husband and wives. Obtain the two Regression
equations and estimate the most likely age of husband for the age of wife 25 years.
Ages of husbands 25 28 30 32 35 36 38 39 42 55
Ages of wife’s 20 26 29 30 25 18 26 35 35 46

SOLUTION
X = A + ∑dx X C
x-36 2 Y-29 2 N
X Dx Y Dy dxdy
dx dy = 36 + 0 X 1
25 -4 121 20 -9 81 99 10
28 -8 64 26 -3 9 24
X =36
30 -6 36 29 0 0 0
32 -4 16 30 1 1 -4
Y = A + ∑dy X C
35 -1 1 25 -4 16 4 N
36 0 0 18 -11 121 0 =29 + 0 X 1
38 2 4 26 -3 9 -6 10
39 3 9 35 6 36 18 Y = 29
42 6 36 35 6 36 36 X
55 19 361 46 17 289 323 Bxy = r R. coefficient.
0 648 0 598 494 Y
N=10
∑dx ∑d2x ∑dy ∑d2y ∑dxdy Y
Box = r R. coefficient
X

∑dxdy X n – (∑dx X ∑dy) ∑dxdy X n – (∑dx X ∑dy)


byx = byx =
∑d 2y X n-(∑dx)2 ∑d 2xX n-(∑dx)2
= 494 X 10 – 0 X 0 = 494 X 10 – (0 X 0)
598X 10 – (0) 2 648X 10 – (0) 2
= 4940 = 4940
=0.8261 6480 =0.7623

148
Regression Equation Regression Equation

X – X = bxy (Y-Y) Y – Y = byx (X-X)


X – 36 = 0.8261Y – (Y-29) Y – 29 = 0.7623 – (X-36)
X –36 = 0.8261Y – 23.9569 Y –29 = 0.7623X – 27.4428
X=0.8261Y – 23.9569 + 36 Y=0.7623X – 27.4428 + 29
X = 0.8261Y + 12.0431 R.L
Y = 0.7623X + 1.5572 R.Line
If a wife’s age is 25 (y)
X = 0.8261 X 25 + 12.0431 Coefficient of correlation
=20.6525 + 12.0431 r=√bxy X byx
X = 32.6956 =√0.8261 X 0.7623
Husband’s age is 32.6956 years r = 0.7935

ILLUSTRATION =12
A panel of two Judges P and Q graded dramatic performance by independently awarding marks as
follows.
Performance 1 2 3 4 5 6 7
Marks by ‘P’ 46 42 44 40 43 41 45
Marks by ‘Q’ 40 38 36 35 39 37 41
The eight performance which judge Q could not attend, was awarded 37 marks by judge P. If
Judge Q had also been present, how many marks could be expected to have been awarded by him to the eight
performances.
SOLUTION
Let the marks awarded by judge P be represented by X and those awarded by judge Q be Y. We
have to find out the value of Y when X=37. This can be done by finding out the regression equation Y on X.
Computation of Regression Equation Y on X
X-43 Y-38 ∑dx
X D2X Y D2Y dxdy X=A+ X C
Dx dy
46 3 9 40 2 4 6 N
42 -1 1 38 0 0 0
44 1 1 36 -2 4 -2 =43+ 0 X 1
40 -3 9 35 -3 9 9 7
43 0 0 39 1 1 0 X=43
41 -2 4 37 -1 1 2
45 2 4 41 3 9 6 Y=A + ∑dy X C
0 28 0 28 21 N
∑dx ∑d2X ∑dy ∑d2y ∑dxdy
=38 + 0 X 1
Regression Equation of Y on X
7
Y=38

Y- Y = bxy (X-X)
Y – 38 = bxy (X-43) X
∑dxdy X n – (∑dx X dy) 21 X 7 – 0 147 Bxy= r
∴bxy = ∑d2x X n – (∑dx)2 28 X 7 – 0 = 196 = 0.75 Y
Y – 38 = 0.75 (X-43)
Y-38 = 0.75X – 32.25
Y=0.75x +38 – 32.25
Y=0.75x + 5.75 R.Line
When X = 37
=0.75 X 37 + 5.75 Y=33.5

149
If judge Q was present, he would have awarded 33.5 marks.

REGRESSION EQUATION IN A BIVARIATE GROUPED FREQUENCY DISTRIBUTION


The procedure is the same as we have followed in case of individual series.
The modified formula is as under ;
Regression coefficient of X on Y
X
i.e,bxy=ς
Y
∑fdxdy X N – (∑fdx X ∑fdy) c of x
bxy = X
∑fd2y X N - (∑fdy)2 c of y
Regression coefficient of Y on X

Y
i.e box = r
X
∑fdxdy X N – (∑fdx X ∑fdy) c of y
box = X
∑fd2x X N – (∑fdx)2 c of x

Coefficient of correlation = √bxy X byx

ILLUSTRATION = 12

Following table gives the ages of husbands and wives for 50 newly married couples. Find the two regression
lines. Also estimate. A) The age of husband when wife is 20 and B) The age of wife when husband is 30.
Age of Husbands
Age of wives
20-25 25-30 30-35 Total
16-20 9 14 - 23
20-24 6 11 3 20
24-28 - - 7 7
Total 15 25 10 50
SOLUTION
Class interval for age of husband x is = 5
Class interval for age of wife (Y) is =4
X – 27.5
Dx = 5

Y – 22
dy = 4

150
A=27.5
X 20-25 25-30 30-35 Total
C=5
A=22
22.5 27.5 32.5
C=4
dx
Y MV -1 0 1 f fdy fd2y fdxdy
dy
9
16-20 18 -1 9 14 - 23 -23 23 9
20-24 22 0 6 11 3 20 0 0 0
7
24-28 26 1 - - 7 7 7 7 7
50 -16 30
Total F 15 25 10 16
N ∑fsy ∑fd2y
-5
fdx -15 0 10
∑fdx
25
Fd2x 15 0 10
∑fd2x
fdxdy 9 0 7 16

X on Y R.E Y on X R.E

∑fdx –5 ∑fdx –16


X=A+ X C = 27.5 + X5 Y=A+ X C = 22 + X4
N 50 N 50
= 27 64
= 27 –
Regression Coefficient of X on Y 50
∑fdxdy X N – (∑fdx X ∑fdy) c of x = 22 – 1.28 = 20.72
bxy = X ∑fdxdy X N – (∑fdx X ∑fdy) c of y
2 2
∑fd y X n – (∑dy) c of y bxy = X
=16 X 50 – (-5 X –16) 5 X 5 ∑fd2y X n – (∑fdy)2 c of x
30 X 50 – (-16)2 4 4 =16 X 50 – (-5 X –16) 4 X 4
800 – 80 5 720 5 25 X 50 – (-5)2 4 5
= X = X 800 – 80 4 720 4
1500 –256 4 1244 4 = X = X
= 3600 1500 –256 5 1225 5
4976 = 0.723 = 2880
6125 = 0.47
(X-X) = bxy (Y-Y)
X – 27 = 0.723 (Y – 20.72) (Y-Y) = byx (X-X)
X – 27 = 0.723Y – 14.98 (Y – 20.72) = 0.47 (X – 27)
X = 0.723Y – 14.98 + 27 Y – 20.72 = 0.47X – 12.69
X = 0.723Y + 12.02 R. Line Y = 0.47X – 12.69 + 20.72
Estimate of husband’s age when Y =20 Y= 0.47X + 12.03 R. Line
X = 0.723 X 20 + 12.02 Estimate of wife’s age when X =30
X = 26.48 years Y = 0.47 X 30 + 8.03
= 1410 + 8.03
= 22.13 years
r =√bxy X box
=√0.723 X 0.47 = 0.5829

151
ILLUSTRATION –14
The following are the marks obtained by 132 students in Test X and Test Y. calculate a) The Regression
Coefficient
b) Two Regression Equations
c) Coefficient of correlation
X
30-40 40-50 50-60 60-70 70-80 Total
Y
20-30 2 5 3 - - 10
30-40 1 8 12 6 - 27
40-50 - 5 22 14 1 42
50-60 - 2 16 9 2 29
60-70 - 1 8 6 1 16
70-80 - 2 4 2 8
Total 3 21 63 39 6 132
SOLUTION
A=55 X 30-40 40-50 50-60 60-70 70-80 Total
c=10
A=45 35 45 55 65 75
C=10
dx
Y MV -2 -1 0 1 2 f fdy Fd2y fdxdy
dy
8 10
20-30 25 -2 2 5 3 - - 10 -20 40 18
2 8 -6
30-40 35 -1 1 8 12 6 - 27 -27 27 4
0 0 0
40-50 45 0 - 5 22 14 1 42 0 0 0
-2 9 4
50-60 55 1 - 2 16 9 2 29 29 29 11
2 12 4
60-70 65 2 - 1 8 6 1 16 32 64 14
12 1
70-80 75 3 - - 2 4 2 2 8 24 72 24

132 38 232
Total F 3 21 63 39 6 71
n ∑fdy ∑fd2y
24
Fdx -6 -21 0 39 12
∑fdx
96
Fd2x 12 21 0 39 24
∑fd2x
fdxdy 10 14 0 27 20 71

∑fdx ∑fdy
X=A+ XC Y=A+ XC
N N
=55 + 24 X 10 =45 + 38 X 10
132 132
=55 + 240 =45 + 380
132 132
=55 + 1.82 X = 56.82 =45 + 2.878 = 47.878

152
Regression on Coefficient of X on Y Regression on Coefficient of Y on X
∑fdxdy X N – (∑fdx X ∑fdy) C of X ∑fdxdy X N – (∑fdx X ∑fdy) C of Y
bxy = X byx = X
∑fd2y X N – (∑fdy)2 C of Y ∑fd2x X N – (∑fdx)2 C of X
= 71 X 132 – (24 X 38) 10 = 71 X 132 – (24 X 38) 10
232 X 132 – (38)2 10 96 X 132 – (24)2 10
= 9372 – 912 = 8460 = 8460 = 8460
30624 – 1444 29180 =0.289 12672 – 576 12096 =0.699
R. Equation R. Equation
X-X=bxy (Y-Y) Y-Y=bxy (X-X)
X-56.82 = 0.289 (Y-47.88) Y-47.88 = 0.699 (X-56.82)
X-56.82=0.29Y – 13.8852 Y-47.88=0.7x– 39.774
X=0.29Y – 13.8852 + 56.82 Y=47.88=0.7x-39.774
X=0.29Y + 42.93 R.Line Y=0.7x + 8.11 R.Line

Coefficient of Correlation = √bxy X byx


=√0.29 X 0.7 = 0.450

ILLUSTRATION = 15

Following is the distribution of students according to their Height and Weight.

Height Weight in lbsY


In inches X 90-100 100-110 110-120 120-130 TOTAL
50-55 4 7 5 2 18
55-60 6 10 7 4 27
60-65 6 12 10 7 35
65-70 3 8 6 3 20
TOTAL 19 37 28 16 100

From the above,


a) Estimate the weight when height is 63 inches
b) Estimate the height when weight is 115 lbs
c) Calculate coefficient of correlation

SOLUTION: Let X be height in inches, Let Y be weight is lbs

∑fdx ∑fdy
X=A XC Y= A XC
N N
- 43 - 59
=62.5 + X 5 =115 + X 10
100 100
= 62.5 – 215 = 115 - 590
100 100
= 60.35 Y = 109.1

153
Tot
Y 90-100 100-110 110-120 120-130
al
95 105 115 125
dy
X MV -2 -1 0 1 f fdx fd2x fdxdy
dx
16 14 -4
50-55 52.5 -2 4 7 5 2 18 -36 72 26

12 10 -4 1
55-60 57.5 -1 6 10 7 4 27 -27 27
8
0 0 0
60-65 62.5 0 6 12 10 7 35 0 0 0
-6 -8 3
65-70 67.5 1 3 8 6 3 20 20 20 -11
100 -43 119
Total f 19 37 28 16 33
N ∑fdx ∑fd2x
fdxy -38 -37 0 16 -59 ∑fdy
12 ∑fd2
fd2y 76 37 0 16 ∑fdxdy
9 y
fdxdy 22 16 0 -5 33

X on Y Regression Equation Y on X Regression Equation

X Y
bxy = r byx = r
Y X
∑fdxdy X N – (∑fdx X ∑fdy) Cof x ∑fdxdy X N – (∑fdx X ∑fdy) Cof y
∴bxy = X ∴byx = X
∑fd2y X N – (∑fdy)2 Cof y ∑fd2x X N – (∑fdx)2 Cof x
=33 X 100 –(-43 X 59) 5 =33 X 100 –(-43 X 59) 10
2
129 X 100 – (59)2 10 119 X 100 – (-43) 5
3300 - 2537 3300 + 2537 2
= X 0.5 = X
12900 – 3481 11900 – 1849 1
= 763 X 0.5 = 381.5 = 763 X 2 =0.15
9419 1 9419 = 0.0405 10051 byx =01518
R. Equation R. Equation

(X – X) = bxy (Y-Y) (Y – Y) = bxy (X-X)


X – 60.35 = 0.0405 (Y – 109.1) Y – 109.1 = 0.1518 (X – 60.35)
X – 60.35 = 0.0405y – 4.41855 Y – 109.1 = 0.1518x – 9.16113
X=0.0405y – 4.41855 + 60.35 Y=0.1518x – 9.16113 + 109.1
X=0.0405y + 55.93145 R.L Y=0.1518x + 99.93897 R.L
Estimation of height (x) when weight (y) is 115 Estimation of weight (y) when height (x) is 63
lbs. inches.
X=0.0405 X 115 + 55.93145 Y=0.1518 X 63 + 99.93897
X=4.6575 + 55.93145 =9.5634 + 99.93897
X=60.6 inches height Y=109.5 lbs
r=√bxy X box =√0.0405 X 04518 = 0.0784

154
ILLUSTRATION = 16
From the following data find:
a) The most probable value of Y, when X is 60 and
b) The most probable value of X, when Y is 40 and
c) The coefficient of correlation
X =53.2, Y=27.9, byx -1.5 and bxy = - 0.2
SOLUTION
X on Y R.Equation Y on X R.Equation

X (Y-Y) = box (X-X)


(X-X)=r (Y-Y) Y-27.9 = -1.5 (X-53.2)
Y Y-27.9 = - 1.5x + 79.8
(X-53.2)=-0.2 (Y-27.9) Y = - 1.5x + 79.8 + 27.9
X-53.2 = - 0.2Y + 5.58 Y=1.5x + 107.7 R.L
X = -0.2Y + 5.58 + 53.2 If x is 60
X = -0.2Y + 58.78 R.Line Y = -1.5 X 60 + 107.7
If Y is 40 = - 90 + 107.7
X= - 0.2 X 40 + 58.78 Y = 17.2
X = 50.78

Coefficient correlation will be r = √bxy X box = √-1.5 X –0.2


= - 0.5477

THEORETICAL QUESTIONS (5 , 10 & 15 Marks)


1. What is meant by Regression? How is this concept useful to business fore casting?
2. Destination clearly between correlation and Regression analysis.
3. What is Regression analysis? State its uses.
4. Define Regression and explain its importance
5. Briefly explain:
a. Regression line
b. Regression Equation
c. Regression Coefficient

PRACTICAL PROBLEMS
6. Given the following data, calculate,
a. The expected value of Y when X=60
b. The expected value of X when Y=120
X Y
Mean 65 120
SD 5 10

Coefficient of correlation = 0.6 [Answers, X=65 Y=114]

PROBLEM = 07
Given the following data estimate the marks in Mathematics for a student who has secured 60 marks in English.
Arithmetic Average of Marks in Maths = 80
Arithmetic Average of Marks in English = 50
SD of Marks in Mathematics _ _ _ _ _ _ _ 15
SD of Marks in English _ _ _ _ _ _ _ _ _ _ 10
Coefficient of Correlation _ _ _ _ _ _ _ _ _ _ 0.4
[Answer : 86]

155
PROBLEM = 08
Find the most likely Price in Bangalore corresponding to the price ofRs.70 at Mysore from the following
data
Average price at Mysore = Rs.65
Average price at Bangalore = Rs.67
SD of Price at Mysore = Rs.2.5
SD of Price at Bangalore = Rs.3.5
Coefficient of correlation between the two prices of the commodity in the two cities is 0.8.
Also estimate the price at Mysore Corresponding to the price Rs.50 at Bangalore.
[Answer: 72.6 and 55.3]
PROBLEM = 09
You are given the following data.
X Y
Mean 36 85
S. D. 11 8

Coefficient of correlation = 0.66


1.Find the two regression equations
2.Estimate the Value of X when Y = 75
[Answer X75= 26.92]
PROBLEM = 10
The following are the marks in Statistics (X) and Mathematics (Y) of ten students
X 56 55 58 58 57 56 60 64 69 57
Y 68 67 67 70 65 68 70 66 68 66
Calculate the coefficient of correlation based on bxy and byx also estimate the marks in Mathematics of a
student who scores 62 marks in Statistics.
[Answer: r = 0.78,bxy= 0.0294, Y = 67.59]
PROBLEM NO: 11
From the following data, obtain both the regression equations and estimate the demand (Y) if the price (X) is
75.
Price (X) 60 63 66 69 72 78 81 90 96 99
Demand(Y) 85 87 84 80 82 79 78 73 70 72

PROBLEM NO: 12
Form the data given below, find
a. The two regression equations
b. The Coefficient of Correlation between the marks in Economics and Statistics.
c. The most likely marks in Statistics when marks in Economics are 30.
Marks in Economics X 25 28 35 32 31 36 39 38 34 32
Marks in Statistics Y 43 46 49 41 36 32 31 30 33 39

[Ans: X = 40.892 –1.234Y, Y = 59.248 –0.664X, r =0.394, Y= 39]


PROBLEM =13
The following data relate to price and demand of a commodity
a) Estimate demand when price is Rs.30
b) Estimate price when demand is 65 units
c) Coefficient of correlation.
Demand in units 20 22 25 23 18 16 14 17 21 19
Price in Rs 50 45 38 42 55 58 59 54 49 57
[Answer a) 29.6 b) 13.21 c) r = - 0.94]

PROBLEM = 14
The following table shows the frequency distribution of couples classified according to the ages.
Calculate,
a) Obtain two Regression coefficients.
b) Estimate the age of husband when wife’s age is 28 years.

156
c) Calculate coefficient of correlation.
Wife’s age Husbands age in years X
In years Y 20-25 25-30 30-35 35-40 TOTAL
15-20 20 10 3 2 35
20-25 4 18 6 4 32
25-30 - 5 11 - 16
30-35 - - 2 - 2
35-40 - - - 5 5
TOTAL 24 33 22 11 90
[ Answers, r = 0.612, X = 22.5, Y = 28.6, b = 31.7 , box = 0.558 ]
PROBLEM = 15
From the following data,
a) Estimate X when Y = 30 and also b) Estimate Y when X = 20
X
5-15 15-25 25-35 35-45 TOTAL
Y
0-10 1 1 - - 2
10-20 3 6 5 1 15
20-30 1 8 9 2 20
30-40 - 3 9 3 15
40-50 - - 4 4 8
TOTAL 5 18 27 10 60
[Answer a) 28.7 b)22.31]
PROBLEM NO =16
From the following data, calculate
a) Regression coefficients b) Coefficient of correlation based on bxy and box.
Y
30-35 35-40 40-45 45-50 TOTAL
X
25-30 20 10 3 2 35
30-35 4 28 6 4 42
35-40 - 5 11 - 16
40-45 - - 2 - 2
45-50 - - - 5 5
TOTAL 24 43 22 11 100
[Answer: X = 32.5, Y = 38.5 bxy = 0.6744 box = 0.5576, ς = 0.6132]
PROBLEM = 17
Calculate two Regression Coefficients. Estimate the value of X when Y = 49 also calculate
coefficient of correlation based on bxy and box.
X 43 44 46 40 44 42 45 42 38 40 42 57
Y 29 31 19 18 19 27 27 29 41 30 26 10
[Answer X = 64.8, Y = ? , bxy = -0.44, byx = -1.2198, ς= -0.732]
PROBLEM = 18
From the following bivariate table calculate the following
a) Two Regression coefficients
b) Coefficient of correlation based on bxy and box
X
59.9 79.5 99.5 119.5 139.5 159.5 179.5 TOTAL
Y
2.25 3 4 3 6 2 1 1 20
7.25 2 3 5 10 3 1 1 25
12.25 5 4 6 11 5 3 3 37
17.25 10 11 12 15 12 15 10 85
22.25 4 2 3 10 7 5 6 37
27.25 1 1 2 8 8 5 4 29
32.25 1 1 1 10 5 4 5 27
TOTAL 26 26 32 70 42 34 30 260

157
[Answer: X = 17.80, Y = 122.42, bxy = 0.05, box = 1.06, r = 0.230]

158
School of Distance Education
Here the normal equations are, 47.14 = 90 B + 20 A --- (1)

11.59 = 20 B + 5 A --- (2)

(1)  4  (2)  10 B  0.78  B  0.078 .

Solving (2) using B  0.078 , get A = 2.006.

Then, a  Anti log(2.006)  101.3 , and b  Anti log(0.078)  1.196

Hence the required curve is,

y  101.3   (1.196) x

2.4. Regression lines:

Let ( x1 , y1 ) , ( x2 , y2 ) ,…, ( xn , yn ) be the given set of observations on two variables X and


Y. A scatter plot of these points reveals an idea on the linear relation between X and Y. If
a linear relation exists between X and Y, the line about which the points in the scatter
diagram cluster is called the regression line and the equation representing this line is
called the regression equation. There are two approaches for finding the regression line.
One is fitting a straight line of the form y  ax  b to the given data ( x1 , y1 ) , ( x2 , y2 ) ,…,
( xn , yn ) , by minimizing the sum of squares of possible errors in y values. The other is
fitting a straight line of the form x  cy  d to the data, by minimizing the sum of squares
of possible errors in x values. If all the given ( xi , yi ) values are perfectly obeys a linear
relation, then the straight line fitted by the above two approaches will be same. But in
general the ( xi , yi ) values may not perfectly obey a linear relation, and hence the above
approaches may give two different straight lines for the given data. The straight line
fitted to the data in the form y  ax  b by minimizing the sum of squares of possible
errors in y values is known as the regression line y on x and the straight line fitted to the
data in the form x  cy  d by minimizing the sum of squares of possible errors in x values
is known as the regression line x on y.

To obtain the regression line Y on X of the form y  ax  b for the given data ( x1 , y1 ) ,
( x2 , y2 ) ,…, ( xn , yn ) , the following normal equations for fitting y  ax  b are to be solved.

n n n

 xi yi  a
i 1
 xi 2  b
i 1
x
i 1
i     (1) and

n n

y
i 1
i  a x 
i 1
i n b     (2)

Let us transform x and y to X and Y as, X  x  x and Y  y  y ; where x and y are the
means of x and y respectively. Now the normal equations for fitting a straight line
connecting X and Y in the form Y  aX  b are:

Applied Statistics Page 29


School of Distance Education
n n n

X Y
i 1
i i  a X
i 1
i
2
 b X
i 1
i     (3) and
n n

 Yi  a
i 1
X
i 1
i  n b     (4)

n n n n
But here,  X i   ( xi  x )  0 and
i 1 i 1
 Yi   ( yi  y )  0
i 1 i 1

Hence,
n n
(3)   X i Yi  a
i 1
Xi 1
i
2
 b 0

n n
1 n
 X i Yi   xi  x  yi  y    xi  x  yi  y 
n i 1
 a  i 1
n
 i 1
n

1 n
X x  x    xi  x 
2 2 2
i i
i 1 i 1 n i 1

Cov( x, y )
That is a
var( x)

(4)  0  a0  n b  b  0 .

Cov( x, y )
Then, the straight line is, Y  X 0.
var( x)

Cov( x, y )
Hence the regression line y on x is, y  y  x  x.
var( x)

In as similar way, the regression line x on y is derived as,

Cov( x, y )
x  x  y  y
var( y )

Cov( x, y ) P
In the regression line y on x, the coefficient of x,  xy2 is known as the
var( x) x
regression coefficient of y on x, denoted by byx and in the regression line x on y, the
Cov( x, y ) P
coefficient of y,  xy2 is known as the regression coefficient of x on y, denoted by
var( y ) y
bxy .

The regression line y on x help us to predict the value of y for a given value of x,
and the regression line x on y helps to predict the value of x for a given value of y.

Applied Statistics Page 30


School of Distance Education

Problem: Obtain the line of regression of ‘y on x’ for the following data.

Age x : 66 38 56 42 72 36 63 47 55 45

BP : 145 124 147 125 160 118 149 128 150 124

Estimate the blood pressure of a man whose age is 55.

Solution:

The regression line y on x is defined as,

Px , y
y  y  x  x , where Px , y = cov(X,Y),  x 2 = V(X).
x 2

Using the given data to find mean of x, mean of y, cov(X,Y) and V(X).

The calculations are as follows:

x y x2 xy

66 145 4356 9570


38 124 1444 4712
56 147 3136 8232
42 125 1764 5250
72 160 5184 11520
36 118 1296 4248
63 149 3969 9387
47 128 2209 6016
55 150 3025 8250
45 124 2025 5580

520 1370 28408 72765

520 1370
Mean of X =  52 , Mean of Y =  137
10 10

1 72765
Cov (X,Y) 
n
 xy  x  y 
10
 52 137  152.5

Applied Statistics Page 31


School of Distance Education
1 28408
V (X ) 
n
 x2  x 2 
10
 522  136.8

Hence the regression line of y on x is,

152.5
 y  137    x  52   y  1.1148  x  79.03
136.8

Then the blood pressure of a man whose age x = 55 can be get by substituting x =
55 in the derived regression equation y on x, This implies, the blood pressure,

y  1.1148   55  79.03  140.34 .

Problem: For 10 observations on X and Y, the following data were observed

 x  130 ,  y  200 ,  x 2
 2288 , y 2
 5506 ,  xy  3467
Obtain regression line of Y on X. Find y when x = 16.

Solution:
Px , y
The regression line y on x is,  y  y    x  x  , where Px , y = cov(X,Y),  x 2 = V(X)
 x2
1
Cov( X , Y ) 
n
 xy  x y
 3467       = 86.7
1 130 200

10  10   10 
2
1 1  130 
V ( X )   x 2  x 2   2288     = 59.8
n 10  10 

 200  86.7  130 


The regression line Y on X is,  y   x 
 10  59.8  10 

 y  1.4498 x  1.1526 .

When x = 16, we get,

y  1.449816  1.1526  24.3494 .

2.5. Pearson’s Coefficient of correlation:

If there is a linear relation between the variables x and y, the degree of linear
relation is measured by the coefficient of correlation. If all they given ( xi , yi ) points are
almost satisfying a linear relation, then we are saying that there is a high degree of linear
relation between the variables. If the linear relation fitted for the variables is in such a
Applied Statistics Page 32
School of Distance Education
way that the increment in one variable results in the increment of the other also, then
there is a direct (or positive) correlation existing between the variables. On the other
hand, if the linear relation fitted for the variables is in such a way that the increment in
one variable results in the decrease of the other, and then there is an inverse (or negative)
correlation existing between the variables. If there is no linear relation existing between
the variables, the correlation is zero.

A famous British Statistician, Karl Pearson suggested a coefficient measure of the


degree of correlation between two variables x and y, known as Pearson’s coefficient of
correlation is denoted by rxy , where,

1 n 1 n
Pxy  ( xi  x )( yi  y )
n i 1
 xi yi  xy
n i 1
rxy   
 x y 1 n 1 n
1 n 2 1 n 2
 ( xi  x )2 n 
n i 1 i 1
( yi  y ) 2 i
n i 1
x  ( x ) 2
 yi  ( y )2
n i 1

Theorem: For two variable x and y, 1  rxy  1 , where rxy is the Pearson’s coefficient of
correlation.

Proof:

( xi  x )
Let ( x1 , y1 ) , ( x2 , y2 ) ,…, ( xn , yn ) are the observations on x and y. Consider and
x
( yi  y )
, where x and y are the means and  x and  y are the standard deviations of x
y
and y respectively.
2
 ( x  x ) ( yi  y ) 
We have,  i    0 , because it is the square of a real number.
  x  y 

Adding all such terms for i=1,2,…,n and dividing by n,


2
1  ( xi  x ) ( yi  y ) 
 
n i   x

 y 
 0

1  ( xi  x ) 2  1  ( yi  y ) 2  1 ( xi  x ) ( yi  y )
On expansion,   
n i  x
  2  0
 n i   y x y
2 2
 n i

1 1 1 1 1 1
 
x n i
2
( xi  x ) 2   2   ( yi  y ) 2   2
y n i
 ( xi  x )( yi  y )  0
 x y n i

Applied Statistics Page 33


School of Distance Education

 x2  y
2
Cov( x, y ) Pxy
  2  0 . That is, 1  1  2 0
x y
2 2
 x y  x y

 2  2 rxy  0 . That is, 1  rxy  0

This gives, 1  rxy  0 or 1  rxy  0

That is, rxy   1 or rxy  1

  1  rxy  1

Pxy
Remark: We have the regression coefficients y on x, byx  and the regression
 x2
Pxy
coefficients x on y, bxy  . The geometric mean of these regression coefficients gives
 y2
the magnitude of the coefficient of correlation rxy . The sign of correlation is determined by
the sign of covariance between x and y, Pxy . If Pxy is positive rxy is positive in sign and if
Pxy is negative rxy is negative in sign.

Theorem: (Invariance of correlation coefficient under linear transformation): A


x A yB
transformation on the variables x and y to u and v in the form u  and v  is
c d
making no change in the coefficient of correlation between the variables. That is, rxy  ruv
.

Proof:

Let ( x1 , y1 ) , ( x2 , y2 ) ,…, ( xn , yn ) are the observations on x and y.

1 n
 ( xi  x )( yi  y )
n i 1
Then, rxy 
1 n 1 n

n i 1
( xi  x ) 2

n i 1
( yi  y ) 2

x A yB
Let, u  and v  ;
c d

Then, Pearson’s coefficient of correlation between u and v,

1 n
 (ui  u )(vi  v )
n i 1
ruv 
1 n 1 n

n i 1
(ui  u ) 2

n i 1
(vi  v ) 2

Applied Statistics Page 34


School of Distance Education

1 n  xi  A  x  A    yi  B  y  B  

n i 1  c
  
 c   d

 d 

 ruv 
2 2
1 n  xi  A  x  A   1 n  yi  B  y  B  

n i 1  c

 c 
 
n i 1  d

 d 


1 n  xi  x   yi  y 

n i 1  c   d 
 ruv 
1 n  xi  x  1 n  yi  y 
2 2


n i 1  c 

n i 1  d 

1 1 n
   xi  x  yi  y 
cd n i 1
 ruv 
1 1 n 1 n
   i    yi  y 

2 2
x x
cd n i 1 n i 1

1
 Pxy Pxy
 ruv  cd 
1
  x y  x y
cd

 ruv  rxy .

Problem: Find the coefficient of correlation for the following data on X and Y.

X: 65 66 67 67 68 69 70 72

Y: 67 68 65 68 72 72 69 71

Solution:

Pxy
Coefficient of correlation, rxy 
 x y

To find x , y , Pxy ,  x2 and  y2

1 n 1 n 2 1 n
Pxy   xi yi  xy ;
n i 1
 x2 = 
n i 1
xi  ( x ) 2 and  y2 =  yi 2  ( y ) 2
n i 1

The calculations are as follows:

x y x2 y2 xy

Applied Statistics Page 35


School of Distance Education

65 67 4225 4489 4355

66 68 4356 4624 4488

67 65 4489 4225 4355

67 68 4489 4624 4556

68 72 4624 5184 4896

69 72 4761 5184 4968

70 69 4900 4761 4830

72 71 5184 5041 5112

544 552 37028 38132 37560

1 1 1 1
x
n
 x i   544 = 68 ;
8
y
n
 y i   552 = 69
8

1 n 1
Pxy  
n i 1
xi yi  xy =  37560  68  69  3
8

1 n 2 1
 x2 =  xi  ( x ) 2 =  37028   68 2  4.5
n i 1 8

1 n 2 1
 y2 =  yi  ( y ) 2 =  38132   69  2  5.5
n i 1 8

Pxy 3
Coefficient of correlation, rxy    0.603 .
 x y 4.5 5.5

Problem: Calculate Karl Pearson’s coefficient of correlation for the following data;
x: 10 12 13 16 17 20 25
y: 19 22 26 27 29 33 37
Solution:

Cov( X , Y )
Coefficient of correlation r 
S .D.( X )  S .D.(Y )

The problem can be solved by simply following the steps shown in above example.
But for some computational easiness the problem can also be solved as in the following
illustration.

Applied Statistics Page 36


School of Distance Education
We have the result that correlation coefficient is independent of change of origin
and scale. Hence we can calculate the correlation between X and Y by altering

X and Y by some linear transformation. Here, consider U = X – 16 and V = Y – 27.

The correlation between U and V is same to correlation between X and Y.

Cov(U ,V )
Correlation between U and V, r 
S .D.(U )  S .D.(V )

The calculations are:

x y U = X – 16 V = Y – 27 U2 V2 UV

10 19 -6 -8 36 64 48

12 22 -4 -5 16 25 20

13 26 -3 -1 9 1 3

16 27 0 0 0 0 0

17 29 1 2 1 4 2

20 33 4 6 16 36 24

25 37 9 10 81 100 90

1 4 159 230 187

187       26.71  .082  26.628


1 1 1 4
Cov (U , V ) 
n
 uv  u v 
7 7 7
2
1 1 1
V (U ) 
n
 u 2  u 2  159      22.71  0.02  22.69
7 7
2
1 1 4
V (V ) 
n
 v 2  v 2   230      32.86  0.327  32.533
7 7

26.628
Now, Correlation between U and V, r  = 0.98
22.69  32.533

That is the correlation coefficient of X and Y = 0.98

2.6. Angle between the regression lines:

The regression equations are

Applied Statistics Page 37


School of Distance Education
Pxy
y  y  x  x and
 x2
Pxy
x  x  y  y
 y2

Pxy Pxy y
Since rxy  , the regression coefficient y on x,  rxy and
 x y  2
x x

Pxy x
The regression coefficient x on y,  rxy .
 2
y y

y
Hence the regression equations are, y  y  rxy x  x ---- (1) and
x
x
x  x  rxy  y  y  ---- (2)
y

y
The regression equation x on y can be rewrite as y  y  x  x ---- (3)
rxy x

Now the regression equation y on x [equation (1)] and that on x on y [equation (3)] can be
written in the form y = m x + c as follows:

y y
y  rxy x  rxy x  y ---- (1) and
x x

y y
y x x  y ---- (3)
rxy yx rxy yx

y y
From here, we get the slopes of these two regression lines as, m1  rxy and m2 
x rxy x

Let us consider  as the angle between the regression lines. Then,

y y
rxy 
m1  m2 x rxy x
tan     
1  m1m2    
1   rxy y  y 
  x rxy x 

rxy 2 y   y
rxy x rxy 2 y   y  x2
   
 y 2  rxy x  x2   y2
1  2 
 x 
Applied Statistics Page 38
School of Distance Education

rxy 2  1  y  2
 tan      2 x 2
rxy x x  y

rxy 2  1  x y
 tan    
rxy  x2   y2

Remarks:

(i) For two variables x and y, if rxy  1 , we get tan   0 . This implies the angle between
the regression lines   tan 1 0  0 . That is, if there is a perfect linear relation exists
between x and y (whether it is direct or inverse), the angle between the regression line is
zero. Or in other words, the two regression lines coincide or they are same.

(ii) If rxy  0 , we get tan   . This implies the angle between the regression lines
  tan 1   900 . That is, if there is no linear relation exists between x and y, the two
regression lines are perpendicular.

If there are two regression lines, it is obvious that they are intersecting at a point. The
point of intersection of regression lines can be obtained by solving the regression
equations for x an y. It can be done as follows:

y
We have regression equation y on x; y  y  rxy x  x --- (1) and the regression
x
x
equation x on y;  x  x   rxy  y  y  ---- (2)
y

y x
Put (2) in (1) gives,  y  y   rxy rxy y  y
x y

 y  y  rxy 2  y  y 

 1  r  y  1  r  y
xy
2
xy
2
 y y

Put y  y in (2)   x  x   0  x x

Hence the point of intersection of the regression lines is  x , y


2.7. Identification of regression lines and determination of correlation coefficient

If we are given a1 x  b1 y  c1  0 and a2 x  b2 y  c2  0 as the two regression lines, it is


to identify which of them represent regression line yon x and which is regression line x on
y. For this first of all we assume the first line a1 x  b1 y  c1  0 is regression line y on x or
regression line x on y. Let us assume the first line is regression line y on x. Then we
Applied Statistics Page 39
School of Distance Education
a1 c
express the line in terms of y as, y   x  1 . Then the regression coefficient y on x is
b1 b1
a1
byx   . If the first line is assumed as regression line y on x the second is regression line
b1
b c
x on y. It is written in terms of x as, x   2 y  2 . If so, the regression coefficient x on y,
a2 a2
b
bxy   2 .
a2

We know the geometric mean of regression coefficients is the magnitude of


coefficient of correlation rxy and 1  rxy  1 .

a1 b2
Hence, if byxbxy    1 , we can confirm that our assumption regarding the
b1 a2
regression lines are same. Otherwise the first line is the regression line x on y and the
b
second is the regression line y on x. Then the regression coefficients are bxy   1 and
a1
a2 a2 b1
byx   . Then the coefficient of correlation, rxy   , which is the reciprocal of rxy ,
b2 b2 a1
obtained by previous assumption.

Problem: The two regression lines are


5 x  6 y  90  0
15 x  8 y  130  0

Find (i) x , y (ii) regression coefficient of y on x and x on y (iii) correlation coefficient.

Solution:
Solving the given two regression lines,
5 x  6 y  90  0 ----- (1) and 15 x  8 y  130  0 ----- (2) , we get x , y .
(2)  3  (1)  10 y  400  y  40.
y  40, in (1)  5 x  6  40  90  0  x  30.

  x , y    30, 40  .

Assume the first line is the regression line Y on X, then, the line can be expressed
5 90 5  a1 
as, y  x  . This implies the regression coefficient Y on X      . The second
6 6 6  b1 

Applied Statistics Page 40


School of Distance Education
8 130
line, X ion Y, can be expressed as, x  y  .Hence the regression coefficient X on
15 15
8  b2 
Y    .
15  a2 

a1b2  a  b  5 8
Then, =    1   2     0.444  1
a2b1  b1  a2  6 15

Hence our assumption is true. That is 5 x  6 y  90  0 is regression line Y on X


and 15 x  8 y  130  0 is the regression line X on Y. Then the regression coefficient of Y
5 8
on X = = 0.833. Regression coefficient of X on Y = = 0.533 and correlation
6 15
coefficient = 0.444. ( here the regression coefficients are positive)

Problem: Given that 14 x  12 y  3  0 and 12 x  21 y  10  0 are the regression lines for X


and Y. Identify the regression lines and find the correlation coefficient.

Solution:

14 3
Assume the 14 x  12 y  3  0 is the regression line Y on X, then, y   x .
12 2
14  a1 
This implies the regression coefficient Y on X   .
 The line
12  b1 
21 10
12 x  21 y  10  0 is assumed as the regression line X on Y, then, x   y  . Then
12 12
21  b2 
the regression coefficient X on Y     .
12  a2 

a1b2  a  b 
Then,    1   2 
a2b1  b1  a2 

14 21
=   = 2.04 > 1. Hence our assumptions about the
12 12
regression lines are NOT true.

Now, 12 x  21 y  10  0 is the regression line Y on X and the line


14 x  12 y  3  0 is the regression line X on Y .

12 10 12  a1 
Then, y   x  , and regression coefficient Y on X      .
21 21 21  b1 

12 3 12  b2 
And, x   x  , the regression coefficient X on Y      .
14 14 14  a2 

Applied Statistics Page 41


School of Distance Education

a1b2 12 12
Then,, =   = 0.4898.
a2b1 21 14

Since the regression coefficients are negative, the correlation coefficient is (- 0.4898).

Problem: The regression lines are y  ax  b and x  cy  d . If the two variables are having the
same mean, show that d (1  a )  b(1  c ) .

Solution:

The means of x and y are obtained by solving the regression lines for x and y.

1 d
Here the first line is y  ax  b --(1) and the second is x  cy  d --(2) that is y   x  --(3)
c c
1 d
(3) and (1)  ax  b   x 
c c
 d  1  bc  d
 x  b   /   a 
 c  c  1  ac

bc  d ad  b
(1)  y  a  b 
1  ac 1  ac
bc  d ad  b
This implies, x and y  .
1  ac 1  ac
bc  d ad  b
If the means of the variables are equal, we can write, 
1  ac 1  ac
This gives, bc  d  ad  b  d  ad  b  bc

 1  a  d  b 1  c  .

Problem: If the variables x and y are satisfying the relation ax  by  c  0 . Show that the
correlation between x and y is -1 or +1, according as a and b are of the same sign or not.

Solution:
Since the variables are satisfying the relation ax  by  c  0 , we can write this
a c
relation in the line of the form y on x as, y   x  ; and in the line of the form x on y as,
b b
b c a
x   y  . Then the regression coefficients y on x and x on y are identified as  , and
a a b
b
 respectively. Then the magnitude of the coefficient of correlation is obtained by the
a
a b
geometric mean of the regression coefficients as,     1 . Then the correlation
b a
coefficient can be +1 or -1 according as the regression coefficients are positive or negative.

Applied Statistics Page 42


School of Distance Education
a b
The regression coefficients  and  becomes positive, when a and b are with
b a
different signs. And they will become negative, when a and b are of same sign. Hence,
the coefficient of correlation is -1 or +1, according as a and b are of the same sign or not.

2.8. Rank correlation coefficient

When we are considering two characteristics which are qualitative in nature, they are
not possible to measure numerically. For example consider the characteristics of the
ability in drawing (let it be X) and the ability in music (let it be Y). It is not possible to
measure numerically the values of X and Y, for an individual. But if there are n
individuals, it is possible to rank these n individuals according to the ability in drawing
(X) and according to their ability in music (Y). If these two characteristics are having high
positive correlation, then ranks obtained for the individuals based of X and Y will be in
same order. If these two characteristics are having high negative correlation, then ranks
obtained for the individuals based of X and Y will be in reverse order. Using the ranks
obtained for the n individuals based on the characteristics X and Y, a method of finding
the coefficient of correlation is derived by C.Spearman in 1904. The coefficient of
correlation for two characteristics which are calculated based on the ranks is known as
Spearman’s Rank Correlation Coefficient.

Let there be n individuals ranked according to two qualitative characteristics


considered. Let ( xi , yi ) denote the rank of the i th individual when ranked according to the
characteristics. So the xi , yi values are the numbers from 1 to n.
Since xi values are the numbers from 1 to n, the mean of x values,
sum of first n natural numbers 1 n(n  1) (n  1)
x   
n n 2 2
Similarly,
sum of first n natural numbers 1 n(n  1) (n  1)
y   
n n 2 2
Variance of xi values,
sum of squares of first n natural numbers  (n  1) 
2

 
2

 2 
x
n
1 n(n  1)(2n  1)  (n  1) 
2

    2

 2 
x
n 6
;
n 1
2

12

n2  1
Similarly,   2
y .
12

Let di   xi  yi  . This gives, d  x  y  0

Applied Statistics Page 43


School of Distance Education
Variance of ‘d’ values,

1 n 2 1 n
   di   d     xi  yi    0
2
 
2 2 2
d
n i 1 n i 1
1 n
   xi  yi 
2

n i 1
1 n 2
  di
n i 1

1 n 2 1 n 1 n
Since x  y , we can re write  d i as,  di 2    xi  x  y  yi 
2

n i 1 n i 1 n i 1

2
1 n 2 1 n
  di    xi  x    yi  y  
n i 1 n i 1

2 2
1 n 2 1 n 1 n 1 n
  di    xi  x     yi  y   2   xi  x  yi  y 
n i 1 n i 1 n i 1 n i 1

1 n 2
  di   x 2   y 2  2 cov( x, y)
n i 1

But, we have, cov( x, y )  r x y , where r is the coefficient of correlation. Hence,


1 n 2
 di   x 2   y 2  2r x y
n i 1

n2  1
Since,  x2   y2  ,
12

1 n 2 n2  1 n2  1 n2  1 n2  1
we get,  di    2r
n i 1 12 12 12 12

1 n 2 n2  1 n2  1
  i
n i 1
d  2 
12
 2  r
12

1 n 2 n2  1
  i
n i 1
d  1  r 
6

Applied Statistics Page 44


School of Distance Education
n
6 d i 2
 1 r  i 1
or
n  n 2  1
n
6 d i 2
the coefficient of correlation r  1  i 1
.
n  n 2  1

Problem: The following are the ranks obtained by 10 students in Statistics and Mathematics
Statistics: 1 2 3 4 5 6 7 8 9 10
Mathematics: 1 4 2 5 3 9 7 10 6 8
To what extent is the knowledge of students in the two subjects related?
Solution:
Here to find the rank correlation coefficient of the ranks in Statistics and
Mathematics. Rank correlation coefficient is defined as,
6 d i
2

r  1 i
, di is the difference in ranks.
n(n 2  1)
The calculations are:

Rank inStat. xi Rank in Maths yi di = xi - yi di 2

1 1 0 0
2 4 -2 4
3 2 1 1
4 5 1 1
5 3 2 4
6 9 3 9
7 7 0 0
8 10 -2 4
9 6 3 9
10 8 2 4

36

6 d i
2
6  36
Hence, r  1  i
=  1  1  0.2189  0.7819
n(n  1)
2
10(102  1)

Problem: 10 competitors in a music test were ranked by three judges A, B, and C in following
order.
Applied Statistics Page 45
School of Distance Education
Ranks by A: 1 6 5 10 3 2 4 9 7 8
Ranks by B: 3 5 8 4 7 10 2 1 6 9
Ranks by C: 6 4 9 8 1 2 3 10 5 7
Discuss which pair of judges has the nearest approaches to common likings in music.

Solution:

Here to find the rank correlation coefficient between each pair of the judges
considering the ranks they given. Identify the pair of judges with high correlation
coefficient. They are considered having nearest approaches to common likings in music.

The calculations follow:

Ranks Ranks Ranks xi - yi xi - zi yi - zi  xi  yi 


2
 xi  zi 
2
 yi  zi 
2

by A by B by C
xi yi zi

1 3 6 -2 -5 -3 4 25 9
6 5 4 1 2 1 1 4 1
5 8 9 -3 -4 -1 9 16 1
10 4 8 6 2 -4 36 4 16
3 7 1 -4 2 6 16 4 36
2 10 2 -8 0 8 64 0 64
4 2 3 2 1 -1 4 1 1
9 1 10 8 -1 -9 64 1 81
7 6 5 1 2 1 1 4 1
8 9 7 -1 1 2 1 1 4
200 60 214

6 d i
2
6  200
Rank correlation between A and B, r  1  i
=  1  0.212
n(n  1) 2
10(102  1)

6 d i
2
6  60
Rank correlation between A and C, r  1  i
=  1  0.6364
n(n  1) 2
10(102  1)

6 d i
2
6  214
Rank correlation between B and C, r  1  i
=  1  0.297
n(n  1) 2
10(102  1)

It can be observed that the judges A and C are having nearest approaches to
common likings in music.
Applied Statistics Page 46
School of Distance Education
Problem: Find the rank correlation coefficient for the following data:
X: 92 89 87 86 84 77 71 63 53 50
Y: 86 83 91 77 68 85 52 82 37 57
Solution:

First, the given values of X and Y should be ranked. If an observation repeats, then
the sum of the ranks is equally divided among the observations. (For eg., when we are
ranking the observations in order, and let a number, say a, coming in the 6th and 7th
position then the first and second a values are assigned with the rank 6.5).

Here the observations are ranked in descending order. Then find the rank
correlation coefficient.

x y Rank of X, xi Rank of Y, yi xi - yi  xi  yi 
2

92 86 1 2 -1 1

89 83 2 4 -2 4

87 91 3 1 2 4

86 77 4 6 -2 4

84 68 5 7 -2 4

77 85 6 3 3 9

71 52 7 9 -2 4

63 82 8 5 3 9

53 37 9 10 -1 1

50 57 10 8 2 4

44

6 d i
2

Rank correlation coefficient, r  1  i

n(n  1)
2

6  44
 1  0.733
10(102  1)

Applied Statistics Page 47


School of Distance Education
Rank correlation coefficient when equal ranks (Tied ranks):

It may be noted that the Spearman’s rank correlation formula is derived on the
assumption that all the ranks are different. But in practice, there are many situations,
where more than one individual are getting the same rank. In a competition consider,
three individuals received 3rd rank. They would have given the 3rd ,4th, and 5th rank, if
there were slight difference in the evaluation. Then we add 3,4 and 5, which is 12. Then
12 is equally divided for these three individuals. Hence we assign the rank 4 to each of
these three individual. In such situations it is more accurate to calculate the Pearson’s
coefficient of correlation between the ranks directly after assigning the average rank to
those with the same rank. But there is also a modified formula of Spearman’s rank
correlation coefficient, which is as follows:

 n 
6   di 2   mi  mi 2  1   m j  m j 2  1 
1 1

r  1   , where, m stands for the number of


i 1 12 i 12 j
n  n  1
2 i

times the i th rank repeats in the x series of ranks and m j is the number of times the j th rank
repeats in the y series of ranks when the average ranks are assigned. The method is
illustrated below:

Obtain the rank correlation coefficient for the following data:

X: 15 20 28 12 40 60 20 80

Y: 40 30 50 30 20 10 30 60

Illustration:

At first we assign ranks for X and Y values. Here we have 8 sets of data. That is
n=8.

The ranks are:

X: 7 5.5 4 8 3 2 5.5 1

Y: 3 5 2 5 7 8 5 1

Here in X values, 20 repeats twice, with the possible ranks, 5 and 6. Hence its
average 5.5 is supplied for the value 20. Similarly in Y values, 30 repeat thrice, with
possible ranks 4, 5 and 6. Hence their average 5 is assigned as the ranks of the values 30.
Now the difference in ranks, di  X i  Yi values are:

di : 4 0.5 2 3 -4 -6 0.5 0

di 2 : 16 0.25 4 9 16 36 0.25 0

Applied Statistics Page 48


School of Distance Education

This gives, d i
i
2
 81.50 .

mi  2 (Because on X values, only the value 20 repeats twice) and m j  3 ( because on Y


values, only the value 30 repeats thrice).

 n 
6   di 2   mi  mi 2  1   m j  m j 2  1 
1 1

Hence, r  1   
i 1 12 i 12 j
n  n  1
2

 
6 81.50   2  22  1   3  32  1 
1 1
 1  
12 12
8  8  1
2

6 81.50  0.5  2
 1 = 0.
8  63

2.9. Partial and Multiple Correlations:

In a statistical study, if there are many variables included, and whenever we are
interested in studying the joint effect of a group of variables upon a variable not included
in that group, our study is on multiple correlations and multiple regressions.

For eg., in the study on the yield of a crop per acre (let it be X 1 ), the value of the
variable X 1 , is a joint effect of the variables, quality of seed  X 2  , fertility of soil  X 3 
,fertilizer used  X 4  , irrigation facilities  X 5  , whether conditions  X 6  and so on.

If we are considering the relation between two variables only, there are two
alternatives;

(i) We consider only those two members of the observed data in which the
other members have specified values. Or,

(ii) We may eliminate mathematically the effect of other variables on the two
variables under consideration.

[The first method has the disadvantage that it limits the size of the data and also it will
applicable only the data in which the other variables have assigned values]

In second method it may not possible to eliminate the entire influence of the variables, but
the linear effect can easily eliminated. The correlation and regression between only two
variables eliminating the linear effects of other variables in considered is called the partial
correlation and partial regression.

Let us limit our discussion with three variables X 1 , X 2 and X 3 .

Applied Statistics Page 49


School of Distance Education
The equation of plane of regression of X 1 on X 2 and X 3 is,

X 1  a  b12.3 X 2  b13.2 X 3    (1)

Let the observations on X 1 , X 2 and X 3 are measured from their respective means, ie.,
X 1   x1i  x1  , X 2   x2i  x2  and X 3   x3i  x3  .

Then,  x 1i  x1    x2i  x2    x3i  x3   0 . That is X X X


1 2 3 0

Taking summation on (1), we get a=0.

Then (1) implies, X 1  b12.3 X 2  b13.2 X 3    (2)

The coefficients b12.3 and b13.2 are the partial regression coefficients of X 1 on X 2 and
that of X 1 on X 3 respectively.

e12.3  b12.3 X 2  b13.2 X 3 is called the estimate of X 1 as given by the equation of plane
of regression (2).

The quantity X 1.23  X 1  b12.3 X 2  b13.2 X 3  is called the error estimate or residual.
In the subscript of the residual X 1.23 , the subscript before ‘.’ ie., 1 is known as the
primary subscript and the other after the subscript, ie, 2 and 3 are called the secondary
subscripts.

The order of regression coefficients are determined by the number of secondary


subscripts. For eg., b12.3 is the regression coefficient of order 1. In b12.3 , X 2 is independent
and X 1 is dependent. In b21.3 , X 1 is independent and X 2 is dependent.

From the equation of plane of regression given in (2), the y the constants b’s are
determined by the principle of least squares.

Sum of squares of residuals,

S    X 1.23     X 1  b12.3 X 2  b13.2 X 3 


2 2

S
 0   2 X 2  X 1  b12.3 X 2  b13.2 X 3   0
b12.3

S
 0   2 X 3  X 1  b12.3 X 2  b13.2 X 3   0
b13.2

 X X   0
2 1.23 and X X   0
3 1.23

Applied Statistics Page 50


School of Distance Education

 X X 1 2  b12.3  X 12  b13.2  X 2 X 3  0 
    (3)
X X 1 3  b12.3  X X
2 3  b 13.2  X 3
2
 0 

1
Since X i ' s are measured from their respective means, we have,  12 
N
X 1
2
,

1 cov  X i , X j  X X
cov( X i , X j ) 
N
X X i j and ri j 
 i j

i

N i j
j
.

Hence, the equations given in (3), gives,

 r12 1 2  b12.3 2 2  b13.2 r23 2 3 


    (4)
r13 1 3  b12.3 r23 2 3  b13.2 32 

(4)  r12 1  b12.3 2  b13.2 r23 3


r13 1  b12.3 r23 2  b13.2 3

r12 1 r23 3 r12 r23


r 3  r 1
Solving these equations, we get, b12.3  13 1  1 13 and,
2 r23 3 2 1 r23
r23 2  3 r23 1

r23 2 r13 1 r23 r13


 2 r12 1  1 r12
b13.2   1
r23 2  3  3 r23 1
 2 r23 3 1 r23
1 r12
 r r13
 1 23
3 1 r23
r23 1

1 r12 r13
r23 , and 
If we write,   r21 1
i j is the cofactor of the (i, j ) element of  , then,
th

r31 r32 1
 1 12  
b12.3   and b13.2   1 13 . Now we get,
 2 11  3 11

 1 12  
X1   X 2   1 13 X 3
 2 11  3 11

Applied Statistics Page 51


School of Distance Education
X1 X X
 11  2 12  3 13  0 .
1 2 3

2.10. Properties of residuals

(i) Sum of the product of any residual of order zero with any other residual of
higher order is zero, provided the subscript of the former occurs among the
secondary subscripts of the later.

(ii) X 1.2 X 1.23   X 1 X 1.23   X 1.232

(iii) The sum of the product of two residuals is zero, if all the subscript (primary
as well as secondary) of the one occur among the secondary subscripts of the
other. Eg.,  X 1.2 X 3.12  0 ,  X 2.3 X 1.23  0

2.11. Coefficient of multiple correlations

Consider the variables X 1 , X 2 and X 3 has N observations. The multiple correlation


of X 1 on X 2 and X 3 , usually denoted by R1.23 is the simple correlation coefficient between
X 1 and the joint effect of X 2 and X 3 on X 1 . In other words, R1.23 is the correlation
coefficient between X 1 and its estimated value as given by the plane of regression of X 1 on
X 2 and X 3 .

cov( X 1 , e1.23 )
That is, R1.23  , which is derived as,
V ( X 1 )V (e1.23 )

r122  r132  2r12 r13r23


R 21.23 
1  r232
Multiple correlation coefficient measures the closeness of the association between
the observed values and expected values of a variable obtained from the multiple linear
regression of that variable on the other variables. It is proved that 0  R1.23  1 .
If R1.23  1 , then association is perfect and all the predicted value of X 1 coincide with
the observed values of X 1 .
If R1.23  0 , then X 1 is completely uncorrelated with the predicted values of X 1 .
That is the regression equation fails to throw any light on the value of X 1 , when X 2 and
X 3 are known.

2.12. Coefficient of partial correlation

The correlation coefficient between X 1 and X 2 after the linear effect of X 3 on each
of them has been eliminated is called partial correlation coefficient of X 1 and X 2 .

Applied Statistics Page 52


School of Distance Education
Let X 1.3  X 1  b13 X 3 may be regarded as a part of the variable X 1 which remains
after the linear effect of X 3 has been eliminated.

Similarly, X 2.3  X 2  b23 X 3 is the part of X 2 obtained after eliminating the linear
effect of X 3 .

The partial correlation between X 1 and X 2 , denoted by r12. 3 is given by,

cov( X 1.3 , X 2.3 )


r12.3  .
V ( X 1.3 )V ( X 2.3 )

r12  r13r23
This is derived as, r12.3  .
1  r 1  r 
2
13
2
23

In a similar way the expressions for r13.2 and r23.1 can be obtained.

Problem: For the variables X 1 , X 2 and X 3 , it is given that


  2,     3, r12  0.7, r23  r31  0.5 . Find (i) r23.1 (ii) R1.23 and (iii) b13.2 .
2
1
2
2
2
3

Solution:

r23  r21r31
(i) We have, r23.1 
1  r 1  r 
2
21
2
31

0.5  0.7  0.5


Hence, r23.1 
1  0.7 1  0.5 
2 2

= 0.2425.

r122  r132  2r12 r13r23


(ii) R 21.23 
1  r232

0.7 2  0.52  2  0.7  0.5  0.5


Hence, R 21.23 
1  0.52

= 0.52  R1.23  0.721 .

1 r12
 r r13
(iii) b13.2  1 23
3 1 r23
r23 1

Applied Statistics Page 53

You might also like