You are on page 1of 36
MAT2001 Statistics for Engineers Module 3 Correlation and Regression Syllabus Correlation and Regression Correlation and Regression - Rank Correlation - Partial and Multiple Correlation — Multiple Regression. ovarlance Vor (= E((s- Bx) ) (- =) Covay (x)= £3) 029) Correlation CORRELATION COEFFICIENT As the variance E{X — E(X)}? measures the variations of the R.V. X from its mean value E(X), the quantity Z{[X — E(X)] [Y— E(Y )]} measures the simulta- neous variation of two R.V.’s X and Y from their respective means and hence itis called the covariance of X, ¥ and denoted as Cov (X, ¥). Cov (X, ¥) = E{[X — E(X)] [Y - E()]} is also called the product moment of X and ¥ and is also denoted as p(X, Y ). rea is a measure of intensity of linear relationship between X and ¥ and is called Karl Pearson's Product Moment Correlation Coefficient or simply cor- relation coefficient between X and Y. It is denoted by r(X, Y) of rgy or simply r. E(x - £(x)][Y¥-E@)]} Thus xy E{X - E(X)} E{y - EQ) a) since g,, the standard deviation of X is the positive square root of the variance of x. ] e{x - ECO) [¥- Ew} VE{x- E(x) Efy - EWy pe EX) = B(X)-EW) e(x?) - B00} {e(27)- 20} *” een) er Go)] Vine GBs) fet (Fe n n n n nEay-ExLy Txy =" {nEx? -(= xy} {azy? -()"} Properties of Correlation Coefficient 1. - 1S rzy $1 pr !Cov (X, Y}I< oy: oy. Note: When 0 < ryy < 1, the correlation between X and ¥ is said to be positive or direct. When — 1 $ ryy $0, the correlation is said to be negative or inverse. When — 1 S ryy S—0.5 or 0.5 < ryy $ 1, the correlation is assumed to be high, otherwise the correlation is assumed to be poor. 2: Correlation coefficient is independent of change of origin and scale. Example: Compute the coefficients of correlation between X and ¥ using the following data: xX: 65 67 66 nl 67 70 «68 md ¥ 67 68 68 70 64 67 72 Comment cout Ro nabirce of conelahion. Solution: We effect change of origin in respect of both X and Y. The new origins are chosen ator near the average of extreme values, Thus we take S57! 465 ag the new origin for Xana S4+72 = 68 as the new origin for ¥. viz., we put = (x, nduv-Lu-Ev , {nzu? ~(Eu)"} {nz ~(0)'} ry =o = 8x13~(~1) x2 [iS 291) (8x 39—4) “ee Exercise: Find the coefficient of correlation between X and ¥ using the following data eo nme es 16 19 23 2% 30 Rank Correlation Coefficient Sometimes the actual numerical values of X and Y may not be available, but the positions of the actual values arranged in order of merit (ranks) only may be available. The ranks of X and ¥ will in general, be different and hence may be considered as random variables. Let them be denoted by U and V. The correla- tion coefficient between U and V is called the rank correlation coefficient be- tween (the ranks of) X, Yand denoted by pyy. Let us now derive a formula for Py oF ry Since U represents ranks of n values of X, U takes the values 1, 2, 3, . Similarly V takes the same values 2,3, --, nina different order. D=U-V 6Ed* Pry wt [Note: The formula for the rank correlation coefficient is known as spearman's formula, The values of r xy and Pxy (or ryy) Will be, in general, different. Example: ‘Ten students got the following percentage of marks in Mathematics and Physical sciences: Students: 1 2 3 4 5 6 7 8 9 10 Marks in Mathematics:78 36 98 25 75 82 90 62 65 39 Marks in Phy: Sciences:84 51 91 60 68 62 8 S8 63 47 Calculate the rank correlation coefficient. Solution: Denoting the ranks in Mathematics and in Phy. Sclenced by U/andV, we have the following values of U and V: mo ©. 9 1 10 5 3.2 76 ee 8 4 6G 2 8 19 D. 1 0 0 3 1-3 O-1 1-2 De 10 6 9 16011 4 J2ee2 Exercise: Ten competitors in a beauty contest were ranked by three judges as follows: Competitors foe fc 2G dl SG eh ll A: 6 S 3 1 2 4 9 7 6 ik B: S 6¢ 4710 2 1 6 5 3 Cc 4 9 8 1 2 _ 5 7 6 Discuss which pair of judges have the nearest approach to common taste of beauty. Regression When the random variables X and ¥ are linearly correlated, the points plotted on the scatter diagram, corresponding to n pairs of observed values of X and Y, will have a tendency to cluster round a straight line. This straight is called the regres- sion line. The regression line can be taken as the best fitting straight line for the observed pairs of values of X and ¥ in the least square sense, with which the students are familiar. . When two R.V.’s X and Y are linearly correlated, we may not know which variable takes independent values. If we treat X as the jndependent variable and hence assume that the values of ¥ depend on those of X, the regression line is called the regression line of Y on X. If we assume that the values of X depend on those of the independent variable ¥, the regression line of X on Y is obtained. Thus in situations. where the distinction cannot be made between the R.V.’s X__ and Y as to which is the independent variable and which is the dependent vari- able, there will be two regression lines, However, when the value of ¥(X ) is to be Predicted corresponding to a specified value of X(Y), we should make use of the regression line of Y(X) on X(Y). Equation of the Regression Line of Y on X: By the principle of least squares, the normal equations which give the values of a and b. oe are Ly;=aLx,+nb 2) and Exy,=ahxt4+b bx, @) Dividing equation (2) by n, we get Frax+b (4) the equation of the regression line of Y on X as -¥ = Px (x~x) ox Equation of the Regression Line of X on Y: Ina similar manner, assuming the equation of the regression line of X and ¥ as x= ay + b and using the equations we can get the equation of the regression line of X on ¥ as x-¥ = PEQ-9) x-¥ = MER (y-5) Note: Par og Tar Pr ig called the regression coefficient of ¥ on X and denoted ok Ox by b; or byy. PAL or a is called the regression coefficient of X on oy Y ¥ and denoted by b, or byy- Clearly b, bz = ry, ie., ryy is the geometric mean of b, and 5. o ry = + by by : : o; ‘The sign of ryy is the same as that of b, or by, as by — and by = ryy ox o: 7 = ie — have the same sign as ryy (© ox and oy are positive). Ox 2 enece : b, Oo} When there is perfect lined vorrelation between X and Y, viz., when ryy= + 1, the two regression lines coincide. ‘The point of intersection of the two regressiorlines is clearly the point Also whose co-ordinates are (¥, ¥) : When there is no linear correlation between X aiftt ¥, viz., when ryy= 0, the equations of the regression lines become y = ¥ and x = ¥, which are at right angles. Example: Obtain the equations of the lines of regression from the following data: X: 1 2 3 4 5 6 7 ¥: 9 .s 1 i ti i hi 2 see SKM) = YeKa—Ye Ywe oo ha TT Sep] lsleno--salsl je nce-colS ‘The regression tne of Yon Xs rove DE tte e-w a es 2x 0-9 S o- hes 4e-37y424720 Example: In a partially destroyed laboratory record of an analysis of correlation data, the following results only are legible: Variance of X = 1. The Tegression equations are 3x + 2y = 26 and 6x + y = 31. What were (i) the mean values of X and Y? (ii) the standard deviation of Y? and (iii) the correlation coefficient between X and Y? Solution: (i) Since the lines of regression intersect at (x, ¥), we have 3X + 2¥ = 26 and 6¥+ ¥ =31 Solving these equations, we get ¥ = 4 and y =7. (ii) Which of the two equations is the regression equation of ¥ on X and which one is the regression equation of X on Y are not known. Letus tentatively assume that the first equation is the regression line of X on ¥ and the second equation is the regression line-of Y on.X. Based on this assump= tion, the first equation can be re-written as - xe-Syt a) and the other as *Q) then ig any Thy = dyy X byy = 4 Fyy=— 2, which is absurd. Hence our tentative assuinption is wrong. Solution (Continued): «. The first equation is the regression line of Y on X and re-written as ye x 413 @) The second equation is the regression line of X on ¥ and re-written as 4) Gii) Now Exercise: Find the equations of the regression lines from the following data. Also estimate the value of Y when X = 7] and the value of X when Y = 70. x; - 65 66 67 67 68 69 70 72 ie Oe 68 65 68 72 72 69 71 Exercise: Obtain the equations of the regression lines from the following data, using the method of least squares. Also estimate the value of (i) Y, when X = 38 and (ii) X, when Y = 18. &: 22- 26 29 | 31 34 35 ro mu op Hun i Exercise: In a partially destroyed laboratory record of an analysis of correlation data, the following results only are legible. Variance of X = 9. Regression equations are 8x — 10y + 66 = 0 and 40x — 18y = 214. What were (i) the mean values of X and ¥? (ii) the correlation coefficient between X and Y and (iii) the standard de- viation of Y? - sha lywlen &, Xa, X3) > Wivennekr Disb xX (XQ) = Yura Ng, ~ Fa\ Y (Xs) = Vag = Vig = Yay Y 7 O%,,) ~ Waar Yep = Ya Ws GyvQqy) = IX. x Multiple and Partial Correlation Mulige Conlon Sup] DeSe a Vor olde ae ofa ae ed is old mobile Mobb. Gndatan G-cHiued (R) Ta a Livedeole dishabustton a Ny, Eo Geveladin G of Yq on Xg kX, 8 ‘het “ ald Ries = Wether fabs ty I~¥g BRR, ot 7 Rae SA May] <4 7 sR, S Rudich Corelakte ry . covets ey Verh onl Xa, dhs 4p ae ? A Rid Yomably X, - A cad Xo. Th ah est, of . 1 Th ef of x, On Grol Xy were “ah aT O A rel hin vefefiog ip walk d ry Pertidl Grdbatin G-efhnent Fo. partic coeatine C-eHfioed behoon 6X, fe. livin Ah, Jin fret oh Xs, is chick wa Yigr- ws Can Ys Yas Example: In a trivariate distribution : ry2= 0-77, r3=0-72 and r3=.0-52 Find the the partial, correlation coefficient r12.3and multiple correlation coefficient Ry.25, Solution: aoe Ti2-T13 723 - 0-77 - 0-72 x 0-52 = 062 NG =713) =r) VE = ©-72)7111 = 0-52)7] Tyg? +743? — 2ry2 743 723 1-93? _ (0-77)? + (0-72)? - 2 x 0-77 x 0-72 x 0-52 _ 9, = OaDF 0-734 Ry23? = Ry.23 = + 0-8564 Exercise: In a trivariate distribution : 72 = 0-7, ro3 = 3, = 0-5. Find @ rma, Gi) Ria Multiple Regression Method of Least Squares nN Zui =t2 X\ tbhexy,4 no Zh Qq ae + bERRITCER, AK Ae Zayieiths wr, G-2%, Multiple Regression pea nla +b Yo +h doen + + bee =r Multiple Linear | = » iz = Regression Sau thi oa + be Daun + by So writes = > erims a a = ai rat bo Soa thy So znseit be emen t rt Example: The following data represent the chemi grades for a random sample of 12 freshmen at a ce tain college along with their scores on an intelligence test administered while they were still seniors in high school. The number of class periods missed is also given. Chemistry Test Classes Student Grade, y Score, 2: Missed, 2 55 1 50 7 55 5 65 2 85 55 6 87 70 3 4 65 2 98 70 5 81 55 4 91 70 3 76 50. 1 74 55 4 (a) Fit a multiple linear regression equation of the form 9 = bo + bias + bore. (b) Estimate the chemistry grade for a student who has an intelligence test score of 60 and missed 4 classes, Multiple Regression Regression equation of X, on X2 and X; is given by (X1- where zs z \ = \ RG + M2 - Ra) G2 + Ks Hs) Ge gu ery 1 re fp o= Tm 1 ty T™ Tm 1 a tm. O11 = =1l-ry? ' ry ras, 2r-| >. 1 | 7M3%3-ra O13 =F h2—-Ns Example: Find the regression equation of X on X> and X3 given the following results :— Trait Mean Standard deviation 'r Toy ry x, 28.02 4-42 +080 — — X2 4.91 1-10 — 056 - X; 594 85 —_ — - 0-40 where X,=\Seed per acre; Xz = Rainfall in inches X3 = Accumulated temperature above 42°F. Solution. Regression equation of X, on X2 and X; is given by _¥,) au _x,) 22 _7,) 2B. (X, -X1) ©; + (X,-X,) ° + (X3 -X;) . =0 1 rp ty where @= lm 1 rg Ty Ty 1 ag n=] 55 be =1-ryt=1-(~ 0-56? = 0-686 i @2--| rh, 7 =P13 3-2 = — 0-576 @13 = 123 M12 — 713 = (- 0-56) (0-80) - (— 0-40) = - 0-048 «.-Required equation of plane of regression of X; on Xz and X; is given by 0-686 (0576). 4.01) « (=0-048) cog) a Tay 1 ~ 2802) + PPE x, 4:91) + EGO Ms - 594) = 0 YI

You might also like