Regression
Regression
9
CHAPTER
INTRODUCTION
Linear Regression Analysis
The literal or
dictionary
meaning of the word
average Value. The term was Regression' is'stepping back or returning to the
used by British
first
19th century in Connection with biometrician Sir Francis
some studies he made on estimating Galton in the later part
of the
Sons ol lalparents the extent to which
reverts or regresses the stature of e
back to the mean stature
of the population. He studied
relationship betweenthe heights of aboutone tne
thousand fathers and sons
Regression towards and published the results
Mediocrity in Hereditary in a paper
(i) The tall
()
fathers have tallsons
The average height of the
Stature'.The interesting
and short fathers
have short sons.
features of his study were :
sons of group of tall fathers
height of the sons is less than that of
of a group of short fathers the fathers and the average
is more than
that of the fathers.
In other words,
Galton's studies revealed
that the off springs of
revert or step back
to the averageheight of abnormally tall or short
the population, a parents tend to
Regression to Mediocrity. phenomenon which Galton described
as
He concluded that if the average
height of a certain group of fathers is 'a'cms. above
general averageheight then average (below)the
average height where r is the
height of their sons will be (a x r) cms. above (below)
correlation coefficients the general
their sons. In this case between the heights of the givengroupof fathers and
correlation and since
(i) above.
is positive r|slwe have a xrsa.This supports the result in
Definition.Line fregres.sion fy on x is the line which gives the best eslimate /or the value ofyor
any specified value ofx.
Similarly, line of regression ofx on y is the line which gives the best estimate for the value of'x forany
specified value ofy.
The term best fit is interpreted in accordance with the Principle of Least Squares which consists in
minimising the sum of the squares of the residuals or the errors ofestimates, i. e., the deviations between the
given observed values of the variable and tneir corresponding estimated values as given by theline ofbest
fit. We may minimise the sum of the squares of the errors parallel to y-axis or parallel to x-axis, the former
(i.e., minimising the sum of squares of errors parallel toy-axis), gives the equation of the line of regression
of y on x and the latter, viz., minimising the sum of squares of the errors parallel to x-axis gives the equation
of the line ofregression of x on y.
We shall explain below the technique of deriving the equation of the line of regression of y on x.
93-1.Derivation of Line of Regressionofy on X. Let (,y), (X2, V2), .,(xm, y), ben pairs of
observations on the two variables x and yunder study.Let
y=at bx ... (9-1) Y4la ad
is given by
P; H, =P, M- H, M M
=y;- (a+bx) ..(9:2) Fig. 9-1.
UNEAR REGRESSION ANALYSIS
9-3
1hs the crror (parallel to the
Is
OE =0 E
Ja and
=0 ...(9-4)
and
y ax +bEr2
...(9-5)
..(9-6)
These equations are known as the ormal
eq11ations for estimating aand b. The quantities
Žy, Xxy can be obtained from the x, r,
given set of n points (|,y), (2,y2), ...,
(In y,) and we can solve the
cquations (9:5) and (9-6) simultaneously for a and b, i.e.,
Substituting these values of a and b from (97)and (98)in (9:1), we get the required equation of the
line of regression ofy on x.
The equation of the line of regression y on x
of can be obtained in a much more systematic and
Simplified form in terms of x, y, Os and r=ry as explained below.
O,
Dividing both sides of (9-5) by n, the total number of pairs,we get
This implies that line ofbesti. i.e. regression yon x paS5es through the point (x, p). Or in other
ds, the point (F,y ) lies on the line ofregression ofy onx.
We have:
n(02+2)
and
xy=ax+ br 2 .(9-11)
Subtracting (9-11) from (9·10), we get
Cov(x. y)
Cov (x,y) =bo? ...(9-12)
We find that the equation (9-1) is in the slope-iniercept form, viz., y mx e Hene repreveus the
slope ofthe line o regression fy on x. Further, we have proved in (99) {.al this linc t.e1ne r
regression ofy on x) passes through the point (x,y ).Hence, using the slope-point lorm of the cquation of a
line, the required cquation of the line of regression of y on x becomes
y y =b (x-- x)
...(9-13)
Cov (x, y)
Or
...(9-14)
Cov (x,y)
But Cov (x,y) =ro,o,
Substituting in (9-14), wemay also write the cquation of the lineof regression of y on x as :
i.e., fis a function of three variables x, y and z,then partial derivative isobtained on
dx differentiating f
w.r.i. x, treating the other two variables, viz., y and z as constants. Similarly, is obtained on
differentiating fw.r.l. y regarding x and z as constants.
JE JE
2
da
-0and =0, is only a requirement for extremum (maxima or minima) of E.
The necessary and sufficient conditions for a minima of E for variations in a and b are :
() =0, =0 ...*)
E
and (ii) >0 and = >0 ..**)
Ja?
1heorem. The solaution ofthe least sguare equations (9.5) and (9.6),provides amininmm fE defined
in (9.3).
Proof. The normal equations (9.5) and (9.6) already satisfy the equations in (").
We have -2(y-a-bx
-2(1) 2n
)
>0
ab
E
-
= -2)x (y-a-bx)
2 (-)- 2r?
-
NEAR
and
REGRESSION
ANALYSIS
2
da db db da
E
2 4n Var (X)>0
prOVides
ror
3, From (9,ehave :
y-y)0 ...(9-16)
),is
here
The line of regression ofy on x passes through the point (x. y ).
Analysis'
Qries.
ofx on y which gives the best
the line
0.3-2. Line of Regression of x on y. The line of regression is
-A-By)2,
A and Y4
...(9-18)
f A and B, we shall
Solving (9-19) simultancously
(2y)(2r)-(y)(Exy) ...(9-20)
nj- (Z»)2 By
nExy- (x)(y)
and B= ...(9:21)
r=A+
ny2- (y)2
Substituting these values ofA and B in (9:17), we
get the required equation
of line of regression of x
Fig. 9-2.
Sshall
|Ony.
same as in equations (9-7) and (9-8)
(9:21) are
Remark. The values ofA and B obtained in (9:20) and
TeSults :
X =A+ By ...(9:22)
)
Ihis
Cov (x, y) ro
(i
B= ...(923)
x ony is
The equation of the line of regression
of
...(9:24)
x-X =BY -y )
X-X =Cov (r,y) -) ..(9-25)
regression of y on x becomes:
y
x-x t(y-y) .(**)
general,we alwayshave two lines of regression except in the particular case of perfect correlation (r=tl)
when both the lines coincide and we get only one line.
9:33.Angle Between the Régression Lines. The equations of the lines of regression ofyon x andx
on y are respectively:
and
ro,
If0 is the angle between the two lines of regression then
tan ) 1+ mmn2
ro,
1+r
O,Oy
..(9-27)
0 =tanl •..(928)
REGRESSION ANALYSIS 9-7
NEAR
ifr=tIthen 0=tan (0) =0 or Tt.
Inparticular,
tWo
lines are either coincident ( = 0) or they are parallel (0 = ). But since both the lines of
the
interscct
at the point ( x,y), they cannot be parallel. Hence in case of perfect correlation,
TCSSIon
negative,
the two lines of regression coincide.
Ifr=0,
then from (9:28), =tan (oo) =T/2
variables
are uncorrelated, the two lines of regression become pependicularto each other.
fthe
1. Whenever two lines intersect,there are twoangles between them, one acute angle and the
Remarks
obtuse
angle. Further tan 0>0if0<0</2, an acute angle andtan
i.e., O is i.e., <0ifu2<0<T
ther
isan
obtuse angle and since 0 <<1, the acute angle (0,) and obtuse angle 0, between the lines of two
by
arcgiven
egression
=
9, Acute angle =tan ...(9:29)
2. When r=0 ie., when x and y are uncorrelated, then the lines of regression of y on x, andx on y are
3.We have seen above that ifr = 0 (variables uncorrlated), the two lines of regression are
perpendicular to each other and ifr = 1,0=0, i.e., the two lines coincide,This leads us to the conclusion
for higher degree of correlation between the variables,theangle between the lines is smaller, i.e., the
lhat
between the lines increases,
WO Iines of regression are nearer to each other. On the other hand, the angle
coefficient decreases. In other words, if
le, the lines of regression move apart as thevalue of correlation
De lines of regression make a larger angle, they indicate
a poor degree of correlation betweenthe variables
TWO LINES TWO LINES TWO LINES TWO LINES TWO LINES
APART (LOW CLOSER (HIGH
COINCIDE COINCIDE PERPENDICULAR DEGREE OF
DEGREE OF
(r=-1) (r= 1) (r=0) CORRELATION)
CORRELATION)
YA
Y4 YA Y4
X X o
Fig. 9-3(). Fig. 9-3(. Fig. 9-3(c). Fig. 9-3 (c. Fig. 9:3(e).
FUNDAMENTALS OF STATISTICE
9-8
value of the
Lines for Prediction dependent variable y
Using Regression
9:3-4.
predict the
commonly use to Y, whcn
The equation of the regression line is value of Y, written as
the predicted
variable X. For example,
for a given valuc of theindependent
X= X, is given by:
)
kent
It makes sense
to use the regression offit'.The following
we should test for 'goodness
betore using the lines
of regression
for prediction.
in mind while using regression lines
coefficient
(X, as discussed in S8-5 r=r
sample correlation
of the observed for estimation and
I. Test the significance
of r is significant.
we can use the lines of regression
and §
19-8. If the value
prediction. of regression
model is not a good fit and hence the line
2. Ifr not significant, then the linear
is
(
ney= bo +bX,
the independentvariable X.
available sample data on
on X (age) of women, say,
For example, suppose the line of regression of Y (blood pressure)
to estimate the blood pressure
Y= bo + b, X, is a good fit to the given data. It will be futile, (rather absurd),
4. We should not use the linear regression model to make predictions for Y corresponding to lar
distant values ofX. At adistant value of X, there may be a drastic change in the pattern of the relationship
between thevariables, which is not exhibited in the current available data.
Hence, the predicted values of Y for far distant values of X may be extremely unreliable
5. It will be worthwhile to make predictions for the linear regression model only for the population
from which the sample dataare drawn.It should not be used for a different population. For example, the
line of regression fitted between the blood pressure (Y) and age () of a group of women cannot be used
y=a+bx
ANALYSIS
R GRESSION 9-9
cocficic
. } which is the slope of the line of
regressIon of y on x is called the coefficient
of
1he Irepresenisthe incrementin the value ofthe dependet variable y
for a unit change
he independent variablex. In other words, it represents therate w.r.. x. For
the slope
of change of y
b, i.e., cocfficient
convenience, of regression of y on x is written as byr
in theregression cquation of x on viz.,
y,
Sinilarly
x=A+By,
Breprescnts the clhange in the value dependent
of variablex for a unit change in the value of
sdendent
votatlons
by= Coefficient of regression ofy on x.
b,, = Coefticient of regression of x on y.
Cov (r, y)
( Cov(r, y) =ro,o,.] ...(931)
Remarks 1.For numerical computations of the cquations of line of regression ofy on x, andx on y, the
olowing formulae forthe regression cocfficients b,, and b, are very convenient to use.
We have
Cov (. )
by definition : [See(8-4) and
(r- )y-p)
(84a)]
B6) simultaneously.
b
Cov (x. )(r-)))
=b,, was also obtained
X(r- ) 2
in
n2
n
y-(x2y)
-(E)?
equation (9·8) on solving the normal equations (9:5) and
...(9:36)
coefficients from
Omulae (9:36) and (9:37) are very use ful for computing the values of regression
ven set of n points (r|, ), (r2, y2), (n Vn).
problems
er convenient formulae to be used for finding the regression coefficients for numerical
and .(938)
9-10 FUNDAMENTALS OF STATISTICE
y is a symmetrical tnelion
(wo vaiables x and
between x and v, ie
2. Conelation coeflicient between
not symmetrie functions ofx and y, by
i.e,, bev
'y However,the regression coellicients are
3. We have:
Cov(r, v) Cov (*, y)
Cov(, )
.(*), ..(**),
by
From
cOVariance
(*) and (**), we
tem,since o, > 0 and o,, >
observe that the sigu of each
0. // Cov(x, v) is
regression coofticient by and
positive, both the regression
depends
coejicicnts
on
e
the
positive Cov (x, y) is negative, both the regression coefficients are negative
and if
r.,
4.
>
o,
Cov (r, y) is positive, all the threeare positive and if Cov(x,v) is negative, all the three are negative. This
coefficients are positive, r is positive and if regression coelicients are negative, r is negativ.
i.e.,
.. (939)
Cov (x, y) Cov (r, y)
Proof. We have, b ...(9-40) and ...(9-41)
Remark. The sign to be taken before the square root is same as that of regression coefficients. Ifthe
regression coeficients are positive, we take positive sign in (9-42) and if regression coeffñcients are
negative, we take negative sign in (9:42).
Theorem 9-2. If one of the regression coefficients is greater than unity (one), the other must be les
than unity.
Proof. If one of the regression coefficients is greater than 1,then the other must be less than one
because otherwise, on using (9:39), we shall get :
2= bg.by > l,
which is impossible, since 0 Srs1.
Aliter. Let >1 ..)
b,
We have 1 <1.
by S [From (*)]
Hence,if one of the regression coefficients is greater than one, the other must be less than one.
Theorem 93. The arithmetic mean of the modulus value of the regression coefficients is greater than
the modulus value of the correlation coefficient.
Proof. We know that for any two real distinct positive numbers a and b:
Arithmetic Mean>Geometric Mean
atbVab .(**)
and (9-44)
b
Then k
the correlation coefficicnt is independent of change and scale we have:
Proof.
Since of origin
.(9:45)
.(9-45a)
and ko,.
(9-43)gives:
Also transformation
independent of change of origin but not of scale.
deviation is
SInce.
standard (9-46)
ko k k
b,,
''o, .(9:46a)
h o,
ko, chunge
coe/ficienis dre independent
obvious thut the regres.sion
is
(946) and (946a). 1t
From by the relaton
x and v to u and v
rgin bt 1of f.scale.
we transform the
variables .(9:47)
In particular
if wetake h-k- 1, i.e., and
we get
only, then from
(9:46) and
(9-46a).
v ()(2w) ..(9-47b)
n2 -( y)2
the equalions
of the lines
for obtaining
formulac are veryuseful ofx and yare
large.
47 -mo
dr -0
Xx-900
57
y 700
9-12 FUNDAMENTALS OF STATISTICS
700
We have T =10 =70
900
10 =90: and
byy
2(x-)(y) dk dy 3900
= 1:361
2868
-) dy
Regression Equations
Equation of line of regression
of y on x is Equation of line of regression ofx on y is
25
28
35
32
43
46
49
41
dx
l-7dts
=x-x =-32
oL-4ORe8o8 3
dy y-y =y- 38
549
EQUATIONS
11
3
16
Diond
dy?
121
25
64
ddy
-35
32
33
31 36 -1
-2
36 32 4 4
-6 2
29 31 -3
16
36 -24
38 30 O6 9 49
-8 21
34 33 36
2 64 48
32 4
39 25 -10
1
Xx=320 Žy=380 d 0 dy =0
dr 140 X dy =398 > dxdy =- 93
Here, Žx_320 =32: and
10 380
10 38.
ANALYSIS 9-13
REGRESSION
Coefficients
Regression
ofregression
of y on x= b,
x-) (-)
=
ddhy 93
=-06643
(xlicient
r-i) dr 140
(x-)(y) ddhy 93
(xticien
of regression of x on y=by =
d 398 0-2337
Regression Equations
ofthe line of regression of x on y is : Equation of the line of regression of y on x is
:
Fquation
r=-0-2337y + 40-8806
Coefficient. We have
Correlation
the most marks in Statistics (v) when marks in Economics (x) are 30, we
LA In order to estimate likely
gien by
y=-06643 x 30 + 59-2576 =-19-929 + 59-2576 = 39-3286
when marks Economics are 30, are 39:3286 39.
Hence, the most likely marks in Statistics in
ollowing marks :
2 3 4 5 6 7
Debator
40 34 28 30 44 38 31
Marks by A
39 26 30 38 34 28
Marks by B 32
An eighth debator was awarded 36 marks by JudgeA while Judge B was notpresent.
to award to eighth debator
If Judge B was alsopresent, how many marks would you expect him
asSuming same degree of relationship exists in judgement ?
[Delhi Univ. B.Com (Hons.), 1993; HimachalPradesh Univ. M.A. (Econ.), June 1999,
Allahabad Univ. M. Com. 2002]
by Judge be denoted by the variable and the marks awarded by
Solution. Let the marks awarded 4
31 28 4 -2
P
16
P
4
)y==
ores_
Total
u0 2v=17 206 185 121
he marks awarded by Judge A to the eighth debator are given to be 36, i.e., we are given x 36. We
to tind the marks which would have been given to the 8th debator by Judge B, if he were present. In