You are on page 1of 16

12/11/2012

LinearRegressionAnalysis

Correlation
SimpleLinearRegression
TheMultipleLinearRegressionModel
LeastSquaresEstimates
R2 andAdjustedR2
OverallValidityoftheModel(F test)
Testingforindividualregressor (t test)
ProblemofMulticollinearity

SmokingandLungCapacity
Suppose,forexample,wewanttoinvestigatethe
relationshipbetweencigarettesmokingandlung
capacity
Wemightaskagroupofpeopleabouttheirsmoking
habits,andmeasuretheirlungcapacities
Cigarettes (X)

Lung Capacity (Y)

45

42

10
15

33
31

20

29

GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

HeightandWeight

Scatterplotofthedata
LungCapacity

60
40
20
0
0

10

20

30

Wecanseethatassmokinggoesup,lung
capacitytendstogodown.
Thetwovariableschangethevaluesinopposite
directions.

Considerthefollowingdataofheightsandweightsof5
womenswimmers:
Height(inch):
62
64
65
66
68
Weight(pounds): 102 108 115 128 132
Wecanobservethatweightisalsoincreasingwith
height.
150
100
50
0
60

GauravGarg(IIMLucknow)

Sometimestwovariablesarerelatedtoeach
other.
Thevaluesofbothofthevariablesarepaired.
Changeinthevalueofoneaffectsthevalueof
other.
Usuallythesetwovariablesaretwoattributesof
eachmemberofthepopulation
ForExample:
Height
Weight
AdvertisingExpenditure
SalesVolume
Unemployment
CrimeRate
Rainfall
FoodProduction
Expenditure
Savings
GauravGarg(IIMLucknow)

65

70

GauravGarg(IIMLucknow)

Wehavealreadystudiedonemeasureofrelationship
betweentwovariables Covariance
Covariancebetweentworandomvariables,X andY is
givenby
Cov ( X , Y ) E ( XY ) E ( X ) E (Y )
XY

ForpairedobservationsonvariablesX andY,

Cov ( X , Y ) XY

1 n
( xi x )( yi y )
n i 1

xx
yy

GauravGarg(IIMLucknow)

12/11/2012

Correlation

PropertiesofCovariance:

Cov(X+a, Y+b) = Cov(X, Y)


[notaffectedbychangeinlocation]
Cov(aX, bY) = ab Cov(X, Y)
[affectedbychangeinscale]
Covariancecantakeanyvaluefrom- to+.
Cov(X,Y) > 0 meansX andY changeinthesamedirection
Cov(X,Y) < 0 meansX andY changeintheoppositedirection
If X andY areindependent,Cov(X,Y) = 0 [otherwaymaynotbetrue]

Itisnotunitfree.
Soitisnotagoodmeasureofrelationshipbetweentwo
variables.
Abettermeasureiscorrelationcoefficient.
Itisunitfreeandtakesvaluesin[-1,+1].

KarlPearsonsCorrelationcoefficientisgivenby

Cov( X , Y ) E ( XY ) E ( X ) E (Y )
Var ( X ) E ( X 2 ) [ E ( X )]2 , Var (Y ) E (Y 2 ) [ E (Y )] 2

WhenobservationsonX andY areavailable


Cov ( X , Y )

GauravGarg(IIMLucknow)

PropertiesofCorrelationCoefficient

Corr(aX+b, cY+d) = Corr(X, Y),


Itisunitfree.
Itmeasuresthestrengthofrelationshipona
scaleof-1 to+1.
So,itcanbeusedtocomparetherelationshipsof
variouspairsofvariables.
Valuescloseto0 indicatelittleornocorrelation
Valuescloseto+1 indicateverystrongpositive
correlation.
Valuescloseto-1 indicateverystrongnegative
correlation.

ScatterDiagram
Y

X
PositivelyCorrelated

WeaklyCorrelated

NegativelyCorrelated

StronglyCorrelated

GauravGarg(IIMLucknow)

NotCorrelated

GauravGarg(IIMLucknow)

CorrelationCoefficientmeasuresthestrengthof
linear relationship.
r = 0 doesnotnecessarilyimplythatthereisno
correlation.
Itmaybethere,butisnotalinear one.

GauravGarg(IIMLucknow)

1 n
( xi x )( yi y )
n i 1

1 n
1 n
( yi y ) 2
( xi x ) 2 ,Var (Y ) n
n i 1
i 1

GauravGarg(IIMLucknow)

Var ( X ) Var (Y )

WhenthejointdistributionofX andY isknown

Var ( X )

Cov( X , Y )

rXY Corr ( X , Y )

y y

xx

1.25

125

-0.9

45

0.8100

2025

-40.50

1.75

105

-0.4

25

0.1600

625

-10.00

2.25

65

0.1

-15

0.0100

225

-1.50

2.00

85

-0.15

0.0225

25

-0.75

2.50

75

0.35

-5

0.1225

25

-1.75

2.25

80

0.1

0.0100

2.70

50

0.55

-30

0.3025

900

-16.50

2.50

55

0.35

-25

0.1225

625

-8.75

17.50

640

1.560

4450

-79.75

SSX

SSY

SSXY

( x x )2 ( y y)2

( x x )( y y )

Cov( X , Y )
SSXY
79.75

0.957
Var ( X )Var (Y ) GauravGarg(IIMLucknow)
SSX SSY
1.56 4450

12/11/2012

AlternativeFormulasforSumofSquares

SSX x 2

, SSY y 2

y ,
SSXY
2

x2

y2

x.y

1.25

125

1.5625

15625

156.25

1.75

105

3.0625

11025

183.75

2.25

65

5.0625

4225

146.25

2.00

85

4.0000

7225

170.00

2.50

75

6.2500

5625

187.50

2.25

80

5.0625

6400

180.00

2.70

50

7.2500

2500

135.00

2.50

55

6.2500

3025

137.50

17.20

640

38.54

55650

1296.25

xy

SmokingandLungCapacityExample

x y
n

Cigarettes
(X)

0
5
10
15
20
50

SSX = 1.56
SSY = 4450
SSXY= -79.75

X2
0
25
100
225
400
750

0
210
330
465
580
1585

Cov( X , Y )
SSXY
79.75

0.957
Var ( X )Var (Y ) GauravGarg(IIMLucknow)
SSX SSY
1.56 4450

Lung
Capacity
(Y)

Y2

XY

2025
1764
1089
961
841
6680

45
42
33
31
29
180

GauravGarg(IIMLucknow)

RegressionAnalysis
rxy

HavingdeterminedthecorrelationbetweenXandY,we
wishtodetermineamathematicalrelationshipbetween
them.
Dependentvariable:thevariableyouwishtoexplain
Independentvariables:thevariablesusedtoexplainthe
dependentvariable
Regressionanalysisisusedto:
Predictthevalueofdependentvariablebasedonthe
valueofindependentvariable(s)
Explaintheimpactofchangesinanindependent
variableonthedependentvariable

(5)(1585) (50)(180)
2
2
(5)(750) 50 (5)(6680) 180
7925 9000

(3750 2500)(33400 32400)


1075

1250 (1000)

.9615

GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

TypesofRelationships

TypesofRelationships

Linearrelationships

Strong relationships

Curvilinearrelationships

X
GauravGarg(IIMLucknow)

Weak relationships

X
GauravGarg(IIMLucknow)

12/11/2012

SimpleLinearRegressionAnalysis

TypesofRelationships

No relationship
Y

Thesimplestmathematicalrelationshipis
Y = a + bX + error (linear)
ChangesinY arerelatedtothechangesin X
Whatarethemostsuitablevaluesof
a (intercept)andb (slope)?

1
y = a + b.x

}a

GauravGarg(IIMLucknow)

MethodofLeastSquares

GauravGarg(IIMLucknow)

Wewanttofitalineforwhichalltheerrorsare
minimum.
Wewanttoobtainsuchvaluesofa andb in
Y = a + bX + error forwhichalltheerrorsare
minimum.
Tominimizealltheerrorstogetherweminimize
thesumofsquaresoferrors(SSE).

a bX

(xi, yi)

error

yi

a bx i
X

xi

SSE (Yi a bX i ) 2
i 1

The best fitted line would be for which all the


ERRORS are minimum.
GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

Togetthevaluesofa andb whichminimizeSSE,we


proceedasfollows:

Solvingabovenormalequations,weget
n

n
SSE
0 2 (Yi a bX i ) 0
a
i 1
n

i 1

Y X

Yi na b X i
i 1

(1)

SSE
0 2 (Yi a bX i ) X i 0
b
i 1
n

Yi X i a X i b X
i 1

i 1

i 1

Eq(1)and(2)arecallednormalequations.
Solvenormalequationstogeta and b
GauravGarg(IIMLucknow)

( 2)

na b X i
i 1

i 1

i 1

a X i b X i2

n Yi X i Yi X i
i 1 i 1
b i 1
2
n
n

X
X

i
i
i 1
i 1
n

2
i

i 1

i 1

Y Y X
n

i 1

X
n

i 1

SSXY
SSX

a Y bX
GauravGarg(IIMLucknow)

12/11/2012

Thevaluesofa andb obtainedusingleastsquares


methodarecalledasleastsquaresestimates(LSE)
ofa andb.
Thus,LSEofa andb aregivenby

SSXY
b
.
SSX
AlsothecorrelationcoefficientbetweenX andY is
a Y bX,

rXY

Cov( X , Y )
Var ( X )Var (Y )

SSXY
SSX SSY

SSXY
SSX

y y

xx

1.25

125

-0.9

45

0.8100

2025

-40.50

1.75

105

-0.4

25

0.1600

625

-10.00

2.25

65

0.1

-15

0.0100

225

-1.50

2.00

85

-0.15

0.0225

25

-0.75

2.50

75

0.35

-5

0.1225

25

-1.75

2.25

80

0.1

0.0100

2.70

50

0.55

-30

0.3025

900

-16.50

2.50

55

0.35

-25

0.1225

625

-8.75

17.50

640

1.560

4450

-79.75

SSX

SSY

SSXY

( x x )2 ( y y)2

SSX
SSX
b
SSY
SSY

( x x )( y y )

X 2.15, Y 80.
GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

SSXY
0.957
SSX SSY

SSXY
b
51.12 a Y bX 189.91
SSX
FittedLineis Y 189.91 51.12 X
140
120
100
80
60
40

0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75

Fitted Line is Y 189.91 51.12 X


189.91 istheestimatedmeanvalueofY when
thevalueofX iszero.
-51.12 isthechangeintheaveragevalueofY as
aresultofaoneunitchangeinX.
WecanpredictthevalueofY forsomegiven
valueofX.
ForexampleatX=2.15,predictedvalueofY is
189.91 51.12 x2.15= 80.002

GauravGarg(IIMLucknow)

Residuals :ei Yi Yi
ResidualistheunexplainedpartofY
Thesmallertheresiduals,thebettertheutilityof
Regression.
SumofResidualsisalwayszero.LeastSquare
procedureensuresthat.
Residualsplayanimportantroleininvestigating
theadequacyofthefittedmodel.
Weobtaincoefficientofdetermination(R2)
usingtheresiduals.
R2 isusedtoexaminetheadequacyofthefitted
linearmodeltothegivendata.

GauravGarg(IIMLucknow)

CoefficientofDetermination
Y

Y Y Y Y

Y Y

Y
X
n

TotalSumofSquares :SST (Yi Y ) 2


i 1

RegressionSumofSquares :SSR (Yi Y ) 2


i 1

ErrorSumofSquares : SSE (Yi Yi ) 2


i 1

Also, SST = SSR + SSE


GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

12/11/2012

ThefractionofSST explained byRegressionisgivenbyR2


R2 = SSR/ SST = 1 (SSE/ SST)
Clearly,0 R2 1
WhenSSR isclosedtoSST, R2 willbeclosedto1.
Thismeansthatregressionexplainsmostofthevariability
inY.(Fitisgood)
WhenSSE isclosedtoSST, R2 willbeclosedto0.
Thismeansthatregressiondoesnotexplainmuch
variabilityinY. (Fitisnotgood)
R2 isthesquareofcorrelationcoefficientbetweenX and
Y.(proofomitted)

r = -1

r=1

0 < R2 < 1
Weaklinear
relationships

R2 = 1
Perfectlinear
relationship

R2 = 0
Nolinear
relationship

Somebutnotallof
thevariationinY is
explainedbyX

100%ofthevariation
inY isexplainedbyX

GauravGarg(IIMLucknow)

Noneofthe
variationinY is
explainedbyX

GauravGarg(IIMLucknow)

(Y Y ) (Y Y ) (Y Y ) (Y Y ) 2 (Y Y ) 2 (Y Y ) 2
45
-1
46
2025
1
2116

1.25

125

126.0

1.75

105

100.5

25

4.5

20.5

625

20.25

420.25

2.25

65

74.9

-15

-9.9

-5.1

225

98.00

26.01

2.00

85

87.7

-2.2

7.7

25

4.84

59.29

2.50

75

62.1

-5

12.9

-17.7

25

166.41 313.29

2.25

80

74.9

5.1

-5.1

26.01

26.01

2.70

50

51.9

-30

-1.9

-28.1

900

3.61

789.61

2.50

55

62.1

-25

-7.1

-17.9

625

50.41

320.41

17.20

640

4450

370.54 4079.4
6

CoefficientofDetermination:R2 = (4450-370.5)/4450 = 0.916


CorrelationCoefficient:
r = -0.957
CoefficientofDetermination=(CorrelationCoefficient)2

Example:
Watchingtelevisionalsoreducestheamountofphysicalexercise,
causingweightgains.
Asampleoffifteen10yearoldchildrenwastaken.
Thenumberofpoundseachchildwasoverweightwasrecorded
(anegativenumberindicatesthechildisunderweight).
Additionally,thenumberofhoursoftelevisionviewingperweeks
wasalsorecorded.Thesedataarelistedhere.
TV
Overweight

42 34 25 35 37 38 31 33 19 29 38 28 29 36 18
18 6 0 1 13 14 7 7 9 8 8 5 3 14 7

Calculatethesampleregressionlineanddescribewhatthe
coefficientstellyouabouttherelationshipbetweenthetwo
variables.

Y=24.709+0.967XandR2 =0.768

GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

20.00

15.00

10.00

5.00
Y
PredictedY
0.00
1

10

11

12

13

14

15

5.00

10.00

15.00

GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

12/11/2012

StandardError
Consideradataset.
Alltheobservationscannotbeexactlythesameas
arithmeticmean(AM).
VariabilityoftheobservationsaroundAMismeasured
bystandarddeviation.
Similarlyinregression,allY valuescannotbethesame
aspredictedY values.
VariabilityofY valuesaroundthepredictionlineis
measuredbySTANDARDERROROFTHEESTIMATE.
n
Itisgivenby
2
S YX

SSE

n2

(Y Y )
i 1

Assumptions
TherelationshipbetweenXandYislinear
Errorvaluesarestatisticallyindependent
AlltheErrorshaveacommonvariance.
(Homoscedasticity)
Var(ei )= 2,where e Y Y
i
i
i
E(ei )= 0
Nodistributionalassumptionabouterrorsis
requiredforleastsquaresmethod.

n2

GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

Linearity

Independence
Linear

NotLinear

Independent

Y
residuals

residuals

X
residuals

residuals

GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

EqualVariance
Unequalvariance
(Heteroscadastic)

residuals

NotIndependent

TVWatching WeightGainExample

Equalvariance
(Homoscadastic)

20.00

ScatterPlotofXandY

15.00
10.00
5.00

0.00
0

10

15

20

25

30

35

40

45

5.00
10.00
15.00

ScatterPlotofXandResiduals
6.00
4.00

residuals

residuals

2.00
0.00

2.00

10

15

20

25

30

35

40

45

4.00
6.00
8.00
10.00
12.00

GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

12/11/2012

TheMultipleLinearRegressionModel
Insimplelinearregressionanalysis,wefitlinearrelation
between
oneindependentvariable(X)and
onedependentvariable(Y).

WeassumethatY isregressedononlyoneregressor
variableX.
Insomesituations,thevariableY isregressedonmore
thanoneregressor variables(X1, X2, X3, ).
ForEXample:
Cost
Salary
Sales

>Laborcost,Electricitycost,Rawmaterialcost
>Education,EXperience
>Cost,AdvertisingEXpenditure

Example:
Adistributoroffrozendessertpieswantsto
evaluatefactorswhichinfluencethedemand
Dependentvariable:
Y:Piesales(unitsperweek)

Independentvariables:
X1: Price(in$)
X2: AdvertisingExpenditure($100s)

Dataarecollectedfor15weeks

GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

Week

Pie
Sales

Price
($)

Advertising
($100s)

350

5.50

3.3

460

7.50

3.3

350

8.00

3.0

430

8.00

4.5

350

6.80

3.0

380

7.50

4.0

430

4.50

3.0

470

6.40

3.7

450

7.00

3.5

10

490

5.00

4.0

11

340

7.20

3.5

12

300

7.90

3.2

13

440

5.90

4.0

14

450

5.00

3.5

15

300

7.00

2.7

Usingthegivendata,wewishtofitalinear
functionoftheform:
Yi 0 1 X 1i 2 X 2 i i ,
i 1, 2 , ,15 .
where
Y:Piesales(unitsperweek)
X1: Price(in$)
X2: AdvertisingExpenditure($100s)

Fittingmeans,wewanttogetthevaluesof
regressioncoefficientsdenotedby
Originalvaluesofsarenotknown.
Weestimatethemusingthegivendata.

GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

TheMultipleLinearRegressionModel
Examinethelinearrelationshipbetween
onedependent (Y) and
twoormoreindependentvariables(X1, X2, , Xk).

MultipleLinearRegressionModelwithk
IndependentVariables:
Intercept

Slopes

MultipleLinearRegressionEquation
Intercept and Slopes are estimated using observed
data.
Multiple linear regression equation with k
independent variables
Estimated
value

Estimate of
intercept

Estimates of slopes

Random Error

Yi 0 1 X 1i 2 X 2 i k X ki i
i 1, 2 , , n.
GauravGarg(IIMLucknow)

Yi b0 b1 X 1i b2 X 2i bk X ki
i 1, 2 , , n.
GauravGarg(IIMLucknow)

12/11/2012

MultipleRegressionEquation

EstimatingRegressionCoefficients

EXample withtwoindependentvariables
Y b0 b1 X 1 b2 X 2

X2

Themultiplelinearregressionmodel
Yi 0 1 X 1i 2 X 2i k X ki i ,i 1,2 ,...,n
InmatriX notations
0
Y1 1 X 11 X 12 X 1k 1

1
Y2 1 X 21 X 22 X 2 k 2
2




Y 1 X
X n 2 X nk n
n1
n
k

Y X

or

X1

GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

Assumptions

Inordertofindtheestimateof,weminimize

No.ofobservations(n)isgreaterthanno.of
regressors (k).i.e.,n> k
RandomErrorsareindependent
RandomErrorshavethesamevariances.
(Homoscedasticity)
Var(i )= 2
Inlongrun,meaneffectofrandomerrorsiszero.

S( ) i2 (Y X)(Y X )

E(i )= 0.

NoAssumptionondistributionofRandomerrors
isrequiredforleastsquaresmethod.
GauravGarg(IIMLucknow)

i 1

Y Y-2 X Y X X
WedifferentiateS() withrespectto andequate
tozero,i.e.,
Thisgives

S
0,

b (X X)1 X Y

b iscalledleastsquaresestimatorof.
GauravGarg(IIMLucknow)

Example:Considerthepieexample.
Wewanttofitthemodel Yi 0 1 X 1i 2 X 2 i i ,
Thevariablesare
Y:Piesales(unitsperweek)
X1: Price(in$)
X2: AdvertisingExpenditure($100s)

Sales 306 .53 - 24 .98( X 1 ) 74 .13( X 2 )

UsingthematriX formula,theleastsquaresestimate
(LSE)ofsareobtainedasbelow:
LSE of Intercept 0

Intercept(b0)

306.53

LSE of slope 1

Price(b1)

24.98

LSE of slope 2

Advertising(b2)

74.13

PieSales=306.53 24.98Price+74.13Adv.Expend.
GauravGarg(IIMLucknow)

b1 = -24.98: sales will decrease, on


average, by 24.98 pies per week for
each $1 increase in selling price,
while advertising expenses are kept
fixed.

b2 = 74.13: sales will


increase, on average, by
74.13 pies per week for
each $100 increase in
advertising, while selling
price are kept fixed.

GauravGarg(IIMLucknow)

12/11/2012

Y 306 .52619 24 .97509 X 1 74 .13096 X 2

Prediction:
Predictsalesforaweekinwhich
sellingpriceis$5.50
AdvertisingeXpenditure is$350:

Sales=306.53 24.98X1 +74.13X2

=306.53 24.98(5.50)+74.13(3.5)

=428.62

Predictedsalesis428.62pies
NotethatAdvertisingisin$100s,soX2 = 3.5

GauravGarg(IIMLucknow)

Y
350
460
350
430
350
380
430
470
450
490
340
300
440
450
300

X1
5.5
7.5
8.0
8.0
6.8
7.5
4.5
6.4
7.0
5.0
7.2
7.9
5.9
5.0
7.0

X2
3.3
3.3
3.0
4.5
3.0
4.0
3.0
3.7
3.5
4.0
3.5
3.2
4.0
3.5
2.7

PredictedY
413.77
363.81
329.08
440.28
359.06
415.70
416.51
420.94
391.13
478.15
386.13
346.40
455.67
441.09
331.82

Residuals
63.80
96.15
20.88
10.31
9.09
35.74
13.47
49.03
58.84
11.83
46.16
46.44
15.70
8.89
31.85

GauravGarg(IIMLucknow)

CoefficientofDetermination
CoefficientofDetermination(R2 )isobtainedusingthe
sameformulaaswasinsimplelinearregression.

600

Total Sum of Squares, SST (Yi Y ) 2

500

i 1

400

Regression Sum of Squares, SSR (Yi Y ) 2

Y
PredictedY

300

i 1

Error Sum of Squares, SSE (Yi Yi ) 2


i 1

200

Also,SST = SSR + SSE

R2 = SSR/SST = 1 (SSE/SST)

100

0
1

10

11

12

13

14

15

GauravGarg(IIMLucknow)

Since
SST = SSR + SSE
andallthreequantitiesarenonnegative,
Also,
0 SSR SST

So
0 SSR/SST 1
Or
0 R2 1
WhenR2 iscloseto0,thelinearfitisnotgood
AndX variablesdonotcontributeinexplainingthe
variabilityinY.

WhenR2 iscloseto1,thelinearfitisgood.
Inthepreviouslydiscussedexample,R2 =0.5215
IfweconsiderY andX1 only,R2 =0.1965
IfweconsiderY andX2 only,R2 =0.3095
GauravGarg(IIMLucknow)

R2 istheproportionofvariationinY explainedby
regression.
GauravGarg(IIMLucknow)

AdjustedR2
Ifonemoreregressor isaddedtothemodel,thevalue
ofR2 willincrease
Thisincreaseisregardlessofthecontributionofnewly
addedregressor.
So,anadjustedvalueofR2 isdefined,whichiscalledas
adjustedR2 anddefinedas
2
R Adj
1

SSE (n k 1 )
SST (n 1 )

ThisAdjustedR2 willonlyincrease,iftheadditional
variablecontributeinexplainingthevariationinY.
Forourexample,AdjustedR2 =0.4417
GauravGarg(IIMLucknow)

10

12/11/2012

FTestforOverallSignificance
Wecheckifthereisalinearrelationshipbetweenallthe
regressors (X1, X2, , Xk) andresponse(Y).
UseFteststatistic
Totest:

TotalSumofSquare(SST)ispartitionedinto
SumofSquaresduetoRegression(SSR)and
SumofSquaresduetoResiduals(SSE)

where

H0: 1 = 2 = = k = 0 (noregressor issignificant)


H1: atleastonei 0 (atleastoneregressor affectsY)

i 1
n

i 1

i 1

SSE e i2 Yi Yi

ThetechniqueofAnalysisofVarianceisused.
Assumptions:
n > k, Var(i )= 2, E(i )= 0.
is areindependent.ThisimpliesthatCorr (i , j )=0,fori j
is haveNormalDistribution.[i ~ N(0, 2)]
[NEWASSUMPTION]

SSR SST SSE

eis arecalledtheresiduals.

GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

AnalysisofVarianceTable
df

SST Yi Y

SS

MS

Fc
MSR/MSE

Regression

SSR

MSR

ResidualorError

n-k-1

SSE

MSE

Total

n-1

SST

H0: j = 0 (nolinearrelationship)
H1: j 0 (linearrelationshipexistsbetweenXj andY)

TestStatistic:
Fc = MSR / MSE ~ F(k, n-k-1)
ForthepreviouseXample,wewishtotest
H0: 1 = 2 = 0 AgainstH1: atleastonei 0
ANOVATable
df

SS

MS

F(2,12)(0.05)

6.5386

3.89

Regression

29460.03

14730.01

ResidualorError

12

27033.31

2252.78

Total

14

56493.33

IndividualVariablesTestsofHypothesis
Wetestifthereisalinearrelationshipbetweena
particularregressor Xj andY
Hypotheses:

Weuseatwotailedttest
IfH0: j = 0 isaccepted,
thisindicatesthatthevariableXj canbedeleted
fromthemodel.

ThusH0 isrejectedat5%levelofsignificance.
GauravGarg(IIMLucknow)

TestStatistic:

Tc

bj

2 C jj

GauravGarg(IIMLucknow)

Inourexample
2 2252.7755

Tc ~ Studentst with(n-k-1) degreeoffreedom


bj istheleastsquaresestimateofj
C j j isthe(j, j)th elementofmatrix(XX)-1
2 MSE

(MSE isobtainedinANOVATable)
GauravGarg(IIMLucknow)

and

5 .7946 0 .3312 1.0165

(X X) 1 0 .3312 0 .0521 0 .0038


1.0165 0 .0038 0 .2993

TotestH0: 1 = 0 againstH1: 1 0
Tc = -2.3057
TotestH0: 2 = 0 againstH1: 2 0
Tc =2.8548
Twotailedcriticalvaluesoft at12d.f.are
3.0545for1%levelofsignificance
2.6810for2%levelofsignificance
2.1788for5%levelofsignificance
GauravGarg(IIMLucknow)

11

12/11/2012

StandardError

i 1

Yi )

residuals

SSE

n k 1

Linear

NotLinear

n k 1

GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

ResidualAnalysisforEqualVariance

residuals

AssumptionofEqualVariance
WeassumethatVar(i )= 2
Thevarianceisconstantforallobservations.

Thisassumptionisexaminedbylookingatthe
plotof

residuals

S YX

(Y

AssumptionofLinearity

residuals

Consideradataset.
Alltheobservationscannotbeexactlythesameas
arithmeticmean(AM).
VariabilityoftheobservationsaroundAMismeasured
bystandarddeviation.
Similarlyinregression,allY valuescannotbethesame
aspredictedY values.
VariabilityofY valuesaroundthepredictionlineis
measuredbySTANDARDERROROFTHEESTIMATE.
n
Itisgivenby
2

Unequal variance

Y
Equal variance

Predictedvalues Yi andresiduals e i Yi Yi
GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

AssumptionofUncorrelatedResiduals

ResidualAnalysisforIndependence
(UncorrelatedErrors)

DurbinWatsonstatisticisateststatisticusedtodetect
thepresenceofautocorrelation.
n
Itisgivenby
(e e ) 2
i 1

Independent

2
i

Thevalueofd alwaysliesbetween0 and4.


d = 2 indicatesnoautocorrelation.
Smallvaluesofd < 2 indicatesuccessiveerrortermsare
positivelycorrelated.
Ifd > 2 successiveerrortermsarenegativelycorrelated.
Thevalueofd morethan3andlessthan1arealarming.
GauravGarg(IIMLucknow)

Y
residuals

residuals

i2

Not Independent

i 1

residuals

GauravGarg(IIMLucknow)

12

12/11/2012

AssumptionofNormality
WhenweuseF testort test,weassumethat1,
2 , , n arenormallydistributed.
Thisassumptioncanbeexaminedbyhistogram
ofresiduals.

NormalitycanalsobeexaminedusingQQplot
orNormalprobabilityplot.

NORMAL

NOTNORMAL

NORMAL

NOTNORMAL
GauravGarg(IIMLucknow)

StandardizedRegressionCoefficient
Inamultiplelinearregression,wemayliketoknow
whichregressor contributesmore.
Weobtainstandardizedestimatesofregression
coefficients.
Forthat,firstwestandardizetheobservations.
1 n
Y Yi ,
n i 1

sY

1 n
(Yi Y ) 2
n 1 i 1

X1

1 n
X 1i , s X1
n i 1

1 n
( X 1i X 1 ) 2
n 1 i 1

X2

1 n
X 2i , s X 2
n i 1

1 n
( X 2i X 2 )2
n 1 i 1

GauravGarg(IIMLucknow)

StandardizeallY, X1 andX2 valuesasfollows:


Standardized Yi

Y Y
,
sY

Standardized X 1i

Fittheregressioninthestandardizeddataandobtain
theleastsquaresestimateofregressioncoefficients.
Thesecoefficientsaredimensionlessorunitfreeand
canbecompared.
Lookfortheregressioncoefficienthavingthehighest
magnitude.
Correspondingregressor contributesthemost.

GauravGarg(IIMLucknow)

Standardized Data
Week

Pie
Sales

Price
($)

Advertising
($100s)

0.78

0.95

0.37

0.96

0.76

0.37

0.78

1.18

0.98

0.48

1.18

2.09

0.78

0.16

0.98

0.30

0.76

1.06

0.48

1.80

0.98

1.11

0.18

0.45

0.80

0.33

0.04

10

1.43

1.38

1.06

11

0.93

0.50

0.04

12

1.56

1.10

0.57

13

0.64

0.61

1.06

14

0.80

1.38

0.04

15

1.56

0.33

1.60

X 1i X 1
X X2
, Standardized X 2i 2i
s X1
sX 2

GauravGarg(IIMLucknow)

Notethat:
2
R Adj
1

Y = 0 0.461 X1 + 0.570 X2
Since0.461 < 0.570

X2 Contributesthemost

GauravGarg(IIMLucknow)

Fc

(1 R 2 )(n 1)
( n k 1)

( n k 1) R 2
k (1 R 2 )

AdjustedR2 canbenegative
AdjustedR2 isalwayslessthanorequaltoR2
Inclusionofintercepttermisnotnecessary.
Itdependsontheproblem.
Analystmaydecideonthis.
GauravGarg(IIMLucknow)

13

12/11/2012

Example:Followingdatawascollectedforthesales,numberof
advertisementspublishedandadvertizingexpenditurefor12
weeks.Fitaregressionmodeltopredictthesales.

Model
1

Regression
Residual

ANOVAb
Sum of
Squares
df
309.986

Mean Square
154.993

143.201

15.911

453.187

11

F
9.741

Sales(0,000Rs)

Ads(Nos.)

AdvEx(000Rs)

43.6

12

13.9

38.0

11

12

a. Predictors: (Constant), Ex_Adv, No_Adv

30.1

9.3

b. Dependent Variable: Sales

35.3

9.7

pvalue<0.05;H0 isrejected;Allsarenotzero

46.4

12

12.3

34.2

11.4

Allpvalues>0.05;NoH0 rejected.0=0,1=0,2=0

30.2

9.3

40.7

13

14.3

38.5

10.2

22.6

8.4

37.6

11.2

35.2

10

11.1

GauravGarg(IIMLucknow)

Multicollinearity
Weassumethatregressors areindependentvariables.

Total

Sig.
.006a

CONTRADICTION

Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model
B
Std. Error
Beta
1
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
a. Dependent Variable: Sales

t
.771
.558
1.455

Sig.
.461
.591
.180

GauravGarg(IIMLucknow)

Includingtwohighlycorrelatedindependentvariablescan
adverselyaffecttheregressionresults

WhenweregressY onregressors X1, X2, , Xk.

Canleadtounstablecoefficients

Weassumethatallregressors X1, X2, , Xk are


statisticallyindependentofeachother.

SomeIndicationsofStrongMulticollinearity:
Coefficientsignsmaynotmatchpriorexpectations

Alltheregressors affectthevaluesofY.

Largechangeinthevalueofapreviouscoefficientwhenanew
variableisaddedtothemodel

Oneregressor doesnotaffectthevaluesofother
regressor.

Apreviouslysignificantvariablebecomesinsignificantwhena
newindependentvariableisadded.

Sometimes,inpracticethisassumptionisnotmet.

Fsaysatleastonevariableissignificant,butnoneofthets
indicatesausefulvariable.

Wefacetheproblemofmulticollinearity.
Thecorrelatedvariablescontributeredundantinformation
tothemodel
GauravGarg(IIMLucknow)

EXAMPLESINWHICHTHISMIGHTHAPPEN:
MilespergallonVs.horsepowerandenginesize
IncomeVs.ageandexperience
SalesVs.No.ofAdvertisementandAdvert.Expenditure

VarianceInflationaryFactor:
VIFj isusedtomeasuremulticollinearity generated
byvariableXj
Itisgivenby
1

VIFj

1 R2j

whereR2j isthecoefficientofdeterminationofa
regressionmodelthatuses
Xj asthedependentvariableand
allotherX variablesastheindependentvariables.
GauravGarg(IIMLucknow)

Largestandarderrorandcorrespondingregressors isstill
significant.
MSEisveryhighand/orR2 isverysmall
GauravGarg(IIMLucknow)

IfVIFj >5,Xj ishighlycorrelatedwiththeother


independentvariables
Mathematically,theproblemofmulticollinearity occurs
whenthecolumnsofmatrixX havenearlinear
dependence
LSEb cannotbeobtainedwhenthematrixXX issingular
ThematrixXX becomessingularwhen
thecolumnsofmatrixX haveexactlineardependence
Ifanyoftheeigen valueofmatrixXX iszero

Thus,nearzeroeigen valueisalsoanindicationof
multicollinearity.
Themethodsofdealingwithmulticollinearity:
CollectingAdditionalData
VariableElimination
GauravGarg(IIMLucknow)

14

12/11/2012

Coefficientsa
Standardize
d
Unstandardized
Coefficients
Coefficients
Model
B
Std. Error
Beta
1
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
a. Dependent Variable: Sales

Collinearity
Statistics
Sig. Tolerance VIF
.461
.591
.199 5.022
.180
.199 5.022

t
.771
.558
1.455

Tolerance=1/VIF

Greaterthan5

Collinearity Diagnosticsa
Variance Proportions
Condition
Model
Dimension
Eigenvalue
Index
(Constant) No_Adv Ex_Adv
1
1
2.966
1.000
.00
.00
.00
2
.030
9.882
.33
.17
.00
3
.003
30.417
.67
.83
1.00
a. Dependent Variable: Sales

NegligibleValue

Wemayusethemethodofvariableelimination.
Inpractice,IfCorr (X1, X2) ismorethan0.7or
lessthan0.7,weeliminateoneofthem.
Techniques:
Stepwise
ForwardInclusion
BackwardElimination

(basedonANOVA)
(basedonCorrelation)
(basedonCorrelation)

LargeValue

GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

StepwiseRegression
Y = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X4 + 5 X5 +
Step1:Run5simplelinearregressions:

Y = 0 + 1 X1
Y = 0 + 2 X2
Y = 0 + 3 X3
Y = 0 + 4 X4 <====haslowestpvalue(ANOVA)<0.05
Y = 0 + 5 X5

Step2:Run4twovariablelinearregressions:

Y = 0 + 4 X4 + 1 X1
Y = 0 + 4 X4 + 2 X2
Y = 0 + 4 X4 + 3 X3 <=haslowestpvalue(ANOVA)<0.05
Y = 0 + 4 X4 + 5 X5

Step3:Run3threevariablelinearregressions:
Y = 0 + 3 X3 + 4 X4 + 1 X1
Y = 0 + 3 X3 + 4 X4 + 2 X2
Y = 0 + 3 X3 + 4 X4 + 5 X5

Supposenoneofthesemodelshave
pvalues<0.05

STOP
BestmodelistheonewithX3 andX4 only

GauravGarg(IIMLucknow)

GauravGarg(IIMLucknow)

Example:Followingdatawascollectedforthesales,numberof
advertisementspublishedandadvertizingexpenditurefor12
months.Fitaregressionmodeltopredictthesales.
Sales(0,000Rs)

Ads(Nos.)

AdvEx(000Rs)

43.6

12

13.9

38.0

11

12

30.1

9.3

35.3

9.7

46.4

12

12.3

34.2

11.4

30.2

9.3

40.7

13

14.3

38.5

10.2

22.6

8.4

37.6

11.2

35.2

10

11.1

GauravGarg(IIMLucknow)

SummaryOutput1:SalesVs.No_Adv
Model Summary
Model
R
R Square
Adjusted R Square
1
.781a
.610
.571
a. Predictors: (Constant), No_Adv
Model
1

Std. Error of the


Estimate
4.20570

ANOVAb
Sum of Squares
df
Mean Square
F
276.308
1
276.308 15.621
176.879
10
17.688

Regression
Residual
Total

453.187

Sig.
.003a

11

a. Predictors: (Constant), No_Adv


b. Dependent Variable: Sales

Model
1

(Constant)

No_Adv
a. Dependent Variable: Sales

Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
B
Std. Error
Beta
16.937
4.982
2.083
.527
.781

t
3.400
3.952

Sig.
.007
.003

GauravGarg(IIMLucknow)

15

12/11/2012

SummaryOutput2:SalesVs.Ex_Adv

SummaryOutput3:SalesVs.No_Adv &Ex_Adv

Model Summary

Model Summary

Model
R
R Square
Adjusted R Square
1
.820a
.673
.640
a. Predictors: (Constant), Ex_Adv
Model
1

Std. Error of the


Estimate
3.84900

ANOVAb
Sum of Squares
df
Mean Square
F
305.039
1
305.039 20.590
148.148
10
14.815

Regression
Residual
Total

453.187

Sig.
.001a

Model
1

ANOVAb
Sum of Squares
df
309.986
143.201

Regression
Residual

Total
453.187
a. Predictors: (Constant), Ex_Adv, No_Adv

11

a. Predictors: (Constant), Ex_Adv


b. Dependent Variable: Sales

(Constant)

Ex_Adv
a. Dependent Variable: Sales

Standardized
Unstandardized Coefficients Coefficients
B
Std. Error
Beta
4.173
7.109
2.872
.633
.820

Mean Square
154.993
15.911

F
9.741

Sig.
.006a

Sig.
.461
.591
.180

11

Coefficientsa

t
.587
4.538

Sig.
.570
.001

Model
1

(Constant)

No_Adv
Ex_Adv
a. Dependent Variable: Sales

Unstandardized Coefficients
B
Std. Error
6.584
8.542
.625
1.120
2.139
1.470

GauravGarg(IIMLucknow)

QualitativeIndependentVariables
JohnsonFiltration,Inc.,providesmaintenance
serviceforwaterfiltrationsystemsthroughout
southernFlorida.
Toestimatetheservicetimeandtheservicecost,
themanagerswanttopredicttherepairtime
necessaryforeachmaintenancerequest.
Repairtimeisbelievedtoberelatedtotwo
factors
Numberofmonthssincethelastmaintenance
service
Typeofrepairproblem(mechanicalorelectrical)

Usingleastsquaresmethod,wefittedthemodelas

Y 2.1473 0.3041 X 1

R2 =0.534
At5%levelofsignificance,wereject
H0: 0 = 0 (Usingt test)
H0: 1 = 0 (Usingt andF test)
X1 aloneexplains53.4%variabilityinrepairtime.
Tointroducethetypeofrepairintothemodel,wedefinea
dummyvariablegivenas
0, if type of repair is mechanical
X2
1, if type of repair is electrical

RegressionModelthatusesX1 andX2 toregressY is


Y=0+ 1 X1+ 2 X2+

Isthenewmodelimproved?
GauravGarg(IIMLucknow)

Standardized
Coefficients
Beta
.234
.611

.771
.558
1.455

GauravGarg(IIMLucknow)

Dataforasampleof10servicecallsaregiven:
ServiceCall
1
2
3
4
5
6
7
8
9
10

MonthsSinceLast
Service
TypeofRepair
2
electrical
6
mechanical
8
electrical
3
mechanical
2
electrical
7
electrical
9
mechanical
8
mechanical
4
electrical
6
electrical

RepairTimein
Hours
2.9
3.0
4.8
1.8
2.9
4.9
4.2
4.8
4.4
4.5

LetY denotetherepairtime,X1 denotethenumberof


monthssincelastmaintenanceservice.
RegressionModelthatusesX1 onlytoregressY is
Y=0+ 1 X1+

GauravGarg(IIMLucknow)

2
9

b. Dependent Variable: Sales


Coefficientsa

Model
1

Std. Error of the


Estimate
3.98888

Model
R
R Square
Adjusted R Square
1
.827a
.684
.614
a. Predictors: (Constant), Ex_Adv, No_Adv

GauravGarg(IIMLucknow)

Summary
Multiplelinearregressionmodel Y=X +
LeastSquaresEstimateof isgivenbyb= (XX)-1XY
R2 andadjustedR2
UsingANOVA(F test),weexamineifallsarezeroor
not.
t testisconductedforeachregressor separately.
Usingt test,weexamineif correspondingtothat
regressor iszeroornot.
ProblemofMulticollinearity VIF,eigen value
DummyVariable
Examiningtheassumptions:
commonvariance,independence,normality
GauravGarg(IIMLucknow)

16

You might also like