Professional Documents
Culture Documents
7 Regression
7 Regression
LinearRegressionAnalysis
Correlation
SimpleLinearRegression
TheMultipleLinearRegressionModel
LeastSquaresEstimates
R2 andAdjustedR2
OverallValidityoftheModel(F test)
Testingforindividualregressor (t test)
ProblemofMulticollinearity
SmokingandLungCapacity
Suppose,forexample,wewanttoinvestigatethe
relationshipbetweencigarettesmokingandlung
capacity
Wemightaskagroupofpeopleabouttheirsmoking
habits,andmeasuretheirlungcapacities
Cigarettes (X)
45
42
10
15
33
31
20
29
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
HeightandWeight
Scatterplotofthedata
LungCapacity
60
40
20
0
0
10
20
30
Wecanseethatassmokinggoesup,lung
capacitytendstogodown.
Thetwovariableschangethevaluesinopposite
directions.
Considerthefollowingdataofheightsandweightsof5
womenswimmers:
Height(inch):
62
64
65
66
68
Weight(pounds): 102 108 115 128 132
Wecanobservethatweightisalsoincreasingwith
height.
150
100
50
0
60
GauravGarg(IIMLucknow)
Sometimestwovariablesarerelatedtoeach
other.
Thevaluesofbothofthevariablesarepaired.
Changeinthevalueofoneaffectsthevalueof
other.
Usuallythesetwovariablesaretwoattributesof
eachmemberofthepopulation
ForExample:
Height
Weight
AdvertisingExpenditure
SalesVolume
Unemployment
CrimeRate
Rainfall
FoodProduction
Expenditure
Savings
GauravGarg(IIMLucknow)
65
70
GauravGarg(IIMLucknow)
Wehavealreadystudiedonemeasureofrelationship
betweentwovariables Covariance
Covariancebetweentworandomvariables,X andY is
givenby
Cov ( X , Y ) E ( XY ) E ( X ) E (Y )
XY
ForpairedobservationsonvariablesX andY,
Cov ( X , Y ) XY
1 n
( xi x )( yi y )
n i 1
xx
yy
GauravGarg(IIMLucknow)
12/11/2012
Correlation
PropertiesofCovariance:
Itisnotunitfree.
Soitisnotagoodmeasureofrelationshipbetweentwo
variables.
Abettermeasureiscorrelationcoefficient.
Itisunitfreeandtakesvaluesin[-1,+1].
KarlPearsonsCorrelationcoefficientisgivenby
Cov( X , Y ) E ( XY ) E ( X ) E (Y )
Var ( X ) E ( X 2 ) [ E ( X )]2 , Var (Y ) E (Y 2 ) [ E (Y )] 2
GauravGarg(IIMLucknow)
PropertiesofCorrelationCoefficient
ScatterDiagram
Y
X
PositivelyCorrelated
WeaklyCorrelated
NegativelyCorrelated
StronglyCorrelated
GauravGarg(IIMLucknow)
NotCorrelated
GauravGarg(IIMLucknow)
CorrelationCoefficientmeasuresthestrengthof
linear relationship.
r = 0 doesnotnecessarilyimplythatthereisno
correlation.
Itmaybethere,butisnotalinear one.
GauravGarg(IIMLucknow)
1 n
( xi x )( yi y )
n i 1
1 n
1 n
( yi y ) 2
( xi x ) 2 ,Var (Y ) n
n i 1
i 1
GauravGarg(IIMLucknow)
Var ( X ) Var (Y )
Var ( X )
Cov( X , Y )
rXY Corr ( X , Y )
y y
xx
1.25
125
-0.9
45
0.8100
2025
-40.50
1.75
105
-0.4
25
0.1600
625
-10.00
2.25
65
0.1
-15
0.0100
225
-1.50
2.00
85
-0.15
0.0225
25
-0.75
2.50
75
0.35
-5
0.1225
25
-1.75
2.25
80
0.1
0.0100
2.70
50
0.55
-30
0.3025
900
-16.50
2.50
55
0.35
-25
0.1225
625
-8.75
17.50
640
1.560
4450
-79.75
SSX
SSY
SSXY
( x x )2 ( y y)2
( x x )( y y )
Cov( X , Y )
SSXY
79.75
0.957
Var ( X )Var (Y ) GauravGarg(IIMLucknow)
SSX SSY
1.56 4450
12/11/2012
AlternativeFormulasforSumofSquares
SSX x 2
, SSY y 2
y ,
SSXY
2
x2
y2
x.y
1.25
125
1.5625
15625
156.25
1.75
105
3.0625
11025
183.75
2.25
65
5.0625
4225
146.25
2.00
85
4.0000
7225
170.00
2.50
75
6.2500
5625
187.50
2.25
80
5.0625
6400
180.00
2.70
50
7.2500
2500
135.00
2.50
55
6.2500
3025
137.50
17.20
640
38.54
55650
1296.25
xy
SmokingandLungCapacityExample
x y
n
Cigarettes
(X)
0
5
10
15
20
50
SSX = 1.56
SSY = 4450
SSXY= -79.75
X2
0
25
100
225
400
750
0
210
330
465
580
1585
Cov( X , Y )
SSXY
79.75
0.957
Var ( X )Var (Y ) GauravGarg(IIMLucknow)
SSX SSY
1.56 4450
Lung
Capacity
(Y)
Y2
XY
2025
1764
1089
961
841
6680
45
42
33
31
29
180
GauravGarg(IIMLucknow)
RegressionAnalysis
rxy
HavingdeterminedthecorrelationbetweenXandY,we
wishtodetermineamathematicalrelationshipbetween
them.
Dependentvariable:thevariableyouwishtoexplain
Independentvariables:thevariablesusedtoexplainthe
dependentvariable
Regressionanalysisisusedto:
Predictthevalueofdependentvariablebasedonthe
valueofindependentvariable(s)
Explaintheimpactofchangesinanindependent
variableonthedependentvariable
(5)(1585) (50)(180)
2
2
(5)(750) 50 (5)(6680) 180
7925 9000
1250 (1000)
.9615
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
TypesofRelationships
TypesofRelationships
Linearrelationships
Strong relationships
Curvilinearrelationships
X
GauravGarg(IIMLucknow)
Weak relationships
X
GauravGarg(IIMLucknow)
12/11/2012
SimpleLinearRegressionAnalysis
TypesofRelationships
No relationship
Y
Thesimplestmathematicalrelationshipis
Y = a + bX + error (linear)
ChangesinY arerelatedtothechangesin X
Whatarethemostsuitablevaluesof
a (intercept)andb (slope)?
1
y = a + b.x
}a
GauravGarg(IIMLucknow)
MethodofLeastSquares
GauravGarg(IIMLucknow)
Wewanttofitalineforwhichalltheerrorsare
minimum.
Wewanttoobtainsuchvaluesofa andb in
Y = a + bX + error forwhichalltheerrorsare
minimum.
Tominimizealltheerrorstogetherweminimize
thesumofsquaresoferrors(SSE).
a bX
(xi, yi)
error
yi
a bx i
X
xi
SSE (Yi a bX i ) 2
i 1
GauravGarg(IIMLucknow)
Solvingabovenormalequations,weget
n
n
SSE
0 2 (Yi a bX i ) 0
a
i 1
n
i 1
Y X
Yi na b X i
i 1
(1)
SSE
0 2 (Yi a bX i ) X i 0
b
i 1
n
Yi X i a X i b X
i 1
i 1
i 1
Eq(1)and(2)arecallednormalequations.
Solvenormalequationstogeta and b
GauravGarg(IIMLucknow)
( 2)
na b X i
i 1
i 1
i 1
a X i b X i2
n Yi X i Yi X i
i 1 i 1
b i 1
2
n
n
X
X
i
i
i 1
i 1
n
2
i
i 1
i 1
Y Y X
n
i 1
X
n
i 1
SSXY
SSX
a Y bX
GauravGarg(IIMLucknow)
12/11/2012
SSXY
b
.
SSX
AlsothecorrelationcoefficientbetweenX andY is
a Y bX,
rXY
Cov( X , Y )
Var ( X )Var (Y )
SSXY
SSX SSY
SSXY
SSX
y y
xx
1.25
125
-0.9
45
0.8100
2025
-40.50
1.75
105
-0.4
25
0.1600
625
-10.00
2.25
65
0.1
-15
0.0100
225
-1.50
2.00
85
-0.15
0.0225
25
-0.75
2.50
75
0.35
-5
0.1225
25
-1.75
2.25
80
0.1
0.0100
2.70
50
0.55
-30
0.3025
900
-16.50
2.50
55
0.35
-25
0.1225
625
-8.75
17.50
640
1.560
4450
-79.75
SSX
SSY
SSXY
( x x )2 ( y y)2
SSX
SSX
b
SSY
SSY
( x x )( y y )
X 2.15, Y 80.
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
SSXY
0.957
SSX SSY
SSXY
b
51.12 a Y bX 189.91
SSX
FittedLineis Y 189.91 51.12 X
140
120
100
80
60
40
0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
GauravGarg(IIMLucknow)
Residuals :ei Yi Yi
ResidualistheunexplainedpartofY
Thesmallertheresiduals,thebettertheutilityof
Regression.
SumofResidualsisalwayszero.LeastSquare
procedureensuresthat.
Residualsplayanimportantroleininvestigating
theadequacyofthefittedmodel.
Weobtaincoefficientofdetermination(R2)
usingtheresiduals.
R2 isusedtoexaminetheadequacyofthefitted
linearmodeltothegivendata.
GauravGarg(IIMLucknow)
CoefficientofDetermination
Y
Y Y Y Y
Y Y
Y
X
n
GauravGarg(IIMLucknow)
12/11/2012
r = -1
r=1
0 < R2 < 1
Weaklinear
relationships
R2 = 1
Perfectlinear
relationship
R2 = 0
Nolinear
relationship
Somebutnotallof
thevariationinY is
explainedbyX
100%ofthevariation
inY isexplainedbyX
GauravGarg(IIMLucknow)
Noneofthe
variationinY is
explainedbyX
GauravGarg(IIMLucknow)
(Y Y ) (Y Y ) (Y Y ) (Y Y ) 2 (Y Y ) 2 (Y Y ) 2
45
-1
46
2025
1
2116
1.25
125
126.0
1.75
105
100.5
25
4.5
20.5
625
20.25
420.25
2.25
65
74.9
-15
-9.9
-5.1
225
98.00
26.01
2.00
85
87.7
-2.2
7.7
25
4.84
59.29
2.50
75
62.1
-5
12.9
-17.7
25
166.41 313.29
2.25
80
74.9
5.1
-5.1
26.01
26.01
2.70
50
51.9
-30
-1.9
-28.1
900
3.61
789.61
2.50
55
62.1
-25
-7.1
-17.9
625
50.41
320.41
17.20
640
4450
370.54 4079.4
6
Example:
Watchingtelevisionalsoreducestheamountofphysicalexercise,
causingweightgains.
Asampleoffifteen10yearoldchildrenwastaken.
Thenumberofpoundseachchildwasoverweightwasrecorded
(anegativenumberindicatesthechildisunderweight).
Additionally,thenumberofhoursoftelevisionviewingperweeks
wasalsorecorded.Thesedataarelistedhere.
TV
Overweight
42 34 25 35 37 38 31 33 19 29 38 28 29 36 18
18 6 0 1 13 14 7 7 9 8 8 5 3 14 7
Calculatethesampleregressionlineanddescribewhatthe
coefficientstellyouabouttherelationshipbetweenthetwo
variables.
Y=24.709+0.967XandR2 =0.768
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
20.00
15.00
10.00
5.00
Y
PredictedY
0.00
1
10
11
12
13
14
15
5.00
10.00
15.00
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
12/11/2012
StandardError
Consideradataset.
Alltheobservationscannotbeexactlythesameas
arithmeticmean(AM).
VariabilityoftheobservationsaroundAMismeasured
bystandarddeviation.
Similarlyinregression,allY valuescannotbethesame
aspredictedY values.
VariabilityofY valuesaroundthepredictionlineis
measuredbySTANDARDERROROFTHEESTIMATE.
n
Itisgivenby
2
S YX
SSE
n2
(Y Y )
i 1
Assumptions
TherelationshipbetweenXandYislinear
Errorvaluesarestatisticallyindependent
AlltheErrorshaveacommonvariance.
(Homoscedasticity)
Var(ei )= 2,where e Y Y
i
i
i
E(ei )= 0
Nodistributionalassumptionabouterrorsis
requiredforleastsquaresmethod.
n2
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
Linearity
Independence
Linear
NotLinear
Independent
Y
residuals
residuals
X
residuals
residuals
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
EqualVariance
Unequalvariance
(Heteroscadastic)
residuals
NotIndependent
TVWatching WeightGainExample
Equalvariance
(Homoscadastic)
20.00
ScatterPlotofXandY
15.00
10.00
5.00
0.00
0
10
15
20
25
30
35
40
45
5.00
10.00
15.00
ScatterPlotofXandResiduals
6.00
4.00
residuals
residuals
2.00
0.00
2.00
10
15
20
25
30
35
40
45
4.00
6.00
8.00
10.00
12.00
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
12/11/2012
TheMultipleLinearRegressionModel
Insimplelinearregressionanalysis,wefitlinearrelation
between
oneindependentvariable(X)and
onedependentvariable(Y).
WeassumethatY isregressedononlyoneregressor
variableX.
Insomesituations,thevariableY isregressedonmore
thanoneregressor variables(X1, X2, X3, ).
ForEXample:
Cost
Salary
Sales
>Laborcost,Electricitycost,Rawmaterialcost
>Education,EXperience
>Cost,AdvertisingEXpenditure
Example:
Adistributoroffrozendessertpieswantsto
evaluatefactorswhichinfluencethedemand
Dependentvariable:
Y:Piesales(unitsperweek)
Independentvariables:
X1: Price(in$)
X2: AdvertisingExpenditure($100s)
Dataarecollectedfor15weeks
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
Week
Pie
Sales
Price
($)
Advertising
($100s)
350
5.50
3.3
460
7.50
3.3
350
8.00
3.0
430
8.00
4.5
350
6.80
3.0
380
7.50
4.0
430
4.50
3.0
470
6.40
3.7
450
7.00
3.5
10
490
5.00
4.0
11
340
7.20
3.5
12
300
7.90
3.2
13
440
5.90
4.0
14
450
5.00
3.5
15
300
7.00
2.7
Usingthegivendata,wewishtofitalinear
functionoftheform:
Yi 0 1 X 1i 2 X 2 i i ,
i 1, 2 , ,15 .
where
Y:Piesales(unitsperweek)
X1: Price(in$)
X2: AdvertisingExpenditure($100s)
Fittingmeans,wewanttogetthevaluesof
regressioncoefficientsdenotedby
Originalvaluesofsarenotknown.
Weestimatethemusingthegivendata.
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
TheMultipleLinearRegressionModel
Examinethelinearrelationshipbetween
onedependent (Y) and
twoormoreindependentvariables(X1, X2, , Xk).
MultipleLinearRegressionModelwithk
IndependentVariables:
Intercept
Slopes
MultipleLinearRegressionEquation
Intercept and Slopes are estimated using observed
data.
Multiple linear regression equation with k
independent variables
Estimated
value
Estimate of
intercept
Estimates of slopes
Random Error
Yi 0 1 X 1i 2 X 2 i k X ki i
i 1, 2 , , n.
GauravGarg(IIMLucknow)
Yi b0 b1 X 1i b2 X 2i bk X ki
i 1, 2 , , n.
GauravGarg(IIMLucknow)
12/11/2012
MultipleRegressionEquation
EstimatingRegressionCoefficients
EXample withtwoindependentvariables
Y b0 b1 X 1 b2 X 2
X2
Themultiplelinearregressionmodel
Yi 0 1 X 1i 2 X 2i k X ki i ,i 1,2 ,...,n
InmatriX notations
0
Y1 1 X 11 X 12 X 1k 1
1
Y2 1 X 21 X 22 X 2 k 2
2
Y 1 X
X n 2 X nk n
n1
n
k
Y X
or
X1
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
Assumptions
Inordertofindtheestimateof,weminimize
No.ofobservations(n)isgreaterthanno.of
regressors (k).i.e.,n> k
RandomErrorsareindependent
RandomErrorshavethesamevariances.
(Homoscedasticity)
Var(i )= 2
Inlongrun,meaneffectofrandomerrorsiszero.
S( ) i2 (Y X)(Y X )
E(i )= 0.
NoAssumptionondistributionofRandomerrors
isrequiredforleastsquaresmethod.
GauravGarg(IIMLucknow)
i 1
Y Y-2 X Y X X
WedifferentiateS() withrespectto andequate
tozero,i.e.,
Thisgives
S
0,
b (X X)1 X Y
b iscalledleastsquaresestimatorof.
GauravGarg(IIMLucknow)
Example:Considerthepieexample.
Wewanttofitthemodel Yi 0 1 X 1i 2 X 2 i i ,
Thevariablesare
Y:Piesales(unitsperweek)
X1: Price(in$)
X2: AdvertisingExpenditure($100s)
UsingthematriX formula,theleastsquaresestimate
(LSE)ofsareobtainedasbelow:
LSE of Intercept 0
Intercept(b0)
306.53
LSE of slope 1
Price(b1)
24.98
LSE of slope 2
Advertising(b2)
74.13
PieSales=306.53 24.98Price+74.13Adv.Expend.
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
12/11/2012
Prediction:
Predictsalesforaweekinwhich
sellingpriceis$5.50
AdvertisingeXpenditure is$350:
=306.53 24.98(5.50)+74.13(3.5)
=428.62
Predictedsalesis428.62pies
NotethatAdvertisingisin$100s,soX2 = 3.5
GauravGarg(IIMLucknow)
Y
350
460
350
430
350
380
430
470
450
490
340
300
440
450
300
X1
5.5
7.5
8.0
8.0
6.8
7.5
4.5
6.4
7.0
5.0
7.2
7.9
5.9
5.0
7.0
X2
3.3
3.3
3.0
4.5
3.0
4.0
3.0
3.7
3.5
4.0
3.5
3.2
4.0
3.5
2.7
PredictedY
413.77
363.81
329.08
440.28
359.06
415.70
416.51
420.94
391.13
478.15
386.13
346.40
455.67
441.09
331.82
Residuals
63.80
96.15
20.88
10.31
9.09
35.74
13.47
49.03
58.84
11.83
46.16
46.44
15.70
8.89
31.85
GauravGarg(IIMLucknow)
CoefficientofDetermination
CoefficientofDetermination(R2 )isobtainedusingthe
sameformulaaswasinsimplelinearregression.
600
500
i 1
400
Y
PredictedY
300
i 1
200
R2 = SSR/SST = 1 (SSE/SST)
100
0
1
10
11
12
13
14
15
GauravGarg(IIMLucknow)
Since
SST = SSR + SSE
andallthreequantitiesarenonnegative,
Also,
0 SSR SST
So
0 SSR/SST 1
Or
0 R2 1
WhenR2 iscloseto0,thelinearfitisnotgood
AndX variablesdonotcontributeinexplainingthe
variabilityinY.
WhenR2 iscloseto1,thelinearfitisgood.
Inthepreviouslydiscussedexample,R2 =0.5215
IfweconsiderY andX1 only,R2 =0.1965
IfweconsiderY andX2 only,R2 =0.3095
GauravGarg(IIMLucknow)
R2 istheproportionofvariationinY explainedby
regression.
GauravGarg(IIMLucknow)
AdjustedR2
Ifonemoreregressor isaddedtothemodel,thevalue
ofR2 willincrease
Thisincreaseisregardlessofthecontributionofnewly
addedregressor.
So,anadjustedvalueofR2 isdefined,whichiscalledas
adjustedR2 anddefinedas
2
R Adj
1
SSE (n k 1 )
SST (n 1 )
ThisAdjustedR2 willonlyincrease,iftheadditional
variablecontributeinexplainingthevariationinY.
Forourexample,AdjustedR2 =0.4417
GauravGarg(IIMLucknow)
10
12/11/2012
FTestforOverallSignificance
Wecheckifthereisalinearrelationshipbetweenallthe
regressors (X1, X2, , Xk) andresponse(Y).
UseFteststatistic
Totest:
TotalSumofSquare(SST)ispartitionedinto
SumofSquaresduetoRegression(SSR)and
SumofSquaresduetoResiduals(SSE)
where
i 1
n
i 1
i 1
SSE e i2 Yi Yi
ThetechniqueofAnalysisofVarianceisused.
Assumptions:
n > k, Var(i )= 2, E(i )= 0.
is areindependent.ThisimpliesthatCorr (i , j )=0,fori j
is haveNormalDistribution.[i ~ N(0, 2)]
[NEWASSUMPTION]
eis arecalledtheresiduals.
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
AnalysisofVarianceTable
df
SST Yi Y
SS
MS
Fc
MSR/MSE
Regression
SSR
MSR
ResidualorError
n-k-1
SSE
MSE
Total
n-1
SST
H0: j = 0 (nolinearrelationship)
H1: j 0 (linearrelationshipexistsbetweenXj andY)
TestStatistic:
Fc = MSR / MSE ~ F(k, n-k-1)
ForthepreviouseXample,wewishtotest
H0: 1 = 2 = 0 AgainstH1: atleastonei 0
ANOVATable
df
SS
MS
F(2,12)(0.05)
6.5386
3.89
Regression
29460.03
14730.01
ResidualorError
12
27033.31
2252.78
Total
14
56493.33
IndividualVariablesTestsofHypothesis
Wetestifthereisalinearrelationshipbetweena
particularregressor Xj andY
Hypotheses:
Weuseatwotailedttest
IfH0: j = 0 isaccepted,
thisindicatesthatthevariableXj canbedeleted
fromthemodel.
ThusH0 isrejectedat5%levelofsignificance.
GauravGarg(IIMLucknow)
TestStatistic:
Tc
bj
2 C jj
GauravGarg(IIMLucknow)
Inourexample
2 2252.7755
(MSE isobtainedinANOVATable)
GauravGarg(IIMLucknow)
and
TotestH0: 1 = 0 againstH1: 1 0
Tc = -2.3057
TotestH0: 2 = 0 againstH1: 2 0
Tc =2.8548
Twotailedcriticalvaluesoft at12d.f.are
3.0545for1%levelofsignificance
2.6810for2%levelofsignificance
2.1788for5%levelofsignificance
GauravGarg(IIMLucknow)
11
12/11/2012
StandardError
i 1
Yi )
residuals
SSE
n k 1
Linear
NotLinear
n k 1
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
ResidualAnalysisforEqualVariance
residuals
AssumptionofEqualVariance
WeassumethatVar(i )= 2
Thevarianceisconstantforallobservations.
Thisassumptionisexaminedbylookingatthe
plotof
residuals
S YX
(Y
AssumptionofLinearity
residuals
Consideradataset.
Alltheobservationscannotbeexactlythesameas
arithmeticmean(AM).
VariabilityoftheobservationsaroundAMismeasured
bystandarddeviation.
Similarlyinregression,allY valuescannotbethesame
aspredictedY values.
VariabilityofY valuesaroundthepredictionlineis
measuredbySTANDARDERROROFTHEESTIMATE.
n
Itisgivenby
2
Unequal variance
Y
Equal variance
Predictedvalues Yi andresiduals e i Yi Yi
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
AssumptionofUncorrelatedResiduals
ResidualAnalysisforIndependence
(UncorrelatedErrors)
DurbinWatsonstatisticisateststatisticusedtodetect
thepresenceofautocorrelation.
n
Itisgivenby
(e e ) 2
i 1
Independent
2
i
Y
residuals
residuals
i2
Not Independent
i 1
residuals
GauravGarg(IIMLucknow)
12
12/11/2012
AssumptionofNormality
WhenweuseF testort test,weassumethat1,
2 , , n arenormallydistributed.
Thisassumptioncanbeexaminedbyhistogram
ofresiduals.
NormalitycanalsobeexaminedusingQQplot
orNormalprobabilityplot.
NORMAL
NOTNORMAL
NORMAL
NOTNORMAL
GauravGarg(IIMLucknow)
StandardizedRegressionCoefficient
Inamultiplelinearregression,wemayliketoknow
whichregressor contributesmore.
Weobtainstandardizedestimatesofregression
coefficients.
Forthat,firstwestandardizetheobservations.
1 n
Y Yi ,
n i 1
sY
1 n
(Yi Y ) 2
n 1 i 1
X1
1 n
X 1i , s X1
n i 1
1 n
( X 1i X 1 ) 2
n 1 i 1
X2
1 n
X 2i , s X 2
n i 1
1 n
( X 2i X 2 )2
n 1 i 1
GauravGarg(IIMLucknow)
Y Y
,
sY
Standardized X 1i
Fittheregressioninthestandardizeddataandobtain
theleastsquaresestimateofregressioncoefficients.
Thesecoefficientsaredimensionlessorunitfreeand
canbecompared.
Lookfortheregressioncoefficienthavingthehighest
magnitude.
Correspondingregressor contributesthemost.
GauravGarg(IIMLucknow)
Standardized Data
Week
Pie
Sales
Price
($)
Advertising
($100s)
0.78
0.95
0.37
0.96
0.76
0.37
0.78
1.18
0.98
0.48
1.18
2.09
0.78
0.16
0.98
0.30
0.76
1.06
0.48
1.80
0.98
1.11
0.18
0.45
0.80
0.33
0.04
10
1.43
1.38
1.06
11
0.93
0.50
0.04
12
1.56
1.10
0.57
13
0.64
0.61
1.06
14
0.80
1.38
0.04
15
1.56
0.33
1.60
X 1i X 1
X X2
, Standardized X 2i 2i
s X1
sX 2
GauravGarg(IIMLucknow)
Notethat:
2
R Adj
1
Y = 0 0.461 X1 + 0.570 X2
Since0.461 < 0.570
X2 Contributesthemost
GauravGarg(IIMLucknow)
Fc
(1 R 2 )(n 1)
( n k 1)
( n k 1) R 2
k (1 R 2 )
AdjustedR2 canbenegative
AdjustedR2 isalwayslessthanorequaltoR2
Inclusionofintercepttermisnotnecessary.
Itdependsontheproblem.
Analystmaydecideonthis.
GauravGarg(IIMLucknow)
13
12/11/2012
Example:Followingdatawascollectedforthesales,numberof
advertisementspublishedandadvertizingexpenditurefor12
weeks.Fitaregressionmodeltopredictthesales.
Model
1
Regression
Residual
ANOVAb
Sum of
Squares
df
309.986
Mean Square
154.993
143.201
15.911
453.187
11
F
9.741
Sales(0,000Rs)
Ads(Nos.)
AdvEx(000Rs)
43.6
12
13.9
38.0
11
12
30.1
9.3
35.3
9.7
pvalue<0.05;H0 isrejected;Allsarenotzero
46.4
12
12.3
34.2
11.4
Allpvalues>0.05;NoH0 rejected.0=0,1=0,2=0
30.2
9.3
40.7
13
14.3
38.5
10.2
22.6
8.4
37.6
11.2
35.2
10
11.1
GauravGarg(IIMLucknow)
Multicollinearity
Weassumethatregressors areindependentvariables.
Total
Sig.
.006a
CONTRADICTION
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model
B
Std. Error
Beta
1
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
a. Dependent Variable: Sales
t
.771
.558
1.455
Sig.
.461
.591
.180
GauravGarg(IIMLucknow)
Includingtwohighlycorrelatedindependentvariablescan
adverselyaffecttheregressionresults
Canleadtounstablecoefficients
SomeIndicationsofStrongMulticollinearity:
Coefficientsignsmaynotmatchpriorexpectations
Alltheregressors affectthevaluesofY.
Largechangeinthevalueofapreviouscoefficientwhenanew
variableisaddedtothemodel
Oneregressor doesnotaffectthevaluesofother
regressor.
Apreviouslysignificantvariablebecomesinsignificantwhena
newindependentvariableisadded.
Sometimes,inpracticethisassumptionisnotmet.
Fsaysatleastonevariableissignificant,butnoneofthets
indicatesausefulvariable.
Wefacetheproblemofmulticollinearity.
Thecorrelatedvariablescontributeredundantinformation
tothemodel
GauravGarg(IIMLucknow)
EXAMPLESINWHICHTHISMIGHTHAPPEN:
MilespergallonVs.horsepowerandenginesize
IncomeVs.ageandexperience
SalesVs.No.ofAdvertisementandAdvert.Expenditure
VarianceInflationaryFactor:
VIFj isusedtomeasuremulticollinearity generated
byvariableXj
Itisgivenby
1
VIFj
1 R2j
whereR2j isthecoefficientofdeterminationofa
regressionmodelthatuses
Xj asthedependentvariableand
allotherX variablesastheindependentvariables.
GauravGarg(IIMLucknow)
Largestandarderrorandcorrespondingregressors isstill
significant.
MSEisveryhighand/orR2 isverysmall
GauravGarg(IIMLucknow)
Thus,nearzeroeigen valueisalsoanindicationof
multicollinearity.
Themethodsofdealingwithmulticollinearity:
CollectingAdditionalData
VariableElimination
GauravGarg(IIMLucknow)
14
12/11/2012
Coefficientsa
Standardize
d
Unstandardized
Coefficients
Coefficients
Model
B
Std. Error
Beta
1
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
a. Dependent Variable: Sales
Collinearity
Statistics
Sig. Tolerance VIF
.461
.591
.199 5.022
.180
.199 5.022
t
.771
.558
1.455
Tolerance=1/VIF
Greaterthan5
Collinearity Diagnosticsa
Variance Proportions
Condition
Model
Dimension
Eigenvalue
Index
(Constant) No_Adv Ex_Adv
1
1
2.966
1.000
.00
.00
.00
2
.030
9.882
.33
.17
.00
3
.003
30.417
.67
.83
1.00
a. Dependent Variable: Sales
NegligibleValue
Wemayusethemethodofvariableelimination.
Inpractice,IfCorr (X1, X2) ismorethan0.7or
lessthan0.7,weeliminateoneofthem.
Techniques:
Stepwise
ForwardInclusion
BackwardElimination
(basedonANOVA)
(basedonCorrelation)
(basedonCorrelation)
LargeValue
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
StepwiseRegression
Y = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X4 + 5 X5 +
Step1:Run5simplelinearregressions:
Y = 0 + 1 X1
Y = 0 + 2 X2
Y = 0 + 3 X3
Y = 0 + 4 X4 <====haslowestpvalue(ANOVA)<0.05
Y = 0 + 5 X5
Step2:Run4twovariablelinearregressions:
Y = 0 + 4 X4 + 1 X1
Y = 0 + 4 X4 + 2 X2
Y = 0 + 4 X4 + 3 X3 <=haslowestpvalue(ANOVA)<0.05
Y = 0 + 4 X4 + 5 X5
Step3:Run3threevariablelinearregressions:
Y = 0 + 3 X3 + 4 X4 + 1 X1
Y = 0 + 3 X3 + 4 X4 + 2 X2
Y = 0 + 3 X3 + 4 X4 + 5 X5
Supposenoneofthesemodelshave
pvalues<0.05
STOP
BestmodelistheonewithX3 andX4 only
GauravGarg(IIMLucknow)
GauravGarg(IIMLucknow)
Example:Followingdatawascollectedforthesales,numberof
advertisementspublishedandadvertizingexpenditurefor12
months.Fitaregressionmodeltopredictthesales.
Sales(0,000Rs)
Ads(Nos.)
AdvEx(000Rs)
43.6
12
13.9
38.0
11
12
30.1
9.3
35.3
9.7
46.4
12
12.3
34.2
11.4
30.2
9.3
40.7
13
14.3
38.5
10.2
22.6
8.4
37.6
11.2
35.2
10
11.1
GauravGarg(IIMLucknow)
SummaryOutput1:SalesVs.No_Adv
Model Summary
Model
R
R Square
Adjusted R Square
1
.781a
.610
.571
a. Predictors: (Constant), No_Adv
Model
1
ANOVAb
Sum of Squares
df
Mean Square
F
276.308
1
276.308 15.621
176.879
10
17.688
Regression
Residual
Total
453.187
Sig.
.003a
11
Model
1
(Constant)
No_Adv
a. Dependent Variable: Sales
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
B
Std. Error
Beta
16.937
4.982
2.083
.527
.781
t
3.400
3.952
Sig.
.007
.003
GauravGarg(IIMLucknow)
15
12/11/2012
SummaryOutput2:SalesVs.Ex_Adv
SummaryOutput3:SalesVs.No_Adv &Ex_Adv
Model Summary
Model Summary
Model
R
R Square
Adjusted R Square
1
.820a
.673
.640
a. Predictors: (Constant), Ex_Adv
Model
1
ANOVAb
Sum of Squares
df
Mean Square
F
305.039
1
305.039 20.590
148.148
10
14.815
Regression
Residual
Total
453.187
Sig.
.001a
Model
1
ANOVAb
Sum of Squares
df
309.986
143.201
Regression
Residual
Total
453.187
a. Predictors: (Constant), Ex_Adv, No_Adv
11
(Constant)
Ex_Adv
a. Dependent Variable: Sales
Standardized
Unstandardized Coefficients Coefficients
B
Std. Error
Beta
4.173
7.109
2.872
.633
.820
Mean Square
154.993
15.911
F
9.741
Sig.
.006a
Sig.
.461
.591
.180
11
Coefficientsa
t
.587
4.538
Sig.
.570
.001
Model
1
(Constant)
No_Adv
Ex_Adv
a. Dependent Variable: Sales
Unstandardized Coefficients
B
Std. Error
6.584
8.542
.625
1.120
2.139
1.470
GauravGarg(IIMLucknow)
QualitativeIndependentVariables
JohnsonFiltration,Inc.,providesmaintenance
serviceforwaterfiltrationsystemsthroughout
southernFlorida.
Toestimatetheservicetimeandtheservicecost,
themanagerswanttopredicttherepairtime
necessaryforeachmaintenancerequest.
Repairtimeisbelievedtoberelatedtotwo
factors
Numberofmonthssincethelastmaintenance
service
Typeofrepairproblem(mechanicalorelectrical)
Usingleastsquaresmethod,wefittedthemodelas
Y 2.1473 0.3041 X 1
R2 =0.534
At5%levelofsignificance,wereject
H0: 0 = 0 (Usingt test)
H0: 1 = 0 (Usingt andF test)
X1 aloneexplains53.4%variabilityinrepairtime.
Tointroducethetypeofrepairintothemodel,wedefinea
dummyvariablegivenas
0, if type of repair is mechanical
X2
1, if type of repair is electrical
Isthenewmodelimproved?
GauravGarg(IIMLucknow)
Standardized
Coefficients
Beta
.234
.611
.771
.558
1.455
GauravGarg(IIMLucknow)
Dataforasampleof10servicecallsaregiven:
ServiceCall
1
2
3
4
5
6
7
8
9
10
MonthsSinceLast
Service
TypeofRepair
2
electrical
6
mechanical
8
electrical
3
mechanical
2
electrical
7
electrical
9
mechanical
8
mechanical
4
electrical
6
electrical
RepairTimein
Hours
2.9
3.0
4.8
1.8
2.9
4.9
4.2
4.8
4.4
4.5
GauravGarg(IIMLucknow)
2
9
Model
1
Model
R
R Square
Adjusted R Square
1
.827a
.684
.614
a. Predictors: (Constant), Ex_Adv, No_Adv
GauravGarg(IIMLucknow)
Summary
Multiplelinearregressionmodel Y=X +
LeastSquaresEstimateof isgivenbyb= (XX)-1XY
R2 andadjustedR2
UsingANOVA(F test),weexamineifallsarezeroor
not.
t testisconductedforeachregressor separately.
Usingt test,weexamineif correspondingtothat
regressor iszeroornot.
ProblemofMulticollinearity VIF,eigen value
DummyVariable
Examiningtheassumptions:
commonvariance,independence,normality
GauravGarg(IIMLucknow)
16