7 Regression

12/11/2012
LinearRegressionAnalysis
Correlation
SimpleLinearRegression
TheMultipleLinearRegressionModel
LeastSquaresEstimates
R2 andAdjustedR2
OverallValidityoftheModel(F test)
Testingforindividualregressor (t test)
ProblemofMulticollinearity
SmokingandLungCapacity
Suppose,forexample,wewanttoinvestigatethe
relationshipbetweencigarettesmokingandlung
capacity
Wemightaskagroupofpeopleabouttheirsmoking
habits,andmeasuretheirlungcapacities
Cigarettes (X)
Lung Capacity (Y)
45
42
10
15
33
31
20
29
GauravGarg(IIMLucknow)
HeightandWeight
Scatterplotofthedata
LungCapacity
60
40
20
0
0
10
20
30
Wecanseethatassmokinggoesup,lung
capacitytendstogodown.
Thetwovariableschangethevaluesinopposite
directions.
Considerthefollowingdataofheightsandweightsof5
womenswimmers:
Height(inch):
62
64
65
66
68
Weight(pounds): 102 108 115 128 132
Wecanobservethatweightisalsoincreasingwith
height.
150
100
50
0
60
Sometimestwovariablesarerelatedtoeach
other.
Thevaluesofbothofthevariablesarepaired.
Changeinthevalueofoneaffectsthevalueof
other.
Usuallythesetwovariablesaretwoattributesof
eachmemberofthepopulation
ForExample:
Height
Weight
AdvertisingExpenditure
SalesVolume
Unemployment
CrimeRate
Rainfall
FoodProduction
Expenditure
Savings
65
70
Wehavealreadystudiedonemeasureofrelationship
betweentwovariables Covariance
Covariancebetweentworandomvariables,X andY is
givenby
Cov ( X , Y ) E ( XY ) E ( X ) E (Y )
XY
ForpairedobservationsonvariablesX andY,
Cov ( X , Y ) XY
1 n
( xi x )( yi y )
n i 1
xx
yy
12/11/2012
Correlation
PropertiesofCovariance:
Cov(X+a, Y+b) = Cov(X, Y)

[notaffectedbychangeinlocation]
Cov(aX, bY) = ab Cov(X, Y)
[affectedbychangeinscale]
Covariancecantakeanyvaluefrom- to+.
Cov(X,Y) > 0 meansX andY changeinthesamedirection
Cov(X,Y) < 0 meansX andY changeintheoppositedirection
If X andY areindependent,Cov(X,Y) = 0 [otherwaymaynotbetrue]
Itisnotunitfree.
Soitisnotagoodmeasureofrelationshipbetweentwo
variables.
Abettermeasureiscorrelationcoefficient.
Itisunitfreeandtakesvaluesin[-1,+1].
KarlPearsonsCorrelationcoefficientisgivenby
Cov( X , Y ) E ( XY ) E ( X ) E (Y )
Var ( X ) E ( X 2 ) [ E ( X )]2 , Var (Y ) E (Y 2 ) [ E (Y )] 2
WhenobservationsonX andY areavailable

Cov ( X , Y )
PropertiesofCorrelationCoefficient
Corr(aX+b, cY+d) = Corr(X, Y),

Itisunitfree.
Itmeasuresthestrengthofrelationshipona
scaleof-1 to+1.
So,itcanbeusedtocomparetherelationshipsof
variouspairsofvariables.
Valuescloseto0 indicatelittleornocorrelation
Valuescloseto+1 indicateverystrongpositive
correlation.
Valuescloseto-1 indicateverystrongnegative
correlation.
ScatterDiagram
Y
X
PositivelyCorrelated
WeaklyCorrelated
NegativelyCorrelated
StronglyCorrelated
NotCorrelated
CorrelationCoefficientmeasuresthestrengthof
linear relationship.
r = 0 doesnotnecessarilyimplythatthereisno
correlation.
Itmaybethere,butisnotalinear one.
1 n
( xi x )( yi y )
n i 1
1 n
1 n
( yi y ) 2
( xi x ) 2 ,Var (Y ) n
n i 1
i 1
Var ( X ) Var (Y )
WhenthejointdistributionofX andY isknown
Var ( X )
Cov( X , Y )
rXY Corr ( X , Y )
y y
xx
1.25
125
-0.9
45
0.8100
2025
-40.50
1.75
105
-0.4
25
0.1600
625
-10.00
2.25
65
0.1
-15
0.0100
225
-1.50
2.00
85
-0.15
0.0225
25
-0.75
2.50
75
0.35
-5
0.1225
25
-1.75
2.25
80
0.1
0.0100
2.70
50
0.55
-30
0.3025
900
-16.50
2.50
55
0.35
-25
0.1225
625
-8.75
17.50
640
1.560
4450
-79.75
SSX
SSY
SSXY
( x x )2 ( y y)2
( x x )( y y )
Cov( X , Y )
SSXY
79.75
0.957
Var ( X )Var (Y ) GauravGarg(IIMLucknow)
SSX SSY
1.56 4450
12/11/2012
AlternativeFormulasforSumofSquares
SSX x 2
, SSY y 2
y ,
SSXY
2
x2
y2
x.y
1.25
125
1.5625
15625
156.25
1.75
105
3.0625
11025
183.75
2.25
65
5.0625
4225
146.25
2.00
85
4.0000
7225
170.00
2.50
75
6.2500
5625
187.50
2.25
80
5.0625
6400
180.00
2.70
50
7.2500
2500
135.00
2.50
55
6.2500
3025
137.50
17.20
640
38.54
55650
1296.25
xy
SmokingandLungCapacityExample
x y
n
Cigarettes
(X)
0
5
10
15
20
50
SSX = 1.56
SSY = 4450
SSXY= -79.75
X2
0
25
100
225
400
750
0
210
330
465
580
1585
Cov( X , Y )
SSXY
79.75
0.957
Var ( X )Var (Y ) GauravGarg(IIMLucknow)
SSX SSY
1.56 4450
Lung
Capacity
(Y)
Y2
XY
2025
1764
1089
961
841
6680
45
42
33
31
29
180
RegressionAnalysis
rxy
HavingdeterminedthecorrelationbetweenXandY,we
wishtodetermineamathematicalrelationshipbetween
them.
Dependentvariable:thevariableyouwishtoexplain
Independentvariables:thevariablesusedtoexplainthe
dependentvariable
Regressionanalysisisusedto:
Predictthevalueofdependentvariablebasedonthe
valueofindependentvariable(s)
Explaintheimpactofchangesinanindependent
variableonthedependentvariable
(5)(1585) (50)(180)
2
2
(5)(750) 50 (5)(6680) 180
7925 9000
(3750 2500)(33400 32400)

1075
1250 (1000)
.9615
TypesofRelationships
Linearrelationships
Strong relationships
Curvilinearrelationships
X
Weak relationships
X
12/11/2012
SimpleLinearRegressionAnalysis
No relationship
Y
Thesimplestmathematicalrelationshipis
Y = a + bX + error (linear)
ChangesinY arerelatedtothechangesin X
Whatarethemostsuitablevaluesof
a (intercept)andb (slope)?
1
y = a + b.x
}a
MethodofLeastSquares
Wewanttofitalineforwhichalltheerrorsare
minimum.
Wewanttoobtainsuchvaluesofa andb in
Y = a + bX + error forwhichalltheerrorsare
minimum.
Tominimizealltheerrorstogetherweminimize
thesumofsquaresoferrors(SSE).
a bX
(xi, yi)
error
yi
a bx i
X
xi
SSE (Yi a bX i ) 2
i 1
The best fitted line would be for which all the

ERRORS are minimum.
Togetthevaluesofa andb whichminimizeSSE,we

proceedasfollows:
Solvingabovenormalequations,weget
n
n
SSE
0 2 (Yi a bX i ) 0
a
i 1
n
i 1
Y X
Yi na b X i
i 1
(1)
SSE
0 2 (Yi a bX i ) X i 0
b
i 1
n
Yi X i a X i b X
i 1
i 1
i 1
Eq(1)and(2)arecallednormalequations.
Solvenormalequationstogeta and b
( 2)
na b X i
i 1
i 1
i 1
a X i b X i2
n Yi X i Yi X i
i 1 i 1
b i 1
2
n
n
X
X
i
i
i 1
i 1
n
2
i
i 1
i 1
Y Y X
n
i 1
X
n
i 1
SSXY
SSX
a Y bX
12/11/2012
Thevaluesofa andb obtainedusingleastsquares

methodarecalledasleastsquaresestimates(LSE)
ofa andb.
Thus,LSEofa andb aregivenby
SSXY
b
.
SSX
AlsothecorrelationcoefficientbetweenX andY is
a Y bX,
rXY
Cov( X , Y )
Var ( X )Var (Y )
SSXY
SSX SSY
SSXY
SSX
y y
xx
1.25
125
-0.9
45
0.8100
2025
-40.50
1.75
105
-0.4
25
0.1600
625
-10.00
2.25
65
0.1
-15
0.0100
225
-1.50
2.00
85
-0.15
0.0225
25
-0.75
2.50
75
0.35
-5
0.1225
25
-1.75
2.25
80
0.1
0.0100
2.70
50
0.55
-30
0.3025
900
-16.50
2.50
55
0.35
-25
0.1225
625
-8.75
17.50
640
1.560
4450
-79.75
SSX
SSY
SSXY
( x x )2 ( y y)2
SSX
SSX
b
SSY
SSY
( x x )( y y )
X 2.15, Y 80.
SSXY
0.957
SSX SSY
SSXY
b
51.12 a Y bX 189.91
SSX
FittedLineis Y 189.91 51.12 X
140
120
100
80
60
40
0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
Fitted Line is Y 189.91 51.12 X

189.91 istheestimatedmeanvalueofY when
thevalueofX iszero.
-51.12 isthechangeintheaveragevalueofY as
aresultofaoneunitchangeinX.
WecanpredictthevalueofY forsomegiven
valueofX.
ForexampleatX=2.15,predictedvalueofY is
189.91 51.12 x2.15= 80.002
Residuals :ei Yi Yi
ResidualistheunexplainedpartofY
Thesmallertheresiduals,thebettertheutilityof
Regression.
SumofResidualsisalwayszero.LeastSquare
procedureensuresthat.
Residualsplayanimportantroleininvestigating
theadequacyofthefittedmodel.
Weobtaincoefficientofdetermination(R2)
usingtheresiduals.
R2 isusedtoexaminetheadequacyofthefitted
linearmodeltothegivendata.
CoefficientofDetermination
Y
Y Y Y Y
Y Y
Y
X
n
TotalSumofSquares :SST (Yi Y ) 2

i 1
RegressionSumofSquares :SSR (Yi Y ) 2

i 1
ErrorSumofSquares : SSE (Yi Yi ) 2

i 1
Also, SST = SSR + SSE

12/11/2012
ThefractionofSST explained byRegressionisgivenbyR2

R2 = SSR/ SST = 1 (SSE/ SST)
Clearly,0 R2 1
WhenSSR isclosedtoSST, R2 willbeclosedto1.
Thismeansthatregressionexplainsmostofthevariability
inY.(Fitisgood)
WhenSSE isclosedtoSST, R2 willbeclosedto0.
Thismeansthatregressiondoesnotexplainmuch
variabilityinY. (Fitisnotgood)
R2 isthesquareofcorrelationcoefficientbetweenX and
Y.(proofomitted)
r = -1
r=1
0 < R2 < 1
Weaklinear
relationships
R2 = 1
Perfectlinear
relationship
R2 = 0
Nolinear
relationship
Somebutnotallof
thevariationinY is
explainedbyX
100%ofthevariation
inY isexplainedbyX
Noneofthe
variationinY is
explainedbyX
(Y Y ) (Y Y ) (Y Y ) (Y Y ) 2 (Y Y ) 2 (Y Y ) 2
45
-1
46
2025
1
2116
1.25
125
126.0
1.75
105
100.5
25
4.5
20.5
625
20.25
420.25
2.25
65
74.9
-15
-9.9
-5.1
225
98.00
26.01
2.00
85
87.7
-2.2
7.7
25
4.84
59.29
2.50
75
62.1
-5
12.9
-17.7
25
166.41 313.29
2.25
80
74.9
5.1
-5.1
26.01
26.01
2.70
50
51.9
-30
-1.9
-28.1
900
3.61
789.61
2.50
55
62.1
-25
-7.1
-17.9
625
50.41
320.41
17.20
640
4450
370.54 4079.4
6
CoefficientofDetermination:R2 = (4450-370.5)/4450 = 0.916

CorrelationCoefficient:
r = -0.957
CoefficientofDetermination=(CorrelationCoefficient)2
Example:
Watchingtelevisionalsoreducestheamountofphysicalexercise,
causingweightgains.
Asampleoffifteen10yearoldchildrenwastaken.
Thenumberofpoundseachchildwasoverweightwasrecorded
(anegativenumberindicatesthechildisunderweight).
Additionally,thenumberofhoursoftelevisionviewingperweeks
wasalsorecorded.Thesedataarelistedhere.
TV
Overweight
42 34 25 35 37 38 31 33 19 29 38 28 29 36 18
18 6 0 1 13 14 7 7 9 8 8 5 3 14 7
Calculatethesampleregressionlineanddescribewhatthe
coefficientstellyouabouttherelationshipbetweenthetwo
variables.
Y=24.709+0.967XandR2 =0.768
20.00
15.00
10.00
5.00
Y
PredictedY
0.00
1
10
11
12
13
14
15
5.00
10.00
15.00
12/11/2012
StandardError
Consideradataset.
Alltheobservationscannotbeexactlythesameas
arithmeticmean(AM).
VariabilityoftheobservationsaroundAMismeasured
bystandarddeviation.
Similarlyinregression,allY valuescannotbethesame
aspredictedY values.
VariabilityofY valuesaroundthepredictionlineis
measuredbySTANDARDERROROFTHEESTIMATE.
n
Itisgivenby
2
S YX
SSE
n2
(Y Y )
i 1
Assumptions
TherelationshipbetweenXandYislinear
Errorvaluesarestatisticallyindependent
AlltheErrorshaveacommonvariance.
(Homoscedasticity)
Var(ei )= 2,where e Y Y
i
i
i
E(ei )= 0
Nodistributionalassumptionabouterrorsis
requiredforleastsquaresmethod.
n2
Linearity
Independence
Linear
NotLinear
Independent
Y
residuals
residuals
X
residuals
residuals
EqualVariance
Unequalvariance
(Heteroscadastic)
residuals
NotIndependent
TVWatching WeightGainExample
Equalvariance
(Homoscadastic)
20.00
ScatterPlotofXandY
15.00
10.00
5.00
0.00
0
10
15
20
25
30
35
40
45
5.00
10.00
15.00
ScatterPlotofXandResiduals
6.00
4.00
residuals
residuals
2.00
0.00
2.00
10
15
20
25
30
35
40
45
4.00
6.00
8.00
10.00
12.00
12/11/2012
Insimplelinearregressionanalysis,wefitlinearrelation
between
oneindependentvariable(X)and
onedependentvariable(Y).
WeassumethatY isregressedononlyoneregressor
variableX.
Insomesituations,thevariableY isregressedonmore
thanoneregressor variables(X1, X2, X3, ).
ForEXample:
Cost
Salary
Sales
>Laborcost,Electricitycost,Rawmaterialcost
>Education,EXperience
>Cost,AdvertisingEXpenditure
Example:
Adistributoroffrozendessertpieswantsto
evaluatefactorswhichinfluencethedemand
Dependentvariable:
Y:Piesales(unitsperweek)
Independentvariables:
X1: Price(in$)
X2: AdvertisingExpenditure($100s)
Dataarecollectedfor15weeks
Week
Pie
Sales
Price
($)
Advertising
($100s)
350
5.50
3.3
460
7.50
3.3
350
8.00
3.0
430
8.00
4.5
350
6.80
3.0
380
7.50
4.0
430
4.50
3.0
470
6.40
3.7
450
7.00
3.5
10
490
5.00
4.0
11
340
7.20
3.5
12
300
7.90
3.2
13
440
5.90
4.0
14
450
5.00
3.5
15
300
7.00
2.7
Usingthegivendata,wewishtofitalinear
functionoftheform:
Yi 0 1 X 1i 2 X 2 i i ,
i 1, 2 , ,15 .
where
X1: Price(in$)
Fittingmeans,wewanttogetthevaluesof
regressioncoefficientsdenotedby
Originalvaluesofsarenotknown.
Weestimatethemusingthegivendata.
Examinethelinearrelationshipbetween
onedependent (Y) and
twoormoreindependentvariables(X1, X2, , Xk).
MultipleLinearRegressionModelwithk
IndependentVariables:
Intercept
Slopes
MultipleLinearRegressionEquation
Intercept and Slopes are estimated using observed
data.
Multiple linear regression equation with k
independent variables
Estimated
value
Estimate of
intercept
Estimates of slopes
Random Error
Yi 0 1 X 1i 2 X 2 i k X ki i
i 1, 2 , , n.
Yi b0 b1 X 1i b2 X 2i bk X ki
i 1, 2 , , n.
12/11/2012
MultipleRegressionEquation
EstimatingRegressionCoefficients
EXample withtwoindependentvariables
Y b0 b1 X 1 b2 X 2
X2
Themultiplelinearregressionmodel
Yi 0 1 X 1i 2 X 2i k X ki i ,i 1,2 ,...,n
InmatriX notations
0
Y1 1 X 11 X 12 X 1k 1

1
Y2 1 X 21 X 22 X 2 k 2
2

Y 1 X
X n 2 X nk n
n1
n
k
Y X
or
X1
Assumptions
Inordertofindtheestimateof,weminimize
No.ofobservations(n)isgreaterthanno.of
regressors (k).i.e.,n> k
RandomErrorsareindependent
RandomErrorshavethesamevariances.
(Homoscedasticity)
Var(i )= 2
Inlongrun,meaneffectofrandomerrorsiszero.
S( ) i2 (Y X)(Y X )
E(i )= 0.
NoAssumptionondistributionofRandomerrors
isrequiredforleastsquaresmethod.
i 1
Y Y-2 X Y X X
WedifferentiateS() withrespectto andequate
tozero,i.e.,
Thisgives
S
0,
b (X X)1 X Y
b iscalledleastsquaresestimatorof.
Example:Considerthepieexample.
Wewanttofitthemodel Yi 0 1 X 1i 2 X 2 i i ,
Thevariablesare
X1: Price(in$)
Sales 306 .53 - 24 .98( X 1 ) 74 .13( X 2 )
UsingthematriX formula,theleastsquaresestimate
(LSE)ofsareobtainedasbelow:
LSE of Intercept 0
Intercept(b0)
306.53
LSE of slope 1
Price(b1)
24.98
LSE of slope 2
Advertising(b2)
74.13
PieSales=306.53 24.98Price+74.13Adv.Expend.
b1 = -24.98: sales will decrease, on

average, by 24.98 pies per week for
each $1 increase in selling price,
while advertising expenses are kept
fixed.
b2 = 74.13: sales will

increase, on average, by
74.13 pies per week for
each $100 increase in
advertising, while selling
price are kept fixed.
12/11/2012
Y 306 .52619 24 .97509 X 1 74 .13096 X 2
Prediction:
Predictsalesforaweekinwhich
sellingpriceis$5.50
AdvertisingeXpenditure is$350:
Sales=306.53 24.98X1 +74.13X2
=306.53 24.98(5.50)+74.13(3.5)
=428.62
Predictedsalesis428.62pies
NotethatAdvertisingisin$100s,soX2 = 3.5
Y
350
460
350
430
350
380
430
470
450
490
340
300
440
450
300
X1
5.5
7.5
8.0
8.0
6.8
7.5
4.5
6.4
7.0
5.0
7.2
7.9
5.9
5.0
7.0
X2
3.3
3.3
3.0
4.5
3.0
4.0
3.0
3.7
3.5
4.0
3.5
3.2
4.0
3.5
2.7
PredictedY
413.77
363.81
329.08
440.28
359.06
415.70
416.51
420.94
391.13
478.15
386.13
346.40
455.67
441.09
331.82
Residuals
63.80
96.15
20.88
10.31
9.09
35.74
13.47
49.03
58.84
11.83
46.16
46.44
15.70
8.89
31.85
CoefficientofDetermination
CoefficientofDetermination(R2 )isobtainedusingthe
sameformulaaswasinsimplelinearregression.
600
Total Sum of Squares, SST (Yi Y ) 2
500
i 1
400
Regression Sum of Squares, SSR (Yi Y ) 2
Y
PredictedY
300
i 1
Error Sum of Squares, SSE (Yi Yi ) 2

i 1
200
Also,SST = SSR + SSE
R2 = SSR/SST = 1 (SSE/SST)
100
0
1
10
11
12
13
14
15
Since
SST = SSR + SSE
andallthreequantitiesarenonnegative,
Also,
0 SSR SST
So
0 SSR/SST 1
Or
0 R2 1
WhenR2 iscloseto0,thelinearfitisnotgood
AndX variablesdonotcontributeinexplainingthe
variabilityinY.
WhenR2 iscloseto1,thelinearfitisgood.
Inthepreviouslydiscussedexample,R2 =0.5215
IfweconsiderY andX1 only,R2 =0.1965
IfweconsiderY andX2 only,R2 =0.3095
R2 istheproportionofvariationinY explainedby
regression.
AdjustedR2
Ifonemoreregressor isaddedtothemodel,thevalue
ofR2 willincrease
Thisincreaseisregardlessofthecontributionofnewly
addedregressor.
So,anadjustedvalueofR2 isdefined,whichiscalledas
adjustedR2 anddefinedas
2
R Adj
1
SSE (n k 1 )
SST (n 1 )
ThisAdjustedR2 willonlyincrease,iftheadditional
variablecontributeinexplainingthevariationinY.
Forourexample,AdjustedR2 =0.4417
10
12/11/2012
FTestforOverallSignificance
Wecheckifthereisalinearrelationshipbetweenallthe
regressors (X1, X2, , Xk) andresponse(Y).
UseFteststatistic
Totest:
TotalSumofSquare(SST)ispartitionedinto
SumofSquaresduetoRegression(SSR)and
SumofSquaresduetoResiduals(SSE)
where
H0: 1 = 2 = = k = 0 (noregressor issignificant)

H1: atleastonei 0 (atleastoneregressor affectsY)
i 1
n
i 1
i 1
SSE e i2 Yi Yi
ThetechniqueofAnalysisofVarianceisused.
Assumptions:
n > k, Var(i )= 2, E(i )= 0.
is areindependent.ThisimpliesthatCorr (i , j )=0,fori j
is haveNormalDistribution.[i ~ N(0, 2)]
[NEWASSUMPTION]
SSR SST SSE
eis arecalledtheresiduals.
AnalysisofVarianceTable
df
SST Yi Y
SS
MS
Fc
MSR/MSE
Regression
SSR
MSR
ResidualorError
n-k-1
SSE
MSE
Total
n-1
SST
H0: j = 0 (nolinearrelationship)
H1: j 0 (linearrelationshipexistsbetweenXj andY)
TestStatistic:
Fc = MSR / MSE ~ F(k, n-k-1)
ForthepreviouseXample,wewishtotest
H0: 1 = 2 = 0 AgainstH1: atleastonei 0
ANOVATable
df
SS
MS
F(2,12)(0.05)
6.5386
3.89
Regression
29460.03
14730.01
ResidualorError
12
27033.31
2252.78
Total
14
56493.33
IndividualVariablesTestsofHypothesis
Wetestifthereisalinearrelationshipbetweena
particularregressor Xj andY
Hypotheses:
Weuseatwotailedttest
IfH0: j = 0 isaccepted,
thisindicatesthatthevariableXj canbedeleted
fromthemodel.
ThusH0 isrejectedat5%levelofsignificance.
TestStatistic:
Tc
bj
2 C jj
Inourexample
2 2252.7755
Tc ~ Studentst with(n-k-1) degreeoffreedom

bj istheleastsquaresestimateofj
C j j isthe(j, j)th elementofmatrix(XX)-1
2 MSE
(MSE isobtainedinANOVATable)
and
5 .7946 0 .3312 1.0165
(X X) 1 0 .3312 0 .0521 0 .0038

1.0165 0 .0038 0 .2993
TotestH0: 1 = 0 againstH1: 1 0
Tc = -2.3057
TotestH0: 2 = 0 againstH1: 2 0
Tc =2.8548
Twotailedcriticalvaluesoft at12d.f.are
3.0545for1%levelofsignificance
11
12/11/2012
StandardError
i 1
Yi )
residuals
SSE
n k 1
Linear
NotLinear
n k 1
ResidualAnalysisforEqualVariance
residuals
AssumptionofEqualVariance
WeassumethatVar(i )= 2
Thevarianceisconstantforallobservations.
Thisassumptionisexaminedbylookingatthe
plotof
residuals
S YX
(Y
AssumptionofLinearity
residuals
Consideradataset.
Alltheobservationscannotbeexactlythesameas
arithmeticmean(AM).
VariabilityoftheobservationsaroundAMismeasured
bystandarddeviation.
Similarlyinregression,allY valuescannotbethesame
aspredictedY values.
VariabilityofY valuesaroundthepredictionlineis
measuredbySTANDARDERROROFTHEESTIMATE.
n
Itisgivenby
2
Unequal variance
Y
Equal variance
Predictedvalues Yi andresiduals e i Yi Yi
AssumptionofUncorrelatedResiduals
ResidualAnalysisforIndependence
(UncorrelatedErrors)
DurbinWatsonstatisticisateststatisticusedtodetect
thepresenceofautocorrelation.
n
Itisgivenby
(e e ) 2
i 1
Independent
2
i
Thevalueofd alwaysliesbetween0 and4.

d = 2 indicatesnoautocorrelation.
Smallvaluesofd < 2 indicatesuccessiveerrortermsare
positivelycorrelated.
Ifd > 2 successiveerrortermsarenegativelycorrelated.
Thevalueofd morethan3andlessthan1arealarming.
Y
residuals
residuals
i2
Not Independent
i 1
residuals
12
12/11/2012
AssumptionofNormality
WhenweuseF testort test,weassumethat1,
2 , , n arenormallydistributed.
Thisassumptioncanbeexaminedbyhistogram
ofresiduals.
NormalitycanalsobeexaminedusingQQplot
orNormalprobabilityplot.
NORMAL
NOTNORMAL
NORMAL
NOTNORMAL
StandardizedRegressionCoefficient
Inamultiplelinearregression,wemayliketoknow
whichregressor contributesmore.
Weobtainstandardizedestimatesofregression
coefficients.
Forthat,firstwestandardizetheobservations.
1 n
Y Yi ,
n i 1
sY
1 n
(Yi Y ) 2
n 1 i 1
X1
1 n
X 1i , s X1
n i 1
1 n
( X 1i X 1 ) 2
n 1 i 1
X2
1 n
X 2i , s X 2
n i 1
1 n
( X 2i X 2 )2
n 1 i 1
StandardizeallY, X1 andX2 valuesasfollows:

Standardized Yi
Y Y
,
sY
Standardized X 1i
Fittheregressioninthestandardizeddataandobtain
theleastsquaresestimateofregressioncoefficients.
Thesecoefficientsaredimensionlessorunitfreeand
canbecompared.
Lookfortheregressioncoefficienthavingthehighest
magnitude.
Correspondingregressor contributesthemost.
Standardized Data
Week
Pie
Sales
Price
($)
Advertising
($100s)
0.78
0.95
0.37
0.96
0.76
0.37
0.78
1.18
0.98
0.48
1.18
2.09
0.78
0.16
0.98
0.30
0.76
1.06
0.48
1.80
0.98
1.11
0.18
0.45
0.80
0.33
0.04
10
1.43
1.38
1.06
11
0.93
0.50
0.04
12
1.56
1.10
0.57
13
0.64
0.61
1.06
14
0.80
1.38
0.04
15
1.56
0.33
1.60
X 1i X 1
X X2
, Standardized X 2i 2i
s X1
sX 2
Notethat:
2
R Adj
1
Y = 0 0.461 X1 + 0.570 X2
Since0.461 < 0.570
X2 Contributesthemost
Fc
(1 R 2 )(n 1)
( n k 1)
( n k 1) R 2
k (1 R 2 )
AdjustedR2 canbenegative
AdjustedR2 isalwayslessthanorequaltoR2
Inclusionofintercepttermisnotnecessary.
Itdependsontheproblem.
Analystmaydecideonthis.
13
12/11/2012
Example:Followingdatawascollectedforthesales,numberof
advertisementspublishedandadvertizingexpenditurefor12
weeks.Fitaregressionmodeltopredictthesales.
Model
1
Regression
Residual
ANOVAb
Sum of
Squares
df
309.986
Mean Square
154.993
143.201
15.911
453.187
11
F
9.741
Sales(0,000Rs)
Ads(Nos.)
AdvEx(000Rs)
43.6
12
13.9
38.0
11
12
a. Predictors: (Constant), Ex_Adv, No_Adv
30.1
9.3
b. Dependent Variable: Sales
35.3
9.7
pvalue<0.05;H0 isrejected;Allsarenotzero
46.4
12
12.3
34.2
11.4
Allpvalues>0.05;NoH0 rejected.0=0,1=0,2=0
30.2
9.3
40.7
13
14.3
38.5
10.2
22.6
8.4
37.6
11.2
35.2
10
11.1
Multicollinearity
Weassumethatregressors areindependentvariables.
Total
Sig.
.006a
CONTRADICTION
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model
B
Std. Error
Beta
1
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
a. Dependent Variable: Sales
t
.771
.558
1.455
Sig.
.461
.591
.180
Includingtwohighlycorrelatedindependentvariablescan
adverselyaffecttheregressionresults
WhenweregressY onregressors X1, X2, , Xk.
Canleadtounstablecoefficients
Weassumethatallregressors X1, X2, , Xk are

statisticallyindependentofeachother.
SomeIndicationsofStrongMulticollinearity:
Coefficientsignsmaynotmatchpriorexpectations
Alltheregressors affectthevaluesofY.
Largechangeinthevalueofapreviouscoefficientwhenanew
variableisaddedtothemodel
Oneregressor doesnotaffectthevaluesofother
regressor.
Apreviouslysignificantvariablebecomesinsignificantwhena
newindependentvariableisadded.
Sometimes,inpracticethisassumptionisnotmet.
Fsaysatleastonevariableissignificant,butnoneofthets
indicatesausefulvariable.
Wefacetheproblemofmulticollinearity.
Thecorrelatedvariablescontributeredundantinformation
tothemodel
EXAMPLESINWHICHTHISMIGHTHAPPEN:
MilespergallonVs.horsepowerandenginesize
IncomeVs.ageandexperience
SalesVs.No.ofAdvertisementandAdvert.Expenditure
VarianceInflationaryFactor:
VIFj isusedtomeasuremulticollinearity generated
byvariableXj
Itisgivenby
1
VIFj
1 R2j
whereR2j isthecoefficientofdeterminationofa
regressionmodelthatuses
Xj asthedependentvariableand
allotherX variablesastheindependentvariables.
Largestandarderrorandcorrespondingregressors isstill
significant.
MSEisveryhighand/orR2 isverysmall
IfVIFj >5,Xj ishighlycorrelatedwiththeother

independentvariables
Mathematically,theproblemofmulticollinearity occurs
whenthecolumnsofmatrixX havenearlinear
dependence
LSEb cannotbeobtainedwhenthematrixXX issingular
ThematrixXX becomessingularwhen
thecolumnsofmatrixX haveexactlineardependence
Ifanyoftheeigen valueofmatrixXX iszero
Thus,nearzeroeigen valueisalsoanindicationof
multicollinearity.
Themethodsofdealingwithmulticollinearity:
CollectingAdditionalData
VariableElimination
14
12/11/2012
Coefficientsa
Standardize
d
Unstandardized
Coefficients
Coefficients
Model
B
Std. Error
Beta
1
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
Collinearity
Statistics
Sig. Tolerance VIF
.461
.591
.199 5.022
.180
.199 5.022
t
.771
.558
1.455
Tolerance=1/VIF
Greaterthan5
Collinearity Diagnosticsa
Variance Proportions
Condition
Model
Dimension
Eigenvalue
Index
(Constant) No_Adv Ex_Adv
1
1
2.966
1.000
.00
.00
.00
2
.030
9.882
.33
.17
.00
3
.003
30.417
.67
.83
1.00
NegligibleValue
Wemayusethemethodofvariableelimination.
Inpractice,IfCorr (X1, X2) ismorethan0.7or
lessthan0.7,weeliminateoneofthem.
Techniques:
Stepwise
ForwardInclusion
BackwardElimination
(basedonANOVA)
(basedonCorrelation)
(basedonCorrelation)
LargeValue
StepwiseRegression
Y = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X4 + 5 X5 +
Step1:Run5simplelinearregressions:
Y = 0 + 1 X1
Y = 0 + 2 X2
Y = 0 + 3 X3
Y = 0 + 4 X4 <====haslowestpvalue(ANOVA)<0.05
Y = 0 + 5 X5
Step2:Run4twovariablelinearregressions:
Y = 0 + 4 X4 + 1 X1
Y = 0 + 4 X4 + 2 X2
Y = 0 + 4 X4 + 3 X3 <=haslowestpvalue(ANOVA)<0.05
Y = 0 + 4 X4 + 5 X5
Step3:Run3threevariablelinearregressions:
Y = 0 + 3 X3 + 4 X4 + 1 X1
Y = 0 + 3 X3 + 4 X4 + 2 X2
Y = 0 + 3 X3 + 4 X4 + 5 X5
Supposenoneofthesemodelshave
pvalues<0.05
STOP
BestmodelistheonewithX3 andX4 only
Example:Followingdatawascollectedforthesales,numberof
advertisementspublishedandadvertizingexpenditurefor12
months.Fitaregressionmodeltopredictthesales.
Sales(0,000Rs)
Ads(Nos.)
AdvEx(000Rs)
43.6
12
13.9
38.0
11
12
30.1
9.3
35.3
9.7
46.4
12
12.3
34.2
11.4
30.2
9.3
40.7
13
14.3
38.5
10.2
22.6
8.4
37.6
11.2
35.2
10
11.1
SummaryOutput1:SalesVs.No_Adv
Model Summary
Model
R
R Square
Adjusted R Square
1
.781a
.610
.571
a. Predictors: (Constant), No_Adv
Model
1
Std. Error of the

Estimate
4.20570
ANOVAb
Sum of Squares
df
Mean Square
F
276.308
1
276.308 15.621
176.879
10
17.688
Regression
Residual
Total
453.187
Sig.
.003a
11
a. Predictors: (Constant), No_Adv

Model
1
(Constant)
No_Adv
Coefficientsa
Standardized
B
Std. Error
Beta
16.937
4.982
2.083
.527
.781
t
3.400
3.952
Sig.
.007
.003
15
12/11/2012
SummaryOutput2:SalesVs.Ex_Adv
SummaryOutput3:SalesVs.No_Adv &Ex_Adv
Model Summary
Model Summary
Model
R
R Square
Adjusted R Square
1
.820a
.673
.640
a. Predictors: (Constant), Ex_Adv
Model
1
Std. Error of the

Estimate
3.84900
ANOVAb
Sum of Squares
df
Mean Square
F
305.039
1
305.039 20.590
148.148
10
14.815
Regression
Residual
Total
453.187
Sig.
.001a
Model
1
ANOVAb
Sum of Squares
df
309.986
143.201
Regression
Residual
Total
453.187
11
a. Predictors: (Constant), Ex_Adv

(Constant)
Ex_Adv
Standardized
B
Std. Error
Beta
4.173
7.109
2.872
.633
.820
Mean Square
154.993
15.911
F
9.741
Sig.
.006a
Sig.
.461
.591
.180
11
Coefficientsa
t
.587
4.538
Sig.
.570
.001
Model
1
(Constant)
No_Adv
Ex_Adv
Unstandardized Coefficients
B
Std. Error
6.584
8.542
.625
1.120
2.139
1.470
QualitativeIndependentVariables
JohnsonFiltration,Inc.,providesmaintenance
serviceforwaterfiltrationsystemsthroughout
southernFlorida.
Toestimatetheservicetimeandtheservicecost,
themanagerswanttopredicttherepairtime
necessaryforeachmaintenancerequest.
Repairtimeisbelievedtoberelatedtotwo
factors
Numberofmonthssincethelastmaintenance
service
Typeofrepairproblem(mechanicalorelectrical)
Usingleastsquaresmethod,wefittedthemodelas
Y 2.1473 0.3041 X 1
R2 =0.534
At5%levelofsignificance,wereject
H0: 0 = 0 (Usingt test)
H0: 1 = 0 (Usingt andF test)
X1 aloneexplains53.4%variabilityinrepairtime.
Tointroducethetypeofrepairintothemodel,wedefinea
dummyvariablegivenas
0, if type of repair is mechanical
X2
1, if type of repair is electrical
RegressionModelthatusesX1 andX2 toregressY is

Y=0+ 1 X1+ 2 X2+
Isthenewmodelimproved?
Standardized
Coefficients
Beta
.234
.611
.771
.558
1.455
Dataforasampleof10servicecallsaregiven:
ServiceCall
1
2
3
4
5
6
7
8
9
10
MonthsSinceLast
Service
TypeofRepair
2
electrical
6
mechanical
8
electrical
3
mechanical
2
electrical
7
electrical
9
mechanical
8
mechanical
4
electrical
6
electrical
RepairTimein
Hours
2.9
3.0
4.8
1.8
2.9
4.9
4.2
4.8
4.4
4.5
LetY denotetherepairtime,X1 denotethenumberof

monthssincelastmaintenanceservice.
RegressionModelthatusesX1 onlytoregressY is
Y=0+ 1 X1+
2
9

Coefficientsa
Model
1
Std. Error of the

Estimate
3.98888
Model
R
R Square
Adjusted R Square
1
.827a
.684
.614
Summary
Multiplelinearregressionmodel Y=X +
LeastSquaresEstimateof isgivenbyb= (XX)-1XY
R2 andadjustedR2
UsingANOVA(F test),weexamineifallsarezeroor
not.
t testisconductedforeachregressor separately.
Usingt test,weexamineif correspondingtothat
regressor iszeroornot.
ProblemofMulticollinearity VIF,eigen value
DummyVariable
Examiningtheassumptions:
commonvariance,independence,normality
16

7 Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

7 Regression

Uploaded by

Copyright:

Available Formats

12/11/2012

Lung Capacity (Y)

Cov(X+a, Y+b) = Cov(X, Y)

WhenobservationsonX andY areavailable

Corr(aX+b, cY+d) = Corr(X, Y),

WhenthejointdistributionofX andY isknown

(3750 2500)(33400 32400)

The best fitted line would be for which all the

Togetthevaluesofa andb whichminimizeSSE,we

Thevaluesofa andb obtainedusingleastsquares

Fitted Line is Y 189.91 51.12 X

TotalSumofSquares :SST (Yi Y ) 2

RegressionSumofSquares :SSR (Yi Y ) 2

ErrorSumofSquares : SSE (Yi Yi ) 2

Also, SST = SSR + SSE

ThefractionofSST explained byRegressionisgivenbyR2

CoefficientofDetermination:R2 = (4450-370.5)/4450 = 0.916

Sales 306 .53 - 24 .98( X 1 ) 74 .13( X 2 )

b1 = -24.98: sales will decrease, on

b2 = 74.13: sales will

Y 306 .52619 24 .97509 X 1 74 .13096 X 2

Sales=306.53 24.98X1 +74.13X2

Total Sum of Squares, SST (Yi Y ) 2

Regression Sum of Squares, SSR (Yi Y ) 2

Error Sum of Squares, SSE (Yi Yi ) 2

Also,SST = SSR + SSE

H0: 1 = 2 = = k = 0 (noregressor issignificant)

SSR SST SSE

Tc ~ Studentst with(n-k-1) degreeoffreedom

5 .7946 0 .3312 1.0165

(X X) 1 0 .3312 0 .0521 0 .0038

Thevalueofd alwaysliesbetween0 and4.

StandardizeallY, X1 andX2 valuesasfollows:

a. Predictors: (Constant), Ex_Adv, No_Adv

b. Dependent Variable: Sales

WhenweregressY onregressors X1, X2, , Xk.

Weassumethatallregressors X1, X2, , Xk are

IfVIFj >5,Xj ishighlycorrelatedwiththeother

Std. Error of the

a. Predictors: (Constant), No_Adv

Std. Error of the

a. Predictors: (Constant), Ex_Adv

RegressionModelthatusesX1 andX2 toregressY is

LetY denotetherepairtime,X1 denotethenumberof

b. Dependent Variable: Sales

Std. Error of the

You might also like