You are on page 1of 17

# 19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

Linearregression
FromWikipedia,thefreeencyclopedia

Instatistics,linearregressionisanapproachformodelingtherelationshipbetweenascalardependent
variableyandoneormoreexplanatoryvariables(orindependentvariables)denotedX.Thecaseofone
explanatoryvariableiscalledsimplelinearregression.Formorethanoneexplanatoryvariable,the
processiscalledmultiplelinearregression.[1](Thistermshouldbedistinguishedfrommultivariate
linearregression,wheremultiplecorrelateddependentvariablesarepredicted,ratherthanasinglescalar
variable.)[2]
Inlinearregression,therelationshipsaremodeledusinglinearpredictorfunctionswhoseunknown
modelparametersareestimatedfromthedata.Suchmodelsarecalledlinearmodels.[3]Mostcommonly,
theconditionalmeanofygiventhevalueofXisassumedtobeanaffinefunctionofXlesscommonly,
themedianorsomeotherquantileoftheconditionaldistributionofygivenXisexpressedasalinear
functionofX.Likeallformsofregressionanalysis,linearregressionfocusesontheconditional
probabilitydistributionofygivenX,ratherthanonthejointprobabilitydistributionofyandX,whichis
thedomainofmultivariateanalysis.
Linearregressionwasthefirsttypeofregressionanalysistobestudiedrigorously,andtobeused
extensivelyinpracticalapplications.[4]Thisisbecausemodelswhichdependlinearlyontheirunknown
parametersareeasiertofitthanmodelswhicharenonlinearlyrelatedtotheirparametersandbecause
thestatisticalpropertiesoftheresultingestimatorsareeasiertodetermine.
categories:
Ifthegoalisprediction,orforecasting,orerrorreduction,linearregressioncanbeusedtofita
predictivemodeltoanobserveddatasetofyandXvalues.Afterdevelopingsuchamodel,ifan
usedtomakeapredictionofthevalueofy.
GivenavariableyandanumberofvariablesX1,...,Xpthatmayberelatedtoy,linearregression
analysiscanbeappliedtoquantifythestrengthoftherelationshipbetweenyandtheXj,toassess
whichXjmayhavenorelationshipwithyatall,andtoidentifywhichsubsetsoftheXjcontain
Linearregressionmodelsareoftenfittedusingtheleastsquaresapproach,buttheymayalsobefittedin
otherways,suchasbyminimizingthe"lackoffit"insomeothernorm(aswithleastabsolutedeviations
regression),orbyminimizingapenalizedversionoftheleastsquareslossfunctionasinridgeregression
(L2normpenalty)andlasso(L1normpenalty).Conversely,theleastsquaresapproachcanbeusedtofit
modelsthatarenotlinearmodels.Thus,althoughtheterms"leastsquares"and"linearmodel"are

Contents
1 Introductiontolinearregression
1.1 Assumptions
1.2 Interpretation
2 Extensions
2.1 Simpleandmultipleregression

https://en.wikipedia.org/wiki/Linear_regression

1/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

5
6
7
8

2.1 Simpleandmultipleregression
2.2 Generallinearmodels
2.3 Heteroscedasticmodels
2.4 Generalizedlinearmodels
2.5 Hierarchicallinearmodels
2.6 Errorsinvariables
2.7 Others
Estimationmethods
3.1 Leastsquaresestimationandrelatedtechniques
3.2 Maximumlikelihoodestimationandrelatedtechniques
3.3 Otherestimationtechniques
3.4 Furtherdiscussion
3.5 Usinglinearalgebra
Applicationsoflinearregression
4.1 Trendline
4.2 Epidemiology
4.3 Finance
4.4 Economics
4.5 Environmentalscience
Seealso
Notes
References

Introductiontolinearregression
ofn
statisticalunits,alinearregression
modelassumesthattherelationship
betweenthedependentvariableyi
andthepvectorofregressorsxiis
linear.Thisrelationshipismodeled
variableianunobservedrandom
relationshipbetweenthedependent
variableandregressors.Thusthe
modeltakestheform
Exampleofsimplelinearregression,whichhasoneindependent
variable

whereTdenotesthetranspose,sothatxiTistheinnerproductbetweenvectorsxiand.
Oftenthesenequationsarestackedtogetherandwritteninvectorformas

https://en.wikipedia.org/wiki/Linear_regression

2/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

where

Exampleofacubicpolynomialregression,whichisa
typeoflinearregression.

Someremarksonterminologyandgeneraluse:
iscalledtheregressand,endogenousvariable,responsevariable,measuredvariable,criterion
variable,ordependentvariable(seedependentandindependentvariables.)Thedecisionasto
independentvariablesmaybebasedonapresumptionthatthevalueofoneofthevariablesis
causedby,ordirectlyinfluencedbytheothervariables.Alternatively,theremaybeanoperational
reasontomodeloneofthevariablesintermsoftheothers,inwhichcasethereneedbeno
presumptionofcausality.
arecalledregressors,exogenousvariables,explanatoryvariables,
covariates,inputvariables,predictorvariables,orindependentvariables(seedependentand
independentvariables,butnottobeconfusedwithindependentrandomvariables).Thematrix
issometimescalledthedesignmatrix.
Usuallyaconstantisincludedasoneoftheregressors.Forexample,wecantakexi1=1for
i=1,...,n.Thecorrespondingelementofiscalledtheintercept.Manystatisticalinference
proceduresforlinearmodelsrequireanintercepttobepresent,soitisoftenincludedevenif
theoreticalconsiderationssuggestthatitsvalueshouldbezero.
Sometimesoneoftheregressorscanbeanonlinearfunctionofanotherregressororofthe
data,asinpolynomialregressionandsegmentedregression.Themodelremainslinearas
longasitislinearintheparametervector.
https://en.wikipedia.org/wiki/Linear_regression

3/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

Theregressorsxijmaybeviewedeitherasrandomvariables,whichwesimplyobserve,or
theycanbeconsideredaspredeterminedfixedvalueswhichwecanchoose.Both
estimationprocedureshoweverdifferentapproachestoasymptoticanalysisareusedin
thesetwosituations.
isapdimensionalparametervector.Itselementsarealsocalledeffects,orregression
coefficients.Statisticalestimationandinferenceinlinearregressionfocuseson.Theelementsof
thisparametervectorareinterpretedasthepartialderivativesofthedependentvariablewith
respecttothevariousindependentvariables.
iscalledtheerrorterm,disturbanceterm,ornoise.Thisvariablecapturesallotherfactors
whichinfluencethedependentvariableyiotherthantheregressorsxi.Therelationshipbetween
theerrortermandtheregressors,forexamplewhethertheyarecorrelated,isacrucialstepin
formulatingalinearregressionmodel,asitwilldeterminethemethodtouseforestimation.
Example.Considerasituationwhereasmallballisbeingtossedupintheairandthenwemeasureits
heightsofascenthiatvariousmomentsintimeti.Physicstellsusthat,ignoringthedrag,the
relationshipcanbemodeledas

where1determinestheinitialvelocityoftheball,2isproportionaltothestandardgravity,andiis
duetomeasurementerrors.Linearregressioncanbeusedtoestimatethevaluesof1and2fromthe
measureddata.Thismodelisnonlinearinthetimevariable,butitislinearintheparameters1and2
ifwetakeregressorsxi=(xi1,xi2)=(ti,ti2),themodeltakesonthestandardform

Assumptions
Standardlinearregressionmodelswithstandardestimationtechniquesmakeanumberofassumptions
beendevelopedthatalloweachoftheseassumptionstoberelaxed(i.e.reducedtoaweakerform),and
insomecaseseliminatedentirely.Somemethodsaregeneralenoughthattheycanrelaxmultiple
assumptionsatonce,andinothercasesthiscanbeachievedbycombiningdifferentextensions.
Generallytheseextensionsmaketheestimationproceduremorecomplexandtimeconsuming,andmay
alsorequiremoredatainordertoproduceanequallyprecisemodel.
estimationtechniques(e.g.ordinaryleastsquares):
Weakexogeneity.Thisessentiallymeansthatthepredictorvariablesxcanbetreatedasfixed
values,ratherthanrandomvariables.Thismeans,forexample,thatthepredictorvariablesare
assumedtobeerrorfreethatis,notcontaminatedwithmeasurementerrors.Althoughthis
invariablesmodels.
Linearity.Thismeansthatthemeanoftheresponsevariableisalinearcombinationofthe
parameters(regressioncoefficients)andthepredictorvariables.Notethatthisassumptionismuch
lessrestrictivethanitmayatfirstseem.Becausethepredictorvariablesaretreatedasfixedvalues
(seeabove),linearityisreallyonlyarestrictionontheparameters.Thepredictorvariables
themselvescanbearbitrarilytransformed,andinfactmultiplecopiesofthesameunderlying
https://en.wikipedia.org/wiki/Linear_regression

4/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

inpolynomialregression,whichuseslinearregressiontofittheresponsevariableasanarbitrary
polynomialfunction(uptoagivenrank)ofapredictorvariable.Thismakeslinearregressionan
extremelypowerfulinferencemethod.Infact,modelssuchaspolynomialregressionareoften
"toopowerful",inthattheytendtooverfitthedata.Asaresult,somekindofregularizationmust
typicallybeusedtopreventunreasonablesolutionscomingoutoftheestimationprocess.
Commonexamplesareridgeregressionandlassoregression.Bayesianlinearregressioncanalso
beused,whichbyitsnatureismoreorlessimmunetotheproblemofoverfitting.(Infact,ridge
regressionandlassoregressioncanbothbeviewedasspecialcasesofBayesianlinearregression,
withparticulartypesofpriordistributionsplacedontheregressioncoefficients.)
Constantvariance(a.k.a.homoscedasticity).Thismeansthatdifferentresponsevariableshave
thesamevarianceintheirerrors,regardlessofthevaluesofthepredictorvariables.Inpracticethis
assumptionisinvalid(i.e.theerrorsareheteroscedastic)iftheresponsevariablescanvaryovera
widescale.Inordertodetermineforheterogeneouserrorvariance,orwhenapatternofresiduals
violatesmodelassumptionsofhomoscedasticity(errorisequallyvariablearoundthe'bestfitting
line'forallpointsofx),itisprudenttolookfora"fanningeffect"betweenresidualerrorand
predictedvalues.Thisistosaytherewillbeasystematicchangeintheabsoluteorsquared
residualswhenplottedagainstthepredictingoutcome.Errorwillnotbeevenlydistributedacross
theregressionline.Heteroscedasticitywillresultintheaveragingoverofdistinguishable
variancesaroundthepointstogetasinglevariancethatisinaccuratelyrepresentingallthe
forlargerandsmallervaluesforpointsalongthelinearregressionline,andthemeansquarederror
forthemodelwillbewrong.Typically,forexample,aresponsevariablewhosemeanislargewill
haveagreatervariancethanonewhosemeanissmall.Forexample,agivenpersonwhoseincome
ispredictedtobe\$100,000mayeasilyhaveanactualincomeof\$80,000or\$120,000(astandard
deviationofaround\$20,000),whileanotherpersonwithapredictedincomeof\$10,000isunlikely
tohavethesame\$20,000standarddeviation,whichwouldimplytheiractualincomewouldvary
anywherebetween\$10,000and\$30,000.(Infact,asthisshows,inmanycasesoftenthesame
caseswheretheassumptionofnormallydistributederrorsfailsthevarianceorstandard
deviationshouldbepredictedtobeproportionaltothemean,ratherthanconstant.)Simplelinear
quantitiessuchasstandarderrorswhensubstantialheteroscedasticityispresent.However,various
estimationtechniques(e.g.weightedleastsquaresandheteroscedasticityconsistentstandard
errors)canhandleheteroscedasticityinaquitegeneralway.Bayesianlinearregressiontechniques
canalsobeusedwhenthevarianceisassumedtobeafunctionofthemean.Itisalsopossiblein
somecasestofixtheproblembyapplyingatransformationtotheresponsevariable(e.g.fitthe
logarithmoftheresponsevariableusingalinearregressionmodel,whichimpliesthattheresponse
variablehasalognormaldistributionratherthananormaldistribution).
Independenceoferrors.Thisassumesthattheerrorsoftheresponsevariablesareuncorrelated
witheachother.(Actualstatisticalindependenceisastrongerconditionthanmerelackof
correlationandisoftennotneeded,althoughitcanbeexploitedifitisknowntohold.)Some
methods(e.g.generalizedleastsquares)arecapableofhandlingcorrelatederrors,althoughthey
typicallyrequiresignificantlymoredataunlesssomesortofregularizationisusedtobiasthe
modeltowardsassuminguncorrelatederrors.Bayesianlinearregressionisageneralwayof
handlingthisissue.
Lackofmulticollinearityinthepredictors.Forstandardleastsquaresestimationmethods,the
designmatrixXmusthavefullcolumnrankp,otherwise,wehaveaconditionknownas
multicollinearityinthepredictorvariables.Thiscanbetriggeredbyhavingtwoormoreperfectly
correlatedpredictorvariables(e.g.ifthesamepredictorvariableismistakenlygiventwice,either
withouttransformingoneofthecopiesorbytransformingoneofthecopieslinearly).Itcanalso
happenifthereistoolittledataavailablecomparedtothenumberofparameterstobeestimated
(e.g.fewerdatapointsthanregressioncoefficients).Inthecaseofmulticollinearity,theparameter
vectorwillbenonidentifiableithasnouniquesolution.Atmostwewillbeabletoidentify
someoftheparameters,i.e.narrowdownitsvaluetosomelinearsubspaceofRp.Seepartialleast
https://en.wikipedia.org/wiki/Linear_regression

5/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

squaresregression.Methodsforfittinglinearmodelswithmulticollinearityhavebeen
fractionoftheeffectsareexactlyzero.Notethatthemorecomputationallyexpensiveiterated
algorithmsforparameterestimation,suchasthoseusedingeneralizedlinearmodels,donotsuffer
fromthisproblemandinfactit'squitenormaltowhenhandlingcategoricallyvaluedpredictors
tointroduceaseparateindicatorvariablepredictorforeachpossiblecategory,whichinevitably
introducesmulticollinearity.
Beyondtheseassumptions,severalotherstatisticalpropertiesofthedatastronglyinfluencethe
performanceofdifferentestimationmethods:
Thestatisticalrelationshipbetweentheerrortermsandtheregressorsplaysanimportantrolein
determiningwhetheranestimationprocedurehasdesirablesamplingpropertiessuchasbeing
unbiasedandconsistent.
Thearrangement,orprobabilitydistributionofthepredictorvariablesxhasamajorinfluenceon
theprecisionofestimatesof.Samplinganddesignofexperimentsarehighlydeveloped
subfieldsofstatisticsthatprovideguidanceforcollectingdatainsuchawaytoachieveaprecise
estimateof.

Interpretation
Afittedlinearregressionmodel
canbeusedtoidentifythe
relationshipbetweenasingle
predictorvariablexjandthe
responsevariableywhenallthe
otherpredictorvariablesinthe
modelare"heldfixed".
Specifically,theinterpretationof
jistheexpectedchangeinyfor
aoneunitchangeinxjwhenthe
othercovariatesareheldfixed
thatis,theexpectedvalueofthe
partialderivativeofywith
respecttoxj.Thisissometimes
calledtheuniqueeffectofxjony.
Incontrast,themarginaleffectof
xjonycanbeassessedusinga

ThesetsintheAnscombe'squartethavethesamelinearregressionline
butarethemselvesverydifferent.

correlationcoefficientorsimple
linearregressionmodelrelating
xjtoythiseffectisthetotalderivativeofywithrespecttoxj.

Caremustbetakenwheninterpretingregressionresults,assomeoftheregressorsmaynotallowfor
marginalchanges(suchasdummyvariables,ortheinterceptterm),whileotherscannotbeheldfixed
(recalltheexamplefromtheintroduction:itwouldbeimpossibleto"holdtifixed"andatthesametime
changethevalueofti2).

https://en.wikipedia.org/wiki/Linear_regression

6/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

Itispossiblethattheuniqueeffectcanbenearlyzeroevenwhenthemarginaleffectislarge.Thismay
implythatsomeothercovariatecapturesalltheinformationinxj,sothatoncethatvariableisinthe
model,thereisnocontributionofxjtothevariationiny.Conversely,theuniqueeffectofxjcanbelarge
whileitsmarginaleffectisnearlyzero.Thiswouldhappeniftheothercovariatesexplainedagreatdeal
ofthevariationofy,buttheymainlyexplainvariationinawaythatiscomplementarytowhatis
capturedbyxj.Inthiscase,includingtheothervariablesinthemodelreducesthepartofthevariability
ofythatisunrelatedtoxj,therebystrengtheningtheapparentrelationshipwithxj.
Themeaningoftheexpression"heldfixed"maydependonhowthevaluesofthepredictorvariables
arise.Iftheexperimenterdirectlysetsthevaluesofthepredictorvariablesaccordingtoastudydesign,
thecomparisonsofinterestmayliterallycorrespondtocomparisonsamongunitswhosepredictor
variableshavebeen"heldfixed"bytheexperimenter.Alternatively,theexpression"heldfixed"canrefer
toaselectionthattakesplaceinthecontextofdataanalysis.Inthiscase,we"holdavariablefixed"by
restrictingourattentiontothesubsetsofthedatathathappentohaveacommonvalueforthegiven
predictorvariable.Thisistheonlyinterpretationof"heldfixed"thatcanbeusedinanobservational
study.
Thenotionofa"uniqueeffect"isappealingwhenstudyingacomplexsystemwheremultiple
interrelatedcomponentsinfluencetheresponsevariable.Insomecases,itcanliterallybeinterpretedas
beenarguedthatinmanycasesmultipleregressionanalysisfailstoclarifytherelationshipsbetweenthe
predictorvariablesandtheresponsevariablewhenthepredictorsarecorrelatedwitheachotherandare
sharedanduniqueimpactsofcorrelatedindependentvariables.[10]

Extensions
Numerousextensionsoflinearregressionhavebeendeveloped,whichallowsomeorallofthe
assumptionsunderlyingthebasicmodeltoberelaxed.

Simpleandmultipleregression
Theverysimplestcaseofasinglescalarpredictorvariablexandasinglescalarresponsevariableyis
knownassimplelinearregression.Theextensiontomultipleand/orvectorvaluedpredictorvariables
(denotedwithacapitalX)isknownasmultiplelinearregression,alsoknownasmultivariablelinear
regression.Nearlyallrealworldregressionmodelsinvolvemultiplepredictors,andbasicdescriptionsof
linearregressionareoftenphrasedintermsofthemultipleregressionmodel.Note,however,thatin
thesecasestheresponsevariableyisstillascalar.Anothertermmultivariatelinearregressionrefersto
caseswhereyisavector,i.e.,thesameasgenerallinearregression.Thedifferencebetweenmultivariate
linearregressionandmultivariablelinearregressionshouldbeemphasizedasitcausesmuchconfusion
andmisunderstandingintheliterature.

Generallinearmodels
ThegenerallinearmodelconsidersthesituationwhentheresponsevariableYisnotascalarbutavector.
ConditionallinearityofE(y|x)=Bxisstillassumed,withamatrixBreplacingthevectorofthe
classicallinearregressionmodel.MultivariateanaloguesofOrdinaryLeastSquares(OLS)and
GeneralizedLeastSquares(GLS)havebeendeveloped.Theterm"generallinearmodels"isequivalent
https://en.wikipedia.org/wiki/Linear_regression

7/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

to"multivariatelinearmodels".Itshouldbenotedthedifferenceof"multivariatelinearmodels"and
"multivariablelinearmodels,"wheretheformeristhesameas"generallinearmodels"andthelatteris
thesameas"multiplelinearmodels."

Heteroscedasticmodels
Variousmodelshavebeencreatedthatallowforheteroscedasticity,i.e.theerrorsfordifferentresponse
variablesmayhavedifferentvariances.Forexample,weightedleastsquaresisamethodforestimating
linearregressionmodelswhentheresponsevariablesmayhavedifferenterrorvariances,possiblywith
correlatederrors.(SeealsoWeightedlinearleastsquares,andgeneralizedleastsquares.)
Heteroscedasticityconsistentstandarderrorsisanimprovedmethodforusewithuncorrelatedbut
potentiallyheteroscedasticerrors.

Generalizedlinearmodels
Generalizedlinearmodels(GLMs)areaframeworkformodelingaresponsevariableythatisbounded
ordiscrete.Thisisused,forexample:
whenmodelingpositivequantities(e.g.pricesorpopulations)thatvaryoveralargescalewhich
simplytransformedusingthelogarithmfunction)
whenmodelingcategoricaldata,suchasthechoiceofagivencandidateinanelection(whichis
betterdescribedusingaBernoullidistribution/binomialdistributionforbinarychoices,ora
categoricaldistribution/multinomialdistributionformultiwaychoices),wherethereareafixed
numberofchoicesthatcannotbemeaningfullyordered
whenmodelingordinaldata,e.g.ratingsonascalefrom0to5,wherethedifferentoutcomescan
beorderedbutwherethequantityitselfmaynothaveanyabsolutemeaning(e.g.aratingof4may
notbe"twiceasgood"inanyobjectivesenseasaratingof2,butsimplyindicatesthatitisbetter
than2or3butnotasgoodas5).
response,andinparticularittypicallyhastheeffectoftransformingbetweenthe
rangeof
thelinearpredictorandtherangeoftheresponsevariable.
SomecommonexamplesofGLMsare:
Poissonregressionforcountdata.
Logisticregressionandprobitregressionforbinarydata.
Multinomiallogisticregressionandmultinomialprobitregressionforcategoricaldata.
Orderedprobitregressionforordinaldata.
Singleindexmodelsallowsomedegreeofnonlinearityintherelationshipbetweenxandy,while
preservingthecentralroleofthelinearpredictorxasintheclassicallinearregressionmodel.Under
certainconditions,simplyapplyingOLStodatafromasingleindexmodelwillconsistentlyestimate
uptoaproportionalityconstant.[11]

Hierarchicallinearmodels

https://en.wikipedia.org/wiki/Linear_regression

8/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

Hierarchicallinearmodels(ormultilevelregression)organizesthedataintoahierarchyofregressions,
forexamplewhereAisregressedonB,andBisregressedonC.Itisoftenusedwherethevariablesof
interesthaveanaturalhierarchicalstructuresuchasineducationalstatistics,wherestudentsarenestedin
suchasaschooldistrict.Theresponsevariablemightbeameasureofstudentachievementsuchasatest
score,anddifferentcovariateswouldbecollectedattheclassroom,school,andschooldistrictlevels.

Errorsinvariables
modeltoallowthepredictorvariablesXtobeobservedwitherror.Thiserrorcausesstandardestimators
oftobecomebiased.Generally,theformofbiasisanattenuation,meaningthattheeffectsarebiased
towardzero.

Others
InDempsterShafertheory,oralinearbelieffunctioninparticular,alinearregressionmodelmay
berepresentedasapartiallysweptmatrix,whichcanbecombinedwithsimilarmatrices
representingobservationsandotherassumednormaldistributionsandstateequations.The
combinationofsweptorunsweptmatricesprovidesanalternativemethodforestimatinglinear
regressionmodels.

Estimationmethods
Alargenumberofprocedureshavebeendevelopedforparameter
estimationandinferenceinlinearregression.Thesemethods
differincomputationalsimplicityofalgorithms,presenceofa
closedformsolution,robustnesswithrespecttoheavytailed
distributions,andtheoreticalassumptionsneededtovalidate
desirablestatisticalpropertiessuchasconsistencyand
asymptoticefficiency.
Someofthemorecommonestimationtechniquesforlinear
regressionaresummarizedbelow.

Leastsquaresestimationandrelatedtechniques
Ordinaryleastsquares(OLS)isthesimplestandthus
mostcommonestimator.Itisconceptuallysimpleand
computationallystraightforward.OLSestimatesare
commonlyusedtoanalyzebothexperimentaland
observationaldata.

ComparisonoftheTheilSen
estimator(black)andsimplelinear
regression(blue)forasetofpoints
withoutliers.

fortheestimatedvalueoftheunknownparameter:

Theestimatorisunbiasedandconsistentiftheerrorshavefinitevarianceandareuncorrelated
withtheregressors[12]
https://en.wikipedia.org/wiki/Linear_regression

9/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

Itisalsoefficientundertheassumptionthattheerrorshavefinitevarianceandarehomoscedastic,
meaningthatE[i2|xi]doesnotdependoni.Theconditionthattheerrorsareuncorrelatedwiththe
regressorswillgenerallybesatisfiedinanexperiment,butinthecaseofobservationaldata,itis
difficulttoexcludethepossibilityofanomittedcovariatezthatisrelatedtoboththeobserved
correlationbetweentheregressorsandtheresponsevariable,andhencetoaninconsistent
estimatorof.Theconditionofhomoscedasticitycanfailwitheitherexperimentalor
observationaldata.Ifthegoaliseitherinferenceorpredictivemodeling,theperformanceofOLS
estimatescanbepoorifmulticollinearityispresent,unlessthesamplesizeislarge.
Insimplelinearregression,wherethereisonlyoneregressor(withaconstant),theOLS
coefficientestimateshaveasimpleformthatiscloselyrelatedtothecorrelationcoefficient
betweenthecovariateandtheresponse.
Generalizedleastsquares(GLS)isanextensionoftheOLSmethod,thatallowsefficient
estimationofwheneitherheteroscedasticity,orcorrelations,orbotharepresentamongtheerror
termsofthemodel,aslongastheformofheteroscedasticityandcorrelationisknown
independentlyofthedata.Tohandleheteroscedasticitywhentheerrortermsareuncorrelatedwith
eachother,GLSminimizesaweightedanaloguetothesumofsquaredresidualsfromOLS
regression,wheretheweightfortheithcaseisinverselyproportionaltovar(i).Thisspecialcase
ofGLSiscalled"weightedleastsquares".TheGLSsolutiontoestimationproblemis

whereisthecovariancematrixoftheerrors.GLScanbeviewedasapplyingalinear
transformationtothedatasothattheassumptionsofOLSaremetforthetransformeddata.For
GLStobeapplied,thecovariancestructureoftheerrorsmustbeknownuptoamultiplicative
constant.
Percentageleastsquaresfocusesonreducingpercentageerrors,whichisusefulinthefieldof
forecastingortimeseriesanalysis.Itisalsousefulinsituationswherethedependentvariablehasa
widerangewithoutconstantvariance,asherethelargerresidualsattheupperendoftherange
woulddominateifOLSwereused.Whenthepercentageorrelativeerrorisnormallydistributed,
leastsquarespercentageregressionprovidesmaximumlikelihoodestimates.Percentageregression
errorterm.[13]
Iterativelyreweightedleastsquares(IRLS)isusedwhenheteroscedasticity,orcorrelations,or
covariancestructureoftheerrorsindependentlyofthedata.[14]Inthefirstiteration,OLS,orGLS
withaprovisionalcovariancestructureiscarriedout,andtheresidualsareobtainedfromthefit.
Basedontheresiduals,animprovedestimateofthecovariancestructureoftheerrorscanusually
beobtained.AsubsequentGLSiterationisthenperformedusingthisestimateoftheerror
structuretodefinetheweights.Theprocesscanbeiteratedtoconvergence,butinmanycases,
onlyoneiterationissufficienttoachieveanefficientestimateof.[15][16]
Instrumentalvariablesregression(IV)canbeperformedwhentheregressorsarecorrelatedwith
theerrors.Inthiscase,weneedtheexistenceofsomeauxiliaryinstrumentalvariableszisuchthat
E[zii]=0.IfZisthematrixofinstruments,thentheestimatorcanbegiveninclosedformas

OptimalinstrumentsregressionisanextensionofclassicalIVregressiontothesituationwhere
https://en.wikipedia.org/wiki/Linear_regression

10/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

E[i|zi]=0.
Totalleastsquares(TLS)[17]isanapproachtoleastsquaresestimationofthelinearregression
modelthattreatsthecovariatesandresponsevariableinamoregeometricallysymmetricmanner
thanOLS.Itisoneapproachtohandlingthe"errorsinvariables"problem,andisalsosometimes
usedevenwhenthecovariatesareassumedtobeerrorfree.

Maximumlikelihoodestimationandrelatedtechniques
Maximumlikelihoodestimationcanbeperformedwhenthedistributionoftheerrortermsis
knowntobelongtoacertainparametricfamilyofprobabilitydistributions.[18]Whenfisa
normaldistributionwithzeromeanandvariance,theresultingestimateisidenticaltotheOLS
estimate.GLSestimatesaremaximumlikelihoodestimateswhenfollowsamultivariatenormal
distributionwithaknowncovariancematrix.
Ridgeregression,[19][20][21]andotherformsofpenalizedestimationsuchasLassoregression,[5]
deliberatelyintroducebiasintotheestimationofinordertoreducethevariabilityofthe
estimate.TheresultingestimatorsgenerallyhavelowermeansquarederrorthantheOLS
estimates,particularlywhenmulticollinearityispresentorwhenoverfittingisaproblem.Theyare
generallyusedwhenthegoalistopredictthevalueoftheresponsevariableyforvaluesofthe
predictorsxthathavenotyetbeenobserved.Thesemethodsarenotascommonlyusedwhenthe
goalisinference,sinceitisdifficulttoaccountforthebias.
sensitivetothepresenceofoutliersthanOLS(butislessefficientthanOLSwhennooutliersare
present).ItisequivalenttomaximumlikelihoodestimationunderaLaplacedistributionmodelfor
.[22]
,
theoptimalestimatoristhe2stepMLE,wherethefirststepisusedtononparametricallyestimate
thedistributionoftheerrorterm.[23]

Otherestimationtechniques
BayesianlinearregressionappliestheframeworkofBayesianstatisticstolinearregression.(See
alsoBayesianmultivariatelinearregression.)Inparticular,theregressioncoefficientsare
assumedtoberandomvariableswithaspecifiedpriordistribution.Thepriordistributioncanbias
thesolutionsfortheregressioncoefficients,inawaysimilarto(butmoregeneralthan)ridge
pointestimateforthe"best"valuesoftheregressioncoefficientsbutanentireposterior
distribution,completelydescribingtheuncertaintysurroundingthequantity.Thiscanbeusedto
estimatethe"best"coefficientsusingthemean,mode,median,anyquantile(seequantile
regression),oranyotherfunctionoftheposteriordistribution.
QuantileregressionfocusesontheconditionalquantilesofygivenXratherthantheconditional
meanofygivenX.Linearquantileregressionmodelsaparticularconditionalquantile,for
exampletheconditionalmedian,asalinearfunctionTxofthepredictors.
Mixedmodelsarewidelyusedtoanalyzelinearregressionrelationshipsinvolvingdependentdata
whenthedependencieshaveaknownstructure.Commonapplicationsofmixedmodelsinclude
analysisofdatainvolvingrepeatedmeasurements,suchaslongitudinaldata,ordataobtainedfrom
clustersampling.Theyaregenerallyfitasparametricmodels,usingmaximumlikelihoodor
Bayesianestimation.Inthecasewheretheerrorsaremodeledasnormalrandomvariables,thereis
acloseconnectionbetweenmixedmodelsandgeneralizedleastsquares.[24]Fixedeffects
estimationisanalternativeapproachtoanalyzingthistypeofdata.
Principalcomponentregression(PCR)[7][8]isusedwhenthenumberofpredictorvariablesis
large,orwhenstrongcorrelationsexistamongthepredictorvariables.Thistwostageprocedure
firstreducesthepredictorvariablesusingprincipalcomponentanalysisthenusesthereduced
https://en.wikipedia.org/wiki/Linear_regression

11/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

variablesinanOLSregressionfit.Whileitoftenworkswellinpractice,thereisnogeneral
theoreticalreasonthatthemostinformativelinearfunctionofthepredictorvariablesshouldlie
amongthedominantprincipalcomponentsofthemultivariatedistributionofthepredictor
variables.ThepartialleastsquaresregressionistheextensionofthePCRmethodwhichdoesnot
sufferfromthementioneddeficiency.
Leastangleregression[6]isanestimationprocedureforlinearregressionmodelsthatwas
developedtohandlehighdimensionalcovariatevectors,potentiallywithmorecovariatesthan
observations.
TheTheilSenestimatorisasimplerobustestimationtechniquethatchoosestheslopeofthefit
linetobethemedianoftheslopesofthelinesthroughpairsofsamplepoints.Ithassimilar
statisticalefficiencypropertiestosimplelinearregressionbutismuchlesssensitivetooutliers.[25]
Otherrobustestimationtechniques,includingthetrimmedmeanapproach,andL,M,S,and
Restimatorshavebeenintroduced.

Furtherdiscussion
Instatisticsandnumericalanalysis,theproblemofnumericalmethodsforlinearleastsquaresisan
importantonebecauselinearregressionmodelsareoneofthemostimportanttypesofmodel,bothas
formalstatisticalmodelsandforexplorationofdatasets.Themajorityofstatisticalcomputerpackages
containfacilitiesforregressionanalysisthatmakeuseoflinearleastsquarescomputations.Henceitis
undertakenefficientlyandwithdueregardtonumericalprecision.
Individualstatisticalanalysesareseldomundertakeninisolation,butratherarepartofasequenceof
investigatorysteps.Someofthetopicsinvolvedinconsideringnumericalmethodsforlinearleast
squaresrelatetothispoint.Thusimportanttopicscanbe
Computationswhereanumberofsimilar,andoftennested,modelsareconsideredforthesame
dataset.Thatis,wheremodelswiththesamedependentvariablebutdifferentsetsofindependent
variablesaretobeconsidered,foressentiallythesamesetofdatapoints.
Computationsforanalysesthatoccurinasequence,asthenumberofdatapointsincreases.
Specialconsiderationsforveryextensivedatasets.
Fittingoflinearmodelsbyleastsquaresoften,butnotalways,arisesinthecontextofstatisticalanalysis.
Itcanthereforebeimportantthatconsiderationsofcomputationalefficiencyforsuchproblemsextendto
alloftheauxiliaryquantitiesrequiredforsuchanalyses,andarenotrestrictedtotheformalsolutionof
thelinearleastsquaresproblem.
Matrixcalculations,likeanyothers,areaffectedbyroundingerrors.Anearlysummaryoftheseeffects,
regardingthechoiceofcomputationalmethodsformatrixinversion,wasprovidedbyWilkinson.[26]

Usinglinearalgebra
Itfollowsthatonecanfinda"best"approximationofanotherfunctionbyminimizingtheareabetween
twofunctions,acontinuousfunction on
andafunction
where isasubspaceof
:
,

https://en.wikipedia.org/wiki/Linear_regression

12/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

allwithinthesubspace .Duetothefrequentdifficultyofevaluatingintegrandsinvolvingabsolute

theinnerproductspace .
Assuch,

or,equivalently,

,canthusbewritteninvectorform:
.

Inotherwords,theleastsquaresapproximationof isthefunction
termsoftheinnerproduct
.Furthermore,thiscanbeappliedwithatheorem:
Let becontinuouson
,andlet beafinitedimensionalsubspaceof
squaresapproximatingfunctionof withrespectto isgivenby

closestto in
.Theleast

,
where

isanorthonormalbasisfor

Applicationsoflinearregression
Linearregressioniswidelyusedinbiological,behavioralandsocialsciencestodescribepossible
relationshipsbetweenvariables.Itranksasoneofthemostimportanttoolsusedinthesedisciplines.

Trendline
Atrendlinerepresentsatrend,thelongtermmovementintimeseriesdataafterothercomponentshave
beenaccountedfor.Ittellswhetheraparticulardataset(sayGDP,oilpricesorstockprices)have
increasedordecreasedovertheperiodoftime.Atrendlinecouldsimplybedrawnbyeyethroughaset
ofdatapoints,butmoreproperlytheirpositionandslopeiscalculatedusingstatisticaltechniqueslike
linearregression.Trendlinestypicallyarestraightlines,althoughsomevariationsusehigherdegree
polynomialsdependingonthedegreeofcurvaturedesiredintheline.
technique,anddoesnotrequireacontrolgroup,experimentaldesign,orasophisticatedanalysis
technique.However,itsuffersfromalackofscientificvalidityincaseswhereotherpotentialchanges
canaffectthedata.

Epidemiology
Earlyevidencerelatingtobaccosmokingtomortalityandmorbiditycamefromobservationalstudies
employingregressionanalysis.Inordertoreducespuriouscorrelationswhenanalyzingobservational
https://en.wikipedia.org/wiki/Linear_regression

13/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

ofprimaryinterest.Forexample,supposewehavearegressionmodelinwhichcigarettesmokingisthe
independentvariableofinterest,andthedependentvariableislifespanmeasuredinyears.Researchers
effectofsmokingonlifespanisnotduetosomeeffectofeducationorincome.However,itisnever
possibletoincludeallpossibleconfoundingvariablesinanempiricalanalysis.Forexample,a
hypotheticalgenemightincreasemortalityandalsocausepeopletosmokemore.Forthisreason,
randomizedcontrolledtrialsareoftenabletogeneratemorecompellingevidenceofcausalrelationships
thancanbeobtainedusingregressionanalysesofobservationaldata.Whencontrolledexperimentsare
notfeasible,variantsofregressionanalysissuchasinstrumentalvariablesregressionmaybeusedto
attempttoestimatecausalrelationshipsfromobservationaldata.

Finance
Thecapitalassetpricingmodeluseslinearregressionaswellastheconceptofbetaforanalyzingand
quantifyingthesystematicriskofaninvestment.Thiscomesdirectlyfromthebetacoefficientofthe
linearregressionmodelthatrelatesthereturnontheinvestmenttothereturnonallriskyassets.

Economics
Linearregressionisthepredominantempiricaltoolineconomics.Forexample,itisusedtopredict
consumptionspending,[27]fixedinvestmentspending,inventoryinvestment,purchasesofacountry's
exports,[28]spendingonimports,[28]thedemandtoholdliquidassets,[29]labordemand,[30]andlabor
supply.[30]

Environmentalscience
EnvironmentalEffectsMonitoringProgramusesstatisticalanalysesonfishandbenthicsurveysto
measuretheeffectsofpulpmillormetalmineeffluentontheaquaticecosystem.[31]

Seealso
Analysisofvariance
Censoredregressionmodel
Crosssectionalregression
Curvefitting
EmpiricalBayesmethods
Errorsandresiduals
Lackoffitsumofsquares
Linefitting
Linearclassifier
Linearequation
Logisticregression
Mestimator
MLPACKcontainsaC++implementationoflinearregression
Nonlinearregression
Nonparametricregression
Normalequations
Projectionpursuitregression
https://en.wikipedia.org/wiki/Linear_regression

14/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

Segmentedlinearregression
Stepwiseregression
Supportvectormachine
Truncatedregressionmodel

Notes
1.DavidA.Freedman(2009).StatisticalModels:TheoryandPractice.CambridgeUniversityPress.p.26."A
simpleregressionequationhasontherighthandsideaninterceptandanexplanatoryvariablewithaslope
coefficient.Amultipleregressionequationhastwoormoreexplanatoryvariablesontherighthandside,each
withitsownslopecoefficient"
2.Rencher,AlvinC.Christensen,WilliamF.(2012),"Chapter10,MultivariateregressionSection10.1,
Introduction",MethodsofMultivariateAnalysis,WileySeriesinProbabilityandStatistics709(3rded.),
JohnWiley&Sons,p.19,ISBN9781118391679.
3.HilaryL.Seal(1967)."ThehistoricaldevelopmentoftheGausslinearmodel".Biometrika54(1/2):124.
doi:10.1093/biomet/54.12.1.
4.Yan,Xin(2009),LinearRegressionAnalysis:TheoryandComputing,WorldScientific,pp.12,
ISBN9789812834119,"Regressionanalysis...isprobablyoneoftheoldesttopicsinmathematicalstatistics
method,whichwaspublishedbyLegendrein1805,andbyGaussin1809...LegendreandGaussbothapplied
sun."
5.Tibshirani,Robert(1996)."RegressionShrinkageandSelectionviatheLasso".JournaloftheRoyal
StatisticalSociety,SeriesB58(1):267288.JSTOR2346178.
AnnalsofStatistics32(2):407451.doi:10.1214/009053604000000067.JSTOR3448465.
7.Hawkins,DouglasM.(1973)."OntheInvestigationofAlternativeRegressionsbyPrincipalComponent
Analysis".JournaloftheRoyalStatisticalSociety,SeriesC22(3):275286.JSTOR2346776.
8.Jolliffe,IanT.(1982)."ANoteontheUseofPrincipalComponentsinRegression".JournaloftheRoyal
StatisticalSociety,SeriesC31(3):300303.JSTOR2348005.
9.Berk,RichardA.RegressionAnalysis:AConstructiveCritique.Sage.doi:10.1177/0734016807304871.
10.Warne,R.T.(2011).Beyondmultipleregression:UsingcommonalityanalysistobetterunderstandR2
11.Brillinger,DavidR.(1977)."TheIdentificationofaParticularNonlinearTimeSeriesSystem".Biometrika64
(3):509515.doi:10.1093/biomet/64.3.509.JSTOR2345326.
12.Lai,T.L.Robbins,H.Wei,C.Z.(1978)."Strongconsistencyofleastsquaresestimatesinmultiple
regression".PNAS75(7):30343036.Bibcode:1978PNAS...75.3034L.doi:10.1073/pnas.75.7.3034.
JSTOR68164.
13.Tofallis,C(2009)."LeastSquaresPercentageRegression".JournalofModernAppliedStatisticalMethods7:
526534.doi:10.2139/ssrn.1406472.
14.delPino,Guido(1989)."TheUnifyingRoleofIterativeGeneralizedLeastSquaresinStatisticalAlgorithms".
StatisticalScience4(4):394403.doi:10.1214/ss/1177012408.JSTOR2245853.
(4):12241233.doi:10.1214/aos/1176345987.JSTOR2240725.
16.Cohen,MichaelDalal,SiddharthaR.Tukey,JohnW.(1993)."Robust,SmoothlyHeterogeneousVariance
Regression".JournaloftheRoyalStatisticalSociety,SeriesC42(2):339353.JSTOR2986237.
17.Nievergelt,Yves(1994)."TotalLeastSquares:StateoftheArtRegressioninNumericalAnalysis".SIAM
Review36(2):258264.doi:10.1137/1036055.JSTOR2132463.
18.Lange,KennethL.Little,RoderickJ.A.Taylor,JeremyM.G.(1989)."RobustStatisticalModelingUsing
thetDistribution".JournaloftheAmericanStatisticalAssociation84(408):881896.doi:10.2307/2290063.
JSTOR2290063.
19.Swindel,BeneeF.(1981)."GeometryofRidgeRegressionIllustrated".TheAmericanStatistician35(1):12
15.doi:10.2307/2683577.JSTOR2683577.
20.Draper,NormanR.vanNostrandR.Craig(1979)."RidgeRegressionandJamesSteinEstimation:Review
https://en.wikipedia.org/wiki/Linear_regression

15/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

21.Hoerl,ArthurE.Kennard,RobertW.Hoerl,RogerW.(1985)."PracticalUseofRidgeRegression:A
ChallengeMet".JournaloftheRoyalStatisticalSociety,SeriesC34(2):114120.JSTOR2347363.
22.Narula,SubhashC.Wellington,JohnF.(1982)."TheMinimumSumofAbsoluteErrorsRegression:AState
oftheArtSurvey".InternationalStatisticalReview50(3):317326.doi:10.2307/1402501.JSTOR1402501.
Statistics3(2):267284.doi:10.1214/aos/1176343056.JSTOR2958945.
24.Goldstein,H.(1986)."MultilevelMixedLinearModelAnalysisUsingIterativeGeneralizedLeastSquares".
Biometrika73(1):4356.doi:10.1093/biomet/73.1.43.JSTOR2336270.
25.Theil,H.(1950)."Arankinvariantmethodoflinearandpolynomialregressionanalysis.I,II,III".Nederl.
"EstimatesoftheregressioncoefficientbasedonKendall'stau".JournaloftheAmericanStatistical
Association63(324):13791389.doi:10.2307/2285891.JSTOR2285891.MR0258201.
26.Wilkinson,J.H.(1963)"Chapter3:MatrixComputations",RoundingErrorsinAlgebraicProcesses,
London:HerMajesty'sStationeryOffice(NationalPhysicalLaboratory,NotesinAppliedScience,No.32)
27.Deaton,Angus(1992).UnderstandingConsumption.OxfordUniversityPress.ISBN0198288247.
28.Krugman,PaulR.Obstfeld,M.Melitz,MarcJ.(2012).InternationalEconomics:TheoryandPolicy(9th
globaled.).Harlow:Pearson.ISBN9780273754091.
29.Laidler,DavidE.W.(1993)."TheDemandforMoney:Theories,Evidence,andProblems"(4thed.).New
York:HarperCollins.ISBN0065010981.
ISBN9780321538963.
31.EEMPwebpage(http://www.ec.gc.ca/eseeeem/default.asp?lang=En&n=453D78FC1)

References
Cohen,J.,CohenP.,West,S.G.,&Aiken,L.S.(2003).Appliedmultipleregression/correlation
analysisforthebehavioralsciences.(2nded.)Hillsdale,NJ:LawrenceErlbaumAssociates
CharlesDarwin.TheVariationofAnimalsandPlantsunderDomestication.(1868)(ChapterXIII
Draper,N.R.Smith,H.(1998).AppliedRegressionAnalysis(3rded.).JohnWiley.ISBN0471
170828.
FrancisGalton."RegressionTowardsMediocrityinHereditaryStature,"Journalofthe
AnthropologicalInstitute,15:246263(1886).(Facsimileat:[1](http://www.mugu.com/galton/ess
ays/18801889/galton1886jaigiregressionstature.pdf))
RobertS.PindyckandDanielL.Rubinfeld(1998,4hed.).EconometricModelsandEconomic
Forecasts,ch.1(Intro,incl.appendicesonoperators&derivationofparameterest.)&
Appendix4.3(mult.regressioninmatrixform).

Barlow,JesseL.(1993)."Chapter9:Numericalaspectsof
TheWikibookR
SolvingLinearLeastSquaresProblems".InRao,C.R.
Programminghasapage
ComputationalStatistics.HandbookofStatistics9.North
onthetopicof:Linear
Holland.ISBN0444880968
Models
Bjrck,ke(1996).Numericalmethodsforleastsquares
Wikiversityhaslearning
Goodall,ColinR.(1993)."Chapter13:Computationusing
theQRdecomposition".InRao,C.R.Computational
regression
Statistics.HandbookofStatistics9.NorthHolland.
ISBN0444880968
Pedhazur,ElazarJ(1982)."Multipleregressioninbehavioralresearch:Explanationand
prediction"(2nded.).NewYork:Holt,RinehartandWinston.ISBN0030417600.
NationalPhysicalLaboratory(1961)."Chapter1:LinearEquationsandMatrices:Direct
https://en.wikipedia.org/wiki/Linear_regression

16/17

19/05/2016

LinearregressionWikipedia,thefreeencyclopedia

Methods".ModernComputingMethods.NotesonAppliedScience16(2nded.).HerMajesty's
StationeryOffice
NationalPhysicalLaboratory(1961)."Chapter2:LinearEquationsandMatrices:DirectMethods
onAutomaticComputers".ModernComputingMethods.NotesonAppliedScience16(2nded.).
HerMajesty'sStationeryOffice
Retrievedfrom"https://en.wikipedia.org/w/index.php?title=Linear_regression&oldid=720569695"
Categories: Regressionanalysis Estimationtheory Parametricstatistics Econometrics
Thispagewaslastmodifiedon16May2016,at18:04.