# OLS Data Analysis in

OLSDataAnalysisin
R
DinoChristenson&ScottPowell
Ohio State University
OhioStateUniversity
November20,2007

Introduction to R Outline
IntroductiontoROutline
Data Description
II. Data Analysis
ii. Command functions
Command functions
ii. Hand rolling

III. OLS Diagnostics & Graphing
III
OLS Diagnostics & Graphing
IV. Functions and loops
V. Moving forward

## Data Analysis: Descriptive Stats

DataAnalysis:DescriptiveStats
Rhasseveralbuiltin
commandsfor
describingdata
Thelist()
commandcanoutput
p
allelementsofan
object

## Data Analysis: Descriptive Stats

DataAnalysis:DescriptiveStats
Thesummary()
y
commandcanbe
usedtodescribeall
variables contained
variablescontained
Thesummary()
commandcanalso
be used with
beusedwith
individualvariables

## Data Analysis: Descriptive Stats

DataAnalysis:DescriptiveStats
Simpleplotscanalso
p p
providefamiliarity
withthedata
Thehist()
commandproducesa
p
histogramforany
givendatavalues

## Data Analysis: Descriptive Stats

DataAnalysis:DescriptiveStats
Simpleplotscanalso
p p
providefamiliarity
withthedata
Theplot()
commandcan
produceboth
univariateand
bivariate plots for
bivariateplotsfor
anygivenobjects

DataAnalysis:DescriptiveStats
y
p
Other Useful Commands
OtherUsefulCommands

sum
mean
var
sd
range

min
max
median
di
cor
summary

## Data Analysis: Regression

DataAnalysis:Regression
Asmentionedabove,oneofthebigperksofusingRis
,
gp
g
flexibility.
Rcomeswithitsowncannedlinearregressioncommand:
lm(y ~ x)
However,weregoingtouseRtomakeourownOLS
estimator.Thenwewillcomparewiththecanned
procedure,aswellasStata.

## Data Analysis: Regression

DataAnalysis:Regression

First,letstakealookatour
codeforthehandrolledOLS
estimator
TheHolyGrail:
(XX)
(X
X)-1 X
XY
Y
Weneedasinglematrixof
independentvariables
The cbind() command
Thecbind()
command
takestheindividualvariable
vectorsandcombinesthem
intoonexvariablematrix
A1isincludedasthefirst
elementtoaccountforthe
constant.

## Data Analysis: Regression

DataAnalysis:Regression
Withthexandy
y
matricescomplete,
wecannow
manipulate them to
manipulatethemto
producecoefficients.
Afterperformingthe
divinemultiplication,
wecanobservethe
estimates by entering
estimatesbyentering
theobjectname(in
thiscaseb).

## Data Analysis: Regression

DataAnalysis:Regression
Withthexandy
y
matricescomplete,
wecannow
manipulate them to
manipulatethemto
producecoefficients.
Afterperformingthe
divinemultiplication,
wecanobservethe
estimates byentering
by entering
theobjectname(in
thiscaseb).

## Data Analysis: Regression

DataAnalysis:Regression

Tofindthestandard
errors,weneedto
computeboththe
varianceoftheresiduals
andthecovmatrixofthe
d h
f h
xs.
Thesqrtofthediagonal
elementsofthisvarcov
l
f hi
matrix willgiveusthe
standarderrors.
Oh
Otherteststatisticscanbe
i i
b
easilycomputed.
Viewthestandarderrors.

## Data Analysis: Regression

DataAnalysis:Regression

Tofindthestandard
errors,weneedto
computeboththe
varianceoftheresiduals
andthecovmatrixofthe
d h
f h
xs.
Thesqrtofthediagonal
elementsofthisvarcov
l
f hi
matrixwillgiveusthe
standarderrors.
Oh
Otherteststatisticscan
i i
beeasilycomputed.
Viewthestandarderrors.

## Data Analysis: Regression

DataAnalysis:Regression

Tofindthestandard
errors,weneedto
computeboththe
varianceoftheresiduals
andthecovmatrixofthe
d h
f h
xs.
Thesqrtofthediagonal
elementsofthisvarcov
l
f hi
matrixwillgiveusthe
standarderrors.
Oh
Otherteststatisticscanbe
i i
b
easilycomputed.
Viewthestandarderrors.

## Data Analysis: Regression

DataAnalysis:Regression
TimetoCompare
p
Usethelm()
commandtoestimate
themodelusingRs
cannedprocedure
p
Aswecansee,the
estimatesarevery
similar

## Data Analysis: Regression

DataAnalysis:Regression
TimetoCompare
p
Wecanalsoseehow
boththehandrolled
andcannedOLS
d
d OLS
proceduresstackup
toStata
Usethereg
commandtoestimate
the model
themodel
Aswecansee,the
estimatesareonce
againverysimilar

## Data Analysis: Regression

DataAnalysis:Regression

DataAnalysis:Regression
y
g
Other Useful Commands
OtherUsefulCommands

lm

Linear Model

lme

glm
- General lm

Mixed Effects

multinom
- Multinomial
Logit

anova

optim
- General
Optimizer

OLS Diagnostics in R
OLSDiagnosticsinR
Postestimationdiagnosticsarekeytodata
g
y
analysis
Wewanttomakesureweestimatedtheproper
model
Besides,Irfan willhurtyouifyouneglecttodothis

Furthermore,diagnosticsallowusthe
g
opportunitytoshowoffsomeofRsgraphs
Rsrealstrengthisthatithasvirtuallyunlimited
graphing capabilities
graphingcapabilities
Ofcourse,suchstrengthsonRspartisdependenton
yourknowledgeofbothRandstatistics
Still,withjustsomebasicswecandosomecoolgraphs
Still with just some basics we can do some cool graphs
OLS Diagnostics in R
OLSDiagnosticsinR
Whatcouldbeunjustifiably drivingourdata?
Outlier:unusualobservation
O tli
l b
ti
Leverage:abilitytochangetheslopeofthe
regression line
regressionline
Influence:thecombinedimpactofstrongleverage
and outlier status
andoutlierstatus
AccordingtoJohnFox,influence=leverage*outliers

## OLS Diagnostics: Leverage

OLSDiagnostics:Leverage
Recallourols
eca ou o s model
ode
ols.model1<-lm(formula =
repvshr~income+presvote+pressup)

Ourmeasureofleverage:isthehi orhatvalue
Itsjustthepredictedvalueswrittenintermsofhi
Where,H
Where Hij isthecontributionofobservationY
is the contribution of observation Yitothefitted
to the fitted
valueYj
Ifhij islarge,thentheith observationhasasignificantimpacton
the jth fittedvalue
thejth
fitted value
So,skippingtheformulas,weknowthatthelargerthehatvalue
thegreatertheleverageofthatobservation

## OLS Diagnostics: Leverage

OLSDiagnostics:Leverage
Findthehatvalues
Find the hat values
hatvalues(ols.model1)

Calculatetheaveragehatvalue
avg.mod1<-ncol(x)/nrow(x)
## OLS Diagnostics: Leverage

OLSDiagnostics:Leverage

0.35

18

0.20

0.25

0.30

20

3
11

0.15

plot(hatvalues(ols.model
1))
abline(h=1*(ncol(x))/nro
w(x))
abline(h=2*(ncol(x))/nro
bli (h 2*(
l( ))/
w(x))
abline(h=3*(ncol(x))/nro
w(x))
identify(hatvalues(ols.m
odel1))

14
0.10

Butapictureisworthahundred
numbers?
Graphthehatvalueswithlinesfor
theaverage,twicetheavg (large
samples)andthreetimestheavg
(small samples) hat values
(smallsamples)hatvalues
hatvalues(ols.model1)

identify letsusselectthedata
pointsinthenewgraph

State#2isovertwicetheavg
Nothing above three times
Nothingabovethreetimes

10

15

20

Index

23

## OLS Diagnostics: Outliers

OLSDiagnostics:Outliers
CanwefindanydatapointsthatareunusualforY
y
p
ui
giventheXs?
*
ui =
u ( 1 ) 1 hi
Usestudentized residuals
Wecanseewhetherthereisasignificantchangein
h h h
f
h
themodel
Iftheirabsolutevaluesarelargerthan2,thenthe
g
correspondingobservationsarelikelytobeoutliers)
rstudent(ols.model1)

## OLS Diagnostics: Outliers

OLSDiagnostics:Outliers

2
1

14

15
1
0

19
10

5
-1

Perhapsthereisamistake
i d
indataentry
Perhapsthemodelis
misspecified intermsof
functionalform
(forthcoming) or omitted
(forthcoming)oromitted
vars
Maybeyoucanthrowout
y
observation,tryrobust
regression

22
3
-2

rstu
udent(ols.model1)

Again,letsplotthemwith
li
linesfor2&2
f 2& 2
States2and3appeartobe
outliers,ordarnclose
Weshoulddefinitelytakea
We should definitely take a
lookatwhatmakesthese
statesunusual

10

15

20

Index

25

## OLS Diagnostics: Influence

OLSDiagnostics:Influence

IfCooksDisgreaterthan4/(nk
/
1),thentheobservationissaidto
exertundueinfluence
Letsjustplotit
plot(cookd(ols.model1))
abline(h=4/(nrow(x)ncol(x)))
Identify(cookd(ols.mode
y
l1))

States2and(maybe)3areinthe
troublezone

0.4

h
1 hi

0.3

k + 1

0.2

13
0.1

18
11
0.0

'2
i

0.5

CooksDgivesakindofsummary
for each observationssinfluence
foreachobservation
influence

coo
okd(ols.model1)

17

1
5

10

15

20

Index

## OLS Diagnostics: Influence

OLSDiagnostics:Influence

Forahostofmeasures
of influence including
ofinfluence,including
df betasanddf fits
influence.measu
res(ols.model1)

dfbeta givesthe
influenceofan
observationonthe
coefficients orthe
changeinivscoefficient
causedbydeletinga
singleobservation

Simplecommandsfor
partialregressionplots
canbefoundonFoxs
website
website
qq.plot(ols.model1,dist
l ( l
d l1 di
ribution="norm")
Theproblemsareagain2and13,
,
g
with3,22and14borderingon
troublethistimearound

-1

Pl
Plotsempiricalquantiles
t
ii l
til ofa
f
variableagainststudentized
residuals
Lookingforobs onastraightline
InRitissimpletoplottheerror
I R it i i l t l t th
bandsaswell
Deviationrequiresusto
transformourvariables

2
14

22
3
-2

Was it correct to use a linear
Wasitcorrecttousealinear
model?
Useaquantile plot(qq plot)to
check

Studen
ntized Residuals(olss.model1)

## OLS Diagnostics: Normality

OLSDiagnostics:Normality

13
-2

-1

norm Quantiles

28

## OLS Diagnostics: Normality

OLSDiagnostics:Normality

0.0

0.1

0.2

0..3

0.4

density.default(x = rstudent(ols.model1))

Density

Asimpledensityplot
p
yp
ofthestudentized
residualshelpsto
determine the nature
determinethenature
ofourdata
Theapparent
deviationfromthe
normalcurveisnot
severe but there
severe,butthere
certainlyseemstobe
aslightnegativeskew

-4

-2

N = 22 Bandwidth = 0.4217

29

10
0
-20

-10

resid(ols.model1)

0
-10
-20
30

40

50

60

70

30000

35000

40000

45000

50000

0
-10
-20

-10

resid(o
ols.model1)

10

income

10

fitted.values(ols.model1)

-20

par(mfrow=c(2,2))
plot(resid(ols.model1)
~fitted.values(ols.mod
el1))
plot(resid(ols.model1)
p
~income)
plot(resid(ols.model1)
~presvote)
p
plot(resid(ols.model1)
(
(
)
~pressup)

resid(ols.model1)

Wecanalsoeasilylookfor
heteroskedasticity
Plottingtheresidualsagainstthe
fittedvaluesandthecontinuous
independentvariablesletsus
examineourstatisticalmodelfor
l
d lf
thepresenceofunbalanced
errorvariance

resid(o
ols.model1)

10

## OLS Diagnostics: Error Variance

OLSDiagnostics:ErrorVariance

35

40

45

50
presvote

55

60

65

65

70

75

80

85

90

95

pressup

30

## OLS Diagnostics: Error Variance

OLSDiagnostics:ErrorVariance
Formaltestsforheteroskedasticity areavailablefromthelmtest
library

library(lmtest)
bptest(ols.model1) willgiveyoutheBreuschPaganteststat
gqtest(ols.model1) willgiveyoutheGoldfeld
will give you the GoldfeldQuandttest
Quandttest stat
hmctest(ols.model1)willgiveyoutheHarrisonMcCabeteststat

## OLS Diagnostics: Collinearity

OLSDiagnostics:Collinearity
Finally,letslookoutfor
collinearity
Togetthevarianceinflation
factors
vif(ols.model1)

Letslookattheconditionindex
fromtheperturb
p
libraryy
library(perturb)
colldiag(ols.model1)

Issues
Issueshereisthelargest
here is the largest
conditionindex
Ifitislargerthan30,Houston
we have
wehave
## OLS Diagnostics: Shortcut

OLSDiagnostics:Shortcut

1
0
-1

Standardized residu
uals

0
-10

--2

-20

13

13

plot(ols.model1,
which=1:4)

30

40

50

60

70

-2

-1

Fitted values

1.5

Theoretical Quantiles

Scale-Location

Cook's distance

0.3
0

Cook's d
distance

1.0

3
13

0.0

0.1

0.5

0.2

0.4

0.5

13

0.0

Standardize
ed residuals

N
Nowyouhaveno
h
excusenottorunsome
diagnostics!
Btw,lookatthehigh
Bt l k t th hi h
residualsinthervf plot
for14,13and3
suggesting outliers
suggestingoutliers

10

14

Residuals

Myfavoriteshortcut
commandtogetyou
fouressentialdiagnostic
plotsafteryourunyour
model
d l

Normal Q-Q
2

Residuals vs Fitted

30

40

50

60

Fitted values

70

10

15

20

Obs. number

33

## The Final Act: Loops and Functions

TheFinalAct:LoopsandFunctions
Aswasmentionedabove,Rsbiggestassetisitsflexibility.
,
gg
y
Loopsandfunctionsdirectlyutilizethisasset.
Loopscanbeimplementedforanumberofpurposes,
essentiallywhenrepeatedactionsareneeded(i.e.
simulations).
)
Functionsallowustocreateourowncommands.Thisis
especiallyusefulwhenacannedproceduredoesnotexist.
WewillcreateourownOLSfunctionwiththehandrolled
code used earlier.
codeusedearlier.

Loops
for loopsarethe
p
mostcommonandthe
onlytypeofloopwe
will look at today
willlookattoday.
Thefirstloop
p
commandattheright
showssimpleloop
iteration.
iteration

Carlo

Functions

Nowwewillmakeourown
linearregressionfunction
usingourhandrolledOLS
code
Functions require inputs
Functionsrequireinputs
(whicharetheobjectstobe
utilized)andarguments
(whicharethecommands
thatthefunctionperforms)
Theactualestimation
proceduredoesnotchange.
However some changes are
However,somechangesare

Functions
OLS:HandrolledvsFunction

Functions
Implementingour
p
g
newfunctionols,
wegetpreciselythe
output that we
outputthatwe
Wecancheckthis
againsttheresults
produced by the
producedbythe
standardlm
function.

Favorite Resources
Favorite

InvaluableResourcesonline
TheRmanuals
h
l
http://cran.rproject.org/manuals.html
Foxsslideshttp://socserv.mcmaster.ca/jfox/Courses/Rcourse/index.html
Faraway's book
http://cran.rproject.org/doc/contrib/FarawayPRA.pdf
//
/ /
/
Anderson'sICPSRlecturesusingR
http://socserv.mcmaster.ca/andersen/icpsr.html
Arai'sguidehttp://people.su.se/~ma/R_intro/
UCLAnoteshttp://www.ats.ucla.edu/stat/SPLUS/default.htm
Keeles introguidehttp://www.polisci.ohiostate.edu/faculty/lkeele/RIntro.pdf

G tRb k
GreatRbooks
Verzanis book
http://www.amazon.com/UsingIntroductoryStatisticsJohn
Verzani/dp/1584884509
Maindonald
M i d
ld andBraunsbook
dB
b k
http://www.amazon.com/DataAnalysisGraphicsUsingR/dp/0521813360

