You are on page 1of 18

DataanalysisandregressioninStata

ThishandoutshowshowtheweeklybeersalesseriesmightbeanalyzedwithStata(thesoftware
packagenowusedforteachingstatsatKellogg),forpurposesofcomparingitsmodelingtoolsandease
ofusetothoseofFSBForecast.Toanalyzetheweeklybeersalesseries,thefirststepistoimportthe
datafromtheExcelfile.AnystatisticalsoftwarepackagecanimportExcelfileseasily.

ThedialogboxforimportingtheExcelfileofferstheoptionofreadingthevariablenamesfromthefirst
row,whichisalsostandard.So,thesamedatafilethatworkedforFSBForecastwillworkhere.

Inordertobeabletousetimetransformationoptionslateron,itisnecessarytodeclarethevariablesto
betimeseries.Todothis,dontgototheDatamenu,whichiswheremostdatadefinitionoperations
areperformed.Instead,gototheStatisticsmenuandlookunderthetimeseriesoptionsthere.

Forsimplicity,letsjustusetheweeknumberasthegenerictimeindex:

Thisexecutesthefollowingcommandwhichisshownintheoutputwindow:

Underthehoodthisisacommandlanguageprogram,asareSPSSandSAS.Choosingoptionsfromthe
menucausestheappropriatecodetobegeneratedandexecuted.Mostserioususersofprogramslike
Statawritetheircodedirectlyratherthanlettingamenusystemdoitforthem.
Thenumericalresultsofyouranalysiswillbewrittentotheoutputwindowalongwiththecodethat
createdthem,intheformofasinglescrollinglogfile.Beforeproceedingwithyouranalysis,youneed
toopenandassignanametothelogfileforyoursession,sothatyoucansaveyourresultslater:

Thelogfileisaplaintextfilethatcontainstheoutputthatyouseescrollingbywhiledoingyouranalysis.
Itcontainsonlythetextoutput,notthegraphs.ItcanbeopenedandeditedlaterwithMicrosoftWord
orothertexteditingsoftware,andyoucanusetheappendoptiontoreopenthelogfileandadd
moreanalysistoitlater.IfyouopenitinMicrosoftWord,youshouldchangethefontsizeto8pointsto
avoidwrappingthelinesanduseafixedwidthfontsuchasCouriersothattablecontentswilllineup.

HereisKelloggscustommenufortheircorestatisticsclass,whichcanbeloadedbytypingthedo
statementshowninthecommandwindowattheverybottomofthescreen:

Theunivariatestatisticscommandprovidessummarystatisticsofsomeorallvariables:

Ifyouchoosethestandardsummarystatisticsreport,whichshowspreselectedstatisticsforall
variablesinthefile,youmustclickthroughthisscreen:

Youthengetthefollowingoutput:

Apparentlythereisnowayaroundthedefaultabbreviationofvariablenamesonsomereports.Bestto
useshortnames(8charactersorless)!

Inacustomunivariatestatisticsreport,youcanchooseasubsetofvariablestoanalyze,andyoucan
chooseupto8statsbyasequenceofstepsinwhichyouuseseparatepulldownmenus:

Hereistheoutputofthisparticularcustomanalysis,whichshows4statsfor6variables:

Howtogenerateacorrelationmatrix:

Thiscommandopensadialogboxinwhichyoucanchoosealistofvariablesbyclickingonthem.As
eachoneisclicked,itisaddedtothelistinthewindow,whichistypicalofallproceduresinStatathat
operateonmultiplevariables.

This6variablecorrelationmatrixissmallenoughtofitintheStataoutputwindow.Alargermatrix
wouldbebrokenintopieceswhenitwasdisplayed,andIdonotthinkitwouldbepossibletocopyittoa
spreadsheetorotherdocumentwhereyoucouldseeitasasingletriangulararray.Again,itisbestto
useshortvariablenames.

Howtogetasinglescatterplot:

Thegraphisdisplayedinaseparategraphwindowsthatopensup:

Themaingraphicsmenuhasotherplotoptions,includingascatterplotmatrix:

Thenewgraphreplacestheoldoneinthegraphwindow.Youcantgetmultipleopengraphwindows
(i.e.,morethanonegraphvisibleatatime)usingonlymenucommands.Youcandoitbywritingcode,
though.Thegraphthatiscurrentlyinthewindowcanbedirectlycopiedandpastedtoother
documentslikethisWordfile.Thisisanicelyformattedplot,although(aswithacorrelationmatrix)you
havetoreadacrossanddowntodeterminewhichvariablesareshowninagivenplot,aswellastheir
axisscalenumbers.Axisscalenumbersareprovidedoneachseparatescatterplotinthematrixin
FSBforecast,alongwiththecorrelationsandtheirsquaredvaluesorregressionslopecoefficients.

Howtoplot2timeseriesonthesamechartbyusingthetimeseriesgraphsoptiononthemain
graphicsmenu:

13

13.5
14
14.5
PRICE CANS_30PK

CASES CANS_30PK
200
400
600

15

800

Hereyoucantrytojudgehowthepeaksandvalleysinthetwoserieslineup:

10

20

30

40

50

Week
CASES CANS_30PK

PRICE CANS_30PK

10

Themultipleregressionprocedure:

Selectthedependentandindependentvariablesfrompulldownlists(besttouseshortvariablenames
ifyouwanttoseethefullnameofthedependentvariablehere):

Hereistheregressionoutputthatyougetbydefault:

11

Youneedtorunsomeseparateprocedurestogetadditionalmodelstatsandcharts:

-200

-100

Residuals
0
100

200

300

Theresidualvs.predictedplotistheonlyresidualplotthatisincludedonthismenu.Othertypesof
residualplotscanbegeneratedfromtheGraphicsmenu.(Moreaboutthislater.)

100

200

300
400
Fitted values

500

600

12

TheResiduals,outliers,andinfluentialobservationscommandstorestheresidualsandstandardized
residualsandafewotherstats(leverage,CooksD)onthedataworksheet.Beforeclickingthroughits
dialogboxyoumightwanttoeditthenewvariablenamesifyouaresavingtheresultsofdifferent
modelsinthesamefilee.g.,Model_1_residuals.

TheJarqueBeranormalitytestisatestthatisbasedonlyontheskewnessandkurtosiscoefficientsof
theresiduals,unliketheAndersonDarlingorKolmogorovSmirnofftestswhicharebasedontheentire
cumulativedistribution.

Theresultofthistestiswrittenbacktotheoutputwindowandlookslikethis:

Inthiscasethepvalueof1.1e05indicatesthatthenormalityhypothesisisstronglyrejected.
13

Foracloserlookattheerrordistribution,aresidualhistogramchartcanbegeneratedbygoingtothe
maingraphicsmenuandusingthehistogramprocedurewiththenewlycreatedresidualsvariableas
theinput:

Hereisthedialogboxforthehistogramprocedure:

Thedefaultbinspecificationsareprettycoarsehereiswhatyougetinthiscase.Youmightwantto
fiddlewiththebinsettingsinordertoshowmorefinedetail.

14

TheDurbinWatsonstatistic(whichrequiresexecutinganotherseparatemenucommandinordertobe
reported)isatestforautocorrelationatlag1intheresiduals.Clickthroughthisdialogbox:

TheresultingreportoftheDWstatlookslikethis.Theuserneedstoknowwhetheravalueof1.4is
significant,becausenopvalueisreportedforit.

IngeneraltheDWstatisticisapproximatelyequalto2(1r1)wherer1isthelag1residualautocorrelation.
Itsrangeisfrom0to4anditapproaches2whenthelag1autocorrelationapproaches0.Thelag1
residualautocorrelationforthismodelis0.281,whichistheteststatisticthatisshowninFSBForecast
output.FSBForecastalsoreportstheresidualautocorrelationsforlags2through7andlag12,along
withtheir95%significancelimits.The95%significancelimitfortestingthelagkautocorrelationis
2/SQRT(nk),wherenisthesamplesize,whichworksouttobe0.280forlag1inthismodel,sothisisa
borderlinesignificantvalue.
Forecastingfromaregressionmodel:
StatageneratesforecastsinamannersimilartoFSBForecast.Ifvaluesfortheindependentvariable(s)
aretypedoralreadyexistinadditionalrowsatthebottomofthedataset,thePrediction,usingmost
recentregression,usingthedatasheetcommandwillcausethecorrespondingforecaststobe
computed.Heresomeadditionalpricevaluesweretypedinatthebottomofthedataworksheet:

15

Thepredictionusingdatasheetoptionwaschosennext:

Youthenhavetoclickthroughthisboxinwhichyoucaneditthenamesoftheforecaststatisticsthat
thatwillbeshown.Heretooyoumightwanttoeditthenamesofthevariablestobecreated,ifyou
arefittingmorethan1modelusingthesamefile:

Theforecastsandtheirstandarderrorsandconfidencelimitsarewrittenbacktothedataspreadsheet,
butnotplotted:

16

MathematicaltransformationscanbeappliedwiththecreatenewvariableoptionontheDatamenu:

Forexample,hereyoucanapplythenaturallogtransformation.Youneedtotypeanameforthenew
variableandthenyouneedtotypetheformulatocomputeit.Thereisalsoanexpressionbuilder
dialogboxthatyoucanbringupbyhittingthecreatebutton.Itshowsthelistofavailable
transformationsandcantypetheirnamesforyouifyouclickonthem.

However,therearenotimetransformationsonthefunctionlist:nolagordifferenceordifferenceof
naturallogorpercentagedifferencerelativetopreviousobservations.
17

Youcangetadifferenceofnaturallogfromoneperiodagotransformationbygoingbacktothetop
levelofthecreatenewvariableprocedureandthenapplyingthedifferenceoperator,whosesyntaxis
D.,totheloggedvariable.Hereagainyoumusttypeanameforthenewvariableaswellasthe
mathematicalexpressionthatcreatesit.

Thecodethatwasexecutedbythecreatevariableprocedureintheprocessofapplyingthelogand
differencetransformationsisshownbelow,alongwiththeresultsoffittingaregressionmodeltothe
loggedanddifferencedvariables.ThisisthesameasModel3thatwasfittedtothesamedatasetwith
FSBForecast.

Fromhereyoucangoonandconstructtherestoftheregressionoutputonepieceatatimeasshown
earlier.
18