You are on page 1of 18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Introduction
Time Series (referred as TS from now) is considered to be one of the less known skills in the
analytics space (Even I had little clue about it a couple of days back). But as you know our
inaugural Mini Hackathon is based on it, I set myself on a journey to learn the basic steps for
solvingaTimeSeriesproblemandhereIamsharingthesamewithyou.Thesewilldefinitelyhelp
yougetadecentmodelinourhackathontoday.

Before going through this article, I highly recommend reading A Complete Tutorial on Time Series
ModelinginR,whichislikeaprequeltothisarticle.Itfocusesonfundamentalconceptsandisbased
onRandIwillfocusonusingtheseconceptsinsolvingaproblemendtoendalongwithcodesin
Python.ManyresourcesexistforTSinRbutveryfewarethereforPythonsoIllbeusingPythonin
thisarticle.
Outjourneywouldgothroughthefollowingsteps:
1.WhatmakesTimeSeriesSpecial?
2.LoadingandHandlingTimeSeriesinPandas
3.HowtoCheckStationarityofaTimeSeries?
4.HowtomakeaTimeSeriesStationary?
5.ForecastingaTimeSeries

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

1/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

1.WhatmakesTimeSeriesSpecial?
Asthenamesuggests,TSisacollectionofdatapointscollectedatconstanttimeintervals.These
are analyzed to determine the long term trend so as to forecast the future or perform some other
formofanalysis.ButwhatmakesaTSdifferentfromsayaregularregressionproblem?Thereare2
things:
1.Itistimedependent.Sothebasicassumptionofalinearregressionmodelthattheobservationsare
independentdoesntholdinthiscase.
2.Along with an increasing or decreasing trend, most TS have some form of seasonality trends, i.e.
variationsspecifictoaparticulartimeframe.Forexample,ifyouseethesalesofawoolenjacketover
time,youwillinvariablyfindhighersalesinwinterseasons.

BecauseoftheinherentpropertiesofaTS,therearevariousstepsinvolvedinanalyzingit.These
arediscussedindetailbelow.LetsstartbyloadingaTSobjectinPython.Wellbeusingthepopular
AirPassengersdatasetwhichcanbedownloadedhere.
PleasenotethattheaimofthisarticleistofamiliarizeyouwiththevarioustechniquesusedforTSin
general.TheexampleconsideredhereisjustforillustrationandIwillfocusoncoverageabreadthof
topicsandnotmakingaveryaccurateforecast.

2.LoadingandHandlingTimeSeriesin
Pandas
PandashasdedicatedlibrariesforhandlingTSobjects,particularlythedatatime64[ns]classwhich
storestimeinformationandallowsustoperformsomeoperationsreallyfast.Letsstartbyfiringup
therequiredlibraries:

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

2/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

importpandasaspd
importnumpyasnp
importmatplotlib.pylabasplt
%matplotlibinline
frommatplotlib.pylabimportrcParams
rcParams['figure.figsize']=15,6

Now,wecanloadthedatasetandlookatsomeinitialrowsanddatatypesofthecolumns:

data=pd.read_csv('AirPassengers.csv')
printdata.head()
print'\nDataTypes:'
printdata.dtypes

Thedatacontainsaparticularmonthandnumberofpassengerstravellinginthatmonth.Butthisis
stillnotreadasaTSobjectasthedatatypesareobjectandint.Inordertoreadthedataasatime
series,wehavetopassspecialargumentstotheread_csvcommand:

dateparse=lambdadates:pd.datetime.strptime(dates,'%Y%m')

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

3/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

data=pd.read_csv('AirPassengers.csv',parse_dates='Month',index_col='Month',date_parser=datepa
rse)
printdata.head()

Letsunderstandtheargumentsonebyone:
1.parse_dates:Thisspecifiesthecolumnwhichcontainsthedatetimeinformation.Aswesayabove,the
columnnameisMonth.
2.index_col:AkeyideabehindusingPandasforTSdataisthattheindexhastobethevariabledepicting
datetimeinformation.SothisargumenttellspandastousetheMonthcolumnasindex.
3.date_parser:Thisspecifiesafunctionwhichconvertsaninputstringintodatetimevariable.Bedefault
PandasreadsdatainformatYYYYMMDDHH:MM:SS.Ifthedataisnotinthisformat,theformathas
to be manually defined. Something similar to the dataparse function defined here can be used for this
purpose.

Nowwecanseethatthedatahastimeobjectasindexand#Passengersasthecolumn.Wecan
crosscheckthedatatypeoftheindexwiththefollowingcommand:

data.index

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

4/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Noticethedtype=datetime[ns]whichconfirmsthatitisadatetimeobject.Asapersonal
preference,IwouldconvertthecolumnintoaSeriesobjecttopreventreferringtocolumnsnames
everytimeIusetheTS.Pleasefeelfreetouseasadataframeisthatworksbetterforyou.
ts=data[#Passengers]ts.head(10)

Before going further, Ill discuss some indexing techniques for TS data. Lets start by selecting a
particularvalueintheSeriesobject.Thiscanbedoneinfollowing2ways:

#1.Specifictheindexasastringconstant:
ts['19490101']

#2.Importthedatetimelibraryanduse'datetime'function:
fromdatetimeimportdatetime
ts[datetime(1949,1,1)]

Both would return the value 112 which can also be confirmed from previous output. Suppose we
wantallthedatauptoMay1949.Thiscanbedonein2ways:

#1.Specifytheentirerange:
ts['19490101':'19490501']

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

5/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

#2.Use':'ifoneoftheindicesisatends:
ts[:'19490501']

Bothwouldyieldfollowingoutput:

Thereare2thingstonotehere:
1.Unlikenumericindexing,theendindexisincludedhere.Forinstance,ifweindexalistasa[:5]thenit
would return the values at indices [0,1,2,3,4]. But here the index 19490501 was included in the
output.
2.Theindiceshavetobesortedforrangestowork.Ifyourandomlyshuffletheindex,thiswontwork.

Consideranotherinstancewhereyouneedallthevaluesoftheyear1949.Thiscanbedoneas:

ts['1949']

Themonthpartwasomitted.Similarlyifyoualldaysofaparticularmonth,thedaypartcanbe

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

6/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

omitted.
Now,letsmoveontotheanalyzingtheTS.

3.HowtoCheckStationarityofaTime
Series?
ATSissaidtobestationaryifitsstatisticalpropertiessuch as mean, variance remain constant
overtime. But why is it important? Most of theTS models work on the assumption that theTS is
stationary.Intuitively,wecansatthatifaTShasaparticularbehaviourovertime,thereisaveryhigh
probabilitythatitwillfollowthesameinthefuture.Also,thetheoriesrelatedtostationaryseriesare
morematureandeasiertoimplementascomparedtononstationaryseries.
Stationarityisdefinedusingverystrictcriterion.However,forpracticalpurposeswecanassumethe
seriestobestationaryifithasconstantstatisticalpropertiesovertime,ie.thefollowing:
1.constantmean
2.constantvariance
3.anautocovariancethatdoesnotdependontime.

Ill skip the details as it is very clearly defined in this article. Lets move onto the ways of testing
stationarity.Firstandforemostistosimpleplotthedataandanalyzevisually.Thedatacanbeplotted
usingfollowingcommand:

plt.plot(ts)

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

7/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Itisclearlyevidentthatthereisanoverallincreasingtrendinthedataalongwithsomeseasonal
variations.However,itmightnotalwaysbepossibletomakesuchvisualinferences(wellseesuch
caseslater).So,moreformally,wecancheckstationarityusingthefollowing:
1.PlottingRollingStatistics:We can plot the moving average or moving variance and see if it varies
withtime.Bymovingaverage/varianceImeanthatatanyinstantt,welltaketheaverage/varianceof
thelastyear,i.e.last12months.Butagainthisismoreofavisualtechnique.
2.DickeyFullerTest:Thisisoneofthestatisticaltestsforcheckingstationarity.Herethenullhypothesis
is that the TS is nonstationary. The test results comprise of a Test Statistic and some Critical
Values for difference confidence levels. If the Test Statistic is less than the Critical Value, we can
rejectthenullhypothesisandsaythattheseriesisstationary.Referthisarticlefordetails.

Theseconceptsmightnotsoundveryintuitiveatthispoint.Irecommendgoingthroughtheprequel
article.Ifyoureinterestedinsometheoreticalstatistics,youcanreferIntroductiontoTimeSeries
andForecastingbyBrockwellandDavis.Thebookisabitstatsheavy,butifyouhavetheskillto
readbetweenlines,youcanunderstandtheconceptsandtangentiallytouchthestatistics.
Back to checking stationarity, well be using the rolling statistics plots along with DickeyFuller test
results a lot so I have defined a function which takes a TS as input and generated them for us.
PleasenotethatIveplottedstandarddeviationinsteadofvariancetokeeptheunitsimilartomean.

fromstatsmodels.tsa.stattoolsimportadfuller

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

8/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

deftest_stationarity(timeseries):

#Determingrollingstatistics
rolmean=pd.rolling_mean(timeseries,window=12)
rolstd=pd.rolling_std(timeseries,window=12)

#Plotrollingstatistics:
orig=plt.plot(timeseries,color='blue',label='Original')
mean=plt.plot(rolmean,color='red',label='RollingMean')
std=plt.plot(rolstd,color='black',label='RollingStd')
plt.legend(loc='best')
plt.title('RollingMean&StandardDeviation')
plt.show(block=False)

#PerformDickeyFullertest:
print'ResultsofDickeyFullerTest:'
dftest=adfuller(timeseries,autolag='AIC')
dfoutput=pd.Series(dftest[0:4],index=['TestStatistic','pvalue','#LagsUsed','NumberofO
bservationsUsed'])
forkey,valueindftest[4].items():
dfoutput['CriticalValue(%s)'%key]=value
printdfoutput

Thecodeisprettystraightforward.Pleasefeelfreetodiscussthecodeincommentsifyouface
challengesingraspingit.
Letsrunitforourinputseries:

test_stationarity(ts)

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

9/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Thoughthevariationinstandarddeviationissmall,meanisclearlyincreasingwithtimeandthisis
not a stationary series. Also, the test statistic is way more than the critical values. Note that
thesignedvaluesshouldbecomparedandnottheabsolutevalues.
Next,welldiscussthetechniquesthatcanbeusedtotakethisTStowardsstationarity.

4.HowtomakeaTimeSeriesStationary?
ThoughstationarityassumptionistakeninmanyTSmodels,almostnoneofpracticaltimeseriesare
stationary.Sostatisticianshavefiguredoutwaystomakeseriesstationary,whichwelldiscussnow.
Actually,itsalmostimpossibletomakeaseriesperfectlystationary,butwetrytotakeitascloseas
possible.

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

10/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Lets understand what is making a TS nonstationary. There are 2 major reasons behind non
stationarutyofaTS:
1. Trend varying mean over time. For eg, in this case we saw that on average, the number of
passengerswasgrowingovertime.
2.Seasonalityvariationsatspecifictimeframes.egpeoplemighthaveatendencytobuycarsina
particularmonthbecauseofpayincrementorfestivals.
Theunderlyingprincipleistomodelorestimatethetrendandseasonalityintheseriesandremove
those from the series to get a stationary series. Then statistical forecasting techniques can be
implementedonthisseries.Thefinalstepwouldbetoconverttheforecastedvaluesintotheoriginal
scalebyapplyingtrendandseasonalityconstraintsback.
Note:Illbediscussinganumberofmethods.Somemightworkwellinthiscaseandothersmight
not.Buttheideaistogetahangofallthemethodsandnotfocusonjusttheproblemathand.
Letsstartbyworkingonthetrendpart.

Estimating&EliminatingTrend
One of the first tricks to reduce trend can be transformation. For example, in this case we can
clearly see that the there is a significant positive trend. So we can apply transformation which
penalizehighervaluesmorethansmallervalues.Thesecanbetakingalog,squareroot,cuberoot,
etc.Letstakealogtransformhereforsimplicity:

ts_log=np.log(ts)
plt.plot(ts_log)

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

11/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Inthissimplercase,itiseasytoseeaforwardtrendinthedata.Butitsnotveryintuitiveinpresence
ofnoise.Sowecanusesometechniquestoestimateormodelthistrendandthenremoveitfrom
theseries.Therecanbemanywaysofdoingitandsomeofmostcommonlyusedare:
1.Aggregationtakingaverageforatimeperiodlikemonthly/weeklyaverages
2.Smoothingtakingrollingaverages
3.PolynomialFittingfitaregressionmodel

Iwilldiscusssmoothinghereandyoushouldtryothertechniquesaswellwhichmightworkoutfor
otherproblems.Smoothingreferstotakingrollingestimates,i.e.consideringthepastfewinstances.
TherearecanbevariouswaysbutIwilldiscusstwoofthosehere.

Movingaverage
In this approach, we take average of k consecutive values depending on the frequency of time
series.Herewecantaketheaverageoverthepast1year,i.e.last12values.Pandashasspecific
functionsdefinedfordeterminingrollingstatistics.

moving_avg=pd.rolling_mean(ts_log,12)
plt.plot(ts_log)
plt.plot(moving_avg,color='red')

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

12/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Theredlineshowstherollingmean.Letssubtractthisfromtheoriginalseries.Notethatsincewe
aretakingaverageoflast12values,rollingmeanisnotdefinedforfirst11values.Thiscanbe
observedas:

ts_log_moving_avg_diff=ts_logmoving_avg
ts_log_moving_avg_diff.head(12)

Noticethefirst11beingNan.LetsdroptheseNaNvaluesandchecktheplotstoteststationarity.

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

13/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

ts_log_moving_avg_diff.dropna(inplace=True)
test_stationarity(ts_log_moving_avg_diff)

Thislookslikeamuchbetterseries.Therollingvaluesappeartobevaryingslightlybutthereisno
specifictrend.Also,theteststatisticissmallerthanthe5%criticalvaluessowecansaywith95%
confidencethatthisisastationaryseries.
However,adrawbackinthisparticularapproachisthatthetimeperiodhastobestrictlydefined.In
this case we can take yearly averages but in complex situations like forecasting a stock price, its
difficult to come up with a number. So we take a weighted moving average where more recent
valuesaregivenahigherweight.Therecanbemanytechniqueforassigningweights.Apopularone
isexponentiallyweightedmovingaveragewhereweightsareassignedtoallthepreviousvalues
withadecayfactor.Finddetailshere.ThiscanbeimplementedinPandasas:

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

14/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

expwighted_avg=pd.ewma(ts_log,halflife=12)
plt.plot(ts_log)
plt.plot(expwighted_avg,color='red')

Notethatheretheparameterhalflifeisusedtodefinetheamountofexponentialdecay.Thisisjust
anassumptionhereandwoulddependlargelyonthebusinessdomain.Otherparameterslikespan
andcenterofmasscanalsobeusedtodefinedecaywhicharediscussedinthelinksharedabove.
Now,letsremovethisfromseriesandcheckstationarity:

ts_log_ewma_diff=ts_logexpwighted_avg
test_stationarity(ts_log_ewma_diff)

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

15/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

This TS has even lesser variations in mean and standard deviation in magnitude. Also, the test
statisticissmallerthanthe1%criticalvalue,whichisbetterthanthepreviouscase.Notethatin
thiscasetherewillbenomissingvaluesasallvaluesfromstartingaregivenweights.Soitllwork
evenwithnopreviousvalues.

EliminatingTrendandSeasonality
Thesimpletrendreductiontechniquesdiscussedbeforedontworkinallcases,particularlytheones
withhighseasonality.Letsdiscusstwowaysofremovingtrendandseasonality:
1.Differencingtakingthedifferecewithaparticulartimelag
2.Decompositionmodelingbothtrendandseasonalityandremovingthemfromthemodel.

Differencing

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

16/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Oneofthemostcommonmethodsofdealingwithbothtrendandseasonalityisdifferencing.Inthis
technique,wetakethedifferenceoftheobservationataparticularinstantwiththatattheprevious
instant. This mostly works well in improving stationarity. First order differencing can be done in
Pandasas:

ts_log_diff=ts_logts_log.shift()
plt.plot(ts_log_diff)

Thisappearstohavereducedtrendconsiderably.Letsverifyusingourplots:

ts_log_diff.dropna(inplace=True)
test_stationarity(ts_log_diff)

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

17/18

21/6/2016

CompleteguidetocreateaTimeSeriesForecast(withCodesinPython)

Wecanseethatthemeanandstdvariationshavesmallvariationswithtime.Also,theDickeyFuller
teststatisticislessthanthe10%criticalvalue,thustheTSisstationarywith90%confidence.We
can also take second or third order differences which might get even better results in certain
applications.Ileaveittoyoutotrythemout.

Decomposing
Inthisapproach,bothtrendandseasonalityaremodeledseparatelyandtheremainingpartofthe
seriesisreturned.Illskipthestatisticsandcometotheresults:

fromstatsmodels.tsa.seasonalimportseasonal_decompose
decomposition=seasonal_decompose(ts_log)

trend=decomposition.trend
seasonal=decomposition.seasonal

http://www.analyticsvidhya.com/blog/2016/02/timeseriesforecastingcodespython/

18/18