You are on page 1of 11

ARIMA Model Python Example —

Time Series Forecasting


Cory Maklin Follow
May 26 · 8 min read

The ability to make predictions based upon historical observations creates a


competitive advantage. For example, if an organization has the capacity to
better forecast the sales quantities of a product, it will be in a more
favourable position to optimize inventory levels. This can result in an
increased liquidity of the organizations cash reserves, decrease of working
capital and improved customer satisfaction by decreasing the backlog of
orders.

In the domain of machine learning, there’s a specific collection of methods


and techniques particularly well suited for predicting the value of a
dependent variable according to time. In the proceeding article, we’ll cover
AutoRegressive Integrated Moving Average (ARIMA).

We refer to a series of data points indexed (or graphed) in time order as a


time series. A time series can be broken down into 3 components.

Trend: Upward & downward movement of the data with time over a
large period of time (i.e. house appreciation)

Seasonality: Seasonal variance (i.e. an increase in demand for ice


cream during summer)

Noise: Spikes & troughs at random intervals


Before applying any statistical model on a time series, we want to ensure it’s
stationary.

If a time series is stationary and has a particular behaviour over a given time
interval, then it is safe to assume that it will have same behaviour at some
later point in time. Most statistical modelling methods assume or require
the time series to be stationary.

Code
The statsmodels library provides a suite of functions for working with time
series data.

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

We’ll be working with a dataset that contains the number of airplane


passengers on a given day.

df = pd.read_csv('air_passengers.csv', parse_dates = ['Month'],


index_col = ['Month'])

df.head()

plt.xlabel('Date')
plt.ylabel('Number of air passengers')
plt.plot(df)

As mentioned previously, before we can build a model, we must ensure that


the time series is stationary. There are two primary way to determine
whether a given time series is stationary.
Rolling Statistics: Plot the rolling mean and rolling standard deviation.
The time series is stationary if they remain constant with time (with the
naked eye look to see if the lines are straight and parallel to the x-axis).

Augmented Dickey-Fuller Test: The time series is considered stationary


if the p-value is low (according to the null hypothesis) and the critical
values at 1%, 5%, 10% confidence intervals are as close as possible to
the ADF Statistics

For those who don’t understand the difference between average and rolling
average, a 10-day rolling average would average out the closing prices for
the first 10 days as the first data point. The next data point would drop the
earliest price, add the price on day 11 and take the average, and so on as
shown below.

rolling_mean = df.rolling(window = 12).mean()


rolling_std = df.rolling(window = 12).std()

plt.plot(df, color = 'blue', label = 'Original')


plt.plot(rolling_mean, color = 'red', label = 'Rolling Mean')
plt.plot(rolling_std, color = 'black', label = 'Rolling Std')
plt.legend(loc = 'best')
plt.title('Rolling Mean & Rolling Standard Deviation')
plt.show()
As you can see, the rolling mean and rolling standard deviation increase
with time. Therefore, we can conclude that the time series is not stationary.

result = adfuller(df['Passengers'])

print('ADF Statistic: {}'.format(result[0]))


print('p-value: {}'.format(result[1]))
print('Critical Values:')
for key, value in result[4].items():
print('\t{}: {}'.format(key, value))

The ADF Statistic is far from the critical values and the p-value is greater
than the threshold (0.05). Thus, we can conclude that the time series is not
stationary.

Taking the log of the dependent variable is as simple way of lowering the
rate at which rolling mean increases.

df_log = np.log(df)
plt.plot(df_log)

Let’s create a function to run the two tests which determine whether a given
time series is stationary.

def get_stationarity(timeseries):

# rolling statistics
rolling_mean = timeseries.rolling(window=12).mean()
rolling_std = timeseries.rolling(window=12).std()

# rolling statistics plot


original = plt.plot(timeseries, color='blue', label='Original')
mean = plt.plot(rolling_mean, color='red', label='Rolling Mean')
std = plt.plot(rolling_std, color='black', label='Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)

# Dickey–Fuller test:
result = adfuller(timeseries['Passengers'])
print('ADF Statistic: {}'.format(result[0]))
print('p-value: {}'.format(result[1]))
print('Critical Values:')
for key, value in result[4].items():
print('\t{}: {}'.format(key, value))

There are multiple transformations that we can apply to a time series to


render it stationary. For instance, we subtract the rolling mean.

rolling_mean = df_log.rolling(window=12).mean()
df_log_minus_mean = df_log - rolling_mean
df_log_minus_mean.dropna(inplace=True)

get_stationarity(df_log_minus_mean)

As we can see, after subtracting the mean, the rolling mean and standard
deviation are approximately horizontal. The p-value is below the threshold
of 0.05 and the ADF Statistic is close to the critical values. Therefore, the
time series is stationary.

Applying exponential decay is another way of transforming a time series


such that it is stationary.

rolling_mean_exp_decay = df_log.ewm(halflife=12, min_periods=0,


adjust=True).mean()
df_log_exp_decay = df_log - rolling_mean_exp_decay
df_log_exp_decay.dropna(inplace=True)

get_stationarity(df_log_exp_decay)
Exponential decay performed worse than subtracting the rolling mean.
However, it is still more stationary than the original.

Let’s try one more method to determine whether an even better solution
exists. When applying time shifting, we subtract every the point by the one
that preceded it.

null, (x1−x0), (x2−x1), (x3−x2), (x4−x3), …, (xn−xn−1)

df_log_shift = df_log - df_log.shift()


df_log_shift.dropna(inplace=True)

get_stationarity(df_log_shift)

Time shifting performed worse than subtracting the rolling mean. However,
it is still more stationary than the original.

AutoRegressive Model (AR)


Autoregressive models operate under the premise that past values have an
effect on current values. AR models are commonly used in analyzing nature,
economics, and other time-varying processes. As long as the assumption
holds, we can build a linear regression model that attempts to predict value
of a dependent variable today, given the values it had on previous days.
The order of the AR model corresponds to the number of days incorporated
in the formula.

Moving Average Model (MA)


Assumes the value of the dependent variable on the current day depends on
the previous days error terms. The formula can be expressed as:

You’ll also come across the equation written as:

where μ is the mean of the series, the θ1, …, θq are the parameters of the
model and the εt, εt−1,…, εt−q are white noise error terms. The value of q
is called the order of the MA model.

Auto Regressive Moving Average (ARMA)


The ARMA model is simply the combination of the AR and MA models.

AutoRegressive Integrated Moving Average Model (ARIMA)


The ARIMA (aka Box-Jenkins) model adds differencing to an ARMA model.
Differencing subtracts the current value from the previous and can be used
to transform a time series into one that’s stationary. For example, first-order
differencing addresses linear trends, and employs the transformation zi =

yi — yi-1 . Second-order differencing addresses quadratic trends and

employs a first-order difference of a first-order difference, namely zi = (yi

— yi-1) — (yi-1 — yi-2) , and so on.

Three integers (p, d, q) are typically used to parametrize ARIMA models.


p: number of autoregressive terms (AR order)

d: number of nonseasonal differences (differencing order)

q: number of moving-average terms (MA order)

Auto Correlation Function (ACF)


The correlation between the observations at the current point in time and
the observations at all previous points in time. We can use ACF to
determine the optimal number of MA terms. The number of terms
determines the order of the model.

Partial Auto Correlation Function (PACF)


As the name implies, PACF is a subset of ACF. PACF expresses the
correlation between observations made at two points in time while
accounting for any influence from other data points. We can use PACF to
determine the optimal number of terms to use in the AR model. The
number of terms determines the order of the model.

Let’s take a look at an example. Recall, that PACF can be used to figure out
the best order of the AR model. The horizontal blue dashed lines represent
the significance thresholds. The vertical lines represent the ACF and PACF
values at in point in time. Only the vertical lines that exceed the horizontal
lines are considered significant.

Thus, we’d use the preceding two days in the autoregression equation.

Recall, that ACF can be used to figure out the best order of the MA model.

Thus, we’d only use yesterday in the moving average equation.


Going back to our example, we can create and fit an ARIMA model with AR
of order 2, differencing of order 1 and MA of order 2.

decomposition = seasonal_decompose(df_log)
model = ARIMA(df_log, order=(2,1,2))
results = model.fit(disp=-1)
plt.plot(df_log_shift)
plt.plot(results.fittedvalues, color='red')

Then, we can see how the model compares to the original time series.

predictions_ARIMA_diff = pd.Series(results.fittedvalues, copy=True)


predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(df_log['Passengers'].iloc[0],
index=df_log.index)
predictions_ARIMA_log =
predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,
fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)
plt.plot(df)
plt.plot(predictions_ARIMA)

Given that we have data going for every month going back 12 years and
want to forecast the number of passengers for the next 10 years, we use (12
x12)+ (12 x 10) = 264.

results.plot_predict(1,264)
Final Thoughts
In the domain of machine learning, there is a collection techniques for
manipulating and interpreting variables that depend on time. Among these
include ARIMA which can remove the trend component in order to
accurately predict future values.

Machine Learning Data Science Arti cial Intelligence Towards Data Science Programming

Discover Medium Make Medium yours Become a member


Welcome to a place where words matter. Follow all the topics you care about, and Get unlimited access to the best stories
On Medium, smart voices and original we’ll deliver the best stories for you to on Medium — and support writers while
ideas take center stage - with no ads in your homepage and inbox. Explore you’re at it. Just $5/month. Upgrade
sight. Watch

About Help Legal

You might also like