Module 2.3 EDA Part 3 Time Series Data in Python and R

EXPLORATORY DATA ANALYSIS FOR TIME SERIES DATA IN PYTHON AND R
For Module 2.3, we will be putting into practice the theories discussed in Module 2.1 EDA Part 1 Time Series Data using
quarterly sales data of a French retail company from Prof. Rob J. Hyndman and Prof. George Athanasopoulos’s book
Forecasting: Principles and Practice (3rd ed).
The objectives for this lecture are:

1. Perform exploratory data analysis (EDA) of a time series
2. Explain time series behavior in qualitative and quantitative terms to build intuition for model selection
3. Identify candidate models and possible model parameters that can be used based on the findings in the EDA
PART 1: Overview of EDA for Time Series and Modeling Implications
These are some of the questions I ask at various stages of model building, more so during EDA.
1. Are there any null values? How many? How do we impute null data?
▪ If NaNs/#NAs are present, first identify why these data points are missing and if they mean anything. Missing
values can be filled by interpolation, forward-fill, or backward-fill depending on the data and context. Also
make sure null doesn’t mean zero, which is acceptable but has modeling implications.
▪ It is important to understand how the data was generated (manual entry, ERP system), any transformations ,
or assumptions were made before imputing the missing data.
2. Are there any duplicate dates?

▪ Determine why are there duplicate dates. If applicable, remove the duplicates by aggregating the data (e.g.,
average or sum).
3. Visually, is there a trend, seasonality, or both?

▪ This will help us properly tune model parameters. Classic statistical or regression models like the ARIMA and
GARCH families of models rely on correctly setting the parameters for trend (autoregression lag) and
seasonality.
▪ If multiple seasonalities are present, ARIMA cannot be used. TBATS, harmonic regression, or supervised
machine learning models are more appropriate in that case.
▪ Frequency of seasonality is important. ARIMA is not appropriate for high frequency data such as hourly, daily,
sub-daily and even weekly. Consider using TBATS and deep learning models.
4. If seasonality is present, how does the data change from season to season for each period? How does the
seasonality look like if trend is also present?
▪ Does it increase/decrease with the trend? Changes slowly, rapidly, or remains constant? These are important
observations to be made, especially for regression and gradient boosted models. This is also key if any data
preprocessing will be needed.
▪ Decompose the series into level, trend, seasonality, and residual error. Observe the patterns in the
decomposed series.
▪ Is the trend constant, growing/slowing linearly, exponentially, or some other non-linear function?
▪ Is the seasonal pattern repetitive?
▪ How is the seasonal pattern changing relative to level? If it is constant relative to level, it shows additive
seasonality; whereas if it is growing relative to level, it's multiplicative. See screenshot from
https://anomaly.io/seasonal-trend-decomposition-in-r/index.html for a more visual explanation.
Figure 1. Visual comparison of additive and multiplicative seasonality
5. Are there any potential outliers in the dataset?

▪ Outliers are defined as observations that differ significantly from the rest of the dataset. Identify if the dataset
is susceptible to outliers/spikes.
▪ In time series forecasting we want to treat outliers before the data is used for fitting the model. Classic
regression models like the ARIMA class of models (especially ARIMA) are not robust to outliers and can provide
erroneous forecasts. Data should be analyzed while keeping seasonality in mind (e.g., a sudden spike could be
because of the seasonal behavior and not be outlier).
▪ Here a few ways to treat outliers:
o Winsorizing: Use Box and whiskers and clip the values that fall below the 5th percentile and 95th
percentile of the entire dataset.
o Use residual standard deviation and compare against observed values (preferred but can't do a priori)
o Use moving average to smoothen spikes/troughs (iterative and not entirely robust)
▪ Supervised machine learning models can handle outliers better but too many outliers can affect the stability
or forecasting power of the model (i.e., instead of producing reliable h-time steps ahead forecasts, the model
can reliably predict only up to three- or five-time steps ahead forecasts).
6. Is the data stationary?

▪ In the most intuitive sense, stationarity means that the statistical properties of a process generating a time
series do not change over time. It does not mean that the series does not change over time, just that the way
it changes does not itself change over time.
▪ To make a simple equivalence in math, algebraically speaking, a stationary time series is a linear function
perhaps, and not a constant one; the value of a linear function changes as 𝒙 grows, but the way it changes
remains constant — it has a constant slope; one value that captures that rate of change.
▪ Stationarity is integral for ARIMA and GARCH classes of regression models.
▪ With supervised machine learning models, stationarity is not necessarily an issue in theory as more powerful
algorithms can, and should be able to, comb through non-stationary data. The issue arises with more complex
feature engineering and parameter (or even hyperparameter) tuning. NOTE: We will discuss this further in
Module 3.
▪ During EDA, there are two ways to check for stationarity. A quick and visual way to do it is to plot the time
series average against the time series line chart. If the average line is sloping upward/downward, then the
series is not stationary.
▪ A more robust way to check for stationarity is to run the Augmented Dickey-Fuller (ADF) test in Python or R. If
Test statistic < Critical Value and p-value < 0.05, reject the null hypothesis (H0) (i.e., time series does not have
a unit root, meaning it is stationary).
▪ If the series is not stationary, perform differencing in Python or R. Differencing can help stabilize the mean of
a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend
and seasonality.
▪ Occasionally the differenced data will not appear to be stationary and it may be necessary to difference the
data a second time to obtain a stationary series. In Python or R, set degree of differencing in the code equal
to two (you’ll see it in practice in the Python and R code blocks later). In practice, it is almost never necessary
to go beyond second-order differences.
7. What is the distribution of the data? Will we need to perform any transformations?
▪ While normally distributed data is not a requirement for forecasting and does not necessarily improve point
forecast accuracy, it can help stabilize the variance and narrow the prediction interval.
▪ Plot the histogram of the entire dataset and for each time period (i.e., each year) to gauge kurtosis/peakedness
and skewness of the data. It can also help compare different periods and track trends over time.
▪ If the data is severely skewed, consider normalizing the data before training the model.
▪ Common transformations for positively-skewed data include square root, cube root, and logarithm.
▪ Common transformations for negatively-skewed data include power transformation (e.g., squaring the values)
and logarithm also.
▪ A more powerful data transformation is the Box-Cox transformation, which includes both log and power
transformations that depends on the parameter λ and are defined as follows:
log 𝑦𝑡 , 𝑖𝑓 λ = 0;
𝑤𝑡 = {𝑦𝑡λ − 1
, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒.
λ
▪ NOTE: We will discuss Box-Cox transformation in Module 3.
8. What does the ACF and PACF plot tell us?

▪ The autocorrelation analysis helps detect patterns and check for randomness. It is especially important when
you intend to use classic regression models for forecasting because it helps to determine its parameters.
▪ Recall that the number of intervals between the two observations is the lag. For example, the lag between the
current and previous observation is one. If you go back one more interval, the lag is two, and so on.
▪ In mathematical terms, the observations at yt and yt – k are separated by k time units. K is the lag. This lag
can be days, quarters, or years depending on the nature of the data. When k = 1, you’re assessing adjacent
observations. For each lag, there is a correlation.
▪ IMPORTANT: Run the plotting codes for ACF and PACF for already differenced data.
▪ Are any lags significant (i.e., lines that go outside the confidence interval bands)?
▪ Here’s a quick guide to interpreting ACF plots, we’ll discuss them more in detail in Module 3:
o The autocorrelation function declines to near zero rapidly for a stationary time series. In contrast, the
ACF drops slowly for a non-stationary time series. In this chart for a stationary time series, notice how
the autocorrelations decline to non-significant levels quickly.
Figure 2. ACF plot for stationary time series at 5% level of significance
o When trends are present in a time series, shorter lags typically have large positive correlations because
observations closer in time tend to have similar values. The correlations taper off slowly as the lags
increase. In this ACF plot, the autocorrelations decline slowly. The first five lags are significant. In
practice we usually set k = 1 for a time series that exhibits a strong trend like this one. A sinusoidal
(wave) pattern that converges to 0, possibly alternating negative and positive signs also signify a strong
trend.
Figure 3. ACF plot of time series that exhibits strong trend at 5% level of significance
o When seasonal patterns are present, the autocorrelations are larger for lags at multiples of the
seasonal frequency than for other lags. When a time series has both a trend and seasonality, the ACF
plot displays a mixture of both effects. Notice how you can see the wavy correlations for the seasonal
pattern and the slowly diminishing lags of a trend.
Figure 4. ACF plot of time series that exhibits BOTH trend and seasonality at 5% level of significance
▪ A PACF plot is only appropriate if you will develop a class autoregressive model like ARIMA. Typically, you will
use the ACF to determine whether an autoregressive model is appropriate. If it is, you then use the PACF to
help you choose the model terms. We will discuss this in more detail in Module 3.
9. Are there any structural breaks in the time series?

▪ Structural breaks are abrupt changes in the trend. Gather more information about the sudden changes. If the
breaks are valid, classic regression models WILL NOT WORK. Dynamic regression, supervised machine learning
models, or deep learning models will be more suitable.
▪ Identify the possible reasons for the structural break (e.g., change in macroeconomic environment, price
change, change in customer preferences, regulatory intervention, business restructuring, etc.). Note structural
change persists for some time, while outliers do not.
Figure 5. Visual comparison of an outlier and structural break in time series data
PART 2: Time Series EDA in Python
We will need the following libraries for this exercise. These are the basic ones and most used packages and libraries for a
data analytics project in Python.
a. Pandas
b. Numpy
c. Matplotlib
d. Seaborn
e. Altair
f. Statsmodels
g. Scipy
CODE BLOCK #1: Importing libraries
import pandas as pd
import numpy as np
#Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
plt.style.use('seaborn-white')
%matplotlib inline
#Statistics libraries
import statsmodels.api as sm
import scipy as stats
from scipy.stats import anderson
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import month_plot, seasonal_plot, plot_acf, plot_pacf, quarter_plot
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.stats.diagnostic import acorr_ljungbox as ljung
#From nimbusml.timeseries import SsaForecaster

from statsmodels.tsa.statespace.tools import diff as diff
from scipy import signal
from scipy.stats import shapiro
from scipy.stats import boxcox
from sklearn.preprocessing import StandardScaler
CODE BLOCK #2: Importing the Data
path = 'https://raw.githubusercontent.com/pawarbi/datasets/master/timeseries/ts_frenchretail.csv'
#Sales numbers are in thousands, so I am dividing by 1000 to make it easier to work with numbers, especially
squared errors
data = pd.read_csv(path, parse_dates=True, index_col="Date").div(1_000)
data.index.freq='Q'
data.head()
The data.head() line in the code is a sanity check to ensure that the you imported the correct file. It shows the first five
rows of your dataset.
In Python, it is important to set the frequency of the time series data. You can set data.index.freq to any of the following:
- Quarterly = ‘Q’
- Monthly = ‘M’
- Weekly = ‘W’
CODE BLOCK #3: Train-Test Split
#Split into train and test

train = data.iloc[:-6]
test = data.iloc[-6:]
#forecast horizon
h = 6
train_length = len(train)
print('train_length:',train_length, '\n test_length:', len(test) )
Before analyzing the data, first split it into train and test (hold-out) for model evaluation. All EDA and model fitting/selection
should be done first using train data. Never look at the test set until later to avoid any bias. Typically, we want at least 3-4
full seasonal cycles for training, and test set length should be no less than the forecast horizon.
In this example, we have 24 observations of quarterly data, which means 6 full cycles (24/4). Our forecast horizon is 4
quarters. So, the train set should be more than 16 and less than 20. We will use the first 18 observations for training and
keep the last 6 for validation. Unlike typical train-test split, we cannot shuffle the data before splitting to retain the temporal
structure.
CODE BLOCK #4: Data integrity check
#Any missing data?

print("missing_data:", train.isna().sum())
print("unique dates:", train.index.nunique())
Observations:
a. No null values
b. Length of the train set is 18 and we have 12 unique dates/quarters so no duplicate dates
c. Each quarter has 1 observation, so no duplicates and time series is continuous
CODE BLOCK #5: Plotting the time series
#Create line chart for Training data. Index is reset to use Date column.
train_chart=alt.Chart(train.reset_index()).mark_line(point=True).encode(
x='Date',
y='Sales',
tooltip=['Date', 'Sales'])
#Create rolling mean

rolling_mean = alt.Chart(train.reset_index()).mark_trail(
color='orange',
size=1
).transform_window(
rolling_mean='mean(Sales)',
frame=[-4,4]
).encode(
x='Date:T',
y='rolling_mean:Q',
size='Sales'
)
#Add data labels

text = train_chart.mark_text(
align='left',
baseline='top',
dx=5 #Moves text to right so it doesn't appear on top of the bar
).encode(
text='Sales:Q'
)
#Add zoom-in/out
scales = alt.selection_interval(bind='scales')
#Combine everything
(train_chart + rolling_mean +text).properties(
width=600,
title="French Retail Sales & 4Q Rolling mean ( in '000)").add_selection(
scales
)
Matplotlib and Seaborn create static charts, whereas plots created with Altair are interactive. You can hover over the data
points to read tooltips. The most useful feature is the ability to zoom-in and out. Time series data can be dense and it’s
important to check each time period to get insights. With zoom-in/out, it can be done interactively without slicing the time
series.
NOTE: You can choose to perform your time series plotting in Excel instead if you find writing the codes too confusing.
Observations:
a. Sales has gone up each year from 2012-2016. Positive trend is present.
b. Typically, sales go up from Q1 to Q3, peaks in Q3, then drops in Q4. This is a seasonal pattern. Model should capture
seasonality and trend.
c. Series is not stationary as observed with upward sloping rolling mean line.
CODE BLOCK #6: Creating box plots
#Box plot to see distribution of sales in each year

fig, ax = plt.subplots(figsize = (12,8))
sns.boxplot(data=train, x=train.index.year, y = 'Sales', ax = ax, boxprops=dict(alpha=.3));
sns.swarmplot(data=train, x=train.index.year, y = 'Sales');
Observations:
a. Overall data looks clean, no observations outside of IQR.
b. No structural breaks.
c. Notice that the length of the bar in the box plot increases from 2012-2015. This shows that the mean and variance
are increasing, and we might need to transform the data to stabilize the variance. HINT: We will difference the
series later.
CODE BLOCK #7: Density plot of time series and each year
#Distribution plot of each year compared with overall distribution

sns.distplot(train, label='Train', hist=False, kde_kws={"color": "g", "lw": 3, "label":
"Train","shade":True})
sns.distplot(train['2012'], label='2012', hist=False)
sns.distplot(train['2015'], label='2015', hist=False);
Observations:
a. Density plot shows data looks normally distributed. Bimodal distribution in quarters is because of small sample
size. Peaks shift right from 2012 to 2015 indicating increase in average. No structural breaks found.
b. Distribution becomes fatter as the years progress, indicating higher spread/variation (as seen in boxplot too).
CODE BLOCK #8: Decomposing time series components
#Decompose time series components

decompose = seasonal_decompose(train["Sales"])
decompose.plot();
plt.rcParams['figure.figsize'] = (12, 8);
Always use a semicolon (;) after plotting any results from statsmodels. For some reason if you don’t, it will print the plots
twice. Also, by default the statsmodels plots are small and do not have a figsize() argument. Use rcParams() to define
the plot size.
Observations:
a. Trend is more than linear, notice a small upward take off after 2013-07. Also notice that trend is projecting upward.
b. Seasonal pattern is consistent
c. Residuals are whatever is left after fitting the trend and seasonal components to the observed data. It's the
component we cannot explain. We want the residuals to be i.i.d (i.e uncorrelated). If the residuals have a pattern,
it means there is still some structural information left to be captured. For example, residuals are showing some
wavy pattern, which is not good. We need to perform Ljung-Box test to confirm if they are i.i.d as a group.
d. We do not want to see any recognizable patterns in the residuals (e.g., waves, upward/downward slope, funnel
pattern, etc.)
CODE BLOCK #9: Performing an initial Ljung-Box Test on residuals of decomposed time series
The Ljung-Box test is a statistical test that checks if autocorrelation exists in a time series. It uses the following hypotheses:
H0: The residuals are independently distributed.

Ha: The residuals are not independently distributed; they exhibit serial correlation.
Ideally, we would like to fail to reject the null hypothesis. That is, we would like to see the p-value of the test be greater
than 0.05 because this means the residuals for our time series model are independent, which is often an assumption we
make when creating a model.
#Perform initial Ljung-Box test of residuals of decomposed time series

sm.stats.acorr_ljungbox(decompose.resid.dropna(), lags=[1], return_df=True)
The first argument in this code is the dataframe that we want to run the test on. The dataframe decompose.resid.dropn a()
refers to the residuals of the decompose dataframe, which we declared in Code Block #8.
The initial test ran above shows that residuals are uncorrelated. If the residuals are correlated, we can perform
transformations to see if it stabilizes the variance.
CODE BLOCK #10: Test for stationarity and difference the time series if needed
#Calculate ADF statistic

adf = adfuller(train["Sales"])[1]
print(f"p value:{adf.round(4)}", ", Series is Stationary" if adf <0.05 else ", Series is Non-Stationary")
#Differencing
de_trended = train.diff(1).dropna()
adf2 = adfuller(de_trended)[1]
print(f"p value:{adf2}", ", Series is Stationary" if adf2 <0.05 else ", Series is Non-Stationary")
As suspected through our time series plots, the series is not stationary. The first p-value below refers to the first ADF test
in the above code block. The second refers to the ADF test after we performed a first order differencing.
CODE BLOCK #11: ACF and PACF plots of the de-trended time series
#Plot ACF and PACF using statsmodels

plot_acf(de-trended);
plot_pacf(de-trended, lags=5);
As explained earlier, for EDA purposes we’ll care more for the ACF plot rather than the PACF plot. The PACF plot allows us
to tune our parameters for a regression-based model, but will be unnecessary for supervised machine learning models.
Note that for the PACF line in the code, we set the lags equal to 5 because of the small sample size. The PACF plot is more
sensitive to sample size and will return an error if we do not specify a lag size. What this means is that the PACF will only
test partial autocorrelation of up to 5 lags only. For sufficiently large sample size, there’s no need to specify the lag as the
PACF will optimize it already.
Observations:
a. ACF plot shows autocorrelation coefficient is insignificant at all lag values (within the blue 95% CI band), except lag
1.
b. The ACF plot is sinusoidal which means the time series exhibits a strong trend.
CODE BLOCK #12: Check if dataset has a normal distribution
As mentioned above, a time series does not have to be Gaussian for accurate forecasting but if the data is highly skewed
it can affect the model selection and forecast uncertainty. In general, if the series is non-Gaussian, it should be normalized
before through transformations. Normalization will also help us to decide on whether we will use regression-based models,
tree-based models, or neural network models later.
To visually check for normality, plotting a histogram against a density plot is ideal. You can also create a Q-Q plot to check
for normality. Points on the Normal Q-Q plot provide an indication of univariate normality of the dataset. If the data is
normally distributed, the points will fall on the 45-degree reference line. If the data is not normally distributed, the points
will deviate from the reference line.
#Distribution Plot
sns.distplot(train["Sales"]);
#Q-Q Plot
sm.qqplot(train["Sales"], fit=True, line='45');
#Perform Jarque-Bera Test

scipy.stats.jarque_bera(train)
SignificanceResult(statistic=1.0509757294798883, pvalue=0.5912668357500077)
Observations:
a. Q-Q plot shows the data follows the 45-degree line very closely, deviates slightly in the left tail.
b. Density plot shows a distribution close to a perfectly normal curve.
c. Jarque-Bera Test shows the data is from a normal distribution. The Jarque-Bera test is a goodness-of-fit test that
determines whether sample data have skewness and kurtosis that matches a normal distribution. The null
hypothesis is a joint hypothesis of the skewness being zero and the excess kurtosis being zero. Since the p-value
above is not less than .05, we fail to reject the null hypothesis.
d. If the p-value is “small” – that is, if there is a low probability of sampling data from a normally distributed
population that produces such an extreme value of the statistic - this may be taken as evidence against the null
hypothesis in favor of the alternative: the weights were not drawn from a normal distribution.
e. Note that the inverse is not true; that is, the test is not used to provide evidence for the null hypothesis.
PART 2: Time Series EDA in R or RStudio
While Python is more ideal to use when doing EDA, R has its pros, especially when it comes to statistical analysis. For most
of time series analysis projects, we will need the following packages:
1. tidyverse
2. forecast
3. FinTS
4. tseries
CODE BLOCK #1: Install and load packages
Similar with Python, we will only install packages once in our personal devices. We need to load the packages, however,
every time we perform EDA.
Here’s how RStudio looks:
When installing packages for the first time, you will be asked to select a CRAN mirror. You can select any of the mirror sites
listed. I normally choose Taiwan or Philippines.
#install packages
install.packages("tidyverse")
install.packages("forecast")
install.packages("FinTS")
install.packages("tseries")
install.packages(“urca”)
#load libraries
library(tidyverse)
library(forecast)
library(FinTS)
library(tseries)
library(zoo)
library(lubridate)
library(urca)
NOTE: No need to install the zoo and lubridate packages as they are already built-in in R.
CODE BLOCK #2: Importing the Data
#importing the data
path = 'https://raw.githubusercontent.com/pawarbi/datasets/master/timeseries/ts_frenchretail.csv'
data = read.csv(path)
head(data)
CODE BLOCK #3: Train-Test Split
In practice, train-test split is usually done at 70% train, 30% test split. For the sake of recreating the Python codes, we used
75-25 split in R.
#splitting the dataset to train and test
train = head(data, 0.75*nrow(data))

test = tail(data, 0.25*nrow(data))
CODE BLOCK #4: Data integrity check

#check for missing values in the data and for duplicate dates
#Count of missing values in wickets column

sum(is.na(data$Sales))
#Count of duplicate dates

sum(duplicated(data$Date))
CODE BLOCK #5: Plotting the time series
#Plotting the time series

train.new = train %>% mutate(Sales = Sales/1000)
ts_train = ts(train.new$Sales, start = c(2012,01), frequency = 4)
ts_train %>% autoplot()+

ggtitle("French Retail Sales") + xlab("Year-Quarter") + ylab("Retail Sales")
The first portion of the code converts our Sales data truncated in ‘000s and then converts our dataframe into a time series.
Here, you’ll notice that unlike in Python where it’s easy to properly scale the time series, in R it is much harder to visualize
the original time series plot. Still, we can see in the R plot that the time series is trending upward and has seasonality.
CODE BLOCK #6: Decomposing time series components
#Decomposing time series components

decompose_train = decompose(ts_train, "multiplicative")
plot(as.ts(decompose_train$seasonal))
plot(as.ts(decompose_train$trend))
plot(as.ts(decompose_train$random))
plot(decompose_train)
Like the Python visuals of decomposition, we can capture both trend and seasonality of the retail sales time series using
R. Here, however, we need specify whether the seasonality is additive or multiplicative. Go back to the above in the
overview on how to identify additive and multiplicative seasonality.
CODE BLOCK #7: Performing an initial Ljung-Box Test on residuals of decomposed time series
#Initial Ljung-Box test of residuals of decomposed time series

Box.test(decompose_train$random, type = "Ljung-Box")
While we got a different p-value compared to when the test was done in Python, the results are the same in that we will
reject the null hypothesis and that the residuals of the original time series are not autocorrelated.
CODE BLOCK #8: Test for stationarity and difference the time series if needed
#Initial ADF test

adf_initial = ur.df(ts_train, type = "none", selectlags = "AIC")
summary(adf_initial)
#Differencing the time series

ts_train_diff = diff(ts_train, difference = 1)
#Re-try of ADF test

adf_2 = ur.df(ts_train_diff, type ="none", selectlags = "AIC")
summary(adf_2)
In the initial ADF test, we fail to reject the null hypothesis that the series has a unit root or is not stationary because the
test statistic is greater than the critical value at 5% level of significance. This is also in agreement with the initial ADF test
we did in Python.
The second part of the code performs the differencing of the time series to transform into a stationary time series. Similar
in Python, we set the order of differencing equal to one. After the re-run of the ADF test, we can already say that the
differenced time series is stationary as per below.
The selectlags argument refers to how test will select the optimal number of lags that will minimize the information
criterion. AIC stands for Aikake Information Criterion. We will discuss this further in Module 3.
CODE BLOCK #9: ACF and PACF plots of the de-trended time series
#ACF and PACF plots

acf(ts_train_diff)
pacf(ts_train_diff)
We got the same results in our ACF plot in R with what we got in Python. There is a difference with our PACF plots, but for
the purposes of EDA, let us focus first on our ACF plots.
CODE BLOCK #10: Check if dataset has a normal distribution
#Plots for checking normality

par(mfrow = c(1,2))
qqnorm(ts_train)
qqline(ts_train, col = "steelblue", lwd = 2)
hist = hist(ts_train, col = "gray", xlab = "Sales", prob = TRUE)
lines(density(ts_train), col = "steelblue", lwd = 2)
#Initial Jarque-Bera test

jarque.bera.test(ts_train)
Similar in Python, using R visuals we can see that the French retail sales time series is normal. We also ran the Jarque-Bera
test and since the p-value is greater than .05, we can reject the null hypothesis that the sample is not from a normal
distribution.

Module 2.3 EDA Part 3 Time Series Data in Python and R

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 2.3 EDA Part 3 Time Series Data in Python and R

Uploaded by

Copyright:

Available Formats

EXPLORATORY DATA ANALYSIS FOR TIME SERIES DATA IN PYTHON AND R

The objectives for this lecture are:

PART 1: Overview of EDA for Time Series and Modeling Implications

2. Are there any duplicate dates?

3. Visually, is there a trend, seasonality, or both?

5. Are there any potential outliers in the dataset?

6. Is the data stationary?

▪ NOTE: We will discuss Box-Cox transformation in Module 3.

8. What does the ACF and PACF plot tell us?

Figure 2. ACF plot for stationary time series at 5% level of significance

9. Are there any structural breaks in the time series?

PART 2: Time Series EDA in Python

CODE BLOCK #1: Importing libraries

#From nimbusml.timeseries import SsaForecaster

CODE BLOCK #3: Train-Test Split

#Split into train and test

print('train_length:',train_length, '\n test_length:', len(test) )

#Any missing data?

CODE BLOCK #5: Plotting the time series

#Create rolling mean

#Add data labels

CODE BLOCK #6: Creating box plots

#Box plot to see distribution of sales in each year

#Distribution plot of each year compared with overall distribution

CODE BLOCK #8: Decomposing time series components

#Decompose time series components

H0: The residuals are independently distributed.

#Perform initial Ljung-Box test of residuals of decomposed time series

#Calculate ADF statistic

#Plot ACF and PACF using statsmodels

CODE BLOCK #12: Check if dataset has a normal distribution

#Perform Jarque-Bera Test

PART 2: Time Series EDA in R or RStudio

CODE BLOCK #1: Install and load packages

Here’s how RStudio looks:

CODE BLOCK #2: Importing the Data

#importing the data

CODE BLOCK #3: Train-Test Split

#splitting the dataset to train and test

train = head(data, 0.75*nrow(data))

CODE BLOCK #4: Data integrity check

#Count of missing values in wickets column

#Count of duplicate dates

CODE BLOCK #5: Plotting the time series

#Plotting the time series

ts_train %>% autoplot()+

CODE BLOCK #6: Decomposing time series components

#Decomposing time series components

#Initial Ljung-Box test of residuals of decomposed time series

#Initial ADF test

#Differencing the time series

#Re-try of ADF test

#ACF and PACF plots

CODE BLOCK #10: Check if dataset has a normal distribution

#Plots for checking normality

#Initial Jarque-Bera test

You might also like