Professional Documents
Culture Documents
07.06.2019
─
Janani Prakash
PGPBABI-Online
GreatLearning, Great Lakes Institute of Management
Project Objective 3
1. The data should be read as time series objects in R and plotted. The major features
in the series should be identified and the difference between the two series should be
drawn.
2. Any seasonal changes in the two series should be scanned and then a formal
extraction of time series components should be done. The two series have to be
compared to check whether there are more variability in a season compared to the
others, whether seasonal variations are changing across years etc.
3. Each series should be decomposed to extract trend and seasonality, if there are any
and the best seasonality type has to be identified – additive or multiplicative. The
seasonal indices should be explained. The months in which the sales are higher or
lower should be identified. Any difference in the nature of demand of the two items
have to be identified.
4. The residuals for the two decomposition exercises have to be extracted and check if
it forms a stationary series. A formal test for stationary has to be done writing down
the null and alternative hypothesis and the conclusion has to be drawn for each case.
5. Before the final forecast is undertaken a few models have to be compared. The last
21 months should be used as hold-out sample to fit a suitable exponential smoothing
model to the rest of the data and the MAPE has to be calculated. The values of a, ß
and γ have to be calculated. For the same hold-out period the forecast by
decomposition has to be compared and the MAPE has to be computed.
6. The ‘best’ model obtained from above should be used to forecast demand for the
period Oct 2017 to December 2018 for both items. The forecasted values as well as
their upper and lower confidence limits should be provided. If you are the store
manager what decisions would you make after looking at the demand of the two
items over years.
2. Exploratory Data Analysis – Step by step
approach
2.1.1. Install necessary Packages and Invoke Libraries
The necessary packages were installed and the associated libraries were
invoked. Having all the packages at the same places increases code
readability.
Please refer Appendix A for Source Code.
From the above plots, we can see Item A has an increasing demand, whereas Item B
has fall in demand. Also, there is some seasonality and trend in demands. Both Item
A and B doesn’t seem to have cyclic in nature. Item A variation increases with time
whereas Item B variation decreases.
3.3. Decomposition of the time series dataset
The original time series is often computed (decompose) into 3 sub-time series:
1. Seasonal: patterns that repeat with fixed period of time.
2. Trend: the underlying trend of the metrics.
3. Random: (also call “noise”, “Irregular” or “Remainder”) Is the residuals of the
time series after allocation into the seasonal and trends time series.
Other than above three component there is Cyclic component which occurs after
long period of time.
> monthplot(ItemAdemand)
>monthplot(ItemBdemand)
The seasonal variation looked to be about the same magnitude across time, so an
additive decomposition might give good results.
> ItemAseasonality
Call:
stl(x = ItemAdemand[, 1], s.window = "p")
Components
seasonal trend remainder
Jan 2002 -970.47187 2838.594 85.8774947
Feb 2002 -454.22689 2841.712 -85.4854465
Mar 2002 -124.79419 2844.830 333.9638914
Apr 2002 -364.03321 2850.118 -72.0843134
May 2002 -314.83443 2855.405 -314.5703082
Jun 2002 -343.58304 2861.823 206.7602840
> ItemBseasonality <- stl(ItemBdemand[,1], s.window="p") #constant seasonality
> plot(ItemBseasonality)
> ItemBseasonality
Call:
stl(x = ItemBdemand[, 1], s.window = "p")
Components
seasonal trend remainder
Jan 2002 -1222.0193 3777.014 30.0049071
Feb 2002 -860.5255 3773.067 455.4584776
Mar 2002 -438.6564 3769.120 -120.4632005
Apr 2002 -145.5874 3763.765 -507.1773823
May 2002 354.4193 3758.410 -356.8292167
Jun 2002 451.4649 3749.684 14.8509274
From the above decomposed details, we can see that there is continuous increase in
demand for Item A, but on contrary similar drop pattern is observed for Item B.
Decompose the time series and plot the deseasoned series
If the focus is on figuring out whether the general trend of demand is up, we
deseasonalize, and possibly forget about the seasonal component. However, if you
need to forecast the demand in next month, then you need take into account both the
secular trend and seasonality.
The above plot show demand in Red and de-seasoned demand in Blue, we can see
that there is decreasing trend of demand.
Divide data into test and train
The data is divided into Training and test data for both Item A and B. The dataset is
divided in such a way that the older data is used as the training data whereas the
most recent data is used as the testing data.
The training data of the both the items are deseasoned and converted into seasonal,
trend and irregular component using STL
The ts plot function plots the test data and the forecasted data to check how well the
forecasted data matches the actual test data.
> ts.plot(VecA, col=c("blue", "red"),xlab="year", ylab="demand", main="Quarterly
Demand A: Actual vs Forecast")
Calculates the mean absolute percentage error (Deviation) function for the forecast
and the eventual outcomes.
> MAPEA <- mean(abs(VecA[,1]-VecA[,2])/VecA[,1])
> MAPEA
[1] 0.1408798
Box-Ljung Test
To check is residual are independent
H0: Residuals are independent
Ha: Residuals are not independent
>Box.test(stlForecastA$residuals, lag=10, type="Ljung-Box")
Box-Ljung test
data: stlForecastA$residuals
X-squared = 72.292, df = 10, p-value = 1.597e-11
The ts plot function plots the test data and the forecasted data to check how well the
forecasted data matches the actual test data.
>ts.plot(VecB, col=c("blue", "red"),xlab="year", ylab="demand", main="Quarterly
Demand B: Actual vs Forecast")
Calculates the mean absolute percentage error (Deviation) function for the forecast
and the eventual outcomes.
>MAPEB <- mean(abs(VecB[,1]-VecB[,2])/VecB[,1])
>MAPEB
[1] 0.1082608
Box-Ljung Test
To check is residual are independent
H0: Residuals are independent
Ha: Residuals are not independent
Box-Ljung test
data: stlForecastb$residuals
X-squared = 63.798, df = 10, p-value = 6.878e-10
From the above MAPE results we can see the 14 % and 10.8% less accuracy in
model.
3.5. Forecasting using Holt Winters
The Holt-Winters seasonal method comprises the forecast equation and three
smoothing equations — one for the level, one for trend, and one for the seasonal
component, with smoothing parameters alpha(α), beta(β) and gamma(γ).
Call:
HoltWinters(x = as.ts(ATrain), seasonal = "additive")
Smoothing parameters:
alpha: 0.1241357
beta : 0.03174654
gamma: 0.3636975
Coefficients:
[,1]
a 3753.348040
b 7.663395
s1 -1250.098605
s2 -438.592232
s3 -224.017731
s4 -407.395313
s5 -507.668223
s6 -667.267246
s7 63.659702
s8 197.909330
s9 -301.525945
s10 25.272325
s11 712.529546
s12 1545.291998
>plot(hwItemA)
data: hwAForecast$residuals
X-squared = 14.227, df = 20, p-value = 0.8188
>MAPE(VecA1[,1],VecA1[,2])
[1] 0.1160528
Forecasting for Item B
Call:
HoltWinters(x = as.ts(BTrain), seasonal = "additive")
Smoothing parameters:
alpha: 0.0166627
beta : 0.4878834
gamma: 0.5000132
Coefficients:
[,1]
a 2297.12724
b -15.29024
s1 -1222.01821
s2 -1012.34884
s3 -442.56913
s4 -307.95973
s5 79.56065
s6 258.33260
s7 697.64492
s8 241.68337
s9 -246.12729
s10 -465.09216
s11 120.77708
s12 412.50043
>plot(hwItemB)
data: hwBForecast$residuals
X-squared = 13.101, df = 20, p-value = 0.873
Conclusion: Do not reject H0: Residuals are independent
>MAPE(VecB1[,1],VecB1[,2])
[1] 0.1867152
Statistical tests make strong assumptions about your data. They can only be used to
inform the degree to which a null hypothesis can be accepted or rejected. The result
must be interpreted for a given problem to be meaningful. Nevertheless, they can
provide a quick check and confirmatory evidence that your time series is stationary or
non-stationary.
Null Hypothesis (H0): If accepted, it suggests the time series has a unit root,
meaning it is non-stationary. It has some time dependent structure.
Alternate Hypothesis (H1): The null hypothesis is rejected; it suggests the time
series does not have a unit root, meaning it is stationary. It does not have
time-dependent structure.
p-value > 0.05: Accept the null hypothesis (H0), the data has a unit root and is
non-stationary.
p-value <= 0.05: Reject the null hypothesis (H0), the data does not have a unit root
and is stationary.
Item A
>adf.test(ItemAdemand)
Augmented Dickey-Fuller Test
data: ItemAdemand
Dickey-Fuller = -7.8632, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
>adf.test(diff(ItemAdemand))
Augmented Dickey-Fuller Test
data: diff(ItemAdemand)
Dickey-Fuller = -8.0907, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
Item B
>adf.test(ItemBdemand)
Augmented Dickey-Fuller Test
data: ItemBdemand
Dickey-Fuller = -12.967, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
> adf.test(diff(ItemBdemand))
data: diff(ItemBdemand)
Dickey-Fuller = -9.8701, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
PACF
>acf(ItemAdemand,lag=15)
>acf(ItemAdiff, lag=15)
>acf(ItemBdemand,lag=15)
>acf(ItemBdiff, lag=15)
>acf(ItemAdemand,lag=50)
>acf(ItemAdiff, lag=50)
>pacf(ItemAdemand)
>pacf(ItemAdiff)
>acf(ItemBdemand,lag=50)
>acf(ItemBdiff, lag=50)
>pacf(ItemBdemand)
>pacf(ItemBdiff)
ARIMA model
ARMA models are commonly used in time series modeling. In ARMA model, AR
stands for auto-regression and MA stands for moving average.
The above ACF and PACF we have found out that the positive and negative values
mean (that is because of data is stationary); they are not cuts for AR(2) series and no
gradually decrease in the value of PACF, no significance of MA(2).
Item A
Coefficients:
sma1 drift
-0.6581 3.9132
s.e. 0.0798 0.9188
> plot(ItemA_ArimaTrain$x,col="blue")
> lines(ItemA_ArimaTrain$fitted,col="red",main="Demand A: Actual vs Forecast")
> MAPE(ItemA_ArimaTrain$fitted,ItemA_ArimaTrain$x)
[1] 0.0733376
The MAPE percentage error is now reduced to 7.3% for ARIMA model
>acf(ItemA_ArimaTrain$residuals)
>pacf(ItemA_ArimaTrain$residuals)
Box-Ljung Test
To check is residual are independent
H0: Residuals are independent
Ha: Residuals are not independent
>Box.test(ItemA_ArimaTrain$residuals, lag = 10, type = c("Ljung-Box"), fitdf = 0)
Box-Ljung test
data: ItemA_ArimaTrain$residuals
X-squared = 16.716, df = 10, p-value = 0.0809
From the plot and data, we can see the forecasted value follows almost the same as
actual value, there are point of interaction at Jan 2016, May 2016, Dec 2016, Jan
2017.
Item B
Coefficients:
sar1 sar2 sma1 drift
0.2141 0.0379 -0.7536 -10.1449
s.e. 0.1958 0.1382 0.1773 0.9005
>plot(ItemB_ArimaTrain$residuals)
>plot(ItemB_ArimaTrain$x,col="blue")
>acf(ItemB_ArimaTrain$residuals)
>pacf(ItemB_ArimaTrain$residuals)
Box-Ljung Test
To check is residual are independent
H0: Residuals are independent
Ha: Residuals are not independent
data: ItemB_ArimaTrain$residuals
X-squared = 18.298, df = 10, p-value = 0.05014
From the plot and data, we can see the forecasted value to some extent follows the
actual value.
3.7. Model Comparison
For Time Series Forecasting problem, we observed the trend and seasonality in the
data.
We have observed that the Item A has increasing trend, but for Item B the trend is
declining.
Also, we observed for both item there are few months with high variation in
seasonality; and for Item A there are few outliers.
As the seasonality was not following the trend pattern we have used the “Additive”
seasonality. We have performed the three models
1. Random Model,
3. ARIMA model.
Random Model
Holt Winters
ARIMA
From the MAPE values observed the ARIMA model provided the lowest values and
we selected the model for the Forecasting.
3.8. Further forecasting using ARIMA model and conclusion
Forecast A
>ItemA_arima <- auto.arima(ItemAdemand, seasonal=TRUE)
>ItemAforecast <- forecast(ItemA_arima, h=17)
>plot(ItemAforecast)
>ItemAforecast
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Aug 2017 4320.211 3879.991 4760.431 3646.953 4993.469
Sep 2017 4169.513 3725.551 4613.476 3490.531 4848.495
Oct 2017 4428.791 3981.385 4876.197 3744.542 5113.040
Nov 2017 5102.669 4652.091 5553.246 4413.570 5791.767
Dec 2017 5879.220 5425.721 6332.719 5185.653 6572.787
Jan 2018 2819.535 2363.343 3275.727 2121.849 3517.221
Feb 2018 3990.984 3532.307 4449.660 3289.498 4692.469
Mar 2018 4181.449 3720.480 4642.419 3476.458 4886.441
Apr 2018 4081.089 3618.003 4544.174 3372.860 4789.317
May 2018 3888.336 3423.296 4353.376 3177.118 4599.554
Jun 2018 4029.525 3562.679 4496.370 3315.545 4743.504
Jul 2018 4390.292 3921.777 4858.807 3673.760 5106.823
Aug 2018 4407.590 3900.778 4914.402 3632.487 5182.693
Sep 2018 4257.019 3747.019 4767.019 3477.041 5036.997
Oct 2018 4516.419 4003.480 5029.358 3731.946 5300.892
Nov 2018 5190.414 4674.763 5706.065 4401.794 5979.034
Dec 2018 5967.079 5448.925 6485.232 5174.632 6759.526
Forecast B
>ItemB_arima <- auto.arima(ItemBdemand, seasonal=TRUE)
>ItemBforecast <- forecast(ItemB_arima, h=17)
>plot(ItemBforecast)
>ItemBforecast
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Aug 2017 2356.360 1945.1156 2767.605 1727.4157 2985.305
Sep 2017 2082.947 1671.7024 2494.192 1454.0025 2711.892
Oct 2017 1784.795 1373.5500 2196.040 1155.8501 2413.740
Nov 2017 2436.402 2025.1570 2847.647 1807.4571 3065.347
Dec 2017 2429.861 2018.6162 2841.106 1800.9163 3058.806
Jan 2018 965.227 553.9834 1376.471 336.2842 1594.170
Feb 2018 1278.230 866.9865 1689.474 649.2873 1907.173
Mar 2018 1693.373 1282.1294 2104.617 1064.4302 2322.316
Apr 2018 2088.924 1677.6805 2500.168 1459.9813 2717.867
May 2018 2342.673 1931.4290 2753.916 1713.7298 2971.615
Jun 2018 2587.570 2176.3268 2998.814 1958.6276 3216.513
Jul 2018 2903.952 2492.7084 3315.196 2275.0092 3532.895
Aug 2018 2267.612 1814.7093 2720.515 1574.9570 2960.267
Sep 2018 1945.385 1492.4816 2398.287 1252.7293 2638.040
Oct 2018 1663.827 1210.9242 2116.730 971.1719 2356.482
Nov 2018 2293.259 1840.3561 2746.162 1600.6038 2985.914
Dec 2018 2325.067 1872.1643 2777.970 1632.4120 3017.722
4. Appendix A – Source Code
4.1. The source code for Multiple Linear Regression has been written below
demand <- read_excel("Demand.xlsx")
names(demand)[1:4] <- c("Year","Month","ItemA","ItemB")
dim(demand)
head(demand)
summary(demand[,3:4])
monthplot(ItemAdemand)
monthplot(ItemBdemand)
library(forecast)
stlForecastA <- forecast(ItemATrain, method="rwdrift", h=19)
stlForecastb <- forecast(ItemBTrain, method="rwdrift", h=19)
VecA<- cbind(ATest,stlForecastA$mean)
VecB<- cbind(BTest,stlForecastb$mean)
install.packages("MLmetrics")
library(MLmetrics)
MAPE(VecA1[,1],VecA1[,2])
MAPEA1 <- mean(abs(VecA1[,1]-VecA1[,2])/VecA1[,1])
MAPEA1
#B
#install.packages("MLmetrics")
library(MLmetrics)
MAPE(VecB1[,1],VecB1[,2])
#ARIMA
library(tseries)
adf.test(ItemAdemand)
ItemAdiff <- diff(ItemAdemand)
plot(ItemAdiff)
adf.test(diff(ItemAdemand))
#B
adf.test(ItemBdemand)
adf.test(diff(ItemBdemand))
# ACF PACF
acf(ItemAdemand,lag=15)
acf(ItemAdiff, lag=15)
acf(ItemBdemand,lag=15)
acf(ItemBdiff, lag=15)
acf(ItemAdemand,lag=50)
acf(ItemAdiff, lag=50)
pacf(ItemAdemand)
pacf(ItemAdiff)
acf(ItemBdemand,lag=50)
acf(ItemBdiff, lag=50)
pacf(ItemBdemand)
pacf(ItemBdiff)
plot(ItemA_ArimaTrain$residuals)
plot(ItemA_ArimaTrain$x,col="blue")
lines(ItemA_ArimaTrain$fitted,col="red",main="Demand A: Actual vs Forecast")
MAPE(ItemA_ArimaTrain$fitted,ItemA_ArimaTrain$x)
acf(ItemA_ArimaTrain$residuals)
pacf(ItemA_ArimaTrain$residuals)
#Forecast on holdout
#B
ItemB_ArimaTrain <- auto.arima(BTrain, seasonal=TRUE)
ItemB_ArimaTrain
plot(ItemB_ArimaTrain$residuals)
plot(ItemB_ArimaTrain$x,col="blue")
lines(ItemB_ArimaTrain$fitted,col="red",main="Demand B: Actual vs Forecast")
MAPE(ItemB_ArimaTrain$fitted,ItemB_ArimaTrain$x)
acf(ItemB_ArimaTrain$residuals)
pacf(ItemB_ArimaTrain$residuals)
#Forecast on holdout
#Forecast A
ItemA_arima <- auto.arima(ItemAdemand, seasonal=TRUE)
ItemAforecast <- forecast(ItemA_arima, h=17)
plot(ItemAforecast)
ItemAforecast
#Forecast B
ItemB_arima <- auto.arima(ItemBdemand, seasonal=TRUE)
ItemBforecast <- forecast(ItemB_arima, h=17)
plot(ItemBforecast)
ItemBforecast