Professional Documents
Culture Documents
1.
Imported the series in two ways, each row has an associated date. This is
in fact not a column, but instead a time index for value. As an index, we
see single value of wine sales across times. While uploading in df_1 we
have passed the argument of parse dates=true and squeeze=true,
indicating the first column as dates and it needs to be parsed, through
squeeze we indicating one column and a series.
While df_2 is upload without any argument we are adding time stamp
manually to the series and also dropping the YearMonth column available
in the data set. This is primarily done when don’t parse the dates while
uploading the series.
Plotting the time series - While Sparkling wine sales are growing and the
price of rose wine showing downward trend. We can also see some
element of seasonality through the graph however a detailed report is
given on trend and seasonality while decomposing the series.
2.
For Sparkling time series there is no missing values however for Rose,
there are two missing values. I used interpolate function with linear
method and direction forward to impute the values. The missing values.
The data seems to be skewed for the both time series. The min and max
are two extremes hence high standard deviation is observed and so is
mean with 50th percentile. The total count is 187 records for the data set.
The above yearly box plots reflect the year on year sales.
For Sparkling, there is a variation of the sales each year with some
consistency 1985 & 1986. The least sale took place in 1982 and highest
sale in year 1994. The 1995 data is only for 7 months hence cannot take a
call on the performance of the sales for this year. The sales started with a
dip in year 1980 and stated showing upward trend from 1983, we can also
see variation in sales going up from 1983 to 1987, however the highest
variation is visible in 1994. We also see the skewness in each year sales
except for 1981. There are outliers in the year sales data however to time
series we can ignore the outliers.
For Rose, there is a downward sales observation in Rose sales, with
highest sales in 1981 and lowest in 1994. Steep decline in sales is
observed from year 1990 until 1994, although 1995 data is till July month,
the sales seems to have picked up in this year. We can also assume high
variation in monthly sales for year 1981, least in 1994.
Looking at the monthly box plots, we can observe the seasonality in the
both the data sets, being on higher side for Sparkling wine. The sales
picks up in Q4 for both the wine, higher increase of sales observed for
sparkling wines. The sales for sparkling wines is comparatively lower in Q1
& Q2, slowly starts upwards trend from Q3 wherein the monthly sales for
Rose wine picks from Jan and show slight consistent path until Q3 and
starts to go up again from Q4. Monthly sales for both wines have skewed
data with very exception.
The above plots clearly reflects the December being highest sales
generating month, followed by November for Sparkling and December
highest sales month for Rose however other months mixed trends such as
August showing high sales in few years and so do July. There is a hint of
seasonality observed from both the graph.
The above upsample of the data to annual plot, reflects downward annual
sales showing overall downwards trend year on year for Rose. The Annual
sale for Sparkling wine shown some dip initially however sales picked from
1982 to steady peak until 1988 and shown a dip again. We observed and
steep decline after 1994 and that is due to sales data available until July
of 1995.
In the above case we have up sample the records to mean of yearly
observations which talks very similar observation as mentioned sum of
annualized sales for both the wines. The average sales for Sparkling wine
is much higher than Rose.
Another view created is for Quarterly for both data set, while the quarterly
sales for sparkling shows increasing trend for sparkling, the same we can
see going downwards for Rose wine. There is a hint of seasonality on the
time series data set.
Above image reflects decomposition of the time series for both the data
set. I used both additive and multiplicative way of decompising to see the
residual, through which I can determine whether they are multiplicative or
additive. Basis which we can clearly say that the series is a multiplicative
one and have seasonal component to it. We can clearly see increasing
trend for sparkling wine wherein declining trend for Rose wine. The plot
above clearly shows that the sales of furniture is unstable, along with its
obvious seasonality.
The seasonal variation is on higher side for Sparkling sales and sales
variation is higher for Rose wines. We can see this through the residual
plot also.
3.
The data frame has been split into train and test. As per the question the
test data should start at 1991, hence used 71% split for train and rest for
test. The head and tail of the train and test data confirms the test
beginning at January 1991 and train set ending at December, 1990. The
count of train set is 132 and test set is 55 for both data frame, Sparkling
and Rose.
The above graph is visual representation of the train and test set for data
frame. Orange reflects the train set and blue reflects the test data set.
Train set -01/1980 to 12/1990 Testing set – 01/1991 to 07/1995
4.
Linear Regression
There are several models which are made on both data set, starting with
Linear Linear Regression. The above images reflects the code and visual
prediction of test data in comparison to the actuals for the Sparkling and
Rose.
For Sparkling an upwards prediction is observed for the test set while the
downward trend observed for Rose, test set. The red line is regression on
train and wherein green is for regression on test.
The RMSE for Sparking test is 1389.35 and MAPE is 40.05
The RMSE for Rose test is 15.26 and MAPE is 22.82
Naïve
In the Naïve model, the forecasts for every horizon correspond to the last
observed value, Looking at the RMSE for Naïve model we can clearly say
this model is not suitable for both the data sets. The RMSE for test,
Sparkling is 3864.27 and MAPE is 152.87 and for Rose, RMSE is 79.71 and
MAPE is 145.10
Simple Average
Simple average forecasts the expected value through average of all
previous observations. Take average of all previous known values and
calculate the next value.
The RMSE for test set, sparkling – 1275.08 and MAPE - 38.90, which is so
far the best one among all three models ran on the data set. For Rose test
set, RMSE – 53.46 and MAPE- 96.93, which is not good against linear
regression model used so far.
Moving Average
A moving average (rolling average or running average) is a calculation to
analyze data points by creating a series of averages of different subsets of
the full data set. Given a series of numbers and a fixed subset size, the
first element of the moving average is obtained by taking the average of
the initial fixed subset of the number series. Then the subset is modified
by "shifting forward"; that is, excluding the first number of the series and
including the next value in the subset.
In our case, we have taken 2, 4, 6- and 9-point trailing average on the
both the data sets. If you look at the plot of all the 4 trailing averages, all
of them have predicted below the actual train and test data set, out of
which 9 point trailing has predicted lowest of all and closet to actual is
2point trailing moving average.
This is also evident through the RMSE score for each moving averages.
Sparkling
2 point MA – 831.40, MAPE – 19.70
4 point MA – 1156.59, MAPE – 35.96
6 point MA – 1283.92, MAPE – 43.86
9 point MA – 1346.27 MAPE – 46.86
Rose
2 point MA – 11.52, MAPE – 13.54
4 point MA – 14.45, MAPE – 19.49
6 point MA – 14.56, MAPE – 20.82
9 point MA – 14.72 MAPE – 21.01
So far among all models ran, 2 point moving average has been best for
Sparkling and Rose dataset.
Double Exponential
Double exponential are used for data sets with level and trends and no
seasonality, we did the grid search to begin and we found alpha- 0.3 and
beta – 0.3 with lowest RMSE and MAPE on train and test set for Sparkling
and Rose
Sparkling RMSE- 18259 and MAPE-675.28
Rose RMSE- 265.56 and MAPE- 442.50
Double Exponential has been worst performing on both data set so far.
Triple Exponential
We started the Triple exponential smoothening model with auto parameter
followed by grid search for alpha, beta and gamma value with least RMSE
and MAPE.
Sparkling – With auto parameter we got
Alpha=0.154,Beta=0.371,Gamma=7.413 and through grid search we
Alpha=0.3,Beta=0.3,Gamma=0.3. The RMSE for
Alpha=0.154,Beta=0.371,Gamma=7.413 test model was 17.36 and
MAPE- 28.8 where in for Alpha=0.3,Beta=0.3,Gamma=0.3. test model we
got 462.28 and MAPE- 499.52 . The Auto parameter Triple exponential
model gave the least RMSE for Sparkling data
Rose – With auto parameter we got Alpha=0.106,Beta=0.0,Gamma=0.048
and through grid search we got Alpha=0.1,Beta=0.2,Gamma=0.2. The
RMSE for Alpha=0.106,Beta=0.0,Gamma=0.048 test model was 17.36
and MAPE- 28.8 where in for Alpha=0.106,Beta=0.0,Gamma=0.048 test
model we got 462.28 and MAPE- 499.52. The Auto parameter Triple
exponential model gave the highest RMSE for Rose data
5.
6
The above screen shot is for automated ARIMA model. An ARIMA model is
characterized by 3 terms: p, d, q where, p is the order of the AR term, q is
the order of the MA term, d is the number of differencing required to make
the time series stationary. An ‘non-seasonal’ time series that exhibits
patterns and is not a random white noise can be modeled with ARIMA
models. We used Akaike information criteria as our measurement for the
performance of the model. The model summary reveals a lot of
information. The table in the middle is the coefficients table where the
values under ‘coef’ are the weights of the respective terms.
There is no difference with a SARIMA model. We are still trying to get the
series to behave in a stationary way, so that our model gets estimated
correctly. Seasonality can come in two basic varieties, multiplicative and
additive. By default, statsmodels works with a multiplicative seasonal
components. For our model it really won’t matter.
SARIMA model has 7 parameters. The first 3 parameters are the same as
an ARIMA model. The last 4 define the seasonal process. It takes the
seasonal autoregressive component, the seasonal difference, the seasonal
moving average component, the length of the season, as additional
parameters
Sparkling - Starting with an iteration for an order and seasonal order. We
applied params for order and seasonal order both to see combination with
the lowest AIC. Post sorting of the lowest AIC, we derived p = 1, q=1 and
d=2 and season order of 1,0,2,12, the lowest AIC recorded was 1555.61.
The same order was fitted to check the final outcome of the model on
train and test data set. We found ar.L1, ar.S.L12, ma.L1, ma.L2, ma.S.L12
and ma.S.L24. The p value of AR for Lag 1,seasonal AR for Lag 12, MA for
lag 2 and seasonal MA for lad 12 term is 0 or close to 0, less that 0.05
which makes them highly significant in terms of coefficient. The MA for lag
1 and seasonal MA for lag 24 is not significant since their p value is above
0.05 for The AIC for this model is 1555.58. The same parameters are then
used to forecast on test set, RMSE for test is 528.59.
Rose - Starting with an iteration for an order and seasonal order. We
applied params for order and seasonal order both to see combination with
the lowest AIC. Post sorting of the lowest AIC, we derived p = 0, q=1 and
d=2 and season order of 2,0,2,12, the lowest AIC recorded was 887.93
The same order was fitted to check the final outcome of the model on
train and test data set. We found ar.S.L24, ar.S.L12, ma.L1, ma.L2,
ma.S.L12 and ma.S.L24. The p value of seasonal AR for Lag 12 & 24 term
is 0, less that 0.05 which makes them highly significant in terms of
coefficient. The MA for lag 1&2, and seasonal MA for lag 12 & 24 is not
significant since their p value is above 0.05 for The AIC for this model is
887.93. The same parameters are then used to forecast on test set, RMSE
for test is 26.92.
7
9
Final Model – Sparkling
The above plot reflects the prediction on the data set against actual data.
The blue line represents the dataframe and orange is the 2 point moving
average. The average is slightly lower than the actual peaks in dataframe
which is obvious.
The RMSE of 2 point moving average is 11.52 and MAPE is 13.54, which
same as test data set.
The above plot reflects the forecasting for next 12 months starting from
1995-08-31 to 1996-07-31 along with confidence interval of 95%. We
manually create the upper limit being 2.5% above the forecast and lower
limit with being 2.5% below the forecast.
The plot reflects the forecasting variance with 95% confidence interval.
10
Comments - Sparkling
In the first iteration, we created 12 models to see which one has lowest
RMSE and MAPE and in second iteration we created another 4 models
through ARIMA and SARIMA approach. Looking at the performance its
clear that the integrated approach of Auto regression and moving average
was not able to forecast with lowest RMSE. Between ARIMA and SARIMA,
SARIMA performed better. Out of all the models, Triple Exponential model
gave out the lowest RMSE and observed as one of best models, which
takes our attention to the seasonal aspect of the Sparkling wine sales, this
can be seen in ACF and PACF interpretation as well. We cannot ignore that
the sales go up towards the end of the year and that has to be accounted
in our forecasting. While we have forecasted using TES model, we still can
see the error on the higher side and this could be due to variation in the
data and seasonal components. While the we have considered the whole
year data from 1980 to 1995, 1995 data consist of 7months only. The
other iteration we can include it omitting data either from the beginning
and or towards the end to see if the error can be reduced to make better
forecasting.
Comments-Rose
Like we 12 iteration for Sparkling we created similar models for Rose sales
data set. Although we can see hint of seasonal aspect however it is not
strong as Sparkling wine sales. The variation is high in the Rose data set.
We can see the sales going up in Q4 of the year which does hint the
seasonal aspect in the data. The RMSE are on lower side for all the models
except for Triple exponential model. The moving average performed the
best among all. Even though I used 2 point moving average to forecast for
12months, my recommendation that we should only consider for 2months
in future rather than 12 months. This is mainly due to 2point moving
average considering last two values, hence basis actual values we would
be able to forecast two months into future.