You are on page 1of 20

ForecastingAnalysisIndividualAssignment

Mohammad Mujtaba

16/10/2021
# Importing the relevant libraries
library(forecast)

## Registered S3 method overwritten by 'quantmod':


## method from
## as.zoo.data.frame zoo

library(tseries)

#Retrieving the working directory


getwd()

## [1] "C:/Users/user/Desktop"

# Reading the file


data<-read.csv("C:/Users/user/Documents/SouvenirSales.csv",header=TRUE)
head(data)

## Date Sales
## 1 Jan-95 1664.81
## 2 Feb-95 2397.53
## 3 Mar-95 2840.71
## 4 Apr-95 3547.29
## 5 May-95 3752.96
## 6 Jun-95 3714.74

# Summarizing the data


summary(data)

## Date Sales
## Length:84 Min. : 1665
## Class :character 1st Qu.: 5884
## Mode :character Median : 8772
## Mean : 14316
## 3rd Qu.: 16889
## Max. :104661

# Displaying the internal structure of r object


str(data)

## 'data.frame': 84 obs. of 2 variables:


## $ Date : chr "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
## $ Sales: num 1665 2398 2841 3547 3753 ...
#1(a) Plot the time series of the original data. Which time series components appear from
the plot
# Answer 1(a)
# Creating time series
?ts

## starting httpd help server ... done

tseriesdata <- ts(data$Sales,start=c(1995,1),frequency=12)


tseriesdata

## Jan Feb Mar Apr May Jun Jul


## 1995 1664.81 2397.53 2840.71 3547.29 3752.96 3714.74 4349.61
## 1996 2499.81 5198.24 7225.14 4806.03 5900.88 4951.34 6179.12
## 1997 4717.02 5702.63 9957.58 5304.78 6492.43 6630.80 7349.62
## 1998 5921.10 5814.58 12421.25 6369.77 7609.12 7224.75 8121.22
## 1999 4826.64 6470.23 9638.77 8821.17 8722.37 10209.48 11276.55
## 2000 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## 2001 10243.24 11266.88 21826.84 17357.33 15997.79 18601.53 26155.15
## Aug Sep Oct Nov Dec
## 1995 3566.34 5021.82 6423.48 7600.60 19756.21
## 1996 4752.15 5496.43 5835.10 12600.08 28541.72
## 1997 8176.62 8573.17 9690.50 15151.84 34061.01
## 1998 7979.25 8093.06 8476.70 17914.66 30114.41
## 1999 12552.22 11637.39 13606.89 21822.11 45060.69
## 2000 19888.61 23933.38 25391.35 36024.80 80721.71
## 2001 28586.52 30505.41 30821.33 46634.38 104660.67

# Answer 1(a) continued


# Plotting the time series
plot(tseriesdata,bty="l",
type='l',
xlab = "Year",
ylab = "Sales of sovenier data",
main = " Souvenir Sales")
sm <- ma(tseriesdata, order=12) # 12 month moving average
lines(sm, col="blue")
#1(b) Fit a linear trend model with additive seasonality (Model A) and exponential trend
model with multiplicative seasonality (Model B). Consider January as the #reference group
for each model. Produce the regression coefficients and the validation set errors.
Remember to fit only the training period
# Answer (b)
# Creating training and test sets

train <- window(tseriesdata,end=c(2000,12), frequency=12)


test <- window(tseriesdata,start=c(2001,1), frequency=12)

#model A which is linear additive model


train.linear.trend <- tslm(train ~ trend+season) ## We are building a
Linear Trend
train.linear.pred<-forecast(train.linear.trend, h=length(test ), level = 0)

train.linear.pred<- forecast(sm,h=length(test),level=0)

## Warning in ets(object, lambda = lambda, biasadj = biasadj,


## allow.multiplicative.trend = allow.multiplicative.trend, : Missing values
## encountered. Using longest contiguous portion of time series

# Publishing the details of additive linear model


summary(train.linear.trend)
##
## Call:
## tslm(formula = train ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16

summary(train.linear.pred )

##
## Forecast method: ETS(M,A,N)
##
## Model Information:
## ETS(M,A,N)
##
## Call:
## ets(y = object, lambda = lambda, biasadj = biasadj,
allow.multiplicative.trend = allow.multiplicative.trend)
##
## Smoothing parameters:
## alpha = 0.9999
## beta = 0.9999
##
## Initial states:
## l = 5392.1595
## b = 168.239
##
## sigma: 0.0175
##
## AIC AICc BIC
## 1080.361 1081.270 1091.745
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 17.65765 271.1743 150.7931 0.1055216 1.135913 0.03994644
## ACF1
## Training set 0.05838826
##
## Forecasts:
## Point Forecast Lo 0 Hi 0
## Jul 2001 30663.35 30663.35 30663.35
## Aug 2001 32102.81 32102.81 32102.81
## Sep 2001 33542.27 33542.27 33542.27
## Oct 2001 34981.74 34981.74 34981.74
## Nov 2001 36421.20 36421.20 36421.20
## Dec 2001 37860.66 37860.66 37860.66
## Jan 2002 39300.12 39300.12 39300.12
## Feb 2002 40739.58 40739.58 40739.58
## Mar 2002 42179.05 42179.05 42179.05
## Apr 2002 43618.51 43618.51 43618.51
## May 2002 45057.97 45057.97 45057.97
## Jun 2002 46497.43 46497.43 46497.43

accuracy(train.linear.pred,test)

## ME RMSE MAE MPE MAPE


MASE
## Training set 17.65765 271.1743 150.7931 0.1055216 1.135913
0.03994644
## Test set 10298.57132 27766.1968 15372.4926 5.4558880 23.119400
4.07231107
## ACF1 Theil's U
## Training set 0.05838826 NA
## Test set 0.14488559 1.100419

#model B which is multiplicative exponential model

train.exponential.trend <- tslm(train ~ trend+season, lambda = 0) ##


Building exponential Trend Model
train.exponential.pred <- forecast(train.exponential.trend, h=length(test ),
level = 0) ## Forecasting on the validation test

# Creating the plot


plot(train.exponential.pred, xlab ="Time", ylab= "Sales", main="Souvenir
sales",flty=2, bty="l", ylim=c(0,100000))
lines(train.exponential.pred$fitted)
lines(test,col="blue")
# Publishing the details of the multiplicative forecasting model
summary(train.exponential.trend) ## regression and forecasting summary

##
## Call:
## tslm(formula = train ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16

summary(train.exponential.pred)

##
## Forecast method: Linear regression model
##
## Model Information:
##
## Call:
## tslm(formula = train ~ trend + season, lambda = 0)
##
## Coefficients:
## (Intercept) trend season2 season3 season4
season5
## 7.64636 0.02112 0.28201 0.69500 0.37387
0.42171
## season6 season7 season8 season9 season10
season11
## 0.44705 0.58338 0.54690 0.63557 0.72949
1.20095
## season12
## 1.95220
##
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set Inf Inf Inf NaN NaN NaN NA
##
## Forecasts:
## Point Forecast Lo 0 Hi 0
## Jan 2001 9780.022 9780.022 9780.022
## Feb 2001 13243.095 13243.095 13243.095
## Mar 2001 20441.749 20441.749 20441.749
## Apr 2001 15143.541 15143.541 15143.541
## May 2001 16224.628 16224.628 16224.628
## Jun 2001 16996.137 16996.137 16996.137
## Jul 2001 19894.424 19894.424 19894.424
## Aug 2001 19591.112 19591.112 19591.112
## Sep 2001 21864.492 21864.492 21864.492
## Oct 2001 24530.299 24530.299 24530.299
## Nov 2001 40144.775 40144.775 40144.775
## Dec 2001 86908.868 86908.868 86908.868

accuracy(train.exponential.pred,test)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set Inf Inf Inf NaN NaN NaN NA
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 NaN 0.4245018
## Theil's U
## Training set NA
## Test set 0.4610253

#1(c))Which model is the best model considering RMSE as the metric? Could you have
#understood this from the line chart? Explain. Produce the plot showing the forecasts from
#both models along with actual data. In a separate plot, present the residuals from both
#models (consider only the validation set residuals).
# Answer 1(c)
# Model B which is the multiplicative exponential model is better as the RMSE
for it is lower at 7101.44 compared to linear model the RMSE for which is
27766.1968

# Root-Mean-Square-Error or RMSE is one of the most popular methods to


estimate the accuracy of our forecasting model’s predicted values versus the
actual or observed values while training the time series models.
# The model with lower RMSE is the better model

plot(train.linear.pred, xlab ="Time", ylab= "Sales", main="Souvenir


sales",flty=2, bty="l", ylim=c(0,50000))
lines(train.linear.pred$fitted)
lines(test,col="blue")
plot(train.exponential.pred, xlab ="Time", ylab= "Sales", main="Souvenir
sales",flty=2, bty="l", ylim=c(0,50000))
lines(train.exponential.pred$fitted)
lines(test,col="blue")

# Answer 1(c) continued


# plotting residuals for both exponential as well as linear models
plot(train.exponential.trend$residuals)
plot(train.linear.trend$residuals)
#1(d) Examine the additive model. Which month has the highest average sales during the
year. #What does the estimated trend coefficient in the model A mean?
#Answer 1(d)
# The trend is significant given the low p value
# November and December months are also significant with low or negligible
value
# December month has the highest average sales during the year
# The estimated trend coefficient in model A means that for every single unit
change in trend, the average sales changes/increases by 245 units. This is
significant #as the p # value against trend is negligible. Lower the p value,
greater the significance

# Publishing the summary of additive model


summary(train.linear.trend)

##
## Call:
## tslm(formula = train ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16

#1(e))Examine the multiplicative model. What doesthe coefficient of October mean? What
does #the estimated trend coefficient in the model B mean?
# Answer 1(e)
# The entire model is significant
# We have high significance right from intercept and trend to all months of
the year given the negligible p value for all of them
# Coefficient for October is positive. It means that as the value of the
independent variable increases, the mean of the dependent variable also tends
to increase.
# The coefficient value signifies how much the mean of the dependent variable
changes given a one-unit shift in the independent variable while holding
other #variables in the model constant.
# If the sale in season 10 increases by 1 unit, the average sales increases
by 0.729490 units
# The estimated trend co efficient in the model B means that for every single
unit change in trend, the average sales changes/increases by 0.021120 units.
This is #significant as the p value against trend is negligible. Lower the
p value, greater the significance

summary(train.exponential.trend)

##
## Call:
## tslm(formula = train ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
#1(f)) Use the best model type from part (c) to forecast the salesin January 2002. Think
carefully #which data to use for model fitting in this case.
# Answer 1(f)
# We are using the multiplicative model as it has lower RMSE making it the
better model
# We would use the training data from exponential model for model fitting
# Reasons
#(a) It has lower RMSE value
#(b) Its R square and adjusted R square is significantly better at 0.9424 and
0.9306

# Using model B which is multiplicative model


forecast(train.exponential.trend,h=13)

## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2001 9780.022 7459.322 12822.72 6437.463 14858.16
## Feb 2001 13243.095 10100.643 17363.21 8716.947 20119.38
## Mar 2001 20441.749 15591.130 26801.46 13455.287 31055.83
## Apr 2001 15143.541 11550.133 19854.91 9967.870 23006.60
## May 2001 16224.628 12374.689 21272.34 10679.469 24649.03
## Jun 2001 16996.137 12963.127 22283.87 11187.297 25821.13
## Jul 2001 19894.424 15173.679 26083.86 13095.024 30224.31
## Aug 2001 19591.112 14942.340 25686.18 12895.376 29763.51
## Sep 2001 21864.492 16676.271 28666.84 14391.774 33217.31
## Oct 2001 24530.299 18709.509 32162.02 16146.477 37267.30
## Nov 2001 40144.775 30618.828 52634.38 26424.329 60989.36
## Dec 2001 86908.868 66286.278 113947.44 57205.664 132035.03
## Jan 2002 12601.017 9570.837 16590.57 8240.964 19267.84

#1(g)) Plot the ACF and PACF plot until lag 20 of the residuals obtained from training set of
the #best model chosen. Comment on these plots and think what AR(p) model could be a
good #choice
# Answer 1(g)
# In ACF plot,the first three bars are outside the threshold which means data
has trend and seasonality
# In PACF plot, the two bars are outside the threshold which means data has
trend and seasonality
# This means that Data is non stationary
# Ar(3) could be a good choice as three bars are out of threshold
# Plotting ACF and PACF with lag of 20

acf(train.exponential.trend$residuals,lag.max = 20)
pacf(train.exponential.trend$residuals,lag.max = 20)
#1(h)) Fit an AR(p) model as you think appropriate from part (h) to the training set
residuals and #produce the regression coefficients. Was your intuition at part (h) correct?
# Answer 1(h)
# Ar(3) is the best model as three bars are out of threshold in ACF plot
# When we are re plotting the ACF with Ar(3), we can see that bars
representing all seasons are within the threshold which means we have fitted
the Ar(p) model #nicely
# The intuition from the ACF plot in Par(g) was correct that three bars are
out of threshold and Ar(3) model needs to be fitted
# Fitting the AR(P) model to the training set residuals
manual_ARIMA<-arima(train.exponential.trend$residuals,order=c(3,0,0))

# Publishing the summary


summary(manual_ARIMA)

##
## Call:
## arima(x = train.exponential.trend$residuals, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 0.3469 0.3996 -0.1208 -0.0013
## s.e. 0.1155 0.1138 0.1236 0.0426
##
## sigma^2 estimated as 0.01939: log likelihood = 39.5, aic = -69
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.003079309 0.1392322 0.1138666 2205.758 2506.715 0.839027
## ACF1
## Training set -0.02678944

# Re plotting ACF to validate the Ar(p) model


Acf(manual_ARIMA$residuals,lag.max=12)
#1(i)Now, using the best regressionmodel and AR(p)model, forecast the sales in January
2002. # Think carefully which data to use for model fitting in this case.
# Answer 1(i)
# We need to add the value forecasted with h as 13 and the value obtained in
1(f) to get the forecasted sales of January 2002
# Therefore we need to add 0.003692175 and 12601.017
# The forecasted sales in January 2002 is 12601.02069217

# Forecasting the sale for January 2002 taking h as 13

forecast.manual <- forecast(manual_ARIMA,h=13)


forecast.manual

## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2001 0.0850434040 -0.09338981 0.2634766 -0.1878466 0.3579335
## Feb 2001 0.0832594572 -0.10560507 0.2721240 -0.2055839 0.3721028
## Mar 2001 0.0406981469 -0.16972614 0.2511224 -0.2811180 0.3625143
## Apr 2001 0.0366430171 -0.17673242 0.2500185 -0.2896866 0.3629726
## May 2001 0.0184424186 -0.19900215 0.2358870 -0.3141104 0.3509952
## Jun 2001 0.0156502894 -0.20249319 0.2337938 -0.3179714 0.3492720
## Jul 2001 0.0078978649 -0.21102976 0.2268255 -0.3269230 0.3427188
## Aug 2001 0.0062916903 -0.21279460 0.2253780 -0.3287719 0.3413553
## Sep 2001 0.0029736371 -0.21626539 0.2222127 -0.3323235 0.3382708
## Oct 2001 0.0021173551 -0.21715696 0.2213917 -0.3332338 0.3374685
## Nov 2001 0.0006883262 -0.21861599 0.2199926 -0.3347087 0.3360853
## Dec 2001 0.0002512756 -0.21906077 0.2195633 -0.3351576 0.3356601
## Jan 2002 -0.0003679852 -0.21968597 0.2189500 -0.3357859 0.3350499

Short answer type questions

2(a) Explain the key difference between cross sectional and time
series data.
# Answer 2(a)
#The key difference between time series and cross sectional data is that the
time series data focuses on the same variable over a period of time while the
cross sectional data focuses on multiple variables at the same or any given
point of time. Furthermore, the time series data consist of observational
values of a single variable at multiple time intervals whereas, the cross
sectional data consist of observations of many variables at the same point in
time.Example,Profit of an organization over a period of 10 years is an
example of time series data while maximum temperature of multiple cities on a
given day is an example for a cross sectional data.

2(b) Explain the difference between seasonality and cyclicality.


# Answer 2(b)
# Seasonality occurs when a series is influenced by seasonal factors like the
quarter of the year, the month, or day of the week. Seasonality is for a
fixed and known period. Seasonal time series are also known as periodic time
series.Cyclicality is construed when data depicts rises and falls that are
not of fixed period. The duration of these fluctuations is usually 2 years.
If the fluctuations are not of fixed period then they are cyclic and if the
period is constant then the pattern is seasonal in nature. As a thumb rule,
the average length of cycles is longer than the length of a seasonal pattern,
and the size of cycles tends to be more variable than the size of seasonal
patterns.

2(c) Explain why centered moving average is not-considered suitable


for forecasting.
# Answer 2(c)
# The disadvantages of the method of simple moving averages are as follows:
# (i)It cannot handle trend properly
#(ii)It cannot handle data with seasonal or cyclical variations
#(iii)It requires logging of records of different time periods for each
forecasted period making it cumbersome to implement
#(iv)It gives equal weight age to every time period selected. This makes the
forecasts to stay behind the general trend.
#(v)It does not sync well with particular changes that happened for a reason
like sharp increase in online sales during pandemic period
#(vi)It cannot handle multiple and complex relationship patterns in the data

#2(d)Explain stationarity and why is it important for some time series forecasting
methods?
# Answer 2(d)
# Stationarity is an important concept in the field of time series
analysis .Stationarity in data means a constant mean and a constant
variance.Since the data is non-stationary, you could perform a transformation
to convert into a stationary dataset. The most common transforms are the
difference and logarithmic transform.The ADF test, also known as the “unit
root test”, is a statistical test to inform the degree to which a null
hypothesis can be rejected or fail to be rejected. The p-value below a
threshold (1% or 5%) suggests we reject the null hypothesis that data is non
stationary.When forecasting or predicting the future, most time series models
assume that each point is independent of one another. The best indication of
this is when the dataset of past instances is stationary. For data to be
stationary, the statistical properties of a system do not change over time.
This does not mean that the values for each data point have to be the same,
but the overall behavior of the data should remain constant.
# To borrow wikipedia, "A stationary process is “a process whose
unconditional joint probability distribution does not change when shifted in
time.”
# #Reference :https://towardsdatascience.com/why-does-stationarity-matter-in-
time-series-analysis-e2fb7be74454

2(e) How does an ACF plot help to identify whether a time series is
stationary or not?
# Answer 2(e)
#If there are bars moving out of the ACF plot, we know that time series has
trend and seasonality and data is non stationary. If all bars are within the
threshold, the data is construed to be stationary.Autocorrelation is the
correlation of a signal with a lag. When plotting the value of the ACF for
increasing lags also known as correlogram, the values tend to degrade to
zero quickly for stationary time series while for non-stationary data the
degradation or decay is more gradual.In our dataset of Souvenir sales, we saw
bars denoting first three seasons were outside the threshold indicating
presence of data and seasonality which means that data is not stationary

2(f)Why partitioning time series data into training, validation, and test
set is not recommended?Describe briefly two considerations for
choosing the width of validation period?
# Answer 2(f)
# We need to simulate a situation in a production environment, where after
training a model we evaluate data coming after creation of the model. The
random sampling we use for validation and training is therefore not a robust
course of action.The validation set is used to choose between models.
Choosing a random sample from a timeseries of data is very easy but non
representative of the business objective. For validation set , we would
choose continuous section of latest dates if our timeseries data contains
dates.If we want to split the time series data below into training and
validation sets. We can use the earlier data as our training set and the
later data for the validation set. The size of the validation set will depend
upon the number of hyperparameters. Larger the number of hyperparameters,
larger the validation set

2(g) Both smoothing and ARIMA method of forecasting can handle


time series data with missing value. True/False. Explain
# Answer (g)
# This is False. Only ARIMA method of forecasting handles missing values in
time series data.Exponential smoothing is a way to smooth out data for
presentations or to make forecasts. It is primarily used for finance and
economics. If we have a time series with unclear pattern, we can use
exponential smoothing to forecast.We can fit ARIMA models with missing values
easily because all ARIMA models are state space models and the Kalman filter
which isdeployed to fit state space models handles missing values by simply
skipping the update phase. So, putting together the missing data is possible
with ARIMA
##Reference: https://stats.stackexchange.com/questions/346225/fitting-arima-
to-time-series-with-missing-values

2(h)Additive and multiplicative decomposition differ in the way the


trend is computed. True /False. Explain.
# Answer 2(h)
# This is true. For an additive decomposition, trend is computed by
subtracting the trend estimates from the series. For a multiplicative
decomposition, it is computed by dividing the series by the trend
values. .The simplest method for estimating these decompositions is to
average the de-trended values for a specific season.While estimating the
random component, we record the difference between additive and
multiplicative decomposition in handling of trend

#For the additive model, random = series – trend – seasonal.


#For the multiplicative model, random = series / (trend*seasonal)
#Reference: https://online.stat.psu.edu/stat510/lesson/5/5.1
2(i)After accounting for trend and seasonality in a time series data,
the analyst observes that there is still correlation left amongst the
residuals of the time series. Is that a good or a bad news for the
analyst? Explain.
# Answer 2(i)
#This is bad news for the analyst. This means that there is information in
residuals that can be used to make forecasts. The subject model can still be
improved.The “residuals” in a time series model are what is left over after
fitting a model. A good forecasting method will yield residuals with the
following properties:
# (i) The residuals are uncorrelated. If there are correlations between
residuals, then there is information left in the residuals which should be
used in computing forecasts.
# (ii) The residuals have zero mean. If the residuals have a mean other than
zero, then the forecasts are biased.
# Any forecasting method that does not meet the above properties can be
improved forcing the analyst to make further improvements. This does not mean
that the forecasts from a dataset meeting above properties cannot be
improved. There are different forecasting methods that an analyst can deploy
in the same data set to meet the above properties

You might also like