You are on page 1of 16

FORECASTING OF A TIME SERIES DATA

SUBMITTED BY : KURTIKA SINHA

ROLL: DST-17/18-013
Dataset: Monthly Airline Passenger Numbers 1949-1960
.

Description of the Dataset:


No. of observations: 144 observations
Start date: Jan 1949
End date: Dec 1960

The dataset containing the variables:


 Monthly Airline Passenger Number(in thousands)

Data source: https://vincentarelbundock.github.io/Rdatasets/datasets.html

Sample of the dataset:


Monthly
Month Milk
Production
1949-01 112
1949-02 118
1949-03 132
1949-04 129
1949-05 121
1949-06 135
1949-07 148
1949-08 148
1949-09 136
1949-10 119
1949-11 104
1949-12 118
1950-01 115
1950-02 126

Objective: Perform the complete time series analysis to predict the future Airline Passenger
number
TIME SERIES

A time series is defined as a set of observations on a variable generated sequentially in time. The
measurement of the variable may be made continuously or may be made at discrete (usually equal
spaced) intervals.

There are four components to a time series: the trend, the cyclical variation, the seasonal variation, and
the irregular variation.

SECULAR TREND: The smooth long-term direction of a time series is called trend.

SEASONAL VARIATION: It is a periodic or oscillatory behavior of time series. Its period is fixed and
generally not longer than a year.

CYCLICAL VARIATION: It is the rise and fall of a time series over periods longer than one year. The period
of oscillation is not fixed.

IRREGULAR VARIATION: Many analysts prefer to subdivide the irregular variation into episodic and
residual variations. Episodic fluctuations are unpredictable, but they can be identified. After the episodic
fluctuations have been removed, the remaining variation is called the residual variation.
STEPS OF TIME SERIES ANALYSIS
Step 1:
We have to deseasonalize the raw data. To do this we need to find seasonal index and subtract
it from the original data.
Seasonal Index: It is the presence of variations that occur at a specific regular intervals less than a year,
such as weekly, monthly or quarterly.
Deseasonalization: It is the process of removing the seasonal index from the original data. The steps of
deseasonalization are as follows:
 Estimate trend by moving average method
 Subtract the trend value from the actual series and we will get some value known as
error.
 Collect all the errors for every month, if it is quarterly data then collect for every quarter.
 We need to find the average of error of each month for all the years.
 Again, we take the grand average for the average.
 Now we have to subtract grand average from the average of each month, subtracted value
is called seasonal index.
 And finally subtract the corresponding seasonal index from the original series and we get
the Deseasonalized series.

Step 2:
After getting seasonally adjusted data now we have check the series is stationary or not.
By Dickey fuller test we can check whether the series is stationary or not.
Let us consider a model,
𝑦𝑡 = 𝜕0 + 𝜕 ∗ 𝑡 + 𝑢𝑡
𝑢𝑡 = ∅ ∗ 𝑢𝑡−1 + 𝑒𝑡
Dickey fuller test equation is
∆𝑦𝑡 = 𝛼 + 𝛽 ∗ 𝑡 + 𝛾 ∗ 𝑦𝑡−1 + 𝑒𝑡
Where 𝛼 is a constant, 𝛽 is the coefficient on a time trend.
Null Hypothesis,
𝐻0 : 𝛾 = 0 (Non Stationary)
Alternative Hypothesis,
𝐻1 : 𝛾 < 0 (Stationary)
Test statistic is denoted by 𝜏 and is given by
𝛾̂
𝜏=
𝑆. 𝐸. (𝛾̂)
Conclusion: Comparing with dickey fuller critical value
Case1: If 𝐻0 cannot be rejected, then the series is non stationary.
In that case we have to do differencing after that again we have to check the series is stationary
or not.
Case2: If 𝐻0 is rejected then we conclude that the series is stationary.
Dickey Fuller Test assumes that ut depends only on ut-1 . But in Augmented Dickey Fuller(ADF) Test we
check whether ut depends on other lag also. In this test, for the given equation ADF Test equation is
∆𝑦𝑡 = 𝛼 + 𝛽 ∗ 𝑡 + 𝛾 ∗ 𝑦𝑡−1 + ∅∆𝑦𝑡−1 + 𝑒𝑡
Where 𝛼 is a constant, 𝛽 is the coefficient on a time trend.

Then follow the steps as in the DF test.


Step3:
Correlogram Analysis:
In time series analysis, correlogram is used to identify model and its order. It contains
auto correlation function (ACF) and partial auto correlation function (PACF).

Case1: If ACF is diminishing and the PACF is zero from lag (p+1) onwards then we can say that
the model is AR model with order ‘p’.
Case2: If ACF is zero from lag (q+1) onwards and PACF is diminishing then we can say that the
model is MA with order ‘q’.
Case3:If ACF is diminishing and PACF is also diminishing then we can say that the model is
ARMA with order (p, q). To identify ARMA we will estimate ARMA (1,1), ARMA (1,2),
ARMA (2,1), ARMA (2,2) and so on and note all AIC (Akaike Information Criteria) or SIC
(Schwartz Information Criteria) and then we have to choose that ARMA model which having
minimum AIC or SIC.
Test for ACF (L-Jung Box Test):
Null Hypothesis,
𝐻0 : 𝜌1 = 𝜌2 = 𝜌3 = 𝜌4 = ⋯ = 𝜌𝑘 = 0
Alternative Hypothesis;
𝐻1 : At least one inequality holds.

The test statistic is denoted by 𝑄𝑘 and is given by


𝑗̂2
𝜌
𝑄𝑘 = 𝑛 ∗ (𝑛 + 2) ∑𝑘𝑗=1 (𝑛−𝑗) ~ 𝜒 2 (𝑘)
Where n is the no. of observation, 𝜌̂𝑗 is the estimated value of the jth autocorrelation.
𝐶𝑜𝑣(𝑦𝑡 ,𝑦𝑡−𝑘 )
And 𝜌̂𝑗 = 𝜎𝑦𝑡 ∗𝜎𝑦𝑡−𝑘
2
Conclusion: Comparing with 𝜒 – critical value

Case1: If 𝐻0 cannot be rejected then 𝑦𝑡 series is white noise. In that case no need of model.
Case2: If 𝐻0 is rejected then sample ACF is diminishing. Therefore the model may be AR (p) or
ARMA (p,q).
Case3: If 𝐻0 is rejected upto lag q and lag (q+1) onwards it is accept then the model may be MA
(q) model.

Test for PACF:


Null Hypothesis;
𝐻0 : 𝜙1 = 0
Alternative Hypothesis;
𝐻1 : 𝜙1 ≠ 0
Test Statistic is denoted by 𝜏1 and is given by
𝜏1 = √𝑛 ∗ ̂
𝜙1 ~ N(0,1)
̂
Here n is the sample size and 𝜙1 is the estimated value of 𝜙1 .

Conclusion: Comparing with critical value of standard normal distribution:

Case1: If null hypothesis is rejected, then we formulate next hypothesis to see whether 𝜙2 = 0
or not.
Case2: Suppose 𝐻0 cannot be rejected, i.e. PACF cuts after one lag.
Case3: If all 𝐻0 are rejected then PACF is diminishing.
Step4:
Coefficient Estimation and Testing:
Coefficients of a model are estimated by following two methods:
 Ordinary least square method,
 Maximum likelihood estimator.
After finding the coefficients of the model we have to test whether individual coefficients of our
model are significant or not.

Null Hypothesis;
𝐻0 : 𝜙1 = 0
Alternative Hypothesis;
𝐻1 : 𝜙1 ≠ 0

Test statistic is denoted by 𝜏2 and is given by


̂
𝜙1
𝜏2 = 𝑆.𝐸.(𝜙̂)1
Conclusion:
If p-value is less than 0.05 at 𝛼% level of significance then we may reject the null hypothesis.

Step5:
Residual Diagnostic Test:

Residual analysis is use to test the residual follows white noise or not.

Null Hypothesis;
𝐻0 : The correlation between residuals is zero.
Alternative Hypothesis;
𝐻1 : At least one inequality holds.

Test statistic is denoted by 𝑄𝑘 and is given by


𝑗 ̂2
𝜌
𝑄𝑘 = n*(n+2) ∑𝑘𝑗=1 (𝑛−𝑗) ~ 𝜒 2 (𝑘)
Where n is the no. of observation, 𝜌̂𝑗 is the estimated value of the jth autocorrelation.
𝐶𝑜𝑣(𝑒𝑡 ,𝑒𝑡−𝑘 )
And 𝜌̂𝑗 = 𝜎𝑒𝑡 ∗𝜎𝑒𝑡−𝑘
Conclusion:
Case1: If 𝐻0 cannot be rejected then we have correctly specified the model.
Case2: If 𝐻0 is rejected then we have to add again a lag and perform the same procedure.
Step6:
Forecasting:
In time series, forecast is the prediction of short term trends from previous patterns. To
find out the likely values pertaining to future time point based on a given time series observation
(n), we use the following procedure.

Formula:
𝒇𝒏,𝒉 = E (𝑿𝒏+𝒉 | 𝑿𝒏 , 𝑿𝒏−𝟏 , … , 𝑿𝟏 ),
Where ‘h’ is the period ahead,
‘n’ is the no. of observations.
Rule for forecasting:
E (𝒆𝒏+𝒋 | 𝑿𝒏 , 𝑿𝒏−𝟏 , … , 𝑿𝟏 ) = 0 , when j > 0
̂,
E (𝒆𝒏+𝒋 | 𝑿𝒏 , 𝑿𝒏−𝟏 , … , 𝑿𝟏 ) = 𝒆 𝒏+𝒋 when j ≤ 𝟎
ANALYSIS:
Here are my observations :
1. There is a trend component which grows the passenger year by year.
2. There looks to be a seasonal component which has a cycle less than 12 months.
3. The variance in the data keeps on increasing with time
we need to remove unequal variances. We do this using log of the series.
Step1: First we have to to deseasonalize the data (if present) in order to get the stationary series for
further analysis. So after deseasonalizing the series we get the seasonally adjusted data.

Step2: Now using Augmented Dickey Fuller (ADF) test, we have to check whether the deseasonalized
series is stationary or not. By using ADF test we get the following result:

###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression trend

Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-0.102892 -0.018405 0.002371 0.018950 0.081424
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6316699 0.3268251 1.933 0.05568 .
z.lag.1 -0.1245162 0.0682526 -1.824 0.07065 .
tt 0.0011203 0.0006861 1.633 0.10520
z.diff.lag1 -0.2144397 0.1008692 -2.126 0.03561 *
z.diff.lag2 -0.0026793 0.1000602 -0.027 0.97868
z.diff.lag3 -0.1068487 0.0968075 -1.104 0.27198
z.diff.lag4 -0.2765146 0.0962991 -2.871 0.00485 **
z.diff.lag5 -0.1039818 0.0936348 -1.111 0.26906
z.diff.lag6 0.0241358 0.0920719 0.262 0.79367
z.diff.lag7 -0.1610850 0.0911941 -1.766 0.07994 .
z.diff.lag8 -0.2595426 0.0865078 -3.000 0.00330 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.03319 on 117 degrees of freedom


Multiple R-squared: 0.2289, Adjusted R-squared: 0.163
F-statistic: 3.473 on 10 and 117 DF, p-value: 0.0004999

Value of test-statistic is: -1.8243 10.6565 2.8181


Critical values for test statistics:
1pct 5pct 10pct
tau3 -3.99 -3.43 -3.13
phi2 6.22 4.75 4.07
phi3 8.43 6.49 5.47

Here, value of test statistic is -1.8243 and at 1% level of significance the critical value is -3.99, so, test
statistic is greater than the critical value. Hence we accept null hypothesis i.e. , the series is non
stationary.
So, we have to do first differencing to remove the trend.
Step 3: After differencing the series, we use the ADF test and we get the following result:

###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression trend

Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-0.094974 -0.014845 0.000679 0.022692 0.087338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.871e-02 8.401e-03 4.607 1.05e-05 ***
z.lag.1 -2.534e+00 3.706e-01 -6.836 3.91e-10 ***
tt -1.574e-04 7.961e-05 -1.977 0.050434 .
z.diff.lag1 1.207e+00 3.400e-01 3.550 0.000555 ***
z.diff.lag2 1.112e+00 3.061e-01 3.631 0.000420 ***
z.diff.lag3 9.895e-01 2.658e-01 3.722 0.000305 ***
z.diff.lag4 6.323e-01 2.209e-01 2.862 0.004982 **
z.diff.lag5 4.709e-01 1.790e-01 2.630 0.009676 **
z.diff.lag6 4.519e-01 1.364e-01 3.313 0.001230 **
z.diff.lag7 2.643e-01 8.357e-02 3.163 0.001991 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03238 on 117 degrees of freedom
Multiple R-squared: 0.6966, Adjusted R-squared: 0.6732
F-statistic: 29.84 on 9 and 117 DF, p-value: < 2.2e-16

Value of test-statistic is: -6.8363 15.7947 23.6865


Critical values for test statistics:
1pct 5pct 10pct
tau3 -3.99 -3.43 -3.13
phi2 6.22 4.75 4.07
phi3 8.43 6.49 5.47

Here, the calculated vlaue is -6.8363, which is less than the tabulated value -3.46 at 1% level of
significance. So we reject the null hypothesis at 1 % level of significace. So we conclude that the
observed series is stationary.
Now, we plot log of the original data, deseasonalized data and stationary data.

Step 4: After getting the stationary series, we have to do the correlogram analysis for predicting
the model. And therefore we have to find the ACF and the PACF. After finding the ACF and the
PACF the following results are obtained:
From the plots, we see that the ACF is diminishing and the PACF cuts off after lag 8. As the
ACF is diminishing and the PACF cuts after some lag, so the predicted model is AR. And the
predicted model is AR(8) (i.e. AR model with order 8)

Step 5: Now, we have to find the estimate of the coefficients of the lag values including the
intercept. And by OLS method we find the estimates of the coefficients, which are given as:
Lags Estimates
Intercept 0.02439525
Y(lag1) -0.31990573
Y(lag2) -0.09136468
Y(lag3) -0.16368423
Y(lag4) -0.33385189
Y(lag5) -0.11598302
Y(lag6) 0.01107708
Y(lag7) -0.17680286
Y(lag8) -0.25164359

After finding th

e estimated values, we have to find the residuals by subtracting the estimated (𝑦̂) value from the
original (y) values.

Step 6: In this step we have to test the residuals, and the procedure is known as residual
analysis. Here we have to check, whether the residuals are white noise or not. After using the L-
jung Box test, we conclude that the residual series is white noise, there is no persistence
behaviour.

Hence, we conclude that the final model is AR(8).

Step 7: Now the final step is forecasting. After forecasting, the forecasted data are plotted in the
following graph:
And the forecasted values are:

Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 1961 448.0623 433.8106 462.3140 426.2661 469.8585
Feb 1961 420.8662 403.6775 438.0549 394.5784 447.1541
Mar 1961 463.7743 443.6484 483.9002 432.9944 494.5542
Apr 1961 499.7590 477.9085 521.6096 466.3415 533.1765
May 1961 510.7177 488.0791 533.3563 476.0950 545.3405
Jun 1961 575.3002 551.6361 598.9642 539.1091 611.4912
Jul 1961 660.4971 635.9330 685.0612 622.9296 698.0646
Aug 1961 646.9421 621.8731 672.0111 608.6024 685.2818
Sep 1961 549.7747 524.3503 575.1991 510.8914 588.6579
Oct 1961 499.4111 473.4258 525.3964 459.6700 539.1522
Nov 1961 431.3447 404.7778 457.9116 390.7141 471.9753
Dec 1961 473.7939 446.3706 501.2172 431.8536 515.7342
Jan 1962 489.3953 456.3452 522.4454 438.8495 539.9411
Feb 1962 462.7039 426.7138 498.6939 407.6619 517.7458
Mar 1962 504.7662 465.7141 543.8183 445.0411 564.4912
APPENDIX:
library(readxl)
library(stats)
library(lawstat)
library(urca)
library(forecast)

AirPass=read.csv("C:\\Users\\DELL\\Desktop\\airpassengers.csv")
AirPass1=ts(AirPass,start = c(1949,1),end = c(1960,12),frequency = 12)
ts.plot(AirPass1,ylab="No. of Air Passengers")
air_pass=log(AirPass)
air_pass=ts(air_pass,start = c(1949,1),end = c(1960,12),frequency = 12)
ts.plot(air_pass)
View(AirPass)
decom_air_pass=decompose(air_pass)

seas_adj=air_pass-decom_air_pass$seasonal
ts.plot(seas_adj)

summary(ur.df(seas_adj,type = "trend",lags = 15,selectlags = "AIC"))

ndiffs(seas_adj)
diff_seas_adj=diff(seas_adj)
ts.plot(diff_seas_adj,ylim=c(0,6.5),col="blue")
lines(seas_adj,col="green")
lines(air_pass,col="red")
legend(1951,4,c("Air Passenger","Seasonally Adjusted","difference of seasonally
adjusted"),col=c("red","green","blue"),lty=c(1,1,1))
summary(ur.df(diff_seas_adj,lags = 15,type = "trend",selectlags = "AIC"))

## Ljung Box test#####

acf=sarima::autocorrelations(diff_seas_adj,maxlag=50)
typeof(acf)
acf1=acf^2
n=length(diff_seas_adj)

for(i in 1:50){
Q=n*(n+2)*sum(acf1[1:i]/((n-c(1:i))))
if(Q<=qchisq(0.95,df=i)){
print("accept")
}else{
print("reject")
}
}
Acf(diff_seas_adj)

######### pacf test #########

pacf=sarima::partialAutocorrelations(diff_seas_adj,maxlag=50)
T=c()
for(i in 1:50){
T[i]=abs(pacf[i]*sqrt(n))
if(T[i]<=1.96){
print("accept")
}else{print("reject")}
}

Pacf(diff_seas_adj)

order=max(which(T>=1.96))

###

library(dplyr)
diff_seas_adj=as.vector(diff_seas_adj)
m1=matrix( ,length(diff_seas_adj),8)
for (i in 1:8) {
m1[,i]= lag(diff_seas_adj,i)
}

View(m1)

m1=as.data.frame(m1)

m2=as.matrix(m1[-c(1:12),])
View(m2)

m3=as.matrix(cbind( 1,m2))
View(m3)

m4=as.matrix(cbind(diff_seas_adj[-c(1:12)],m3))
View(m4)

library(matlib)
cf=inv(t(m3)%*%m3)%*%(t(m3)%*%m4[,1])
cf

res=m4[,1]-m3%*%cf
View(res)
res_acf=sarima::autocorrelations(res,maxlag=50)
typeof(res_acf)
res_acf1=res_acf^2
n1=length(res)

for(i in 1:50){
Q1=n1*(n1+2)*sum(res_acf1[1:i]/((n1-c(1:i))))
if(Q1<=qchisq(0.95,df=i)){
print("accept")
}else{
print("reject")
}
}

fit <- arima(air_pass, c(8, 1, 0),seasonal = list(order = c(1, 1, 0), period = 12))
pred= forecast(fit,h=15)

plot(pred)

You might also like