You are on page 1of 11

Predictive Analytics

Project Report

Title: Build a Stock index returns forecasting Model of using R

Submitted to:

Prof: Bhargab Chattopadhyay

Submitted by:

Group-11
Tarun Chintam (1810018)
Joseph Thomas Kurisunkal (1810025)
MVL Sravan Kumar (1810028)
Saurabh Yadav (1810099)

1|Page
Contents

Introduction ................................................................................................................................ 3

Problem statement ...................................................................................................................... 3

Literature Review....................................................................................................................... 3

Data analysis .............................................................................................................................. 4

Procedure ................................................................................................................................... 4

Building the ARIMA model using R programming .................................................................. 5

Inference .................................................................................................................................. 11

Contribution ................................................................................ Error! Bookmark not defined.

References ................................................................................................................................ 11

2|Page
Introduction
We are in information age where all our interaction with the world creates digital footprints
which generates data. Analysing these data will fetch so many insights to the business world.
This is made possible by the emergence of high-tech computers and technological innovations
happened in the past decades. The analysis of these data is causing new businesses to emerge.

Stock market was well established before these technological innovations. But now with these
there are many possibilities in the stock market. One such possibility is prediction of the stock
market movement using past data. Even though an efficient market is supposed to be
unpredictable, our markets are far from efficient at times. This opens up opportunities for data
analysts to take advantage of. So, this project is aiming for the development of a very basic
model that can reasonably predict the stock market returns. For this we are applying the
concepts of data analysis to build a model using the R-programming platform. R is used
because of its open-source nature, availability of various libraries necessary for data analysis
and its versatility and robustness.

Problem statement
The movements in Indian stock market is mostly depicted by movements of NIFTY 50 and
SENSEX indexes. For this project we are using NIFTY 50 index data. We are going to get
three year daily returns of NIFTY and build a time series forecasting model with a training data
and then predict for test data and find the accuracy of the model. Ninety percent of the data is
used as training data and remaining ten percent is used as test data in this project. Forecasting
is predicting the values of one variable using its historical data points or predicting the change
in one dependent variable with respect to the changes in independent variables. We are going
to use time series forecasting which is a quantitative forecasting using statistical principles and
concepts.

Literature Review
In journal published by Miguel A. Ferreira and Pedro Santa-Clara in Journal of Financial
Economics on the topic “Forecasting stock market returns”- they proposed forecasting
separately the three components of stock market return-the dividend -price ratio, earnings
growth and price-earnings ratio growth-the sum of the parts method. Their method exploits the
different time series persistence of the components and obtains out-of-sample R-squares of
more than 1.3% with monthly data and 13.4% with yearly data. This compares with typically
negative R-squares obtained in a similar experiment with predictive regressions.

3|Page
In journal published by Ju-Jie Wanga, Jian-ZhouWanga, Zhe-GeorgeZhang b,c,n, Shu-PoGuo
a in Elsevier on the topic “Stock index forecasting based on a hybrid model” they proposed
forecasting the stock market price index is a challenging task. The exponential smoothing
model (ESM), autoregressive integrated moving average model (ARIMA), and the back
propagation neural network (BPNN) can be used to make forecasts based on time series. In this
paper, a hybrid approach combining ESM, ARIMA, and BPNN is proposed to be the most
advantageous of all three models. The weight of the proposed hybrid model (PHM) is
determined by genetic algorithm (GA). The closing of the Shenzhen Integrated Index (SZII)
and opening of the Dow Jones Industrial Average Index (DJIAI) are used as illustrative
examples to evaluate the performances of the PHM. Numerical results show that the proposed
model outperforms all traditional models, including ESM, ARIMA, BPNN, the equal weight
hybrid model (EWH), and the random walk model (RWM).

We are using the second paper for reference. We will be using the Autoregressive integrated
moving average (ARIMA) model part from the paper as it is in the scope of our project and
can be incorporated in R programming.

Data analysis
There are four popular time series forecasting techniques 1. Auto regressive models (AR) 2.
Moving Average Models (MA) 3. Seasonal Regression Models 4. Distributed Lags Models.
We are going to use both AR and MA models called ARIMA model. ARIMA stands for
Autoregressive Integrated Moving Average also known as Box-Jenkins approach.

Yt =ϕ1Yt−1 +ϕ2Yt−2…ϕpYt−p +ϵt + θ1ϵt−1+ θ2ϵt−2 +…θqϵt−q

Where, Yt is the differenced time series value, ϕ and θ are unknown parameters and ϵ are
independent identically distributed error terms with zero mean. Here, Yt is expressed in terms
of its past values and the current and past values of error terms.

Procedure
In time series regression we have to make the data stationary meaning make the means,
variance and autocorrelation constant. We use Augmented Dickey-Fuller unit root test to find
the stationarity of the data. Then we apply the differencing method to make the process a

4|Page
stationary process. The differenced value forms a new time series data set which is again tested
using ADF test for stationarity. We apply differencing method multiple times till the series is
stationary.

Then we use Autocorrelation function (ACF) and partial Autocorrelation function (PACF) to
identify the order if Autoregressive (AR) and Moving Average (MA) processes. For AR
models, the ACF will dampen exponentially and the PACF will be used to identify the order
(p) of the AR model. If we have one significant spike at lag 1 on the PACF, then we have an
AR model of the order 1, i.e. AR(1). If we have significant spikes at lag 1, 2, and 3 on the
PACF, then we have an AR model of the order 3, i.e. AR(3). For MA models, the PACF will
dampen exponentially and the ACF plot will be used to identify the order of the MA process.
If we have one significant spike at lag 1 on the ACF, then we have an MA model of the order
1, i.e. MA(1). If we have significant spikes at lag 1, 2, and 3 on the ACF, then we have an MA
model of the order 3, i.e. MA(3).

Once we have determined the parameters (p,d,q) we estimate the accuracy of the ARIMA
model on a training data set and then use the fitted model to forecast the values of the test data
set using a forecasting function. In the end, we cross check whether our forecasted values are
in line with the actual values.

Building the ARIMA model using R programming

#invoke necessary libraries in R


library(quantmod);library(tseries);
library(timeSeries);library(forecast);library(xts);

# Pull the nifty daily returns data from Yahoo finance


getSymbols(‘^NSEI', from='2016-08-01', to='2019-08-01')
# Select the relevant close price series
stock_prices = NSEI[,4]

# Compute the log returns for the stock


stock = diff(log(stock_prices),lag=1)
stock = stock[!is.na(stock)]

# Plot log returns


plot(stock,type='l', main='log returns plot')

5|Page
# Conduct ADF test on log returns series
print(adf.test(stock))

# Split the dataset in two parts - training and testing


breakpoint = floor(nrow(stock)*(2.9/3))

# Apply the ACF and PACF functions


par(mfrow = c(1,1))
acf.stock = acf(stock[c(1:breakpoint),], main='ACF Plot', lag.max=100)
pacf.stock = pacf(stock[c(1:breakpoint),], main='PACF Plot',
lag.max=100)

6|Page
For AR models, the ACF will dampen exponentially and the PACF plot will be used to identify
the order (p) of the AR model. For MA models, the PACF will dampen exponentially and the
ACF plot will be used to identify the order (q) of the MA model. From these plots let us select
AR order = 2 and MA order = 2. Thus, our ARIMA parameters will be (2,0,2).

Our objective is to forecast the entire returns series from breakpoint onwards. We will make
use of the For-Loop statement in R and within this loop we will forecast returns for each data
point from the test dataset.

We call the arima function on the training dataset for which the order specified is (2, 0, 2). We
use this fitted model to forecast the next data point by using the forecast function. The function
is set at 99% confidence level. One can use the confidence level argument to enhance the
model. We will be using the forecasted point estimate from the model. The “h” argument in
the forecast function indicates the number of values that we want to forecast, in this case, the
next day returns.

We can use the summary function to confirm the results of the ARIMA model are within
acceptable limits. In the last part, we attach every forecasted return and the actual return to the
forecasted returns series and the actual returns series.

# Initialzing an xts object for Actual log returns


Actual_series = xts(0,as.Date("2014-11-25","%Y-%m-%d"))

# Initialzing a dataframe for the forecasted return series


forecasted_series = data.frame(Forecasted = numeric())

7|Page
for (b in breakpoint:(nrow(stock)-1)) {

stock_train = stock[1:b, ]
stock_test = stock[(b+1):nrow(stock), ]

# Summary of the ARIMA model using the determined (p,d,q) parameters


fit = arima(stock_train, order = c(2, 0, 2),include.mean=FALSE)
summary(fit)

# plotting a acf plot of the residuals


acf(fit$residuals,main="Residuals plot")

# Forecasting the log returns


arima.forecast = forecast(fit, h = 1,level=99)
summary(arima.forecast)

# plotting the forecast


par(mfrow=c(1,1))
plot(arima.forecast, main = "ARIMA Forecast")

# Creating a series of forecasted returns for the forecasted period


forecasted_series = rbind(forecasted_series,arima.forecast$mean[1])
colnames(forecasted_series) = c("Forecasted")

# Creating a series of actual returns for the forecasted period


Actual_return = stock[(b+1),]
Actual_series = c(Actual_series,xts(Actual_return))
rm(Actual_return)

print(stock_prices[(b+1),])
print(stock_prices[(b+2),])

From the coefficients obtained, the return equation can be written as:
Yt = 0.216*Y(t-1) –0.881 *Y(t-2) –0.1949* ε(t-1) +0.8867* ε(t-2)

8|Page
The standard error is given for the coefficients, and this needs to be within the acceptable limits.
The Akaike information criterion (AIC) score is a good indicator of the ARIMA model
accuracy. Lower the AIC score better the model. We can also view the ACF plot of the
residuals; a good ARIMA model will have its autocorrelations below the threshold limit. The
forecasted point return is -0.0002003189, which is given in the last row of the output.

Next, we will check the accuracy of the model be comparing the forecasted returns and the
actual returns.
# Adjust the length of the Actual return series
Actual_series = Actual_series[-1]

# Create a time series object of the forecasted series


forecasted_series = xts(forecasted_series,index(Actual_series))

# Create a plot of the two return series - Actual versus Forecasted


plot(Actual_series,type='l',main='Actual Returns Vs Forecasted
Returns')
lines(forecasted_series,lwd=1.5,col='red')
legend('bottomright',c("Actual","Forecasted"),lty=c(1,1),lwd=c(1.5,1.5)
,col=c('black','red'))

# Create a table for the accuracy of the forecast


comparsion = merge(Actual_series,forecasted_series)
comparsion$Accuracy =
sign(comparsion$Actual_series)==sign(comparsion$Forecasted)
print(comparsion)

# Compute the accuracy percentage metric


Accuracy_percentage = sum(comparsion$Accuracy ==
1)*100/length(comparsion$Accuracy)
print(Accuracy_percentage)

9|Page
In the forecasted returns if the sign equal to the sign of the actual returns we assigned it a
positive accuracy score. The accuracy of our model is 60% which is decent for a primary
model.

10 | P a g e
Inference

In real situations, the dynamics of stock index time series is complex and unknown. Using a
single classical model cannot produce accurate forecasts for stock price indexes. Instead of
using only the Auto regressive (AR) model or the Moving Average (MA) model, using the
integrated model gives a better accurate result. The Akaike information criterion (AIC) score
is a good indicator of the ARIMA model accuracy. With the ARIMA model the accuracy is
coming around 60%, with the inclusion of more independent variables we can get a better
accurate model.
Another direction is to explore further is the possibility of combining other forecasting tools,
like support vector regression (SVR) and multivariate adaptive regression splines (MARS) to
further improve time-series forecasting.

References

 https://www.r-bloggers.com/forecasting-stock-returns-using-arima-model/
 https://subscription.packtpub.com/book/big_data_and_business_intelligence/97817843908
15/10/ch10lvl1sec114/predicting-stock-prices-with-an-arima-model
 https://www.sciencedirect.com/science/article/pii/B9780444536839000062
 https://www.monash.edu/business/econometrics-and-business-
statistics/research/publications/ebs/wp18-17.pdf
 https://www.researchgate.net/publication/321439926_Which_Variables_Predict_and_Fore
cast_Stock_Market_Returns
 https://www.semanticscholar.org/paper/Modelling-and-Forecasting-Stock-Returns%3A-
Exploiting-Trombini-Valente/5fa0dc2a79f115ff14188f7ea1708bde20a113b7

11 | P a g e

You might also like