Professional Documents
Culture Documents
Project Report
Submitted to:
Submitted by:
Group-11
Tarun Chintam (1810018)
Joseph Thomas Kurisunkal (1810025)
MVL Sravan Kumar (1810028)
Saurabh Yadav (1810099)
1|Page
Contents
Introduction ................................................................................................................................ 3
Literature Review....................................................................................................................... 3
Procedure ................................................................................................................................... 4
Inference .................................................................................................................................. 11
References ................................................................................................................................ 11
2|Page
Introduction
We are in information age where all our interaction with the world creates digital footprints
which generates data. Analysing these data will fetch so many insights to the business world.
This is made possible by the emergence of high-tech computers and technological innovations
happened in the past decades. The analysis of these data is causing new businesses to emerge.
Stock market was well established before these technological innovations. But now with these
there are many possibilities in the stock market. One such possibility is prediction of the stock
market movement using past data. Even though an efficient market is supposed to be
unpredictable, our markets are far from efficient at times. This opens up opportunities for data
analysts to take advantage of. So, this project is aiming for the development of a very basic
model that can reasonably predict the stock market returns. For this we are applying the
concepts of data analysis to build a model using the R-programming platform. R is used
because of its open-source nature, availability of various libraries necessary for data analysis
and its versatility and robustness.
Problem statement
The movements in Indian stock market is mostly depicted by movements of NIFTY 50 and
SENSEX indexes. For this project we are using NIFTY 50 index data. We are going to get
three year daily returns of NIFTY and build a time series forecasting model with a training data
and then predict for test data and find the accuracy of the model. Ninety percent of the data is
used as training data and remaining ten percent is used as test data in this project. Forecasting
is predicting the values of one variable using its historical data points or predicting the change
in one dependent variable with respect to the changes in independent variables. We are going
to use time series forecasting which is a quantitative forecasting using statistical principles and
concepts.
Literature Review
In journal published by Miguel A. Ferreira and Pedro Santa-Clara in Journal of Financial
Economics on the topic “Forecasting stock market returns”- they proposed forecasting
separately the three components of stock market return-the dividend -price ratio, earnings
growth and price-earnings ratio growth-the sum of the parts method. Their method exploits the
different time series persistence of the components and obtains out-of-sample R-squares of
more than 1.3% with monthly data and 13.4% with yearly data. This compares with typically
negative R-squares obtained in a similar experiment with predictive regressions.
3|Page
In journal published by Ju-Jie Wanga, Jian-ZhouWanga, Zhe-GeorgeZhang b,c,n, Shu-PoGuo
a in Elsevier on the topic “Stock index forecasting based on a hybrid model” they proposed
forecasting the stock market price index is a challenging task. The exponential smoothing
model (ESM), autoregressive integrated moving average model (ARIMA), and the back
propagation neural network (BPNN) can be used to make forecasts based on time series. In this
paper, a hybrid approach combining ESM, ARIMA, and BPNN is proposed to be the most
advantageous of all three models. The weight of the proposed hybrid model (PHM) is
determined by genetic algorithm (GA). The closing of the Shenzhen Integrated Index (SZII)
and opening of the Dow Jones Industrial Average Index (DJIAI) are used as illustrative
examples to evaluate the performances of the PHM. Numerical results show that the proposed
model outperforms all traditional models, including ESM, ARIMA, BPNN, the equal weight
hybrid model (EWH), and the random walk model (RWM).
We are using the second paper for reference. We will be using the Autoregressive integrated
moving average (ARIMA) model part from the paper as it is in the scope of our project and
can be incorporated in R programming.
Data analysis
There are four popular time series forecasting techniques 1. Auto regressive models (AR) 2.
Moving Average Models (MA) 3. Seasonal Regression Models 4. Distributed Lags Models.
We are going to use both AR and MA models called ARIMA model. ARIMA stands for
Autoregressive Integrated Moving Average also known as Box-Jenkins approach.
Where, Yt is the differenced time series value, ϕ and θ are unknown parameters and ϵ are
independent identically distributed error terms with zero mean. Here, Yt is expressed in terms
of its past values and the current and past values of error terms.
Procedure
In time series regression we have to make the data stationary meaning make the means,
variance and autocorrelation constant. We use Augmented Dickey-Fuller unit root test to find
the stationarity of the data. Then we apply the differencing method to make the process a
4|Page
stationary process. The differenced value forms a new time series data set which is again tested
using ADF test for stationarity. We apply differencing method multiple times till the series is
stationary.
Then we use Autocorrelation function (ACF) and partial Autocorrelation function (PACF) to
identify the order if Autoregressive (AR) and Moving Average (MA) processes. For AR
models, the ACF will dampen exponentially and the PACF will be used to identify the order
(p) of the AR model. If we have one significant spike at lag 1 on the PACF, then we have an
AR model of the order 1, i.e. AR(1). If we have significant spikes at lag 1, 2, and 3 on the
PACF, then we have an AR model of the order 3, i.e. AR(3). For MA models, the PACF will
dampen exponentially and the ACF plot will be used to identify the order of the MA process.
If we have one significant spike at lag 1 on the ACF, then we have an MA model of the order
1, i.e. MA(1). If we have significant spikes at lag 1, 2, and 3 on the ACF, then we have an MA
model of the order 3, i.e. MA(3).
Once we have determined the parameters (p,d,q) we estimate the accuracy of the ARIMA
model on a training data set and then use the fitted model to forecast the values of the test data
set using a forecasting function. In the end, we cross check whether our forecasted values are
in line with the actual values.
5|Page
# Conduct ADF test on log returns series
print(adf.test(stock))
6|Page
For AR models, the ACF will dampen exponentially and the PACF plot will be used to identify
the order (p) of the AR model. For MA models, the PACF will dampen exponentially and the
ACF plot will be used to identify the order (q) of the MA model. From these plots let us select
AR order = 2 and MA order = 2. Thus, our ARIMA parameters will be (2,0,2).
Our objective is to forecast the entire returns series from breakpoint onwards. We will make
use of the For-Loop statement in R and within this loop we will forecast returns for each data
point from the test dataset.
We call the arima function on the training dataset for which the order specified is (2, 0, 2). We
use this fitted model to forecast the next data point by using the forecast function. The function
is set at 99% confidence level. One can use the confidence level argument to enhance the
model. We will be using the forecasted point estimate from the model. The “h” argument in
the forecast function indicates the number of values that we want to forecast, in this case, the
next day returns.
We can use the summary function to confirm the results of the ARIMA model are within
acceptable limits. In the last part, we attach every forecasted return and the actual return to the
forecasted returns series and the actual returns series.
7|Page
for (b in breakpoint:(nrow(stock)-1)) {
stock_train = stock[1:b, ]
stock_test = stock[(b+1):nrow(stock), ]
print(stock_prices[(b+1),])
print(stock_prices[(b+2),])
From the coefficients obtained, the return equation can be written as:
Yt = 0.216*Y(t-1) –0.881 *Y(t-2) –0.1949* ε(t-1) +0.8867* ε(t-2)
8|Page
The standard error is given for the coefficients, and this needs to be within the acceptable limits.
The Akaike information criterion (AIC) score is a good indicator of the ARIMA model
accuracy. Lower the AIC score better the model. We can also view the ACF plot of the
residuals; a good ARIMA model will have its autocorrelations below the threshold limit. The
forecasted point return is -0.0002003189, which is given in the last row of the output.
Next, we will check the accuracy of the model be comparing the forecasted returns and the
actual returns.
# Adjust the length of the Actual return series
Actual_series = Actual_series[-1]
9|Page
In the forecasted returns if the sign equal to the sign of the actual returns we assigned it a
positive accuracy score. The accuracy of our model is 60% which is decent for a primary
model.
10 | P a g e
Inference
In real situations, the dynamics of stock index time series is complex and unknown. Using a
single classical model cannot produce accurate forecasts for stock price indexes. Instead of
using only the Auto regressive (AR) model or the Moving Average (MA) model, using the
integrated model gives a better accurate result. The Akaike information criterion (AIC) score
is a good indicator of the ARIMA model accuracy. With the ARIMA model the accuracy is
coming around 60%, with the inclusion of more independent variables we can get a better
accurate model.
Another direction is to explore further is the possibility of combining other forecasting tools,
like support vector regression (SVR) and multivariate adaptive regression splines (MARS) to
further improve time-series forecasting.
References
https://www.r-bloggers.com/forecasting-stock-returns-using-arima-model/
https://subscription.packtpub.com/book/big_data_and_business_intelligence/97817843908
15/10/ch10lvl1sec114/predicting-stock-prices-with-an-arima-model
https://www.sciencedirect.com/science/article/pii/B9780444536839000062
https://www.monash.edu/business/econometrics-and-business-
statistics/research/publications/ebs/wp18-17.pdf
https://www.researchgate.net/publication/321439926_Which_Variables_Predict_and_Fore
cast_Stock_Market_Returns
https://www.semanticscholar.org/paper/Modelling-and-Forecasting-Stock-Returns%3A-
Exploiting-Trombini-Valente/5fa0dc2a79f115ff14188f7ea1708bde20a113b7
11 | P a g e