You are on page 1of 16

A Project Report on

Predictive Modelling of Stock Market Returns using Data Mining Tools

Submitted by
SHREYAN HOTA 08SI2034

Master of Science Statistics and Informatics Department of Mathematics Indian Institute of Technology Kharagpur

Under the Guidance of

Prof. A. Goswami
Department of Mathematics Indian Institute of Technology Kharagpur

Page | 1

CERTIFICATE
This is to certify that the present thesis entitled Predictive Modelling of Stock Market Returns using Data Mining Tools, submitted by Shreyan Hota (Roll No. 08SI2034) to the Indian Institute of Technology, Kharagpur in partial fulfilment of the requirements for the Degree of Integrated Master of Science in Statistics and Informatics, is a bona fide record of the work carried out by him during July to November 2012 under my supervision and guidance. This report has not been submitted to any other Institute/University for any degree of diploma to the best of my knowledge.

Date : 29th November 2012 Prof. A Goswami Dept. of Mathematics INDIAN INSTITUTE OF TECHNOLOGY Kharagpur, 721302.

Page | 2

Table of Contents

Sl. No. 1 2 2.1 2.2 2.3 2.4 3 3.1 3.2 3.2.1 3.3 3.3.1 3.4 3.4.1 3.5

Title Abstract Introduction Functions of Data Mining and correlation with our case Knowledge Discovery Process and implementation in our case Previous Research Scope Stock-Prediction Using Data Mining Traditional Time Series Forecasting Neural Networks for Level Estimation o Results Neural Network for Classification of sign of change of stock returns o Results Conclusion Future Work References

Page No. 4 5 5 6 6 7 8 9 10 11 13 13 15 15 16

Page | 3

Abstract
A widely accepted hypothesis regarding the relatively new field of research of data mining is that the various processes like neural networks, decision trees and genetic algorithms can be used to uncover relationships from a huge database and can lead to discovery of information about the effect of factors on results and grouping of data under categories. The non-linearity, fickleness or unpredictability of the financial market is a well-known fact. Current technical analysis indicators including regression, time series forecasting techniques, RSI indicators and Bollinger Bands fail to consider relevance of input variables and relationships of the market trends with various external factors. This requires an in-depth analysis and correlation of results with recognizable factors, and, of need be, uncovering of unrecognizable factors. This can only be done using machine learning tools and data mining. A certain process of knowledge-discovery must be followed, a suitable technique must be chosen and the resulting information discovered must be cross-validated properly to give a tangible and verifiable predictive model for the market trends or other unpredictable financial ups and downs, for example, credit risk. This project mainly focuses on returns on various portfolios in the stock market. Various neural network methods are explored, using various input variables leading to information gain and uncovering of hidden layers. The aim of the study is to come up with a tangible predictive model for future stock-returns, and this must be cross-validated to improve its generalization ability. If possible, this model will incorporate realistic trading rules to generate buy, sell or hold decisions for maximum profitability.

Page | 4

Introduction
Forecasting stock return or a stock index has attracted researchers attention for many years. But recent advancements in the environment of financial markets as well as in the application of computing technology in the past two decades has made the stock market a big potential domain for application of data mining techniques. Increased storage and enhance communications technology has led to the existence of huge databases of historical data regarding stock market trends and variables. Such variables include interest rates and exchange rates, growth rates of industrial production and consumer price and company specific information such as income statements, balance sheets, profit/loss accounts and dividend yields. The most basic obstacle to prediction using computing or technical tools is the efficient market hypothesis (Jensen, 1978). This in effect means that markets adapt so rapidly in terms of price adjustments that there is no space to obtain profits in a consistent way. It states that all available information affecting stock market current values is constituted by the market before the general public can make trades based on it. In other words, technical analysis does not work. But this is still considered an empirical issue and has been successively replaced by more relaxed versions to allow for trading opportunities. Recently, many authors have provided evidence implying that various time series data on various variables allow stock market returns to be predicted. Interest rates, monetary growth rates and inflation rates are statistically important for predicting returns. Most of the relationships of the available information and stock market returns are based on linear regression models, but there is no evidence to support the conjecture that market returns vary linearly with the independent factors. Thus data mining might help in uncovering reasons the residual error or variance and produce more reliable predictions by discarding the linear regression assumptions. In search for a more systematic approach to uncovering important input variables, this project aims to find a perfect methodology for data selection and then introduce a knowledge-discovery process for variable relevance analysis using various neural network or genetic algorithm approaches, the final step being cross-validation. Classification of direction and level of change in stock returns are important predictions in this model. If possible, we will try to incorporate profit-maximization trading rules decision making using one or more stocks based on the model used in the previous study. Functions of data mining and correlation with our case A problem is suitable for data-mining tools if it fulfils the following requirements: 1. 2. 3. 4. 5. requires knowledge-based decisions have a changing environment have sub-optimal current methods have accessible, sufficient and relevant data provide high payoff for right decisions

Our case study of stock market predictions is tailor-made for data-mining application, as the changing environment of the market, sub-optimal current methods of linear regression unable to predict returns reliably, and the high payoff in terms of high returns for the right trading decisions all make stock market return prediction a perfect data mining application field. Major data mining tasks include:

Page | 5

1. Classification: This involves predicting an item class. This can be applied in predicting direction of stock market change (positive or negative) as we shall see later. Decision Trees and Neural Networks are most often used for classification. 2. Clustering and Deviation Detection: Finding clusters in data is an important application. Finding which external factors most affect our returns is an important step. How variance of predicted values from actual values can be reduced is an important part of creating a data mining model. 3. Link Analysis and Estimation: Finding relationships between input variables and output layer is useful for level estimation of stock market return changes. Neural Networks are used for link analysis and estimation. 4. Knowledge Discovery: Perhaps the most pertinent function of data-mining, knowledge discovery means the gain of information from huge databases. Making sense of data is the key function of this process. Suppose we have a database of all the stock returns, open, close, high, low prices of stock market for 10 or more years of trading days. Conclusions about how the stock returns are affected by various external factors can be made using data mining and predictions can be made using our results. Knowledge Discovery Process and implementation in our case The Knowledge Discovery Process using data mining can be summarized in the following steps: 1. 2. 3. 4. 5. 6. Database Creation Data Cleaning : Data Warehousing Data selection to get Task-Relevant Data Data Mining to Evaluate Patterns Insight Prediction, leading to Knowledge Cross-Checking, Model Validation

A simple example would be a database consisting of dates, open, close, high, low prices of six stocks last 10 years. Data Cleaning would take care of irrelevant information such as intra-day quotes. Data selection would result in a subset database consisting of the close prices and dates of six or less than six stocks, for relevant time periods. Data Mining would perhaps consist of time-series forecasting and creation of neural networks. From this neural network, we would get results such as linking weights and hidden layers, which would be used to produce predictions. Cross-checking and updating the model to reduce error between predicted and actual values from our database is an important final recursive step. Previous Research The usefulness of data mining in finance is a relatively new field. Earlier, simple regression and time series forecasting techniques were used to predict stock variations and to make investment decisions. Due to the inefficiency of such techniques and due to the advent of computing power and increased storage space, data mining and machine tools came to the forefront in the financial research field since 1990. Current studies that reflect an interest in applying neural networks to answer future stock behaviors include Chenoweth and Obradovic (1996), Desai and Bharati (1998), Gencay (1998), Leung, Daouk, and Chen (2000), Motiwalla and Wahab (2000), Pantazopoulos et al. (1998), Qi and Maddala (1999), and Wood and Dasgupta (1996).

Page | 6

Scope This project aims to broaden the scope of the previous research by including other data mining techniques such as genetic algorithms, fuzzy logic and decision trees in conjunction with neural network as well as traditional statistical predictive models to come up with an optimal model for predicting stock market returns, minimizing the error as much as possible. Every time such a model is identified as satisfactorily predicting stock returns, it is implemented in a simple trading simulation using buy, sell or hold decisions to maximize profitability.

Page | 7

Simple Stock-Prediction Using Data Mining


To illustrate the process we will be following with a simple example, we will take up a case study concentrating on a particular stock, say, RIL stocks from the NSE (National Stock Exchange), India. Data regarding stock quotes (open, close, high, low) are readily available on the internet (Yahoo! Finance and Moneycontrol). In this case we will consider the day-end stock quotes to predict future trends. We will begin with simple data-warehousing. After data cleaning and selection of task-relevant data, we get a database with the following information: 1. 2. 3. 4. 5. 6. Date of Stock Exchange Open Price (First stock quote of the day) High Price (Highest price during day) Low Price (Lowest price during day) Close Price (Last transaction quote of the day) Volume of Transactions

We created a CSV (Comma Separated Values) file to be of use in R. Stock quotes from January 3, 2000 to November 25, 2012 were taken in the file. We load the .csv file into R. ril now contains a database of RIL stock quotes from 2000 to 2012. Plotting the Close prices of RIL stock, we get the following plot.

Traditional Time Series Forecasting Page | 8

Let us start with simple Time Series Predictions. We take new variable, h-day returns on closing prices: Rh(t) = (Closet- Closet-h)/ Closet-h We define the returns function to get this result for our database Close column. Next we use this function to neatly create a data frame with all the necessary R h(t) information for our Close column. dataset <- function(data,quote=Close,hday,off) { + ds <- data.frame(embed(returns(data[,quote],h=hday),off+hday)) + ds <- ds[,c(1,(1+hday):(hday+off))] + names(ds) <- c(paste(r,hday,.f,hday,sep=""), + paste(r,hday,.t,0:( off-1),sep="")) + ds$Date <- data[(hday+off):(nrow(data)-hday),Date] + ds +} This function receives a data frame which must have a Date column, and the default name of column selected to work on is Close. Let us see the result when our database is entered as argument in this function. ril.data<- dataset(ril, hday=1,off=10) ril.data[1:4,] r1.f1 0.010145 0.004833 0.005872 -0.00012 r1.t7 -0.00753 -0.01282 0.00155 0.011929 r1.t0 0.002097 0.010145 0.004833 0.005872 r1.t8 0.00522 0.00753 0.01282 0.00155 r1.t1 0 0.002097 0.010145 0.004833 r1.t9 -0.00886 0.00522 -0.00753 0.01282 r1.t2 0 0 0.002096702 0.01014456 Date 09-11-2012 08-11-2012 07-11-2012 06-11-2012 r1.t3 0.002867 0 0 0.002097 r1.t4 0.011929 0.002867 0 0 r1.t5 0.00155 0.011929 0.002867 0 r1.t6 0.01282 0.00155 0.011929 0.002867

1 2 3 4 1 2 3 4

Here hday is the lag, off indicates the number of rows to offset to get t, the time for which returns are to be calculated with h time lag. In our case hday=1 and off=10, implying we have a time lag of 1 day and we have to offset 10 rows to get t. This shows the first four dates returns and R h(t). r1.f1 indicates that hday=1 has been used and this is the target variable. r1.f1 for 1st row=R1(11) (11 is the 11th column, that is, Date = 09-11-2012) r1.t0 for 1st row=R1(10) (.t0 implies no offset to the emb) Previous values of h-day returns can be obtained by our embedded.dataset function. These values can be used to create a model to predict the next value of this time series. An important input for this prediction is

Page | 9

the autocorrelation. This gives us an idea of the correlation of h-day returns with values of h-day returns of previous days. The function acf() is used to calculate autocorrelation of our data frame. acf(ril.data$r1.f1,main=,ylim=c(-0.1,0.1)) We get the following autocorrelation plot.

The dotted line (caused by the ylim=c(-0.1,0.1) clause) is the 95% confidence level line for significance of our autocorrelation values. As we can clearly see, there are several autocorrelation values that do not conform to the 95% confidence level line. This is because autocorrelations function on the assumption of linear correlations. As we have seen before, linear models are not suitable for predicting financial market trends. We will now move on to non-linear models. Neural Networks for Level Estimation The first non-linear model we shall use to predict h-day returns of closing prices will be neural network. These are possibly the most common models used in predictions experiments. We use the package nnet for R to meet our data-mining needs. It involves feed-forward neural networks, the most common type of neural networks used. A neural network consists of a network of neurons which are actually computing units, linked to each other. These linkages have an associated weight, which have to be found out using algorithms. Neurons are organized in layers, the first layer being the input neurons consisting of input variables. The final layer is the prediction values. The other layers are between these two layers, and are called hidden layers. The weight updating algorithm obtains weights of the linkages based on certain criteria. These criteria are formed from our case constraints present in the problem.

Page | 10

Here 1 and 2 are input neurons; 3, 4, 5 are neurons of the hidden layer; and 6 is the output neuron. The weights associated with each linkage, for example, W12 or W56 are unknown parameters that are to be found. The weight updating algorithm involves a multilayer feed-forward neural network, where for each training sample, input variables are fed simultaneously into the input layer. The weighted outputs are in turn fed simultaneously into the first hidden layer. The outputs found from this layer are input into the next hidden layer, if a second hidden layer exists. And so on, till all the hidden layers are covered. Finally the output layers weights are found which can be used in prediction of output or target variables. Since we will need two datasets, one to form the neural network and other to evaluate it, we dissociate the database of 1-day returns (ril.data) into two, one from 2000 to 2006, other from 2006 to 2012. First, we will use the simple nnet function used in the nnet package. This function builds a neural network with single hidden layer with, say, 10 hidden neurons. The decay rate, which is actually the weight updating rate, is given as an argurment. library(nnet) nn<-nnet(r1.f1 ~ ., data=ril.form[,-ncol(ril.form)], linout=T,size=10, decay=0.01, maxit=1000) Here linout true argument implies that the target variable (r1.f1) is continuous. The maxit argument gives the maximum number of iterations of the algorithm (stopping technique). We have removed the Date column from our ril.form data frame (using the ril.form[,-ncol(ril.form) clause, removing the last column). This is to delete the useless information from our neural network input neurons. Results The summary(nn) function can be used to check the final linkage weights obtained from the algorithm. summary(nn) a 10-10-1 network with 121 weights options were - linear output units decay=0.01 b->h1 i1->h1 i2->h1 i3->h1 i4->h1 0 0.05 -0.03 0.01 -0.08 b->h2 i1->h2 i2->h2 i3->h2 i4->h2 0 -0.07 0.04 -0.01 0.12 b->h3 i1->h3 i2->h3 i3->h3 i4->h3 0 -0.01 0 0 0.01 Page | 11

i5->h1 0.09 i5->h2 -0.14 i5->h3 -0.02

i6->h1 -0.05 i6->h2 0.07 i6->h3 0.01

i7->h1 -0.02 i7->h2 0.04 i7->h3 0

i8->h1 -0.02 i8->h2 0.03 i8->h3 0

i9->h1 -0.01 i9->h2 0.02 i9->h3 0

i10->h1 -0.02 i10->h2 0.04 i10->h3 0

b->h4 0 b->h5 0 b->h6 0 b->h7 0 b->h8 0 b->h9 0 b->h10 0 b->o 0

i1->h4 0.07 i1->h5 -0.05 i1->h6 0.04 i1->h7 0 i1->h8 -0.07 i1->h9 -0.05 i1->h10 0.09 h1->o -0.14

i2->h4 -0.04 i2->h5 0.03 i2->h6 -0.02 i2->h7 0 i2->h8 0.04 i2->h9 0.03 i2->h10 -0.05 h2->o 0.22

i3->h4 0.01 i3->h5 -0.01 i3->h6 0.01 i3->h7 0 i3->h8 -0.01 i3->h9 -0.01 i3->h10 0.02 h3->o 0.03

i4->h4 -0.11 i4->h5 0.08 i4->h6 -0.07 i4->h7 0 i4->h8 0.11 i4->h9 0.08 i4->h10 -0.15 h4->o -0.2

i5->h4 0.13 i5->h5 -0.09 i5->h6 0.08 i5->h7 0 i5->h8 -0.13 i5->h9 -0.09 i5->h10 0.17 h5->o 0.15

i6->h4 -0.07 i6->h5 0.05 i6->h6 -0.04 i6->h7 0 i6->h8 0.07 i6->h9 0.05 i6->h10 -0.09 h6->o -0.12

i7->h4 -0.03 i7->h5 0.02 i7->h6 -0.02 i7->h7 0 i7->h8 0.04 i7->h9 0.02 i7->h10 -0.05 h7->o 0

i8->h4 -0.03 i8->h5 0.02 i8->h6 -0.02 i8->h7 0 i8->h8 0.03 i8->h9 0.02 i8->h10 -0.03 h8->o 0.21

i9->h4 -0.02 i9->h5 0.01 i9->h6 -0.01 i9->h7 0 i9->h8 0.02 i9->h9 0.01 i9->h10 -0.02 h9->o 0.15

i10->h4 -0.03 i10->h5 0.02 i10->h6 -0.02 i10->h7 0 i10->h8 0.04 i10->h9 0.02 i10>h10 -0.05 h10->o -0.28

10-10-1 network implies 10 input variables (off=10), 10 hidden layers (size=10) and one output (target) variable (r1.f1). The neural network with the given weights can now be used to make predictions for our RIL stock using the predict function. nn.prediction<-predict(nn, ril.check) We plot the nn.prediction results on a neural network results graph.

Page | 12

Ideally the dotted line (indicating zero error) should be the locus of all the results for the target variable found using the neural network prediction. This is not the case; hence we must address the issue of evaluating time series models.

Neural Network for Classification of sign of change of stock returns We will now attempt to predict the sign of change of stock from one trading day to the next. This is a classification problem in data-mining. There will be three classes of sign change [-1,0,1]. These will be the output neurons. First we create a function in R to record the change in returns for lag h with user-defined function signchange. This gives us change in returns for lag h. To get the sign of change we use the R in-built function sign. Once again we subdivide the vector containing sign of change of stock returns into two groups, one from 2000 to 2006 for training the neural network, and another from 2006 to 2012 for evaluating the network. Now we shall create a neural network with r1.f1, now the label for sign of change in stock over 1 day lag. The input variables will be the same as last time, that is, various time series prediction variables (r1.t0, r1.t1 and so on). library(nnet) nn<-nnet(r1.f1 ~ .,data=c.form1[,-ncol(c.form1)],linout=T,size=5,decay=0.01,maxit=1000) Results We can see a summary of our weight-updating algorithm. summary(nn) a 10-5-1 network with 61 weights options were - linear output units decay=0.01 b->h1 i1->h1 i2->h1 i3->h1 i4->h1 0.13 34.61 8.15 -3.42 -1.94 b->h2 i1->h2 i2->h2 i3->h2 i4->h2 2.18 -0.31 -3.39 -10.84 -2.45 b->h3 i1->h3 i2->h3 i3->h3 i4->h3 1.96 6.79 3.33 -7.88 -1.76 b->h4 i1->h4 i2->h4 i3->h4 i4->h4 2.76 -2.91 -3.06 1.09 -2.57 b->h5 i1->h5 i2->h5 i3->h5 i4->h5 1.92 8.39 1.45 -3.34 -3.06 b->o h1->o h2->o h3->o h4->o 1.06 -6.59 -9.82 11.43 -9.42

i5->h1 -2.06 i5->h2 -10.48 i5->h3 -2.84 i5->h4 3.58 i5->h5 -5.82 h5->o 11.57

i6->h1 -2.69 i6->h2 6.16 i6->h3 -2.8 i6->h4 5.18 i6->h5 7.06

i7->h1 0.09 i7->h2 -5.32 i7->h3 -1.94 i7->h4 1.3 i7->h5 -0.48

i8->h1 3.42 i8->h2 2.86 i8->h3 10.59 i8->h4 -9.16 i8->h5 -8.29

i9->h1 -2.78 i9->h2 5.7 i9->h3 3.43 i9->h4 7.7 i9->h5 -0.75

i10->h1 2.26 i10->h2 -3.98 i10->h3 2.69 i10->h4 2.88 i10->h5 -1.83

Page | 13

Note that since we are using a neural network it will not return values from the set {-1,0,1} but will return from the range (-1,1). It is possible for it to return values outside this range due to error, or limitations of iterations. Thus we have to look for a way to compare our predicted values with the actual values. In a classification problem, the best way to compare the two sets of values is to take the Root Mean Square Error (RMSE). RMSE =

where yi is the predicted change in sign and ti is the actual change in sign. We create a simple function rmse in R. A low RMSE implies an accurate neural network prediction. We subdivide the checking database (2006-01-03 onwards) into 7 parts, and cross-check the neural network with different hidden neuron size and decay rate to check if change in any one of these variables can increase the accuracy of our network. Summarizing our findings in the following table: Parameters ps=5,dc=0.01 1.072651 1.02308 1.008584 1.066422 1.03331 0.9774792 1.009009 1.027219314

Year 2006-2007 2007-2008 2008-2009 2009-2010 2010-2011 2011-2012 2012-2012 Average RMSE

ps=10,dc=0.05 1.079826 0.9683714 1.008584 1.059041 1.009911 0.9857281 0.9909089 1.014624343

ps=15,dc=0.1 1.058155 0.9763746 1 1.044125 0.9696592 0.9774792 0.9536783 0.997067329

ps=15,dc=0.01 1.10107 1.007752 1.106408 1.088262 1.03331 1.018056 1.009009 1.051981

ps=25,dc=0.3 1.058155 0.9843128 1 1.051609 0.9696592 0.9523038 0.9724718 0.9983588

This table shows the RMSE for all the subgroups for different parameters. For example, the first column shows the RMSE for the subgroups for hidden neuron parameter size (ps) 5 and decay rate (dc) 0.01. This has been repeated for increasing ps and increasing dc. A general statement cannot be made about the relationship between the increasing parameter values of pc and dc. But we can clearly see that for ps=15 and dc=0.1 we get the lowest RMSE (0.997) and hence our predicted values are closest to our actual values of stock return change sign from 2006 to 2012 (checking subgroup).

Page | 14

Conclusions
Since our input variables for both stock market returns change level estimation and sign change classification were mainly time series forecasts, we depended on the assumption of linear market hypothesis for our results. As we have seen before this assumption is flawed, and this was evident from our predicted returns showing a large amount of variance from the actual values. Input variables must include varied data such as income rates, company balance sheet information and GDP growth rates of the time for superior classification, link analysis and information gain. In our second experiment (sign change classification), we noted that RMSE is a good estimate of the error in prediction of stock market returns change sign. Changing the parameters of the neural network such as hidden layer neurons and decay rate helped decrease the RMSE. Future Work To improve on our experiments, we must add many variables in our database such as income statements, profit-loss accounts and dividend yields. But there must be a trade-off between addition of new input variables and the number of hidden layer neurons so as not to increase the complexity of the neural networks. Performance analysis of the neural networks will be made to test benchmarks of the system. In addition, genetic algorithms and decision tree approach to predictive modelling of stock market returns will be discussed and experiments designed.

Page | 15

References
Chenoweth, T., & Obradovic, Z. (1996). A multi-component nonlinear prediction system for the S&P 500 Index. Neurocomputing, 10, 275290. Balvers, R. J., Cosimano, T. F., & McDonald, B. (1990). Predicting stock returns in an efficient market. Journal of Finance Journal of Finance, 55, 11091128. Desai, V. S., & Bharati, R. (1998). The efficiency of neural networks in predicting returns on stock and bond indices. Decision Sciences, 29, 405425. Gencay, R. (1998). Optimization of technical trading strategies and the profitability in securities markets. Economics Letters, 59, 249254. Leung, M. T., Daouk, H., & Chen, A. S. (2000). Forecasting stock indices: a comparison of classification and level estimation models. International Journal of Forecasting, 16, 173190. Motiwalla, L., & Wahab, M. (2000). Predictable variation and profitable trading of US equities: a trading simulation using neural networks. Computer & Operations Research, 27, 11111129. Qi, M., & Maddala, G. S. (1999). Economic factors and the stock market: a new perspective. Journal of Forecasting, 18, 151166. Wood, D., & Dasgupta, B. (1996). Classifying trend movements in the MSCI U.S.A. capital market index - a comparison of regression, ARIMA, and neural network methods. Computers & Operations Research, 23, 611622.

Page | 16