Professional Documents
Culture Documents
Debt & Money Markets
Debt & Money Markets
A report submitted to
Predictive Analytics
By
Group A1
On
08-08-2019
OBJECTIVE
The Objective of the study is to classify whether the market is up or down based on the
various predictor variables using various models. We have tried to make the report as
simple as possible so that it can be easily comprehended by anyone, even if one doesn’t
have any analytical background.
We have been provided with the Smarket file having 1250 observation with 10 variables.
We have checked for the structure of data to see whether continuous and
categorical variables are distinguished correctly or whether it need some conversion.
We have found that all the variables are in their correct format.
We then checked for any null entry for which we have found that there are no such
entries
Our response variable is Direction.
PARTITIONING OF DATA
We have partitioned data into training and validation set. Training data consist of all the
rows/observation prior to the year 2005 i.e. 2001-2004. Validation data consist of the
observations of the year 2005.
DIFFERENT MODELS
This is the first model used to classify the 2005 data based on the model made out of
the training set with Direction as our response variable.
First, we made the model using all the variables (except today, as we are predicting
for today itself). In summary we found that all the variables are not necessary as
they are insignificant (have a pretty large P value). We then choose 3 variables
having lowest p-value among those. These variables are Lag1, Lag2 and Lag3. After
hit and trial, we found out that these are the variables which is maximizing the
efficiency of the model.
We then run the model on validation set (data of year 2005) to predict the Direction
(response variable) of the market. We have set the direction “Up” in validation data
where the probabilities > 0.5.
Then we compared the predicted output (Direction) to the actual Direction of the
observations in year 2005 using confusion matrix. Output is shown below-
We can see that our model has classified “59.12%” of records correctly which is a
pretty good percentage.
Again, we made our LDA model based on our training data with Direction as
response and with predictors as Lag1, Lag2 and Lag3.
We then run the model on validation set (data of year 2005) to predict the Direction
(response variable) of the market.
Then we compared the predicted output (Direction) to the actual Direction of the
observations in year 2005 using confusion matrix. Output is shown below-
We can see that our model has classified “58.7 %” of records correctly which is
also a pretty good proportion.
3. K-Nearest Neighbor
For this model, we have used only Lag1 and Lag2 as they are giving maximum
efficiency for the model (Found out after several hit and trials).
We then run the model for K=1 (we are focusing on only 1 nearest neighbor) on
validation set (data of year 2005) to predict the Direction (response variable) of
the market.
Then we compared the predicted output (Direction) to the actual Direction of the
observations in year 2005 using confusion matrix. Output is shown below-
K =1
We can see that our model has classified “50 %” of records correctly with K=1.
We then run the model for K=3 (we are focusing on only 3 nearest neighbors)
on validation set (data of year 2005) to predict the Direction (response variable)
of the market.
Then we compared the predicted output (Direction) to the actual Direction of the
observations in year 2005 using confusion matrix. Output is shown below-
K =3
We can see that our model has classified “53.5 %” of records correctly with K=3.
4. Classification Trees
First, we made the tree model using Lag1 and Lag2 predictors on the training set.
The tree can be found below with 7 leaf nodes.
We then run the model on validation set (data of year 2005) to predict the
Direction (response variable) of the market.
Then we compared the predicted output (Direction) to the actual Direction of
the observations in year 2005 using confusion matrix. Output is shown
below-
We can see that our model has classified “58.33 %” of records correctly
which is also a pretty good proportion.
We didn’t show output after pruning here, as even after pruning the tree our
efficiency was not increasing and remains the same.
CONCLUSION
Among the 4 models, one can either go with Logistic or LDA model as they are showing
greater efficiency than the rest two models and at the similar level, but we would suggest to
go with Logistic regression as we are getting maximum efficiency in Logistic Regression
model i.e. 59.12%, so we will go with that to classify our records as whether the direction
would be up or down.