You are on page 1of 5

Predictive Analytics: Assignment 2

A report submitted to

Prof. Debanjan Mitra

In partial fulfilment of requirements of the course

Predictive Analytics

By

Group A1

Abhishek Yadav – 188007

Amit Kumar – 188028

Sameer Sarnaik – 188214

On

08-08-2019
OBJECTIVE

The Objective of the study is to classify whether the market is up or down based on the
various predictor variables using various models. We have tried to make the report as
simple as possible so that it can be easily comprehended by anyone, even if one doesn’t
have any analytical background.

DATA AND ITS CLEANING

We have been provided with the Smarket file having 1250 observation with 10 variables.
 We have checked for the structure of data to see whether continuous and
categorical variables are distinguished correctly or whether it need some conversion.

We have found that all the variables are in their correct format.
 We then checked for any null entry for which we have found that there are no such
entries
 Our response variable is Direction.

PARTITIONING OF DATA

We have partitioned data into training and validation set. Training data consist of all the
rows/observation prior to the year 2005 i.e. 2001-2004. Validation data consist of the
observations of the year 2005.

DIFFERENT MODELS

1. Logistic Regression Model

 This is the first model used to classify the 2005 data based on the model made out of
the training set with Direction as our response variable.
 First, we made the model using all the variables (except today, as we are predicting
for today itself). In summary we found that all the variables are not necessary as
they are insignificant (have a pretty large P value). We then choose 3 variables
having lowest p-value among those. These variables are Lag1, Lag2 and Lag3. After
hit and trial, we found out that these are the variables which is maximizing the
efficiency of the model.
 We then run the model on validation set (data of year 2005) to predict the Direction
(response variable) of the market. We have set the direction “Up” in validation data
where the probabilities > 0.5.
 Then we compared the predicted output (Direction) to the actual Direction of the
observations in year 2005 using confusion matrix. Output is shown below-

 We can see that our model has classified “59.12%” of records correctly which is a
pretty good percentage.

2. Linear Discriminant Analysis

 Again, we made our LDA model based on our training data with Direction as
response and with predictors as Lag1, Lag2 and Lag3.
 We then run the model on validation set (data of year 2005) to predict the Direction
(response variable) of the market.
 Then we compared the predicted output (Direction) to the actual Direction of the
observations in year 2005 using confusion matrix. Output is shown below-
 We can see that our model has classified “58.7 %” of records correctly which is
also a pretty good proportion.

3. K-Nearest Neighbor

 For this model, we have used only Lag1 and Lag2 as they are giving maximum
efficiency for the model (Found out after several hit and trials).
 We then run the model for K=1 (we are focusing on only 1 nearest neighbor) on
validation set (data of year 2005) to predict the Direction (response variable) of
the market.
 Then we compared the predicted output (Direction) to the actual Direction of the
observations in year 2005 using confusion matrix. Output is shown below-

 K =1

 We can see that our model has classified “50 %” of records correctly with K=1.

 We then run the model for K=3 (we are focusing on only 3 nearest neighbors)
on validation set (data of year 2005) to predict the Direction (response variable)
of the market.
 Then we compared the predicted output (Direction) to the actual Direction of the
observations in year 2005 using confusion matrix. Output is shown below-

 K =3

 We can see that our model has classified “53.5 %” of records correctly with K=3.

4. Classification Trees
 First, we made the tree model using Lag1 and Lag2 predictors on the training set.
 The tree can be found below with 7 leaf nodes.

 We then run the model on validation set (data of year 2005) to predict the
Direction (response variable) of the market.
 Then we compared the predicted output (Direction) to the actual Direction of
the observations in year 2005 using confusion matrix. Output is shown
below-

 We can see that our model has classified “58.33 %” of records correctly
which is also a pretty good proportion.
 We didn’t show output after pruning here, as even after pruning the tree our
efficiency was not increasing and remains the same.

CONCLUSION

Among the 4 models, one can either go with Logistic or LDA model as they are showing
greater efficiency than the rest two models and at the similar level, but we would suggest to
go with Logistic regression as we are getting maximum efficiency in Logistic Regression
model i.e. 59.12%, so we will go with that to classify our records as whether the direction
would be up or down.

You might also like