Sakchi Saraf Project 5

Predictive
Modeling
Sakchi Saraf
PGP-DSBA Online
May’22
Date: 16/10/22
0
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Table of Contents
Contents
Problem 1 .................................................................................................................................. 3
Executive summary....................................................................................................................
Introduction................................................................................................................................ 3
Data description......................................................................................................................... 3
Sample of the dataset...............................................................................................................
Exploratory data analysis........................................................................................................... 4
Data type....................................................................................................................................
Missing value............................................................................................................................
Describe....................................................................................................................................
Boxplot......................................................................................................................................
Histogram.................................................................................................................................
Correlation................................................................................................................................
Pairplot ....................................................................................................................................
Model analysis............................................................................................................................ 8
Train test data...........................................................................................................................
Linear regression model...........................................................................................................
Linear regression using stats models.......................................................................................
Comparison of both the models ...............................................................................................
Insights......................................................................................................................................... 12
Problem 2................................................................................................................................... 13
Executive summary.................................................................................................................... 13
Introduction................................................................................................................................ 13
Data description......................................................................................................................... 13
Sample of the dataset...............................................................................................................
Exploratory data analysis........................................................................................................... 14
Data type..................................................................................................................................
Missing value............................................................................................................................
Describe....................................................................................................................................
Boxplot......................................................................................................................................
Histogram.................................................................................................................................
Pairplot ....................................................................................................................................
Correlation................................................................................................................................
Model analysis............................................................................................................................ 18
Train test data...........................................................................................................................
Logistic Regression Model........................................................................................................
Linear Discriminate Analysis.....................................................................................................
Comparison of the models and inferences................................................................................. 23
THE END!............................................................................................................................................. 24
1
List of tables List of graph or picture
1.1 Sample of the dataset 1.1 Boxplot
1.2 Data type 1.2 Histogram
1.3 Missing value 1.3 Correlation
1.4 Describe 1.4 Pairplot
1.5 Histogram 1.5 Scatter plot
1.6.1 R-square 1.6 Scatter plot
1.6.2 RSME 2.1 Boxplot
1.7.1 Summary 2.2 Histogram
1.7.2 RSME 2.3 Pairplot
1.8 Comparison Table 2.4 Correlation
2.1 Sample of the dataset 2.5 Output grid search
2.2 Data type 2.5.1 Confusion matrix
2.3 Missing value 2.5.2 Classification report
2.4 Describe 2.5.3 ROC AUC
2.5 Histogram 2.6.1 Confusion matrix
2.6 Train test split 2.6.2 Classification report
2.7.1 Probability sample 2.6.3 ROC AUC
2.7.2 Confusion matrix 2.7.1 Confusion matrix
2.8.1 Probability sample 2.7.2 Classification report
2.8.2 Confusion matrix
2.9 Confusion matrix
2.10 Comparison Table
2
Problem 1
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are provided
with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia (which is an
inexpensive diamond alternative with many of the same qualities as a diamond). The company is earning
different profits on different prize slots. You have to help the company in predicting the price for the stone
on the bases of the details given in the dataset so it can distinguish between higher profitable stones and
lower profitable stones so as to have better profit share. Also, provide them with the best 5 attributes that
are most important.
Executive Summary
The primary objective of any organization is to earn profit. The dataset provides the attributes of the gem
stones which determine the price of a gem stone. By using the dataset, we can predict the price of a gem
stone depending upon the attributes of it. So, in this problem statement we will create the price prediction
equation using different liner regression models.
Introduction
The purpose of this whole exercise is to draw inferences from the dataset. We will use linear regression
models on the dataset. The data consist of the price of a gem stone and different attributes determining the
price of the gem stone.
Data Description
1. Carat - Carat weight of the cubic zirconia.
2. Cut - Describe the cut quality of the cubic zirconia. Quality is increasing order Fair, Good, Very
Good, Premium, Ideal.
3. Color - Colour of the cubic zirconia.With D being the best and J the worst.
4. Clarity - Clarity refers to the absence of the Inclusions and Blemishes. (In order from Best to Worst
in terms of avg price) IF, VVS1, VVS2, VS1, VS2, Sl1, Sl2, l1
5. Depth - The Height of cubic zirconia, measured from the Culet to the table, divided by its average
Girdle Diameter.
6. Table - The Width of the cubic zirconia's Table expressed as a Percentage of its Average Diameter.
7. Price - the Price of the cubic zirconia.
8. X - Length of the cubic zirconia in mm.
9. Y - Width of the cubic zirconia in mm.
10. Z - Height of the cubic zirconia in mm.
Sample of the dataset
3
The dataset has 1 dependent variable – “Price” and 9 independent variables. Each variable has different set
of attributes. Based on the variables the price for the gem will be predicted.
Exploratory Data Analysis

Data type in the data frame:
Range Index: 26967 entries, 0 to 26966

carat 26967 float64
cut 26967 object
color 26967 object
clarity 26967 object
depth 26270 float64
table 26967 float64
x 26967 float64
y 26967 float64
z 26967 float64
price 26967 int64
 There is total 26967 rows and 10 columns in the dataset.

 The dataset has 7 numerical variables and the rest 3 are categorical variables.
 While creating predictive model we only require numerical data as input. So, the object data type
was changed.
Missing values in the dataset:
From the results we can see that there is missing value present in “Depth” in
the dataset, for rest there is no missing values. We need to treat the missing
records, so we updated the records with median of the variable.
Describe
4
From the above table we get to know that there is miss value in one variable but no bad data. We can see
that the range of the numerical variables are very different from each other. We can see the most frequent
attribute of the categorical variable.
While checking for the zero value in the variables we came across 3 variables with
zero value, X, Y & Z. The zero value in the variable is equivalent to missing data so
we dropped the row as replacing it with the mean or median value may result as
wrong attributes for the gem stone.
The data has 34 duplicate records. We are not removing the duplicate records as it
contributes only 0.13% (34/26958) of the dataset. There is no unique identity
associated with the records by which we can assure they are duplicate data.
Boxplot
From the above boxplots we can see that all the 7 numerical variables have outliers. All the outliers are
reasonable outliers. Here, we will not treat the outliers as sometimes treating them may make our model
perform better during modelling but during actual test it may not perform up to that mark as it will lose out
on the generalization.
5
Histogram
Some of the bimodal histograms Some of the some right skewed Some of the nearly normally
are as follows: variables are as follows: distributed variables but with
outliers are as follows:
1. Cut 1. Carat 1. Depth
2. Clarity 2. X – Length 2. Table
3. Color 3. Y – Width
4. Z – Height
5. Price
6
Correlation Plot
From the correlation plot, we can see that various attributes are highly correlated to each other. Correlation
values near to 1 or -1 are highly positively correlated and highly negatively correlated respectively.
Correlation values near to 0 are not correlated to each other.
From the above table we can know that Carat, X, Y, Z and Price are highly correlated to each other. Before
the preparation of model X, Y & Z variables will be dropped from the dataset as there contribution will get
captured by Carat variable and Price being the dependent variable.
Pairplot
Pairplot shows the relationship between the variables in the form of scatterplot and the distribution of the
variable in the form of histogram. From the scatterplot, we can see that there is no clear linear relationship
between any variables.
7
Model Analysis
Train and Test data
The dataset has been divided into two parts – X and Y, X being the set of independent variables and Y being
the dependent variable. The dataset has been divided in 70%:30% ratio. Before the division of the data,
dummy variables were created for the categorical variables and the column of the first attribute created was
dropped for each categorical variable.
Linear Regression Model
The first step was to fit the Linear Regression model to the train set and then followed by determining the
coefficient of the variables and the intercept.
 Carat has the highest positive coefficient magnitude as 8885.65 units.

 Colors have negative coefficient, of which color J has the highest negative coefficient magnitude as -
2300.13 units.
 Table has the lowest coefficient magnitude as -24.75 units.
The linear regression equation:
Price = 8885.65 * carat – 28.66 * depth – 24.75 * table + 560.38 * cut_Good + 841.40 * cut_Ideal + 732.20 *
cut_Premium + 724.86 * cut_Very Good – 192.70 * color_E – 296.95 * color_F – 474.10 * color_G – 957.23 *
color_H – 1431.20 * color_I – 2300.13 * color_J + 5267.73 * clarity_IF + 3485.52 * clarity_SI1 + 2514.09 *
clarity_SI2 + 4428.67 * clarity_VS1 + 4134.25 * clarity_VS2 + 4967.10 * clarity_VVS1 + 4838.63 * clarity_VVS2
– 3951.25
There are also both positive and negative co-efficient values, for instance, carat, cut good & cut ideal have
positive values, on the other depth, table & color have negative values. For example:
 When carat increases by 1 unit, price increases by 8885.65 units, keeping all other predictors
constant.
8
 When table increase by 1 unit, prices decrease by 24.75 units, keeping all other predictors constant.
R-square:
Train data Test data
91.92% 91.29%
The 91.92% of the variation in the Price is explained by the predictors in the model for train set and 91.29%
of the variation in the Price is explained by the predictors in the model for train set. So, based on the R-
square we can say that our model has performed good and the prediction on the test data is also in
accordance with the train data.
RSME:

1133.37 1209.30
In the above table we can see the RSME of the train & test data. The RSME of the test data is a little higer
compared to the RSME of the train data. As we know lower the RSME the better it is but for our model the
value is high. So, we can conclude that the error in our model is higher however the test data perform well.
Train Data Test Data
From the above graph we can we can say that there is a liner trend.
Linear Regression using stats models (OLS)
The first step was to fit the OLS to the train set and then followed by determining the coefficient of the
variables and the intercept.
9
We can observe the following from the above table:
1. Both the R-square & Adjusted R-square are same with 91.9%, so the 91.9% of the variation in the
price is explained by the predictors.
2. P-value for all the variables is less than 0.05, so we can say that the variables have relationship with
the dependent variable, price.
3. Carat has the highest positive coefficient magnitude as 8885.65 units.
4. Colors have negative coefficient, of which color J has the highest negative coefficient magnitude as -
2300.13 units.
5. Table has the lowest coefficient magnitude as -24.75 units.
10
The linear regression equation:
Price = (-3951.25) * const + (8885.65) * carat + (-28.66) * depth + (-24.75) * table + (560.38) * cut_Good +
(841.4) * cut_Ideal + (732.2) * cut_Premium + (724.85) * cut_Very Good + (-192.7) * color_E + (-296.95) *
color_F + (-474.1) * color_G + (-957.23) * color_H + (-1431.2) * color_I + (-2300.13) * color_J + (5267.73) *
clarity_IF + (3485.52) * clarity_SI1 + (2514.09) * clarity_SI2 + (4428.67) * clarity_VS1 + (4134.25) *
clarity_VS2 + (4967.1) * clarity_VVS1 + (4838.62) * clarity_VVS2
There are also both positive and negative co-efficient values, for instance, carat, cut good & cut ideal have
positive values, on the other depth, table & color have negative values. For example:
 When carat increases by 1 unit, price increases by 8885.65 units, keeping all other predictors
constant.
 When table increase by 1 unit, prices decrease by 24.75 units, keeping all other predictors constant.
RSME:
1133.37 1212.54
In the above table we can see the RSME of the train & test data. The RSME of the test data is a little higer
compared to the RSME of the train data. As we know lower the RSME the better it is but for our model the
value is high. So, we can conclude that the error in our model is higher however the test data perform well.
From the above graph we can we can say that there is a liner trend.
11
Comparison of both the models
Linear Regression Model Linear Regression using stats model

Linear Price = 8885.65 * carat – 28.66 * depth – Price = (8885.65)* carat + (-28.66 * depth
Equation 24.75 * table + 560.38 * cut_Good + + (-24.75) * table + (560.38) * cut_Good +
841.40 * cut_Ideal + 732.20 * cut_Premium (841.4)*cut_Ideal + (732.2)* cut_Premium
+ 724.86 * cut_Very Good – 192.70 * + (724.85) * cut_Very Good + (-192.7) *
color_E – 296.95 * color_F – 474.10 * color_E + (-296.95) * color_F + (-474.1) *
color_G – 957.23 * color_H – 1431.20 * color_G + (-957.23) * color_H + (-1431.2) *
color_I – 2300.13 * color_J + 5267.73 * color_I + (-2300.13) * color_J + (5267.73) *
clarity_IF + 3485.52 * clarity_SI1 + 2514.09 clarity_IF + (3485.52) * clarity_SI1 +
* clarity_SI2 + 4428.67 * clarity_VS1 + (2514.09) * clarity_SI2 + (4428.67) *
4134.25 * clarity_VS2 + 4967.10 * clarity_VS1 + (4134.25) * clarity_VS2 +
clarity_VVS1 + 4838.63 * clarity_VVS2 – (4967.1) * clarity_VVS1 + (4838.62) *
3951.25 clarity_VVS2 + (-3951.25) * const
R-square 91.92% 91.9%

RSME for train 1133.37 1133.37
RSME for test 1209.30 1212.54
Insights
 We have predicted the price of the cubic zirconia using 2 models. Both the models showed the same
output. The R-square is good with 91.9% but the RSME is also high.
 There is a strong correlation between carat, height, width and length. If one of them is increased
then the others will also increase proportionately. Due to this strong relationship, it might be
difficult to provide the linear equation. So, we removed 3 out of the 4 strongly related variables to
obtain a better linear equation
 The top variables contributing the most to the price of the cubic zirconia are:
o Carat
o Clarity
o Color
o X- length, Y- width & Z- height
 The top variables contributing the most to the price of the cubic zirconia are:
o Depth
o Table
 The cubic zirconia manufacturer should manufacture with the features contributing maximum to
the price (Carat, Clarity, Length, Height & Width).
12
Problem 2
You are hired by a tour and travel agency which deals in selling holiday packages. You are provided details of
872 employees of a company. Among these employees, some opted for the package and some didn't. You
have to help the company in predicting whether an employee will opt for the package or not on the basis of
the information given in the data set. Also, find out the important factors on the basis of which the company
will focus on particular employees to sell their packages.
Executive Summary
The primary objective of the organization is to sell more and more of the holiday packages. One of the ways
to increase the sell the package is to sell them to its own employees also. The dataset available provide the
details of the employees. By using these details, we can come up with a predictive model which can predict if
the employee will opt for the holiday package or will not opt for the package. So, in this problem statement
we will perform Logistic Regression and Linear Discriminate Analysis to look for the best model for the
problem.
Introduction
The purpose of this whole exercise is to draw inference from the dataset. We will perform Logistic
Regression and Linear Discriminate Analysis on the dataset. The dataset consists of the 872 employee details
of the company. So, we will prepare a predictive model which best suited the situation trying different
models.
Data Description
1. Holiday Package - Opted for Holiday Package yes/no
1. Salary - Employee salary
2. Age - Age in years
3. edu - Years of formal education
4. no_young_children - The number of young children (younger than 7 years)
5. no_older_children - Number of older children
6. foreign - foreigner Yes/No
Sample of the dataset
The dataset has 1 dependent variable – “Holliday Package” and 6 independent variables. Each variable has
different set of attributes. Based on the variables whether the holiday package has been taken or not will be
predicted.
13
Exploratory Data Analysis
Data type in the data frame:
Range Index: 872 entries, 0 to 871

Holliday_Package 872 non-null object
Salary 872 non-null int64
age 872 non-null int64
educ 872 non-null int64
no_young_children 872 non-null int64
no_older_children 872 non-null int64
foreign 872 non-null object
 There are 872 rows and 7 columns in the dataset.

 The dataset has 5 numerical variables and the rest 2 are categorical variables.
 While creating predictive model we only require numerical data as input. So, the object data type
was changed.
Missing values in the dataset:
From the above results we can see that there is no missing value present in
the dataset. So, we do not need to treat any missing records and continue
with the model.
Describe
From the above table we get to know that there is no miss value & bad data. We can see that the range of
the are very different from each other. We can see the most frequent attribute of the categorical variable.
Boxplot
14
From the above boxplots we can see that all the numerical variables have outliers. All the outliers are
reasonable outliers. Here, we will not treat the outliers as sometimes treating them may make our model
perform better during modelling but during actual test it may not perform up to that mark as it will lose out
on the generalization.
Histogram
15
Some of the bimodal Some of the some right skewed Some of the nearly normally
histograms are as follows: variables are as follows: distributed variables but with
outliers are as follows:
 Education  Salary  Age
 Number of young children
 Number of older children
Pairplot
Pairplot shows the relationship between the variables in the form of scatterplot and the distribution of the
variable in the form of histogram.
From the scatter plot we can do multi variant analysis. In the graph the orange dots are the once who did not
choose the holiday package and the blue dots are the once who did choose the holiday package. From the
scatterplot, we can see that there is no clear linear relationship between any variables.
16
Correlation Plot
From the correlation plot, we can see that most of the variables are not highly correlated to each other.
Correlation values near to 1 or -1 are highly positively correlated and highly negatively correlated
respectively. Correlation values near to 0 are not correlated to each other.
From the correlation plot, we can see that there is no relationship between any variables.
17
Model Analysis
Train and Test data
The dataset has been divided into two parts – X and Y, X being the set of independent variables and Y being
the dependent variable. The dataset has been divided in 70%:30% ratio. Before the division of the data,
categorical variables were changes to objects.
Below table shows the attribute breakdown of the Holiday Package:
Holiday Package Original dataset Train data Test data

No or 0 54.01% 53.93% 54.2%
Yes or 1 45.99% 46.07% 45.8%
We can see that the data split is fine as the split has happened on the similar bracket.
Logistic Regression Model
In Logistic Regression model creation, we took the following parameters as follows:

1. penalty: L2, none, L1: it is the amount of shrinkage, where data values are shrunk towards a central
point, like the mean
2. solver: newton-cg, sag, saga: use to successively optimization problems, ‘sag’ & ‘saga’ are faster for
larger dataset
3. tol: 0.0001, 0.00001: the tolerance level is used as an indicator of multicollinearity
4. multi_class: multinomial, auto, ovr: it is used for classification tasks that have more than two class
labels
Following is the attribute output of the Grid search:
1. penalty: none
2. solver: newton-cg
3. tol: 0.0001
4. multi_class: multinomial
The table shows the probability of whether the particular employee opt for the
holiday package or did not opt for the holiday package. ‘0’ is for not opting the
package whereas ‘1’ is for opting the package. Here the cut-off is of above 50%
We can see that 4 out of 5 employees would not opt for the package as the
probability of not opting is above 50% and employee number 3 would opt for the
package.
18
Confusion matrix:
Confusion Matrix Train Data Test Data

True Negative 245 109 The employee did not opt for package and
model predicted the same
False Negative 119 58 The employee did opt for package but model
predicted did not opt
False Positive 84 33 The employee did not opt for package but
model predicted did opt
True Positive 162 62 The employee did opt for package and model
predicted the same
Classification report:
For this case, False Negative is most harmful as if an employee is a potential buyer but the model predicted
as not a buyer than the company will lost one of its customers, which result in loss of business. Followed by
True positive as to know the potential customers is important for any organization.
So, if we will compare the recall, we get can infer that:

1. Recall of employees not opting package is better with 74% and 77% for train and test data
respectively. The model has performed better in the test data.
2. Recall of employees opting package is not good with 58% and 52% for train and test data
respectively. The model is not that good as the difference between the test and train is more than
5%.
3. The Accuracy of the train and test data is also not good with 67 %and 65% respectively. The model
build is good as the difference between train and test is less than +-2% but the output of the model
is not adecuate.
19
ROC_AUC score and ROC curve:
The ROC AUC score for both the train and test data with 0.733 and 0.714 respectively shows that the model
has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area under
the curve also seems similar.
Linear Discriminate Analysis
The first step was to fit the Linear Discriminant Analysis to the train set and then followed by predicting the
test data.
The table shows the probability of whether the particular employee opt for the
holiday package or did not opt for the holiday package. ‘0’ is for not opting the
package where as ‘1’ is for opting the package. Here the cut-off is of above 50%
We can see that the all the 5 employees would not opt for the package as the
probability of not opting is above 50%.
Confusion matrix:
20
predicted the same
Classification report:
For this case, False Negative is most harmful as if an employee is a potential buyer but the model predicted
as not a buyer than the company will lost one of its customers, which result in loss of business. Followed by
True positive as to know the potential customers is important for any organization.
So, if we will compare the recall, we get can infer that:

1. Recall of employees not opting package is better with 74% and 77% for train and test data
respectively. The model has performed better in the test data.
2. Recall of employees opting package is not good with 58% and 49% for train and test data
respectively. The model is not that good as the difference between the test and train is more than
5%.
3. The Accuracy of the train and test data is also not good with 66 %and 64% respectively. The model
build is good as the difference between train and test is less than +-2% but the output of the model
is not adequate.
ROC_AUC score and ROC curve:
21
The ROC AUC score for both the train and test data with 0.733 and 0.714 respectively shows that the model
has performed good as they are within the difference of +-5. Along with the ROC AUC score, the area under
the curve also seems similar.
Changing the cut-off manually:
The default cut-off for the probability of whether the particular employee opt for the holiday package or did
not opt for the holiday package is above 50%. Here, ‘0’ is for not opting the package whereas ‘1’ is for opting
the package. Now we will change the cut-off value to look for the cut-off value where the model prediction
improves.
We tried the cut-off from 10% - 90% and got the best Accuracy and F1 score at the cut-off of 40%. The below
are the confusion matrix at 40% cut-off.
Confusion matrix:

predicted the same
Accuracy 0.67 0.65 It remained similar to the accuracy of the
precious models
Recall 0.76 0.73 The Recall had improved significantlt in this
model
Precision 0.61 0.59 It has degraded as compared to the previous
models
The report shows the classification report of the test data at cut-off of 40%.
22
Comparison of the models and Inferences
Logistic regression Logistic regression LDA LDA

Train Data Test Data Train Data Test Data
Accuracy 0.67 0.65 0.66 0.64
AUC 0.735 0.717 0.733 0.714
Recall 0.58 0.52 0.58 0.49
Precision 0.66 0.65 0.65 0.64
F1 Score 0.61 0.58 0.61 0.56
Comparing the parameters of both the models we can say that the Logistic Regression model did a better job
in test data.
Below are the observations:
1. We have predicted the price of the cubic zirconia using 2 models. Both the models showed the same
output. Model has performed without the outlier treatment to keep the generality.
2. 67% accuracy score is obtained for model which uses solver = “newton-cg”. This is the best accuracy
compared to other models.
3. 66% accuracy score is obtained for LDA model
4. 52% recall for the employee opt for the package is obtained for model which uses solver = “newton-
cg”.
5. Model still needs to improve as 33% the model might predict incorrectly and the Recall is also low.
6. The company can provide employee discounts to motivate the employees to opt for the package as it is
being avail to them at lower price compared to others.
7. The company can provide different package to different age group of the employees, for example:
religious places for older employees, adventure places for young employees and activity places for
employees with young kids.
8. They can also provide international packages at a discounted price to maximize the craze on
international trips.
23
THE END!
24

Sakchi Saraf Project 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sakchi Saraf Project 5

Uploaded by

Copyright:

Available Formats

Predictive

Sample of the dataset

Exploratory Data Analysis

Range Index: 26967 entries, 0 to 26966

 There is total 26967 rows and 10 columns in the dataset.

Missing values in the dataset:

Linear Regression Model

 Carat has the highest positive coefficient magnitude as 8885.65 units.

The linear regression equation:

Train data Test data

Train data Test data

Train Data Test Data

Linear Regression using stats models (OLS)

Train Data Test Data

Linear Regression Model Linear Regression using stats model

R-square 91.92% 91.9%

Sample of the dataset

Range Index: 872 entries, 0 to 871

 There are 872 rows and 7 columns in the dataset.

Missing values in the dataset:

Below table shows the attribute breakdown of the Holiday Package:

Holiday Package Original dataset Train data Test data

Logistic Regression Model

In Logistic Regression model creation, we took the following parameters as follows:

Following is the attribute output of the Grid search:

Confusion Matrix Train Data Test Data

Train Data Test Data

So, if we will compare the recall, we get can infer that:

Train Data Test Data

Linear Discriminate Analysis

Train Data Test Data

So, if we will compare the recall, we get can infer that:

ROC_AUC score and ROC curve:

Train Data Test Data

Changing the cut-off manually:

Train Data Test Data

Confusion Matrix Train Data Test Data

Logistic regression Logistic regression LDA LDA

Below are the observations:

You might also like