You are on page 1of 82

PREDICTIVE

MODELING
PADMA
PGP-DSBA
ONLINE
MAY’22
DATE:08/05/2022

1
Table of Contents
Problem 1: Linear Regression
Introduction............................................................................................................................6

Sample of data set...................................................................................................................7


Data Description......................................................................................................................9
Exploratory Data Analysis……………………………………………………………………………………………………10
Let us check the types of variables in the data frame. ……………………………………………………..10
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the
null values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate
Analysis………………………………………………………………………………………………………………………….7-22
1.2 Impute null values if present, also check for the values which are equal to zero. Do they
have any meaning or do we need to change them or drop them? Check for the possibility of
combining the sub levels of a ordinal variables and take actions accordingly. Explain why you
are combining these sub levels with appropriate reasoning………………………………………….22-27
1.3 Encode the data (having string values) for Modelling. Split the data into train and test
(70:30). Apply Linear regression using scikit learn. Perform checks for significant variables
using appropriate method from statsmodel. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Compare these models and select the best one with appropriate reasoning………………27-43
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations………………………………………………………………………………………………………43-45

Problem 2: Logistic Regression and LDA


Introduction...........................................................................................................................45
Sample of data set..................................................................................................................46
Data Description.....................................................................................................................47

Exploratory Data Analysis……………………………………………………………………………………………………49


Let us check the types of variables in the data frame. ……………………………………………………....48
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis. Do
exploratory data analysis……………………………………………………………………………………………….46-64
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis)………………………………………………………………………………………………..64-80

2
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final
Model: Compare Both the models and write inference which model is
best/optimized………………………………………………………………………………………………………………65-80
2.4 Inference: Basis on these predictions, what are the insights and
recommendations………………………………………………………………………………………………………..80-82

List of Figures:
1.1Boxplot for numerical variable………………………………………………………………………………………11
1.1.2Boxplot after capping outliers…………………………………………………………………………………….12
1.1.3Hist plot for numerical variable………………………………………………………………………………….13

1.1.4Heat map for numerical variable…………………………………………………………………………………15


1.1.5Paiplot for numerical variable…………………………………………………………………………………….16
1.1.6Count plot cut…………………………………………………………………………………………………………...17
1.1.7Boxplot cut vs price…………………………………………………………………………………………………….17
1.1.8Count plot for color…………………………………………………………………………………………………..18

1.1.9Boxplot color vs price………………………………………………………………………………………………….18


1.1.10Count plot clarity………………………………………………………………………………………………………19
1.1.11Boxplot Clarity vs price…………………………………………………………………………………………….19
1.2.1Heat map for numerical and categorical Data……………………………………………………………32
1.3.1Scatter plot on test Data……………………………………………………………………………………………35

1.3.2Heat mapafter encoding……………………………………………………………………………………………..36


1.3.3Scatter plot after scalling on test data…………………………………………………………………………41
2.1.1Hist plots for Numerical Data…………………………………………………………………………………….52
2.1.2Boxplots for numeric data………………………………………………………………………………………….52
2.1.3Boxplots after capping outlier……………………………………………………………………………………53

2.1.4BoxplotHolliday package Vs salary…………………………………………………………………………….54


2.1.5BoxplotHolliday Package Vs age………………………………………………………………………………….55
2.1.6.Count plot forHolliday package vs Age………………………………………………………………………55
2.1.7Boxplot for Holliday Package vs educ………………………………………………………………………….56
2.1.8Countsplot for Holliday package and educ………………………………….……………………………..56

3
2.1.9Boxplot for Holliday Package and no young children…………………………………………………57
2.1.10Countplot for Holliday Package and no young children…………………………………………….57
2.1.11Stacked bar chart Foreign vs no young children………………………………………………………58

2.1.12Boxplot for Holliday Package and no older children…………………………………………………59


2.1.13Countplot for Holliday Package and no older children………………………………………………60
2.1.14Stacked bar chart Foreign vs no older children………………………………………………………..60
2.1.15Heat map for numerical and categorical Data…………………………………………………………61
2.1.16Pair plot for numerical and categorical Data…………………………………………………………….62

2.1.17 Count plot for Holliday Package…………………………………………………………………………….63


2.1.18Count plot for Foreign………………………………………………………………………………………………63
2.3.1AUC and ROC on Train Data-Linear regression ……………………………………………………………70
2.3.2Confusion matrix on Train Data-Linear regression………………………………………………………70
2.3.3AUC and ROC on Test Data-Linear regression……………………………………………………………71

2.3.4Confusion matrix on Test Data-Linear regression……………………………………………………….71


2.3.5Confusion matrix on Train Data- Linear Grid Search…………………………………………………74
2.3.6AUC and ROC on Train Data- Linear Grid Search…………………………………………………………74
2.3.7 Confusion matrix on Test Data- Linear Grid Search…………………………………………………….75
2.3.8AUC and ROC on Test Data- Linear Grid Search………………………………………………………….75

2.3.9Confusion matrix on Train Data-LDA…………………………………………………………………………76


2.3.10 Confusion matrix on Test Data-LDA………………………………………………………………………..77
2.3.11 AUC and ROC on Train Data-LDA……………………………………………………………………………..78
2.3.12 AUC and ROC on Test Data-LDA……………………………………………………………………………..78

List of Tables:
1.1Sample of Data…………………………………………………………………………………………………………….7
1.1.2Description of data……………………………………………………………………………………………………..9
1.1.3Checking correlation of theb data………………………………………………………………………………14
1.2.1Description summery of zero value in x,y,z column……………………………………………………24
1.2.3Categorical variable with Encoded values………………………………………………………………….37
1.2.4OLS Regression Results………………………………………………………………………………………………39

4
2.1.1Sample of the data…………………………………………………………………………………………………..47
2.1.2Descriptive summery of the data………………………………………………………………………………47
2.1.3Checking correlation of the data…………………………………………………………………………………61

2.1.4Categorical variable with Encoded values………………………………………………………………...68

5
Linear Regression
Intruduction:
Regression analysis may be one of the most widely used statistical techniques for studying
relationships between variables .We use simple linear regression to analyze the impact of a
numeric variable (i.e., the predictor) on another numeric variable (i.e., the response
variable) .For example, managers at a call center want to know how the number of orders
that resulted from calls relates to the number of phone calls received. Many software
packages offer expertise in regression analysis, such as Microsoft Excel, R, and Python.
Following the tutorial. anyone who knows Microsoft Excel basics can build a simple linear
regression model. However, to interpret the regression outputs and assess the model's
usefulness, we need to learn regression analysis. "all models are wrong, but some are
useful." To be capable of discovering useful models and make plausible predictions, we
should have an appreciation of both regression analysis theory and domain-specific
knowledge. IT professionals are experts in learning and using software packages. All of a
sudden, business users ask them to perform a regression analysis. IT professionals whose
mathematical background does not include a regression analysis study want to gain the
skills to discover plausible models, interpret regression outputs, and make inferences.

Problem 1:
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You
are provided with the dataset containing the prices and other attributes of almost 27,000
cubic zirconia (which is an inexpensive diamond alternative with many of the same qualities
as a diamond). The company is earning different profits on different prize slots. You have to
help the company in predicting the price for the stone on the bases of the details given in
the dataset so it can distinguish between higher profitable stones and lower profitable
stones so as to have better profit share. Also, provide them with the best 5 attributes that
are most important.

Data Dictionary:
Variable Name: Description
Carat: Carat weight of the cubic zirconia.
Cut : Describe the cut quality of the cubic zirconia. Quality is increasing order Fair, Good,
Very Good, Premium, Ideal.
Color: Colour of the cubic zirconia.With D being the worst and J the best.

Clarity: Clarity refers to the absence of the Inclusions and Blemishes. (In order from Worst to
Best in terms of avg price) IF, VVS1, VVS2, VS1, VS2, Sl1, Sl2, l1

6
Depth: The Height of cubic zirconia, measured from the Culet to the table, divided by its
average Girdle Diameter.
Table: The Width of the cubic zirconia's Table expressed as a Percentage of its Average
Diameter.
Price: The Price of the cubic zirconia.
X: Length of the cubic zirconia in mm.

1.1. Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, Data types, shape, EDA, duplicate
values). Perform Univariate and Bivariate Analysis.

Sample of the dataset :

Unnamed: 0 carat cut color clarity depth table x y z price


0 1 0.30 Ideal E SI1 62.1 58.0 4.27 4.29 2.66 499
1 2 0.33 Premium G IF 60.8 58.0 4.42 4.46 2.70 984

2 3 0.90 Very Good E VVS2 62.2 60.0 6.04 6.12 3.78 6289
3 4 0.42 Ideal F VS1 61.6 56.0 4.82 4.80 2.96 1082
4 5 0.31 Ideal F VVS1 60.4 59.0 4.35 4.43 2.65 779
Table no 1.1Sample of Data

First five rows of the dataset.is given by head function.

Data dimensions:
The data set contains 26967 row, 11 columns .

Structure of the Dataset:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26967 entries, 0 to 26966

Data columns (total 11 columns):


# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 26967 non-null int64
1 carat 26967 non-null float64
2 cut 26967 non-null object

7
3 color 26967 non-null object
4 clarity 26967 non-null object
5 depth 26270 non-null float64

6 table 26967 non-null float64


7 x 26967 non-null float64
8 y 26967 non-null float64
9 z 26967 non-null float64
10 price 26967 non-null int64

dtypes: float64(6), int64(2), object(3)


memory usage: 2.3+ MB
observations:
1.In the given data set there are 2 Integer type features,6 Float type features. 3 Object type
features. Where 'price' is the target variable and all other are predector variable.
2.Carat, depth, table, x, y, z variables are numerical or continuous variables. Cut, Clarity and
colour are categorical variables.

3.The first column is an index ("Unnamed: 0")as this only serial no, we can remove it

Descriptive Statistics Summary:


carat depth table x y z price
count 26967.000000 26270.000000 26967.000000 26967.000000 26967.000000
26967.000000 26967.000000
mean 0.798375 61.745147 57.456080 5.729854 5.733569 3.538057
3939.518115
std 0.477745 1.412860 2.232068 1.128516 1.166058 0.720624
4024.864666
min 0.200000 50.800000 49.000000 0.000000 0.000000 0.000000
326.000000
25% 0.400000 61.000000 56.000000 4.710000 4.710000 2.900000
945.000000
50% 0.700000 61.800000 57.000000 5.690000 5.710000 3.520000
2375.000000

8
75% 1.050000 62.500000 59.000000 6.550000 6.540000 4.040000
5360.000000
max 4.500000 73.600000 79.000000 10.230000 58.900000 31.800000
18818.000000
Table 1.1.2Description of the Data

Observation:

1.Carat : This is an independent variable, and it ranges from 0.2 to 4.5. mean value is around
0.8 and 75% of the stones are of 1.05 carat value. Standard deviation is around 0.477 which
shows that the data is skewed and has a right tailed curve. Which means that majority of
the stones are of lower carat. There are very few stones above 1.05 carat.

2.Depth : The percentage height of cubic zirconia stones is in the range of 50.80 to 73.60.
Average height of the stones is 61.80 25% of the stones are 61 and 75% of the stones are
62.5. Standard deviation of the height of the stones is 1.4. Standard deviation is indicating a
normal distribution
3.Table :T he percentage width of cubic Zirconia is in the range of 49 to 79. Average is
around 57. 25% of stones are below 56 and 75% of the stones have a width of less than 59.
Standard deviation is 2.24. Thus the data does not show normal distribution and is similar to
carat with most of the stones having less width also this shows outliers are present in the
variable.
4.Price : Price is the Predicted variable. Prices are in the range of 3938 to 18818. Median
price of stones is 2375, while 25% of the stones are priced below 945. 75% of the stones are
in the price range of 5356. Standard deviation of the price is 4022. Indicating prices of
majority of the stones are in lower range as the distribution is right skewed.
5.Variables x, y, and z seems to follow a normal distribution with a few outliers.

Descriptive summary of categorical variable:


cut color clarity
count 26967 26967 26967

unique 5 7 8
top Ideal G SI1
freq 10816 5661 6571

Checking for the values which are equal to zero:


Number of rows with x == 0: 3
Number of rows with y == 0: 3

9
Number of rows with z == 0: 9
Number of rows with depth == 0: 0
Observations:

1.On the given data set the the mean and median values does not have much differenc. .We
can observe Min value of "x", "y", & "z" are zero this indicates that they are faulty values. As
we know dimensionless or 2-dimensional diamonds are not possible. So we need to filter
out those as it clearly faulty data entries.There are three object data type 'cut', 'color' and
'clarity'.
2.Ofter dropping the duplicates – The shape of the data set is – 26958 rows & 10 columns.

Checking for duplicate records in the data:


Number of duplicate rows = 33
(26958, 10)
Observations:
There are total of 33 duplicate rows as computed using.We will drop the duplicates.
Ofter dropping the duplicates – The shape of the data set is – 26925 rows & 10 columns

Check for Missing Values:


carat 0
cut 0
color 0
clarity 0
depth 697
table 0
x 0
y 0
z 0
price 0
dtype: int64

Only in ‘depth 697missing values are present which we will impute by its median values.

EDA
Univariate Analysis:
Distribution of Numeric Features:

10
Fig 1.1 boxplot for numerical variable

observations:
1.There is significant amount of outlier present in 'carat','depth', 'table','x','y','z','price',
iiiiiiivariable.
2.We can see that the distribution of variable like 'carat' and the target feature "price" are
heavily "right-skewed".
3.Variables 'depth','x', 'y', and 'z' seems to follow a normal distribution with a few outliers.
4.Large number of outliers are present in all the variables (Carat, Depth, Table, x, y, z). price
is right skewed with largerange of outliers.

Treating the outliers:

11
1.1.2 Boxplot after capping outlier

Ofter capping the outliers 'depth' stil have outliers.

Host plots shows distributions of numerical variable:

12
1.1.3Hist plots

observatoins:
1.From above histpots it is seen that except for carat and price variable, all other variables
have mean and median values very close to each other, seems like there is no skewness in
these variables.
2.Whereas for carat and price we see some difference in value of mean and median, which
slightly indicates existence of some skewness in the data.
3.Depth is the only variable which can be considered as normal distribution,Carat, Table, x,
y, z these variables have multiple modes with the spread of data.

13
4.Large number of outliers are present in all the variables (Carat, Depth, Table, x, y, z). price
is right skewed with largerange of outliers.

Skewness of the numerical variable :


carat 0.917214
depth -0.025042

table 0.480476
x 0.397696
y 0.394060
z 0.394819
price 1.157121

dtype: float64
Observation:
1.There is significant amount of outlier present in 'carat','depth', 'table','x','y','z','price',
variable.
2.We can see that the distribution of some quantitative features like "carat" and the target
feature "price" are heavily "right-skewed",. 3.Variables x, y, and z seems to follow a normal
distribution with a few outliers.

Bivariate Analysis:
Numeric Features - Checking for Correlations:
carat depth table x y z price

carat 1.00 0.03 0.19 0.98 0.98 0.98 0.94


depth 0.03 1.00 -0.29 -0.02 -0.02 0.10 0.00
table 0.19 -0.29 1.00 0.20 0.19 0.16 0.14
x 0.98 -0.02 0.20 1.00 1.00 0.99 0.91
y 0.98 -0.02 0.19 1.00 1.00 0.99 0.91

z 0.98 0.10 0.16 0.99 0.99 1.00 0.91


price 0.94 0.00 0.14 0.91 0.91 0.91 1.00
Table1.1.3 Checking for Correlations of the Data

Heatmap for Numerical variable:

14
1.1.4Heat map for numerical Data

observations:
1.High correlation between the different features like carat, x, y, z and price.
2.Less correlation between table with the other features.

3.Depth is negatively correlated with most the other features except for carat
4.We see strong correlation between Carat, x,y, and z that are demonstrating strong
correlation or multicollinearity.

Pair plots for Numerical variable:

15
1.1.5Pair plot Numeric Data

Pair plot allows us to see both distribution of single variable and relationships between two
variables.

16
EDA for Categorical variable:
Count plot (Cut):

1.1.6Count plot for CUT

Boxplot(Cut vs Price):

1.1.7Boxplot Cut vs Price

observations:

1.For the cut variable we see the most sold is Ideal cut type gems and least sold is Fair cut
gems.
2.All cut type gems have outliers with respect to price.

17
3.Slightly less priced seems to be Ideal type and premium cut type to be slightly more
expensive.

Count plot(Color):

1.1.8Count plot Color

Boxplot (Color vs Price):

1.1.9BOxplotColor Vs Price

observations:
1.For the color variable we see the most sold is G colored gems and least is J colored gems.

18
2.All color type gems have outliers with respect to price.
3.However, the least priced seems to be E type; J and I colored gems seems to be more
expensive.

Count plot(Clarity):

1.1.10Count plot for Clarity

Box plot(Clarity vs Price):

1.1.11Boxplot Clarity Vs Price

observations:

19
1.For the clarity variable we see the most sold is SI1 clarity gems and least is I1 clarity gems.
2.All clarity type gems have outliers with respect to price.
3.Slightly less priced seems to be SI1 type; VS2 and SI2 clarity stones seems to be more
expensive.

Getting unique count fo Numeric and Categorical Variable:


carat
_____
0.30 1328
0.31 1118
1.01 1109
0.70 959
0.32 949
...
1.96 1
1.88 1
1.94 1
1.89 1
0.22 1
Name: carat, Length: 183, dtype: int64

cut
_____
Ideal 10805
Premium 6880
Very Good 6027
Good 2434
Fair 779
Name: cut, dtype: int64

color
_____
G 5650
E 4916
F 4722
H 4091
D 3341
I 2765
J 1440
Name: color, dtype: int64

clarity
_____
SI1 6564
VS2 6092
SI2 4561
VS1 4086
VVS2 2530
VVS1 1839
IF 891
I1 362
Name: clarity, dtype: int64

20
depth
_____
62.0 1128
61.9 1090
62.1 1014
61.8 1012
62.2 976
...
50.8 1
71.0 1
52.7 1
71.3 1
70.8 1
Name: depth, Length: 169, dtype: int64

table
_____
56.0 4983
57.0 4771
58.0 4252
59.0 3296
55.0 3133
...
51.6 1
61.6 1
60.9 1
58.6 1
58.7 1
Name: table, Length: 99, dtype: int64

x
_____
4.38 233
4.37 229
4.32 227
4.33 225
4.34 224
...
9.14 1
9.03 1
3.74 1
9.30 1
3.83 1
Name: x, Length: 520, dtype: int64

y
_____
4.35 236
4.38 234
4.34 223
4.37 223
4.31 212
...

21
3.75 1
9.14 1
8.81 1
8.98 1
3.87 1
Name: y, Length: 516, dtype: int64

z
_____
2.70 393
2.69 393
2.68 373
2.71 368
2.72 352
...
5.65 1
1.53 1
5.48 1
2.30 1
2.06 1
Name: z, Length: 344, dtype: int64

price
_____
11965.0 1778
544.0 74
625.0 67
828.0 66
776.0 66
...
9678.0 1
690.0 1
10416.0 1
7898.0 1
6751.0 1
Name: price, Length: 7274, dtype: int64

1.2 Impute null values if present, also check for the values which
are equal to zero. Do they have any meaning or do we need to
change them or drop them? Check for the possibility of combining
the sub levels of a ordinal variables and take actions accordingly.
Explain why you are combining these sub levels with appropriate
reasoning.
Check for Missing Values:
carat 0
cut 0
color 0
clarity 0

22
depth 697
table 0
x 0
y 0
z 0
price 0
dtype: int64
price 0
dtype: int6

Median values:

carat 0.70
depth 61.80
table 57.00
x 5.69
y 5.70
z 3.52
price 2373.00
dtype: float64

observations:

1.There are missing values in the column “depth” – 697 cells or 2.6% of the total data set.
2.We can choose to impute these values using a mean or median. We checked for both the
values and the result for both is almost similar.

Imputing missing values:


carat 0
cut 0
color 0
clarity 0
depth 0
table 0
x 0
y 0
z 0
price 0
dtype: int64

Ofter imputing meadian value with missing value, The depth 697-cells will get
zero values.
Checking for the values which are equal to zero:
Number of rows with x == 0: 3
Number of rows with y == 0: 3
Number of rows with z == 0: 9
Number of rows with depth == 0: 0

23
Dropping dimentionless diamonds:
Shape(2695,10)
observatons:

1.There are three object data type 'cut', 'color' and 'clarity'. Ofter dropping dimentionless
diamond – The shape of the data set is – 26925 rows & 10 columns.

2.we have alrady check for 'Zero' value. and we can observe there are some amount of
'Zero' value present on the data set on variable 'x', 'y','z'.This indicates that they are faulty
values.

3.As we know dimensionless or 2-dimensional diamonds are not possible. So we have filter
out those as it clearly faulty data entries.

Describe Function showing presence of 0 values in x, y, and z columns:


carat depth table x y z price
count 26967.000000 26270.000000 26967.000000 26967.000000 26967.000000
26967.000000 26967.000000
mean 0.798375 61.745147 57.456080 5.729854 5.733569 3.538057
3939.518115
std 0.477745 1.412860 2.232068 1.128516 1.166058 0.720624
4024.864666
min 0.200000 50.800000 49.000000 0.000000 0.000000 0.000000
326.000000
25% 0.400000 61.000000 56.000000 4.710000 4.710000 2.900000
945.000000
50% 0.700000 61.800000 57.000000 5.690000 5.710000 3.520000
2375.000000
75% 1.050000 62.500000 59.000000 6.550000 6.540000 4.040000
5360.000000

max 4.500000 73.600000 79.000000 10.230000 58.900000 31.800000


18818.000000
1.2.1Describe summery of 0 values in X,Y ,Z coloums

observations:
1.While there are no missing values in the numerical columns, there are a few 0 in columns
– x (3 in count), y(3 in count) and z (9 in count) – in the database. A single row was dealt
with during checking for duplicates and the other eight rows were taken care here.

24
2.Since the total number of rows that had 0 value in them was 8 only, it accounts for a
negligible number and for this case study we could have avoided them or dropped.
3.Also, when I checked the correlation values, it seems there is a strong multicollinearity
between all three columns. There is a most likely case that I won’t even use them in creating
my Linear Regression model.
4.I have chosen to drop those rows as it represented an insignificant number when
compared to the overall dataset and it won’t add much value to the analysis here.

Describe Function confirming there are no 0 values in x, y, and z column:


carat depth table x y z price
count 26925.000000 26925.000000 26925.000000 26925.000000 26925.000000
26925.000000 26925.000000

mean 0.793119 61.746982 57.435023 5.729217 5.731159 3.537625


3734.453965
std 0.461998 1.393457 2.156704 1.125500 1.117494 0.695681
3466.394724

min 0.200000 50.800000 51.500000 3.730000 3.710000 1.190000


326.000000
25% 0.400000 61.100000 56.000000 4.710000 4.710000 2.900000
945.000000
50% 0.700000 61.800000 57.000000 5.690000 5.700000 3.520000
2373.000000
75% 1.050000 62.500000 59.000000 6.550000 6.540000 4.040000
5353.000000
max 2.025000 73.600000 63.500000 9.310000 9.285000 5.750000
11965.000000
observations:
1.The dataset contains feature highly varying in magnitudes, units and range. However,
most of the machine learning algorithms use Euclidean distance between two data points in
their computations, and this can be a potential problem.
2.Also, scaling helps to standardize the independent features present in the data in a fixed
range. If feature scaling is not done, then a machine learning algorithm tends to weigh
greater values, higher and consider smaller values as the lower values, regardless of the unit
of the values.
3.The features with high magnitudes will weigh in a lot more in the distance calculations
than features with low magnitudes. To suppress this effect, we need to bring all features to

25
the same level of magnitudes. For this case study, however, let’s look at the data more
closely to identify if there is a need for us to scale the data.
4. The describe function output that was shared above indicates that mean and std dev
numbers aren’t varying significantly for original numeric variables with a low std deviation
and hence, even if we don’t scale the numbers, our model performance will not vary much,
or the impact will be insignificant.
5.Though some of the immediate effects that we can see if we scale the data and run our
linear model is: Faster execution, the conversion is faster. The intercept is minimized
significantly, bringing it almost to negligible value The coefficients can now be interpreted in
std Dev units instead of a pure unit increment in normal linear models. Though, there is no
difference in the interpretation of the model scores and the representation of linear model
in a graphical mode using scatterplot representation.

Correlation between variables of the dataset:


carat depth table x y z price
carat 1.000000 0.033242 0.187134 0.982880 0.981960 0.980882
0.936765
depth 0.033242 1.000000 -0.289972 -0.018307 -0.021944 0.100040
0.000323
table 0.187134 -0.289972 1.000000 0.199653 0.194015 0.160519
0.137915
x 0.982880 -0.018307 0.199653 1.000000 0.998489 0.990898
0.913409
y 0.981960 -0.021944 0.194015 0.998489 1.000000 0.990533
0.914838
z 0.980882 0.100040 0.160519 0.990898 0.990533 1.000000
0.908599
price 0.936765 0.000323 0.137915 0.913409 0.914838 0.908599
1.000000
observations:
1.We can identify that there is a strong correlation between independent variables –like
Carat, x,y, and z. All these variables are strongly correlated with the target variable – price.
This indicates a strong case of our dataset struggling with multicollinearity .
2.Depth does not show any strong relation with the price variable. I would drop x,y, and z
variables before creating the linear regression model. Similarly, Depth does not seems to be

26
influencing my variable price and hence, at some point, I will be dropping this variable from
my model building process as well. Keeping the above points in mind, for this dataset, I
don’t think scaling the data will make much sense.

combining the sub levels of a ordinal variables:


df['cut_s'] = df['cut'].map({'Ideal':1,'Good':2,'Very Good':3,'Premium':4,'Fair':5})
In the ‘cut’-variable Ideal ,Good ,Very Good, Premium and fair are the ordinal variable ,so
sub levels of ordinary variable combing to gather i.e creating [cut_s] new variable and
mapping the[cut] sub levels of the ordinal variable.
Mapping the variable like : 'Ideal':1,'Good':2,'Very Good':3,'Premium':4,'Fair':5.

1.3 Encode the data (having string values) for Modelling. Split the
data into train and test (70:30). Apply Linear regression using scikit
learn. Perform checks for significant variables using appropriate
method from statsmodel. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare,
RMSE & Adj Rsquare. Compare these models and select the best
one with appropriate reasoning.
We have three object columns, which have string values - cut, color, and clarity Let us read
the data in brief before deciding on what kind of encoding technique needs to be

used. I would quickly check the statistical inference of our target variable – price, with the
following output:

mean 3734.453965
std 3466.394724
median 2373.000000
Name: price, dtype: float64
using the groupby function to check the relationship of cut with price and have used some
aggregator functions likes mean, median, and standard deviation.

std max mean median min


cut
Fair 3193.788413 11965.0 4364.383825 3337.0 369.0
Good 3149.222492 11965.0 3770.679540 3092.5 335.0
Ideal 3353. 661103 11965.0 3282.618788 1762.0 326.0
Premium 3679.608382 11965.0 4276.784593 3108.0 326.0

27
Very Good 3461.49217 11965.0 3829.352912 2633.0 336.0
Observations:
1.Let us now check the first object column – cut: Using the groupby function to check the
relationship of cut with price and have used some aggregator functions likes mean, median,
and standard deviation value shown above table.
2.We can establish that there is an order in ranking like mean price is increasing from ideal
then good to very good, premium and fair. Fair segment has the highest median value as
well. Since, I can see the ordered ranking here, I have classified them using scale – Label
encoding - and won’t use one-hot encoding here. However, it is absolutely fine for us to go
ahead and treat this variable using one-hot encoding as well.
3.We can certainly try that approach if we would like to see the impact of varied kind of cut
variables with price target variable. Overall, with the grid above, I don’t think we will find
cut as a strong predictor and hence I have stayed with using label encoding.
using the groupby function to check the relationship of color with price and have used
some aggregator functions likes mean, median, and standard deviation.
std max mean median min
color
D 3022.221311 11965.0 3067.771027 1799.0 357.0

E 2993.071116 11965.0 2956.374288 1698.0 326.0


F 3326.115989 11965.0 3537.406184 2282. 357.0
G 3534.772080 11965.0 3810.162301 2273.5 361.0
H 3596.514528 11965.0 4215.174529 3394.0 337.0
I 3881.685718 11965.0 4730.496926 3733.0 336.0

J 3812.362655 11965.0 5008.376389 4234.5 335.0


Observations:
1.In one-hot encoding, the integer encoded variable is removed and a new binary variable is
added for each unique integer value. A one hot encoding allows the representation of
categorical data to be more expressive.
2. Many machine learning algorithms cannot work with categorical data directly and hence,
the categories must be converted into numbers. This is required for both input and output
variables that are categorical.
using the groupby function to check the relationship of clarity with price and have used
some aggregator functions likes mean, median, and standard deviation.
std max mean median min

28
clarity
I1 2542.831763 11965.0 3850.662983 3494.0 345.0
IF 3252.252509 11965.0 2592.427609 1063.0 369.0

SI1 3292.539153 11965.0 3812.165143 2795.0 326.0


SI2 3448.488611 11965.0 4738.905722 4077.0 326.0
VS1 3542.705105 11965.0 3652.068527 1949.0 338.0
VS2 3537.589627 11965.0 3746.075837 2066.0 357.0
VVS1 3063.527052 11965.0 2424.065797 1066.0 336.0

VVS2 3547.754089 11965.0 3165.168379 1253.0 336.0

Encode the data (having string values):


CUT : 5
Fair 779
Good 2434
Very Good 6027
Premium 6880
Ideal 10805
Name: cut, dtype: int64

COLOR : 7
J 1440
I 2765
D 3341
H 4091
F 4722
E 4916
G 5650
Name: color, dtype: int64

CLARITY : 8
I1 362
IF 891
VVS1 1839
VVS2 2530
VS1 4086
SI2 4561
VS2 6092
SI1 6564
Name: clarity, dtype: int64

29
Unique value for the all categorical variable.

Converting categorical to dummy variables:


c
c d t p c cu c c cl cl cl cla cla cla cla
cu o
a e a r u t_ . ol ol ar ari ari rit rit rity rity
t_I l
r p b X y z i t G . or o ity ty ty y_ y_ _V _V
de o
a t l c _ oo . _ r_ _I _S _S VS VS VS VS
al r
t h e e s d H J F I1 I2 1 2 1 2
_I

4
0 6 5 4 4 2
9
. 2 8 . . .
0 9 1 0 1 ... 0 0 0 0 1 0 0 0 0 0
3 . . 2 2 6
.
0 1 0 7 9 6
0

9
0 6 5 4 4 2
8
. 0 8 . . .
1 4 4 0 0 ... 0 0 0 1 0 0 0 0 0 0
3 . . 4 4 7
.
3 8 0 2 6 0
0

6
0 6 6 6 6 3 2
. 2 0 . . . 8
2 3 0 0 ... 0 0 0 0 0 0 0 0 0 1
9 . . 0 1 7 9
0 2 0 4 2 8 .
0

1
0 6 5 4 4 2 0
. 1 6 . . . 8
3 1 0 1 ... 0 0 0 0 0 0 1 0 0 0
4 . . 8 8 9 2
2 6 0 2 0 6 .
0

7
0 6 5 4 4 2
7
. 0 9 . . .
4 9 1 0 1 ... 0 0 0 0 0 0 0 0 1 0
3 . . 3 4 6
.
1 4 0 5 3 5
0
Table1.2.2 Converting categorical to dummy variables

observations:
1.I have used one-hot encoding for color and clarity independent variables using the
function “drop_function = True” or can be referred as dummy encoding, which takes into
consideration the (Kn-1) encoding.
2.Dummy encoding converts it into n-1 variables. One-hot encoding ends up with kn
variables, while dummy encoding usually ends up with kn-1 variables. This will also help us
to deal with the issue of multidimensionality.

Data types ofter encoding:


carat float64
depth float64
table float64

30
x float64
y float64
z float64
price float64
cut_s int64
cut_Good uint8
cut_Ideal uint8
cut_Premium uint8
cut_Very Good uint8
color_E uint8
color_F uint8
color_G uint8
color_H uint8
color_I uint8
color_J uint8
clarity_IF uint8
clarity_SI1 uint8
clarity_SI2 uint8
clarity_VS1 uint8
clarity_VS2 uint8
clarity_VVS1 uint8
clarity_VVS2 uint8
dtype: object

All categorical object value can be converted into int64 and unit8 ofter
encoding.
Heatmap:

31
1.2.1Heat map for Numerical and categorical variable

Observations:
1.We have now set the precedence that there are a lot of variables (Carat, x,y, and z) that
are demonstrating strong correlation or multicollinearity. So, before proceeding with the
Linear Regression Model.
2.I have decided to drop x,y, and z from my Linear Regression model creation step. I have
decided to keep the carat column of the lot as it has the strongest relation with the target
variable – Price out of the four columns. Depth column also did not have much impact on
the price column and hence I have chosento not use it as well.

Train-Test Split:
X = df.drop('price', axis=1)
X = X.drop({'x', 'y', 'z','depth','cut_Good','cut_Premium','cut_Ideal','cut_Very Good'}, axis=1)
# Copy target into the y dataframe.
y = df[['price']]

32
#Split X and y into training and test set in 75:30 ratio,
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,

y,
test_size=0.30,
random_state=1)

Linear Regression Model:


LinearRegression function and find the bestfit model on training data:

regression_model = LinearRegression()
regression_model.fit(X_train, y_train)
LinearRegression()
Regression model coefficients are:
The coefficient for carat is 8023.99354211987

The coefficient for table is -24.96160128370218


The coefficient for cut_s is -43.85585730631698
The coefficient for color_E is -190.8815448137632
The coefficient for color_F is -248.6881846110516
The coefficient for color_G is -421.99293726043936

The coefficient for color_H is -832.8304011282429


The coefficient for color_I is -1321.6400725185504
The coefficient for color_J is -1859.634521237429
The coefficient for clarity_IF is 4245.01349072724
The coefficient for clarity_SI1 is 2695.2774591875573

The coefficient for clarity_SI2 is 1861.4107061452762


The coefficient for clarity_VS1 is 3546.3157809970157
The coefficient for clarity_VS2 is 3263.5726867690855
The coefficient for clarity_VVS1 is 4018.6599866240103
The coefficient for clarity_VVS2 is 3984.97801468667.
The intercept for our model is [-3600.04330541]

33
Observations:
1.The intercept (often labelled the constant) is the expected mean value of Y when all X=0. If
X never equals 0, then the intercept has no intrinsic meaning.
2.The intercept for our model is -3600.04330541. In preset case when the other predictor
variable are zero i.e like carat,cut, color, clarity all are zero then the C=-3600.( Y = m1X1 +
m2X2+ ….. + mnXn + C + e) that means price is -3600.We can do Z score or scaling the data
and make it nearly zero.

Model Evaluation:R-Values
R square on training data-0.9384124327864911
R square on testing data-0.9394154299395036

RMSE -Values:
RMSE on Training data-858.2270432559263
RMSE on Testing data-857.818165001115
observations:
1.R-square is the percentage of the response variable variation that is explained by a linear
model. R-square = (Explained variation / Total variation)
2.R-squared is always between 0 and 100%: 0% indicates that the model explains none of
the variability of the response data around its mean.100% indicates that the model explains
all the variability of the response data around its mean.
3.In this regression model we can see the R-square value on Training and Test data
respectively 0.9384124327864911 and 0.9394154299395036.
4.In this regression model we can see the RMSE value on Training and Test data respectively
858.2270432559263 and 857.818165001115.

Scatterplot on test data between dependent variable – price - and


Independent variable:

34
1.3.1Scatter plot on test data b/w dependent variable price and indepedendent data

Observation:
1.we can see that the is a linear plot, very strong corelation between the predicted y and
actual y. But there are lots of spread.That indicated some kind noise present on the data set
i.e Unexplained variances on the output.

Check Multi-collinearity using VIF:


carat ---> 5.219230125784284
table ---> 92.02140709539887
cut_s ---> 5.096413143490567

color_E ---> 2.4747624189962067


color_F ---> 2.4341698268457574
color_G ---> 2.774161531143149
color_H ---> 2.2877893449603337
color_I ---> 1.9217515170977741
color_J ---> 1.509454503820044
clarity_IF ---> 3.3631034501372756
clarity_SI1 ---> 17.89611729585019
clarity_SI2 ---> 12.680824943298578
clarity_VS1 ---> 11.60063052200214

clarity_VS2 ---> 16.740214589468305


clarity_VVS1 ---> 5.8723419563685235
clarity_VVS2 ---> 7.630278698237391

35
Ofter dropping the these variable:
“depth","x","y","z","cut_Good","cut_Ideal","cut_Premium","cut_Very Good
Heat Map:

1.3.2Heat map ofter encoding

The strongest correlation is 'carat','cuts' and target variable 'price'then low correlation is
'clarity_SI2'.

36
Linear Regression using statsmodels:
concatenate X and y into a single dataframe,Since statsmodels library requires that data be
passed as a single dataframe unlike sklearn which wants X and y as separate variables.

t c co p
c co co co co co cla clar clar clar clar clari clari
a ut lo ri
ar lor lor lor lor lor rity ity_ ity_ ity_ ity_ ty_V ty_V
b _ r_ c
at _E _F _G _H _J _IF SI1 SI2 VS1 VS2 VS1 VS2
le s I e

4
5
1. 5 0
0
1 6. 2 1 0 0 0 0 0 0 0 1 0 0 0 0 6
3
0 0 5.
0
0

1 5
2 1. 5 1
1 0 6. 3 0 0 0 0 0 0 0 0 1 0 0 0 0 6
0 1 0 6.
8 0

2 1
0 0. 6 7
1 6 1. 2 0 0 0 0 1 0 0 0 0 0 1 0 0 0
8 7 4 8.
1 0

2
4
0. 6 4
7
7 3. 2 0 0 1 0 0 0 0 1 0 0 0 0 0 4
1
6 0 7.
2
0

6
2
1. 5 6
5
0 9. 4 0 0 1 0 0 0 0 0 0 1 0 0 0 1
4
1 0 8.
8
0
Table1.2.3Categorical variable with coded values

Lm1.params:
Intercept -3600.043305
carat 8023.993542
table -24.961601
cut_s -43.855857
color_E -190.881545
color_F -248.688185
color_G -421.992937
color_H -832.830401
color_I -1321.640073
color_J -1859.634521
clarity_IF 4245.013491
clarity_SI1 2695.277459
clarity_SI2 1861.410706
clarity_VS1 3546.315781
clarity_VS2 3263.572687
clarity_VVS1 4018.659987
clarity_VVS2 3984.978015

37
dtype: float64

OLS Regression Results


===========================================================================
===
Dep. Variable: price R-squared: 0.
938
Model: OLS Adj. R-squared: 0.
938
Method: Least Squares F-statistic: 1.793e
+04
Date: Sat, 07 May 2022 Prob (F-statistic): 0
.00
Time: 23:12:51 Log-Likelihood: -1.5405e
+05
No. Observations: 18847 AIC: 3.081e
+05
Df Residuals: 18830 BIC: 3.083e
+05
Df Model: 16
Covariance Type: nonrobust
===========================================================================
=====
coef std err t P>|t| [0.025 0
.975]
---------------------------------------------------------------------------
-----
Intercept -3600.0433 198.516 -18.135 0.000 -3989.153 -321
0.934
carat 8023.9935 15.622 513.630 0.000 7993.373 805
4.614
table -24.9616 3.424 -7.290 0.000 -31.673 -1
8.250
cut_s -43.8559 5.645 -7.770 0.000 -54.920 -3
2.792
color_E -190.8815 23.085 -8.269 0.000 -236.129 -14
5.634
color_F -248.6882 23.393 -10.631 0.000 -294.540 -20
2.836
color_G -421.9929 22.869 -18.453 0.000 -466.818 -37
7.168
color_H -832.8304 24.388 -34.150 0.000 -880.633 -78
5.028
color_I -1321.6401 27.167 -48.649 0.000 -1374.890 -126
8.391
color_J -1859.6345 33.280 -55.879 0.000 -1924.865 -179
4.404

38
clarity_IF 4245.0135 65.081 65.227 0.000 4117.450 437
2.577
clarity_SI1 2695.2775 55.759 48.338 0.000 2585.985 280
4.569
clarity_SI2 1861.4107 56.193 33.125 0.000 1751.268 197
1.554
clarity_VS1 3546.3158 56.943 62.278 0.000 3434.702 365
7.930
clarity_VS2 3263.5727 56.081 58.194 0.000 3153.649 337
3.496
clarity_VVS1 4018.6600 60.279 66.668 0.000 3900.508 413
6.812
clarity_VVS2 3984.9780 58.614 67.986 0.000 3870.089 409
9.867
===========================================================================
===
Omnibus: 3995.393 Durbin-Watson: 1.
999
Prob(Omnibus): 0.000 Jarque-Bera (JB): 12037.
975
Skew: 1.098 Prob(JB): 0
.00
Kurtosis: 6.241 Cond. No. 1.90e
+03
===========================================================================
===

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is corr
ectly specified.
[2] The condition number is large, 1.9e+03. This might indicate that there
are
strong multicollinearity or other numerical problems.

Table 1.2.4 OLS regression Results

observatoins:

1.R-square is the percentage of the response variable variation that is explained by a linear
model. R-square = (Explained variation / Total variation)

2.R-squared is always between 0 and 100%: 0% indicates that the model explains none of
the variability of the response data around its mean.100% indicates that the model explains
all the variability of the response data around its mean.

3.In this regression model we can see the R-square value on Training and Test data
respectively 0.9384124327864911 and 0.9394154299395036.

39
4.In this regression model we can see the RMSE value on Training and Test data respectively
858.2270432559263 and 857.818165001115.

5.As the training data & testing data score are almost inline, we can conclude this model is a
Right-Fit Model.

Applying Z-Score:
As stated earlier, with this specific dataset, I don’t think we need to scale the data, however,
to see its impact, lets quickly view the results post scaling the data. I have used Z score to
scale the data. Z-Scores become comparable by measuring the observations in multiples of
the standard deviation of that sample. The mean of a z-transformed sample is always zero.

from scipy.stats import zscore

x_train_scaled = x_train.apply(zscore)

x_test_scaled = x_test.apply(zscore)

y_train_scaled = y_train.apply(zscore)

y_test_scaled = y_test.apply(zscore)

Coeffcient of the variable Ofter Applying Z-Score:


The coefficient for carat is 1.0670248582356114

The coefficient for table is -0.015565195961756262


The coefficient for cut_s is -0.016640831439161168
The coefficient for color_E is -0.02134727627521308
The coefficient for color_F is -0.02736569280699151
The coefficient for color_G is -0.04968863185196373

The coefficient for color_H is -0.08650138091996604


The coefficient for color_I is -0.1159721894148063
The coefficient for color_J is -0.12076205437047896
The coefficient for clarity_IF is 0.2221583928656033
The coefficient for clarity_SI1 is 0.3358471963086741

The coefficient for clarity_SI2 is 0.20218576827147022


The coefficient for clarity_VS1 is 0.3678080811305068
The coefficient for clarity_VS2 is 0.39322943954265416
The coefficient for clarity_VVS1 is 0.2909920367235056

40
The coefficient for clarity_VVS2 is 0.3358847114654482
I can interpret it as for one std dev increment in carat value, the carat variable impacts the
dependent variable'price'by 1.06 times carat +. +. .+. soon for other variables.
The intercept for our model is -2.459574118873397e-16
observations:

Ofter scalling :1.The new intercept is -2.459574118873397e-16, which is almost equal to 0,


which is a by-product of the scaling exercise.
2.The scaled linear regression equation will be, where the interpretability will be with an
unit increase in standard deviation and not the unit increase as in normal linear model.
Model score - R2-0.9392503405712005
mean_sq_error is standard deviation(MSE)- 0.246474459992915
observations:
1.The intercept value is at-2.459574118873397e-16 or almost 0 Regression model score on
train and test data is same as with the normal Linear Regression model at 93.9% and 93.9%,
respectively.
2.This indicates the model is right fit model and has avoided being an underfit or an overfit
model. The squre root of mse value is 0.24645. This means we have about 25% variance of
residual error or unexplained in our model.

Scatterplot on test data between price - and Independent


Variable:

1.1.3Scatter plot on test data ofter scaling between price - and Independent

Variable:

carat ---> 5.219230125784284


table ---> 92.02140709539887

41
cut_s ---> 5.096413143490567
color_E ---> 2.4747624189962067
color_F ---> 2.4341698268457574
color_G ---> 2.774161531143149
color_H ---> 2.2877893449603337
color_I ---> 1.9217515170977741
color_J ---> 1.509454503820044
clarity_IF ---> 3.3631034501372756
clarity_SI1 ---> 17.89611729585019
clarity_SI2 ---> 12.680824943298578
clarity_VS1 ---> 11.60063052200214
clarity_VS2 ---> 16.740214589468305
clarity_VVS1 ---> 5.8723419563685235
clarity_VVS2 ---> 7.630278698237391

In the variance inflation factor shows the 'table' and 'Clarity_SI1', 'Clarity_SI2',
'Clarity_VS1', 'Clarity_Vs2' are displaying severe collinearity

Final Linear Equation:


(-3600.04) * Intercept + (8023.99) * carat + (-24.96) * table + (-43.86) * cut_s
+ (-190.88) * color_E + (-248.69) * color_F + (-421.99) * color_G + (-832.83) *
color_H + (-1321.64) * color_I + (-1859.63) * color_J + (4245.01) * clarity_IF +
(2695.28) * clarity_SI1 + (1861.41) * clarity_SI2 + (3546.32) * clarity_VS1 +
(3263.57) * clarity_VS2 + (4018.66) * clarity_VVS1 + (3984.98) * clarity_VVS2
+
observations:
1.When carat increases by 1 unit, diamond price increases by 8023.99 units, keeping all
other predictors constant.
2.When cut increases by 1 unit, diamond price increases by 190.88 units, keeping all other
predictors constant.

3.When color increases by 1 unit, diamond price increases by ((-190.88)*color_E + (-248.69)


* color_F + (-421.99) * color_G + (-832.83) * color_H + (-1321.64) * color_I + (-1859.63) *
color_J )keeping all other predictors constant.
4.When clarity increases by 1 unit, diamond price increases by ((4245.01) * clarity_IF +
(2695.28) * clarity_SI1 + (1861.41) * clarity_SI2 + (3546.32) * clarity_VS1 + (3263.57) *
clarity_VS2 + (4018.66) * clarity_VVS1 + (3984.98) * clarity_VVS2 + )keeping all other
predictors constant.
5.There are also some negative co-efficient values, for instance, corresponding co-efficient
for 'table','cuts_s','color',This implies, these are inversely proportional with diamond price.

42
6.R-squared is a statistical measure of how close the data are to the fitted regression line. It
is also known as the coefficient of determination, or the coefficient of multiple
determination for multiple regression. 100% indicates that the model explains all the
variability of the response data around its mean. The value of R-squared vary from 0 to 1.
7.copmaring the performance of Predictions on Train and Test sets using Rsquare, RMSE &
Adj Rsquare. Comparing these models linear regression stats model is the best model.Any
value inching closer to 1 can be considered a good fitted regression line and our R-squared
score of 0.939 signifies good performance.

1.4 Inference: Basis on these predictions, what are the business


insights and recommendations.
Predictions and Insights:
1.'Price' is the target variable while all others are the predictors. (2).The data set contains
26967 row, 11 column. (3).In the given data set there are 2 Integer type features,6 Float
type features. 3 Object type features. Where 'price' is the target variable and all other are
predector variable. (4)The first column is an index ("Unnamed: 0")as this only serial no, we
can remove it.
2.On the given data set the the mean and median values does not have much differenc.
(2).We can observe Min value of "x", "y", "z" are zero this indicates that they are faulty
values. As we know dimensionless or 2-dimensional diamonds are not possible. So we have
filter out those as it clearly faulty data entries. (3).There are three object data type 'cut',
'color' and 'clarity'.
3.we can obs.erve there are 697 missing value in the depth column. There are some
duplicate row present. (33 duplicate rows out of 26958). which is nearly 0.12 % of the total
data. So on this case we have dropped the duplicated row.
4.There are significant amount of outlier present in some variable,the features with
datapoint that are far from the rest of dataset which will affect the outcome of our
regression model. So we have treat the outliar. We can see that the distribution of some
quantitative features like "carat" and the target feature "price" are heavily "right-skewed".
5.It looks like most features do correlate with the price of Diamond. The notable exception
is "depth" which has a negligble correlation (~1%). Observation on 'CUT': The Premium Cut
on Diamonds are the most Expensive, followed by Very Good Cut.
6.We can identify that there is a strong correlation between independent variables –like
Carat, x,y, and z. All these variables are strongly correlated with the target variable – price.
This indicates a strong case of our dataset struggling with multicollinearity .
7.Depth does not show any strong relation with the price variable. I would drop x,y, and z
variables before creating the linear regression model. Similarly, Depth does not seems to be
influencing my variable price and hence, at some point, I will be dropping this variable from

43
my model building process as well. Keeping the above points in mind, for this dataset, I
don’t think scaling the data will make much sense.
8.However, for the sake of checking how does it impact the overall model coefficients, I
have still carried it for this studyPlease note, centering/scaling does not affect our statistical
inference in regression models - the estimates are adjusted appropriately, and the p-values
will be the same. Sharing the brief results from running the model on a scaled data.
9.We have a database which have strong correlation between independent variables and
hence we need to tackle with the issue of multicollinearity which can hinder the results of
the model performance. Multicollinearity makes it difficult to understand how one variable
influence the target variable. However, it does not affect the accuracy of the model. As
aresult while creating the model, I had dropped a lot of independent variables displaying
multicollinearity or the ones with no direct relation with the target variable.
10.After the Linear Regression model was created, we can see the assumption coming true
as the carat variable emerged as the single biggest factor impacting the target variable ,
followed with a few others within clarity variables. Carat variable has the highest coefficient
value as compared to the other studies variables for this test case.
11.Final linear equation:(-3600.04) * Intercept + (8023.99) * carat + (-24.96) * table + (-
43.86) * cut_s + (-190.88) * color_E + (-248.69) * color_F + (-421.99) * color_G + (-832.83) *
color_H + (-1321.64) * color_I + (-1859.63) * color_J + (4245.01) * clarity_IF + (2695.28) *
clarity_SI1 + (1861.41) * clarity_SI2 + (3546.32) * clarity_VS1 + (3263.57) * clarity_VS2 +
(4018.66) * clarity_VVS1 + (3984.98) * clarity_VVS2 +
12.Even after scaling, our claim about carat being an important driver in our linear equation
is reaffirmed.We can then look at the Variance Inflation Factor (VIF) score to check the
multicollinearity scores. As per the industry standards or at least for this case study, any
variable with a VIF score of greater than 10 has been accepted to indicatesevere collinearity.

13.That table and Clarity_SI1, Clarity_SI2, Clarity_VS1, Clarity_Vs2 are displaying severe
collinearity. VIF measures the intercorrelation among independent variables in a
multipleregression model. In mathematical terms, the variance inflation factor for a
regression model variable would be the ratio of the overall model variance to the variance
of the model with a single independent variable.
14.As an example, the VIF value for Carat in the table above is the intercorrelation with
other independent variables in the dataset and so on for other variablesIf we are looking to
finetune the model we can simply drop these columns from our Linear Regression model to
see how the results pan out and can check our model performance.
15.As an alternate approach, we can to use the “cut” variable as one hot encoding or
dummy encoding and run the model again to check on the overall model score or tackle the
issue of multicollinearity. This can allow us to read the impact of cut variables on the target
variable – Price, if company intends to study that as well.however, for this case study and
for the reasons mentioned above, I have not used one-hot encoding on the cut variable.

44
16.For the business based on the model that we have created for the test case, some of the
key variables that are likely to positively drive price change are (top 5 in descending order):
1.Carat
2.Clarity_IF
3.Clarity VVS_1

4.Clarity VVS_2
5.Clarity_vs1

Recommendations:
1.As expected Carat is a strong predictor of the overall price of the stone. Clarity refers to
the absence of the Inclusions and Blemishes and has emerged as a strong predictor of price
as well. Clarity of stone types IF, VVS_1, VVS_2 and vs1 are helping the firm put an
expensive price cap on the stones.
2.Color of the stones such H, I and J won’t be helping the firm put an expensive price cap on
such stones. The company should instead focus on stones of color D, E and F to command
relative higher price points and support sales.
3.This also can indicate that company should be looking to come up with new color stones
like clear stones or a different color/unique color that helps impact the price positively.
4.The company should focus on the stone’s carat and clarity so as to increase their prices.
Ideal customers will also contribute to more profits. The marketing efforts can make use of
educating customers about the importance of a better carat score and importance of clarity
index. Post this, the company can make segments, and target the customer based on their
income/paying capacity etc.

Logistic Regression:
Introduction:
Logistic regression is a process of modeling the probability of a discrete outcome given an
input variable. The most common logistic regression models a binary outcome; something
that can take two values such as true/false, yes/no, and so on. Multinomial logistic
regression can model scenarios where there are more than two possible discrete outcomes.
Logistic regression is a useful analysis method for classification problems, where you are
trying to determine if a new sample fits best into a category. As aspects of cyber security are
classification problems, such as attack detection, logistic regression is a useful analytic
technique.

Linear Discriminant Analysis:


Introduction:

45
LDA or Normal Discriminant Analysis or Discriminant Function Analysis is a dimensionality
reduction technique that is commonly used for supervised classification problems. It is used
for modelling differences in groups i.e. separating two or more classes. It is used to project
the features in higher dimension space into a lower dimension space.
For example, we have two classes and we need to separate them efficiently. Classes can
have multiple features. Using only a single feature to classify them may result in some
overlapping as shown in the below figure. So, we will keep on increasing the number of
features for proper classification.

Problem 2: Logistic Regression and LDA


You are hired by a tour and travel agency which deals in selling holiday packages. You are
provided details of 872 employees of a company. Among these employees, some opted for
the package and some didn't. You have to help the company in predicting whether an
employee will opt for the package or not on the basis of the information given in the data
set. Also, find out the important factors on the basis of which the company will focus on
particular employees to sell their packages.

Data Dictionary:
Variable Name: Description

Holiday_Package: Opted for Holiday Package yes/no?


Salary: Employee salary
age: Age in years
edu: Years of formal education
no_young_children:The number of young children (younger than 7 years)

no_older_children:Number of older children


foreign: foreigner Yes/Noata Dictionary:

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics


and do null value condition check, write an inference on it. Perform
Univariate and Bivariate Analysis. Do exploratory data analysis.

Sample of the dataset :


Unname Holliday_Packa Salar edu no_young_child no_older_childr foreig
age
d: 0 ge y c ren en n

4841 n
0 1 no 30 8 1 1
2 o

46
Unname Holliday_Packa Salar edu no_young_child no_older_childr foreig
age
d: 0 ge y c ren en n

3720 n
1 2 yes 45 8 0 1
7 o

5802 n
2 3 no 46 9 0 0
2 o

6650 n
3 4 no 31 11 2 0
3 o

6673 n
4 5 no 44 12 0 2
4 o

Table 2.1.1Sample of the Data

First five rows of the dataset.is given by head function.

Data dimensions:
(872, 8)

The data set having 872 rows and 8 columns.


We have dropped the first column (‘Unnamed: 0’) column as this is not important for our
study. Unnamed is a variable which has serial numbers so may not be required and thus it
can be dropped for further analysisThe shape would be – 872 rows and 7 columns.

Descriptive Statistics Summary:


Salary age educ no_young_children no_older_children

count 872.000000 872.000000 872.000000 872.000000 872.000000

mean 47729.172018 39.955275 9.307339 0.311927 0.982798

std 23418.668531 10.551675 3.036259 0.612870 1.086786

min 1322.000000 20.000000 1.000000 0.000000 0.000000

25% 35324.000000 32.000000 8.000000 0.000000 0.000000

50% 41903.500000 39.000000 9.000000 0.000000 1.000000

75% 53469.500000 48.000000 12.000000 0.000000 2.000000

max 236961.000000 62.000000 21.000000 3.000000 6.000000


2.1.2Desriptive summery of the data

observations:

47
1.Holiday Package:This variable is a categorical Variable andTarget Variable.
2.Salary, age, educ, no_young_children, no_older_children, variables are numerical or
continuous variables. Salary ranges from 1322 to 236961.
3.Average salary of employees is around 47729 with a standard deviation of 23418.
Standard deviation indicates that the data is not normally distributed. skew of 0.71 indicates
that the data is right skewed and there are few employees earning more than an average of
47729. 75% of the employees are earning below 53469 while 255 of the employees are
earning 35324.
4.Age of the employee ranges from 20 to 62. Median is around 39. 25% of the employees
are below 32 and 25% of the employees are above 48. Standard deviation is around 10.
Standard deviation indicates almost normal distribution.
5.Years of formal education ranges from 1 to 21 years. 25% of the population has formal
education for 8 years, while the median is around 9 years. 75% of the employees have
formal education of 12 years. Standard deviation of the education is around 3.This variable
is also indicating skewness in the data

6.Foreign is a categorical variable.

Structure of the Dataset:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 872 entries, 0 to 871
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Holliday_Package 872 non-null object
1 Salary 872 non-null int64
2 age 872 non-null int64
3 educ 872 non-null int64
4 no_young_children 872 non-null int64
5 no_older_children 872 non-null int64
6 foreign 872 non-null object
dtypes: int64(5), object(2)
memory usage: 47.8+ KB

Data Type is – Integer/Object:

The data set having : objects(2), int64(5).

Checking for Missing Values:


Holliday_Package 0
Salary 0
age 0
educ 0
no_young_children 0

48
no_older_children 0
foreign 0
dtype: int64

There are No missing values in the dataset.

Checking for Duplicated Values:


Number of duplicate rows = 0

(872, 7)

There are No duplicate rows in the dataset.

Univariate Analysis:
Categorical Feature Levels Frequencies:
Holliday_Package Number of Levels 2
no 471

yes 401
Name: Holliday_Package, dtype: int64
foreign Number of Levels 2
no 656
yes 216

Name: foreign, dtype: int64

Unique counts of numerical and categorical variable:


Holliday_Package
_____
no 471
yes 401
Name: Holliday_Package, dtype: int64

Salary
_____
46195 2
33357 2
39460 2
36976 2
40270 2
..
38352 1
119644 1
96072 1

49
115431 1
74659 1
Name: Salary, Length: 864, dtype: int64

age
_____
44 35
31 32
34 32
35 31
33 30
28 29
40 29
36 28
38 28
32 27
47 26
41 26
39 25
26 24
42 24
46 24
49 23
45 23
51 22
50 21
37 21
43 21
48 20
27 19
29 19
30 19
57 18
56 18
55 17
25 17
58 16
24 16
59 14
54 14
52 13
21 12
23 11
53 10
60 10
22 9
61 8
20 8
62 3
Name: age, dtype: int64

educ
_____
8 157
12 124
9 114

50
11 100
10 90
5 67
4 50
13 43
7 31
14 25
6 21
15 15
3 11
16 10
2 6
17 3
19 2
21 1
18 1
1 1
Name: educ, dtype: int64

no_young_children
_____
0 665
1 147
2 55
3 5
Name: no_young_children, dtype: int64

no_older_children
_____
0 393
2 208
1 198
3 55
4 14
5 2
6 2
Name: no_older_children, dtype: int64
foreign
_____
no 656
yes 216
Name: foreign, dtype: int64

Distribution of Numeric Features:

51
2.1.1Hist plots for numerical variable

observations:

1.Salary variable is right skewed.


2.Age variable is almost normal distribution.
3.Education variable is also indicating skewness in the data.

Boxplots for Numerical Variable:

52
2.1.2Box plots for numerical variable

observations:
1.There are significant outliers present in variable “ Salary”, however there are minimal
outliers in other variables like ‘educ’, ‘no. of young children’ & ‘no. of older children’.
2.There are no outliers in variable ‘age’. For Interpretation purpose we would need to study
the variables such as no. of young children and no. of older children before outlier
treatment.
3.For this case study we have done outlier treatment for only salary & educ.

Treating the outliers:

53
2.1.3Boxplot after capping outlirs

Skewness of the variable:


Salary 0.710966

age 0.146412
educ -0.095087
no_young_children 1.946515
no_older_children 0.953951
dtype: float64.

Boxplot(Holliday Package VS Salary) :

54
2.1.4 Boxplot(Holliday Package VS Salary)

While performing the bivariate analysis we observe that Salary for employees opting for
holiday package and for not opting for holiday package is similar in nature. However, the
distribution is fairly spread out for people not opting for holiday packages.

Boxplot(Holliday Package VS Age) :

2.1.5Boxplot(Holliday Package VS Age)

The distribution of data for age variable with holiday package is also similar in nature We
can clearly see that employees in middle range (34 to 45 years) are going for holiday
package as compared to older and younger employees.

Countplot(HollidayPackage VS Age)

55
2.1.6 Countplot(HollidayPackage VS Age)

The distribution of data for age variable with holiday package is also similar in nature We
can clearly see that employees in middle range (34 to 45 years) are going for holiday
package as compared to older and younger employees .

Boxplot(Holliday Package VS Educ):

2.1.7 Boxplot(Holliday Package VS Educ)

This variable is also showing a similar pattern. This means education is likely not to be a
variable for influencing holiday packages for employees.

Countplot(Holliday Package Vs Educ):

56
2.1.8 Countplot(Holliday Package Vs Educ)

We observe that employees with less years of formal education(1 to 7 years) and higher
education are not opting for the Holiday package as compared to employees with formal
education of 8 year to 12 years.

Boxplot(Holliday_Package Vs no young chidren):

2.1.9 Boxplot(Holliday_Package Vs no young chidren)

observations:

57
1.There is a significant difference in employees with younger children who are opting for
holiday package and employees who are not opting for holiday package.
2.We can clearly see that people with younger children are not opting for holiday packages

Countplot(Holliday_Package Vs no young chidren):

2.1.10 Countplot(Holliday_Package Vs no young chidren)

We can clearly see that people with younger children are not opting for holiday packages.

Stacked barchart(Foreign Vs no young children):

2.1.11 Stacked barchart(Foreign Vs no young children)

Boxplot(Holliday Package Vs no older children):

58
2.1.12 Boxplot(Holliday Package Vs no older children)

observations:
1.The distribution for opting or not opting for holiday packages looks same for employees
with older children.less no of older childen opting for holliday package.5yrs and 6 yrs not
opting Holliday package.
2.Almost same distribution for both the scenarios when dealing with employees with older
children.

countplot(Holliday Package Vs no older children):

59
2.1.13 countplot(Holliday Package Vs no older children)

In the bar plot shows Less no of older childen opting for holliday package.5yrs and 6 yrs not
opting Holliday package.

Stacked barchart(Foreign Vs no older children):

2.1.14 Stacked barchart(Foreign Vs no older children)

Observations:
1.The distribution for opting or not opting for holiday packages looks same for employees
with older children. less no of older childen opting for holliday package.5yrs and 6 yrs not
opting Holliday package.
2.Almost same distribution for both the scenarios when dealing with employees with older
children.

60
Bivariate Analysis:
Numeric Features - Checking for Correlations:
Salary age educ no_young_children no_older_children

Salary 1.00 0.05 0.35 -0.03 0.12

age 0.05 1.00 -0.15 -0.52 -0.12

educ 0.35 -0.15 1.00 0.10 -0.04

no_young_children -0.03 -0.52 0.10 1.00 -0.24

no_older_children 0.12 -0.12 -0.04 -0.24 1.00


2.1.3Checking correlation of the data

Heat Map:

2.1.15Heat map numerical variable

observations:
1.We can relate there isn’t any strong correlation between any variables.
2.Salary and education display moderate corelation and no_older_children is somewhat
correlated with salary variable. However, there are no strong correlation in the data set

61
Pairplot:

2.1.16Pair plot numerical variable

We will see correlation between independent variables.all variable for holliday package is
(yes/no) overlapping.

EDA for Categorical variable:


Countplot for Holliday Package:

62
2.1.8 Countplot for Holliday Package

Holliday_Package: The distribution seems to be fine, with 54% for no and 46% for yes.

Means values for Holliday Package and numerical variable:


Salary age educ no_young_children no_older_children

Holliday_Package

no 48217.455945 40.853503 9.583864 0.409766 0.902335

yes 42543.760599 38.900249 8.972569 0.197007 1.077307

In the mean value for no young chidren and no older children is very less.so they are not
opting for holliday package.

Countplot for Foreign:

2.1.18 Countplot for Foreign

Foreign: The data is imbalanced with more skewed towards no and relatively a smaller
shared for yes.

Mean values of Foreign and numerical variable:

63
Salary age educ no_young_children no_older_children

foreign

no 47763.692073 40.603659 10.030488 0.282012 0.969512

yes 39062.443287 37.986111 7.092593 0.402778 1.023148

Mean values of no young children and numerical variable:

Salary age educ no_older_children

no_young_children

0 45975.233083 43.296241 9.141353 1.114286

1 44231.690476 29.265306 9.761905 0.707483

2 44894.422727 29.072727 9.854545 0.200000

3 45137.600000 29.600000 11.200000 0.200000

Mean values of no older children and numerical variable:


Salary age educ no_young_children

no_older_children

0 43710.353053 41.615776 9.536896 0.447837

1 45589.256313 39.161616 8.792929 0.343434

2 48291.643029 37.798077 9.456731 0.110577

3 46788.000000 38.800000 8.709091 0.090909

4 53717.535714 40.285714 10.285714 0.000000

5 59185.500000 43.500000 8.000000 0.000000

6 38605.000000 42.500000 8.500000 0.000000

2.2 Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30).
Apply Logistic Regression and LDA (linear discriminant analysis).

64
2.3 Performance Metrics: Check the performance of Predictions on
Train and Test sets using Accuracy, Confusion Matrix, Plot ROC
curve and get ROC_AUC score for each model Final Model:
Compare Both the models and write inference which model is
best/optimized.(Q2.2 &Q23 Clabbing together)
Unique count of categorical varaiable:
Column Name: Holliday_Package
['no', 'yes']
Categories (2, object): ['no', 'yes']
[0 1]
Column Name: foreign
['no', 'yes']
Categories (2, object): ['no', 'yes']
[0 1]

Value counts of target variable:


0 471
1 401
Name: Holliday_Package, dtype: int64

Value counts of no young children:


0 665
1 147
2 55
3 5
Name: no_young_children, dtype: int64

Value counts of no loder chidren:

0 393
2 208
1 198
3 55
4 14
5 2
6 2
Name: no_older_children, dtype: int64

65
Size of Holliday Package and no young children:
Holliday_Package no_young_children
0 0 326
1 100
2 42
3 3
1 0 339
1 47
2 13
3 2
dtype: int64

observations:
1.The variable which is actually a numeric one seems to show varied distribution between n
umber of children being 1 and 2 when done a bivariate analysis with the dependent variable
2.It is therefore advised to treat this variable as categorical and do the encoding on it .

Size of Holliday Package and older children:


Holliday_Package no_older_children
0 0 231
1 102
2 102
3 27
4 7
5 2
1 0 162
1 96
2 106
3 28
4 7
6 2
dtype: int64

observations:
There does not seems to be much variation between the distribution of data for children
more than 0. It seems they are close enough for the Holiday_Package classes with an almost
like distribution.

Datatypes:
Holliday_Package int8
Salary float64
age int64
educ float64
no_young_children int64
no_older_children int64

66
foreign int8
dtype: object

Encoding the Data:


Holliday_P Sal ed no_older_ fore no_young_c no_young_c no_young_c
age
ackage ary uc children ign hildren_1 hildren_2 hildren_3

484
0 0 30 8.0 1 0 1 0 0
12.0

372
1 1 45 8.0 1 0 0 0 0
07.0

580
2 0 46 9.0 0 0 0 0 0
22.0

665
3 0 31 11.0 0 0 0 1 0
03.0

667
4 0 44 12.0 2 0 0 0 0
34.0

S f
no_y no_y no_y no_o no_o no_o no_o no_o no_o
Holli a e o
a oung oung oung lder_ lder_ lder_ lder_ lder_ lder_
day_ l d r
g _chil _chil _chil child child child child child child
Pac a u ei
e dren_ dren_ dren_ ren_ ren_ ren_ ren_ ren_ ren_
kage r c g
1 2 3 1 2 3 4 5 6
y n

4
8
4 3 8.
0 0 0 1 0 0 1 0 0 0 0 0
1 0 0
2.
0

3
7
2 4 8.
1 1 0 0 0 0 1 0 0 0 0 0
0 5 0
7.
0

5
8
0 4 9.
2 0 0 0 0 0 0 0 0 0 0 0
2 6 0
2.
0

6
6
1
5 3
3 0 1. 0 0 1 0 0 0 0 0 0 0
0 1
0
3.
0

67
S f
no_y no_y no_y no_o no_o no_o no_o no_o no_o
Holli a e o
a oung oung oung lder_ lder_ lder_ lder_ lder_ lder_
day_ l d r
g _chil _chil _chil child child child child child child
Pac a u ei
e dren_ dren_ dren_ ren_ ren_ ren_ ren_ ren_ ren_
kage r c g
1 2 3 1 2 3 4 5 6
y n

6
6
1
7 4
4 0 2. 0 0 0 0 0 1 0 0 0 0
3 4
0
4.
0
2.1.4 Encoded Categorical variable

observations:
1.I have used one-hot encoding for 'no young chidren' and no oider children are
independent variables using the function “drop_function = True” or can be referred as
dummy encoding, which takes into consideration the (Kn-1) encoding.
2.Dummy encoding converts it into n-1 variables. One-hot encoding ends up with kn
variables, while dummy encoding usually ends up with kn-1 variables. This will also help us
to deal with the issue of multidimensionality.

Train Test Split for Logistic regression model:


# Copy all the predictor variables into X dataframe
X = df.drop(['Holliday_Package'], axis=1)
# Copy target into the y dataframe.

y = df['Holliday_Package']
check target variable class proportion:
0 0.540138
1 0.459862
Name: Holliday_Package, dtype: float64

# Split X and y into training and test set in 70:30 ratio


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 ,

random_state=1, stratify=df['Holliday_Package']) # stratify = y

Fit the Logistic Regression model :


model = LogisticRegression(solver='newton-cg',
max_iter=10000,
penalty='none',
verbose=True,

68
n_jobs=2,
random_state=123)
model.fit(X_train, y_train)

LogisticRegression (max_iter=10000, n_jobs=2, penalty='none', random_state=123,


solver='newton-cg', verbose=True)

Prediction on Train & Test Dataset:


Class Label Prediction:
Ytest predictions:

array([0, 0, 0, 1, 0, 1, 1, 1, 1, 0], dtype=int64)


ytrain predictions:
array([1, 0, 0, 1, 0, 0, 0, 0, 1, 1], dtype=int64)

Class Probability Prediction:


0 1

0 0.690002 0.309998

1 0.616723 0.383277

2 0.704176 0.295824

3 0.470343 0.529657

4 0.548567 0.451433

Model Evaluation:
Training Data:
Accuracy: 0.6770491803278689

AUC and ROC for the training data:

69
2.3.1 AUC and ROC for the training data

Confusion Matrix & Classification Report Metrics:

2.3.2 Confusion Matrix on Train Data

precision recall f1-score support

0 0.69 0.74 0.71 329


1 0.67 0.60 0.63 281

accuracy 0.68 610


macro avg 0.68 0.67 0.67 610
weighted avg 0.68 0.68 0.68 610

Logistic regression model:


Train data results:

AUC: 74%
Accuracy: 68%

70
Precision: 67%
Recall:60%
f1-Score: 63%

Test Data:
Accuracy: 0.6793893129770993

AUC and ROC:

2.3.3 AUC and ROC on Test Data

Confusion Matrix & Classification Report Metrics:

2.3.4 Confusion Matrix on Test Data

precision recall f1-score support

0 0.67 0.79 0.73 142


1 0.69 0.55 0.61 120

accuracy 0.68 262

71
macro avg 0.68 0.67 0.67 262
weighted avg 0.68 0.68 0.67 262

Logistic regression model:


Test data results:
AUC: 72%
Accuracy: 68%
Precision: 69%

Recall:55%
f1-Score: 61%

GridSearchCV in Logistic model:


grid={'penalty':['l2','none','l1','elasticnet'],
'solver':['sag','lbfgs','saga','newton-cg','liblinear'],

'tol':[0.0001,0.00001], # 0.1, 0.01,0.001,


'l1_ratio':[0.25,0.5,0.75]} #'max_iter':[100,1000,10000]
grid_search.fit(X_train,y_train)
GridSearchCV(cv=3,
estimator=LogisticRegression(max_iter=10000, n_jobs=2,

random_state=1),
n_jobs=-1,
param_grid={'l1_ratio': [0.25, 0.5, 0.75],
'penalty': ['l2', 'none', 'l1', 'elasticnet'],
'solver': ['sag', 'lbfgs', 'saga', 'newton-cg',

'liblinear'],
'tol': [0.0001, 1e-05]},
scoring='accuracy')
print(grid_search.best_params_,'\n')
print(grid_search.best_estimator_)

{'l1_ratio': 0.25, 'penalty': 'l1', 'solver': 'liblinear', 'tol': 1e-05}

72
LogisticRegression(l1_ratio=0.25, max_iter=10000, n_jobs=2, penalty='l1',
random_state=1, solver='liblinear', tol=1e-05)

Getting the probabilities on the test set:

0 1

0 0.679105 0.320895

1 0.620776 0.379224

2 0.693508 0.306492

3 0.470989 0.529011

4 0.547406 0.452594

Prediction on the training set:


Train Data Accuracy: 0.6655737704918033

Confusion matrix on the training data and Classification report :


precision recall f1-score support

0 0.68 0.73 0.70 329


1 0.65 0.59 0.62 281

accuracy 0.67 610


macro avg 0.66 0.66 0.66 610
weighted avg 0.66 0.67 0.66 610

73
2.3.5 Confusion matrix on Train Data

Train Model ROC_AUC :

2.3.6 Train Model ROC_AUC

Logistic regression model in Gridsearchcv:


Train data results:
AUC: 74%
Accuracy: 67%

Precision: 65%
Recall:59%
f1-Score: 62%

74
Test Data :
Accuracy: 0.6755725190839694

Confusion matrix on the test data Classification report:

precision recall f1-score support

0 0.67 0.80 0.73 142


1 0.69 0.53 0.60 120

accuracy 0.68 262


macro avg 0.68 0.66 0.66 262
weighted avg 0.68 0.68 0.67 262

2.3.7 Confusion matrix on Test Data

Test Model ROC_AUC :

75
2.3.8 Test Model ROC_AUC

Logistic regression model in Gridsearchcv:


Test data results:

AUC: 72%
Accuracy: 68%
Precision: 69%
Recall:53%
f1-Score: 60%

Linear Discriminant Analysis:


Build LDA Model & Train:
clf = LinearDiscriminantAnalysis()
model=clf.fit(X_train,y_train)

Prediction:
Confusion Matrix Comparison:
Train Data:

2.3.9 Confusion Matrix on Train data

Test Data:

76
2.3.10 Confusion Matrix on Test Data

Classification Report Comparison:


Train Data:
precision recall f1-score support

0 0.68 0.74 0.71 329


1 0.66 0.59 0.63 281

accuracy 0.67 610


macro avg 0.67 0.67 0.67 610
weighted avg 0.67 0.67 0.67 610

Test Data:
precision recall f1-score support

0 0.67 0.80 0.73 142


1 0.70 0.54 0.61 120

accuracy 0.68 262


macro avg 0.69 0.67 0.67 262
weighted avg 0.69 0.68 0.68 262

Probability prediction for the training and test data:


Predictions prob Train Data:
array([[0.28182528, 0.71817472],
[0.69600878, 0.30399122],
[0.71897602, 0.28102398],
...,
[0.43701203, 0.56298797],
[0.22725492, 0.77274508],
[0.50183664, 0.49816336]])

77
Auc Score: 0.7430907851896722

Train Model ROC_AUC:

2.3.11 Train Model ROC_AUC

Test AUC Score 0.7250000000000001

Test Model ROC_AUC :

2.3.12 Test Model ROC_AUC

Linear Discriminant Analysis:


Train data results:
AUC: 74%
Accuracy: 67%

78
Precision: 66%
Recall:59%
f1-Score: 63%

Test data results:


AUC: 72%
Accuracy: 68%
Precision: 70%
Recall:54%

f1-Score: 61%
observations:
1.Logistic regression test recall score is 55% and precision score is 69%.
2.Logistic regression GridsearchCv test recall score is 53% and pricision score is 69%.
3.Linear Discriminant Analysis test recall score is 54% and presion score is 70%.

4.comparing all the model recall score, Logistic regression will have high recall score that is
55%.So this is a good madel to fit.

A function to evaluate accuracy,f1,recall for each threshold prob and return a


data frame with all these metrics. It takes the train data probability
predictions as input:

threshold_prob Acc f1 recall prec

0 0.1 0.479 0.639 1.000 0.469

1 0.2 0.534 0.655 0.961 0.497

2 0.3 0.628 0.685 0.879 0.561

3 0.4 0.685 0.687 0.751 0.634

4 0.5 0.672 0.625 0.594 0.660

5 0.6 0.659 0.544 0.441 0.709

6 0.7 0.643 0.441 0.306 0.789

79
threshold_prob Acc f1 recall prec

7 0.8 0.605 0.267 0.157 0.917

8 0.9 0.548 0.035 0.018 1.000

Before classification report values:


Classification Report of the default cut-off test data:

precision recall f1-score support

0 0.67 0.80 0.73 142


1 0.70 0.54 0.61 120

accuracy 0.68 262


macro avg 0.69 0.67 0.67 262
weighted avg 0.69 0.68 0.68 262

After custom classification report values:


Classification Report of the custom cut-off test data:

precision recall f1-score support

0 1.00 0.01 0.03 142


1 0.46 1.00 0.63 120

accuracy 0.47 262


macro avg 0.73 0.51 0.33 262
weighted avg 0.75 0.47 0.30 262

observations:
1.By using custom threshold metrices we can improve the recall score.we can allocating
threshold prob values like 0.1 to 0.9 in that which value suitable for getting good recall
score.To increase the recall score the precision score will get decreses.
2.Here 0.1 threshold prob values give recall is 1 and precision value 0.46 for the train or test
results.

3.Previous recall score wiil be 0.54 ofter using custom threshold metrices recall improve by
1. Whenever we increase recall score the model is good or fit.

2.4 Inference: Basis on these predictions, what are the insights and
recommendations.

80
Predictions and Insights:
1.We started this test case with looking at the data correlation to identify early trends and
patterns. At one stage, Salary and education seems to be important parameters which might
have played out as an important predictor.
2.While performing the bivariate analysis we observe that Salary for employees opting for
holiday package and for not opting for holiday package is similar in nature. However, the
distribution is fairly spread out for people not opting for holiday packages.
3.The distribution of data for age variable with holiday package is also similar in nature. The
range of age for people not opting for holliday package is more spread out when compared
with people opting for yes.
4.We can clearly see that employees in middle range (34 to 45 years) are going for holiday
package as compared to older and younger employeesHowever, almost similar distribution
here for salary and age is indicating that they might not come out as strong predictors after
the model is created. Lets carry on with more data exploration and check.
5.There is a significant difference in employees with younger children who are opting for
holiday package and employees who are not opting for holiday package.We can clearly see
that people with younger children are not opting for holiday packages;

6.We identify that employees with number of younger children has a varied distribution and
might end up playing an important role in our model building process. Employees with older
children has almost similar distribution for opting and not opting for holiday packages across
the number of children levels.
7.For this test case, I have chosen Logistic Regression to be a better model for interpretation
and analytical purposes. Keeping that in mind, I will quickly like to refer the coefficients
values:
Coefficients values for Salary is: -2.10948637e-05 or almost 0

Coefficients values for age is: -6.36093014e-02 or almost 0


Coefficients values for education is: 5.97505341e-02 or almost 0
Coefficients values for foreign is: 1.21216610e+00
Coefficients values for no_young_children:1: -1.84510523e+00
Coefficients values for no_young_children:2 is : -2.54622113e+00

Coefficients values for no_young_children:3 is: -1.81252011e+00


8.A coefficient for a predictor variable shows the effect of a one unit change in the predictor
variable. In the logistic regression models the values are represented using the log function/
log of likelihood. E.g. Log (p/1-p) = b0 + b1x1+ b2x2+ …. bnxn

81
9.Interestingly and as expected by my, Salary and age didn’t turn out to be an
importantpredictor for my model. Also, number of young children has emerged as a strong
predictor (likelihood ) in not opting for holiday packages.
10.There is no plausible effect of salary, age, and education on the prediction for
Holliday_packages. These variables don’t seem to impact the decision to opt for holiday
packages as we couldn’t establish a strong relation of these variables with the target
variable.
11.Foreign has emerged as a strong predictor with a positive coefficient value. The log
likelihood or likelihood of a foreigner opting for a holiday package is high.
12.no_young_children variable is negating the probability for opting for holiday packages,
especially for couple with number of young children at 2.The company can try to bin salary
ranges to see if they can derive some more meaningful interpretations out of that variable.
May be club the salary or age in different buckets and see if there is some plausible impact
on the predictor variable. OR else, the business can use some different model techniques to
do a deep dive.

Recommendation:
1.The company should really focus on foreigners to drive the sales of their holiday packages
as that’s where the majority of conversions are going to come in.

2.The company can try to direct their marketing efforts or offers toward foreigners for a
better conversion opting for holiday packages.

3.The company should also stay away from targeting parents with younger children. The
chances of selling to parents with 2 younger children is probably the lowest. This also gels
with the fact that parents try and avoid visiting with younger children.

4.If the firm wants to target parents with older children, that still might end up giving
favorable return for their marketing efforts then spent on couples with younger children.

82

You might also like