Professional Documents
Culture Documents
MODELING
PADMA
PGP-DSBA
ONLINE
MAY’22
DATE:08/05/2022
1
Table of Contents
Problem 1: Linear Regression
Introduction............................................................................................................................6
2
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final
Model: Compare Both the models and write inference which model is
best/optimized………………………………………………………………………………………………………………65-80
2.4 Inference: Basis on these predictions, what are the insights and
recommendations………………………………………………………………………………………………………..80-82
List of Figures:
1.1Boxplot for numerical variable………………………………………………………………………………………11
1.1.2Boxplot after capping outliers…………………………………………………………………………………….12
1.1.3Hist plot for numerical variable………………………………………………………………………………….13
3
2.1.9Boxplot for Holliday Package and no young children…………………………………………………57
2.1.10Countplot for Holliday Package and no young children…………………………………………….57
2.1.11Stacked bar chart Foreign vs no young children………………………………………………………58
List of Tables:
1.1Sample of Data…………………………………………………………………………………………………………….7
1.1.2Description of data……………………………………………………………………………………………………..9
1.1.3Checking correlation of theb data………………………………………………………………………………14
1.2.1Description summery of zero value in x,y,z column……………………………………………………24
1.2.3Categorical variable with Encoded values………………………………………………………………….37
1.2.4OLS Regression Results………………………………………………………………………………………………39
4
2.1.1Sample of the data…………………………………………………………………………………………………..47
2.1.2Descriptive summery of the data………………………………………………………………………………47
2.1.3Checking correlation of the data…………………………………………………………………………………61
5
Linear Regression
Intruduction:
Regression analysis may be one of the most widely used statistical techniques for studying
relationships between variables .We use simple linear regression to analyze the impact of a
numeric variable (i.e., the predictor) on another numeric variable (i.e., the response
variable) .For example, managers at a call center want to know how the number of orders
that resulted from calls relates to the number of phone calls received. Many software
packages offer expertise in regression analysis, such as Microsoft Excel, R, and Python.
Following the tutorial. anyone who knows Microsoft Excel basics can build a simple linear
regression model. However, to interpret the regression outputs and assess the model's
usefulness, we need to learn regression analysis. "all models are wrong, but some are
useful." To be capable of discovering useful models and make plausible predictions, we
should have an appreciation of both regression analysis theory and domain-specific
knowledge. IT professionals are experts in learning and using software packages. All of a
sudden, business users ask them to perform a regression analysis. IT professionals whose
mathematical background does not include a regression analysis study want to gain the
skills to discover plausible models, interpret regression outputs, and make inferences.
Problem 1:
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You
are provided with the dataset containing the prices and other attributes of almost 27,000
cubic zirconia (which is an inexpensive diamond alternative with many of the same qualities
as a diamond). The company is earning different profits on different prize slots. You have to
help the company in predicting the price for the stone on the bases of the details given in
the dataset so it can distinguish between higher profitable stones and lower profitable
stones so as to have better profit share. Also, provide them with the best 5 attributes that
are most important.
Data Dictionary:
Variable Name: Description
Carat: Carat weight of the cubic zirconia.
Cut : Describe the cut quality of the cubic zirconia. Quality is increasing order Fair, Good,
Very Good, Premium, Ideal.
Color: Colour of the cubic zirconia.With D being the worst and J the best.
Clarity: Clarity refers to the absence of the Inclusions and Blemishes. (In order from Worst to
Best in terms of avg price) IF, VVS1, VVS2, VS1, VS2, Sl1, Sl2, l1
6
Depth: The Height of cubic zirconia, measured from the Culet to the table, divided by its
average Girdle Diameter.
Table: The Width of the cubic zirconia's Table expressed as a Percentage of its Average
Diameter.
Price: The Price of the cubic zirconia.
X: Length of the cubic zirconia in mm.
1.1. Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, Data types, shape, EDA, duplicate
values). Perform Univariate and Bivariate Analysis.
2 3 0.90 Very Good E VVS2 62.2 60.0 6.04 6.12 3.78 6289
3 4 0.42 Ideal F VS1 61.6 56.0 4.82 4.80 2.96 1082
4 5 0.31 Ideal F VVS1 60.4 59.0 4.35 4.43 2.65 779
Table no 1.1Sample of Data
Data dimensions:
The data set contains 26967 row, 11 columns .
7
3 color 26967 non-null object
4 clarity 26967 non-null object
5 depth 26270 non-null float64
3.The first column is an index ("Unnamed: 0")as this only serial no, we can remove it
8
75% 1.050000 62.500000 59.000000 6.550000 6.540000 4.040000
5360.000000
max 4.500000 73.600000 79.000000 10.230000 58.900000 31.800000
18818.000000
Table 1.1.2Description of the Data
Observation:
1.Carat : This is an independent variable, and it ranges from 0.2 to 4.5. mean value is around
0.8 and 75% of the stones are of 1.05 carat value. Standard deviation is around 0.477 which
shows that the data is skewed and has a right tailed curve. Which means that majority of
the stones are of lower carat. There are very few stones above 1.05 carat.
2.Depth : The percentage height of cubic zirconia stones is in the range of 50.80 to 73.60.
Average height of the stones is 61.80 25% of the stones are 61 and 75% of the stones are
62.5. Standard deviation of the height of the stones is 1.4. Standard deviation is indicating a
normal distribution
3.Table :T he percentage width of cubic Zirconia is in the range of 49 to 79. Average is
around 57. 25% of stones are below 56 and 75% of the stones have a width of less than 59.
Standard deviation is 2.24. Thus the data does not show normal distribution and is similar to
carat with most of the stones having less width also this shows outliers are present in the
variable.
4.Price : Price is the Predicted variable. Prices are in the range of 3938 to 18818. Median
price of stones is 2375, while 25% of the stones are priced below 945. 75% of the stones are
in the price range of 5356. Standard deviation of the price is 4022. Indicating prices of
majority of the stones are in lower range as the distribution is right skewed.
5.Variables x, y, and z seems to follow a normal distribution with a few outliers.
unique 5 7 8
top Ideal G SI1
freq 10816 5661 6571
9
Number of rows with z == 0: 9
Number of rows with depth == 0: 0
Observations:
1.On the given data set the the mean and median values does not have much differenc. .We
can observe Min value of "x", "y", & "z" are zero this indicates that they are faulty values. As
we know dimensionless or 2-dimensional diamonds are not possible. So we need to filter
out those as it clearly faulty data entries.There are three object data type 'cut', 'color' and
'clarity'.
2.Ofter dropping the duplicates – The shape of the data set is – 26958 rows & 10 columns.
Only in ‘depth 697missing values are present which we will impute by its median values.
EDA
Univariate Analysis:
Distribution of Numeric Features:
10
Fig 1.1 boxplot for numerical variable
observations:
1.There is significant amount of outlier present in 'carat','depth', 'table','x','y','z','price',
iiiiiiivariable.
2.We can see that the distribution of variable like 'carat' and the target feature "price" are
heavily "right-skewed".
3.Variables 'depth','x', 'y', and 'z' seems to follow a normal distribution with a few outliers.
4.Large number of outliers are present in all the variables (Carat, Depth, Table, x, y, z). price
is right skewed with largerange of outliers.
11
1.1.2 Boxplot after capping outlier
12
1.1.3Hist plots
observatoins:
1.From above histpots it is seen that except for carat and price variable, all other variables
have mean and median values very close to each other, seems like there is no skewness in
these variables.
2.Whereas for carat and price we see some difference in value of mean and median, which
slightly indicates existence of some skewness in the data.
3.Depth is the only variable which can be considered as normal distribution,Carat, Table, x,
y, z these variables have multiple modes with the spread of data.
13
4.Large number of outliers are present in all the variables (Carat, Depth, Table, x, y, z). price
is right skewed with largerange of outliers.
table 0.480476
x 0.397696
y 0.394060
z 0.394819
price 1.157121
dtype: float64
Observation:
1.There is significant amount of outlier present in 'carat','depth', 'table','x','y','z','price',
variable.
2.We can see that the distribution of some quantitative features like "carat" and the target
feature "price" are heavily "right-skewed",. 3.Variables x, y, and z seems to follow a normal
distribution with a few outliers.
Bivariate Analysis:
Numeric Features - Checking for Correlations:
carat depth table x y z price
14
1.1.4Heat map for numerical Data
observations:
1.High correlation between the different features like carat, x, y, z and price.
2.Less correlation between table with the other features.
3.Depth is negatively correlated with most the other features except for carat
4.We see strong correlation between Carat, x,y, and z that are demonstrating strong
correlation or multicollinearity.
15
1.1.5Pair plot Numeric Data
Pair plot allows us to see both distribution of single variable and relationships between two
variables.
16
EDA for Categorical variable:
Count plot (Cut):
Boxplot(Cut vs Price):
observations:
1.For the cut variable we see the most sold is Ideal cut type gems and least sold is Fair cut
gems.
2.All cut type gems have outliers with respect to price.
17
3.Slightly less priced seems to be Ideal type and premium cut type to be slightly more
expensive.
Count plot(Color):
1.1.9BOxplotColor Vs Price
observations:
1.For the color variable we see the most sold is G colored gems and least is J colored gems.
18
2.All color type gems have outliers with respect to price.
3.However, the least priced seems to be E type; J and I colored gems seems to be more
expensive.
Count plot(Clarity):
observations:
19
1.For the clarity variable we see the most sold is SI1 clarity gems and least is I1 clarity gems.
2.All clarity type gems have outliers with respect to price.
3.Slightly less priced seems to be SI1 type; VS2 and SI2 clarity stones seems to be more
expensive.
cut
_____
Ideal 10805
Premium 6880
Very Good 6027
Good 2434
Fair 779
Name: cut, dtype: int64
color
_____
G 5650
E 4916
F 4722
H 4091
D 3341
I 2765
J 1440
Name: color, dtype: int64
clarity
_____
SI1 6564
VS2 6092
SI2 4561
VS1 4086
VVS2 2530
VVS1 1839
IF 891
I1 362
Name: clarity, dtype: int64
20
depth
_____
62.0 1128
61.9 1090
62.1 1014
61.8 1012
62.2 976
...
50.8 1
71.0 1
52.7 1
71.3 1
70.8 1
Name: depth, Length: 169, dtype: int64
table
_____
56.0 4983
57.0 4771
58.0 4252
59.0 3296
55.0 3133
...
51.6 1
61.6 1
60.9 1
58.6 1
58.7 1
Name: table, Length: 99, dtype: int64
x
_____
4.38 233
4.37 229
4.32 227
4.33 225
4.34 224
...
9.14 1
9.03 1
3.74 1
9.30 1
3.83 1
Name: x, Length: 520, dtype: int64
y
_____
4.35 236
4.38 234
4.34 223
4.37 223
4.31 212
...
21
3.75 1
9.14 1
8.81 1
8.98 1
3.87 1
Name: y, Length: 516, dtype: int64
z
_____
2.70 393
2.69 393
2.68 373
2.71 368
2.72 352
...
5.65 1
1.53 1
5.48 1
2.30 1
2.06 1
Name: z, Length: 344, dtype: int64
price
_____
11965.0 1778
544.0 74
625.0 67
828.0 66
776.0 66
...
9678.0 1
690.0 1
10416.0 1
7898.0 1
6751.0 1
Name: price, Length: 7274, dtype: int64
1.2 Impute null values if present, also check for the values which
are equal to zero. Do they have any meaning or do we need to
change them or drop them? Check for the possibility of combining
the sub levels of a ordinal variables and take actions accordingly.
Explain why you are combining these sub levels with appropriate
reasoning.
Check for Missing Values:
carat 0
cut 0
color 0
clarity 0
22
depth 697
table 0
x 0
y 0
z 0
price 0
dtype: int64
price 0
dtype: int6
Median values:
carat 0.70
depth 61.80
table 57.00
x 5.69
y 5.70
z 3.52
price 2373.00
dtype: float64
observations:
1.There are missing values in the column “depth” – 697 cells or 2.6% of the total data set.
2.We can choose to impute these values using a mean or median. We checked for both the
values and the result for both is almost similar.
Ofter imputing meadian value with missing value, The depth 697-cells will get
zero values.
Checking for the values which are equal to zero:
Number of rows with x == 0: 3
Number of rows with y == 0: 3
Number of rows with z == 0: 9
Number of rows with depth == 0: 0
23
Dropping dimentionless diamonds:
Shape(2695,10)
observatons:
1.There are three object data type 'cut', 'color' and 'clarity'. Ofter dropping dimentionless
diamond – The shape of the data set is – 26925 rows & 10 columns.
2.we have alrady check for 'Zero' value. and we can observe there are some amount of
'Zero' value present on the data set on variable 'x', 'y','z'.This indicates that they are faulty
values.
3.As we know dimensionless or 2-dimensional diamonds are not possible. So we have filter
out those as it clearly faulty data entries.
observations:
1.While there are no missing values in the numerical columns, there are a few 0 in columns
– x (3 in count), y(3 in count) and z (9 in count) – in the database. A single row was dealt
with during checking for duplicates and the other eight rows were taken care here.
24
2.Since the total number of rows that had 0 value in them was 8 only, it accounts for a
negligible number and for this case study we could have avoided them or dropped.
3.Also, when I checked the correlation values, it seems there is a strong multicollinearity
between all three columns. There is a most likely case that I won’t even use them in creating
my Linear Regression model.
4.I have chosen to drop those rows as it represented an insignificant number when
compared to the overall dataset and it won’t add much value to the analysis here.
25
the same level of magnitudes. For this case study, however, let’s look at the data more
closely to identify if there is a need for us to scale the data.
4. The describe function output that was shared above indicates that mean and std dev
numbers aren’t varying significantly for original numeric variables with a low std deviation
and hence, even if we don’t scale the numbers, our model performance will not vary much,
or the impact will be insignificant.
5.Though some of the immediate effects that we can see if we scale the data and run our
linear model is: Faster execution, the conversion is faster. The intercept is minimized
significantly, bringing it almost to negligible value The coefficients can now be interpreted in
std Dev units instead of a pure unit increment in normal linear models. Though, there is no
difference in the interpretation of the model scores and the representation of linear model
in a graphical mode using scatterplot representation.
26
influencing my variable price and hence, at some point, I will be dropping this variable from
my model building process as well. Keeping the above points in mind, for this dataset, I
don’t think scaling the data will make much sense.
1.3 Encode the data (having string values) for Modelling. Split the
data into train and test (70:30). Apply Linear regression using scikit
learn. Perform checks for significant variables using appropriate
method from statsmodel. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare,
RMSE & Adj Rsquare. Compare these models and select the best
one with appropriate reasoning.
We have three object columns, which have string values - cut, color, and clarity Let us read
the data in brief before deciding on what kind of encoding technique needs to be
used. I would quickly check the statistical inference of our target variable – price, with the
following output:
mean 3734.453965
std 3466.394724
median 2373.000000
Name: price, dtype: float64
using the groupby function to check the relationship of cut with price and have used some
aggregator functions likes mean, median, and standard deviation.
27
Very Good 3461.49217 11965.0 3829.352912 2633.0 336.0
Observations:
1.Let us now check the first object column – cut: Using the groupby function to check the
relationship of cut with price and have used some aggregator functions likes mean, median,
and standard deviation value shown above table.
2.We can establish that there is an order in ranking like mean price is increasing from ideal
then good to very good, premium and fair. Fair segment has the highest median value as
well. Since, I can see the ordered ranking here, I have classified them using scale – Label
encoding - and won’t use one-hot encoding here. However, it is absolutely fine for us to go
ahead and treat this variable using one-hot encoding as well.
3.We can certainly try that approach if we would like to see the impact of varied kind of cut
variables with price target variable. Overall, with the grid above, I don’t think we will find
cut as a strong predictor and hence I have stayed with using label encoding.
using the groupby function to check the relationship of color with price and have used
some aggregator functions likes mean, median, and standard deviation.
std max mean median min
color
D 3022.221311 11965.0 3067.771027 1799.0 357.0
28
clarity
I1 2542.831763 11965.0 3850.662983 3494.0 345.0
IF 3252.252509 11965.0 2592.427609 1063.0 369.0
COLOR : 7
J 1440
I 2765
D 3341
H 4091
F 4722
E 4916
G 5650
Name: color, dtype: int64
CLARITY : 8
I1 362
IF 891
VVS1 1839
VVS2 2530
VS1 4086
SI2 4561
VS2 6092
SI1 6564
Name: clarity, dtype: int64
29
Unique value for the all categorical variable.
4
0 6 5 4 4 2
9
. 2 8 . . .
0 9 1 0 1 ... 0 0 0 0 1 0 0 0 0 0
3 . . 2 2 6
.
0 1 0 7 9 6
0
9
0 6 5 4 4 2
8
. 0 8 . . .
1 4 4 0 0 ... 0 0 0 1 0 0 0 0 0 0
3 . . 4 4 7
.
3 8 0 2 6 0
0
6
0 6 6 6 6 3 2
. 2 0 . . . 8
2 3 0 0 ... 0 0 0 0 0 0 0 0 0 1
9 . . 0 1 7 9
0 2 0 4 2 8 .
0
1
0 6 5 4 4 2 0
. 1 6 . . . 8
3 1 0 1 ... 0 0 0 0 0 0 1 0 0 0
4 . . 8 8 9 2
2 6 0 2 0 6 .
0
7
0 6 5 4 4 2
7
. 0 9 . . .
4 9 1 0 1 ... 0 0 0 0 0 0 0 0 1 0
3 . . 3 4 6
.
1 4 0 5 3 5
0
Table1.2.2 Converting categorical to dummy variables
observations:
1.I have used one-hot encoding for color and clarity independent variables using the
function “drop_function = True” or can be referred as dummy encoding, which takes into
consideration the (Kn-1) encoding.
2.Dummy encoding converts it into n-1 variables. One-hot encoding ends up with kn
variables, while dummy encoding usually ends up with kn-1 variables. This will also help us
to deal with the issue of multidimensionality.
30
x float64
y float64
z float64
price float64
cut_s int64
cut_Good uint8
cut_Ideal uint8
cut_Premium uint8
cut_Very Good uint8
color_E uint8
color_F uint8
color_G uint8
color_H uint8
color_I uint8
color_J uint8
clarity_IF uint8
clarity_SI1 uint8
clarity_SI2 uint8
clarity_VS1 uint8
clarity_VS2 uint8
clarity_VVS1 uint8
clarity_VVS2 uint8
dtype: object
All categorical object value can be converted into int64 and unit8 ofter
encoding.
Heatmap:
31
1.2.1Heat map for Numerical and categorical variable
Observations:
1.We have now set the precedence that there are a lot of variables (Carat, x,y, and z) that
are demonstrating strong correlation or multicollinearity. So, before proceeding with the
Linear Regression Model.
2.I have decided to drop x,y, and z from my Linear Regression model creation step. I have
decided to keep the carat column of the lot as it has the strongest relation with the target
variable – Price out of the four columns. Depth column also did not have much impact on
the price column and hence I have chosento not use it as well.
Train-Test Split:
X = df.drop('price', axis=1)
X = X.drop({'x', 'y', 'z','depth','cut_Good','cut_Premium','cut_Ideal','cut_Very Good'}, axis=1)
# Copy target into the y dataframe.
y = df[['price']]
32
#Split X and y into training and test set in 75:30 ratio,
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.30,
random_state=1)
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)
LinearRegression()
Regression model coefficients are:
The coefficient for carat is 8023.99354211987
33
Observations:
1.The intercept (often labelled the constant) is the expected mean value of Y when all X=0. If
X never equals 0, then the intercept has no intrinsic meaning.
2.The intercept for our model is -3600.04330541. In preset case when the other predictor
variable are zero i.e like carat,cut, color, clarity all are zero then the C=-3600.( Y = m1X1 +
m2X2+ ….. + mnXn + C + e) that means price is -3600.We can do Z score or scaling the data
and make it nearly zero.
Model Evaluation:R-Values
R square on training data-0.9384124327864911
R square on testing data-0.9394154299395036
RMSE -Values:
RMSE on Training data-858.2270432559263
RMSE on Testing data-857.818165001115
observations:
1.R-square is the percentage of the response variable variation that is explained by a linear
model. R-square = (Explained variation / Total variation)
2.R-squared is always between 0 and 100%: 0% indicates that the model explains none of
the variability of the response data around its mean.100% indicates that the model explains
all the variability of the response data around its mean.
3.In this regression model we can see the R-square value on Training and Test data
respectively 0.9384124327864911 and 0.9394154299395036.
4.In this regression model we can see the RMSE value on Training and Test data respectively
858.2270432559263 and 857.818165001115.
34
1.3.1Scatter plot on test data b/w dependent variable price and indepedendent data
Observation:
1.we can see that the is a linear plot, very strong corelation between the predicted y and
actual y. But there are lots of spread.That indicated some kind noise present on the data set
i.e Unexplained variances on the output.
35
Ofter dropping the these variable:
“depth","x","y","z","cut_Good","cut_Ideal","cut_Premium","cut_Very Good
Heat Map:
The strongest correlation is 'carat','cuts' and target variable 'price'then low correlation is
'clarity_SI2'.
36
Linear Regression using statsmodels:
concatenate X and y into a single dataframe,Since statsmodels library requires that data be
passed as a single dataframe unlike sklearn which wants X and y as separate variables.
t c co p
c co co co co co cla clar clar clar clar clari clari
a ut lo ri
ar lor lor lor lor lor rity ity_ ity_ ity_ ity_ ty_V ty_V
b _ r_ c
at _E _F _G _H _J _IF SI1 SI2 VS1 VS2 VS1 VS2
le s I e
4
5
1. 5 0
0
1 6. 2 1 0 0 0 0 0 0 0 1 0 0 0 0 6
3
0 0 5.
0
0
1 5
2 1. 5 1
1 0 6. 3 0 0 0 0 0 0 0 0 1 0 0 0 0 6
0 1 0 6.
8 0
2 1
0 0. 6 7
1 6 1. 2 0 0 0 0 1 0 0 0 0 0 1 0 0 0
8 7 4 8.
1 0
2
4
0. 6 4
7
7 3. 2 0 0 1 0 0 0 0 1 0 0 0 0 0 4
1
6 0 7.
2
0
6
2
1. 5 6
5
0 9. 4 0 0 1 0 0 0 0 0 0 1 0 0 0 1
4
1 0 8.
8
0
Table1.2.3Categorical variable with coded values
Lm1.params:
Intercept -3600.043305
carat 8023.993542
table -24.961601
cut_s -43.855857
color_E -190.881545
color_F -248.688185
color_G -421.992937
color_H -832.830401
color_I -1321.640073
color_J -1859.634521
clarity_IF 4245.013491
clarity_SI1 2695.277459
clarity_SI2 1861.410706
clarity_VS1 3546.315781
clarity_VS2 3263.572687
clarity_VVS1 4018.659987
clarity_VVS2 3984.978015
37
dtype: float64
38
clarity_IF 4245.0135 65.081 65.227 0.000 4117.450 437
2.577
clarity_SI1 2695.2775 55.759 48.338 0.000 2585.985 280
4.569
clarity_SI2 1861.4107 56.193 33.125 0.000 1751.268 197
1.554
clarity_VS1 3546.3158 56.943 62.278 0.000 3434.702 365
7.930
clarity_VS2 3263.5727 56.081 58.194 0.000 3153.649 337
3.496
clarity_VVS1 4018.6600 60.279 66.668 0.000 3900.508 413
6.812
clarity_VVS2 3984.9780 58.614 67.986 0.000 3870.089 409
9.867
===========================================================================
===
Omnibus: 3995.393 Durbin-Watson: 1.
999
Prob(Omnibus): 0.000 Jarque-Bera (JB): 12037.
975
Skew: 1.098 Prob(JB): 0
.00
Kurtosis: 6.241 Cond. No. 1.90e
+03
===========================================================================
===
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is corr
ectly specified.
[2] The condition number is large, 1.9e+03. This might indicate that there
are
strong multicollinearity or other numerical problems.
observatoins:
1.R-square is the percentage of the response variable variation that is explained by a linear
model. R-square = (Explained variation / Total variation)
2.R-squared is always between 0 and 100%: 0% indicates that the model explains none of
the variability of the response data around its mean.100% indicates that the model explains
all the variability of the response data around its mean.
3.In this regression model we can see the R-square value on Training and Test data
respectively 0.9384124327864911 and 0.9394154299395036.
39
4.In this regression model we can see the RMSE value on Training and Test data respectively
858.2270432559263 and 857.818165001115.
5.As the training data & testing data score are almost inline, we can conclude this model is a
Right-Fit Model.
Applying Z-Score:
As stated earlier, with this specific dataset, I don’t think we need to scale the data, however,
to see its impact, lets quickly view the results post scaling the data. I have used Z score to
scale the data. Z-Scores become comparable by measuring the observations in multiples of
the standard deviation of that sample. The mean of a z-transformed sample is always zero.
x_train_scaled = x_train.apply(zscore)
x_test_scaled = x_test.apply(zscore)
y_train_scaled = y_train.apply(zscore)
y_test_scaled = y_test.apply(zscore)
40
The coefficient for clarity_VVS2 is 0.3358847114654482
I can interpret it as for one std dev increment in carat value, the carat variable impacts the
dependent variable'price'by 1.06 times carat +. +. .+. soon for other variables.
The intercept for our model is -2.459574118873397e-16
observations:
1.1.3Scatter plot on test data ofter scaling between price - and Independent
Variable:
41
cut_s ---> 5.096413143490567
color_E ---> 2.4747624189962067
color_F ---> 2.4341698268457574
color_G ---> 2.774161531143149
color_H ---> 2.2877893449603337
color_I ---> 1.9217515170977741
color_J ---> 1.509454503820044
clarity_IF ---> 3.3631034501372756
clarity_SI1 ---> 17.89611729585019
clarity_SI2 ---> 12.680824943298578
clarity_VS1 ---> 11.60063052200214
clarity_VS2 ---> 16.740214589468305
clarity_VVS1 ---> 5.8723419563685235
clarity_VVS2 ---> 7.630278698237391
In the variance inflation factor shows the 'table' and 'Clarity_SI1', 'Clarity_SI2',
'Clarity_VS1', 'Clarity_Vs2' are displaying severe collinearity
42
6.R-squared is a statistical measure of how close the data are to the fitted regression line. It
is also known as the coefficient of determination, or the coefficient of multiple
determination for multiple regression. 100% indicates that the model explains all the
variability of the response data around its mean. The value of R-squared vary from 0 to 1.
7.copmaring the performance of Predictions on Train and Test sets using Rsquare, RMSE &
Adj Rsquare. Comparing these models linear regression stats model is the best model.Any
value inching closer to 1 can be considered a good fitted regression line and our R-squared
score of 0.939 signifies good performance.
43
my model building process as well. Keeping the above points in mind, for this dataset, I
don’t think scaling the data will make much sense.
8.However, for the sake of checking how does it impact the overall model coefficients, I
have still carried it for this studyPlease note, centering/scaling does not affect our statistical
inference in regression models - the estimates are adjusted appropriately, and the p-values
will be the same. Sharing the brief results from running the model on a scaled data.
9.We have a database which have strong correlation between independent variables and
hence we need to tackle with the issue of multicollinearity which can hinder the results of
the model performance. Multicollinearity makes it difficult to understand how one variable
influence the target variable. However, it does not affect the accuracy of the model. As
aresult while creating the model, I had dropped a lot of independent variables displaying
multicollinearity or the ones with no direct relation with the target variable.
10.After the Linear Regression model was created, we can see the assumption coming true
as the carat variable emerged as the single biggest factor impacting the target variable ,
followed with a few others within clarity variables. Carat variable has the highest coefficient
value as compared to the other studies variables for this test case.
11.Final linear equation:(-3600.04) * Intercept + (8023.99) * carat + (-24.96) * table + (-
43.86) * cut_s + (-190.88) * color_E + (-248.69) * color_F + (-421.99) * color_G + (-832.83) *
color_H + (-1321.64) * color_I + (-1859.63) * color_J + (4245.01) * clarity_IF + (2695.28) *
clarity_SI1 + (1861.41) * clarity_SI2 + (3546.32) * clarity_VS1 + (3263.57) * clarity_VS2 +
(4018.66) * clarity_VVS1 + (3984.98) * clarity_VVS2 +
12.Even after scaling, our claim about carat being an important driver in our linear equation
is reaffirmed.We can then look at the Variance Inflation Factor (VIF) score to check the
multicollinearity scores. As per the industry standards or at least for this case study, any
variable with a VIF score of greater than 10 has been accepted to indicatesevere collinearity.
13.That table and Clarity_SI1, Clarity_SI2, Clarity_VS1, Clarity_Vs2 are displaying severe
collinearity. VIF measures the intercorrelation among independent variables in a
multipleregression model. In mathematical terms, the variance inflation factor for a
regression model variable would be the ratio of the overall model variance to the variance
of the model with a single independent variable.
14.As an example, the VIF value for Carat in the table above is the intercorrelation with
other independent variables in the dataset and so on for other variablesIf we are looking to
finetune the model we can simply drop these columns from our Linear Regression model to
see how the results pan out and can check our model performance.
15.As an alternate approach, we can to use the “cut” variable as one hot encoding or
dummy encoding and run the model again to check on the overall model score or tackle the
issue of multicollinearity. This can allow us to read the impact of cut variables on the target
variable – Price, if company intends to study that as well.however, for this case study and
for the reasons mentioned above, I have not used one-hot encoding on the cut variable.
44
16.For the business based on the model that we have created for the test case, some of the
key variables that are likely to positively drive price change are (top 5 in descending order):
1.Carat
2.Clarity_IF
3.Clarity VVS_1
4.Clarity VVS_2
5.Clarity_vs1
Recommendations:
1.As expected Carat is a strong predictor of the overall price of the stone. Clarity refers to
the absence of the Inclusions and Blemishes and has emerged as a strong predictor of price
as well. Clarity of stone types IF, VVS_1, VVS_2 and vs1 are helping the firm put an
expensive price cap on the stones.
2.Color of the stones such H, I and J won’t be helping the firm put an expensive price cap on
such stones. The company should instead focus on stones of color D, E and F to command
relative higher price points and support sales.
3.This also can indicate that company should be looking to come up with new color stones
like clear stones or a different color/unique color that helps impact the price positively.
4.The company should focus on the stone’s carat and clarity so as to increase their prices.
Ideal customers will also contribute to more profits. The marketing efforts can make use of
educating customers about the importance of a better carat score and importance of clarity
index. Post this, the company can make segments, and target the customer based on their
income/paying capacity etc.
Logistic Regression:
Introduction:
Logistic regression is a process of modeling the probability of a discrete outcome given an
input variable. The most common logistic regression models a binary outcome; something
that can take two values such as true/false, yes/no, and so on. Multinomial logistic
regression can model scenarios where there are more than two possible discrete outcomes.
Logistic regression is a useful analysis method for classification problems, where you are
trying to determine if a new sample fits best into a category. As aspects of cyber security are
classification problems, such as attack detection, logistic regression is a useful analytic
technique.
45
LDA or Normal Discriminant Analysis or Discriminant Function Analysis is a dimensionality
reduction technique that is commonly used for supervised classification problems. It is used
for modelling differences in groups i.e. separating two or more classes. It is used to project
the features in higher dimension space into a lower dimension space.
For example, we have two classes and we need to separate them efficiently. Classes can
have multiple features. Using only a single feature to classify them may result in some
overlapping as shown in the below figure. So, we will keep on increasing the number of
features for proper classification.
Data Dictionary:
Variable Name: Description
4841 n
0 1 no 30 8 1 1
2 o
46
Unname Holliday_Packa Salar edu no_young_child no_older_childr foreig
age
d: 0 ge y c ren en n
3720 n
1 2 yes 45 8 0 1
7 o
5802 n
2 3 no 46 9 0 0
2 o
6650 n
3 4 no 31 11 2 0
3 o
6673 n
4 5 no 44 12 0 2
4 o
Data dimensions:
(872, 8)
observations:
47
1.Holiday Package:This variable is a categorical Variable andTarget Variable.
2.Salary, age, educ, no_young_children, no_older_children, variables are numerical or
continuous variables. Salary ranges from 1322 to 236961.
3.Average salary of employees is around 47729 with a standard deviation of 23418.
Standard deviation indicates that the data is not normally distributed. skew of 0.71 indicates
that the data is right skewed and there are few employees earning more than an average of
47729. 75% of the employees are earning below 53469 while 255 of the employees are
earning 35324.
4.Age of the employee ranges from 20 to 62. Median is around 39. 25% of the employees
are below 32 and 25% of the employees are above 48. Standard deviation is around 10.
Standard deviation indicates almost normal distribution.
5.Years of formal education ranges from 1 to 21 years. 25% of the population has formal
education for 8 years, while the median is around 9 years. 75% of the employees have
formal education of 12 years. Standard deviation of the education is around 3.This variable
is also indicating skewness in the data
48
no_older_children 0
foreign 0
dtype: int64
(872, 7)
Univariate Analysis:
Categorical Feature Levels Frequencies:
Holliday_Package Number of Levels 2
no 471
yes 401
Name: Holliday_Package, dtype: int64
foreign Number of Levels 2
no 656
yes 216
Salary
_____
46195 2
33357 2
39460 2
36976 2
40270 2
..
38352 1
119644 1
96072 1
49
115431 1
74659 1
Name: Salary, Length: 864, dtype: int64
age
_____
44 35
31 32
34 32
35 31
33 30
28 29
40 29
36 28
38 28
32 27
47 26
41 26
39 25
26 24
42 24
46 24
49 23
45 23
51 22
50 21
37 21
43 21
48 20
27 19
29 19
30 19
57 18
56 18
55 17
25 17
58 16
24 16
59 14
54 14
52 13
21 12
23 11
53 10
60 10
22 9
61 8
20 8
62 3
Name: age, dtype: int64
educ
_____
8 157
12 124
9 114
50
11 100
10 90
5 67
4 50
13 43
7 31
14 25
6 21
15 15
3 11
16 10
2 6
17 3
19 2
21 1
18 1
1 1
Name: educ, dtype: int64
no_young_children
_____
0 665
1 147
2 55
3 5
Name: no_young_children, dtype: int64
no_older_children
_____
0 393
2 208
1 198
3 55
4 14
5 2
6 2
Name: no_older_children, dtype: int64
foreign
_____
no 656
yes 216
Name: foreign, dtype: int64
51
2.1.1Hist plots for numerical variable
observations:
52
2.1.2Box plots for numerical variable
observations:
1.There are significant outliers present in variable “ Salary”, however there are minimal
outliers in other variables like ‘educ’, ‘no. of young children’ & ‘no. of older children’.
2.There are no outliers in variable ‘age’. For Interpretation purpose we would need to study
the variables such as no. of young children and no. of older children before outlier
treatment.
3.For this case study we have done outlier treatment for only salary & educ.
53
2.1.3Boxplot after capping outlirs
age 0.146412
educ -0.095087
no_young_children 1.946515
no_older_children 0.953951
dtype: float64.
54
2.1.4 Boxplot(Holliday Package VS Salary)
While performing the bivariate analysis we observe that Salary for employees opting for
holiday package and for not opting for holiday package is similar in nature. However, the
distribution is fairly spread out for people not opting for holiday packages.
The distribution of data for age variable with holiday package is also similar in nature We
can clearly see that employees in middle range (34 to 45 years) are going for holiday
package as compared to older and younger employees.
Countplot(HollidayPackage VS Age)
55
2.1.6 Countplot(HollidayPackage VS Age)
The distribution of data for age variable with holiday package is also similar in nature We
can clearly see that employees in middle range (34 to 45 years) are going for holiday
package as compared to older and younger employees .
This variable is also showing a similar pattern. This means education is likely not to be a
variable for influencing holiday packages for employees.
56
2.1.8 Countplot(Holliday Package Vs Educ)
We observe that employees with less years of formal education(1 to 7 years) and higher
education are not opting for the Holiday package as compared to employees with formal
education of 8 year to 12 years.
observations:
57
1.There is a significant difference in employees with younger children who are opting for
holiday package and employees who are not opting for holiday package.
2.We can clearly see that people with younger children are not opting for holiday packages
We can clearly see that people with younger children are not opting for holiday packages.
58
2.1.12 Boxplot(Holliday Package Vs no older children)
observations:
1.The distribution for opting or not opting for holiday packages looks same for employees
with older children.less no of older childen opting for holliday package.5yrs and 6 yrs not
opting Holliday package.
2.Almost same distribution for both the scenarios when dealing with employees with older
children.
59
2.1.13 countplot(Holliday Package Vs no older children)
In the bar plot shows Less no of older childen opting for holliday package.5yrs and 6 yrs not
opting Holliday package.
Observations:
1.The distribution for opting or not opting for holiday packages looks same for employees
with older children. less no of older childen opting for holliday package.5yrs and 6 yrs not
opting Holliday package.
2.Almost same distribution for both the scenarios when dealing with employees with older
children.
60
Bivariate Analysis:
Numeric Features - Checking for Correlations:
Salary age educ no_young_children no_older_children
Heat Map:
observations:
1.We can relate there isn’t any strong correlation between any variables.
2.Salary and education display moderate corelation and no_older_children is somewhat
correlated with salary variable. However, there are no strong correlation in the data set
61
Pairplot:
We will see correlation between independent variables.all variable for holliday package is
(yes/no) overlapping.
62
2.1.8 Countplot for Holliday Package
Holliday_Package: The distribution seems to be fine, with 54% for no and 46% for yes.
Holliday_Package
In the mean value for no young chidren and no older children is very less.so they are not
opting for holliday package.
Foreign: The data is imbalanced with more skewed towards no and relatively a smaller
shared for yes.
63
Salary age educ no_young_children no_older_children
foreign
no_young_children
no_older_children
2.2 Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30).
Apply Logistic Regression and LDA (linear discriminant analysis).
64
2.3 Performance Metrics: Check the performance of Predictions on
Train and Test sets using Accuracy, Confusion Matrix, Plot ROC
curve and get ROC_AUC score for each model Final Model:
Compare Both the models and write inference which model is
best/optimized.(Q2.2 &Q23 Clabbing together)
Unique count of categorical varaiable:
Column Name: Holliday_Package
['no', 'yes']
Categories (2, object): ['no', 'yes']
[0 1]
Column Name: foreign
['no', 'yes']
Categories (2, object): ['no', 'yes']
[0 1]
0 393
2 208
1 198
3 55
4 14
5 2
6 2
Name: no_older_children, dtype: int64
65
Size of Holliday Package and no young children:
Holliday_Package no_young_children
0 0 326
1 100
2 42
3 3
1 0 339
1 47
2 13
3 2
dtype: int64
observations:
1.The variable which is actually a numeric one seems to show varied distribution between n
umber of children being 1 and 2 when done a bivariate analysis with the dependent variable
2.It is therefore advised to treat this variable as categorical and do the encoding on it .
observations:
There does not seems to be much variation between the distribution of data for children
more than 0. It seems they are close enough for the Holiday_Package classes with an almost
like distribution.
Datatypes:
Holliday_Package int8
Salary float64
age int64
educ float64
no_young_children int64
no_older_children int64
66
foreign int8
dtype: object
484
0 0 30 8.0 1 0 1 0 0
12.0
372
1 1 45 8.0 1 0 0 0 0
07.0
580
2 0 46 9.0 0 0 0 0 0
22.0
665
3 0 31 11.0 0 0 0 1 0
03.0
667
4 0 44 12.0 2 0 0 0 0
34.0
S f
no_y no_y no_y no_o no_o no_o no_o no_o no_o
Holli a e o
a oung oung oung lder_ lder_ lder_ lder_ lder_ lder_
day_ l d r
g _chil _chil _chil child child child child child child
Pac a u ei
e dren_ dren_ dren_ ren_ ren_ ren_ ren_ ren_ ren_
kage r c g
1 2 3 1 2 3 4 5 6
y n
4
8
4 3 8.
0 0 0 1 0 0 1 0 0 0 0 0
1 0 0
2.
0
3
7
2 4 8.
1 1 0 0 0 0 1 0 0 0 0 0
0 5 0
7.
0
5
8
0 4 9.
2 0 0 0 0 0 0 0 0 0 0 0
2 6 0
2.
0
6
6
1
5 3
3 0 1. 0 0 1 0 0 0 0 0 0 0
0 1
0
3.
0
67
S f
no_y no_y no_y no_o no_o no_o no_o no_o no_o
Holli a e o
a oung oung oung lder_ lder_ lder_ lder_ lder_ lder_
day_ l d r
g _chil _chil _chil child child child child child child
Pac a u ei
e dren_ dren_ dren_ ren_ ren_ ren_ ren_ ren_ ren_
kage r c g
1 2 3 1 2 3 4 5 6
y n
6
6
1
7 4
4 0 2. 0 0 0 0 0 1 0 0 0 0
3 4
0
4.
0
2.1.4 Encoded Categorical variable
observations:
1.I have used one-hot encoding for 'no young chidren' and no oider children are
independent variables using the function “drop_function = True” or can be referred as
dummy encoding, which takes into consideration the (Kn-1) encoding.
2.Dummy encoding converts it into n-1 variables. One-hot encoding ends up with kn
variables, while dummy encoding usually ends up with kn-1 variables. This will also help us
to deal with the issue of multidimensionality.
y = df['Holliday_Package']
check target variable class proportion:
0 0.540138
1 0.459862
Name: Holliday_Package, dtype: float64
68
n_jobs=2,
random_state=123)
model.fit(X_train, y_train)
0 0.690002 0.309998
1 0.616723 0.383277
2 0.704176 0.295824
3 0.470343 0.529657
4 0.548567 0.451433
Model Evaluation:
Training Data:
Accuracy: 0.6770491803278689
69
2.3.1 AUC and ROC for the training data
AUC: 74%
Accuracy: 68%
70
Precision: 67%
Recall:60%
f1-Score: 63%
Test Data:
Accuracy: 0.6793893129770993
71
macro avg 0.68 0.67 0.67 262
weighted avg 0.68 0.68 0.67 262
Recall:55%
f1-Score: 61%
random_state=1),
n_jobs=-1,
param_grid={'l1_ratio': [0.25, 0.5, 0.75],
'penalty': ['l2', 'none', 'l1', 'elasticnet'],
'solver': ['sag', 'lbfgs', 'saga', 'newton-cg',
'liblinear'],
'tol': [0.0001, 1e-05]},
scoring='accuracy')
print(grid_search.best_params_,'\n')
print(grid_search.best_estimator_)
72
LogisticRegression(l1_ratio=0.25, max_iter=10000, n_jobs=2, penalty='l1',
random_state=1, solver='liblinear', tol=1e-05)
0 1
0 0.679105 0.320895
1 0.620776 0.379224
2 0.693508 0.306492
3 0.470989 0.529011
4 0.547406 0.452594
73
2.3.5 Confusion matrix on Train Data
Precision: 65%
Recall:59%
f1-Score: 62%
74
Test Data :
Accuracy: 0.6755725190839694
75
2.3.8 Test Model ROC_AUC
AUC: 72%
Accuracy: 68%
Precision: 69%
Recall:53%
f1-Score: 60%
Prediction:
Confusion Matrix Comparison:
Train Data:
Test Data:
76
2.3.10 Confusion Matrix on Test Data
Test Data:
precision recall f1-score support
77
Auc Score: 0.7430907851896722
78
Precision: 66%
Recall:59%
f1-Score: 63%
f1-Score: 61%
observations:
1.Logistic regression test recall score is 55% and precision score is 69%.
2.Logistic regression GridsearchCv test recall score is 53% and pricision score is 69%.
3.Linear Discriminant Analysis test recall score is 54% and presion score is 70%.
4.comparing all the model recall score, Logistic regression will have high recall score that is
55%.So this is a good madel to fit.
79
threshold_prob Acc f1 recall prec
observations:
1.By using custom threshold metrices we can improve the recall score.we can allocating
threshold prob values like 0.1 to 0.9 in that which value suitable for getting good recall
score.To increase the recall score the precision score will get decreses.
2.Here 0.1 threshold prob values give recall is 1 and precision value 0.46 for the train or test
results.
3.Previous recall score wiil be 0.54 ofter using custom threshold metrices recall improve by
1. Whenever we increase recall score the model is good or fit.
2.4 Inference: Basis on these predictions, what are the insights and
recommendations.
80
Predictions and Insights:
1.We started this test case with looking at the data correlation to identify early trends and
patterns. At one stage, Salary and education seems to be important parameters which might
have played out as an important predictor.
2.While performing the bivariate analysis we observe that Salary for employees opting for
holiday package and for not opting for holiday package is similar in nature. However, the
distribution is fairly spread out for people not opting for holiday packages.
3.The distribution of data for age variable with holiday package is also similar in nature. The
range of age for people not opting for holliday package is more spread out when compared
with people opting for yes.
4.We can clearly see that employees in middle range (34 to 45 years) are going for holiday
package as compared to older and younger employeesHowever, almost similar distribution
here for salary and age is indicating that they might not come out as strong predictors after
the model is created. Lets carry on with more data exploration and check.
5.There is a significant difference in employees with younger children who are opting for
holiday package and employees who are not opting for holiday package.We can clearly see
that people with younger children are not opting for holiday packages;
6.We identify that employees with number of younger children has a varied distribution and
might end up playing an important role in our model building process. Employees with older
children has almost similar distribution for opting and not opting for holiday packages across
the number of children levels.
7.For this test case, I have chosen Logistic Regression to be a better model for interpretation
and analytical purposes. Keeping that in mind, I will quickly like to refer the coefficients
values:
Coefficients values for Salary is: -2.10948637e-05 or almost 0
81
9.Interestingly and as expected by my, Salary and age didn’t turn out to be an
importantpredictor for my model. Also, number of young children has emerged as a strong
predictor (likelihood ) in not opting for holiday packages.
10.There is no plausible effect of salary, age, and education on the prediction for
Holliday_packages. These variables don’t seem to impact the decision to opt for holiday
packages as we couldn’t establish a strong relation of these variables with the target
variable.
11.Foreign has emerged as a strong predictor with a positive coefficient value. The log
likelihood or likelihood of a foreigner opting for a holiday package is high.
12.no_young_children variable is negating the probability for opting for holiday packages,
especially for couple with number of young children at 2.The company can try to bin salary
ranges to see if they can derive some more meaningful interpretations out of that variable.
May be club the salary or age in different buckets and see if there is some plausible impact
on the predictor variable. OR else, the business can use some different model techniques to
do a deep dive.
Recommendation:
1.The company should really focus on foreigners to drive the sales of their holiday packages
as that’s where the majority of conversions are going to come in.
2.The company can try to direct their marketing efforts or offers toward foreigners for a
better conversion opting for holiday packages.
3.The company should also stay away from targeting parents with younger children. The
chances of selling to parents with 2 younger children is probably the lowest. This also gels
with the fact that parents try and avoid visiting with younger children.
4.If the firm wants to target parents with older children, that still might end up giving
favorable return for their marketing efforts then spent on couples with younger children.
82