Predictive Modeling PDF

10/3/2021
Predictive Modeling
Project
Predictive Modeling
Dharitree Mahakul
PGP DSBA APRIL B 21
Contents
Problem 1: Linear Regression………………………………………………………………………………………….2
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the
null values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate
Analysis.…………………………………………………………………………………………………………………….…3
1.2. Impute null values if present, also check for the values which are equal to zero. Do
they have any meaning or do we need to change them or drop them? Check for the
possibility of combining the sub levels of an ordinal variables and take actions accordingly.
Explain why you are combining these sub levels with appropriate reasoning
……………………………………………………………………………………………………………………………………17
1.3. Encode the data (having string values) for Modelling. Data Split: Split the data into test
and train (70:30). Apply Linear regression. Performance Metrics: Check the performance
of Predictions on Train and Test sets using Rsquare, RMSE……….………………………………..20
1.4. Inference: Basis on these predictions, what are the business insights and
recommendations ………………………………………………………………………………………………………24
Problem 2: Logistic Regression and LDA………………………………………………………………………...25
2.1. Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, write an inference on it? Perform Univariate and Bivariate Analysis. Do exploratory
data analysis …………………………………………………………………………………………………………………….25
2.2. Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis) ………………………………………………………………………………………………………38
2.3. Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final
Model: Compare Both the models and write inference which model is
best/optimized………………………………………………………………………………………………………………….40
2.4. Inference: Basis on these predictions, what are the insights and recommendations…...48
1|Page
Problem 1: Linear Regression
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You
are provided with the dataset containing the prices and other attributes of almost 27,000
cubic zirconia (which is an inexpensive diamond alternative with many of the same qualities
as a diamond). The company is earning different profits on different prize slots. You have to
help the company in predicting the price for the stone on the bases of the details given in the
dataset so it can distinguish between higher profitable stones and lower profitable stones so
as to have better profit share. Also, provide them with the best 5 attributes that are most
important.
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis.
1.2. Impute null values if present, also check for the values which are equal to zero. Do they
have any meaning or do we need to change them or drop them? Check for the possibility of
combining the sub levels of an ordinal variables and take actions accordingly. Explain why you
are combining these sub levels with appropriate reasoning.
1.3. Encode the data (having string values) for Modelling. Split the data into train and test
(70:30). Apply Linear regression using scikit learn. Perform checks for significant variables
using appropriate method from statsmodel. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Compare these models and select the best one with appropriate reasoning.
1.4. Inference: Basis on these predictions, what are the business insights and
recommendations.
Please explain and summarise the various steps performed in this project. There should be
proper business interpretation and actionable insights present.
Data Dictionary:
Variable
Description
Name
Carat Carat weight of the cubic zirconia.
Describe the cut quality of the cubic zirconia. Quality is increasing order
Cut
Fair, Good, Very Good, Premium, Ideal.
Colour Colour of the cubic zirconia. With D being the worst and J the best.
cubic zirconia Clarity refers to the absence of the Inclusions and Blemishes.
Clarity
(In order from Worst to Best) IF, VVS1, VVS2, VS1, VS2, Sl1, Sl2, l1
The Height of cubic zirconia, measured from the Culet to the table, divided
Depth
by its average Girdle Diameter.
The Width of the cubic zirconia's Table expressed as a Percentage of its
Table
Average Diameter.
Price the Price of the cubic zirconia.
X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
2|Page
Z Height of the cubic zirconia in mm.
Table 1.1: Data Dictionary
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis.
Importing the dataset The dataset is imported using the pd.read_csv() function and we will store the
data in the data frame named as df (in this project (PM_PROJECT_QUESTION_1.ipynb)). Top 5 rows
of the dataset are as shown below:
HEAD OF THE DATA
Fig.1.1: Sample Dataset
DIMENSION OF THE DATASET
Fig.1.2: Shape of Dataset
• There are 26967 number of Rows.

• There are 11 number of Columns.
3|Page
DATASET STRUCTURE:
Fig.1.3: Dataset Structure
INFERENCE:
1. There are 11 variables.
2. 26967 Total Entries and No Variable seems to have null values.
3. All variables are of Mix of Float, Object, Integer Data Type.
CHECKING FOR MISSING VALUES:

There is always a necessity to check for the presence of null values also addressed as NA values in
the dataset, since the presence of null values can lead us to inefficiency or errors in the evaluation.
Fig.1.4: Null Value Check
Data set has 26,967 rows with 11 variables. Column indicating row number (Unnamed:0)
cannot be used for analysis and needs to be deleted. Excluding row number data set has 3
categorical variables and 7 numerical variables. i.e 10 variables available for analysis. Price is
dependent variable and other 9 independent (predictive variables) There are 697 ‘Null
Values’ in variable ‘depth’
4|Page
DATA DESCRIPTION
There can be several methods of visualization or ways through which we can perform the
univariate analysis (includes analysis of the variables of dataset at individual levels) of the
dataset e.g., Histograms, description (summary), range of variation of the variable, boxplot
etc.
Fig.1.5: Describe Table
• We have both categorical and continuous data.

• For categorical data we have cut, colour and clarity for continuous data we have
carat.
• depth, table, x. y, z and price will be target variable.
CHECKING THE DUPLICATES

• Number of duplicate rows = 0
UNIQUE VALUES IN THE CATEGORICAL DATA

CUT: 5
Fair 781
Good 2441
Very Good 6030
Premium 6899
Ideal 10816
Table 1.2: Cut
We have 5 cuts and the ideal seems to be most preferred cut
5|Page
COLOR: 7
J 1443
I 2771
D 3344
H 4102
F 4729
E 4917
G 5661
Table 1.3: Colour
CLARITY: 8
I1 365
IF 894
VVS1 1839
VVS2 2531
VS1 4093
SI2 4575
VS2 6099
SI1 6571
Table 1.4: Clarity
UNIVARIATE / BIVARIATE ANALYSIS

CARAT:
Fig.1.6: Carat
The distribution of data in carat seems to positively skewed, as there are multiple peaks
points in the distribution there could multimode and the box plot of carat seems to have
large number of outliers. In the range of 0 to 1 where majority of data lies.
6|Page
DEPTH:
Fig.1.7: Depth
The distribution of depth seems to be normal distribution, The depth ranges from 55 to 65 The box
plot of the depth distribution holds many outliers.
TABLE:
Fig.1.8: Table
The distribution of table also seems to be positively skewed The box plot of table has outliers The
data distribution where there is maximum distribution is between 55 to 65.
X:
Fig.1.9: X
7|Page
The distribution of x (Length of the cubic zirconia in mm.) is positively skewed The box plot of the
data consists of many outliers The distribution rages from 4 to 8.
Y:
Fig.1.10: Y
The distribution of Y (Width of the cubic zirconia in mm.) is positively skewed The box plot also
consists of outliers The distribution too much positively skewed. The skewness may be due to the
diamonds are always made in specific shape. There might not be too much sizes in the market.
Z:
Fig.1.11: Z
The distribution of z (Height of the cubic zirconia in mm.) is positively skewed The box plot also
consists of outliers The distribution too much positively skewed. The skewness may be due to the
diamonds are always made in specific shape. There might not be too much sizes in the market.
PRICE:
Fig.1.12: Price
8|Page
The price has seemed to be positively skewed. The skew is positive the price has outliers in the data
The price distribution is from Rs 100 to 8000.
PRICE – HIST:
Fig.1.13: Price Hist
SKEW:
Unnamed: 0 0.000000
carat 1.116481
depth -0.028618
table 0.765758
x 0.387986
y 3.850189
z 2.568257
price 1.618550
dtype: float64
Table 1.5: Skew
9|Page
BIVARIATE ANALYSIS
CUT:
Quality is increasing order Fair, Good, Very Good, Premium, Ideal.
Fig.1.14: Cut Countplot

The most preferred cut seems to be ideal cut for diamonds.
Fig.1.15: Cut Bar plot

The reason for the most preferred cut ideal is because those diamonds are priced
lower than other cuts.
10 | P a g e
COLOR:
D being the best and J the worst.
Fig.1.16: Colour Count Plot

We have 7 colours in the data, The G seems to be the preferred colour,
Fig.1.17: Colour Bar Plot

We see the G is priced in the middle of the seven colours, whereas J being the worst colour price
seems too high.
CLARITY:
Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1,
I2, I3
11 | P a g e
Fig.1.18: Clarity Count Plot
The clarity VS2 seems to be preferred by people.
Fig.1.19: Clarity Bar Plot

The data has No FL diamonds, from this we can clearly understand the flawless diamonds are not
bringing any profits to the store.
12 | P a g e
More relationship between categorical variables
CUT AND COLOUR
Fig.1.20: Cut and Colour Relationship
CUT AND CLARITY
Fig.1.21: Cut and Clarity Relationship
13 | P a g e
CORRLEATION
CARAT VS PRICE DEPTH VS PRICE
Fig.1.21: Cart vs Price Fig.1.22: Depth vs Price
X VS PRICE Y VS PRICE
Fig.1.23: X vs Price Fig.1.24: Y vs Price
Z VS PRICE
Fig.1.25: Z vs Price
14 | P a g e
DATA DISTRIBUTION
Fig.1.26: Data Distribution
15 | P a g e
CORRELATIOM MATRIX
Fig.1.27: Correlation Matrix

It can be seen that variable x, y, z and carat are highly correlated and also these variables have
correlation with price (dependent variable).
16 | P a g e
1.2. Impute null values if present, also check for the values which are equal to
zero. Do they have any meaning or do we need to change them or drop them?
Check for the possibility of combining the sub levels of an ordinal variables and
take actions accordingly. Explain why you are combining these sub levels with
appropriate reasoning.
Fig.1.28: Null Values

Yes, we have Null values in depth, since depth being continuous variable mean or median imputation
can be done.
The percentage of Null values is less than 5%, we can also drop these if we want.
After median imputation, we don’t have any null values in the dataset.
Fig.1.29: Dropped Null Value

Checking if there is value that is “0”.
17 | P a g e
Fig.1.30: Dropped Null Value
We have certain rows having values zero, the x, y, z are the dimensions of a diamond so this can’t take
into model. As there are very less rows.
We can drop these rows as don’t have any meaning in model building.
SCALING
Scaling can be useful to reduce or check the multi collinearity in the data, so if scaling is not applied, I
find the VIF – variance inflation factor values very high. Which indicates presence of multi collinearity.
These values are calculated after building the model of linear regression. To understand the multi
collinearity in the model.
The scaling had no impact in model score or coefficients of attributes nor the intercept.
Fig.1.31: Scaling
18 | P a g e
CHECKING THE OUTLIERS IN THE DATA
BEFORE TREATING OUTLIERS
Fig.1.32: Carat Box Plot Fig.1.33: Depth Box Plot
Fig.1.34: Table Box Plot Fig.1.35: X Box Plot
Fig.1.36: Y Box Plot Fig.1.37: Z Box Plot
19 | P a g e
Fig.1.37: Price Box Plot
1.3 Encode the data (having string values) for Modelling. Data Split: Split the
data into test and train (70:30). Apply Linear regression. Performance Metrics:
Check the performance of Predictions on Train and Test sets using Rsquare,
RMSE.
ENCODING THE STRING VALUES
GET DUMMIES
Fig.1.38: Dummies
Dummies have been encoded.
Linear regression model does not take categorical values so that we have encoded categorical values
to integer for better results.
20 | P a g e
DROPING UNWANTED COLUMNS
Encoding, splitting of data and applying linear regression is provided in the code file. Intercept and
coefficients associated with variables are as under.
Fig.1.39: Coefficient Value
• The intercept for our model is -0.756762786304939

• R square on training data 0.9419557931252712
• R square on testing data 0.9381643998102491
• RMSE on Training data 0.20690072466418796
• RMSE on Testing data 0.21647817772382874
VIF –VALUES
carat ---> 33.35086119845924
depth ---> 4.573918951598579
table ---> 1.7728852812618994
x ---> 463.5542785436457
y ---> 462.769821646584
z ---> 238.65819968687333
cut_Good ---> 3.609618194943713
cut_Ideal ---> 14.34812508118844
cut_Premium ---> 8.623414379121153
cut_Very Good ---> 7.848451571723688
color_E ---> 2.371070464762613
We still find we have multi collinearity in the dataset, to drop these values to Lower level we
can drop columns after doing stats model.
From stats model we can understand the features that do not contribute to the Model.
We can remove those features after that the VIF Values will be reduced.
Ideal value of VIF is less than 5%.
21 | P a g e
STATSMODEL
BEST PARAMS SUMMARY
Fig.1.40: Summary With Depth
22 | P a g e
After dropping the depth variable
Fig.1.41: Summary Without Depth

To ideally bring down the values to lower levels we can drop one of the variables that is high
ly correlated. Dropping variables would bring down the multi collinearity level down.
23 | P a g e
1.4 Inference: Basis on these predictions, what are the business insights and r
ecommendations.
We had a business problem to predict the price of the stone and provide insights for the co
mpany on the profits on different prize slots. From the EDA analysis we could understand th
e cut, ideal cut had number profits to the company. The colours H, I, J have bought profits fo
r the company. In clarity if we could see there were no flawless stones and they were no pro
fits coming from l1, l2, l3 stones. The ideal, premium and very good types of cuts were bringi
ng profits where as fair and good are not bringing profits.
The predictions were able to capture 95% variations in the price and it is explained by the pr
edictors in the training set.
Using stats model if we could run the model again, we can have P values and coefficients wh
ich will give us better understanding of the relationship, so that values more 0.05 we can dr
op those variables and re run the model again for better results.
For better accuracy dropping depth column in iteration for better results.
The equation, (-0.76) * Intercept + (1.1) * carat + (-0.01) * table + (-0.32) * x + (0.2 8) * y +
(-0.11) * z + (0.1) * cut_Good + (0.15) * cut_Ideal + (0.15) * cut_Premiu m + (0.13) * cut_Ve
ry_Good + (-0.05) * color_E + (-0.06) * color_F + (-0.1) * color _G + (-0.21) * color_H + (-0.3
2) * color_I + (-0.47) * color_J + (1.0) * clarity_IF + ( 0.64) * clarity_SI1 + (0.43) * clarity_SI2
+ (0.84) * clarity_VS1 + (0.77) * clarity_ VS2 + (0.94) * clarity_VVS1 + (0.93) * clarity_VVS2
+
RECOMMENDATIONS
1. The ideal, premium, very good cut types are the one which are bringing profits so tha
t we could use marketing for these to bring in more profits.
2. The clarity of the diamond is the next important attributes the more the clear is the s
tone the profits are more.
The five best attributes are
Carat,
Y the diameter of the stone
clarity_IF
clarity_SI1
clarity_SI2
clarity_VS1
clarity_VS2 c
larity_VVS1
clarity_VVS2
24 | P a g e
PROBLEM 2: LOGISTIC REGRESSION AND LDA
You are hired by a tour and travel agency which deals in selling holiday packages. You are pr
ovided details of 872 employees of a company. Among these employees, some opted for th
e package and some didn't. You have to help the company in predicting whether an employ
ee will opt for the package or not on the basis of the information given in the data set. Also,
find out the important factors on the basis of which the company will focus on particular em
ployees to sell their packages.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condi
tion check, write an inference on it? Perform Univariate and Bivariate Analysis. Do ex
ploratory data analysis.
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split
: Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear di
scriminant analysis).
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets us
ing Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each mod
el Final Model: Compare Both the models and write inference which model is best/op
timized.
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
DATA DICTIONARY:
Table 2.1: Data Dictionary
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it? Perform Univariate and Bivariate Analysis. Do
exploratory data analysis.
Loading all the necessary library for the model building.
Now, reading the head and tail of the dataset to check whether data has been properly fed
Importing the dataset The dataset is imported using the pd.read_csv() function and we will store the
data in the data frame named as df (in this project (PM_PROJECT_QUESTION_1.ipynb)). Top 5 rows
of the dataset are as shown below:
25 | P a g e
HEAD OF THE DATA
Fig.2.1: Head of the Data

TAIL OF THE DATA
Fig.2.1: Tail of the Data Sample

SHAPE OF THE DATA (872, 8)
INFO
Fig.2.2: INFO
• No null values in the dataset,

• We have integer and object data
26 | P a g e
DATA DESCRIBE
Fig.2.3: Data Description

We have integer and continuous data,
Holiday package is our target variable
Salary, age, educ and number young children, number older children of employee have the went to
foreign, these are the attributes we have to cross examine and help the company predict weather
the person will opt for holiday package or not.
Fig.2.4: Null Value
Fig.2.3: Check For duplicate
27 | P a g e
• Unique values in the categorical data
HOLLIDAY_PACKAGE: 2
• Yes 401
• No 471
• Name: Holliday Package, dtype: int64
FOREIGN: 2
• Yes 216
• No 656
• Name: foreign, dtype: int64
Percentage of target:
• no 0.540138
• yes 0.459862
• Name: Holliday_Package, dtype: float64
•
CATEGORICAL UNIVARIATE ANALYSIS
Fig.2.4: Hist & Box Plot
28 | P a g e
FOREIGN
Fig.2.5: Foreign Count Plot

HOLIDAY PACKAGE
Fig.2.6: Holiday Count Plot
29 | P a g e
HOLIDAY PACKAGE VS SALARY
Fig.2.7: Holiday Package Cat Plot

We can see employee below salary 150000 have always opted for holiday package.
HOLIDAY PACKAGE VS AGE
Fig.2.8: Holiday Package vs Age Cat Plot
30 | P a g e
HOLIDAY PACKAGE VS EDUC
Fig.2.9: Holiday Package vs EDUC Cat Plot

HOLIDAY PACKAGE VS YOUNG CHILDREN
Fig.2.10: Holiday Package vs Young Children Cat Plot
31 | P a g e
HOLIDAY PACKAGE VS OLDER CHILDREN
Fig.2.11: Holiday Package vs older Children Cat Plot

AGE VS SALARY VS HOLIDAY PACKAGE
Fig.2.12: Age vs Salary vs Holiday Package Scatterplot Plot
32 | P a g e
Fig.2.13: Age vs Salary vs Holiday Package LMP Plot
Employee age over 50 to 60 have seems to be not taking the holiday package, whereas in the age 30
to 50 and salary less than 50000 people have opted more for holiday package.
EDUC VS SALARY VS HOLIDAY PACKAGE
Fig.2.14: EDUC vs Salary vs Holiday Package LMP Plot
33 | P a g e
Fig.2.15: EDUC vs Salary vs Holiday Package LMP Plot
YOUNG CHILDREN VS AGE VS HOLIDAY PACKAGE
Fig.2.16: Young Children vs Age vs Holiday Package LMP Plot
34 | P a g e
Fig.2.17: Young Children vs Age vs Holiday Package LMP Plot
OLDER CHILDREN VS AGE VS HOLIDAY_PACKAGE
Fig.2.18: Older Children vs Age vs Holiday Package LMP Plot
35 | P a g e
BIVARITE ANALYIS
DATA DISTRIBUTION
Fig.2.19: Pair Plot

There is no correlation between the data, the data seems to be normal. There is no huge
difference in the data distribution among the holiday package, I don’t see any clear two
different distribution in the data.
36 | P a g e
Fig.2.20: Heat map
No multi collinearity in the data.
TREATING OUTLIERS
BEFORE OUTLIER TREATMENT
we have outliers in the dataset, as LDA works based on numerical computation treating outliers will
help perform the model better.
Fig.2.21: Salary Box Plot
37 | P a g e
AFTER OUTLIER TREATMENT
Fig.2.21: Salary Box Plot without Outlier

No outliers in the data, all outliers have been treated.
2.2 Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply Logistic
Regression and LDA (linear discriminant analysis).
ENCODING CATEGORICAL VARIABLE
Fig.2.22: Salary Box Plot without Outlier
The encoding helps the logistic regression model predict better results.
38 | P a g e
Fig.2.23: Split
GRID SEARCH METHOD:

The grid search method is used for logistic regression to find the optimal solving and the
parameters for solving.
Fig.2.23: Grid
The grid search method gives, liblinear solver which is suitable for small datasets.
Tolerance and penalty have been found using grid search method.
Predicting the training data,
Fig.2.24: Train Prediction
39 | P a g e
Fig.2.25: Test Probability
2.3. Performance Metrics: Check the performance of Predictions on Train a

nd Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC
_AUC score for each model Final Model: Compare Both the models and wr
ite inference which model is best/optimized.
CONFUSION MATRIX TRAIN DATA
Fig.2.26: Confusion Matrix Train Data
40 | P a g e
CONFUSION MATRIX FOR TEST DATA
Fig.2.26: Confusion Matrix Test Data
ACCURACY
Fig.2.27: Accuracy
41 | P a g e
AUC, ROC CURVE FOR TRAIN DATA
Fig.2.28: AUC Curve Train data
AUC, ROC CURVE FOR TEST DATA
Fig.2.29: AUC Curve for Test Data
LDA
Fig.2.30: LDA
42 | P a g e
PREDICTING THE VARIBALE
Fig.2.31: Predicting Variable
MODEL SCORE
Fig.2.32: Train Model
CLASSFICATION REPORT TRAIN DATA
Fig.2.33: Classification
MODEL SCORE
Fig.2.34: Test Model
43 | P a g e
CLASSIFICATION REPORT TEST DATA
Fig.2.35: Test Classification
CHANGING THE CUTT OFF VALUE TO CHECK OPTIMAL VALUE THAT GIVES
BETTER ACCURACY AND F1 SCORE
Fig.2.36: F1 Score 0.1
44 | P a g e
45 | P a g e
46 | P a g e
AUC AND ROC CURVE
Fig.2.45:AUC and ROC Curv
47 | P a g e
Comparing both these models, we find both results are same, but LDA works better when there is
category target variable.
2.4. Inference: Basis on these predictions, what are the insights and
recommendations.
Please explain and summarise the various steps performed in this project.
There should be proper business interpretation and actionable insights
present.
We had a business problem where we need predict whether an employee would opt for a holiday
package or not, for this problem we had done predictions both logistic regression and linear
discriminant analysis. Since both are results are same.
The EDA analysis clearly indicates certain criteria where we could find people aged above 50 are not
interested much in holiday packages. So this is one of the we find aged people not opting for holiday
packages. People ranging from the age 30 to 50 generally opt for holiday packages. Employee age
over 50 to 60 have seems to be not taking the holiday package, whereas in the age 30 to 50 and
salary less than 50000 people have opted more for holiday package.
The important factors deciding the predictions are salary, age and educ.
Recommendations
1. To improve holiday packages over the age above 50 we can provide religious destination
places.
2. For people earning more than 150000 we can provide vacation holiday packages.
3. For employee having more than number of older children we can provide packages in
holiday vacation places.
48 | P a g e

Predictive Modeling PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predictive Modeling PDF

Uploaded by

Copyright:

Available Formats

10/3/2021

Problem 1: Linear Regression………………………………………………………………………………………….2

Problem 2: Logistic Regression and LDA………………………………………………………………………...25

HEAD OF THE DATA

Fig.1.1: Sample Dataset

DIMENSION OF THE DATASET

Fig.1.2: Shape of Dataset

• There are 26967 number of Rows.

Fig.1.3: Dataset Structure

2. 26967 Total Entries and No Variable seems to have null values.

3. All variables are of Mix of Float, Object, Integer Data Type.

CHECKING FOR MISSING VALUES:

Fig.1.4: Null Value Check

Fig.1.5: Describe Table

• We have both categorical and continuous data.

CHECKING THE DUPLICATES

UNIQUE VALUES IN THE CATEGORICAL DATA

We have 5 cuts and the ideal seems to be most preferred cut

UNIVARIATE / BIVARIATE ANALYSIS

Fig.1.13: Price Hist

Fig.1.14: Cut Countplot

Fig.1.15: Cut Bar plot

Fig.1.16: Colour Count Plot

Fig.1.17: Colour Bar Plot

Fig.1.19: Clarity Bar Plot

Fig.1.20: Cut and Colour Relationship

CUT AND CLARITY

Fig.1.21: Cut and Clarity Relationship

Fig.1.21: Cart vs Price Fig.1.22: Depth vs Price

Fig.1.23: X vs Price Fig.1.24: Y vs Price

Fig.1.26: Data Distribution

Fig.1.27: Correlation Matrix

Fig.1.28: Null Values

Fig.1.29: Dropped Null Value

Fig.1.32: Carat Box Plot Fig.1.33: Depth Box Plot

Fig.1.34: Table Box Plot Fig.1.35: X Box Plot

Fig.1.36: Y Box Plot Fig.1.37: Z Box Plot

Fig.1.39: Coefficient Value

• The intercept for our model is -0.756762786304939

Fig.1.40: Summary With Depth

Fig.1.41: Summary Without Depth

Table 2.1: Data Dictionary

Fig.2.1: Head of the Data

Fig.2.1: Tail of the Data Sample

• No null values in the dataset,

Fig.2.3: Data Description

Holiday package is our target variable

Fig.2.4: Null Value

Fig.2.3: Check For duplicate

Fig.2.4: Hist & Box Plot

Fig.2.5: Foreign Count Plot

Fig.2.6: Holiday Count Plot

Fig.2.7: Holiday Package Cat Plot

Fig.2.8: Holiday Package vs Age Cat Plot

Fig.2.9: Holiday Package vs EDUC Cat Plot

Fig.2.10: Holiday Package vs Young Children Cat Plot

Fig.2.11: Holiday Package vs older Children Cat Plot

Fig.2.12: Age vs Salary vs Holiday Package Scatterplot Plot

EDUC VS SALARY VS HOLIDAY PACKAGE

Fig.2.14: EDUC vs Salary vs Holiday Package LMP Plot

Fig.2.16: Young Children vs Age vs Holiday Package LMP Plot

Fig.2.18: Older Children vs Age vs Holiday Package LMP Plot

Fig.2.19: Pair Plot