You are on page 1of 18

PREDECTIVE MODELLING PROJECT

PROBLEM 1
Introduction:
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are
provided with the dataset containing the prices and other attributes of almost 27,000 cubic
zirconia (which is an inexpensive diamond alternative with many of the same qualities as a
diamond). The company is earning different profits on different prize slots. You have to help the
company in predicting the price for the stone on the bases of the details given in the dataset so
it can distinguish between higher profitable stones and lower profitable stones so as to have
better profit share. Also, provide them with the best 5 attributes that are most important.

Data Dictionary:
Variable Name Description
Carat Carat weight of the cubic zirconia.
Describe the cut quality of the cubic zirconia.
Cut Quality is increasing order Fair, Good, Very
Good, Premium, Ideal.
Colour of the cubic zirconia.With D being the
Color
best and J the worst.
cubic zirconia Clarity refers to the absence of
the Inclusions and Blemishes. (In order from
Clarity Best to Worst, FL = flawless, I3= level 3
inclusions) FL, IF, VVS1, VVS2, VS1, VS2,
SI1, SI2, I1, I2, I3
The Height of a cubic zirconia, measured from
Depth the Culet to the table, divided by its average
Girdle Diameter.
The Width of the cubic zirconia's Table
Table expressed as a Percentage of its Average
Diameter.
Price the Price of the cubic zirconia.
X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
Z Height of the cubic zirconia in mm.
1.1. Read the data and do exploratory data analysis. Describe the
data briefly. (Check the null values, Data types, shape, EDA).
Perform Univariate and Bivariate Analysis.
First, we import necessary library and them we upload the CSV file in jupyter notebook. After
that we use the head function to identify the first 5 rows of the dataset.

The dataset has 26967 rows and 11 columns. The first column named ‘Unnamed:0’ is not
useful in evaluating the dataset. Hence, the column is removed from the dataset.

Checking for missing values:

RangeIndex: 26967 entries, 0 to 26966


Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 26967 non-null float64
1 cut 26967 non-null object
2 color 26967 non-null object
3 clarity 26967 non-null object
4 depth 26270 non-null float64
5 table 26967 non-null float64
6 x 26967 non-null float64
7 y 26967 non-null float64
8 z 26967 non-null float64
9 price 26967 non-null int64
dtypes: float64(6), int64(1), object(3)

As, we can see only ‘depth ‘has some missing values. There are 6 float, 1 int and 3 object data
types.

Descriptive statistics help describe and understand the features of a specific data set by giving short
summaries about the sample and measures of the data. Kindly refer to jupyter notebook to see this
table.

We separate the object data type into its unique values.

CUT: 5
Fair 781
Good 2441
Very Good 6030
Premium 6899
Ideal 10816
Name: cut

COLOR: 7
J 1443
I 2771
D 3344
H 4102
F 4729
E 4917
G 5661
Name: color

CLARITY: 8
I1 365
IF 894
VVS1 1839
VVS2 2531
VS1 4093
SI2 4575
VS2 6099
SI1 6571
Name: clarity
Now we convert these object data type into int data type by assigning a particular number to each unique
value. This helps the machine to read the data as it is able to process only numbers. So the values assigned
are:

Cut
['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']
Categories (5, object): ['Fair', 'Good', 'Ideal', 'Premium', 'Very Good']
[2 3 4 1 0]

Color
['E', 'G', 'F', 'D', 'H', 'J', 'I']
Categories (7, object): ['D', 'E', 'F', 'G', 'H', 'I', 'J']
[1 3 2 0 4 6 5]

Clarity
['SI1', 'IF', 'VVS2', 'VS1', 'VVS1', 'VS2', 'SI2', 'I1']
Categories (8, object): ['I1', 'IF', 'SI1', 'SI2', 'VS1', 'VS2', 'VVS1', 'VVS2']
[2 1 7 4 6 5 3 0]

We again use the head function to see whether the changes are done or not.

After that we check for duplicate rows and we have 34 duplicate rows. We remove them from our dataset.
As we look for outliers we can see that only price has outliers. No other columns has outliers. Hence we treat
the outliers.

As you can see below the outlier has been treated.


As we knew from above that ‘depth’ had 697 missing values. we need to fill the missing values
first. Since, originally it was a an object type data type hence, we will fill the missing values with
median.(as it is done in jupyter notebook).

As we can see from the above heatmap that only carat x, y, z i.e. length of cubic zirconia(mm), width of
cubic zirconia(mm), height of cubic zirconia(mm) show correlation as their value close to 1.

PAIRPLOT:
1.2. Impute null values if present, also check for the values which
are equal to zero. Do they have any meaning or do we need to
change them or drop them? Do you think scaling is necessary
in this case?

All null values have been imputed during EDA process. We checked for values equal to zero but we
found none. We do not need to change them or drop them because if any of the criteria become zero
the gem won’t be a profitable stone at all. A stone becomes a gem because it has something of the
criteria above and if any criteria become zero it won’t be any different from a stone.

Scaling is not necessary in this case as to find the best parameters values in case of linear regression
model there is a closed form solution called normal equation. There is no need of stepwise optimization
process so feature scaling is not necessary in case of linear regression.

1.3. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Linear
regression. Performance Metrics: Check the performance of
Predictions on Train and Test sets using Rsquare, RMSE.

Splitting of data onto70:30 ratio and linear regression model has been build in jupyter notebook

Let us explore the coefficient for each of the independent attributes.

The coefficient for carat is 11335.082013391022


The coefficient for cut is 61.12961932497335
The coefficient for color is -282.31521549830694
The coefficient for clarity is 290.8017943110328
The coefficient for depth is -153.66719939263987
The coefficient for table is -93.02072221840359
The coefficient for x is -1257.9598960298047
The coefficient for y is 4.383265574378822
The coefficient for z is -30.489719013863507

The intercept of the model is 16481.5948344

Rsquare for testing data= 0.8891394521159985


Rsquare for training data= 0.8866474574664729

88% variation in price is explained by the predators in the model for train set.

RMSE for test data=1349.36971081


RSME for train data=1349.89893091

Given below is summary of the linear regression model:


OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.682e+33
Date: Sat, 26 Jun 2021 Prob (F-statistic): 0.00
Time: 00:46:31 Log-Likelihood: 4.6693e+05
No. Observations: 18853 AIC: -9.338e+05
Df Residuals: 18842 BIC: -9.337e+05
Df Model: 10
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -8.185e-12 2.19e-12 -3.741 0.000 -1.25e-11 -3.9e-12
carat -3.524e-12 4.12e-13 -8.559 0.000 -4.33e-12 -2.72e-12
cut 1.75e-13 3.1e-14 5.642 0.000 1.14e-13 2.36e-13
color 3.837e-13 2.01e-14 19.081 0.000 3.44e-13 4.23e-13
clarity 9.326e-14 1.96e-14 4.747 0.000 5.48e-14 1.32e-13
depth 8.171e-14 2.61e-14 3.136 0.002 3.06e-14 1.33e-13
table -1.614e-13 1.5e-14 -10.735 0.000 -1.91e-13 -1.32e-13
x 1.421e-12 1.75e-13 8.101 0.000 1.08e-12 1.76e-12
y 1.037e-12 8.32e-14 12.472 0.000 8.74e-13 1.2e-12
z 1.421e-13 1.39e-13 1.026 0.305 -1.29e-13 4.14e-13
price 1.0000 2.29e-17 4.37e+16 0.000 1.000 1.000
==============================================================================
Omnibus: 5105.729 Durbin-Watson: 1.064
Prob(Omnibus): 0.000 Jarque-Bera (JB): 41223.642
Skew: -1.077 Prob(JB): 0.00
Kurtosis: 9.917 Cond. No. 3.99e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 3.99e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

The overall P value is less than alpha, so rejecting H0 and accepting Ha that atleast 1 regression co-
efficient is not close to 0. Here all regression co-efficient are not 0.

1.4. Inference: Basis on these predictions, what are the business


insights and recommendations.

From EDA we could understand that ideal cut had number profits to the company. The color H,I,J have
brought profits to the company. We could see there was less profit coming from I1, I2, I3 stones. The
ideal, premium and very good type of cut were bringing good profit where as fair and good are not
brining profits.

FINAL LINEAR EQUATION:

(-1846.12)*intercept + (9126.94)*carat+(-15.01)*depth+(*18.59)*table+(-1190.28)*x+(837.36)*y+(-
163.64)*z +(481.81)*cut good+(714.65)*cut ideal +(674.77)*cut premium+(606.9)*cut very good+(-
181.91)*color E+(-256.81)*color F+(-429.38)*color G+(-855.99)*color H+(-1323.93)*color I + (-
1928.05)*color J +(4004.01)*clarity_IF +(2519.92)* clarity_SI1+(1684.46)*clarity_SI2
+(3342.57)*clarity_vs1 +(3039.93)*clarity_VS2+(3772.3)*clarity_vvs1+(3757.78)*clarity_vvs2
PROBLEM 2
Introduction:
You are hired by a tour and travel agency which deals in selling holiday packages. You
are provided details of 872 employees of a company. Among these employees, some
opted for the package and some didn't. You have to help the company in predicting
whether an employee will opt for the package or not on the basis of the information
given in the data set. Also, find out the important factors on the basis of which the
company will focus on particular employees to sell their packages.

Data Dictionary:
Variable Name Description
Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
The number of young children (younger than 7
no_young_children
years)
no_older_children Number of older children
foreign foreigner Yes/No

2.1. Data Ingestion: Read the dataset. Do the descriptive statistics


and do null value condition check, write an inference on it. Perform
Univariate and Bivariate Analysis. Do exploratory data analysis.

First, we import necessary library and them we upload the CSV file in jupyter notebook. After
that we use the head function to identify the first 5 rows of the dataset.

The dataset has 872 rows and 8 columns. The first column named ‘Unnamed:0’ is not useful in
evaluating the dataset. Hence, the column is removed from the dataset.

Checking for missing values:

RangeIndex: 872 entries, 0 to 871


Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Holliday_Package 872 non-null object
1 Salary 872 non-null int64
2 age 872 non-null int64
3 educ 872 non-null int64
4 no_young_children 872 non-null int64
5 no_older_children 872 non-null int64
6 foreign 872 non-null object
dtypes: int64(5), object(2)

As we can see from above table we don’t have any missing values. There are 5 int64 data type and 2
object data type.

Descriptive statistics help describe and understand the features of a specific data set by giving short
summaries about the sample and measures of the data. Kindly refer to jupyter notebook to see this
table.

After that we check for duplicate rows and we have 0 duplicated rows.

CHECK FOR OUTLIERS:

As we can see from the above boxplot only salary has outliers. Hence we treat it first.
CORRELATION PLOT:

As we can see from the above heatmap, there is no correlation among the criteria.

PAIRPLOT:
2.2. & 2.3. Do not scale the data. Encode the data (having string
values) for Modelling. Data Split: Split the data into train and test
(70:30). Apply Logistic Regression and LDA (linear discriminant
analysis).

Performance Metrics: Check the performance of Predictions on Train


and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and
get ROC_AUC score for each model Final Model: Compare Both the
models and write inference which model is best/optimized.

we convert these object data type into int data type by assigning a particular number to each unique value.
This helps the machine to read the data as it is able to process only numbers. So the values assigned

Holliday_Package
['no', 'yes']
Categories (2, object): ['no', 'yes']
[0 1]
foreign
['no', 'yes']
Categories (2, object): ['no', 'yes']
[0 1]

Splitting of data onto70:30 ratio and logistic regression model and linear discriminate analysis has been
build in jupyter notebook.

LOGISTIC REGRESSION MODEL


Accuracy of train dataset= 0.675409

AUC of train data=0.742

Accuracy of test data = 0.637404

AUC of test data = 0.704


CONFUSION MATRIX FOR TRAINING DATA

CLASSIFICATION REPORT FOR TRAIN DATA


precision recall f1-score support

0 0.67 0.77 0.72 326


1 0.68 0.56 0.62 284

accuracy 0.68 610


macro avg 0.68 0.67 0.67 610
weighted avg 0.68 0.68 0.67 610

CONFUSION MATRIX FOR TEST DATA


CLASSIFICATION REPORT FOR TEST DATA
precision recall f1-score support

0 0.66 0.70 0.68 145


1 0.60 0.56 0.58 117

accuracy 0.64 262


macro avg 0.63 0.63 0.63 262
weighted avg 0.64 0.64 0.64 262

LINEAR DISCRIMINANT ANALYSIS


Accuracy, AUC and ROC curve of train data

Accuracy of train dataset= 0.68

AUC of train data=0.739

Accuracy, AUC and ROC curve of test data


Accuracy of test data =0.61

AUC of test data = 0.703

CONFUSION MATRIX FOR TRAIN AND TEST DATA


CLASSIFICATION REPORT OF TRAIN DATA

precision recall f1-score support

0 0.67 0.78 0.72 326


1 0.69 0.56 0.61 284

accuracy 0.68 610


macro avg 0.68 0.67 0.67 610
weighted avg 0.68 0.68 0.67 610

CLASSIFICATION REPORT OF TEST DATA

precision recall f1-score support

0 0.67 0.78 0.72 326


1 0.69 0.56 0.61 284

accuracy 0.68 610


macro avg 0.68 0.67 0.67 610
weighted avg 0.68 0.68 0.67 610

Based on classification report linear Discriminant analysis model is slightly better optimized than logistic
regression model.

2.4. Inference: Basis on these predictions, what are the insights and
recommendations.

Based on EDA analysis we understand that salary plays an important role in determining whether a
person takes holiday package or not. People having higher salary generally tend to take holiday package.
Also, people having young children take less holiday package as their priority shifts from holiday to
saving money for their children future. Employees having older children take up holiday package as they
don’t have to worry about their children future. Most foreigners tend to take holiday package.

Based on the analysis and predictive models created linear discriminant analysis is slightly better
optimized than logistic regression model.

You might also like