You are on page 1of 58

Table of Contents

Problem 1: Linear Regression


1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check
the null values, Data types, shape, EDA). Perform Univariate and Bivariate
Analysis………………………………………………………………………………………………………………… 3
1.2 Impute null values if present, also check for the values which are equal to zero.
Do they have any meaning or do we need to change them or drop them? Do you think
scaling is necessary in this
case?...................................................................................................................................................... 18
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data
into train and test (70:30). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using Rsquare,
RMSE…………………………………………………………………………………………………………………… 22
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations………………………………………………………………………………………………… 30

Problem 2: Logistic Regression and LDA


2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis………………………………………………………………………………. 32
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA
(linear discriminant analysis) ……………………………………………………………………………….46
2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model Final Model: Compare Both the models and write inference which model
is best/optimized………………………………………………………………………………………………….48
2.4 Inference: Basis on these predictions, what are the insights and
recommendations…………………………………………………………………………………………………56

1
Problem 1: Linear Regression

You are hired by a company Gem Stones co ltd, which is a cubic zirconia
manufacturer. You are provided with the dataset containing the prices and
other attributes of almost 27,000 cubic zirconia (which is an inexpensive
diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to
help the company in predicting the price for the stone on the bases of the
details given in the dataset so it can distinguish between higher profitable
stones and lower profitable stones so as to have better profit share. Also,
provide them with the best 5 attributes that are most important.

Data Dictionary:

Variable Name Description


Carat Carat weight of the cubic zirconia.
Describe the cut quality of the cubic
Cut zirconia. Quality is increasing order Fair,
Good, Very Good, Premium, Ideal.
Colour of the cubic zirconia. With D being
Color
the worst and J the best.
cubic zirconia Clarity refers to the
absence of the Inclusions and Blemishes.
Clarity (In order from Best to Worst, IF = flawless,
l1= level 1 inclusion) IF, VVS1, VVS2, VS1,
VS2, Sl1, Sl2, l1
The Height of cubic zirconia, measured
Depth from the Culet to the table, divided by its
average Girdle Diameter.
The Width of the cubic zirconia's Table
Table expressed as a Percentage of its Average
Diameter.
Price the Price of the cubic zirconia.
X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
Z Height of the cubic zirconia in mm.

2
1.1. Read the data and do exploratory data analysis.
Describe the data briefly. (Check the null values, Data
types, shape, EDA). Perform Univariate and Bivariate
Analysis.

Solution:
Loading all the necessary libraries required for model building.
After that loading the complete data-set.
Now, reading the head and tail of the data to check whether data
has been properly fed.

Head:

3
Tail:

Data has been properly loaded.

▪ Checking the shape of the data-set.

There is total 26967 numbers of rows and 11 numbers of


columns.

4
▪ Checking the Info of the data-set.

Data types present are float, int and object.

▪ Checking the description of the data-set.

5
We have both categorical and continuous data,
Cut, color, clarity is categorical in nature and carat, depth, table,
x, y, z, price is continuous in nature.
Price will be the target variable.
As per this, we can say that unnamed: 0 column is not useful for
us.

▪ Checking duplicate values.

There is no duplicate value present in it.

6
▪ Checking null values present or not.

In depth column there is 697 null value presents.


▪ Checking unique values present in it for categorical data.

7
As discussed earlier Unnamed: 0 column don’t see of use as
of now, so dropping it.

Univariate / Bivariate Analysis


Know we have total of 7 continuous data and 3 categorical
data.
Starting with continuous data:

The districution of data in carat seems to be positively skweed ,


also there are multiple peak points, from box plot there is large
number of outliers present in it. Most of the data lies between 0
and 1.

8
The distribution data in depth seems to be normally distributed.
From box plot it is clear that it holds large number of outliers.

The distribution of data in table also seems to be positive.


The data range between 55 to 65.
Also for box plot , there is presecnse of oultliers.

9
The distribution of data in x (Length of the cubic zirconia in mm.) seems to be
positively skweed. Here also we can see the prsences of ouliers.

The distribution of data in y (Width of the cubic zirconia in mm.) seems to


positively skweed. The box plot also consist of outliers.

The distribution of data in z (Height of the cubic zirconia in mm.) seems to be


positively skweed. The box plot also consist of outliers

10
The distribution of data in price seems positive , the box plot also
consist of outliers.

11
▪ Checking skewedness.

Bivariate Analysis (categorial data)


CUT:

From the above graph, we can say that the quality is in incresing order from
fair to ideal. The most preffered cut seems to be Ideal cut for diamond.
As from the sencond graph i.e. cut and price,we can say that the ideal cut is the
most preffered because of it price, the ideal cut diamond price is lower than
the other cuts.

12
COLOR:

We have seven colors in data, out of which G is the most preffered colour and J
is the least preffered colour.
As from the second graph, i.e. color and price, we can say that G is the most
and J is the least preffered color because of it’s price as G ranges in between
(middle price range) whereas J has high price range.
CLARITY:

As it is from best to worst clarity (FL= Flawless and L3 = level 3).


Sequence is as follows, FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3
The most preferred clarity seems to be SI1 , VS2.The data don’t have
any FL diamonds hence we can say that they are not bringing any
profit.
The SI2 clarity has the highest price.

13
▪ Checking for some more relationships.
Colour and Cut:

For the different cuts Ideal cut is the highest preferred, and in ideal cut the
mostly G is preferred, after G, E & F color is preferred.
Same sequence of color goes for all the other cuts.
CUT and CLARITY:

14
For the different cuts Ideal cut is the highest preferred, and in ideal cut the
mostly VVS2 clarity, is preferred, after VVS2, SI1& VS1 clarity is
preferred.
Almost same pattern of clarity goes for all the other cuts.

▪ Correlations:

Carat VS Price: Depth VS Price:

15
X VS Price: Y VS Price:

▪ Checking correlations using heat map, table, pair plot:

16
With the below graphs and table, we can say that there is a
multicollinearity in the data set.

17
There is a strong correlation between,
Carat and x, y, z, price.
X and y, z, price.
Y and z, price.
Z and price.
All five column have strong correlation.
1.2 Impute null values if present, also check for the values
which are equal to zero. Do they have any meaning or do we
need to change them or drop them? Do you think scaling is
necessary in this case?
Solution:

As we have null values present in it, the depth column has 697 null values.
It is numeric or continuous variable so we can impute it using mean or
median values. As we have also seen there are outliers present in it, so
median is the best option that I considered, as mean get effected by
outliers. Below are the median values.

18
▪ Checking if there is ‘0’ as value present:

We have some rows which have ‘0’ values, i.e., x, y, z are the
dimensions of diamond and also very less values or very a smaller
number of rows which cannot be taken into the model. So, we can
drop these rows. There is no meaning to take them to the model.
After dropping:

Scaling:
Scaling is needed to be done as all the variables have different values. Scaling
will provide us all values within same range, that's become more convenient
for us. After scaling the data become more cleaned or comes in proper manner
for our further analysis.
Scaling or standardizing the features around the centre and 0 with a standard
deviation of 1 is important when we compare measurements that have
different units. Variables that are measured at different scales do not
contribute equally to the analysis and might end up creating a bias.
Scaling is useful to reduce or check the multi-collinearity in the data, so if
scaling is not applied, we find the VIF-variance inflation factors value very high.

19
Which indicates the presence of multi-collinearity, these values are calculated
after building the model of linear regression, to understand the multi-
collinearity in the model.
The scaling has no impact on model score or coefficients of attributes nor the
intercept. There is no sense in scaling the data, but still I’m carriying forward
with it.

More or less data looks similar , before and after scaling.

Treating Outliers:
As we have seens in above box plots there is outliers present all the continuous
viariables so before moving towards further process treating those outliers is
very important.
As treating the oulitear with the Lower limit and Upper limit method.
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
After treating the outlier:

20
21
1.3 Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Linear
regression. Performance Metrics: Check the performance of
Predictions on Train and Test sets using Rsquare, RMSE.
Solution:
Encoding the string values:
Creating Dummies

We have encoded dummies, as linear regression model does not take


categorical values so we encoded it to integer type for better result.

Train/ Test split:

22
Logical Regression Model:

Checking the coefficient of each independent attributes:

23
Checking the intercept of the model

R square on training and testing data:

We can say that for both training and testing data R square is same
as of 94%
Below is the RMSE on training and testing data:

We can say that RMSE is also almost similar for both.

24
VIF-Values:

We can still see multicollinearity in the dataset to drop these values


to lower level we can drop those columns after doing stats model.
From stats model we can understand the features that do not
contribute to the model. So, we can remove those after that the VIF
values can be reduced. Ideal value for VIF is less than 5%

Best Params Summary Report:

25
26
After Dropping the depth column:

27
To ideally bring down the values to lower levels we can drop one of the
variables that is highly correlated. Dropping the values would bring down the
multicollinearity level down.
As we see here the overall P value is less than alpha, so rejecting H0 and
accepting Ha that at least 1 regression co-efficient is not '0'. Here all regression
co-efficients are not '0'.
We can see that x i.e., the Length of the cubic zirconia in mm. having negative
coefficient and p value is less than 0.05, so we can conclude that the higher the
length of stone is the lower profitable.
Similarly, Z height of the cubic zirconia in mm. also has the negative coefficient
i.e., -0.1088 and p value is less than 0.05, so we can conclude that higher the z
value, lower the profitable.
Also, we can see Y width of the cubic zirconia in mm. has the positive coefficien
t value i.e., 0.2834, so we can conclude that the higher the Y value, higher
profitable.

Linear regression Performance Metrics:


▪ Intercept for the model: -0.7567627863049374

▪ R square on training data: 0.9419557931252712


▪ R square on testing data: 0.9381643998102491

▪ RMSE on Training data: 0.20690072466418796


▪ RMSE on Testing data: 0.21647817772382874

As the training data and testing data score is almost inline, we can conclude
this model is the right fit model.

28
For better accuracy dropping the depth column in iteration for
better result.
The Final Linear Regression Equation:

Price= (-0.76) * Intercept + (1.1) * carat + (-0.01) * table + (-


0.32) * x + (0.28) * y + (-0.11) * z + (0.1) * cut_Good + (0.15)
* cut_Ideal + (0.15) * cut_Premium + (0.13) * cut_Very_Goo
d + (-0.05) * color_E + (-0.06) * color_F + (-0.1) * color_G + (-
0.21) * color_H + (-0.32) * color_I + (-0.47) * color_J + (1.0) *
clarity_IF + (0.64) * clarity_SI1 + (0.43) * clarity_SI2 + (0.84)
* clarity_VS1 + (0.77) * clarity_VS2 + (0.94) * clarity_VVS1 +
(0.93) * clarity_VVS2 +

When carat increases by 1 unit, diamond price increases by 1.1


units, keeping all the other predictors constant.
When y increases by 1 unit, diamond price increases by 0.28 units,
keeping all the other predictors constant.
When cut_Good increases by 1 unit, diamond price increases by 0.1
units, keeping all the other predictors constant.
When cut_Ideal increases by 1 unit, diamond price increases by 1.15
units, keeping all the other predictors constant.
When cut_Premium increases by 1 unit, diamond price increases by
1.15 units, keeping all the other predictors constant.
When cut_Very_Good increases by 1 unit, diamond price increases
by 1.13 units, keeping all the other predictors constant.
When clarity_IF increases by 1 unit, diamond price increases by 1.0
units, keeping all the other predictors constant.
When clarity_SI1 increases by 1 unit, diamond price increases by
0.64 units, keeping all the other predictors constant.
When clarity_SI2 increases by 1 unit, diamond price increases by
0.43 units, keeping all the other predictors constant.
When clarity_VS1 increases by 1 unit, diamond price increases by
0.84 units, keeping all the other predictors constant.

29
When clarity_VS2 increases by 1 unit, diamond price increases by
0.77 units, keeping all the other predictors constant.
When clarity_VVS1 increases by 1 unit, diamond price increases by
0.94 units, keeping all the other predictors constant.
When clarity_VVS2 increases by 1 unit, diamond price increases by
0.93 units, keeping all the other predictors constant.
There are some of the negative co-efficient values, this implies
those variables are inversely proportional to the diamond price.

1.4 Inference: Basis on these predictions, what are the


business insights and recommendations.
Solution:
Here we can see that there is a strong multicollinearity present
in data set.
When we did scale the intercept and co-efficient becomes better
or changed, and the bias became nearly zero but the overall
accuracy remains same.
Using stats model if we could run the model again, we can have
P-values and co-efficient which will give better understanding of
the relationship, so that the more than 0.05 we can drop those
variables and re run the model again.
From stats models we can see the R-squared value is 0.942 and
the Adj. R-squared value is 0.942 both are same. The p value is
also less than alpha i.e., 0.05.
From EDA analysis we understand that Ideal cut has the large
number of profits to the company. Premium, very good cuts were
also bringing the profit. In clarity we have seen there were no
flawless stones and there were no profits coming from I1, I2, I3.
Color of the stones such H, I and J won’t be helping the firm put
an expensive price cap on such stones.

30
Recommendation:
The ideal, premium, very good cut types are the one which
brings profit to the company, so we can use these for marketing
for bringing more profit.
The clarity is also the important attributes the more clarity of
stones the more is the profit.
The company should focus on the stone’s carat and clarity so as
to increase their price.
The marketing efforts can make use of educating customers
about the importance of a better carat score and importance of
clarity index.
The best 5 attributes are:
▪ Carat
▪ Y- Width of the cubic zirconia in mm.
▪ Clarity
(Clarity_IF, clarity_SI1, clarity_SI2, clarity_VS1, clarity_VS2, clarity_VVS1,
clarity_VVS2)
▪ Cut
▪ Also, we can consider colour.

31
Problem 2: Logistic Regression and LDA

You are hired by a tour and travel agency which deals in selling holiday
packages. You are provided details of 872 employees of a company.
Among these employees, some opted for the package and some didn't.
You have to help the company in predicting whether an employee will opt
for the package or not on the basis of the information given in the data set.
Also, find out the important factors on the basis of which the company will
focus on particular employees to sell their packages.
Data Dictionary:

Variable Name Description


Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
The number of young children (younger than 7
no_young_children
years)
no_older_children Number of older children
foreign foreigner Yes/No

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do
null value condition check, write an inference on it. Perform Univariate
and Bivariate Analysis. Do exploratory data analysis.
Solution:
Started with loading all the necessary library for model building.
Reading the head and tail of data set to check whether the data set has
properly fed or not.
Head of the data set:

32
Tail of the data set:

Shape of the data set:

Info and null value check of the data set:

• There is total 872 number of rows and 8 number of columns.


• Holiday_Package, Foreign is of object data type and rest all is of
integer data type.
• There is no null value present.
• As of now ‘Unnamed: 0’ column seems of no use.
Check for duplicates:

• No duplicate values present.


33
Description of the data set:

We have categorical, integer and continuous data.


Holiday_Package is our target variable and on the basis of salary, age,
education, no. of young children and no. of older children have they went
foreign or not, based on these details we have to examine and help the
company will they opt for package or not.
In describe, we have complete details about mean, median, 25%, 50%,
75%, min count, max count, this for numeric data and unique, top, freq for
categorical data.
Checking the unique values and their total number present in data set for
categorical data.

Both the column has just two attributes yes and no and their total number
of counts are as follows in img.
From the above img. as most of the employee has not gone to foreign.

34
Percentage of target variable:

• As I mentioned earlier ‘Unnamed:0’ seems to be no use so dropping


it before starting with further analysis

Univariate/Bivariate Analysis:
Starting with numerical data:

35
36
Age seems to be normally distributed, just age columns is the only one column with
no outliers, rest (salary, edu, no_young_children, no_older_children) all have outliers
in it.

Checking skewness:

All seems to be positively skewed just education is slightly seems to be negatively


skewed.

Categorical Variable:
Foreign:

Most of the employee has not gone to foreign.

37
Holiday_Package:

From the above graph we can see, there is little difference between
those who have taken the package and those who have not taken the
package.
But the number of counts those who have not taken the package is
more.
Comparing it with the other attributes:
Holiday_package VS Salary:

Mostly people below 50000 salaries have opted for the package.

38
Holiday_package VS Age:

Most of the packages is taken by the people below 40yrs. of age.

Holiday_package VS Age:

Most of the packages are taken by the people having education level
below 10.

39
Holiday_package VS no_young_children:

Most of the packages are taken by the people who don’t have young
child or till 2 children. As we can say between 0-2 number of young
children those who have taken the most packages.

Holiday_package VS no_older_children:

Most of the people who opted for the package is between (0-3) that is
mostly people having 0 or 3 older children have opted for maximum
packages. Also, people having 4 and 6 children opted for package but
the counts are less as compare with others.

40
Checking for Age vs Salary vs Holiday_Package:

From this we can say that those who are foreigner and opted for the
holiday package is more than who are foreigner and not opted for the
package.

41
Checking for Age vs Salary vs Holiday_Package:

As the employee age between 25 to 50 and the salary below 50000 have
opted for packages, where as employee age between 50-60 seems not or
very less opted for packages.
Checking for Edu vs Salary vs Holiday_Package:

Checking for Number of Young Children vs Age vs Holiday_Package:

42
As we have seen earlier also age between 20-50 and employee not having
child opted for mostly for packages, also till 2 young children people opted
for packages.
Checking for Number of Older Children vs Age vs Holiday_Package:

Checking Data Distribution and Correlation:

• Data seems to be normally distributed. There don’t seems to be a huge


diference in data distribution of holiday package.

43
• From the table and heatmap seems to be they are not highly corelated
to each other , or no multi colinearity present.

44
Treating Outlier:

As we have seen only age is the variable with no outlier.Treating the rest
where outliers are present. After giving the treatment most of the outliers
has been treated.

45
2.2 Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply
Logistic Regression and LDA (linear discriminant analysis).
Solution:
Encoding categorical variables:
After converting categorial variable to dummy variable.

The encoding helps the logistic regression model predict better result.
Train / Test Split:

As per instruction splitting data into 70:30 ratio.


Grid Search Method:

46
Here using the grid search method, liblinear solver is been used here
which is suitable for small dataset. Tolerance and penalty have been found
using the grid search method.
As from the above output, penalty: l2, solver: liblinear, tolerance: 1e-06
we got.
Predicting on training data:

Getting the probabilities on test data:

47
2.3 Performance Metrics: Check the performance of Predictions on
Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve
and get ROC_AUC score for each model Final Model: Compare Both the
models and write inference which model is best/optimized.
Solution:
Confusion matrix for training data:

Confusion matrix for testing data:

48
Accuracy, AUC, ROC curve for training data:

Accuracy, AUC, ROC curve for testing data:

49
The precision is the ratio tp / (tp + fp) where tp is the number of true
positives and fp the number of false positives. The precision is intuitively
the ability of the classifier not to label as positive a sample that is negative.
The recall is the ratio tp / (tp + fn) where tp is the number of true positives
and fn the number of false negatives. The recall is intuitively the ability of
the classifier to find all the positive samples.

The F-beta score can be interpreted as a weighted harmonic mean of the


precision and recall, where an F-beta score reaches its best value at 1 and
worst score at 0.

The support is the number of occurrences of each class.

For Train Data:

For Test Data:

50
Starting with LDA (Linear Discriminant Analysis):

Analysis on Train Data:

51
Confusion matrix of train data:

Analysis on Test Data:

52
Confusion matrix of test data:

CHANGING THE CUTT OFF VALUE TO CHECK OPTIMAL VALUE THAT


GIVES BETTER ACCURACY AND F1 SCORE.

53
54
AUC AND ROC CURVE FOR BOTH TRAINING AND TESTING DATA:

55
Comparing both the models, we find both results are more or
less equal, but LDA works better when there is categorical
target variable.
2.4 Inference: Basis on these predictions, what are the insights
and recommendations.
Solution:
We had a business problem where we need predict whether an employee
would opt for a holiday package or not, for this problem we had done
predictions both logistic regression and linear discriminant analysis. Since
both are results are same.
The EDA analysis clearly indicates certain criteria, if employee is foreigner
and employee not having young children, chances of opting for Holiday
Package is good.
Many high salary employees are not opting for Holiday Package. Salary
less than 50000 people have opted more for holiday package. Employee
age over 50 to 60 have seems to be not taking the holiday package,
whereas in the age 30 to 50 and salary less than 50000 people have opted
more for holiday package.
Employees having older children are not opting for Holiday Package.
Holiday package seems to have some corelation with number of young
children.

56
Recommendation:
• To improve holiday packages over the age above 50 we can provide
religious destination places.
• Holiday packages can be modified to make infant and young children
friendly to attract more employees having young children.
• Company can focus on high salary employees to sell Holiday
Package.
• Special offer can be designed to domestic employees to opt for
Holiday Package.
• For people earning more than 150000 we can provide vacation
holiday packages.
------------------------------------
--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------

57
58

You might also like