Professional Documents
Culture Documents
1
Problem 1: Linear Regression
You are hired by a company Gem Stones co ltd, which is a cubic zirconia
manufacturer. You are provided with the dataset containing the prices and
other attributes of almost 27,000 cubic zirconia (which is an inexpensive
diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to
help the company in predicting the price for the stone on the bases of the
details given in the dataset so it can distinguish between higher profitable
stones and lower profitable stones so as to have better profit share. Also,
provide them with the best 5 attributes that are most important.
Data Dictionary:
2
1.1. Read the data and do exploratory data analysis.
Describe the data briefly. (Check the null values, Data
types, shape, EDA). Perform Univariate and Bivariate
Analysis.
Solution:
Loading all the necessary libraries required for model building.
After that loading the complete data-set.
Now, reading the head and tail of the data to check whether data
has been properly fed.
Head:
3
Tail:
4
▪ Checking the Info of the data-set.
5
We have both categorical and continuous data,
Cut, color, clarity is categorical in nature and carat, depth, table,
x, y, z, price is continuous in nature.
Price will be the target variable.
As per this, we can say that unnamed: 0 column is not useful for
us.
6
▪ Checking null values present or not.
7
As discussed earlier Unnamed: 0 column don’t see of use as
of now, so dropping it.
8
The distribution data in depth seems to be normally distributed.
From box plot it is clear that it holds large number of outliers.
9
The distribution of data in x (Length of the cubic zirconia in mm.) seems to be
positively skweed. Here also we can see the prsences of ouliers.
10
The distribution of data in price seems positive , the box plot also
consist of outliers.
11
▪ Checking skewedness.
From the above graph, we can say that the quality is in incresing order from
fair to ideal. The most preffered cut seems to be Ideal cut for diamond.
As from the sencond graph i.e. cut and price,we can say that the ideal cut is the
most preffered because of it price, the ideal cut diamond price is lower than
the other cuts.
12
COLOR:
We have seven colors in data, out of which G is the most preffered colour and J
is the least preffered colour.
As from the second graph, i.e. color and price, we can say that G is the most
and J is the least preffered color because of it’s price as G ranges in between
(middle price range) whereas J has high price range.
CLARITY:
13
▪ Checking for some more relationships.
Colour and Cut:
For the different cuts Ideal cut is the highest preferred, and in ideal cut the
mostly G is preferred, after G, E & F color is preferred.
Same sequence of color goes for all the other cuts.
CUT and CLARITY:
14
For the different cuts Ideal cut is the highest preferred, and in ideal cut the
mostly VVS2 clarity, is preferred, after VVS2, SI1& VS1 clarity is
preferred.
Almost same pattern of clarity goes for all the other cuts.
▪ Correlations:
15
X VS Price: Y VS Price:
16
With the below graphs and table, we can say that there is a
multicollinearity in the data set.
17
There is a strong correlation between,
Carat and x, y, z, price.
X and y, z, price.
Y and z, price.
Z and price.
All five column have strong correlation.
1.2 Impute null values if present, also check for the values
which are equal to zero. Do they have any meaning or do we
need to change them or drop them? Do you think scaling is
necessary in this case?
Solution:
As we have null values present in it, the depth column has 697 null values.
It is numeric or continuous variable so we can impute it using mean or
median values. As we have also seen there are outliers present in it, so
median is the best option that I considered, as mean get effected by
outliers. Below are the median values.
18
▪ Checking if there is ‘0’ as value present:
We have some rows which have ‘0’ values, i.e., x, y, z are the
dimensions of diamond and also very less values or very a smaller
number of rows which cannot be taken into the model. So, we can
drop these rows. There is no meaning to take them to the model.
After dropping:
Scaling:
Scaling is needed to be done as all the variables have different values. Scaling
will provide us all values within same range, that's become more convenient
for us. After scaling the data become more cleaned or comes in proper manner
for our further analysis.
Scaling or standardizing the features around the centre and 0 with a standard
deviation of 1 is important when we compare measurements that have
different units. Variables that are measured at different scales do not
contribute equally to the analysis and might end up creating a bias.
Scaling is useful to reduce or check the multi-collinearity in the data, so if
scaling is not applied, we find the VIF-variance inflation factors value very high.
19
Which indicates the presence of multi-collinearity, these values are calculated
after building the model of linear regression, to understand the multi-
collinearity in the model.
The scaling has no impact on model score or coefficients of attributes nor the
intercept. There is no sense in scaling the data, but still I’m carriying forward
with it.
Treating Outliers:
As we have seens in above box plots there is outliers present all the continuous
viariables so before moving towards further process treating those outliers is
very important.
As treating the oulitear with the Lower limit and Upper limit method.
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
After treating the outlier:
20
21
1.3 Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Linear
regression. Performance Metrics: Check the performance of
Predictions on Train and Test sets using Rsquare, RMSE.
Solution:
Encoding the string values:
Creating Dummies
22
Logical Regression Model:
23
Checking the intercept of the model
We can say that for both training and testing data R square is same
as of 94%
Below is the RMSE on training and testing data:
24
VIF-Values:
25
26
After Dropping the depth column:
27
To ideally bring down the values to lower levels we can drop one of the
variables that is highly correlated. Dropping the values would bring down the
multicollinearity level down.
As we see here the overall P value is less than alpha, so rejecting H0 and
accepting Ha that at least 1 regression co-efficient is not '0'. Here all regression
co-efficients are not '0'.
We can see that x i.e., the Length of the cubic zirconia in mm. having negative
coefficient and p value is less than 0.05, so we can conclude that the higher the
length of stone is the lower profitable.
Similarly, Z height of the cubic zirconia in mm. also has the negative coefficient
i.e., -0.1088 and p value is less than 0.05, so we can conclude that higher the z
value, lower the profitable.
Also, we can see Y width of the cubic zirconia in mm. has the positive coefficien
t value i.e., 0.2834, so we can conclude that the higher the Y value, higher
profitable.
As the training data and testing data score is almost inline, we can conclude
this model is the right fit model.
28
For better accuracy dropping the depth column in iteration for
better result.
The Final Linear Regression Equation:
29
When clarity_VS2 increases by 1 unit, diamond price increases by
0.77 units, keeping all the other predictors constant.
When clarity_VVS1 increases by 1 unit, diamond price increases by
0.94 units, keeping all the other predictors constant.
When clarity_VVS2 increases by 1 unit, diamond price increases by
0.93 units, keeping all the other predictors constant.
There are some of the negative co-efficient values, this implies
those variables are inversely proportional to the diamond price.
30
Recommendation:
The ideal, premium, very good cut types are the one which
brings profit to the company, so we can use these for marketing
for bringing more profit.
The clarity is also the important attributes the more clarity of
stones the more is the profit.
The company should focus on the stone’s carat and clarity so as
to increase their price.
The marketing efforts can make use of educating customers
about the importance of a better carat score and importance of
clarity index.
The best 5 attributes are:
▪ Carat
▪ Y- Width of the cubic zirconia in mm.
▪ Clarity
(Clarity_IF, clarity_SI1, clarity_SI2, clarity_VS1, clarity_VS2, clarity_VVS1,
clarity_VVS2)
▪ Cut
▪ Also, we can consider colour.
31
Problem 2: Logistic Regression and LDA
You are hired by a tour and travel agency which deals in selling holiday
packages. You are provided details of 872 employees of a company.
Among these employees, some opted for the package and some didn't.
You have to help the company in predicting whether an employee will opt
for the package or not on the basis of the information given in the data set.
Also, find out the important factors on the basis of which the company will
focus on particular employees to sell their packages.
Data Dictionary:
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do
null value condition check, write an inference on it. Perform Univariate
and Bivariate Analysis. Do exploratory data analysis.
Solution:
Started with loading all the necessary library for model building.
Reading the head and tail of data set to check whether the data set has
properly fed or not.
Head of the data set:
32
Tail of the data set:
33
Description of the data set:
Both the column has just two attributes yes and no and their total number
of counts are as follows in img.
From the above img. as most of the employee has not gone to foreign.
34
Percentage of target variable:
Univariate/Bivariate Analysis:
Starting with numerical data:
35
36
Age seems to be normally distributed, just age columns is the only one column with
no outliers, rest (salary, edu, no_young_children, no_older_children) all have outliers
in it.
Checking skewness:
Categorical Variable:
Foreign:
37
Holiday_Package:
From the above graph we can see, there is little difference between
those who have taken the package and those who have not taken the
package.
But the number of counts those who have not taken the package is
more.
Comparing it with the other attributes:
Holiday_package VS Salary:
Mostly people below 50000 salaries have opted for the package.
38
Holiday_package VS Age:
Holiday_package VS Age:
Most of the packages are taken by the people having education level
below 10.
39
Holiday_package VS no_young_children:
Most of the packages are taken by the people who don’t have young
child or till 2 children. As we can say between 0-2 number of young
children those who have taken the most packages.
Holiday_package VS no_older_children:
Most of the people who opted for the package is between (0-3) that is
mostly people having 0 or 3 older children have opted for maximum
packages. Also, people having 4 and 6 children opted for package but
the counts are less as compare with others.
40
Checking for Age vs Salary vs Holiday_Package:
From this we can say that those who are foreigner and opted for the
holiday package is more than who are foreigner and not opted for the
package.
41
Checking for Age vs Salary vs Holiday_Package:
As the employee age between 25 to 50 and the salary below 50000 have
opted for packages, where as employee age between 50-60 seems not or
very less opted for packages.
Checking for Edu vs Salary vs Holiday_Package:
42
As we have seen earlier also age between 20-50 and employee not having
child opted for mostly for packages, also till 2 young children people opted
for packages.
Checking for Number of Older Children vs Age vs Holiday_Package:
43
• From the table and heatmap seems to be they are not highly corelated
to each other , or no multi colinearity present.
44
Treating Outlier:
As we have seen only age is the variable with no outlier.Treating the rest
where outliers are present. After giving the treatment most of the outliers
has been treated.
45
2.2 Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply
Logistic Regression and LDA (linear discriminant analysis).
Solution:
Encoding categorical variables:
After converting categorial variable to dummy variable.
The encoding helps the logistic regression model predict better result.
Train / Test Split:
46
Here using the grid search method, liblinear solver is been used here
which is suitable for small dataset. Tolerance and penalty have been found
using the grid search method.
As from the above output, penalty: l2, solver: liblinear, tolerance: 1e-06
we got.
Predicting on training data:
47
2.3 Performance Metrics: Check the performance of Predictions on
Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve
and get ROC_AUC score for each model Final Model: Compare Both the
models and write inference which model is best/optimized.
Solution:
Confusion matrix for training data:
48
Accuracy, AUC, ROC curve for training data:
49
The precision is the ratio tp / (tp + fp) where tp is the number of true
positives and fp the number of false positives. The precision is intuitively
the ability of the classifier not to label as positive a sample that is negative.
The recall is the ratio tp / (tp + fn) where tp is the number of true positives
and fn the number of false negatives. The recall is intuitively the ability of
the classifier to find all the positive samples.
50
Starting with LDA (Linear Discriminant Analysis):
51
Confusion matrix of train data:
52
Confusion matrix of test data:
53
54
AUC AND ROC CURVE FOR BOTH TRAINING AND TESTING DATA:
55
Comparing both the models, we find both results are more or
less equal, but LDA works better when there is categorical
target variable.
2.4 Inference: Basis on these predictions, what are the insights
and recommendations.
Solution:
We had a business problem where we need predict whether an employee
would opt for a holiday package or not, for this problem we had done
predictions both logistic regression and linear discriminant analysis. Since
both are results are same.
The EDA analysis clearly indicates certain criteria, if employee is foreigner
and employee not having young children, chances of opting for Holiday
Package is good.
Many high salary employees are not opting for Holiday Package. Salary
less than 50000 people have opted more for holiday package. Employee
age over 50 to 60 have seems to be not taking the holiday package,
whereas in the age 30 to 50 and salary less than 50000 people have opted
more for holiday package.
Employees having older children are not opting for Holiday Package.
Holiday package seems to have some corelation with number of young
children.
56
Recommendation:
• To improve holiday packages over the age above 50 we can provide
religious destination places.
• Holiday packages can be modified to make infant and young children
friendly to attract more employees having young children.
• Company can focus on high salary employees to sell Holiday
Package.
• Special offer can be designed to domestic employees to opt for
Holiday Package.
• For people earning more than 150000 we can provide vacation
holiday packages.
------------------------------------
--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
57
58