Professional Documents
Culture Documents
PROBLEM 1
Introduction:
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are
provided with the dataset containing the prices and other attributes of almost 27,000 cubic
zirconia (which is an inexpensive diamond alternative with many of the same qualities as a
diamond). The company is earning different profits on different prize slots. You have to help the
company in predicting the price for the stone on the bases of the details given in the dataset so
it can distinguish between higher profitable stones and lower profitable stones so as to have
better profit share. Also, provide them with the best 5 attributes that are most important.
Data Dictionary:
Variable Name Description
Carat Carat weight of the cubic zirconia.
Describe the cut quality of the cubic zirconia.
Cut Quality is increasing order Fair, Good, Very
Good, Premium, Ideal.
Colour of the cubic zirconia.With D being the
Color
best and J the worst.
cubic zirconia Clarity refers to the absence of
the Inclusions and Blemishes. (In order from
Clarity Best to Worst, FL = flawless, I3= level 3
inclusions) FL, IF, VVS1, VVS2, VS1, VS2,
SI1, SI2, I1, I2, I3
The Height of a cubic zirconia, measured from
Depth the Culet to the table, divided by its average
Girdle Diameter.
The Width of the cubic zirconia's Table
Table expressed as a Percentage of its Average
Diameter.
Price the Price of the cubic zirconia.
X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
Z Height of the cubic zirconia in mm.
1.1. Read the data and do exploratory data analysis. Describe the
data briefly. (Check the null values, Data types, shape, EDA).
Perform Univariate and Bivariate Analysis.
First, we import necessary library and them we upload the CSV file in jupyter notebook. After
that we use the head function to identify the first 5 rows of the dataset.
The dataset has 26967 rows and 11 columns. The first column named ‘Unnamed:0’ is not
useful in evaluating the dataset. Hence, the column is removed from the dataset.
As, we can see only ‘depth ‘has some missing values. There are 6 float, 1 int and 3 object data
types.
Descriptive statistics help describe and understand the features of a specific data set by giving short
summaries about the sample and measures of the data. Kindly refer to jupyter notebook to see this
table.
CUT: 5
Fair 781
Good 2441
Very Good 6030
Premium 6899
Ideal 10816
Name: cut
COLOR: 7
J 1443
I 2771
D 3344
H 4102
F 4729
E 4917
G 5661
Name: color
CLARITY: 8
I1 365
IF 894
VVS1 1839
VVS2 2531
VS1 4093
SI2 4575
VS2 6099
SI1 6571
Name: clarity
Now we convert these object data type into int data type by assigning a particular number to each unique
value. This helps the machine to read the data as it is able to process only numbers. So the values assigned
are:
Cut
['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']
Categories (5, object): ['Fair', 'Good', 'Ideal', 'Premium', 'Very Good']
[2 3 4 1 0]
Color
['E', 'G', 'F', 'D', 'H', 'J', 'I']
Categories (7, object): ['D', 'E', 'F', 'G', 'H', 'I', 'J']
[1 3 2 0 4 6 5]
Clarity
['SI1', 'IF', 'VVS2', 'VS1', 'VVS1', 'VS2', 'SI2', 'I1']
Categories (8, object): ['I1', 'IF', 'SI1', 'SI2', 'VS1', 'VS2', 'VVS1', 'VVS2']
[2 1 7 4 6 5 3 0]
We again use the head function to see whether the changes are done or not.
After that we check for duplicate rows and we have 34 duplicate rows. We remove them from our dataset.
As we look for outliers we can see that only price has outliers. No other columns has outliers. Hence we treat
the outliers.
As we can see from the above heatmap that only carat x, y, z i.e. length of cubic zirconia(mm), width of
cubic zirconia(mm), height of cubic zirconia(mm) show correlation as their value close to 1.
PAIRPLOT:
1.2. Impute null values if present, also check for the values which
are equal to zero. Do they have any meaning or do we need to
change them or drop them? Do you think scaling is necessary
in this case?
All null values have been imputed during EDA process. We checked for values equal to zero but we
found none. We do not need to change them or drop them because if any of the criteria become zero
the gem won’t be a profitable stone at all. A stone becomes a gem because it has something of the
criteria above and if any criteria become zero it won’t be any different from a stone.
Scaling is not necessary in this case as to find the best parameters values in case of linear regression
model there is a closed form solution called normal equation. There is no need of stepwise optimization
process so feature scaling is not necessary in case of linear regression.
1.3. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Linear
regression. Performance Metrics: Check the performance of
Predictions on Train and Test sets using Rsquare, RMSE.
Splitting of data onto70:30 ratio and linear regression model has been build in jupyter notebook
88% variation in price is explained by the predators in the model for train set.
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 3.99e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
The overall P value is less than alpha, so rejecting H0 and accepting Ha that atleast 1 regression co-
efficient is not close to 0. Here all regression co-efficient are not 0.
From EDA we could understand that ideal cut had number profits to the company. The color H,I,J have
brought profits to the company. We could see there was less profit coming from I1, I2, I3 stones. The
ideal, premium and very good type of cut were bringing good profit where as fair and good are not
brining profits.
(-1846.12)*intercept + (9126.94)*carat+(-15.01)*depth+(*18.59)*table+(-1190.28)*x+(837.36)*y+(-
163.64)*z +(481.81)*cut good+(714.65)*cut ideal +(674.77)*cut premium+(606.9)*cut very good+(-
181.91)*color E+(-256.81)*color F+(-429.38)*color G+(-855.99)*color H+(-1323.93)*color I + (-
1928.05)*color J +(4004.01)*clarity_IF +(2519.92)* clarity_SI1+(1684.46)*clarity_SI2
+(3342.57)*clarity_vs1 +(3039.93)*clarity_VS2+(3772.3)*clarity_vvs1+(3757.78)*clarity_vvs2
PROBLEM 2
Introduction:
You are hired by a tour and travel agency which deals in selling holiday packages. You
are provided details of 872 employees of a company. Among these employees, some
opted for the package and some didn't. You have to help the company in predicting
whether an employee will opt for the package or not on the basis of the information
given in the data set. Also, find out the important factors on the basis of which the
company will focus on particular employees to sell their packages.
Data Dictionary:
Variable Name Description
Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
The number of young children (younger than 7
no_young_children
years)
no_older_children Number of older children
foreign foreigner Yes/No
First, we import necessary library and them we upload the CSV file in jupyter notebook. After
that we use the head function to identify the first 5 rows of the dataset.
The dataset has 872 rows and 8 columns. The first column named ‘Unnamed:0’ is not useful in
evaluating the dataset. Hence, the column is removed from the dataset.
As we can see from above table we don’t have any missing values. There are 5 int64 data type and 2
object data type.
Descriptive statistics help describe and understand the features of a specific data set by giving short
summaries about the sample and measures of the data. Kindly refer to jupyter notebook to see this
table.
After that we check for duplicate rows and we have 0 duplicated rows.
As we can see from the above boxplot only salary has outliers. Hence we treat it first.
CORRELATION PLOT:
As we can see from the above heatmap, there is no correlation among the criteria.
PAIRPLOT:
2.2. & 2.3. Do not scale the data. Encode the data (having string
values) for Modelling. Data Split: Split the data into train and test
(70:30). Apply Logistic Regression and LDA (linear discriminant
analysis).
we convert these object data type into int data type by assigning a particular number to each unique value.
This helps the machine to read the data as it is able to process only numbers. So the values assigned
Holliday_Package
['no', 'yes']
Categories (2, object): ['no', 'yes']
[0 1]
foreign
['no', 'yes']
Categories (2, object): ['no', 'yes']
[0 1]
Splitting of data onto70:30 ratio and logistic regression model and linear discriminate analysis has been
build in jupyter notebook.
Based on classification report linear Discriminant analysis model is slightly better optimized than logistic
regression model.
2.4. Inference: Basis on these predictions, what are the insights and
recommendations.
Based on EDA analysis we understand that salary plays an important role in determining whether a
person takes holiday package or not. People having higher salary generally tend to take holiday package.
Also, people having young children take less holiday package as their priority shifts from holiday to
saving money for their children future. Employees having older children take up holiday package as they
don’t have to worry about their children future. Most foreigners tend to take holiday package.
Based on the analysis and predictive models created linear discriminant analysis is slightly better
optimized than logistic regression model.