You are on page 1of 26

Predictive

Modeling

FEBRUARY 2022

Great learning
Authored by: Umesh Kumar Hasija

1
Problem Statement 1

You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are provided with the
dataset containing the prices and other attributes of almost 27,000 cubic zirconia (which is an inexpensive
diamond alternative with many of the same qualities as a diamond). The company is earning different profits on
different prize slots. You have to help the company in predicting the price for the stone on the bases of the details
given in the dataset so it can distinguish between higher profitable stones and lower profitable stones so as to
have better profit share. Also, provide them with the best 5 attributes that are most important. The data set is
cubic_zirconia.csv

Data Dictionary:

Variable Name Description


Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
The number of young children (younger than 7
no_young_children
years)
no_older_children Number of older children
foreign foreigner Yes/No
Question: 1.1
Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, Data types,
shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis

Step 1:
• Importing the dataset - “cubic_zirconia.csv”

• Data Summary and Exploratory Data Analysis:

• Checking if the data is being imported properly

Displaying the above results by Head statement

2
• Dimension of the Dataset: The Dimension or shape of the dataset can be shown using shape function. It
shows that the dataset given to us has 26967 rows and 11 columns or variables

• Variable Data types and records: Computed using .info() function

Step 2: Dropping the first column: Unammed :0 using the drop() function as this column adds no value to the
overall model (it is just the serial number/order of the rows).

The new data set now has 10 columns instead of 11, displaying the dead of the new data set

Step 3: Descriptive Statistics for the dataset

3
Carat: - This is an independent variable, and it ranges from 0.2 to 4.5. mean value is around 0.8 and 75% of the
stones are of 1.05 carat value. Standard deviation is around 0.477 which shows that the data is skewed and has a
right tailed curve. Which means that majority of the stones are of lower carat. There are very few stones above
1.05 carat.

4
Depth:- The percentage height of cubic zirconia stones is in the range of 50.80 to 73.60. Average height of the
stones is 61.80 25% of the stones are 61 and 75% of the stones are 62.5. Standard deviation of the height of the
stones is 1.4. Standard deviation is indicating a normal distribution

Table: - The percentage width of cubic Zirconia is in the range of 49 to 79. Average is around 57. 25% of stones are
below 56 and 75% of the stones have a width of less than 59. Standard deviation is 2.24. Hence, the data does not
show normal distribution and is similar to carat with most

of the stones having less width also this shows


outliers are present in the variable.

5
Price: - Price is the Dependent variable. Prices are in the range of 3938 to 18818. Median price of stones is 2375,
while 25% of the stones are priced below 945. 75% of the stones are in the price range of 5356. Standard deviation
of the price is 4022. Indicating prices of majority of the stones are in lower range as the distribution is right
skewed.

Variables X,Y,Z: These variables are normally distributed with outliers

6
Duplicate values: There are totalo of 34 duplicated rows that will be dropped. Post dropping the duplicated the
new shape of the data set is

Rows:26933
Columns: 10

7
Stesp 4: Checking Correlation in the data using Heatmap and correlation matrix

• High correlation between features like carat, x, y, z and price. Indicating multicollinearity in data. Some
of the variables might be dropped while building the model

• Less correlation between table with the other features.

• Depth is negatively correlated with most the other features except for carat

8
Summary/Conclusion:
• Price – This variable gives the continuous output with the price of the cubic zirconia stones. This will be
our Target Variable.
• Carat, depth, table, x, y, z variables are numerical or continuous variables.
• Cut, Clarity and colour are categorical variables.
• We will drop the first column ‘Unnamed: 0’ column as this is not important for our study which leaves
the shape of the dataset with 26967 rows & 10 Columns
• Only in ‘depth 697missing values are present which we will impute by its median values.
• There are total of 34 duplicate rows as computed using. Duplicated () function. We will drop the
duplicates

• Upon dropping the duplicates – The shape of the data set is – 26933 rows & 10 columns

Question 1.2:
Impute null values if present, also check for the values which are equal to zero. Do they have any meaning,
or do we need to change them or drop them? Do you think scaling is necessary in this case?

Step1: Checking for missing values of garbage data


There are null values in the variable “Depth” which will be imputed which is 2.6% of the total records for this
variable.

Imputing the null values with median for this integer variable:

9
Post imputation:

Scaling.
The describe function output that is shared above indicates that mean and std dev numbers aren’t varying
significantly for original numeric variables with a low std deviation and hence, even if we don’t scale the
numbers, our model performance will not vary much, or the impact will be insignificant.

From the correlation matrix it is identified that there is a strong correlation between independent variables –
like Carat, x,y, and z. All these variables are strongly correlated with the target variable – price. This indicates a
strong case of our dataset struggling with multicollinearity. Depth does not show any strong relation with the
price variable. For this case study, variables x,y, and z will be dropped before creating the linear regression
model.
Keeping the above points in mind, for this dataset, I don’t think scaling the data will make much sense.

Keeping the above points in mind, for this dataset, I don’t think scaling the data will make much sense.

10
Question: 1.3.
Encode the data (having string values) for Modelling. Data Split: Split the data into test and train (70:30).
Apply Linear regression. Performance Metrics: Check the performance of Predictions on Train and Test sets
using Rsquare, RMSE.

There are three object columns, which have string values - cut, color, and clarity.

• Check the relationship of cut with price

We can establish that there is an order in ranking like mean price is increasing from ideal then good to
very good, premium, and fair. I have classified them using scale – Label encoding - and won’t use one-
hot encoding here
Ideal=4; Premium=3; Very Good=2; Good=1;Fair=0

11
• Relationship between color and price
Using scale – Label encoding. The color variable seems to be making an impact on the price variable
and it should be used further in our linear regression model.
G=6; E=5; F=4; H=3; D=2; I=1; J=0

• Relationship between clarity and price


Using scale – Label encoding. The color variable seems to be making an impact on the price variable
and it should be used further in our linear regression model.

12
There are a lot of variables (Carat, x,y, and z) that are demonstrating strong correlation or
multicollinearity. So, before proceeding with the Linear Regression Model creation, we need to get rid
of them from the model creation exercise. I have decided to drop x,y, and z from my Linear Regression
model creation step. I have decided to keep the carat column of the lot as it has the strongest relation
with the target variable – Price out of the four columns.

Without scaling the data

• Creating 2 variables X and Y

• X = df.drop(['price','x','y','z'],axis=1)
Variable X carries the independent variables hence the dependent variable and variables with high
multicollinearity are being dropped.

• Y = df.price
Variable Y to carry the dependent or target variable

• Data Split: Dividing the data between training and test sets with 30% data being in test data set.

• Building regression model: Apply Linear regression using scikit learn


regression_model = LinearRegression()

13
• Results of the model

Training Data Set Testing Data Set

Coefficient of Determination R^2 0.87 0.86

Root Mean Square Error (RMSE) 1447.39 1483.83

• Building regression model: Apply Linear regression using Statsmodel


model = sm.OLS(Y_trainc,X_trainc).fit()
model.summary()

• Results of the model

Training Data Set Testing Data Set

Coefficient of Determination R^2 0.87 0.86

Root Mean Square Error (RMSE) 1447.39 1483.83

14
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the
coefficient of determination, or the coefficient of multiple determination for multiple regression. 100%
indicates that the model explains all the variability of the response data around its mean. The value of R-
squared vary from 0 to 1. Any value inching closer to 1 can be considered a good fitted regression line and our
R-squared score of 0.869 signifies good performance.

Root mean square error (RMSE) is 1483.83.


RMSE is the standard deviation of the residuals or prediction errors. Residuals are a measure of how far from
the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it
tells us how concentrated the data is around the line of best fit.

• Scatter Plot

The model is able to explain the price predictions very well for lower prices however as the prices
increase, we see a wider deviation between the predicted and actual price.

15
Question 1.4
Inference: Basis on these predictions, what are the business insights and recommendations.

As expected, Carat is a strong predictor of the overall price of the stone.


Clarity refers to the absence of the Inclusions and Blemishes and has emerged as a strong predictor of price as
well. Clarity of stone types IF, VVS_1, VVS_2 and vs1 are helping the firm put an expensive price cap on the
stones.
Color of the stones such H, I and J won’t be helping the firm put an expensive price cap on such stones.
The company should instead focus on stones of color D, E and F to command relative higher price points and
support sales.
This also can indicate that company should be looking to come up with new color stones like clear stones or a
different color/unique color that helps impact the price positively.
The company should focus on the stone’s carat and clarity so as to increase their prices. Ideal customers will
also contribute to more profits. The marketing efforts can make use of educating customers about the
importance of a better carat score and importance of clarity index. Post this, the company can make segments,
and target the customer based on their income/paying capacity etc, which can be further studied.

16
Problem 2: Logistic Regression and LDA

You are hired by a tour and travel agency which deals in selling holiday packages. You are provided details of
872 employees of a company. Among these employees, some opted for the package and some didn't. You
have to help the company in predicting whether an employee will opt for the package or not on the basis of
the information given in the data set. Also, find out the important factors on the basis of which the company
will focus on particular employees to sell their packages.

Data Dictionary:

Variable Name Description


Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
The number of young children (younger than 7
no_young_children
years)
no_older_children Number of older children
foreign foreigner Yes/No

Question 2.1
Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check, write an
inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis.

Step 1:

• Importing the dataset - “Holiday_Package.csv”

• Data Summary and Exploratory Data Analysis:

• Checking if the data is being imported properly

17
• Dimention of the Dataset: The Dimension or shape of the dataset can be shown using shape function.
It shows that the dataset given to us has 872 rows and 8 columns or variables

• Dropping the Unnamed: 0 column and displaying the Structure of the Dataset: Structure of the dataset
can be computed using .info() function

• Checking for Duplicates: - There are No duplicate rows in the dataset as computed using. Duplicated ()
function.
• We have dropped the first column ‘Unnamed: 0’ column as this is not important for our study. The
shape would be – 872 rows and 7 columns

Step 2: Univariate Analysis

Summary of the Dataset

• Holiday Package – This variable is a categorical Variable. output with the This will be our Target
Variable.
• Salary, age, educ, no_young_children, no_older_children, variables are numerical or continuous
variables.
18
• Salary ranges from 1322 to 236961. Average salary of employees is around 47729 with a standard
deviation of 23418. Standard deviation indicates that the data is not normally distributed. skew of 0.71
indicates that the data is right skewed and there are few employees earning more than an average of
47729. 75% of the employees are earning below 53469 while 255 of the employees are earning 35324.
• Age of the employee ranges from 20 to 62. Median is around 39. 25% of the employees are below 32
and 25% of the employees are above 48. Standard deviation is around 10. Standard deviation indicates
almost normal distribution.
• Years of formal education ranges from 1 to 21 years. 25% of the population has formal education for 8
years, while the median is around 9 years. 75% of the employees have formal education of 12 years.
Standard deviation of the education is around 3. This variable is also indicating skewness in the data
• Foreign is a categorical variable
• We have dropped the first column ‘Unnamed: 0’ column as this is not important for our study.
Unnamed is a variable which has serial numbers so may not be required and thus it can be dropped for
further analysisThe shape would be – 872 rows and 7 columns
• There are no null values
• There are no duplicates

Analyzing the categorical variable

• Nearly 54% of employees have not opted for holiday package

19
• We observe that only 25% of employees are of foreign Origin

We can observer that there are significant outliers present in variable “Salary”, however there are minimal
outliers in other variables like ‘educ’, ‘no. of young children’ & ‘no. of older children’. There are no outliers in
variable ‘age’.

Step3: Bi-Variate Analysis:

Holiday & Salary

There is a difference in salary structure as well as spread of people opting and not opting for holiday package.

20
Holiday Package and Age

People with lower age bracket seems to opt for holiday package than higher age

Holiday Package & educ

We observe that employees with less years of formal education (1 to 7 years) and higher education are not
opting for the Holiday package as compared to employees with formal education of 8 year to 12 years.

Holiday Package & No of Young Children

21
There is a significant difference in employees with younger children who are opting for holiday package and
employees who are not opting for holiday package. People with young children are not opting for holiday
packages.

Holiday package & Number of Old Children

No major significant difference and hence probably not a good predictor

Checking for correlation amongst the variable:

• We can relate there isn’t any strong correlation between any variables. Salary and education display
moderate correlation and no_older_children is somewhat correlated with salary variable. However,
there are no strong correlation in the data set.

22
Question: 2.2
Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the data into
train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis).

There are two object type variables in the entire data set “foreign” and “Holiday Package” encoding both

• Data Split: Dividing the data between training and test sets with 30% data being in test data set.

• Building regression model:

Segregating the dependent and independent variables


X2 = df_holiday.drop(['Holliday_Package'],axis=1)
Y2 = df_holiday.Holliday_Package
Model:
Logistic regression model = LogisticRegression(solver='liblinear')
Linear discriminant Analysis model clf = LinearDiscriminantAnalysis()

Training Data Test data


Accuracy Score Logistic 64.91% 66.93%
Regression

Accuracy Score LDA 65.08% 68.57%

23
Question 2.3
Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model: Compare Both the
models and write inference which model is best/optimized.

Logistic regression:
While looking the metrices for both training and the test data, it seems the accuracy scores are
comparable around 65-66% Our model is close enough to be treated as a right fit model. The current
model is not struggling with being an over fit model or an under fit model.
The AUC scores for both the training and test data is also same at 71%.
The model performance is good on F1 score as well with both train and test sets giving a score of
63%

Test Data set:

Training Data Set:

To summarize, our model scores are not very good but seems to be a right fit model and seems to be
avoiding the underfit and overfit scenarios. The training data seems to be performing a bit better when
compared with test data though the difference is not much. The point to note is that F1 score seems

24
to perform better on Training data and dips slightly in the test data. We can study the data further for
data engineering to improve the scores, as needed for the business case.

Liner Discriminant Analysis (LDA)

While looking the metrices for both training and the test data, it seems the accuracy scores are
comparable around 65-68% Our model is close enough to be treated as a right fit model. The current
model is not struggling with being an over fit model or an under fit model.
The AUC scores for both the training and test data is comparable at 71% and 72% respectively.
The model performance is good on F1 score as well with both train and test sets giving a score of
63%. The model seems to give almost similar results to logistic regression.

Logistic Regression LDA

25
Question2.4
Inference: Basis on these predictions, what are the insights and recommendations

As interpretation,
1) There is no plausible effect of salary, age, and education on the prediction for Holliday_packages.
These variables don’t seem to impact the decision to opt for holiday packages as we couldn’t
establish a strong relation of these variables with the target variable

2) Foreign has emerged as a strong predictor with a positive coefficient value. The log likelihood or
likelihood of a foreigner opting for a holiday package is high.

3) no_young_children variable is negating the probability for opting for holiday packages, especially
for couple with number of young children at 2.

The company can try to bin salary ranges to see if they can derive some more meaningful
interpretations out of that variable. May be club the salary or age in different buckets and see if there
is some plausible impact on the predictor variable. OR else, the business can use some different
model techniques to do a deep dive.

Recommendation:
1) The company should really focus on foreigners to drive the sales of their holiday packages as that’s
where the majority of conversions are going to come in.

2) The company should also stay away from targeting parents with younger children. The chances of
selling to parents with 2 younger children is probably the lowest. This also gels with the fact that
parents try and avoid visiting with younger children.

3) If the firm wants to target parents with older children, that still might end up giving favorable return
for their marketing efforts then spent on couples with younger children.

26

You might also like