Predictive Modeling Project

Problem 1: Linear Regression
You are hired by a company Gemstone co ltd, which is a cubic zirconia manufacturer. You
are provided with the dataset containing the prices and other attributes of almost 27,000 cubic
zirconia (which is an inexpensive diamond alternative with many of the same qualities as a
diamond). The company is earning different profits on different prize slots. You must help
the company in predicting the price for the stone on the bases of the details given in the
dataset so it can distinguish between higher profitable stones and lower profitable stones to
have better profit share. Also, provide them with the best 5 attributes that are most important.
Dataset for Problem 1: cubic_zirconia.csv
Data Dictionary:
Variable Name Description
Carat Carat weight of the cubic zirconia.
Describe the cut quality of the cubic zirconia. Quality is increasing order
Cut
Fair, Good, Very Good, Premium, Ideal.
Color Colour of the cubic zirconia. With D being the worst and J the best.
Clarity refers to the absence of the Inclusions and Blemishes. (In order
Clarity from Worst to Best in terms of avg price) IF, VVS1, VVS2, VS1, VS2,
Sl1, Sl2, l1
The Height of cubic zirconia, measured from the Culet to the table,
Depth
divided by its average Girdle Diameter.
The Width of the cubic zirconia's Table expressed as a Percentage of its
Table
Average Diameter.
Price The Price of the cubic zirconia.
X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
Z Height of the cubic zirconia in mm.
1) Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate
Analysis.
2) Impute null values if present, also check for the values which are equal to zero. Do they have
any meaning, or do we need to change them or drop them? Check for the possibility of
combining the sub levels of an ordinal variables and take actions accordingly. Explain why
you are combining these sub levels with appropriate reasoning.
3) Encode the data (having string values) for Modelling. Split the data into train and test
(70:30). Apply Linear regression using scikit learn. Perform checks for significant variables
using appropriate method from statsmodel. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Compare these models and select the best one with appropriate reasoning.
4) Inference: Basis on these predictions, what are the business insights and recommendations.
5) Please explain and summarise the various steps performed in this project. There should be
proper business interpretation and actionable insights present.
Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis
 First, we run the file using pd.read_csv. Once the file is run, we go ahead and use
function DF.head() to see how many columns Run type of data that we are working on.
 The data we see that we have totally 11 columns out of which ‘unnamed 0’ is a
column that we don’t need as neither of them influences the price of the diamond. So, we will
be dropping that. This is the first data analysis that we do to know which is important and
which is not important for our problem.
 Now we are done dropping unnecessary columns, We will use the function df.info()
To see how the data is captured and what are the data types as well as how many rows and
column each have and how many rows have null values in them.
 Observations.
o There are totally 26,967 entries ranging from indexes 0 to 26966
o There are totally ten columns.
o Column cut, colour, clarity, belong to Object data type. Whereas columns Carat depth
table x y z belong to float 64 data type and the price belongs to integer 64 data type.
o The depth column seems to have some missing values in them.
 Now will you pd.describe function To get a basic summary of the entire data. Using
this function, we get Count, means standard deviation, minimum maximum. 25%, 50% and
75% of the categorical data.
 Now check is there any duplicate values in the table. For that we use the function
df.duplicated. We see the totally 34 duplicate rows. So we will be removing them from our
data. Once removed, we have a total of 26,933 rows and 10 columns.
Impute null values if present, also check for the values which are equal to zero. Do they have
any meaning, or do we need to change them or drop them? Check for the possibility of
combining the sub levels of an ordinal variables and take actions accordingly. Explain why
you are combining these sub levels with appropriate reasoning.
 As we saw in the basic analysis, we did have some null values in the columns. So, to
make our further analysis easier as well as accurate, we need to remove those null values and
instead Have a proper value for it. For example, it can be done by using mean or median of
the table or column or can even remove the entire rows which has null values, which is not
advisable in most of the cases. So, in our case here we are going to use the mean as the
standard for all those old columns that have null values and input them instead of the null
values.
 Once we do that, we see the below results where we don't have any null values in any
of the fields. This makes a data more polished for further analysis.
 Here We see that the columns cut, colour and clarity are all having ordinal values.
Now, in this case we can't use Ordinal values, because obviously it will give us some
unnecessary data as well as unnecessary scaling in the data. For example, cut has value such
as fair, good, very good, premium ideal And colour has. D E F G H I J As column values.
Now these can give us nonsensical data as well as the analysis won't be perfect as they don't
have make any sense. So, we are going to be converting them into categorical data which will
give us proper count of each kind of data. An average of a nominal variable does not make
much sense because there is no intrinsic ordering of the levels of the categories. Moreover, if
you tried to compute the average of columns cut as defined in the data dictionary above, you
would also obtain a nonsensical result. Because the spacing between the four levels of
educational experience is very uneven, the meaning of this average would be very
questionable. In short, an average requires a variable to be numerical. To achieve this, we
will you be using. All these columns value. As its own column. To do this, each column
value will be divided into a sub column, or you can call it as a dummy column which will be
containing the Boolean values. As such as zero and 1. So for example, if a particular.
Diamond is of fair cut, it will come under the fair column as 1if it is very good cut it will
come under the very good cut as 1 and so on. Once we do that, we get the below as a new
table and the information of the whole table is as below.
 Notice that we have 24 columns now because we have changed all the object data
type into. Integers and made a dummy value or a column out of all of them so that we can
have a proper quantitative data for our analysis. The new data consists of the below fields.
 Now using this new data, we can make some more analysis of all the data what we
had.
 Heatmap
 Checking of outliers
 All numerical variables have outliers and treating outliers may impact characteristics
of data set and model itself therefore, outliers are not considered to be treated.
Encode the data (having string values) for Modelling. Split the data into train and
test (70:30). Apply Linear regression using scikit learn. Perform checks for
significant variables using appropriate method from statsmodel. Create multiple
models and check the performance of Predictions on Train and Test sets using
Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning.
 For our case we are making a train test split of 70 30. That is 70% is train data and 30
percent is the test data.
 For doing it we can use sklearn.model_selection package and import train_test_split
function that automatically splits the data for us based on the criteria we give. In our case it is
30% for test data.
 Once the split is done, we can invoke the LinearRegression function and find the best
fit model on training data.
 After that we can go ahead and see how each of the individual variable is associated
to the target variable Price. The below shows the same.
 Now we check for the intercept. The intercept (often labelled as constant) is the point
where the function crosses the y-axis. To find the intercept we use the function
regression_model.intercept_. in our case we got the intercept as -3816.77
 R Squared
o R-Squared is a statistical measure of fit that indicates how much variation of a
dependent variable is explained by the independent variable(s) in a regression.
o On training data we get the R Squared score as 0.940415 or 94.04%
o On Test Data we get the R Squared score as 0.941205 or 94.12%
o R Square value can vary from 0 to 1. In our case it falls in this range so we can say
we have a best fit.
o 94% of the variance of the dependent variable being studied is explained by the
variance of the independent variable
 RMSE
o Root Mean Square Error (RMSE) is a standard way to measure the error of a model
in predicting quantitative data
o RMSE on Training data is 847.2198130286624
o RMSE on Testing data is 841.8096631562591
 Linear Regression using statsmodel
o Here we first concatenate X and y into a single data frame that we used previously for
our test and train data.
o Once it is done we will rename the columns as per our needs
o Next we will define the expression for our analysis. The expression we used is as
follows.
expr= ' price ~ carat + depth + x + y + z + cut_Good + cut_Ideal + cut_Premium +
cut_Very_Good + color_E + color_F + color_G + color_H + color_I + color_J +
clarity_IF + clarity_SI1 + clarity_SI2 + clarity_VS1 + clarity_VS2 + clarity_VVS1 +
clarity_VVS2'
o Statsmodel is a Python module that provides classes and functions for the estimation
of many different statistical models, as well as for conducting statistical tests, and
statistical data exploration. An extensive list of result statistics are available for each
estimator. Canonically imported using import statsmodel. formula. api as smf.
o Defining a new function for our purpose we will name it as ‘lml’. The formula we use
is as below.
lm1 = smf.ols(formula= expr, data = data_train).fit()
o We now use the function .params to get all the parameters of the table generated.
For Train
For Test
o Now we can use the function .summary to get the summary of the same.
o Note the summary shows us every detail of the analysis.
Inference: Basis on these predictions, what are the business insights and
recommendations.
 The final Linear Regression equation is
price = b0 + b1 * carat + b2 * depth + b3 * y + b4 * z + b5 * cut_Fair + b6 * cut_Good + b7

* cut_Ideal + b8 * cut_Premium + b9 * cut_Very_Good + b10 * color_D + b11 * color_E +
b12 * color_F + b13 * color_G + b14 * color_H + b15 * color_I + b16 * color_J + b17 *
clarity_I1 + b18 * clarity_IF + b19 * clarity_SI1 + b20 * clarity_SI2 + b21 * clarity_VS1 +
b22 * clarity_VS2 + b23 * clarity_VVS1 + b24 * clarity_VVS2
price = (-872.29) * Intercept + (9110.65) * carat + (0.06) * depth + (-1177.09) * x + (884.95)

* y + (-275.52) * z + (-694.39) * cut_Fair + (-222.61) * cut_Good + (81.6) * cut_Ideal +
(18.47) * cut_Premium + (-55.37) * cut_Very_Good + (592.17) * color_D + (368.1) *
color_E + (328.3) * color_F + (177.53) * color_G + (-267.33) * color_H + (-719.2) * color_I
+ (-1351.85) * color_J + (-3061.65) * clarity_I1 + (1175.33) * clarity_IF + (-312.37) *
clarity_SI1 + (-1149.45) * clarity_SI2 + (478.37) * clarity_VS1 + (202.37) * clarity_VS2 +
(919.36) * clarity_VVS1 + (875.75) * clarity_VVS2 +
When Carat is increases by 1 unit, price increases by 0.06 units, keeping all other
predictors constant
 When cut quality is increased by 1 unit we see the changes as below in price, keeping all
other predictors constant
o Fair Cut decreases in price by 694.39 units
o Good cut decreases in price by 222.61 units
o Very Good Cut decreases in price by 55.37 units
o Ideal increase in price by by 81.6 units
o Premium increase in price by 18.47 units
 When colour is increased by 1 unit we see the changes as below in price, keeping all other
predictors constant
o D
o E
o F
o G
o H
o I
o J
Problem 2: Logistic Regression and LDA
You are hired by a tour and travel agency which deals in selling holiday packages. You are
provided details of 872 employees of a company. Among these employees, some opted for
the package, and some didn't. You must help the company in predicting whether an employee
will opt for the package or not based on the information given in the data set. Also, find out
the important factors based on which the company will focus on employees to sell their
packages.
Dataset for Problem 2: Holiday_Package.csv
Data Dictionary:
Variable Name Description
Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
no_young_children The number of young children (younger than 7 years)
no_older_children Number of older children
foreign foreigner Yes/No
1. Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, write an inference on it. Perform Univariate and Bivariate Analysis. Do
exploratory data analysis.
2. Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).
3. Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model
Final Model: Compare Both the models and write inference which model is
best/optimized.
4. Inference: Basis on these predictions, what are the insights and recommendations.
5. Please explain and summarise the various steps performed in this project. There should be
proper business interpretation and actionable insights present.

Predictive Modeling Project

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predictive Modeling Project

Uploaded by

Copyright:

Available Formats

Problem 1: Linear Regression

price = b0 + b1 * carat + b2 * depth + b3 * y + b4 * z + b5 * cut_Fair + b6 * cut_Good + b7

price = (-872.29) * Intercept + (9110.65) * carat + (0.06) * depth + (-1177.09) * x + (884.95)

You might also like