You are on page 1of 38

GREAT LEARNING 2021

Project - Predictive Modeling


Linear Regression, Logistic Regression and LDA
Karthikeyan M
6/27/2021
Problem 1: Linear Regression

You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are provided with the
dataset containing the prices and other attributes of almost 27,000 cubic zirconia (which is an inexpensive
diamond alternative with many of the same qualities as a diamond). The company is earning different profits on
different prize slots. You have to help the company in predicting the price for the stone on the bases of the
details given in the dataset so it can distinguish between higher profitable stones and lower profitable stones so
as to have better profit share. Also, provide them with the best 5 attributes that are most important.

Data Dictionary:

Variable Name Description

Carat Carat weight of the cubic zirconia.

Describe the cut quality of the cubic zirconia. Quality is increasing order Fair,
Cut
Good, Very Good, Premium, Ideal.

Color Colour of the cubic zirconia.With D being the best and J the worst.

cubic zirconia Clarity refers to the absence of the Inclusions and Blemishes. (In
Clarity order from Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1, VVS2,
VS1, VS2, SI1, SI2, I1, I2, I3

The Height of a cubic zirconia, measured from the Culet to the table, divided by
Depth
its average Girdle Diameter.

The Width of the cubic zirconia's Table expressed as a Percentage of its Average
Table
Diameter.

Price the Price of the cubic zirconia.

X Length of the cubic zirconia in mm.

Y Width of the cubic zirconia in mm.

Z Height of the cubic zirconia in mm.


1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, Data
types, shape, EDA). Perform Univariate and Bivariate Analysis.

Loading all the necessary libraries and checking the data load and basic information of the data.
The target variable is price.
Among the other variable cut,color and clarity are categorical variable whereas carat, depth, table, x,y,z are
continuous variable.
Checking for Null Values:

There are about 697 values in depth which is null. This is less than 3% of total values.

Checking for Duplicate Values:

There are no duplicate rows in the data.


Checking various unique values of categorical values

• In Cut, there are five unique values Fair, good, very good, premium and ideal. Ideal cut seems to be
more preferred cut.
• There are about 7 different colors in the data set
• There are 8 different values for clarity.
Univariate / Bivariate analysis

• The data for Carat shows that the data is positively skewed and also there could be possibilities of
multimode as there are multiple peaks seen in the data
• The data for depth is normally distributed with a single peak and distributed between 55 and 70
• The data for Price shows that the data is positively skewed
• The data for table shows that the data is positively skewed and also possibilities of multimode as
there are multiple peaks seen in data
• The data for X,Y,Z are positively skewed with X having possibilities of multimode.

• The data for carat, depth, table, price, x,y, z shows there are outliers present in the data
Count Plots:

This is clear showing that count increases as the quality of cut increases and Ideal cut seems to be most
preferred.
The plot between cut and price shows that ideal cut seems to cheaper hence it is most preferred cut too.

The plot on color shows that G seems to be most preferred color.

The Color G which is most preferred seems to have median price.


The SL1 is more in count with respect to clarity
Data Distribution:
The correlation data clearly shows that there is presence of multicollinearity in the data.
1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any meaning
or do we need to change them or drop them? Do you think scaling is necessary in this case?

Checking for Null Values:

There are about 697 values in depth which is null. This is less than 3% of total values.

After performing median imputation in depth coloumn, there are no null values,
There are certain values x,y,z has 0 as the value. As x,y,z denotes the dimensions of the diamond it does not make
sense, hence these data can be dropped

SCALING

From the correlation matrix presented above, we clearly understand that there is presence of multi colinearity in
the data. Scaling will help to remove the multicolinearity and also it will not have impact on the coefficient or
intercept of the model.

The Variance inflation factor VIF after scaling shows that the multicolinearity has been taken care.
Treating outliers:
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into train and test (70:30).
Apply Linear regression. Performance Metrics: Check the performance of Predictions on Train and Test sets
using Rsquare, RMSE.

Linear regression model does not take categorical data, hence encoding with dummies.

Dropping the unnamed as it does not have any meaning. Separating the target variable and other variable.

Splitting training and testing data.


Linear Regression Model:

The R-Square value of Train and Test data as follows,

The Root Mean square Error of Training and testing data,


Stats Model:
After dropping depth variable which have less impact on the model, the summary looks like,
1.4 Inference: Basis on these predictions, what are the business insights and recommendations.

From the initial data analysis we understood that ideal cut has has better price points than other cuts hence
providing better profit to the company. The colours H, I, J have better price points and G has the median price
points. In clarity there are no values for flawless hence flawless has no relation with profit.

Stats model shows which are the variables has less effect , only depth seems to be less effective and dropped the
depth variable and formed the equation.

Dropped the depth column for better accuracy.

The equation, (-0.83) * Intercept + (1.24) * carat + (-0.02) * table + (-0.37) * x + (0.32) * y +
(-0.12) * z + (0.11) * cut_Good + (0.18) * cut_Ideal + (0.17) * cut_Premium + (0.15) *
cut_Very_Good + (-0.05) * color_E + (-0.07) * color_F + (-0.12) * color_G + (-0.24) * color_H + (-
0.38) * color_I + (-0.54) * color_J + (1.16) * clarity_IF + (0.74) * clarity_SI1 + (0.5) *
clarity_SI2 + (0.97) * clarity_VS1 + (0.89) * clarity_VS2 + (1.09) * clarity_VVS1 + (1.08) *
clarity_VVS2 +

Best Attributes are

Various clarity (clarity_IF, clarity_SI1, clarity_SI2, clarity_VS1, clarity_VS2, clarity_,


clarity_VVS2)
Carat
Y the diameter

Recommendations

• The various cut types ideal, premium and very good are bringing more profits hence more marketing can
be done to bring in more profits
• Clarity has more importance hence more clear the diamond and more the profit is.
• The diameter is one of the next important attribute. And median of diameter is 5.71 hence diameter can
be cut around these lines to make more profits.
Problem 2: Logistic Regression and LDA

You are hired by a tour and travel agency which deals in selling holiday packages. You are provided details of 872
employees of a company. Among these employees, some opted for the package and some didn't. You have to
help the company in predicting whether an employee will opt for the package or not on the basis of the
information given in the data set. Also, find out the important factors on the basis of which the company will
focus on particular employees to sell their packages.

Data Dictionary:

Variable Name Description

Holiday_Package Opted for Holiday Package yes/no?

Salary Employee salary

age Age in years

edu Years of formal education

The number of young children (younger


no_young_children
than 7 years)

no_older_children Number of older children

foreign foreigner Yes/No


2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check, write an
inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis.

Loading all necessary libraries.

Checking whether the data has been loaded properly.

Checking the data types where two categorical variables Holiday_package, foreign. And there are about 8
columns and 872 rows.
Checking for Null Values

There are no null values in the data set

Data Describe:

Holiday package is our target variable

Salary, age, educ and number young children, number older children , employee have the went to foreign, these
are the attributes to be checked and help the company to predict whether the person will opt for holiday
package or not.

Check for duplicate rows:

There are no duplicate values present in the data


Unique values present in the categorical variables as follows,

Univariate & Bivariate Analysis:

• Salary is positively skewed, age is normally distributes, educ has multi peaks, no young children and no
older children are positively skewed with more than one peaks.

• The salary data has lot of outliers where as other variables number of outliers are less
Data Distribution

There is no clear two different data distribution as there is no huge difference in the data distribution among the
holiday packages.
There is no multicolinearity present in the data

Holiday Package vs salary

There is clear indication that people with salary more than 1,50,000 have always opted for holiday packages
Holiday package vs Educ

Holiday Pacakge vs age


Treating the outlier data:

Holiday package vs no youg children

Holiday package vs no old children


Outlier Treatment:
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the data into
train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis).

Encoding will help in better result prediction in logistic regression model,

Splitting the data into training and test

The stratified split will have have the proportion of splits.


Grid Search Method:

Performed logistic regression. Applying grid search method which will help to find optimal solving methods and
parameters to be used in logistic regression,

Liblinear solver is suggested and penalty, tolerance level has been found

LDA

LDA also has been performed


2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model: Compare Both the
models and write inference which model is best/optimized.

Logistic Regression

Getting the probabilities on the test data,

Confusion matrix on training data and test data


Accuracy of the model, (63%)

AUC,ROC Curve for train data,

AUC,ROC Curve for Test data,


LDA

Classification report on Training data,

Classification report on Test data,

LDA Model score: (64%)

Confusion matrix,
Changing the cut off value to check optimal F1 score and accuracy,

When the cut off is 4, the accuracy is highest and F1 score

AUC,ROC curve on test and train data,


Comparison of the models,

LDA looks slightly better than Logistic regression.

2.4 Inference: Basis on these predictions, what are the insights and recommendations.

From the given data set we have to predict whether a particular person would opt for a holiday package or not. .

To understand this we built both logistic regression and LDA models, LDA seems to slightly better than the logistic
regression.

The exploratory data analysis shows that salary, age, educ are important parameters and gives insights like,

• People with salary more than 1,50,000 are opting for package
• People above 50 years are not opting much package.

People ranging from the age 30 to 50 generally opt for holiday packages based on the salary.

Recommendations

• As the employees earning more than 1,50,000 are opting for package, there should be more options and
also lucrative packages so it will allow the company to earn more as they will be ready to spend if
packages are good
• As aged people are not taking any packages, there could be options which will attract them like pilgrim
packages.

You might also like