You are on page 1of 49

PREDICTIVE

MODELLING
PROJECT REPORT

OCTOBER 6

PGPDSBA Online April_D 2021


Authored by: Nandakumar Chandrasekharan

1
TABLE OF CONTENTS
PROBLEM 1: LINEAR REGRESSION ............................................................................................ 3
DATA DICTIONARY ..................................................................................................................... 3
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the
null values, Data types, shape, EDA, duplicate values).
Perform Univariate and Bivariate Analysis................................................................................ 4
1.2 Impute null values if present, also check for the values which are equal to zero.
Do they have any meaning or do we need to change them or drop them? Check for the
possibility of combining the sub levels of a ordinal variables and take actions accordingly.
Explain why you are combining these sub levels with appropriate reasoning. ....................... 12
1.3 Encode the data (having string values) for Modelling. Split the data into train and test
(70:30). Apply Linear regression using scikit learn. Perform checks for significant variables
using appropriate method from statsmodel. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Compare these models and select the best one with appropriate reasoning. ........................ 17
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.................................................................................................................... 21

PROBLEM 2: LOGISTIC REGRESSION AND LDA ..................................................................... 22


DATA DICTIONARY ................................................................................................................... 22
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis... ................................................................................................ 22
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and
LDA (linear discriminant analysis). .......................................................................................... 38
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model Final Model: Compare Both the models and write inference which model is
best/optimized.. ....................................................................................................................... 42
2.4 Inference: Basis on these predictions, what are the insights and recommendations. ...... 49

2
Problem 1: Linear Regression
You are hired by a company Gem Stones co ltd, which is a cubic zirconia
manufacturer. You are provided with the dataset containing the prices and other
attributes of almost 27,000 cubic zirconia (which is an inexpensive diamond
alternative with many of the same qualities as a diamond). The company is earning
different profits on different prize slots. You have to help the company in predicting
the price for the stone on the bases of the details given in the dataset so it can
distinguish between higher profitable stones and lower profitable stones so as to
have better profit share. Also, provide them with the best 5 attributes that are most
important.

Data Dictionary:
Variable Name Description
Carat Carat weight of the cubic zirconia.
Cut Describe the cut quality of the cubic zirconia. Quality is increasing order Fair, Good,
Very Good, Premium, Ideal.
Color Colour of the cubic zirconia. With D being the worst and J the best.
Clarity cubic zirconia Clarity refers to the absence of the Inclusions and Blemishes. (In order
from Worst to Best) IF, VVS1, VVS2, VS1, VS2, Sl1, Sl2, l1
Depth The Height of cubic zirconia, measured from the Culet to the table, divided by its
average Girdle Diameter.
Table The Width of the cubic zirconia's Table expressed as a Percentage of its Average
Diameter.
Price the Price of the cubic zirconia.
X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
Z Height of the cubic zirconia in mm.

3
1.1 Read the data and do exploratory data analysis. Describe the data briefly.
(Check the null values, Data types, shape, EDA, duplicate values). Perform
Univariate and Bivariate Analysis.

Ans: Checking if data has flown in properly:

Head of data

Tail of data:

Shape of data: (26967, 11)

4
Description of data:

Data Info: Dataset has int, float and object data types

5
Univariate and Bivariate Analysis

6
7
Skewness of Data:

Most preferred cut is ideal according to below graphs

8
Count plot based on color

Plot based on color and price

9
Multivariate Analysis

10
11
1.2 Impute null values if present, also check for the values which are equal to zero.
Do they have any meaning or do we need to change them or drop them? Check
for the possibility of combining the sub levels of a ordinal variables and take
actions accordingly. Explain why you are combining these sub levels with
appropriate reasoning.
Ans: Based on the below, all columns except for depth has no null values.

Sine depth column is continuous, either mean or median imputation can be carried
out.

12
After imputation is done, we see that there are no null values present.

Checking for outliers

13
14
15
These outliers are then removed from the dataset.

16
1.3 Encode the data (having string values) for Modelling. Split the data into train
and test (70:30). Apply Linear regression using scikit learn. Perform checks for
significant variables using appropriate method from statsmodel. Create
multiple models and check the performance of Predictions on Train and Test
sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the
best one with appropriate reasoning.

Ans: Dummies have to be encoded since linear regression models don’t take
categorical variables.

Now we will have to remove the unwanted columns as below:

17
Splitting of the data: X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3 , random_state=1)

Linear Regression Model:

The coefficient for carat is 1.3672709359491833


The coefficient for depth is -0.02715729778195777
The coefficient for table is -0.015129062503321425
The coefficient for x is -0.3109893370891057
The coefficient for y is -0.0008718302715287778
The coefficient for z is -0.009459310770528272
The coefficient for cut_Good is 0.13136322591216373
The coefficient for cut_Ideal is 0.194050821929168
The coefficient for cut_Premium is 0.1695361974418707
The coefficient for cut_Very Good is 0.1637510414681528
The coefficient for color_E is -0.04582992110650405
The coefficient for color_F is -0.06423152006658835
The coefficient for color_G is -0.1093432236463441
The coefficient for color_H is -0.2373503481063302
The coefficient for color_I is -0.36122694997710136
The coefficient for color_J is -0.5838191499347705
The coefficient for clarity_IF is 1.2899471399673714
The coefficient for clarity_SI1 is 0.8895287879225799
The coefficient for clarity_SI2 is 0.6446204697130623
The coefficient for clarity_VS1 is 1.111858158570466
The coefficient for clarity_VS2 is 1.0384035090938615
The coefficient for clarity_VVS1 is 1.2151510670753518
The coefficient for clarity_VVS2 is 1.1977150915884154

18
R Square and RMSW values for training and testing data are as below:

VIF Values

19
Best params summary:

After dropping the depth variable, the results are as below:

20
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.
Ans: Business Insights: Based on the EDA analysis, it is clear that ideal cut brings in
the maximum profit to the company and the colors H,I and J bring in profit whereas
the other colors don’t. Additionally, the fair and good cuts are not bringing any profit
to the company.

Recommendations: Company should focus on carat and clarity of the stone to increase
pricing and thereby the profit. Good customer base and marketing strategy needs to
be adopted to attract customers to buy the stones which gives more profit.

21
Problem 2: Logistic Regression and LDA
You are hired by a tour and travel agency which deals in selling holiday packages.
You are provided details of 872 employees of a company. Among these employees,
some opted for the package and some didn't. You have to help the company in
predicting whether an employee will opt for the package or not on the basis of the
information given in the data set. Also, find out the important factors on the basis of
which the company will focus on particular employees to sell their packages.

Data Dictionary:
Variable Name Description
Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
no_young_children The number of young children (younger than 7 years)
no_older_children Number of older children
foreign foreigner Yes/No

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate
Analysis. Do exploratory data analysis.
Ans: The data was inputted and sample rows were can be viewed below:

22
The dimension of dataset is (872,8) with no null values.
The summary of dataset is as given below:

As per the below table, it can be understood that there are no missing values:

Also we can see that there are no duplicates in the dataset.

23
Univariate Analysis

24
Per the above graph, it is understood that all 4 variables have outliers in it.

25
26
27
28
29
30
31
Inferences:

• Employees over the age of 50 seems to be not taking holiday packages as


compared to younger employees
• Employees with salary <150000 seems to be taking holiday packages
• Based on the analysis, it looks like only 45% people are interested in holiday
packages

32
Bivariate Analysis

There is not much of a difference between data distribution among the holiday
packages.

33
No multicollinearity seen in the data.

34
Additionally, we can see that the data has outliers based on the below graphs

35
Post the outlier treatment, the data looks like below:

36
37
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA
(linear discriminant analysis).
Ans: We have split the data into test and train in the ratio of 70:30 and the data has
been encoded as below:

38
Logistic Regression Model

Accuracy Scores:

39
Creating the LDA Model

40
Model scores and classification report:

41
2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model Final Model: Compare Both the models and write inference which
model is best/optimized.
Ans: Accuracy of the data sets:

Confusion Matrix:

42
AUC and ROC Curve

43
Checking optimal value that gives better value and accuracy:

44
45
46
47
48
AUC and ROC Curve:

LDA works better when there is category target variable is present, else both results
are pretty much same.

2.4 Inference: Basis on these predictions, what are the insights and
recommendations.
Ans: Insights from the analysis are as below:
• Important factors predicting people’s interest in holiday packages are age, salary
and education.
• People above the age of 50 generally don’t prefer the holiday package people
having salary less than 50k have opted for it

Recommendations:
• A survey to understand good destinations for people above 50 may help attract
them to take holiday packages
• Targeting parents with younger children should be avoided as conversion rate
seems to be less
49

You might also like