You are on page 1of 62

GRADED PROJECT-

PREDICTIVE MODELLING

DSBA-PM Module

Faizan Ali Sayyed

Faizan Ali Sayyed

0
1

Table of Contents
Home ................................................................................................................................................... 0
Index ................................................................................................................................................... 1
List of Figures .................................................................................................................................. 2
List of Equationss ............................................................................................................................. 3
Problem 1 ............................................................................................................................................ 4
Problem statement ........................................................................................................................... 4
Data dictionary ................................................................................................................................. 5
1.1 .................................................................................................................................................... 7
1.2 .................................................................................................................................................. 12
1.3 .................................................................................................................................................. 16
1.4 .................................................................................................................................................. 37
Problem 2 .......................................................................................................................................... 39
Problem statement ......................................................................................................................... 39
Data dictionary ............................................................................................................................... 40
2.1 .................................................................................................................................................. 41
2.2 .................................................................................................................................................. 50
2.3 .................................................................................................................................................. 56
2.4 .................................................................................................................................................. 61

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


2

List of Figures
Figure 1 Compactive Data Shape ........................................................................................................................... 7
Figure 2 Compactive Data info ................................................................................................................................ 7
Figure 3 Compactive Data head .............................................................................................................................. 8
Figure 4 Compactive Data tail ................................................................................................................................. 8
Figure 5 Compactive Data describe ........................................................................................................................ 8
Figure 6 Duplicates ................................................................................................................................................. 9
Figure 7 Boxplot ...................................................................................................................................................... 9
Figure 8 Bivariate .................................................................................................................................................. 10
Figure 9 Bivariate analysis 2 ................................................................................................................................. 11
Figure 10 Null values ............................................................................................................................................. 12
Figure 11 Null values imputation .......................................................................................................................... 12
Figure 12 Description after impute ........................................................................................................................ 12
Figure 13 Zero counts in Data ............................................................................................................................... 13
Figure 14 Duplicates ............................................................................................................................................. 13
Figure 15 Boxplot .................................................................................................................................................. 14
Figure 16 After outlier treatment Boxplot............................................................................................................... 15
Figure 17 Distribution of Object type variable runqsz ........................................................................................... 16
Figure 18 Data info after encoding data ................................................................................................................ 16
Figure 19 X head ................................................................................................................................................... 17
Figure 20 X train head data ................................................................................................................................... 17
Figure 21X test head data ..................................................................................................................................... 17
Figure 22 Model fit ................................................................................................................................................. 18
Figure 23 R sq without read .................................................................................................................................. 20
Figure 24 Rsq without lwrite .................................................................................................................................. 20
Figure 25 Model 2 without ppgout ......................................................................................................................... 23
Figure 26 VIF values 2 .......................................................................................................................................... 23
Figure 27 Model 3 without fork .............................................................................................................................. 24
Figure 28 VIF values wothout fork ......................................................................................................................... 24
Figure 29 Model 4 without pgin ............................................................................................................................. 25
Figure 30 VIF values without pgin ......................................................................................................................... 25
Figure 31 Model 5 without vflt ............................................................................................................................... 26
Figure 32 VIF values wihtout vflt ........................................................................................................................... 26
Figure 33 Model 6 without pgout ........................................................................................................................... 27
Figure 34 VIF values withoput pgout ..................................................................................................................... 27
Figure 35 Model 7 wihtout sread ........................................................................................................................... 28
Figure 36 VIF values without sread ....................................................................................................................... 28
Figure 37 Model 8 without lread ............................................................................................................................ 29
Figure 38VIF values without lread ......................................................................................................................... 29
Figure 39 Model 9 without swrite ........................................................................................................................... 30
Figure 40 VIF values without swrite ...................................................................................................................... 30
Figure 41 Model 10 wihtout pflt ............................................................................................................................. 31
Figure 42 VIF values without pflt ........................................................................................................................... 31
Figure 43 Final Model ............................................................................................................................................ 32
Figure 44 Actuals vs residulas .............................................................................................................................. 33
Figure 45 best fit .................................................................................................................................................... 33
Figure 46 pair plot to see dsitribution between y and all variables ........................................................................ 34
Figure 47 Normality of residuals ............................................................................................................................ 35
Figure 48 QQ plot of residuals .............................................................................................................................. 35
Figure 49 shapiro test ............................................................................................................................................ 36
Figure 50 Homoscedascticity test ......................................................................................................................... 36

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


3

Figure 51 Final Model ............................................................................................................................................ 37


Figure 52 Model parameters ................................................................................................................................. 38
Figure 53Data 2 head ............................................................................................................................................ 41
Figure 54 Data2 shape .......................................................................................................................................... 41
Figure 55 Data2 Tail .............................................................................................................................................. 41
Figure 56 Data2 info .............................................................................................................................................. 42
Figure 57 Data 2 descriptive summary .................................................................................................................. 42
Figure 58 Null values ............................................................................................................................................. 42
Figure 59 Imputing null values with median .......................................................................................................... 43
Figure 60 No of duplicate rows .............................................................................................................................. 43
Figure 61 removing duplicate rows from Data2..................................................................................................... 43
Figure 62 Revised decription after treating duplicates and nulls........................................................................... 43
Figure 63 Categorical variable value count ........................................................................................................... 44
Figure 64 Boxplot Data2 ........................................................................................................................................ 45
Figure 65 Outlier treatment ................................................................................................................................... 46
Figure 66 Replacing int with category ................................................................................................................... 47
Figure 67 Heat map Data2 .................................................................................................................................... 47
Figure 68 Pair plot Data2 ...................................................................................................................................... 48
Figure 69 Multivariate analysis .............................................................................................................................. 49
Figure 70 Revised Data 2 head after encoding ..................................................................................................... 50
Figure 71 All data types changed to Numeric ....................................................................................................... 51
Figure 72 Data2 Updated heatmap after encoding ............................................................................................... 51
Figure 73 Pair plot Data2 after encoding ............................................................................................................... 52
Figure 74Y train values ......................................................................................................................................... 53
Figure 75 Y test values .......................................................................................................................................... 53
Figure 76fit check .................................................................................................................................................. 53
Figure 77 revised fit ............................................................................................................................................... 53
Figure 78 Decision tree feature importance .......................................................................................................... 54
Figure 79 Predicted classes .................................................................................................................................. 56
Figure 80 Data2 Model accuracy .......................................................................................................................... 56
Figure 81ROC curve X train .................................................................................................................................. 57
Figure 82 ROC curve test data .............................................................................................................................. 57
Figure 83 Classification report Train data ............................................................................................................. 58
Figure 84 Classification report test data ................................................................................................................ 58
Figure 85 Confusion matrix training data .............................................................................................................. 58
Figure 86 Confusion matrix testing data ................................................................................................................ 58
Figure 87 ROC curve train CART .......................................................................................................................... 59
Figure 88 ROC test CART ..................................................................................................................................... 59
Figure 89 Confusion matrix train CART ................................................................................................................ 60
Figure 90 Confusion matrix test CART .................................................................................................................. 60
Figure 91 Model Accuracy CART .......................................................................................................................... 60
Figure 92 Summary comparison ........................................................................................................................... 60

List of Equations
Equation 1 Final Model equation for usr________________________________________________________ 38
Equation 2 RMSE train vs test_______________________________________________________________ 38
Equation 3 MAE train vs test ________________________________________________________________ 38

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


4

Problem 1 Linear Regression:

The comp-activ databases is a collection of a computer systems activity


measures.
The data was collected from a Sun Sparcstation 20/712 with 128 Mbytes of
memory running in a multi-user university department. Users would typically be
doing a large variety of tasks ranging from accessing the internet, editing files or
running very cpu-bound programs.

As you are a budding data scientist you thought to find out a linear equation to
build a model to predict 'usr' (Portion of time (%) that cpus run in user mode) and
to find out how each attribute affects the system to be in 'usr' mode using a list of
system attributes.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


5

DATA DICTIONARY:
-----------------------
System measures used:
lread - Reads (transfers per second ) between system memory and user
memory
lwrite - writes (transfers per second) between system memory and user memory
scall - Number of system calls of all types per second
sread - Number of system read calls per second .
swrite - Number of system write calls per second .
fork - Number of system fork calls per second.
exec - Number of system exec calls per second.
rchar - Number of characters transferred per second by system read calls
wchar - Number of characters transfreed per second by system write calls
pgout - Number of page out requests per second
ppgout - Number of pages, paged out per second
pgfree - Number of pages per second placed on the free list.
pgscan - Number of pages checked if they can be freed per second
atch - Number of page attaches (satisfying a page fault by reclaiming a page in
memory) per second
pgin - Number of page-in requests per second
ppgin - Number of pages paged in per second
pflt - Number of page faults caused by protection errors (copy-on-writes).
vflt - Number of page faults caused by address translation .
runqsz - Process run queue size (The number of kernel threads in memory that
are waiting for a CPU to run.
Typically, this value should be less than 2. Consistently higher values mean that
the system might be CPU-bound.)
freemem - Number of memory pages available to user processes
freeswap - Number of disk blocks available for page swapping.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


6

1.1 Read the data and do exploratory data analysis. Describe the data
briefly. (Check the Data types, shape, EDA, 5-point summary). Perform
Univariate, Bivariate Analysis, Multivariate Analysis.
1.2 Impute null values if present, also check for the values which are equal
to zero. Do they have any meaning or do we need to change them or
drop them? Check for the possibility of creating new features if
required. Also check for outliers and duplicates if there.
1.3 Encode the data (having string values) for Modelling. Split the data into
train and test (70:30). Apply Linear regression using scikit learn.
Perform checks for significant variables using appropriate method from
statsmodel. Create multiple models and check the performance of
Predictions on Train and Test sets using Rsquare, RMSE & Adj
Rsquare. Compare these models and select the best one with
appropriate reasoning.
1.4 Inference: Basis on these predictions, what are the business insights
and recommendations. Please explain and summarise the various
steps performed in this project. There should be proper business
interpretation and actionable insights present.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


7

1.1 Read the data and do exploratory data analysis. Describe the data briefly.
(Check the Data types, shape, EDA, 5-point summary). Perform
Univariate, Bivariate Analysis, Multivariate Analysis.

Data Shape

Figure 1 Compactive Data Shape

Data info

Figure 2 Compactive Data info

Data head and tail

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


8

Figure 3 Compactive Data head

Figure 4 Compactive Data tail

Data Describe statistically

Figure 5 Compactive Data describe

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


9

Duplicated rows

Figure 6 Duplicates

Data has no duplicate rows

Usr is the dependent variable for building model, while all others are independent variables

Figure 7 Boxplot

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


10

Figure 8 Bivariate

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


11

Figure 9 Bivariate analysis 2

Summary:

There are a total of 8192 rows and 22 columns in the dataset.

Out of 22, 13 are float 8 are integertype and 1 object type variable.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


12

1.2 Impute null values if present, also check for the values which are equal to
zero. Do they have any meaning or do we need to change them or drop
them? Check for the possibility of creating new features if required. Also
check for outliers and duplicates if there.
We know that there are missing values

We can see we have null values for ‘rchar’ and ‘wchar’, As it is a continuous variable, mean
value can be imputed

Figure 10 Null values

Imputing null values with mean

Figure 11 Null values imputation

Figure 12 Description after impute

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


13

We can keep the 0(Zero's) in our dataset for further analysis as we have some features in our
dataset which can be 0(Zero) if the system stays Idle.

Figure 13 Zero counts in Data

Duplicates have been checked already and we know there are no duplicate rows

Figure 14 Duplicates

Outlier treatment

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


14

Figure 15 Boxplot

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


15

Figure 16 After outlier treatment Boxplot

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


16

1.3 Encode the data (having string values) for Modelling. Split the data into
train and test (70:30). Apply Linear regression using scikit learn. Perform
checks for significant variables using appropriate method from statsmodel.
Create multiple models and check the performance of Predictions on Train
and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these
models and select the best one with appropriate reasoning.

Data values for categorical variable

Figure 17 Distribution of Object type variable runqsz

In this Data, runqsz with a dummy column

Figure 18 Data info after encoding data

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


17

First creating “usr” column as output y, and rest data as X

Figure 19 X head

Splitting Data into train test in 70-30 ratio and running the model to see the current fit.

Figure 20 X train head data

Figure 21X test head data

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


18

Figure 22 Model fit

For each predictor variable there is a null hypothesis and alternate hypothesis.

 Null hypothesis : Predictor variable is not significant


 Alternate hypothesis : Predictor variable is significant

Above table shows based on p values, sread,fork,ppgout,pgfree,pgin have high p value, and
hence variables are not significant, and can be removed

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


19

Checking Multicollinearity using Variance Inflation Factor (VIF)

Variance Inflation Factor (VIF) is one of the methods to check if independent variables have
correlation between them.
If they are correlated, then it is not ideal for linear regression models as they inflate the
standard errors which in turn affects the regression parameters.
As a result, the regression model becomes non-reliable and lacks interpretability.

General rule of thumb:


If VIF values are equal to 1, then that means there is no Multicollinearity.
If VIF values are equal to 5 or exceedingly more than 5, then there is moderate
Multicollinearity.
If VIF is10 or more, then that means there is high collinearity
From above information we can conclude that, there is some multi collinearity present in the
data.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


20

We will focus on predictors having VIF values > 2

 Removing predictor “lread”


o No significant drop in R-squared, hence can remove the column from model

Figure 23 R sq without read

 Removing predictor “lwrite”


o No significant drop in R-squared, hence can remove the column from model


Figure 24 Rsq without lwrite

 Removing predictor “scall”


o No significant drop in R-squared, hence can remove the column from model


Figure 25 Rsq without scall

 Removing predictor “sread”


o No significant drop in R-squared, hence can remove the column from model


Figure 26 Rsq without sread

 Removing predictor “swrite”


o No significant drop in R-squared, hence can remove the column from model


Figure 27 Rsq without swrite

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


21

 Removing predictor “fork”


o No significant drop in R-squared, hence can remove the column from model


Figure 28 Rsq without fork

 Removing predictor “exec”


o No significant drop in R-squared, hence can remove the column from model


Figure 29 Rsq without exec

 Removing predictor “rchar”


o No significant drop in R-squared, hence can remove the column from model


Figure 30 Rsq without rchar

 Removing predictor “pgout”


o No significant drop in R-squared, hence can remove the column from model


Figure 31 Rsq without pgout

 Removing predictor “ppgout”


o No significant drop in R-squared, hence can remove the column from model


Figure 32 Rsq without ppgout

 Removing predictor “pgfree”


o No significant drop in R-squared, hence can remove the column from model


Figure 33 Rsq without pgfree

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


22

 Removing predictor “pgin”


o No significant drop in R-squared, hence can remove the column from model


Figure 34 Rsq without pgin

 Removing predictor “ppgin”


o No significant drop in R-squared, hence can remove the column from model


Figure 35 Rsq without ppgin

 Removing predictor “pflt”


o No significant drop in R-squared, hence can remove the column from model


Figure 36 Rsq without pflt

 Removing predictor “vflt”


o No significant drop in R-squared, hence can remove the column from model


Figure 37 Rsq without vflt

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


23

Dropping the variable ppgout with little to no impact on the model

Figure 25 Model 2 without ppgout

Figure 26 VIF values 2

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


24

Removing fork from X train

Figure 27 Model 3 without fork

n
Figure 28 VIF values wothout fork

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


25

Removing pgin from X train

Figure 29 Model 4 without pgin

Figure 30 VIF values without pgin

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


26

Removing vflt from X train

Figure 31 Model 5 without vflt

Figure 32 VIF values wihtout vflt

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


27

Removing pgout from X train

Figure 33 Model 6 without pgout

Figure 34 VIF values withoput pgout

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


28

Removing sread from X train

Figure 35 Model 7 wihtout sread

Figure 36 VIF values without sread

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


29

Removing lread from X train

Figure 37 Model 8 without lread

Figure 38VIF values without lread

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


30

Removing swrite from X train

Figure 39 Model 9 without swrite

Figure 40 VIF values without swrite

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


31

Removing pflt from X train

Figure 41 Model 10 wihtout pflt

Figure 42 VIF values without pflt

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


32

Now that there are no VIF values above 2, it is safe to say that multicolinearity
has been removed from the Dataset.

In this we have “wchar” has p value> 0.05, hence we can remove it.

Figure 43 Final Model

Final equation for usr

“Usr” = 84.1328 – 0.0353*(“lwrite”) – 0.0015*(“scall”) – 1.7255*(“exec”) + 9.51 -06


*(“rchar”) –0.1516*(“pgfree”) + 0.5082*(“atch”) – 0,0625*(“pgin”) –
0.0006*(“freemem”) + 8.658e-06*(“freeswap”) +
1.6073*(“runqsz_Not_CPU_Bound”)

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


33

Testing the Assumptions of Linear Regression

For Linear Regression, we need to check if the following assumptions hold:-

 Linearity
 Independence
 Homoscedasticity
 Normality of error terms
 No strong Multicollinearity

Figure 44 Actuals vs residulas

Figure 45 best fit

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


34

Figure 46 pair plot to see dsitribution between y and all variables

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


35

Figure 47 Normality of residuals

Figure 48 QQ plot of residuals

Most points are lying on the line, and residuals are there and this is not the best fit
curve.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


36

The Shapiro-Wilk test can also be used for checking the normality. The null and alternate
hypotheses of the test are as follows:

 Null hypothesis - Data is normally distributed.


 Alternate hypothesis - Data is not normally distributed.

Figure 49 shapiro test

 Since p-value < 0.05, the residuals are not normal as per shapiro test.
 Strictly speaking - the residuals are not normal. However, as an approximation, we
might be willing to accept this distribution as close to being normal

test for Homoscedasticity:

Figure 50 Homoscedascticity test

Since p-value > 0.05 we can say that the residuals are homoscedastic.

All the assumptions of linear regression are now satisfied

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


37

1.4 Inference: Basis on these predictions, what are the business insights and
recommendations. Please explain and summarise the various steps
performed in this project. There should be proper business interpretation
and actionable insights present.

Figure 51 Final Model

Summary:

 R-squared of the model is 0.711 and adjusted R-squared is 0.711, which


shows that the model is able to explain ~71% variance in the data. This is
quite good.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


38

 Process run queue size, Number of system exec calls per second and
Number of page attached have significant impact on Portion of time (%)
that cpus run in user mode)

Figure 52 Model parameters

Equation 1 Final Model equation for usr

Now predicting on the X test data,

Calculating Root mean square error and Mean Absolute error for train and test
data
Equation 2 RMSE train vs test

Equation 3 MAE train vs test

We can see that RMSE on the train and test sets are comparable. So, our
model is not suffering from overfitting.
MAE indicates that our current model is able to predict mpg within a mean error
of 2.2 units on the test data.
Hence, we can conclude the model is good for prediction as well as inference
purposes. with 71% fit

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


39

Logistic Regression, LDA and CART


You are a statistician at the Republic of Indonesia Ministry of Health and you are
provided with a data of 1473 females collected from a Contraceptive Prevalence
Survey. The samples are married women who were either not pregnant or do not
know if they were at the time of the survey.

The problem is to predict do/don't they use a contraceptive method of choice


based on their demographic and socio-economic characteristics.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null
value condition check, check for duplicates and outliers and write an inference
on it. Perform Univariate and Bivariate Analysis and Multivariate Analysis.
2.2 Do not scale the data. Encode the data (having string values) for Modelling.
Data Split: Split the data into train and test (70:30). Apply Logistic Regression
and LDA (linear discriminant analysis) and CART.

2.3 Performance Metrics: Check the performance of Predictions on Train and


Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score for each model Final Model: Compare Both the models and write
inference which model is best/optimized.
2.4 Inference: Basis on these predictions, what are the insights and
recommendations.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


40

Data Dictionary:

1. Wife's age (numerical)


2. Wife's education (categorical) 1=uneducated, 2, 3, 4=tertiary
3. Husband's education (categorical) 1=uneducated, 2, 3, 4=tertiary
4. Number of children ever born (numerical)
5. Wife's religion (binary) Non-Scientology, Scientology
6. Wife's now working? (binary) Yes, No
7. Husband's occupation (categorical) 1, 2, 3, 4(random)
8. Standard-of-living index (categorical) 1=verlow, 2, 3, 4=high
9. Media exposure (binary) Good, Not good
10. Contraceptive method used (class attribute) No,Yes

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


41

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do
null value condition check, check for duplicates and outliers and write an
inference on it. Perform Univariate and Bivariate Analysis and Multivariate
Analysis.

Head

Figure 53Data 2 head

Figure 54 Data2 shape

Figure 55 Data2 Tail

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


42

Figure 56 Data2 info

Figure 57 Data 2 descriptive summary

Figure 58 Null values

We can see that Wife Age and No of children born have null values.
Replacing with median values for both numeric variables

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


43

Figure 59 Imputing null values with median

Figure 60 No of duplicate rows

There are 80 duplicate rows. lets remove them

Figure 61 removing duplicate rows from Data2

Figure 62 Revised decription after treating duplicates and nulls

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


44

Univariate Analysis

Figure 63 Categorical variable value count

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


45

Figure 64 Boxplot Data2

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


46

Figure 65 Outlier treatment

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


47

Since husband occupation is a categorical variable, numbers 1 2 3 4 are


replaced with strings

Figure 66 Replacing int with category

Bivariate analysis

Figure 67 Heat map Data2

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


48

Figure 68 Pair plot Data2

Multivariate Analysis

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


49

Figure 69 Multivariate analysis

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


50

2.2 Do not scale the data. Encode the data (having string values) for Modelling.
Data Split: Split the data into train and test (70:30). Apply Logistic Regression
and LDA (linear discriminant analysis) and CART.
Encoded the categorical variables Wife_ education, Husband_education,
Wife_religion, Standard_of_living_index, Media_exposure and
Contraceptive_method_used in the ascending order from worst to best since
LDA does not take string variables as parameters into model building.

The encoding for ordinal values:

Wife_ education: Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4


Husband_education: Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4
Wife_religion: Scientology = 0 and non-Scientology = 1
Wife_Working: Yes = 1 and No = 0.
Standard_of_living _index: Very Low = 1, Low = 2, High = 3, Very High = 4
Media_exposure: Exposed = 1 and Not-Exposed = 0
Contraceptive_method_used: Yes = 1 and No = 0

Figure 70 Revised Data 2 head after encoding

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


51

Figure 71 All data types changed to Numeric

Bivariate analysis

Figure 72 Data2 Updated heatmap after encoding

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


52

Figure 73 Pair plot Data2 after encoding

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


53

Train test split

Figure 74Y train values

Figure 75 Y test values

Figure 76fit check

There is an overfit of model in Decision tree regressor, hence we need to do a


grid search

Let us take Max_depth as 10, min_samples_leaf as 15 and Min_sample_split as


15.

Figure 77 revised fit

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


54

Figure 78 Decision tree feature importance

From above chart, we can establish that media exposure plays the most
important role in use of Contraceptive_method. The other important factors are
Standard of living, Husband occupation, Working status, and Wife religion

Here is a comparison of accuracy for Models created using Decision Tree


Classifier, Logistic Regression and LDA

Logistic Regression Model Accuracy of Training data before applying


GridSearchCV: 0.674872
Logistic Regression Model Accuracy of Test data before applying
GridSearchCV: 0.676923

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


55

Thus we can conclude, Decision tree classifier results in best accuracy

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


56

2.3 Performance Metrics: Check the performance of Predictions on Train and


Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score for each model Final Model: Compare Both the models and write
inference which model is best/optimized.

Predicting the model classes

Figure 79 Predicted classes

Checking the accuracy of the Model old:

Figure 80 Data2 Model accuracy

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


57

 AUC Value closer to 1 tells that there is good seperatibility between the
predicted classes and thus the model is good for prediction
 ROC Curve visually represents the above concept where the plot should
be as far as possible from the diagnol.

Figure 81ROC curve X train

AUC value is 72.1

Figure 82 ROC curve test data

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


58

Figure 83 Classification report Train data

Figure 84 Classification report test data

Figure 85 Confusion matrix training data

Figure 86 Confusion matrix testing data

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


59

Figure 87 ROC curve train CART

Figure 88 ROC test CART

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


60

Figure 89 Confusion matrix train CART

Figure 90 Confusion matrix test CART

Model accuracy CART

Figure 91 Model Accuracy CART

Figure 92 Summary comparison

Great Learning- DSBA- AS Project- Faizan Ali Sayyed


61

2.4 Inference: Basis on these predictions, what are the insights and
recommendations.

 The EDA analysis clearly indicates that women with a tertiary education
and very high standard of living used contraceptive methods.

 Media exposure plays a vital role in effect of using contraception


externally.

 Women ranging from 21 to 38 generally use contraceptive methods more.

 The usage of contraceptive methods need not depend on their


demographic or socio-economic backgrounds since the use of
contraceptive methods were almost the same for both working and non-
working women

 The use of contraceptive method was high for both Scientology and Non-
scientology women

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

You might also like