You are on page 1of 40

C.

SOBA

Predic ve Modeling Project

1.1 EDA

Data Description: Top 5 Rows

Bo om 5 Rows:
Information about the Dataset:

Insights:

 The data set contains 8192 observations and 21 variables.¶


 After looking at the data, it can be seen that there are 22 columns.

 There are 13 float, 8 int, 1 object datatype Columns are there

 There are some missing Values in some columns

 Before analysing the histogram, we should make sure that whether all the c
olumns are relevant or not.

 Shape of this Dataset is (8192, 22)


Summary about the Data

Insights:

 Both numeric columns and categorical are considered. Let's understand


each variable and its statistics which we have obtained from above

 To understand the data set each variable and its statistics mean, std, min,
25%, 50%,75%, max is shown here

 Also, we can see there are certain columns where NaN is written, let's
understand more about this data
Univariate analysis for numerical data (Histplot & Boxplot):

Insights:

We have outliers for our columns, we need to treat those before further analysis, from
the Histplot we can say all the features are having left-skewed distribution. ‘usr’ feature
is having right skewed distribution.
Univariate analysis for categorical data:

Insights:

From the analysis we can say we have total ‘Process run queue size’ 3861 as CPU_Bound and
4331 as Not_ CPU_Bound.
Bivariate analysis: Pairplot (pairwise relationships between variables):
Bivariate analysis: Heatmap (Check for presence of correlations):

Insights:

 From the analysis we can see the presence of strong correlations between the
variables

 For this dataset we cannot perform the multivariable analysis


2.1 Checking the null values:

Insights:
There are 104 null values in the variable ‘rchar’ and 15 null values in the variable ‘wchar’
are appearing
Null Values treating:
Insights:
 null values in the variable ‘rchar’ and in the variable ‘wchar’ are treated whith its
mean values
 There are no duplicate values in the dataset

 We can keep the 0(Zero's) in our dataset for further analysis as we have some
features in our dataset which can be 0(Zero) if the system stays Idle
 There is no possibility of creating new features here

From the Univariate analysis we can clearly see that outliers are present in the data-
set, so we need to treat those outliers:
Insights:

Skewness of Variables:
Lread skew : 13.9, lwrite skew : 5.28, scall skew :0.9, sread skew: 5.46
Swrite skew : 9.61, fork skew : 2.25,exec skew : 4.07, rchar skew : 2.85
Wchar skew : 3.85,pgout skew : 5.07, ppgout skew: 4.68, pgfree skew : 4.77
Pgscan skew : 5.81, atch skew : 21.54, pgin skew : 3.24, ppgin skew : 3.9
Pflt skew : 1.72, vflt skew : 1.74, runqsz

There are no duplicates in the variables

1.3 Encode the data (having string values) for Modelling

Unique values for Categorical variable:

Converting categorical variables to Dummy variables


Sample data set after data-encoding:

Insights:
Here we can observe the categorical variable ‘runqsz’ converted in to dummy variable
then numerical
Checking Multicollinearity using Variance Inflation Factor (VIF):
Insights:

 Variance Inflation Factor (VIF) is one of the methods to check if independent


variables have correlation between them. If they are correlated, then it is not ideal
for linear regression models as they inflate the standard errors which in turn affects
the regression parameters. As a result, the regression model becomes non-reliable
and lacks interpretability.

 General rule of thumb: If VIF values are equal to 1, then that means there is no
multicollinearity. If VIF values are equal to 5 or exceedingly more than 5, then there
is moderate multicollinearity. If VIF is 10 or more, then that means there is high
collinearity.

 From the above I can conclude that variables have moderate correlation.

Linear Regression using Using Scikit-learn : We fit the dataset to model to Linear
Regression( )

The coefficients for each of the independent attributes:

Insights: The intercept: 82.80717337506023


 *Mean Squared Error (MSE) on training data: 20.65
 *Mean Squared Error (MSE) on test data: 18.97
 **Root Mean Squared Error (RMSE) on training data: 4.5441663714860345
 **Root Mean Squared Error (RMSE) on test data: 4.355572383064068
 ***Co-efficient of Determination (r-square) on the train data:
0.7858835725266415 ***Co-efficient of Determination (r-square) on the test data:
0.7931214712225988

Insights:

 ** Linear Regression Model Accuracy of training data: 0.7858835725266415, so


78% of the variation in the usr is explained by the predictors in the model for train
set.

 ** Linear Regression Model Accuracy of testing data: 0.7931214712225988, so


79% of the variation in the usr is explained by the predictors in the model for test
set.

Linear Regression using stats models(OLS) :

*Mean Squared Error (MSE) on training data: 20.65 *Mean Squared Error (MSE) on test
data: 18.97 **Root Mean Squared Error (RMSE) on training data: 4.5441663714860345
**Root Mean Squared Error (RMSE) on test data: 4.355572383063754 ***Co-efficient of
Determination (r-square) on the train data: 0.7858835725266415 ***Co-efficient of
Determination (r-square) on the test data: 0.7931214712226287
Insights:

 *Mean Squared Error (MSE) on training data: 20.65


 *Mean Squared Error (MSE) on test data: 18.97
 **Root Mean Squared Error (RMSE) on training data: 4.5441663714860345
 **Root Mean Squared Error (RMSE) on test data: 4.355572383063754
 ***Co-efficient of Determination (r-square) on the train data:
0.7858835725266415
 ***Co-efficient of Determination (r-square) on the test data: 0.7931214712226287

Insights:
We can write the Linear regression as :

(82.8072) * const + (-0.0539) * lread + (0.0446) * lwrite + (-0.0008) * scall + (0.0024) *


sread + (-0.0051) * swrite + (-0.1464) * fork + (-0.2312) * exec + (-0.0) * rchar + (-0.0) *
wchar + (-0.4547) * pgout + (0.0316) * ppgout + (0.0411) * pgfree + (-0.0) * pgscan +
(0.5216) * atch + (0.0059) * pgin + (-0.058) * ppgin + (-0.0314) * pflt + (-0.0064) * vflt + (-
0.0005) * freemem + (0.0) * freeswap + (1.8468) * runqsz_Not_CPU_Bound +
1.4 Inference: Basis on these predictions,

 The data set contains 8192 observations and 21 variables.


 After looking at the data, it can be seen that there are 22 columns.
 There are 13 float, 8 int, 1 object datatype Columns are there
 There are some missing Values in some columns
 Before analysing the histogram, we should make sure that whether all the column
s are relevant or not.
 Shape of this Dataset is (8192, 22)

 Both numeric columns and categorical are considered. Let's understand each
variable and its statistics which we have obtained from above
 To understand the data set each variable and its statistics mean, std, min, 25%,
50%,75%, max is shown here
 Also, we can see there are certain columns where NaN is written, let's
understand more about this data
 We have outliers for our columns, we need to treat those before further analysis,
from the Histplot we can say all the features are having left-skewed distribution.
‘usr’ feature is having right skewed distribution.
 From the analysis we can say we have total ‘Process run queue size’ 3861 as CPU_Bound
and 4331 as Not_ CPU_Bound.
 From the analysis we can see the presence of strong correlations between the
variables
 For this dataset we cannot perform the multivariable analysis
 There are 104 null values in the variable ‘rchar’ and 15 null values in the variable
‘wchar’ are appearing
 null values in the variable ‘rchar’ and in the variable ‘wchar’ are treated whith its
mean values
 There are no duplicate values in the dataset

 We can keep the 0(Zero's) in our dataset for further analysis as we have some
features in our dataset which can be 0(Zero) if the system stays Idle
 There is no possibility of creating new features here

 Skewness of Variables:
Lread skew : 13.9, lwrite skew : 5.28, scall skew :0.9, sread skew: 5.46, Swrite sk
ew : 9.61, fork skew : 2.25,exec skew : 4.07, rchar skew : 2.85, Wchar skew : 3.8
5,pgout skew : 5.07, ppgout skew: 4.68, pgfree skew : 4.77, Pgscan skew : 5.81,
atch skew : 21.54, pgin skew : 3.24, ppgin skew : 3.9, Pflt skew : 1.72, vflt skew :
1.74, runqsz
 There are no duplicates in the variables
 Here we can observe the categorical variable ‘runqsz’ converted in to dummy var
iable then numerical
 Variance Inflation Factor (VIF) is one of the methods to check if independent
variables have correlation between them. If they are correlated, then it is not ideal
for linear regression models as they inflate the standard errors which in turn affects
the regression parameters. As a result, the regression model becomes non-reliable
and lacks interpretability.
 General rule of thumb: If VIF values are equal to 1, then that means there is no
multicollinearity. If VIF values are equal to 5 or exceedingly more than 5, then there
is moderate multicollinearity. If VIF is 10 or more, then that means there is high
collinearity.

 From the above I can conclude that variables have moderate correlation.
 The intercept: 82.80717337506023
ss
 *Mean Squared Error (MSE) on training data: 20.65
 *Mean Squared Error (MSE) on test data: 18.97
 **Root Mean Squared Error (RMSE) on training data: 4.5441663714860345
 **Root Mean Squared Error (RMSE) on test data: 4.355572383064068
 ***Co-efficient of Determination (r-square) on the train data:
0.7858835725266415 ***Co-efficient of Determination (r-square) on the test data:
0.7931214712225988
 ** Linear Regression Model Accuracy of training data: 0.7858835725266415, so
78% of the variation in the usr is explained by the predictors in the model for train
set.

 ** Linear Regression Model Accuracy of testing data: 0.7931214712225988, so


79% of the variation in the usr is explained by the predictors in the model for test
set.

Linear Regression using stats models(OLS) :

*Mean Squared Error (MSE) on training data: 20.65 *Mean Squared Error (MSE) on test
data: 18.97 **Root Mean Squared Error (RMSE) on training data: 4.5441663714860345
**Root Mean Squared Error (RMSE) on test data: 4.355572383063754 ***Co-efficient of
Determination (r-square) on the train data: 0.7858835725266415 ***Co-efficient of
Determination (r-square) on the test data: 0.7931214712226287

 *Mean Squared Error (MSE) on training data: 20.65


 *Mean Squared Error (MSE) on test data: 18.97
 **Root Mean Squared Error (RMSE) on training data: 4.5441663714860345
 **Root Mean Squared Error (RMSE) on test data: 4.355572383063754
 ***Co-efficient of Determination (r-square) on the train data:
0.7858835725266415


We can write the Linear regression as :

(82.8072) * const + (-0.0539) * lread + (0.0446) * lwrite + (-0.0008) * scall + (0.0024) *


sread + (-0.0051) * swrite + (-0.1464) * fork + (-0.2312) * exec + (-0.0) * rchar + (-0.0) *
wchar + (-0.4547) * pgout + (0.0316) * ppgout + (0.0411) * pgfree + (-0.0) * pgscan +
(0.5216) * atch + (0.0059) * pgin + (-0.058) * ppgin + (-0.0314) * pflt + (-0.0064) * vflt + (-
0.0005) * freemem + (0.0) * freeswap + (1.8468) * runqsz_Not_CPU_Bound +

2.1 EDA

Top Five Rows:

Bottom five rows:


Information about the data:

Insights:

 The data set contains 1473 observations


 After looking at the data, it can be seen that there are 10 columns.
 There are 2 float, 1 int , 7 object datatype Columns are there

 There are some missing Values in column of ‘Wife_age’ and in the column
of ‘No_of_children_born.
 Shape of this Dataset is (1473, 10)
Summary About the data:

 Both numeric columns and categorical are considered. Let's understand


each variable and its statistics which we have obtained from above

 To understand the data set each variable and its statistics mean, std, min,
25%, 50%,75%, max is shown here

 Also, we can see there are certain columns where NaN is written, let's
understand more about this data

 There 80 Duplicates are there in the dataset

Checking the null values in the dataset:

Insights:
 There are 71 null values in the column of ‘Wife_age’ the 21 null values in
the column of No_of_children_born
Treating the null values:

Here null values are imputing with their respective column’s mean

Checking the data information after treating the Missing values:

Insights: Above the table showing that all the missing (null) values are
treated successfully and each columns entries are equal now.
Checking the data type of each variable:

Univariate Analysis for numerical variables using hist plot:


Univariate Analysis for categorical variables using count plot:
Bivariate Analysis:

Heatmap (Check for presence of correlations):

‘Wife age’ and ‘No_of_children_born’ are slightly correlated


Multivariable analysis:

Catplot for categorical and numerical variables:

After observing the above plots as a data analyst I decided not to treat outliers
2.2.
Encoding the data:

Insights:

Wife_ education: Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4.


Husband_education : Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4.
Wife_religion : Scientology = 1 and non-Scientology = 2.
Wife_Working : Yes = 1 and No = 2.
Standard_of_living _index: Very Low = 1, Low = 2, High = 3, Very High = 4.
Media_exposure : Exposed = 1 and Not-Exposed = 2.
Contraceptive_method_used : Yes = 1 and No = 0.

Checking the data information after encoding the data:

Above table showing that after encoding the data there are no object type data now
Insights:

Checking Multicollinearity using Variance Inflation Factor (VIF):

VIF for Wife_age ---> 23.986752368551485


VIF for Wife_ education ---> 16.16229659265143
VIF for Husband_education ---> 27.364195172617837
VIF for No_of_children_born ---> 4.376622562511411
VIF for Wife_religion ---> 12.615598921821046
VIF for Wife_Working ---> 14.783600009367635
VIF for Husband_Occupation ---> 6.921852898686767
VIF for Standard_of_living_index ---> 13.36777566587339
VIF for Media_exposure ---> 14.976189325600197

 Variance Inflation Factor (VIF) is one of the methods to check if independent


variables have correlation between them. If they are correlated, then it is not ideal
for linear regression models as they inflate the standard errors which in turn affects
the regression parameters. As a result, the regression model becomes non-reliable
and lacks interpretability.
 General rule of thumb: If VIF values are equal to 1, then that means there is no
multicollinearity. If VIF values are equal to 5 or exceedingly more than 5, then
there is moderate multicollinearity. If VIF is 10 or more, then that means there is
high collinearity.
 From the above I can conclude that variables have moderate correlation.

Logistical Regression Model:


Getting the probabilities on the training data set:

Getting the probabilities on the test data set:

Logistic regression Model Accuracy of Training data before applying Grid search
CV: 0.6646153846153846
Logistic regression Model Accuracy of Test data before applying Grid search
CV: 0.645933014354067
Applying the Grid search CV for logistic Regression:
Getting the probabilities on the training set :

Getting the probabilities on the training set :

Classification Report and confusion Matrix:


AUC and ROC for the training data:

AUC for the Training Data: 0.7074, AUC for the Test Data: 0.6922
Logistic regression Model Accuracy of Training data before applying Grid search
CV: 0.6646153846153846
Logistic regression Model Accuracy of Test data before applying Grid search
CV: 0.645933014354067

LDA (Linear Discriminant) Model:


Training Data probability prediction:

Test Data probability prediction:

LDA Classification Report on Training Dataset

LDA Confusion Matix on Training Dataset:


LDA Classification Report on Test Dataset

LDA Confusion Matix on Test Dataset:

AUC and ROC for the training data:

AUC for the Training Data: 0.7074, AUC for the Test Data: 0.6922
Logistic regression Model Accuracy of Training data before applying Grid search
CV: 0.6646153846153846
Logistic regression Model Accuracy of Test data before applying Grid search
CV: 0.645933014354067
CART Model: Decision tree uding grphviz
Pruning Decision tree
Logistic regression Model Accuracy of Training data before applying Grid search
CV: 0.6646153846153846
Logistic regression Model Accuracy of Test data before applying Grid search
CV: 0.645933014354067

2.4
 The EDA analysis clearly indicates that women with a tertiary education and very
high standard of living used contraceptive methods Women ranging from 21 to 38
generally use contraceptive methods more
 The usage of contraceptive methods need not depend on their demographic or
socioeconomic backgrounds since the use of contraceptive methods were almost
the same for both working and non-working women

 The use of contraceptive method was high for both Scientology and Non-
scientology women

You might also like