Professional Documents
Culture Documents
PM Projec - Sobac
PM Projec - Sobac
SOBA
1.1 EDA
Bo om 5 Rows:
Information about the Dataset:
Insights:
Before analysing the histogram, we should make sure that whether all the c
olumns are relevant or not.
Insights:
To understand the data set each variable and its statistics mean, std, min,
25%, 50%,75%, max is shown here
Also, we can see there are certain columns where NaN is written, let's
understand more about this data
Univariate analysis for numerical data (Histplot & Boxplot):
Insights:
We have outliers for our columns, we need to treat those before further analysis, from
the Histplot we can say all the features are having left-skewed distribution. ‘usr’ feature
is having right skewed distribution.
Univariate analysis for categorical data:
Insights:
From the analysis we can say we have total ‘Process run queue size’ 3861 as CPU_Bound and
4331 as Not_ CPU_Bound.
Bivariate analysis: Pairplot (pairwise relationships between variables):
Bivariate analysis: Heatmap (Check for presence of correlations):
Insights:
From the analysis we can see the presence of strong correlations between the
variables
Insights:
There are 104 null values in the variable ‘rchar’ and 15 null values in the variable ‘wchar’
are appearing
Null Values treating:
Insights:
null values in the variable ‘rchar’ and in the variable ‘wchar’ are treated whith its
mean values
There are no duplicate values in the dataset
We can keep the 0(Zero's) in our dataset for further analysis as we have some
features in our dataset which can be 0(Zero) if the system stays Idle
There is no possibility of creating new features here
From the Univariate analysis we can clearly see that outliers are present in the data-
set, so we need to treat those outliers:
Insights:
Skewness of Variables:
Lread skew : 13.9, lwrite skew : 5.28, scall skew :0.9, sread skew: 5.46
Swrite skew : 9.61, fork skew : 2.25,exec skew : 4.07, rchar skew : 2.85
Wchar skew : 3.85,pgout skew : 5.07, ppgout skew: 4.68, pgfree skew : 4.77
Pgscan skew : 5.81, atch skew : 21.54, pgin skew : 3.24, ppgin skew : 3.9
Pflt skew : 1.72, vflt skew : 1.74, runqsz
Insights:
Here we can observe the categorical variable ‘runqsz’ converted in to dummy variable
then numerical
Checking Multicollinearity using Variance Inflation Factor (VIF):
Insights:
General rule of thumb: If VIF values are equal to 1, then that means there is no
multicollinearity. If VIF values are equal to 5 or exceedingly more than 5, then there
is moderate multicollinearity. If VIF is 10 or more, then that means there is high
collinearity.
From the above I can conclude that variables have moderate correlation.
Linear Regression using Using Scikit-learn : We fit the dataset to model to Linear
Regression( )
Insights:
*Mean Squared Error (MSE) on training data: 20.65 *Mean Squared Error (MSE) on test
data: 18.97 **Root Mean Squared Error (RMSE) on training data: 4.5441663714860345
**Root Mean Squared Error (RMSE) on test data: 4.355572383063754 ***Co-efficient of
Determination (r-square) on the train data: 0.7858835725266415 ***Co-efficient of
Determination (r-square) on the test data: 0.7931214712226287
Insights:
Insights:
We can write the Linear regression as :
Both numeric columns and categorical are considered. Let's understand each
variable and its statistics which we have obtained from above
To understand the data set each variable and its statistics mean, std, min, 25%,
50%,75%, max is shown here
Also, we can see there are certain columns where NaN is written, let's
understand more about this data
We have outliers for our columns, we need to treat those before further analysis,
from the Histplot we can say all the features are having left-skewed distribution.
‘usr’ feature is having right skewed distribution.
From the analysis we can say we have total ‘Process run queue size’ 3861 as CPU_Bound
and 4331 as Not_ CPU_Bound.
From the analysis we can see the presence of strong correlations between the
variables
For this dataset we cannot perform the multivariable analysis
There are 104 null values in the variable ‘rchar’ and 15 null values in the variable
‘wchar’ are appearing
null values in the variable ‘rchar’ and in the variable ‘wchar’ are treated whith its
mean values
There are no duplicate values in the dataset
We can keep the 0(Zero's) in our dataset for further analysis as we have some
features in our dataset which can be 0(Zero) if the system stays Idle
There is no possibility of creating new features here
Skewness of Variables:
Lread skew : 13.9, lwrite skew : 5.28, scall skew :0.9, sread skew: 5.46, Swrite sk
ew : 9.61, fork skew : 2.25,exec skew : 4.07, rchar skew : 2.85, Wchar skew : 3.8
5,pgout skew : 5.07, ppgout skew: 4.68, pgfree skew : 4.77, Pgscan skew : 5.81,
atch skew : 21.54, pgin skew : 3.24, ppgin skew : 3.9, Pflt skew : 1.72, vflt skew :
1.74, runqsz
There are no duplicates in the variables
Here we can observe the categorical variable ‘runqsz’ converted in to dummy var
iable then numerical
Variance Inflation Factor (VIF) is one of the methods to check if independent
variables have correlation between them. If they are correlated, then it is not ideal
for linear regression models as they inflate the standard errors which in turn affects
the regression parameters. As a result, the regression model becomes non-reliable
and lacks interpretability.
General rule of thumb: If VIF values are equal to 1, then that means there is no
multicollinearity. If VIF values are equal to 5 or exceedingly more than 5, then there
is moderate multicollinearity. If VIF is 10 or more, then that means there is high
collinearity.
From the above I can conclude that variables have moderate correlation.
The intercept: 82.80717337506023
ss
*Mean Squared Error (MSE) on training data: 20.65
*Mean Squared Error (MSE) on test data: 18.97
**Root Mean Squared Error (RMSE) on training data: 4.5441663714860345
**Root Mean Squared Error (RMSE) on test data: 4.355572383064068
***Co-efficient of Determination (r-square) on the train data:
0.7858835725266415 ***Co-efficient of Determination (r-square) on the test data:
0.7931214712225988
** Linear Regression Model Accuracy of training data: 0.7858835725266415, so
78% of the variation in the usr is explained by the predictors in the model for train
set.
*Mean Squared Error (MSE) on training data: 20.65 *Mean Squared Error (MSE) on test
data: 18.97 **Root Mean Squared Error (RMSE) on training data: 4.5441663714860345
**Root Mean Squared Error (RMSE) on test data: 4.355572383063754 ***Co-efficient of
Determination (r-square) on the train data: 0.7858835725266415 ***Co-efficient of
Determination (r-square) on the test data: 0.7931214712226287
We can write the Linear regression as :
2.1 EDA
Insights:
There are some missing Values in column of ‘Wife_age’ and in the column
of ‘No_of_children_born.
Shape of this Dataset is (1473, 10)
Summary About the data:
To understand the data set each variable and its statistics mean, std, min,
25%, 50%,75%, max is shown here
Also, we can see there are certain columns where NaN is written, let's
understand more about this data
Insights:
There are 71 null values in the column of ‘Wife_age’ the 21 null values in
the column of No_of_children_born
Treating the null values:
Here null values are imputing with their respective column’s mean
Insights: Above the table showing that all the missing (null) values are
treated successfully and each columns entries are equal now.
Checking the data type of each variable:
After observing the above plots as a data analyst I decided not to treat outliers
2.2.
Encoding the data:
Insights:
Above table showing that after encoding the data there are no object type data now
Insights:
Logistic regression Model Accuracy of Training data before applying Grid search
CV: 0.6646153846153846
Logistic regression Model Accuracy of Test data before applying Grid search
CV: 0.645933014354067
Applying the Grid search CV for logistic Regression:
Getting the probabilities on the training set :
AUC for the Training Data: 0.7074, AUC for the Test Data: 0.6922
Logistic regression Model Accuracy of Training data before applying Grid search
CV: 0.6646153846153846
Logistic regression Model Accuracy of Test data before applying Grid search
CV: 0.645933014354067
AUC for the Training Data: 0.7074, AUC for the Test Data: 0.6922
Logistic regression Model Accuracy of Training data before applying Grid search
CV: 0.6646153846153846
Logistic regression Model Accuracy of Test data before applying Grid search
CV: 0.645933014354067
CART Model: Decision tree uding grphviz
Pruning Decision tree
Logistic regression Model Accuracy of Training data before applying Grid search
CV: 0.6646153846153846
Logistic regression Model Accuracy of Test data before applying Grid search
CV: 0.645933014354067
2.4
The EDA analysis clearly indicates that women with a tertiary education and very
high standard of living used contraceptive methods Women ranging from 21 to 38
generally use contraceptive methods more
The usage of contraceptive methods need not depend on their demographic or
socioeconomic backgrounds since the use of contraceptive methods were almost
the same for both working and non-working women
The use of contraceptive method was high for both Scientology and Non-
scientology women