Professional Documents
Culture Documents
PREDICTIVE MODELLING
DSBA-PM Module
0
1
Table of Contents
Home ................................................................................................................................................... 0
Index ................................................................................................................................................... 1
List of Figures .................................................................................................................................. 2
List of Equationss ............................................................................................................................. 3
Problem 1 ............................................................................................................................................ 4
Problem statement ........................................................................................................................... 4
Data dictionary ................................................................................................................................. 5
1.1 .................................................................................................................................................... 7
1.2 .................................................................................................................................................. 12
1.3 .................................................................................................................................................. 16
1.4 .................................................................................................................................................. 37
Problem 2 .......................................................................................................................................... 39
Problem statement ......................................................................................................................... 39
Data dictionary ............................................................................................................................... 40
2.1 .................................................................................................................................................. 41
2.2 .................................................................................................................................................. 50
2.3 .................................................................................................................................................. 56
2.4 .................................................................................................................................................. 61
List of Figures
Figure 1 Compactive Data Shape ........................................................................................................................... 7
Figure 2 Compactive Data info ................................................................................................................................ 7
Figure 3 Compactive Data head .............................................................................................................................. 8
Figure 4 Compactive Data tail ................................................................................................................................. 8
Figure 5 Compactive Data describe ........................................................................................................................ 8
Figure 6 Duplicates ................................................................................................................................................. 9
Figure 7 Boxplot ...................................................................................................................................................... 9
Figure 8 Bivariate .................................................................................................................................................. 10
Figure 9 Bivariate analysis 2 ................................................................................................................................. 11
Figure 10 Null values ............................................................................................................................................. 12
Figure 11 Null values imputation .......................................................................................................................... 12
Figure 12 Description after impute ........................................................................................................................ 12
Figure 13 Zero counts in Data ............................................................................................................................... 13
Figure 14 Duplicates ............................................................................................................................................. 13
Figure 15 Boxplot .................................................................................................................................................. 14
Figure 16 After outlier treatment Boxplot............................................................................................................... 15
Figure 17 Distribution of Object type variable runqsz ........................................................................................... 16
Figure 18 Data info after encoding data ................................................................................................................ 16
Figure 19 X head ................................................................................................................................................... 17
Figure 20 X train head data ................................................................................................................................... 17
Figure 21X test head data ..................................................................................................................................... 17
Figure 22 Model fit ................................................................................................................................................. 18
Figure 23 R sq without read .................................................................................................................................. 20
Figure 24 Rsq without lwrite .................................................................................................................................. 20
Figure 25 Model 2 without ppgout ......................................................................................................................... 23
Figure 26 VIF values 2 .......................................................................................................................................... 23
Figure 27 Model 3 without fork .............................................................................................................................. 24
Figure 28 VIF values wothout fork ......................................................................................................................... 24
Figure 29 Model 4 without pgin ............................................................................................................................. 25
Figure 30 VIF values without pgin ......................................................................................................................... 25
Figure 31 Model 5 without vflt ............................................................................................................................... 26
Figure 32 VIF values wihtout vflt ........................................................................................................................... 26
Figure 33 Model 6 without pgout ........................................................................................................................... 27
Figure 34 VIF values withoput pgout ..................................................................................................................... 27
Figure 35 Model 7 wihtout sread ........................................................................................................................... 28
Figure 36 VIF values without sread ....................................................................................................................... 28
Figure 37 Model 8 without lread ............................................................................................................................ 29
Figure 38VIF values without lread ......................................................................................................................... 29
Figure 39 Model 9 without swrite ........................................................................................................................... 30
Figure 40 VIF values without swrite ...................................................................................................................... 30
Figure 41 Model 10 wihtout pflt ............................................................................................................................. 31
Figure 42 VIF values without pflt ........................................................................................................................... 31
Figure 43 Final Model ............................................................................................................................................ 32
Figure 44 Actuals vs residulas .............................................................................................................................. 33
Figure 45 best fit .................................................................................................................................................... 33
Figure 46 pair plot to see dsitribution between y and all variables ........................................................................ 34
Figure 47 Normality of residuals ............................................................................................................................ 35
Figure 48 QQ plot of residuals .............................................................................................................................. 35
Figure 49 shapiro test ............................................................................................................................................ 36
Figure 50 Homoscedascticity test ......................................................................................................................... 36
List of Equations
Equation 1 Final Model equation for usr________________________________________________________ 38
Equation 2 RMSE train vs test_______________________________________________________________ 38
Equation 3 MAE train vs test ________________________________________________________________ 38
As you are a budding data scientist you thought to find out a linear equation to
build a model to predict 'usr' (Portion of time (%) that cpus run in user mode) and
to find out how each attribute affects the system to be in 'usr' mode using a list of
system attributes.
DATA DICTIONARY:
-----------------------
System measures used:
lread - Reads (transfers per second ) between system memory and user
memory
lwrite - writes (transfers per second) between system memory and user memory
scall - Number of system calls of all types per second
sread - Number of system read calls per second .
swrite - Number of system write calls per second .
fork - Number of system fork calls per second.
exec - Number of system exec calls per second.
rchar - Number of characters transferred per second by system read calls
wchar - Number of characters transfreed per second by system write calls
pgout - Number of page out requests per second
ppgout - Number of pages, paged out per second
pgfree - Number of pages per second placed on the free list.
pgscan - Number of pages checked if they can be freed per second
atch - Number of page attaches (satisfying a page fault by reclaiming a page in
memory) per second
pgin - Number of page-in requests per second
ppgin - Number of pages paged in per second
pflt - Number of page faults caused by protection errors (copy-on-writes).
vflt - Number of page faults caused by address translation .
runqsz - Process run queue size (The number of kernel threads in memory that
are waiting for a CPU to run.
Typically, this value should be less than 2. Consistently higher values mean that
the system might be CPU-bound.)
freemem - Number of memory pages available to user processes
freeswap - Number of disk blocks available for page swapping.
1.1 Read the data and do exploratory data analysis. Describe the data
briefly. (Check the Data types, shape, EDA, 5-point summary). Perform
Univariate, Bivariate Analysis, Multivariate Analysis.
1.2 Impute null values if present, also check for the values which are equal
to zero. Do they have any meaning or do we need to change them or
drop them? Check for the possibility of creating new features if
required. Also check for outliers and duplicates if there.
1.3 Encode the data (having string values) for Modelling. Split the data into
train and test (70:30). Apply Linear regression using scikit learn.
Perform checks for significant variables using appropriate method from
statsmodel. Create multiple models and check the performance of
Predictions on Train and Test sets using Rsquare, RMSE & Adj
Rsquare. Compare these models and select the best one with
appropriate reasoning.
1.4 Inference: Basis on these predictions, what are the business insights
and recommendations. Please explain and summarise the various
steps performed in this project. There should be proper business
interpretation and actionable insights present.
1.1 Read the data and do exploratory data analysis. Describe the data briefly.
(Check the Data types, shape, EDA, 5-point summary). Perform
Univariate, Bivariate Analysis, Multivariate Analysis.
Data Shape
Data info
Duplicated rows
Figure 6 Duplicates
Usr is the dependent variable for building model, while all others are independent variables
Figure 7 Boxplot
Figure 8 Bivariate
Summary:
Out of 22, 13 are float 8 are integertype and 1 object type variable.
1.2 Impute null values if present, also check for the values which are equal to
zero. Do they have any meaning or do we need to change them or drop
them? Check for the possibility of creating new features if required. Also
check for outliers and duplicates if there.
We know that there are missing values
We can see we have null values for ‘rchar’ and ‘wchar’, As it is a continuous variable, mean
value can be imputed
We can keep the 0(Zero's) in our dataset for further analysis as we have some features in our
dataset which can be 0(Zero) if the system stays Idle.
Duplicates have been checked already and we know there are no duplicate rows
Figure 14 Duplicates
Outlier treatment
Figure 15 Boxplot
1.3 Encode the data (having string values) for Modelling. Split the data into
train and test (70:30). Apply Linear regression using scikit learn. Perform
checks for significant variables using appropriate method from statsmodel.
Create multiple models and check the performance of Predictions on Train
and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these
models and select the best one with appropriate reasoning.
Figure 19 X head
Splitting Data into train test in 70-30 ratio and running the model to see the current fit.
For each predictor variable there is a null hypothesis and alternate hypothesis.
Above table shows based on p values, sread,fork,ppgout,pgfree,pgin have high p value, and
hence variables are not significant, and can be removed
Variance Inflation Factor (VIF) is one of the methods to check if independent variables have
correlation between them.
If they are correlated, then it is not ideal for linear regression models as they inflate the
standard errors which in turn affects the regression parameters.
As a result, the regression model becomes non-reliable and lacks interpretability.
Figure 24 Rsq without lwrite
Figure 25 Rsq without scall
Figure 26 Rsq without sread
Figure 27 Rsq without swrite
Figure 28 Rsq without fork
Figure 29 Rsq without exec
Figure 30 Rsq without rchar
Figure 31 Rsq without pgout
Figure 32 Rsq without ppgout
Figure 33 Rsq without pgfree
Figure 34 Rsq without pgin
Figure 35 Rsq without ppgin
Figure 36 Rsq without pflt
Figure 37 Rsq without vflt
n
Figure 28 VIF values wothout fork
Now that there are no VIF values above 2, it is safe to say that multicolinearity
has been removed from the Dataset.
In this we have “wchar” has p value> 0.05, hence we can remove it.
Linearity
Independence
Homoscedasticity
Normality of error terms
No strong Multicollinearity
Most points are lying on the line, and residuals are there and this is not the best fit
curve.
The Shapiro-Wilk test can also be used for checking the normality. The null and alternate
hypotheses of the test are as follows:
Since p-value < 0.05, the residuals are not normal as per shapiro test.
Strictly speaking - the residuals are not normal. However, as an approximation, we
might be willing to accept this distribution as close to being normal
Since p-value > 0.05 we can say that the residuals are homoscedastic.
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations. Please explain and summarise the various steps
performed in this project. There should be proper business interpretation
and actionable insights present.
Summary:
Process run queue size, Number of system exec calls per second and
Number of page attached have significant impact on Portion of time (%)
that cpus run in user mode)
Calculating Root mean square error and Mean Absolute error for train and test
data
Equation 2 RMSE train vs test
We can see that RMSE on the train and test sets are comparable. So, our
model is not suffering from overfitting.
MAE indicates that our current model is able to predict mpg within a mean error
of 2.2 units on the test data.
Hence, we can conclude the model is good for prediction as well as inference
purposes. with 71% fit
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null
value condition check, check for duplicates and outliers and write an inference
on it. Perform Univariate and Bivariate Analysis and Multivariate Analysis.
2.2 Do not scale the data. Encode the data (having string values) for Modelling.
Data Split: Split the data into train and test (70:30). Apply Logistic Regression
and LDA (linear discriminant analysis) and CART.
Data Dictionary:
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do
null value condition check, check for duplicates and outliers and write an
inference on it. Perform Univariate and Bivariate Analysis and Multivariate
Analysis.
Head
We can see that Wife Age and No of children born have null values.
Replacing with median values for both numeric variables
Univariate Analysis
Bivariate analysis
Multivariate Analysis
2.2 Do not scale the data. Encode the data (having string values) for Modelling.
Data Split: Split the data into train and test (70:30). Apply Logistic Regression
and LDA (linear discriminant analysis) and CART.
Encoded the categorical variables Wife_ education, Husband_education,
Wife_religion, Standard_of_living_index, Media_exposure and
Contraceptive_method_used in the ascending order from worst to best since
LDA does not take string variables as parameters into model building.
Bivariate analysis
From above chart, we can establish that media exposure plays the most
important role in use of Contraceptive_method. The other important factors are
Standard of living, Husband occupation, Working status, and Wife religion
AUC Value closer to 1 tells that there is good seperatibility between the
predicted classes and thus the model is good for prediction
ROC Curve visually represents the above concept where the plot should
be as far as possible from the diagnol.
2.4 Inference: Basis on these predictions, what are the insights and
recommendations.
The EDA analysis clearly indicates that women with a tertiary education
and very high standard of living used contraceptive methods.
The use of contraceptive method was high for both Scientology and Non-
scientology women