PM Project FaizanAliSayyed 20oct

GRADED PROJECT-
PREDICTIVE MODELLING
DSBA-PM Module
Faizan Ali Sayyed
Faizan Ali Sayyed
0
1
Table of Contents
Home ................................................................................................................................................... 0
Index ................................................................................................................................................... 1
List of Figures .................................................................................................................................. 2
List of Equationss ............................................................................................................................. 3
Problem 1 ............................................................................................................................................ 4
Problem statement ........................................................................................................................... 4
Data dictionary ................................................................................................................................. 5
1.1 .................................................................................................................................................... 7
1.2 .................................................................................................................................................. 12
1.3 .................................................................................................................................................. 16
1.4 .................................................................................................................................................. 37
Problem 2 .......................................................................................................................................... 39
Problem statement ......................................................................................................................... 39
Data dictionary ............................................................................................................................... 40
2.1 .................................................................................................................................................. 41
2.2 .................................................................................................................................................. 50
2.3 .................................................................................................................................................. 56
2.4 .................................................................................................................................................. 61
Great Learning- DSBA- AS Project- Faizan Ali Sayyed

2
List of Figures
Figure 1 Compactive Data Shape ........................................................................................................................... 7
Figure 2 Compactive Data info ................................................................................................................................ 7
Figure 3 Compactive Data head .............................................................................................................................. 8
Figure 4 Compactive Data tail ................................................................................................................................. 8
Figure 5 Compactive Data describe ........................................................................................................................ 8
Figure 6 Duplicates ................................................................................................................................................. 9
Figure 7 Boxplot ...................................................................................................................................................... 9
Figure 8 Bivariate .................................................................................................................................................. 10
Figure 9 Bivariate analysis 2 ................................................................................................................................. 11
Figure 10 Null values ............................................................................................................................................. 12
Figure 11 Null values imputation .......................................................................................................................... 12
Figure 12 Description after impute ........................................................................................................................ 12
Figure 13 Zero counts in Data ............................................................................................................................... 13
Figure 14 Duplicates ............................................................................................................................................. 13
Figure 15 Boxplot .................................................................................................................................................. 14
Figure 16 After outlier treatment Boxplot............................................................................................................... 15
Figure 17 Distribution of Object type variable runqsz ........................................................................................... 16
Figure 18 Data info after encoding data ................................................................................................................ 16
Figure 19 X head ................................................................................................................................................... 17
Figure 20 X train head data ................................................................................................................................... 17
Figure 21X test head data ..................................................................................................................................... 17
Figure 22 Model fit ................................................................................................................................................. 18
Figure 23 R sq without read .................................................................................................................................. 20
Figure 24 Rsq without lwrite .................................................................................................................................. 20
Figure 25 Model 2 without ppgout ......................................................................................................................... 23
Figure 26 VIF values 2 .......................................................................................................................................... 23
Figure 27 Model 3 without fork .............................................................................................................................. 24
Figure 28 VIF values wothout fork ......................................................................................................................... 24
Figure 29 Model 4 without pgin ............................................................................................................................. 25
Figure 30 VIF values without pgin ......................................................................................................................... 25
Figure 31 Model 5 without vflt ............................................................................................................................... 26
Figure 32 VIF values wihtout vflt ........................................................................................................................... 26
Figure 33 Model 6 without pgout ........................................................................................................................... 27
Figure 34 VIF values withoput pgout ..................................................................................................................... 27
Figure 35 Model 7 wihtout sread ........................................................................................................................... 28
Figure 36 VIF values without sread ....................................................................................................................... 28
Figure 37 Model 8 without lread ............................................................................................................................ 29
Figure 38VIF values without lread ......................................................................................................................... 29
Figure 39 Model 9 without swrite ........................................................................................................................... 30
Figure 40 VIF values without swrite ...................................................................................................................... 30
Figure 41 Model 10 wihtout pflt ............................................................................................................................. 31
Figure 42 VIF values without pflt ........................................................................................................................... 31
Figure 43 Final Model ............................................................................................................................................ 32
Figure 44 Actuals vs residulas .............................................................................................................................. 33
Figure 45 best fit .................................................................................................................................................... 33
Figure 46 pair plot to see dsitribution between y and all variables ........................................................................ 34
Figure 47 Normality of residuals ............................................................................................................................ 35
Figure 48 QQ plot of residuals .............................................................................................................................. 35
Figure 49 shapiro test ............................................................................................................................................ 36
Figure 50 Homoscedascticity test ......................................................................................................................... 36

3
Figure 51 Final Model ............................................................................................................................................ 37

Figure 52 Model parameters ................................................................................................................................. 38
Figure 53Data 2 head ............................................................................................................................................ 41
Figure 54 Data2 shape .......................................................................................................................................... 41
Figure 55 Data2 Tail .............................................................................................................................................. 41
Figure 56 Data2 info .............................................................................................................................................. 42
Figure 57 Data 2 descriptive summary .................................................................................................................. 42
Figure 58 Null values ............................................................................................................................................. 42
Figure 59 Imputing null values with median .......................................................................................................... 43
Figure 60 No of duplicate rows .............................................................................................................................. 43
Figure 61 removing duplicate rows from Data2..................................................................................................... 43
Figure 62 Revised decription after treating duplicates and nulls........................................................................... 43
Figure 63 Categorical variable value count ........................................................................................................... 44
Figure 64 Boxplot Data2 ........................................................................................................................................ 45
Figure 65 Outlier treatment ................................................................................................................................... 46
Figure 66 Replacing int with category ................................................................................................................... 47
Figure 67 Heat map Data2 .................................................................................................................................... 47
Figure 68 Pair plot Data2 ...................................................................................................................................... 48
Figure 69 Multivariate analysis .............................................................................................................................. 49
Figure 70 Revised Data 2 head after encoding ..................................................................................................... 50
Figure 71 All data types changed to Numeric ....................................................................................................... 51
Figure 72 Data2 Updated heatmap after encoding ............................................................................................... 51
Figure 73 Pair plot Data2 after encoding ............................................................................................................... 52
Figure 74Y train values ......................................................................................................................................... 53
Figure 75 Y test values .......................................................................................................................................... 53
Figure 76fit check .................................................................................................................................................. 53
Figure 77 revised fit ............................................................................................................................................... 53
Figure 78 Decision tree feature importance .......................................................................................................... 54
Figure 79 Predicted classes .................................................................................................................................. 56
Figure 80 Data2 Model accuracy .......................................................................................................................... 56
Figure 81ROC curve X train .................................................................................................................................. 57
Figure 82 ROC curve test data .............................................................................................................................. 57
Figure 83 Classification report Train data ............................................................................................................. 58
Figure 84 Classification report test data ................................................................................................................ 58
Figure 85 Confusion matrix training data .............................................................................................................. 58
Figure 86 Confusion matrix testing data ................................................................................................................ 58
Figure 87 ROC curve train CART .......................................................................................................................... 59
Figure 88 ROC test CART ..................................................................................................................................... 59
Figure 89 Confusion matrix train CART ................................................................................................................ 60
Figure 90 Confusion matrix test CART .................................................................................................................. 60
Figure 91 Model Accuracy CART .......................................................................................................................... 60
Figure 92 Summary comparison ........................................................................................................................... 60
List of Equations
Equation 1 Final Model equation for usr________________________________________________________ 38
Equation 2 RMSE train vs test_______________________________________________________________ 38
Equation 3 MAE train vs test ________________________________________________________________ 38

4
Problem 1 Linear Regression:
The comp-activ databases is a collection of a computer systems activity

measures.
The data was collected from a Sun Sparcstation 20/712 with 128 Mbytes of
memory running in a multi-user university department. Users would typically be
doing a large variety of tasks ranging from accessing the internet, editing files or
running very cpu-bound programs.
As you are a budding data scientist you thought to find out a linear equation to
build a model to predict 'usr' (Portion of time (%) that cpus run in user mode) and
to find out how each attribute affects the system to be in 'usr' mode using a list of
system attributes.

5
DATA DICTIONARY:
-----------------------
System measures used:
lread - Reads (transfers per second ) between system memory and user
memory
lwrite - writes (transfers per second) between system memory and user memory
scall - Number of system calls of all types per second
sread - Number of system read calls per second .
swrite - Number of system write calls per second .
fork - Number of system fork calls per second.
exec - Number of system exec calls per second.
rchar - Number of characters transferred per second by system read calls
wchar - Number of characters transfreed per second by system write calls
pgout - Number of page out requests per second
ppgout - Number of pages, paged out per second
pgfree - Number of pages per second placed on the free list.
pgscan - Number of pages checked if they can be freed per second
atch - Number of page attaches (satisfying a page fault by reclaiming a page in
memory) per second
pgin - Number of page-in requests per second
ppgin - Number of pages paged in per second
pflt - Number of page faults caused by protection errors (copy-on-writes).
vflt - Number of page faults caused by address translation .
runqsz - Process run queue size (The number of kernel threads in memory that
are waiting for a CPU to run.
Typically, this value should be less than 2. Consistently higher values mean that
the system might be CPU-bound.)
freemem - Number of memory pages available to user processes
freeswap - Number of disk blocks available for page swapping.

6
1.1 Read the data and do exploratory data analysis. Describe the data
briefly. (Check the Data types, shape, EDA, 5-point summary). Perform
Univariate, Bivariate Analysis, Multivariate Analysis.
1.2 Impute null values if present, also check for the values which are equal
to zero. Do they have any meaning or do we need to change them or
drop them? Check for the possibility of creating new features if
required. Also check for outliers and duplicates if there.
1.3 Encode the data (having string values) for Modelling. Split the data into
train and test (70:30). Apply Linear regression using scikit learn.
Perform checks for significant variables using appropriate method from
statsmodel. Create multiple models and check the performance of
Predictions on Train and Test sets using Rsquare, RMSE & Adj
Rsquare. Compare these models and select the best one with
appropriate reasoning.
1.4 Inference: Basis on these predictions, what are the business insights
and recommendations. Please explain and summarise the various
steps performed in this project. There should be proper business
interpretation and actionable insights present.

7
1.1 Read the data and do exploratory data analysis. Describe the data briefly.
(Check the Data types, shape, EDA, 5-point summary). Perform
Univariate, Bivariate Analysis, Multivariate Analysis.
Data Shape
Figure 1 Compactive Data Shape
Data info
Figure 2 Compactive Data info
Data head and tail

8
Figure 3 Compactive Data head
Figure 4 Compactive Data tail
Data Describe statistically
Figure 5 Compactive Data describe

9
Duplicated rows
Figure 6 Duplicates
Data has no duplicate rows
Usr is the dependent variable for building model, while all others are independent variables
Figure 7 Boxplot

10
Figure 8 Bivariate

11
Figure 9 Bivariate analysis 2
Summary:
There are a total of 8192 rows and 22 columns in the dataset.
Out of 22, 13 are float 8 are integertype and 1 object type variable.

12
1.2 Impute null values if present, also check for the values which are equal to
zero. Do they have any meaning or do we need to change them or drop
them? Check for the possibility of creating new features if required. Also
check for outliers and duplicates if there.
We know that there are missing values
We can see we have null values for ‘rchar’ and ‘wchar’, As it is a continuous variable, mean
value can be imputed
Figure 10 Null values
Imputing null values with mean
Figure 11 Null values imputation
Figure 12 Description after impute

13
We can keep the 0(Zero's) in our dataset for further analysis as we have some features in our
dataset which can be 0(Zero) if the system stays Idle.
Figure 13 Zero counts in Data
Duplicates have been checked already and we know there are no duplicate rows
Figure 14 Duplicates
Outlier treatment

14
Figure 15 Boxplot

15
Figure 16 After outlier treatment Boxplot

16
1.3 Encode the data (having string values) for Modelling. Split the data into
train and test (70:30). Apply Linear regression using scikit learn. Perform
checks for significant variables using appropriate method from statsmodel.
Create multiple models and check the performance of Predictions on Train
and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these
models and select the best one with appropriate reasoning.
Data values for categorical variable
Figure 17 Distribution of Object type variable runqsz
In this Data, runqsz with a dummy column
Figure 18 Data info after encoding data

17
First creating “usr” column as output y, and rest data as X
Figure 19 X head
Splitting Data into train test in 70-30 ratio and running the model to see the current fit.
Figure 20 X train head data
Figure 21X test head data

18
Figure 22 Model fit
For each predictor variable there is a null hypothesis and alternate hypothesis.
 Null hypothesis : Predictor variable is not significant

 Alternate hypothesis : Predictor variable is significant
Above table shows based on p values, sread,fork,ppgout,pgfree,pgin have high p value, and
hence variables are not significant, and can be removed

19
Checking Multicollinearity using Variance Inflation Factor (VIF)
Variance Inflation Factor (VIF) is one of the methods to check if independent variables have
correlation between them.
If they are correlated, then it is not ideal for linear regression models as they inflate the
standard errors which in turn affects the regression parameters.
As a result, the regression model becomes non-reliable and lacks interpretability.
General rule of thumb:

If VIF values are equal to 1, then that means there is no Multicollinearity.
If VIF values are equal to 5 or exceedingly more than 5, then there is moderate
Multicollinearity.
If VIF is10 or more, then that means there is high collinearity
From above information we can conclude that, there is some multi collinearity present in the
data.

20
We will focus on predictors having VIF values > 2
 Removing predictor “lread”

o No significant drop in R-squared, hence can remove the column from model
Figure 23 R sq without read
 Removing predictor “lwrite”


Figure 24 Rsq without lwrite
 Removing predictor “scall”


Figure 25 Rsq without scall
 Removing predictor “sread”


Figure 26 Rsq without sread
 Removing predictor “swrite”


Figure 27 Rsq without swrite

21
 Removing predictor “fork”


Figure 28 Rsq without fork
 Removing predictor “exec”


Figure 29 Rsq without exec
 Removing predictor “rchar”


Figure 30 Rsq without rchar
 Removing predictor “pgout”


Figure 31 Rsq without pgout
 Removing predictor “ppgout”


Figure 32 Rsq without ppgout
 Removing predictor “pgfree”


Figure 33 Rsq without pgfree

22
 Removing predictor “pgin”


Figure 34 Rsq without pgin
 Removing predictor “ppgin”


Figure 35 Rsq without ppgin
 Removing predictor “pflt”


Figure 36 Rsq without pflt
 Removing predictor “vflt”


Figure 37 Rsq without vflt

23
Dropping the variable ppgout with little to no impact on the model
Figure 25 Model 2 without ppgout
Figure 26 VIF values 2

24
Removing fork from X train
Figure 27 Model 3 without fork
n
Figure 28 VIF values wothout fork

25
Removing pgin from X train
Figure 29 Model 4 without pgin
Figure 30 VIF values without pgin

26
Removing vflt from X train
Figure 31 Model 5 without vflt
Figure 32 VIF values wihtout vflt

27
Removing pgout from X train
Figure 33 Model 6 without pgout
Figure 34 VIF values withoput pgout

28
Removing sread from X train
Figure 35 Model 7 wihtout sread
Figure 36 VIF values without sread

29
Removing lread from X train
Figure 37 Model 8 without lread
Figure 38VIF values without lread

30
Removing swrite from X train
Figure 39 Model 9 without swrite
Figure 40 VIF values without swrite

31
Removing pflt from X train
Figure 41 Model 10 wihtout pflt
Figure 42 VIF values without pflt

32
Now that there are no VIF values above 2, it is safe to say that multicolinearity
has been removed from the Dataset.
In this we have “wchar” has p value> 0.05, hence we can remove it.
Figure 43 Final Model
Final equation for usr
“Usr” = 84.1328 – 0.0353*(“lwrite”) – 0.0015*(“scall”) – 1.7255*(“exec”) + 9.51 -06

*(“rchar”) –0.1516*(“pgfree”) + 0.5082*(“atch”) – 0,0625*(“pgin”) –
0.0006*(“freemem”) + 8.658e-06*(“freeswap”) +
1.6073*(“runqsz_Not_CPU_Bound”)

33
Testing the Assumptions of Linear Regression
For Linear Regression, we need to check if the following assumptions hold:-
 Linearity
 Independence
 Homoscedasticity
 Normality of error terms
 No strong Multicollinearity
Figure 44 Actuals vs residulas
Figure 45 best fit

34
Figure 46 pair plot to see dsitribution between y and all variables

35
Figure 47 Normality of residuals
Figure 48 QQ plot of residuals
Most points are lying on the line, and residuals are there and this is not the best fit
curve.

36
The Shapiro-Wilk test can also be used for checking the normality. The null and alternate
hypotheses of the test are as follows:
 Null hypothesis - Data is normally distributed.

 Alternate hypothesis - Data is not normally distributed.
Figure 49 shapiro test
 Since p-value < 0.05, the residuals are not normal as per shapiro test.
 Strictly speaking - the residuals are not normal. However, as an approximation, we
might be willing to accept this distribution as close to being normal
test for Homoscedasticity:
Figure 50 Homoscedascticity test
Since p-value > 0.05 we can say that the residuals are homoscedastic.
All the assumptions of linear regression are now satisfied

37
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations. Please explain and summarise the various steps
performed in this project. There should be proper business interpretation
and actionable insights present.
Figure 51 Final Model
Summary:
 R-squared of the model is 0.711 and adjusted R-squared is 0.711, which

shows that the model is able to explain ~71% variance in the data. This is
quite good.

38
 Process run queue size, Number of system exec calls per second and
Number of page attached have significant impact on Portion of time (%)
that cpus run in user mode)
Figure 52 Model parameters
Equation 1 Final Model equation for usr
Now predicting on the X test data,
Calculating Root mean square error and Mean Absolute error for train and test
data
Equation 2 RMSE train vs test
Equation 3 MAE train vs test
We can see that RMSE on the train and test sets are comparable. So, our
model is not suffering from overfitting.
MAE indicates that our current model is able to predict mpg within a mean error
of 2.2 units on the test data.
Hence, we can conclude the model is good for prediction as well as inference
purposes. with 71% fit

39
Logistic Regression, LDA and CART

You are a statistician at the Republic of Indonesia Ministry of Health and you are
provided with a data of 1473 females collected from a Contraceptive Prevalence
Survey. The samples are married women who were either not pregnant or do not
know if they were at the time of the survey.
The problem is to predict do/don't they use a contraceptive method of choice

based on their demographic and socio-economic characteristics.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null
value condition check, check for duplicates and outliers and write an inference
on it. Perform Univariate and Bivariate Analysis and Multivariate Analysis.
2.2 Do not scale the data. Encode the data (having string values) for Modelling.
Data Split: Split the data into train and test (70:30). Apply Logistic Regression
and LDA (linear discriminant analysis) and CART.
2.3 Performance Metrics: Check the performance of Predictions on Train and

Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score for each model Final Model: Compare Both the models and write
inference which model is best/optimized.
2.4 Inference: Basis on these predictions, what are the insights and
recommendations.

40
Data Dictionary:
1. Wife's age (numerical)

2. Wife's education (categorical) 1=uneducated, 2, 3, 4=tertiary
3. Husband's education (categorical) 1=uneducated, 2, 3, 4=tertiary
4. Number of children ever born (numerical)
5. Wife's religion (binary) Non-Scientology, Scientology
6. Wife's now working? (binary) Yes, No
7. Husband's occupation (categorical) 1, 2, 3, 4(random)
8. Standard-of-living index (categorical) 1=verlow, 2, 3, 4=high
9. Media exposure (binary) Good, Not good
10. Contraceptive method used (class attribute) No,Yes

41
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do
null value condition check, check for duplicates and outliers and write an
inference on it. Perform Univariate and Bivariate Analysis and Multivariate
Analysis.
Head
Figure 53Data 2 head
Figure 54 Data2 shape
Figure 55 Data2 Tail

42
Figure 56 Data2 info
Figure 57 Data 2 descriptive summary
Figure 58 Null values
We can see that Wife Age and No of children born have null values.
Replacing with median values for both numeric variables

43
Figure 59 Imputing null values with median
Figure 60 No of duplicate rows
There are 80 duplicate rows. lets remove them
Figure 61 removing duplicate rows from Data2
Figure 62 Revised decription after treating duplicates and nulls

44
Univariate Analysis
Figure 63 Categorical variable value count

45
Figure 64 Boxplot Data2

46
Figure 65 Outlier treatment

47
Since husband occupation is a categorical variable, numbers 1 2 3 4 are

replaced with strings
Figure 66 Replacing int with category
Bivariate analysis
Figure 67 Heat map Data2

48
Figure 68 Pair plot Data2
Multivariate Analysis

49
Figure 69 Multivariate analysis

50
2.2 Do not scale the data. Encode the data (having string values) for Modelling.
Data Split: Split the data into train and test (70:30). Apply Logistic Regression
and LDA (linear discriminant analysis) and CART.
Encoded the categorical variables Wife_ education, Husband_education,
Wife_religion, Standard_of_living_index, Media_exposure and
Contraceptive_method_used in the ascending order from worst to best since
LDA does not take string variables as parameters into model building.
The encoding for ordinal values:
Wife_ education: Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4

Husband_education: Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4
Wife_religion: Scientology = 0 and non-Scientology = 1
Wife_Working: Yes = 1 and No = 0.
Standard_of_living _index: Very Low = 1, Low = 2, High = 3, Very High = 4
Media_exposure: Exposed = 1 and Not-Exposed = 0
Contraceptive_method_used: Yes = 1 and No = 0
Figure 70 Revised Data 2 head after encoding

51
Figure 71 All data types changed to Numeric
Bivariate analysis
Figure 72 Data2 Updated heatmap after encoding

52
Figure 73 Pair plot Data2 after encoding

53
Train test split
Figure 74Y train values
Figure 75 Y test values
Figure 76fit check
There is an overfit of model in Decision tree regressor, hence we need to do a

grid search
Let us take Max_depth as 10, min_samples_leaf as 15 and Min_sample_split as

15.
Figure 77 revised fit

54
Figure 78 Decision tree feature importance
From above chart, we can establish that media exposure plays the most
important role in use of Contraceptive_method. The other important factors are
Standard of living, Husband occupation, Working status, and Wife religion
Here is a comparison of accuracy for Models created using Decision Tree

Classifier, Logistic Regression and LDA
Logistic Regression Model Accuracy of Training data before applying

GridSearchCV: 0.674872
Logistic Regression Model Accuracy of Test data before applying
GridSearchCV: 0.676923

55
Thus we can conclude, Decision tree classifier results in best accuracy

56
2.3 Performance Metrics: Check the performance of Predictions on Train and

Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score for each model Final Model: Compare Both the models and write
inference which model is best/optimized.
Predicting the model classes
Figure 79 Predicted classes
Checking the accuracy of the Model old:
Figure 80 Data2 Model accuracy

57
 AUC Value closer to 1 tells that there is good seperatibility between the
predicted classes and thus the model is good for prediction
 ROC Curve visually represents the above concept where the plot should
be as far as possible from the diagnol.
Figure 81ROC curve X train
AUC value is 72.1
Figure 82 ROC curve test data

58
Figure 83 Classification report Train data
Figure 84 Classification report test data
Figure 85 Confusion matrix training data
Figure 86 Confusion matrix testing data

59
Figure 87 ROC curve train CART
Figure 88 ROC test CART

60
Figure 89 Confusion matrix train CART
Figure 90 Confusion matrix test CART
Model accuracy CART
Figure 91 Model Accuracy CART
Figure 92 Summary comparison

61
2.4 Inference: Basis on these predictions, what are the insights and
recommendations.
 The EDA analysis clearly indicates that women with a tertiary education
and very high standard of living used contraceptive methods.
 Media exposure plays a vital role in effect of using contraception

externally.
 Women ranging from 21 to 38 generally use contraceptive methods more.
 The usage of contraceptive methods need not depend on their

demographic or socio-economic backgrounds since the use of
contraceptive methods were almost the same for both working and non-
working women
 The use of contraceptive method was high for both Scientology and Non-
scientology women

PM Project FaizanAliSayyed 20oct

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PM Project FaizanAliSayyed 20oct

Uploaded by

Copyright:

Available Formats

GRADED PROJECT-

Faizan Ali Sayyed

Faizan Ali Sayyed

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Figure 51 Final Model ............................................................................................................................................ 37

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Problem 1 Linear Regression:

The comp-activ databases is a collection of a computer systems activity

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Figure 1 Compactive Data Shape

Figure 2 Compactive Data info

Data head and tail

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Figure 3 Compactive Data head

Figure 4 Compactive Data tail

Data Describe statistically

Figure 5 Compactive Data describe

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Data has no duplicate rows

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Figure 9 Bivariate analysis 2

There are a total of 8192 rows and 22 columns in the dataset.

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Figure 10 Null values

Imputing null values with mean

Figure 11 Null values imputation

Figure 12 Description after impute

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Figure 13 Zero counts in Data

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Figure 16 After outlier treatment Boxplot

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Data values for categorical variable

Figure 17 Distribution of Object type variable runqsz

In this Data, runqsz with a dummy column

Figure 18 Data info after encoding data

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

First creating “usr” column as output y, and rest data as X

Figure 20 X train head data

Figure 21X test head data

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Figure 22 Model fit

 Null hypothesis : Predictor variable is not significant

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

Checking Multicollinearity using Variance Inflation Factor (VIF)

General rule of thumb:

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

We will focus on predictors having VIF values > 2

 Removing predictor “lread”

Figure 23 R sq without read

 Removing predictor “lwrite”

 Removing predictor “scall”

 Removing predictor “sread”

 Removing predictor “swrite”

Great Learning- DSBA- AS Project- Faizan Ali Sayyed

 Removing predictor “fork”

 Removing predictor “exec”

 Removing predictor “rchar”

“Usr” = 84.1328 – 0.0353(“lwrite”) – 0.0015(“scall”) – 1.7255*(“exec”) + 9.51 -06