You are on page 1of 30

Great Learning

Predictive Modelling Business Report

DSBA April 2023

By

Donkada Vindhya Mounika Patnaik


Contents
Problem 1: Linear Regression…………………………………………………………………......4

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the Data types,
shape, EDA, 5 point summary). Perform Univariate, Bivariate Analysis, Multivariate Analysis……5
1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any
meaning or do we need to change them or drop them? Check for the possibility of creating new
features if required. Also check for outliers and duplicates if there……………………………….11
1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30).
Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate
method from statsmodel. Create multiple models and check the performance of Predictions on Train
and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning……………………………………………………………………………13
1.4 Inference: Basis on these predictions, what are the business insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present……………………………………………….16

Problem 2: Logistic Regression, LDA and CART…………………………………………………18

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
check for duplicates and outliers and write an inference on it. Perform Univariate and Bivariate
Analysis and Multivariate Analysis…………………………………………………………………18
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis) and
CART…………………………………………………………………………………………………24
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model:
Compare Both the models and write inference which model is best/optimized……………………29
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present……………………………………………..30

2|Page
List of Figures
Figure 1 – Univariate analysis – Numerical Data………………………………………………………7
Figure 2 – Univariate analysis – Numerical Data …………………………………………………...…8
Figure 3: Bivariate Analysis – Boxplots…………………………………..............................................8
Figure 4: Pair plot………………………………………………………………………………………9
Figure 5: Correlation – Heatmap……………………………………………………….……………..10
Figure 6: Outlier Treatment……………………………………………………………….…………..12
Figure 7: Coefficient Bar Chart……………………………………………………………...……..…14
Figure 8: Actual Vs Predicted Values…………………………………………………………………16
Figure 9: Outliers………………………………………………………………………………..…….21
Figure 10: After outlier treatment…………………………………………………………………..…21
Figure 11: Categorical Data…………………………………………………………………………...22
Figure 12: Univariate Analysis – Numerical Data………………………………………………….…23
Figure 13: Bivariate Analysis………………………………………………………………………....23
Figure 14: Correlation Heatmap………………………………………………………………………24
Figure 15: Bivariate Analysis………………………………………………………………………....25
Figure 16: Bivariate Analysis – Heatmap…………………………………………………………….26
Figure 17: Confusion Matrix Training Data………………………………………………………….28
Figure 18: Confusion Matrix Test Data……………………………………………………………….28
Figure 19: ROC……………………………………………………………………………………..…28
Figure 20: ROC Curve……………………………………………………………………………...…29

List of Tables
Table 1 – Head.....................................................................................................................….....5
Table 2 – Tail………………………………………………………………………………………….5
Table 3 – Data Type…………………………………………………………………………………5
Table 4 – Description…………………………………………………………………………………6
Table 5: Null Values …………………………………………………...……………………………5
Table 6: Percentage of Zero values…………………………………………………………………11
Table 7: Coefficient of independent attributes…………………………….…………………….. 13
Table 8: OLS Model…………………………………………………………………………………..15
Table 9: Head………………………………………………………………………………………….18
Table 10: Tail………………………………………………………………………………………… 19
Table 11: Data Type………………………………………………………………………………….. 19
Table 12: Description……………………………………………………………………………….…19
Table 13: Percentage of Null Values……………………………………………………………….…20
Table 14: Duplicate Values…………………………………………………………………………....20
Table 15: Encoded Data…………………………………………………………………………….…24
Table 16: Encoded Data Type………………………………………………………………………....25
Table 17: Variance Inflation Factor…………………………………………………………………...26
Table 18: Logistic Regression Accuracy……………………………………………………………...27
Table 19: Linear Discriminant Analysis………………………………………………………………27
Table 20: Coefficient……………………………………………………………………………….…27

3|Page
Problem 1: Linear Regression

The comp-activ databases is a collection of a computer systems activity measures .


The data was collected from a Sun Sparcstation 20/712 with 128 Mbytes of memory running in a
multi-user university department. Users would typically be doing a large variety of tasks ranging from
accessing the internet, editing files or running very cpu-bound programs.

As you are a budding data scientist you thought to find out a linear equation to build a model to
predict 'usr'(Portion of time (%) that cpus run in user mode) and to find out how each attribute affects
the system to be in 'usr' mode using a list of system attributes.

Dataset for Problem 1: compactiv.xlsx

DATA DICTIONARY:
-----------------------
System measures used:

lread - Reads (transfers per second ) between system memory and user memory
lwrite - writes (transfers per second) between system memory and user memory
scall - Number of system calls of all types per second
sread - Number of system read calls per second .
swrite - Number of system write calls per second .
fork - Number of system fork calls per second.
exec - Number of system exec calls per second.
rchar - Number of characters transferred per second by system read calls
wchar - Number of characters transfreed per second by system write calls
pgout - Number of page out requests per second
ppgout - Number of pages, paged out per second
pgfree - Number of pages per second placed on the free list.
pgscan - Number of pages checked if they can be freed per second
atch - Number of page attaches (satisfying a page fault by reclaiming a page in memory) per second
pgin - Number of page-in requests per second
ppgin - Number of pages paged in per second
pflt - Number of page faults caused by protection errors (copy-on-writes).
vflt - Number of page faults caused by address translation .
runqsz - Process run queue size (The number of kernel threads in memory that are waiting for a CPU
to run.
Typically, this value should be less than 2. Consistently higher values mean that the system might be
CPU-bound.)
freemem - Number of memory pages available to user processes
freeswap - Number of disk blocks available for page swapping.
------------------------
usr - Portion of time (%) that cpus run in user mode.

4|Page
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the Data
types, shape, EDA, 5 point summary). Perform Univariate, Bivariate Analysis, Multivariate
Analysis.

On the basis of above data, following exploratory analysis is performed to understand, head, tail,
shape, information and description.

Table 1: Head

Table 2: Tail

Table 3: Data Type

There are 8192 rows and 22 columns in which 13 are float, 8 are integers and 1 object.

5|Page
Table 4: Description

Business Insights:

 The data indicates significant variability in local read and write counts, system calls, and
system read and write operations. This suggests that resource utilization varies widely over
time.
 Monitoring system call counts and system read/write operations is crucial for maintaining
system performance. Anomalies or spikes in these metrics may indicate performance
bottlenecks or issues that need immediate attention.
 The metrics related to fork and exec operations (process creation and execution) show
variability. Understanding the patterns of process creation and execution can help in
optimizing system resources and improving efficiency.
 Local read and write counts, as well as system read and write operations, provide insights into
I/O (input/output) operations. Monitoring these metrics can help businesses ensure that data
transfer and storage operations are running smoothly.
 The metrics related to memory (freemem) and swap space (freeswap) indicate available
memory and swap space on the system. Monitoring these values can be critical for avoiding
system crashes due to resource exhaustion.
 The 'usr' metric, with a mean of approximately 83.97, suggests that users are active on the
system. Businesses should track user activity to ensure that user experience and system
responsiveness are maintained.

6|Page
Figure 1: Univariate Analysis – Numerical Data

Insights:

Above analysis is performed on numerical data. lread, lwrite, scall, swrite and freemem are right
skewed and other is left skewed. Boxplots show outliers in the each numerical column which are
further treated.

7|Page
Figure 2: Univariate Analysis – Categorical Data

Insights:

Runqsz means process run queue size (The number of kernel threads in memory that are waiting
for a CPU to run. As per above figure, Not_CPU_Bound is more than CPU_Bound.

Figure 3: Bivariate Analysis – Boxplots

8|Page
Figure 4: Pair plot

Inisghts:

 Pflt and Fork has high correlation.


 Apart from that, Ppgout and pgfree have higher correlation of 0.79.
 PPgfree and PPgscan has higher correlation of 0.92.

9|Page
Figure 5: Correlation – Heatmap

Insight:

Above heatmap supports the insights provided through bivariate analysis. It establishes strong
correlations among pflt, vflt and fork. Followed by correlation among other variables where strong
relation is highlighted in light colour. As colour grows dark, least correlation is observed.

10 | P a g e
1.2 Impute null values if present, also check for the values which are equal to zero. Do they have
any meaning or do we need to change them or drop them? Check for the possibility of
creating new features if required. Also check for outliers and duplicates if there.

Null Values:

Table 5: Null Values

Inisghts:

Null values are identified in rchar and wchar and are less than 5%. Thus, they are treated by imputing
mean.

Table 6: Percentage of Zero values

Insights:

If the system stays idle, dataset records 0. Thus, they are not treated further.

11 | P a g e
Outlier Treatment:

Figure 6: Outlier Treatment

12 | P a g e
Insight: As outliers are observed in univariate analysis, they treated using IQR. Above figures shows
no outliers after outlier treatment. There are no duplicates found.

1.3 Encode the data (having string values) for Modelling. Split the data into train and test
(70:30). Apply Linear regression using scikit learn. Perform checks for significant variables
using appropriate method from statsmodel. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Compare these models and select the best one with appropriate reasoning.

As identified in the Table 4, variable with string value runqsz is an Object.

As a first step, we split the data into train and test i.e 70:30. The training data has 7372 observations
in which we have to fit our model. To validate our model, we need to evaluate against test
observations of 820.

Linear Regression Model – Scikit Learn:

Table 7 : Coefficient of independent attributes

Insight:

After we fit the model, above are the coefficient for each of the independent attributes. They provide
insights into the relationships between the independent variables and the usr. It is used to make
predictions and understand which variables have the significant impact on the target variable.
Additionally, the signs (positive or negative) of the coefficients indicate whether the variables have a
positive or negative effect on the target variable.

13 | P a g e
Figure 7: Coefficient Bar Chart

Insights:

From the above bar chart, we can see that Process run queue size (runqsz) (The number of kernel
threads in memory that are waiting for a CPU to run) and Number of system fork calls per second
(fork) has the highest impact on Portion of time (%) that cpus run in user model (usr).

Findings after further analysis are:

 The intercept of the above model is 43.78.


 R-sqaure (Coefficient of Determination) on training data is 0.637
 R-sqaure (Coefficient of Determination) on testing data is 0.655
 Root Mean Square Error (RMSE) on training data is 10.957
 Root Mean Square Error (RMSE) on testing data is 11.77
 Mean Squared Error (MSE) on training data is
 Mean Squared Error (MSE) on testing data is

Insights:

 R-squared on the training data (0.637) suggests that about 63.7% of the variance in the
dependent variable is explained by the model. R-squared on the testing data (0.655) suggests
that about 65.5% of the variance in the dependent variable is explained by the model. The R-
squared values for both training and testing datasets are quite close, indicating that the model
generalizes well to unseen data. This is a positive sign as it suggests that the model is not
overfitting.

14 | P a g e
 RMSE on the training data (10.957) represents the average error of the model's predictions on
the data it was trained on. RMSE on the testing data (11.77) represents the average error of
the model's predictions on new, unseen data. The RMSE values are relatively low, indicating
that the model's predictions are, on average, close to the actual values.

OLS Model – Linear Regression:

Table 8: OLS Model

Insights:
The R-squared value is 0.640, which means that approximately 64% of the variance in the dependent
variable "usr" is explained by the independent variables in the model. The adjusted R-squared is very
close, indicating that the model's explanatory power is robust. Several independent variables have low

15 | P a g e
p-values (P>|t|), indicating that they are statistically significant in explaining the variation in "usr."
These variables include "lread," "scall," "fork," "rchar," "wchar," "pgout," "ppgout," "pgfree,"
"pgscan," "pgin," "pflt," "vflt," "runqsz," "freemem," "freeswap," and the intercept. These are the
variables that likely have a meaningful impact on the "usr" variable.

Figure 8: Actual Vs Predicted Values

Comparing Scikit Learn and OLS Model:

Both Scikit-learn and the OLS model have similar R-squared values, indicating that they explain a
similar amount of variance in the data. Scikit-learn has slightly lower RMSE values on the training
and testing data, indicating better performance in terms of RMSE. However, the differences are
relatively small. Both Scikit-learn and the OLS model provide similar results in terms of R-squared
and RMSE. Scikit-learn has a slight edge in terms of RMSE, but the overall performance is quite
comparable between the two approaches.

1.4 Inference: Basis on these predictions, what are the business insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be
proper business interpretation and actionable insights present.

Business Insights and Recommendations:

 The dataset contains 8192 rows and 22 columns, with a mix of numerical and categorical
data. Significant variability is observed in local read and write counts, system calls, and
system read and write operations over time.
 Local read and write counts, as well as system read and write operations, provide insights into
I/O (input/output) operations. Monitoring these metrics can help businesses ensure that data
transfer and storage operations are running smoothly.

16 | P a g e
 Metrics related to memory (freemem) and swap space (freeswap) indicate available memory
and swap space on the system. Monitoring these values can be critical for avoiding system
crashes due to resource exhaustion. The presence of non-zero values in these columns
suggests that the system has not exhausted its resources in most cases.
 The 'usr' metric, with a mean of approximately 83.97, suggests that users are active on the
system. Tracking user activity is essential to maintain user experience and system
responsiveness.
 Linear regression models were built to predict the 'usr' metric. The most influential variables
on 'usr' were found to be 'runqsz' (process run queue size) and 'fork' (number of system fork
calls per second). Both Scikit-learn and Ordinary Least Squares (OLS) models provided
similar predictive performance.
 Monitor system call counts, system read/write operations, and I/O operations closely to
identify and address performance bottlenecks promptly. Optimize resource allocation based
on patterns of process creation and execution. Keep a close watch on memory and swap space
usage to prevent system crashes.

Project Summary:

 The project involved exploratory data analysis (EDA), including univariate, bivariate, and
multivariate analyses.
 Data preprocessing steps included handling null values, outliers, and encoding categorical
variables.
 Linear regression models were used to predict system behavior, and model performance
was evaluated using R-squared and RMSE.
 The project provided valuable insights into system performance and resource utilization,
enabling data-driven decision-making for system optimization and maintenance. The
analysis demonstrated the importance of monitoring key system metrics to ensure
reliability and efficiency.

17 | P a g e
Problem 2: Logistic Regression, LDA and CART

You are a statistician at the Republic of Indonesia Ministry of Health and you are provided with a
data of 1473 females collected from a Contraceptive Prevalence Survey. The samples are married
women who were either not pregnant or do not know if they were at the time of the survey.
The problem is to predict do/don't they use a contraceptive method of choice based on their
demographic and socio-economic characteristics.

Data Dictionary:

1. Wife's age (numerical)


2. Wife's education (categorical) 1=uneducated, 2, 3, 4=tertiary
3. Husband's education (categorical) 1=uneducated, 2, 3, 4=tertiary
4. Number of children ever born (numerical)
5. Wife's religion (binary) Non-Scientology, Scientology
6. Wife's now working? (binary) Yes, No
7. Husband's occupation (categorical) 1, 2, 3, 4(random)
8. Standard-of-living index (categorical) 1=verlow, 2, 3, 4=high
9. Media exposure (binary) Good, Not good
10. Contraceptive method used (class attribute) No,Yes

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, check for duplicates and outliers and write an inference on it. Perform Univariate and
Bivariate Analysis and Multivariate Analysis.

Table 9: Head

18 | P a g e
Table 10: Tail

Table 11: Data Type

Table 12: Description

Insights:

As observed in the above tables, there are 1473 columns and 10 rows. It includes 3 numerical data
type and 7 categorical variables. Table 4 summarizes the central tendency, variability, and
distribution of each numerical variable in the dataset.

19 | P a g e
Null Values:

Table 13: Percentage of Null Values

Insights: Above table summarizes null values which is imputed the mean.

Duplicate Values:

Table 14: Duplicate Values

Insights: As per above, there are 80 duplicate rows which are dropped. It resulted in 1393 rows and
10 columns.

20 | P a g e
Outliers:

Figure 9: Outliers

Insights: Above figure illustrates outliers in No_of_children_born which are treated using log.

Figure 10: After outlier treatment

21 | P a g e
Univariate Analysis:

Figure 11: Categorical Data

Insights:

 Tertiary education is highest among Wife followed by secondary education.


 Tertiary education is highest among Husband followed by secondary education.
 Scientology is higher among wife religion in comaprison to non-scientology.
 There are less number of wife working.
 Population has high standard of living.

22 | P a g e
Figure 12: Univariate Analysis – Numerical Data

Insights:

 The maximum age observed in the dataset is 49 years.


 The maximum number of children born is 16.
 The maximum husband occupation score is 4.

Figure 13: Bivariate Analysis

23 | P a g e
Multivariate Analysis:

Figure 14: Correlation Heatmap

Insights: Age of wife and No_of_children_born have high correlation.

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis) and CART.

Encoding Data:

Table 15: Encoded Data

24 | P a g e
Table 16: Encoded Data Type

Insight: Above table represents change in data type from Object to float and int.

Figure 15: Bivariate Analysis

25 | P a g e
Figure 16: Bivariate Analysis – Heatmap

Insights: It is proves correlation insights provided in Figure 14.

Multicollinearity Verification:

Table 17: Variance Inflation Factor

Insight:
Based on the VIF values provided, it appears that multicollinearity is not a severe in regression
model. Most of the variables have VIF values below 2, which is acceptable as they are moderately
correlated.

26 | P a g e
Table 18: Logistic Regression Accuracy

Insights:

The Logistic Regression model achieved an accuracy of approximately 67.03%, meaning it correctly
classified about 67.03% of the total instances.

True Positives (TP): 124 instances were correctly classified as class 1.


True Negatives (TN): 63 instances were correctly classified as class 0.
False Positives (FP): 70 instances were incorrectly classified as class 1.
False Negatives (FN): 22 instances were incorrectly classified as class 0.

Table 19: Linear Discriminant Analysis

Insight: The LDA model achieved an accuracy of approximately 66.31%, slightly lower than the
Logistic Regression model.

Table 20: Coefficient

27 | P a g e
Figure 17: Confusion Matrix Training Data

Figure 18: Confusion Matrix Test Data

Figure 19: ROC

28 | P a g e
Insight:

An accuracy of 0.70 on the training data is relatively decent, indicating that the model is learning
from the training data. An accuracy of 0.66 on the test data is lower than the training accuracy. This
suggests that the model may be overfitting to the training data.

2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final
Model: Compare Both the models and write inference which model is best/optimized.

Figure 20: ROC Curve

Insight:

Based on the evaluation metrics, the Logistic Regression model outperforms the Random Forest
Classifier for this classification task. It has a higher test accuracy (67.03% vs. 64.16%) and a better
ROC AUC score (0.719 vs. 0.688). In the confusion matrices, the Logistic Regression model also has
fewer false positives and false negatives, indicating better predictive performance.

29 | P a g e
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be
proper business interpretation and actionable insights present.

Project Summary:

We analyzed data from a Contraceptive Prevalence Survey to predict contraceptive method usage
among married women. We performed data preprocessing, handling null values, duplicates, and
outliers. Univariate and bivariate analyses revealed insights about the demographic and socio-
economic factors influencing contraceptive usage. We applied Logistic Regression and Linear
Discriminant Analysis (LDA) models for prediction. Logistic Regression outperformed LDA with a
higher accuracy (67.03% vs. 66.31%) and ROC AUC score (0.719 vs. 0.688). The key insight is that
demographic and socio-economic factors can help predict contraceptive use.

Business Interpretation and Recommendations:

The analysis suggests that demographic and socio-economic factors significantly influence
contraceptive usage among married women. Implement targeted family planning campaigns
considering these factors to increase contraceptive adoption rates, ensuring better reproductive health
outcomes and informed family planning choices for women.

30 | P a g e

You might also like