Professional Documents
Culture Documents
By
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the Data types,
shape, EDA, 5 point summary). Perform Univariate, Bivariate Analysis, Multivariate Analysis……5
1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any
meaning or do we need to change them or drop them? Check for the possibility of creating new
features if required. Also check for outliers and duplicates if there……………………………….11
1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30).
Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate
method from statsmodel. Create multiple models and check the performance of Predictions on Train
and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning……………………………………………………………………………13
1.4 Inference: Basis on these predictions, what are the business insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present……………………………………………….16
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
check for duplicates and outliers and write an inference on it. Perform Univariate and Bivariate
Analysis and Multivariate Analysis…………………………………………………………………18
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis) and
CART…………………………………………………………………………………………………24
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model:
Compare Both the models and write inference which model is best/optimized……………………29
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present……………………………………………..30
2|Page
List of Figures
Figure 1 – Univariate analysis – Numerical Data………………………………………………………7
Figure 2 – Univariate analysis – Numerical Data …………………………………………………...…8
Figure 3: Bivariate Analysis – Boxplots…………………………………..............................................8
Figure 4: Pair plot………………………………………………………………………………………9
Figure 5: Correlation – Heatmap……………………………………………………….……………..10
Figure 6: Outlier Treatment……………………………………………………………….…………..12
Figure 7: Coefficient Bar Chart……………………………………………………………...……..…14
Figure 8: Actual Vs Predicted Values…………………………………………………………………16
Figure 9: Outliers………………………………………………………………………………..…….21
Figure 10: After outlier treatment…………………………………………………………………..…21
Figure 11: Categorical Data…………………………………………………………………………...22
Figure 12: Univariate Analysis – Numerical Data………………………………………………….…23
Figure 13: Bivariate Analysis………………………………………………………………………....23
Figure 14: Correlation Heatmap………………………………………………………………………24
Figure 15: Bivariate Analysis………………………………………………………………………....25
Figure 16: Bivariate Analysis – Heatmap…………………………………………………………….26
Figure 17: Confusion Matrix Training Data………………………………………………………….28
Figure 18: Confusion Matrix Test Data……………………………………………………………….28
Figure 19: ROC……………………………………………………………………………………..…28
Figure 20: ROC Curve……………………………………………………………………………...…29
List of Tables
Table 1 – Head.....................................................................................................................….....5
Table 2 – Tail………………………………………………………………………………………….5
Table 3 – Data Type…………………………………………………………………………………5
Table 4 – Description…………………………………………………………………………………6
Table 5: Null Values …………………………………………………...……………………………5
Table 6: Percentage of Zero values…………………………………………………………………11
Table 7: Coefficient of independent attributes…………………………….…………………….. 13
Table 8: OLS Model…………………………………………………………………………………..15
Table 9: Head………………………………………………………………………………………….18
Table 10: Tail………………………………………………………………………………………… 19
Table 11: Data Type………………………………………………………………………………….. 19
Table 12: Description……………………………………………………………………………….…19
Table 13: Percentage of Null Values……………………………………………………………….…20
Table 14: Duplicate Values…………………………………………………………………………....20
Table 15: Encoded Data…………………………………………………………………………….…24
Table 16: Encoded Data Type………………………………………………………………………....25
Table 17: Variance Inflation Factor…………………………………………………………………...26
Table 18: Logistic Regression Accuracy……………………………………………………………...27
Table 19: Linear Discriminant Analysis………………………………………………………………27
Table 20: Coefficient……………………………………………………………………………….…27
3|Page
Problem 1: Linear Regression
As you are a budding data scientist you thought to find out a linear equation to build a model to
predict 'usr'(Portion of time (%) that cpus run in user mode) and to find out how each attribute affects
the system to be in 'usr' mode using a list of system attributes.
DATA DICTIONARY:
-----------------------
System measures used:
lread - Reads (transfers per second ) between system memory and user memory
lwrite - writes (transfers per second) between system memory and user memory
scall - Number of system calls of all types per second
sread - Number of system read calls per second .
swrite - Number of system write calls per second .
fork - Number of system fork calls per second.
exec - Number of system exec calls per second.
rchar - Number of characters transferred per second by system read calls
wchar - Number of characters transfreed per second by system write calls
pgout - Number of page out requests per second
ppgout - Number of pages, paged out per second
pgfree - Number of pages per second placed on the free list.
pgscan - Number of pages checked if they can be freed per second
atch - Number of page attaches (satisfying a page fault by reclaiming a page in memory) per second
pgin - Number of page-in requests per second
ppgin - Number of pages paged in per second
pflt - Number of page faults caused by protection errors (copy-on-writes).
vflt - Number of page faults caused by address translation .
runqsz - Process run queue size (The number of kernel threads in memory that are waiting for a CPU
to run.
Typically, this value should be less than 2. Consistently higher values mean that the system might be
CPU-bound.)
freemem - Number of memory pages available to user processes
freeswap - Number of disk blocks available for page swapping.
------------------------
usr - Portion of time (%) that cpus run in user mode.
4|Page
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the Data
types, shape, EDA, 5 point summary). Perform Univariate, Bivariate Analysis, Multivariate
Analysis.
On the basis of above data, following exploratory analysis is performed to understand, head, tail,
shape, information and description.
Table 1: Head
Table 2: Tail
There are 8192 rows and 22 columns in which 13 are float, 8 are integers and 1 object.
5|Page
Table 4: Description
Business Insights:
The data indicates significant variability in local read and write counts, system calls, and
system read and write operations. This suggests that resource utilization varies widely over
time.
Monitoring system call counts and system read/write operations is crucial for maintaining
system performance. Anomalies or spikes in these metrics may indicate performance
bottlenecks or issues that need immediate attention.
The metrics related to fork and exec operations (process creation and execution) show
variability. Understanding the patterns of process creation and execution can help in
optimizing system resources and improving efficiency.
Local read and write counts, as well as system read and write operations, provide insights into
I/O (input/output) operations. Monitoring these metrics can help businesses ensure that data
transfer and storage operations are running smoothly.
The metrics related to memory (freemem) and swap space (freeswap) indicate available
memory and swap space on the system. Monitoring these values can be critical for avoiding
system crashes due to resource exhaustion.
The 'usr' metric, with a mean of approximately 83.97, suggests that users are active on the
system. Businesses should track user activity to ensure that user experience and system
responsiveness are maintained.
6|Page
Figure 1: Univariate Analysis – Numerical Data
Insights:
Above analysis is performed on numerical data. lread, lwrite, scall, swrite and freemem are right
skewed and other is left skewed. Boxplots show outliers in the each numerical column which are
further treated.
7|Page
Figure 2: Univariate Analysis – Categorical Data
Insights:
Runqsz means process run queue size (The number of kernel threads in memory that are waiting
for a CPU to run. As per above figure, Not_CPU_Bound is more than CPU_Bound.
8|Page
Figure 4: Pair plot
Inisghts:
9|Page
Figure 5: Correlation – Heatmap
Insight:
Above heatmap supports the insights provided through bivariate analysis. It establishes strong
correlations among pflt, vflt and fork. Followed by correlation among other variables where strong
relation is highlighted in light colour. As colour grows dark, least correlation is observed.
10 | P a g e
1.2 Impute null values if present, also check for the values which are equal to zero. Do they have
any meaning or do we need to change them or drop them? Check for the possibility of
creating new features if required. Also check for outliers and duplicates if there.
Null Values:
Inisghts:
Null values are identified in rchar and wchar and are less than 5%. Thus, they are treated by imputing
mean.
Insights:
If the system stays idle, dataset records 0. Thus, they are not treated further.
11 | P a g e
Outlier Treatment:
12 | P a g e
Insight: As outliers are observed in univariate analysis, they treated using IQR. Above figures shows
no outliers after outlier treatment. There are no duplicates found.
1.3 Encode the data (having string values) for Modelling. Split the data into train and test
(70:30). Apply Linear regression using scikit learn. Perform checks for significant variables
using appropriate method from statsmodel. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Compare these models and select the best one with appropriate reasoning.
As a first step, we split the data into train and test i.e 70:30. The training data has 7372 observations
in which we have to fit our model. To validate our model, we need to evaluate against test
observations of 820.
Insight:
After we fit the model, above are the coefficient for each of the independent attributes. They provide
insights into the relationships between the independent variables and the usr. It is used to make
predictions and understand which variables have the significant impact on the target variable.
Additionally, the signs (positive or negative) of the coefficients indicate whether the variables have a
positive or negative effect on the target variable.
13 | P a g e
Figure 7: Coefficient Bar Chart
Insights:
From the above bar chart, we can see that Process run queue size (runqsz) (The number of kernel
threads in memory that are waiting for a CPU to run) and Number of system fork calls per second
(fork) has the highest impact on Portion of time (%) that cpus run in user model (usr).
Insights:
R-squared on the training data (0.637) suggests that about 63.7% of the variance in the
dependent variable is explained by the model. R-squared on the testing data (0.655) suggests
that about 65.5% of the variance in the dependent variable is explained by the model. The R-
squared values for both training and testing datasets are quite close, indicating that the model
generalizes well to unseen data. This is a positive sign as it suggests that the model is not
overfitting.
14 | P a g e
RMSE on the training data (10.957) represents the average error of the model's predictions on
the data it was trained on. RMSE on the testing data (11.77) represents the average error of
the model's predictions on new, unseen data. The RMSE values are relatively low, indicating
that the model's predictions are, on average, close to the actual values.
Insights:
The R-squared value is 0.640, which means that approximately 64% of the variance in the dependent
variable "usr" is explained by the independent variables in the model. The adjusted R-squared is very
close, indicating that the model's explanatory power is robust. Several independent variables have low
15 | P a g e
p-values (P>|t|), indicating that they are statistically significant in explaining the variation in "usr."
These variables include "lread," "scall," "fork," "rchar," "wchar," "pgout," "ppgout," "pgfree,"
"pgscan," "pgin," "pflt," "vflt," "runqsz," "freemem," "freeswap," and the intercept. These are the
variables that likely have a meaningful impact on the "usr" variable.
Both Scikit-learn and the OLS model have similar R-squared values, indicating that they explain a
similar amount of variance in the data. Scikit-learn has slightly lower RMSE values on the training
and testing data, indicating better performance in terms of RMSE. However, the differences are
relatively small. Both Scikit-learn and the OLS model provide similar results in terms of R-squared
and RMSE. Scikit-learn has a slight edge in terms of RMSE, but the overall performance is quite
comparable between the two approaches.
1.4 Inference: Basis on these predictions, what are the business insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be
proper business interpretation and actionable insights present.
The dataset contains 8192 rows and 22 columns, with a mix of numerical and categorical
data. Significant variability is observed in local read and write counts, system calls, and
system read and write operations over time.
Local read and write counts, as well as system read and write operations, provide insights into
I/O (input/output) operations. Monitoring these metrics can help businesses ensure that data
transfer and storage operations are running smoothly.
16 | P a g e
Metrics related to memory (freemem) and swap space (freeswap) indicate available memory
and swap space on the system. Monitoring these values can be critical for avoiding system
crashes due to resource exhaustion. The presence of non-zero values in these columns
suggests that the system has not exhausted its resources in most cases.
The 'usr' metric, with a mean of approximately 83.97, suggests that users are active on the
system. Tracking user activity is essential to maintain user experience and system
responsiveness.
Linear regression models were built to predict the 'usr' metric. The most influential variables
on 'usr' were found to be 'runqsz' (process run queue size) and 'fork' (number of system fork
calls per second). Both Scikit-learn and Ordinary Least Squares (OLS) models provided
similar predictive performance.
Monitor system call counts, system read/write operations, and I/O operations closely to
identify and address performance bottlenecks promptly. Optimize resource allocation based
on patterns of process creation and execution. Keep a close watch on memory and swap space
usage to prevent system crashes.
Project Summary:
The project involved exploratory data analysis (EDA), including univariate, bivariate, and
multivariate analyses.
Data preprocessing steps included handling null values, outliers, and encoding categorical
variables.
Linear regression models were used to predict system behavior, and model performance
was evaluated using R-squared and RMSE.
The project provided valuable insights into system performance and resource utilization,
enabling data-driven decision-making for system optimization and maintenance. The
analysis demonstrated the importance of monitoring key system metrics to ensure
reliability and efficiency.
17 | P a g e
Problem 2: Logistic Regression, LDA and CART
You are a statistician at the Republic of Indonesia Ministry of Health and you are provided with a
data of 1473 females collected from a Contraceptive Prevalence Survey. The samples are married
women who were either not pregnant or do not know if they were at the time of the survey.
The problem is to predict do/don't they use a contraceptive method of choice based on their
demographic and socio-economic characteristics.
Data Dictionary:
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, check for duplicates and outliers and write an inference on it. Perform Univariate and
Bivariate Analysis and Multivariate Analysis.
Table 9: Head
18 | P a g e
Table 10: Tail
Insights:
As observed in the above tables, there are 1473 columns and 10 rows. It includes 3 numerical data
type and 7 categorical variables. Table 4 summarizes the central tendency, variability, and
distribution of each numerical variable in the dataset.
19 | P a g e
Null Values:
Insights: Above table summarizes null values which is imputed the mean.
Duplicate Values:
Insights: As per above, there are 80 duplicate rows which are dropped. It resulted in 1393 rows and
10 columns.
20 | P a g e
Outliers:
Figure 9: Outliers
Insights: Above figure illustrates outliers in No_of_children_born which are treated using log.
21 | P a g e
Univariate Analysis:
Insights:
22 | P a g e
Figure 12: Univariate Analysis – Numerical Data
Insights:
23 | P a g e
Multivariate Analysis:
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis) and CART.
Encoding Data:
24 | P a g e
Table 16: Encoded Data Type
Insight: Above table represents change in data type from Object to float and int.
25 | P a g e
Figure 16: Bivariate Analysis – Heatmap
Multicollinearity Verification:
Insight:
Based on the VIF values provided, it appears that multicollinearity is not a severe in regression
model. Most of the variables have VIF values below 2, which is acceptable as they are moderately
correlated.
26 | P a g e
Table 18: Logistic Regression Accuracy
Insights:
The Logistic Regression model achieved an accuracy of approximately 67.03%, meaning it correctly
classified about 67.03% of the total instances.
Insight: The LDA model achieved an accuracy of approximately 66.31%, slightly lower than the
Logistic Regression model.
27 | P a g e
Figure 17: Confusion Matrix Training Data
28 | P a g e
Insight:
An accuracy of 0.70 on the training data is relatively decent, indicating that the model is learning
from the training data. An accuracy of 0.66 on the test data is lower than the training accuracy. This
suggests that the model may be overfitting to the training data.
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final
Model: Compare Both the models and write inference which model is best/optimized.
Insight:
Based on the evaluation metrics, the Logistic Regression model outperforms the Random Forest
Classifier for this classification task. It has a higher test accuracy (67.03% vs. 64.16%) and a better
ROC AUC score (0.719 vs. 0.688). In the confusion matrices, the Logistic Regression model also has
fewer false positives and false negatives, indicating better predictive performance.
29 | P a g e
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be
proper business interpretation and actionable insights present.
Project Summary:
We analyzed data from a Contraceptive Prevalence Survey to predict contraceptive method usage
among married women. We performed data preprocessing, handling null values, duplicates, and
outliers. Univariate and bivariate analyses revealed insights about the demographic and socio-
economic factors influencing contraceptive usage. We applied Logistic Regression and Linear
Discriminant Analysis (LDA) models for prediction. Logistic Regression outperformed LDA with a
higher accuracy (67.03% vs. 66.31%) and ROC AUC score (0.719 vs. 0.688). The key insight is that
demographic and socio-economic factors can help predict contraceptive use.
The analysis suggests that demographic and socio-economic factors significantly influence
contraceptive usage among married women. Implement targeted family planning campaigns
considering these factors to increase contraceptive adoption rates, ensuring better reproductive health
outcomes and informed family planning choices for women.
30 | P a g e