Arpita - Sarkar - Business - Report - 17th December, 2023

Business Report of
Predictive Modelling
July Batch 23.A

Arpita_Sarkar
17/12/2023
1|Page
Problem 1 - Define the problem and perform exploratory Data Analysis
- Problem definition - Check shape, Data types, statistical summary - Univariate analysis - Multivariate
analysis - Use appropriate visualizations to identify the patterns and insights - Key meaningful
observations on individual variables and the relationship between variables
Problem 1 - Data Pre-processing

Prepare the data for modelling: - Missing Value Treatment (if needed) - Outlier Detection (treat, if needed)
- Feature Engineering - Encode the data - Train-test split
Problem 1- Model Building - Linear regression

- Apply linear Regression using Sklearn - Using Statsmodels Perform checks for significant variables
using the appropriate method - Create multiple models and check the performance of Predictions on Train
and Test sets using Rsquare, RMSE & Adj Rsquare.
Problem 1 - Business Insights & Recommendations

- Comment on the Linear Regression equation from the final model and impact of relevant variables
(atleast 2) as per the equation - Conclude with the key takeaways (actionable insights and
recommendations) for the business

- Problem definition - Check shape, Data types, statistical summary - Univariate analysis - Multivariate
analysis - Use appropriate visualizations to identify the patterns and insights - Key meaningful
observations on individual variables and the relationship between variables

Prepare the data for modelling: - Missing value Treatment (if needed) - Outlier Detection(treat, if needed) -
Feature Engineering (if needed) - Encode the data - Train-test split
Problem 2 - Model Building and Compare the Performance of the Models

- Build a Logistic Regression model - Build a Linear Discriminant Analysis model - Build a CART model
- Prune the CART model by finding the best hyperparameters using GridSearch - Check the performance
of the models across train and test set using different metrics - Compare the performance of all the models
built and choose the best one with proper rationale

- Comment on the importance of features based on the best model - Conclude with the key takeaways
(actionable insights and recommendations) for the business
Business Report Quality

- Adhere to the business report checklist
2|Page
Problem 1 - Define the problem and perform exploratory Data Analysis - Problem definition -
Check shape, Data types, statistical summary - Univariate analysis - Multivariate analysis - Use
appropriate visualizations to identify the patterns and insights - Key meaningful observations on
individual variables and the relationship between variables
First few rows & columns from the dataset
Last few rows & columns from the dataset
22 columns and 8192 rows are present in this dataset
3|Page
Data types: float64(13), int64(8), object(1)
Statistical summary
4|Page
Check the types of the data in each column
From the dataset we can see missing values are present in two columns which is rchar and
wchar, we need to remove the missing values from the data
5|Page
Univariate Analysis
from the above histogram we are presenting univariate analysis from the data, where scall,
usr, vflt, freewap,pflt are high compare to other data
Bivariate analysis
6|Page
From the all above graph presenting a negative co-relationship with the all data
7|Page
from the above bargraph we can observed that usr 1,2 is the most lowest and usr 99 is
the most highest and usr 89 is median in this data
from the above barplot we can see many fluctuations between usr and freemem, but can
conclude that usr 99 is the highest and usr 1 is most lowest in this dataset
from the above graph we can find a positive relationship between iwrite and iread
8|Page
from the above graph we can see a positive relationship between iread and iwrite
from the above graph we can see a positive relationship between iread and iwrite
9|Page
we can see a positive co relation from the above graph
From the above graph, We can see a positive correlation between vflt and pflt
Multivariate Analysis
10 | P a g e
we can see a high co relation between many variables
1. pflt with fork which is 93%
2. vflt with fork which is 94%
3. vflt with pflt which is 94%
and also some negative co-relation between many variables
1. usr with scall -0.67
2. usr with fork -0.75
3. usr with vflt -0.81
11 | P a g e
Prepare the data for modelling: - Missing Value Treatment (if needed) - Outlier Detection
(treat, if needed) - Feature Engineering - Encode the data - Train-test split
Now we remove the missing values successfully. Now, no duplicated rows and columns are
present here
We can observed five columns has zero values which is more than 50%. So, we need to
drop this five column pgout, ppgout, pgfree, pgscan, atch
12 | P a g e
Successfully drop all the inappropriate data
In this data set outliers are presents, we need to remove all the outliers and clean the
data
13 | P a g e
Successfully remove all the outliers
Train test split
Import the train_test_split library, Split X and y into training and test set in 70:30 ratio
We drop the usr column and showing first few head from the data
Problem 1- Model Building - Linear regression
14 | P a g e
- Apply linear Regression using Sklearn - Using Statsmodels Perform checks for significant
variables using the appropriate method - Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Linear regression
Showing coefficient of each independent attributes
The intercept for our model is 97.01594314241657
R Square for training data
R Square for testing data
RMSE training data
RMSE for testing data
Adjusted R square
15 | P a g e
We can see adjusted R- square is 0.776

- Comment on the Linear Regression equation from the final model and impact of relevant
variables (atleast 2) as per the equation - Conclude with the key takeaways (actionable
insights and recommendations) for the business
16 | P a g e
- Problem definition - Check shape, Data types, statistical summary - Univariate analysis -
Multivariate analysis - Use appropriate visualizations to identify the patterns and insights -
Key meaningful observations on individual variables and the relationship between
variables
Data types from the dataset
Few random rows and columns from the dataset
we can observed that in this data set missing values are present for two columns
Univariate analysis
17 | P a g e
We can observed from the above histogram that most of age for wife is between 25-30 and
15 is the lowest age
From the above histogram we can conclude that almost 590 women education
qualification is tertiary which is highest number from the data, uneducated women
number is 190 approx, primary is 320 and secondary is 400.
18 | P a g e
From the above graph we can see the education level of husband, here also highest
number of eduation level is tertiary which is 980 approx from the data set and lowest is
uneducated which is 20 approx
From the above graph we can see that 2 and 4 is the highest and second highest
respectively number for new born baby and 16 is the lowest
19 | P a g e
From the above we can observed most of the wife’s religion is scientology
Most of the wife’s are not working
20 | P a g e
Husband Number 3.0 is the highest in the occupation and 4.0 is the lowest
Most of the people’s living standard is very high
21 | P a g e
From the above graph we can see media exposed is higher than non exposed
From the above graph we can conclude that contraceptive method used quantity is high
Multivariate analysis
22 | P a g e
From the above heatmap no. of children born and wife age correlation is very high which
is 0.54
23 | P a g e

Arpita - Sarkar - Business - Report - 17th December, 2023

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Arpita - Sarkar - Business - Report - 17th December, 2023

Uploaded by

Copyright:

Available Formats

Business Report of

July Batch 23.A

Problem 1 - Data Pre-processing

Problem 1- Model Building - Linear regression

Problem 1 - Business Insights & Recommendations

Problem 2 - Define the problem and perform exploratory Data Analysis

Problem 2 - Data Pre-processing

Problem 2 - Model Building and Compare the Performance of the Models

Problem 2 - Business Insights & Recommendations

Business Report Quality

First few rows & columns from the dataset

Last few rows & columns from the dataset

22 columns and 8192 rows are present in this dataset

1. pflt with fork which is 93%

2. vflt with fork which is 94%

3. vflt with pflt which is 94%

and also some negative co-relation between many variables

1. usr with scall -0.67

2. usr with fork -0.75

3. usr with vflt -0.81

Train test split

Problem 1- Model Building - Linear regression

Showing coefficient of each independent attributes

The intercept for our model is 97.01594314241657

R Square for training data

R Square for testing data

RMSE training data

RMSE for testing data

Problem 1 - Business Insights & Recommendations

Problem 2 - Define the problem and perform exploratory Data Analysis

Data types from the dataset

Few random rows and columns from the dataset

Most of the wife’s are not working

Most of the people’s living standard is very high

You might also like