You are on page 1of 23

Business Report of

Predictive Modelling

July Batch 23.A


Arpita_Sarkar
17/12/2023
1|Page
Problem 1 - Define the problem and perform exploratory Data Analysis
- Problem definition - Check shape, Data types, statistical summary - Univariate analysis - Multivariate
analysis - Use appropriate visualizations to identify the patterns and insights - Key meaningful
observations on individual variables and the relationship between variables

Problem 1 - Data Pre-processing


Prepare the data for modelling: - Missing Value Treatment (if needed) - Outlier Detection (treat, if needed)
- Feature Engineering - Encode the data - Train-test split

Problem 1- Model Building - Linear regression


- Apply linear Regression using Sklearn - Using Statsmodels Perform checks for significant variables
using the appropriate method - Create multiple models and check the performance of Predictions on Train
and Test sets using Rsquare, RMSE & Adj Rsquare.

Problem 1 - Business Insights & Recommendations


- Comment on the Linear Regression equation from the final model and impact of relevant variables
(atleast 2) as per the equation - Conclude with the key takeaways (actionable insights and
recommendations) for the business

Problem 2 - Define the problem and perform exploratory Data Analysis


- Problem definition - Check shape, Data types, statistical summary - Univariate analysis - Multivariate
analysis - Use appropriate visualizations to identify the patterns and insights - Key meaningful
observations on individual variables and the relationship between variables

Problem 2 - Data Pre-processing


Prepare the data for modelling: - Missing value Treatment (if needed) - Outlier Detection(treat, if needed) -
Feature Engineering (if needed) - Encode the data - Train-test split

Problem 2 - Model Building and Compare the Performance of the Models


- Build a Logistic Regression model - Build a Linear Discriminant Analysis model - Build a CART model
- Prune the CART model by finding the best hyperparameters using GridSearch - Check the performance
of the models across train and test set using different metrics - Compare the performance of all the models
built and choose the best one with proper rationale

Problem 2 - Business Insights & Recommendations


- Comment on the importance of features based on the best model - Conclude with the key takeaways
(actionable insights and recommendations) for the business

Business Report Quality


- Adhere to the business report checklist

2|Page
Problem 1 - Define the problem and perform exploratory Data Analysis - Problem definition -
Check shape, Data types, statistical summary - Univariate analysis - Multivariate analysis - Use
appropriate visualizations to identify the patterns and insights - Key meaningful observations on
individual variables and the relationship between variables

First few rows & columns from the dataset

Last few rows & columns from the dataset

22 columns and 8192 rows are present in this dataset

3|Page
Data types: float64(13), int64(8), object(1)

Statistical summary

4|Page
Check the types of the data in each column

From the dataset we can see missing values are present in two columns which is rchar and
wchar, we need to remove the missing values from the data

5|Page
Univariate Analysis

from the above histogram we are presenting univariate analysis from the data, where scall,
usr, vflt, freewap,pflt are high compare to other data

Bivariate analysis

6|Page
From the all above graph presenting a negative co-relationship with the all data

7|Page
from the above bargraph we can observed that usr 1,2 is the most lowest and usr 99 is
the most highest and usr 89 is median in this data

from the above barplot we can see many fluctuations between usr and freemem, but can
conclude that usr 99 is the highest and usr 1 is most lowest in this dataset

from the above graph we can find a positive relationship between iwrite and iread

8|Page
from the above graph we can see a positive relationship between iread and iwrite

from the above graph we can see a positive relationship between iread and iwrite

9|Page
we can see a positive co relation from the above graph

From the above graph, We can see a positive correlation between vflt and pflt

Multivariate Analysis

10 | P a g e
we can see a high co relation between many variables

1. pflt with fork which is 93%

2. vflt with fork which is 94%

3. vflt with pflt which is 94%

and also some negative co-relation between many variables

1. usr with scall -0.67

2. usr with fork -0.75

3. usr with vflt -0.81

11 | P a g e
Problem 1 - Data Pre-processing

Prepare the data for modelling: - Missing Value Treatment (if needed) - Outlier Detection
(treat, if needed) - Feature Engineering - Encode the data - Train-test split

Now we remove the missing values successfully. Now, no duplicated rows and columns are
present here

We can observed five columns has zero values which is more than 50%. So, we need to
drop this five column pgout, ppgout, pgfree, pgscan, atch

12 | P a g e
Successfully drop all the inappropriate data

In this data set outliers are presents, we need to remove all the outliers and clean the
data

13 | P a g e
Successfully remove all the outliers

Train test split

Import the train_test_split library, Split X and y into training and test set in 70:30 ratio

We drop the usr column and showing first few head from the data

Problem 1- Model Building - Linear regression

14 | P a g e
- Apply linear Regression using Sklearn - Using Statsmodels Perform checks for significant
variables using the appropriate method - Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.

Linear regression

Showing coefficient of each independent attributes

The intercept for our model is 97.01594314241657

R Square for training data

R Square for testing data

RMSE training data

RMSE for testing data

Adjusted R square

15 | P a g e
We can see adjusted R- square is 0.776

Problem 1 - Business Insights & Recommendations


- Comment on the Linear Regression equation from the final model and impact of relevant
variables (atleast 2) as per the equation - Conclude with the key takeaways (actionable
insights and recommendations) for the business

Problem 2 - Define the problem and perform exploratory Data Analysis

16 | P a g e
- Problem definition - Check shape, Data types, statistical summary - Univariate analysis -
Multivariate analysis - Use appropriate visualizations to identify the patterns and insights -
Key meaningful observations on individual variables and the relationship between
variables

Data types from the dataset

Few random rows and columns from the dataset

we can observed that in this data set missing values are present for two columns

Univariate analysis

17 | P a g e
We can observed from the above histogram that most of age for wife is between 25-30 and
15 is the lowest age

From the above histogram we can conclude that almost 590 women education
qualification is tertiary which is highest number from the data, uneducated women
number is 190 approx, primary is 320 and secondary is 400.

18 | P a g e
From the above graph we can see the education level of husband, here also highest
number of eduation level is tertiary which is 980 approx from the data set and lowest is
uneducated which is 20 approx

From the above graph we can see that 2 and 4 is the highest and second highest
respectively number for new born baby and 16 is the lowest

19 | P a g e
From the above we can observed most of the wife’s religion is scientology

Most of the wife’s are not working

20 | P a g e
Husband Number 3.0 is the highest in the occupation and 4.0 is the lowest

Most of the people’s living standard is very high

21 | P a g e
From the above graph we can see media exposed is higher than non exposed

From the above graph we can conclude that contraceptive method used quantity is high

Multivariate analysis
22 | P a g e
From the above heatmap no. of children born and wife age correlation is very high which
is 0.54

23 | P a g e

You might also like