Professional Documents
Culture Documents
Predictive Modelling
2|Page
Problem 1 - Define the problem and perform exploratory Data Analysis - Problem definition -
Check shape, Data types, statistical summary - Univariate analysis - Multivariate analysis - Use
appropriate visualizations to identify the patterns and insights - Key meaningful observations on
individual variables and the relationship between variables
3|Page
Data types: float64(13), int64(8), object(1)
Statistical summary
4|Page
Check the types of the data in each column
From the dataset we can see missing values are present in two columns which is rchar and
wchar, we need to remove the missing values from the data
5|Page
Univariate Analysis
from the above histogram we are presenting univariate analysis from the data, where scall,
usr, vflt, freewap,pflt are high compare to other data
Bivariate analysis
6|Page
From the all above graph presenting a negative co-relationship with the all data
7|Page
from the above bargraph we can observed that usr 1,2 is the most lowest and usr 99 is
the most highest and usr 89 is median in this data
from the above barplot we can see many fluctuations between usr and freemem, but can
conclude that usr 99 is the highest and usr 1 is most lowest in this dataset
from the above graph we can find a positive relationship between iwrite and iread
8|Page
from the above graph we can see a positive relationship between iread and iwrite
from the above graph we can see a positive relationship between iread and iwrite
9|Page
we can see a positive co relation from the above graph
From the above graph, We can see a positive correlation between vflt and pflt
Multivariate Analysis
10 | P a g e
we can see a high co relation between many variables
11 | P a g e
Problem 1 - Data Pre-processing
Prepare the data for modelling: - Missing Value Treatment (if needed) - Outlier Detection
(treat, if needed) - Feature Engineering - Encode the data - Train-test split
Now we remove the missing values successfully. Now, no duplicated rows and columns are
present here
We can observed five columns has zero values which is more than 50%. So, we need to
drop this five column pgout, ppgout, pgfree, pgscan, atch
12 | P a g e
Successfully drop all the inappropriate data
In this data set outliers are presents, we need to remove all the outliers and clean the
data
13 | P a g e
Successfully remove all the outliers
Import the train_test_split library, Split X and y into training and test set in 70:30 ratio
We drop the usr column and showing first few head from the data
14 | P a g e
- Apply linear Regression using Sklearn - Using Statsmodels Perform checks for significant
variables using the appropriate method - Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Linear regression
Adjusted R square
15 | P a g e
We can see adjusted R- square is 0.776
16 | P a g e
- Problem definition - Check shape, Data types, statistical summary - Univariate analysis -
Multivariate analysis - Use appropriate visualizations to identify the patterns and insights -
Key meaningful observations on individual variables and the relationship between
variables
we can observed that in this data set missing values are present for two columns
Univariate analysis
17 | P a g e
We can observed from the above histogram that most of age for wife is between 25-30 and
15 is the lowest age
From the above histogram we can conclude that almost 590 women education
qualification is tertiary which is highest number from the data, uneducated women
number is 190 approx, primary is 320 and secondary is 400.
18 | P a g e
From the above graph we can see the education level of husband, here also highest
number of eduation level is tertiary which is 980 approx from the data set and lowest is
uneducated which is 20 approx
From the above graph we can see that 2 and 4 is the highest and second highest
respectively number for new born baby and 16 is the lowest
19 | P a g e
From the above we can observed most of the wife’s religion is scientology
20 | P a g e
Husband Number 3.0 is the highest in the occupation and 4.0 is the lowest
21 | P a g e
From the above graph we can see media exposed is higher than non exposed
From the above graph we can conclude that contraceptive method used quantity is high
Multivariate analysis
22 | P a g e
From the above heatmap no. of children born and wife age correlation is very high which
is 0.54
23 | P a g e