Dresthi Paliwal12003622 Ca02

PROJECT REPORT ON- MACHINE LEARNING
CA-02
SUBMITTED BY – DRESTHI PALIWAL
REGISTRATION NO- 12003622
SECTION- Q2050
SUBMITTED TO- MRS. KRITI BEDI MA’AM
COURSE- BUSINESS ANALYTICS

ACKNODWLEDGEMENT
“It is not possible to prepare a project report without the assistance and encouragement of
other people. This one is certainly no exception”.
This report would not have been possible without the essential and gracious support of Mrs.
Kriti Bedi Ma’am (the faculty) of school of business management at Lovely Professional
University. Her willingness to motivate me contributed tremendously to my report. I also would
like to thank her for showing me some examples related to the topic of this report.
Besides, I would like to thank Lovely Professional University for providing me good
environment and facilities to complete the report. It gave an opportunity to participate and learn
about Machine Learning and its importance in the field of Business Analytics.
Finally, I would like to thank my family for their understanding and supports towards me for
completing this report.
CONTENTS
SUPERVIESD LEARNING ...................................................................................................................................................... 3

BACKGROUND OF STUDY ................................................................................................................................................... 4
ABOUT DATA SET ................................................................................................................................................................. 4
URL ....................................................................................................................................................................................... 4
OBJECTIVE OF DATASET ..................................................................................................................................................... 4
DATA DESCRIPTION ............................................................................................................................................................. 4
DATA PROCESSING-.............................................................................................................................................................. 5
CODE .................................................................................................................................................................................... 6
OUTPUT ............................................................................................................................................................................... 6
ANALYSIS ............................................................................................................................................................................... 9
VARIABLE RESPONSE ...................................................................................................................................................... 9
OUTCOME ............................................................................................................................................................ 9
VARIABLE PREDICTOR ...................................................................................................................................................... 10
PREGNANCIES ................................................................................................................................................................. 10
GLUCOSE .......................................................................................................................................................................... 11
BLOOD PRESSURE .......................................................................................................................................................... 12
SKIN THICKNESS ............................................................................................................................................................. 13
INSULIN ............................................................................................................................................................................. 14
BMI ..................................................................................................................................................................................... 16
DIABETES PEDIGREE FUNCTION................................................................................................................................. 16
AGE .................................................................................................................................................................................... 17
BI- VARIANT ASSOCIATIONS ........................................................................................................................................... 18
Interpretation can be done ................................................................................................................................................... 18
CORRELATIONS BETWEEN PREDICTOR VARIABLES ................................................................................................. 20
INTERPRETATION ........................................................................................................................................................... 21
TRAINING DATA .................................................................................................................................................................. 23
TESTING DATA..................................................................................................................................................................... 23
APPLYING MULTIPLE REGRESSION................................................................................................................................ 24
INTERPRETATION ........................................................................................................................................................... 24
INTERPRETATION ........................................................................................................................................................... 26
SIMPLE LINEAR REGRESSION .......................................................................................................................................... 27
INTERPRETAION ............................................................................................................................................................. 28
APPLYING LOGISTIC REGRESSION ................................................................................................................................. 29
INTERPRETATION OF OUTPUT .................................................................................................................................... 30
BUILD CONFUSION MATRIX ........................................................................................................................................ 30
ROC CURVES (RECEIVER OPERATOR CHARACTERSTIC CURVE) ............................................................................ 31
INTERPRETATION OF ROC ............................................................................................................................................ 32
CONCLUSION........................................................................................................................................................................ 32
MACHINE LEARNING
It is an application of artificial intelligence (AI) that provides a system which has the ability to
automatically learn and improve form the past experiences without being programmed. It
mainly focusses on the development of a computer logical programs that can access any data
and use it learn for themselves.
SUPERVIESD LEARNING - in this kind of algorithm it contains target or the outcome

variable or dependent variable, which need to be predicted form any given set of data, it mainly
consists of independent and dependent variable. This is being done by classifying the data,
training the data and testing the data, predicting and evaluation.
Examples- Regression, Decision Tree, Random Forest, Logistic Regression.
1. Linear Regression- It is used to estimate the real values based on continuous variables
for example cost of houses, number of calls, total sales etc.
2. Logistic Regression- It is ca classification method not a regression algorithm, it is used
to find out the discrete values as Binary 0,1/ yes/no, true, false.
3. Decision Tree- It is used mostly for the classification of the problems. It works with
both the kind of data categorical and continuous dependent variables. Data can be
splinted into 2 or more population.
4. SVM- support vector machine the data is plotted in n-dimensional space with a value
of each feature being value of a particular coordinate.
PREDICTION OF DIABETES IN
PIMA WOMEN
BACKGROUND OF STUDY
Diabetes mellitus which is a group of metabolic disorders where the blood sugar levels are
higher than normal for a very long period of time. It is caused due to the insufficient
production of insulin in the body or due to the improper response of the body’s cells to
insulin. It is categorised into 3 types.
1. Type1 or Insulin-dependent Diabetes

2. Type2 or Non- Insulin- dependent Diabetes
3. Type3 or Gestational Diabetes.
Type3 mainly occur in a woman during pregnancy.
ABOUT DATA SET

This dataset was originated by National Institute of Diabetes and Digestive and Kidney
Diseases.
URL- https://www.kaggle.com/uciml/pima-indians-diabetes-database
OBJECTIVE OF DATASET
It is to diagnostically predict weather or not a patient has diabetes, based on certain diagnostic
measurements included in the dataset. All patients here are females at lest 21 years of age and
old of PIMA INDIAN HERITAGE.
DATA DESCRIPTION
Variable Name Data Variable Description
Type
Pregnancies integer Number of times pregnant
Glucose integer Plasma glucose concentration at 2 hours in an oral glucose

tolerance test
Blood Pressure integer Diastolic blood pressure
Skin Thickness integer Triceps skin fold thickness
Insulin integer 2-hour serum insulin (µU/ml)
BMI numeric Body Mass Index
DiabetesPedigreeFunction numeric Synthesis of the history of Diabetes Mellitus in relatives,

generic
Age integer Age of the individual
Outcome integer Occurrence of Diabetes
DATA PROCESSING-
At the first glace the data appeared to be clean but further detailed analysis revealed many
abnormality values for biological measures. Variables such as Skin Thickness and Glucose had
277 and 374 Zero values respectively. These zero values constitute about 20% of the
observation in the dataset. But these zero values are not null values they are the real values
having result zero. Therefore, delegating this value would lose a lot of important information,
therefore only the wrong values in the dataset were imputed.
CODE
Fig 1- Introductory code
OUTPUT
Fig 1.1- Output Missing value
Fig-1.2 Structure of data

Fig- 1.2.1 Dataset
Fig 1.3 Head, Tail and summary of the data
Fig- 1.4 Outcome

Fig- 1.5 Zero value analysis
ANALYSIS
VARIABLE RESPONSE
➢ OUTCOME- 268 women that were diagnosed with the Diabetes and 500 women that
didn’t have diabetes. 34% of the sample shows positive records of diabetes.
Fig 2.0 input
Fig 2.1 output

Fig-2.3 Distribution of outcomes
VARIABLE PREDICTOR
PREGNANCIES
From the below given graph women who have been diagnosed with diabetes had more
pregnancies than the other women. The histogram below, however gives no significant
relationship between the number of pregnancies and the occurrence of diabetes.
Fig -2.4 Pregnancies and Glucose

Fig- 2.5 Pregnancies Output
Fig- 2.6 Histogram of Pregnancies vs Outcome
GLUCOSE
Below fig shows the clear difference between in the amount of glucose present in the
women who have been diagnose with diabetes and those who are not. The density
graphs show the difference in the level of glucose and its effect.
Fig- 2.7 Glucose variation in women
Fig- 2.8 Output Glucose
BLOOD PRESSURE
There is no clear difference seen in the category of women who don’t have diabetes.
This indicates that the blood pressure might not be a good indicator to predict diabetes.
Fig-2.9 Blood pressure output

Fig-3.0 Relationship between Blood pressure and Diabetes
SKIN THICKNESS
No clear difference between them. Skin is not a good predictor for the response
variable.
Fig- 3.1 Skin Thickness

Fig- 3.2 Skin thickness and diabetes
INSULIN
No clear difference is observed. Therefore, not a good predictor.

Fig- 3.3 Insulin, BMI, DPF, AGE INPUT
Fig-3.4- Insulin vs Diabetes
BMI
Women having BMI greater than 25, above the normal levels have diabetes, BMI range
of women ranging 18 to 60 don’t have Diabetes.
Fig- 3.5 Diabetes vs BMI
DIABETES PEDIGREE FUNCTION

No clear difference, therefore cant be used to predict diabetes
Fig- 3.6 DPF Vs Diabetes
AGE
No clear difference between the two.

Fig- 3.7 Age vs Diabetes
BI- VARIANT ASSOCIATIONS

INTERPRETATION CAN BE DONE
1. No significant difference can be marked between the non-diabetic and diabetic
women based on number of Pregnancies vs Age
2. No lower level on insulin is seen in non- diabetic women, lower level of glucose
and insulin is opposed to diabetic women who have glucose and insulin in high
level.
3. On basis of BMI and BP values women can be distinguished from diabetes to
non- diabetes.
4. Low value of BMI and BP women show no symptoms of Diabetes.
Fig- 3.8 Code relationship between BMI with BP and BMI with skin thickness
Fig- 3.9 AGE vs Diabetes & Glucose vs Diabetes
Fig-4.0 BMI vs Diabetes & Skin Thickness vs Diabetes
CORRELATIONS BETWEEN PREDICTOR VARIABLES
A correlation plot shows the liner association between each other, as mentioned in the bivariant
associations, Insulin and Glucose, BMI and Skin Thickness has a moderate- Higher liner
correlation.
Fig- 4.1matrix of correlation among different variables
INTERPRETATION
The matrix of correlation shows the deviation of the line with the change in terms of variable
which are independent and dependent variable.
Fig- 4.1.1and 4.2.1 Correlation Plot
Fig- 4.3Correlation Input

TRAINING DATA
Fig- 4.4 Training data
TESTING DATA
Fig- 4.5 Testing data
Fig- 4.6 Output of test and trained data

APPLYING MULTIPLE REGRESSION
Fig-4.7 Multiple regression
INTERPRETATION
The data shows the summary of all the variables and the outcome, it is divided into 3 quadrants
followed my mean mode and median. Generally, this kind of model is used to explain the
relationship between multiple independent por predictor variables.
The equation for multiple regression is as follows-
y = b1x1 + b2x2 + … + box + c
Here, it will be depicting the relationship between the BMI, Glucose, Age. It shows a strong
correlation among the variables. The below mentioned graphs represents the scatterplot for
multiple regression.
Fig-4.8 & 4.9 relationship between the level of Glucose and BMI, relationship between age
and BMI
Fig- 5 Histogram BMI
INTERPRETATION
The above-mentioned Histogram shows the BMI and its frequency occurring in the women, it
shows the higher the BMI there are more chances of a women getting diabetes, the BMI range
greater than 25 and between 40 are classed in obesity range. Whereas lower the BMI women
will not get diabetes.
Fig-5.1 scatter plot showing the 4 different variable analysis
SIMPLE LINEAR REGRESSION

It has a single explanatory variable, which is concerns two-dimensional sample points where
one if dependent other is independent. The model is best suited for the data where we don’t
have lot of variables.
The simple linear regression formula is expressed as,
Y = a + bX,
where X is the explanatory variable and Y is the dependent variable

Fig-5.2 simple linear regression
INTERPRETAION
Form the output we can see that the R square value is 1.30 and the correlation coefficients show
the p value is grater than the R. The correlation is done to see the effect of age and BMI. It can
be intercepted with the value 61.92, belonging to the third quadrant in the dataset. But the
simple liner regression is not the best fit for the machine learning. There is seen a lot of Outliner
in the data thus it becomes difficult to predict the accuracy of the model.
APPLYING LOGISTIC REGRESSION
Using a trained data set containing a random sample of 70% of the observation to perform a
Logistic Regression with “Diabetes” as the response and the remaining are predicated.
Fig- 5.3. Logistic Regression

The result shows that the variable Skin thickness, Insulin and age are not statically significant
in order to build a model. Therefore, the p value is gathering than 0.01 hence, removed. We are
using logistic regression rather than the liner regression to predict a continuous variable.
y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))
Fig-5.4. Accuracy of the model
INTERPRETATION OF OUTPUT
The logistic regression is 74% accurate, depending on the exact training- test split. It appears
to be good choice for predicting. The misclassification rate of the model is 25%.
BUILD CONFUSION MATRIX
➢ CONFUSION MATRIX- It compares the actual outcomes with the predicted ones.
Predicted =0 Predicted=1
ACTUAL=0 True negatives (TN) False positives (FP)
ACTUAL=1 False Negatives (FN) True positives (TP)
Sensitivity= TP/TP+FN (True Positive Rate)

Specificity= TN/TN+FP (True Negative Rate)
The model with a higher threshold has lower Sensitivity but higher Specificity.
The model with a lower threshold has higher Sensitivity but lower Specificity.
Fig- 5.5. Confusion Matrix
ROC CURVES (RECEIVER OPERATOR CHARACTERSTIC

CURVE)
ROC Curve will help us decide as which threshold is best
High threshold:
• High specificity
• Low sensitivity
Low threshold:
• Low specificity
• High sensitivity
Fig- 5.6. ROC curve
INTERPRETATION OF ROC
Each point in the ROC curve represents the sensitivity and specificity between the pair
corresponding to a particular decision threshold. The curve is near to the upper left
corner which shows the model is moderate accurate. As we go in the upward direction
the sensitivity of the variable rises, hence the specificity tends to drop off.
CONCLUSION
The logistic regression model was chosen as the best model for predicting the occurrence of
diabetes in PIMA women, as it reported the highest cross-validated sensitivity. The patterns
are identified using data exploration methods, validating using the modelling techniques,
models such as logistic regression, classification trees, random forest ad SVM are used to
analyse the same. From the ROC curve we can determine logistic regression is suited best for
the model.
Applying the classification machine learning model in the data give the brief about the best fix,
of the Logistic Regression model in the data.

Dresthi Paliwal12003622 Ca02

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dresthi Paliwal12003622 Ca02

Uploaded by

Copyright:

Available Formats

PROJECT REPORT ON- MACHINE LEARNING

SUBMITTED BY – DRESTHI PALIWAL

REGISTRATION NO- 12003622

SUBMITTED TO- MRS. KRITI BEDI MA’AM

COURSE- BUSINESS ANALYTICS

SUPERVIESD LEARNING ...................................................................................................................................................... 3

SUPERVIESD LEARNING - in this kind of algorithm it contains target or the outcome

Examples- Regression, Decision Tree, Random Forest, Logistic Regression.

1. Type1 or Insulin-dependent Diabetes

Type3 mainly occur in a woman during pregnancy.

ABOUT DATA SET

Pregnancies integer Number of times pregnant

Glucose integer Plasma glucose concentration at 2 hours in an oral glucose

Blood Pressure integer Diastolic blood pressure

Skin Thickness integer Triceps skin fold thickness

Insulin integer 2-hour serum insulin (µU/ml)

BMI numeric Body Mass Index

DiabetesPedigreeFunction numeric Synthesis of the history of Diabetes Mellitus in relatives,

Age integer Age of the individual

Outcome integer Occurrence of Diabetes

Fig 1- Introductory code

Fig-1.2 Structure of data

Fig- 1.4 Outcome

Fig 2.0 input

Fig 2.1 output

Fig -2.4 Pregnancies and Glucose

Fig- 2.6 Histogram of Pregnancies vs Outcome

Fig- 2.8 Output Glucose

Fig-2.9 Blood pressure output

Fig- 3.1 Skin Thickness

No clear difference is observed. Therefore, not a good predictor.

Fig- 3.5 Diabetes vs BMI

DIABETES PEDIGREE FUNCTION

Fig- 3.6 DPF Vs Diabetes

No clear difference between the two.

BI- VARIANT ASSOCIATIONS

Fig-4.0 BMI vs Diabetes & Skin Thickness vs Diabetes

CORRELATIONS BETWEEN PREDICTOR VARIABLES

Fig- 4.3Correlation Input

Fig- 4.4 Training data

Fig- 4.5 Testing data

Fig- 4.6 Output of test and trained data

Fig-4.7 Multiple regression

The equation for multiple regression is as follows-

y = b1x1 + b2x2 + … + box + c

SIMPLE LINEAR REGRESSION

The simple linear regression formula is expressed as,

where X is the explanatory variable and Y is the dependent variable

Fig- 5.3. Logistic Regression

y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))

Fig-5.4. Accuracy of the model

BUILD CONFUSION MATRIX

ACTUAL=1 False Negatives (FN) True positives (TP)

Sensitivity= TP/TP+FN (True Positive Rate)

Fig- 5.5. Confusion Matrix

ROC CURVES (RECEIVER OPERATOR CHARACTERSTIC

ROC Curve will help us decide as which threshold is best

You might also like

y = e^(b0 + b1x) / (1 + e^(b0 + b1x))