You are on page 1of 33

PROJECT REPORT ON- MACHINE LEARNING

CA-02

SUBMITTED BY – DRESTHI PALIWAL

REGISTRATION NO- 12003622

SECTION- Q2050

SUBMITTED TO- MRS. KRITI BEDI MA’AM

COURSE- BUSINESS ANALYTICS


ACKNODWLEDGEMENT

“It is not possible to prepare a project report without the assistance and encouragement of
other people. This one is certainly no exception”.

This report would not have been possible without the essential and gracious support of Mrs.
Kriti Bedi Ma’am (the faculty) of school of business management at Lovely Professional
University. Her willingness to motivate me contributed tremendously to my report. I also would
like to thank her for showing me some examples related to the topic of this report.

Besides, I would like to thank Lovely Professional University for providing me good
environment and facilities to complete the report. It gave an opportunity to participate and learn
about Machine Learning and its importance in the field of Business Analytics.

Finally, I would like to thank my family for their understanding and supports towards me for
completing this report.
CONTENTS

SUPERVIESD LEARNING ...................................................................................................................................................... 3


BACKGROUND OF STUDY ................................................................................................................................................... 4
ABOUT DATA SET ................................................................................................................................................................. 4
URL ....................................................................................................................................................................................... 4
OBJECTIVE OF DATASET ..................................................................................................................................................... 4
DATA DESCRIPTION ............................................................................................................................................................. 4
DATA PROCESSING-.............................................................................................................................................................. 5
CODE .................................................................................................................................................................................... 6
OUTPUT ............................................................................................................................................................................... 6
ANALYSIS ............................................................................................................................................................................... 9
VARIABLE RESPONSE ...................................................................................................................................................... 9
OUTCOME ............................................................................................................................................................ 9
VARIABLE PREDICTOR ...................................................................................................................................................... 10
PREGNANCIES ................................................................................................................................................................. 10
GLUCOSE .......................................................................................................................................................................... 11
BLOOD PRESSURE .......................................................................................................................................................... 12
SKIN THICKNESS ............................................................................................................................................................. 13
INSULIN ............................................................................................................................................................................. 14
BMI ..................................................................................................................................................................................... 16
DIABETES PEDIGREE FUNCTION................................................................................................................................. 16
AGE .................................................................................................................................................................................... 17
BI- VARIANT ASSOCIATIONS ........................................................................................................................................... 18
Interpretation can be done ................................................................................................................................................... 18
CORRELATIONS BETWEEN PREDICTOR VARIABLES ................................................................................................. 20
INTERPRETATION ........................................................................................................................................................... 21
TRAINING DATA .................................................................................................................................................................. 23
TESTING DATA..................................................................................................................................................................... 23
APPLYING MULTIPLE REGRESSION................................................................................................................................ 24
INTERPRETATION ........................................................................................................................................................... 24
INTERPRETATION ........................................................................................................................................................... 26
SIMPLE LINEAR REGRESSION .......................................................................................................................................... 27
INTERPRETAION ............................................................................................................................................................. 28
APPLYING LOGISTIC REGRESSION ................................................................................................................................. 29
INTERPRETATION OF OUTPUT .................................................................................................................................... 30
BUILD CONFUSION MATRIX ........................................................................................................................................ 30
ROC CURVES (RECEIVER OPERATOR CHARACTERSTIC CURVE) ............................................................................ 31
INTERPRETATION OF ROC ............................................................................................................................................ 32
CONCLUSION........................................................................................................................................................................ 32
MACHINE LEARNING
It is an application of artificial intelligence (AI) that provides a system which has the ability to
automatically learn and improve form the past experiences without being programmed. It
mainly focusses on the development of a computer logical programs that can access any data
and use it learn for themselves.

SUPERVIESD LEARNING - in this kind of algorithm it contains target or the outcome


variable or dependent variable, which need to be predicted form any given set of data, it mainly
consists of independent and dependent variable. This is being done by classifying the data,
training the data and testing the data, predicting and evaluation.

Examples- Regression, Decision Tree, Random Forest, Logistic Regression.

1. Linear Regression- It is used to estimate the real values based on continuous variables
for example cost of houses, number of calls, total sales etc.
2. Logistic Regression- It is ca classification method not a regression algorithm, it is used
to find out the discrete values as Binary 0,1/ yes/no, true, false.
3. Decision Tree- It is used mostly for the classification of the problems. It works with
both the kind of data categorical and continuous dependent variables. Data can be
splinted into 2 or more population.
4. SVM- support vector machine the data is plotted in n-dimensional space with a value
of each feature being value of a particular coordinate.
PREDICTION OF DIABETES IN
PIMA WOMEN

BACKGROUND OF STUDY

Diabetes mellitus which is a group of metabolic disorders where the blood sugar levels are
higher than normal for a very long period of time. It is caused due to the insufficient
production of insulin in the body or due to the improper response of the body’s cells to
insulin. It is categorised into 3 types.

1. Type1 or Insulin-dependent Diabetes


2. Type2 or Non- Insulin- dependent Diabetes
3. Type3 or Gestational Diabetes.

Type3 mainly occur in a woman during pregnancy.

ABOUT DATA SET


This dataset was originated by National Institute of Diabetes and Digestive and Kidney
Diseases.

URL- https://www.kaggle.com/uciml/pima-indians-diabetes-database

OBJECTIVE OF DATASET

It is to diagnostically predict weather or not a patient has diabetes, based on certain diagnostic
measurements included in the dataset. All patients here are females at lest 21 years of age and
old of PIMA INDIAN HERITAGE.

DATA DESCRIPTION
Variable Name Data Variable Description
Type

Pregnancies integer Number of times pregnant

Glucose integer Plasma glucose concentration at 2 hours in an oral glucose


tolerance test

Blood Pressure integer Diastolic blood pressure

Skin Thickness integer Triceps skin fold thickness

Insulin integer 2-hour serum insulin (µU/ml)

BMI numeric Body Mass Index

DiabetesPedigreeFunction numeric Synthesis of the history of Diabetes Mellitus in relatives,


generic

Age integer Age of the individual

Outcome integer Occurrence of Diabetes

DATA PROCESSING-

At the first glace the data appeared to be clean but further detailed analysis revealed many
abnormality values for biological measures. Variables such as Skin Thickness and Glucose had
277 and 374 Zero values respectively. These zero values constitute about 20% of the
observation in the dataset. But these zero values are not null values they are the real values
having result zero. Therefore, delegating this value would lose a lot of important information,
therefore only the wrong values in the dataset were imputed.
CODE

Fig 1- Introductory code

OUTPUT
Fig 1.1- Output Missing value

Fig-1.2 Structure of data


Fig- 1.2.1 Dataset
Fig 1.3 Head, Tail and summary of the data

Fig- 1.4 Outcome


Fig- 1.5 Zero value analysis

ANALYSIS

VARIABLE RESPONSE
➢ OUTCOME- 268 women that were diagnosed with the Diabetes and 500 women that
didn’t have diabetes. 34% of the sample shows positive records of diabetes.

Fig 2.0 input

Fig 2.1 output


Fig-2.3 Distribution of outcomes

VARIABLE PREDICTOR

PREGNANCIES

From the below given graph women who have been diagnosed with diabetes had more
pregnancies than the other women. The histogram below, however gives no significant
relationship between the number of pregnancies and the occurrence of diabetes.

Fig -2.4 Pregnancies and Glucose


Fig- 2.5 Pregnancies Output

Fig- 2.6 Histogram of Pregnancies vs Outcome

GLUCOSE

Below fig shows the clear difference between in the amount of glucose present in the
women who have been diagnose with diabetes and those who are not. The density
graphs show the difference in the level of glucose and its effect.
Fig- 2.7 Glucose variation in women

Fig- 2.8 Output Glucose

BLOOD PRESSURE

There is no clear difference seen in the category of women who don’t have diabetes.
This indicates that the blood pressure might not be a good indicator to predict diabetes.

Fig-2.9 Blood pressure output


Fig-3.0 Relationship between Blood pressure and Diabetes

SKIN THICKNESS
No clear difference between them. Skin is not a good predictor for the response
variable.

Fig- 3.1 Skin Thickness


Fig- 3.2 Skin thickness and diabetes

INSULIN

No clear difference is observed. Therefore, not a good predictor.


Fig- 3.3 Insulin, BMI, DPF, AGE INPUT
Fig-3.4- Insulin vs Diabetes

BMI

Women having BMI greater than 25, above the normal levels have diabetes, BMI range
of women ranging 18 to 60 don’t have Diabetes.

Fig- 3.5 Diabetes vs BMI

DIABETES PEDIGREE FUNCTION


No clear difference, therefore cant be used to predict diabetes

Fig- 3.6 DPF Vs Diabetes

AGE

No clear difference between the two.


Fig- 3.7 Age vs Diabetes

BI- VARIANT ASSOCIATIONS


INTERPRETATION CAN BE DONE
1. No significant difference can be marked between the non-diabetic and diabetic
women based on number of Pregnancies vs Age
2. No lower level on insulin is seen in non- diabetic women, lower level of glucose
and insulin is opposed to diabetic women who have glucose and insulin in high
level.
3. On basis of BMI and BP values women can be distinguished from diabetes to
non- diabetes.
4. Low value of BMI and BP women show no symptoms of Diabetes.
Fig- 3.8 Code relationship between BMI with BP and BMI with skin thickness
Fig- 3.9 AGE vs Diabetes & Glucose vs Diabetes

Fig-4.0 BMI vs Diabetes & Skin Thickness vs Diabetes

CORRELATIONS BETWEEN PREDICTOR VARIABLES

A correlation plot shows the liner association between each other, as mentioned in the bivariant
associations, Insulin and Glucose, BMI and Skin Thickness has a moderate- Higher liner
correlation.
Fig- 4.1matrix of correlation among different variables

INTERPRETATION

The matrix of correlation shows the deviation of the line with the change in terms of variable
which are independent and dependent variable.
Fig- 4.1.1and 4.2.1 Correlation Plot

Fig- 4.3Correlation Input


TRAINING DATA

Fig- 4.4 Training data

TESTING DATA

Fig- 4.5 Testing data

Fig- 4.6 Output of test and trained data


APPLYING MULTIPLE REGRESSION

Fig-4.7 Multiple regression

INTERPRETATION
The data shows the summary of all the variables and the outcome, it is divided into 3 quadrants
followed my mean mode and median. Generally, this kind of model is used to explain the
relationship between multiple independent por predictor variables.

The equation for multiple regression is as follows-

y = b1x1 + b2x2 + … + box + c

Here, it will be depicting the relationship between the BMI, Glucose, Age. It shows a strong
correlation among the variables. The below mentioned graphs represents the scatterplot for
multiple regression.
Fig-4.8 & 4.9 relationship between the level of Glucose and BMI, relationship between age
and BMI
Fig- 5 Histogram BMI

INTERPRETATION

The above-mentioned Histogram shows the BMI and its frequency occurring in the women, it
shows the higher the BMI there are more chances of a women getting diabetes, the BMI range
greater than 25 and between 40 are classed in obesity range. Whereas lower the BMI women
will not get diabetes.
Fig-5.1 scatter plot showing the 4 different variable analysis

SIMPLE LINEAR REGRESSION


It has a single explanatory variable, which is concerns two-dimensional sample points where
one if dependent other is independent. The model is best suited for the data where we don’t
have lot of variables.

The simple linear regression formula is expressed as,

Y = a + bX,

where X is the explanatory variable and Y is the dependent variable


Fig-5.2 simple linear regression

INTERPRETAION

Form the output we can see that the R square value is 1.30 and the correlation coefficients show
the p value is grater than the R. The correlation is done to see the effect of age and BMI. It can
be intercepted with the value 61.92, belonging to the third quadrant in the dataset. But the
simple liner regression is not the best fit for the machine learning. There is seen a lot of Outliner
in the data thus it becomes difficult to predict the accuracy of the model.
APPLYING LOGISTIC REGRESSION
Using a trained data set containing a random sample of 70% of the observation to perform a
Logistic Regression with “Diabetes” as the response and the remaining are predicated.

Fig- 5.3. Logistic Regression


The result shows that the variable Skin thickness, Insulin and age are not statically significant
in order to build a model. Therefore, the p value is gathering than 0.01 hence, removed. We are
using logistic regression rather than the liner regression to predict a continuous variable.

y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))

Fig-5.4. Accuracy of the model

INTERPRETATION OF OUTPUT
The logistic regression is 74% accurate, depending on the exact training- test split. It appears
to be good choice for predicting. The misclassification rate of the model is 25%.

BUILD CONFUSION MATRIX

➢ CONFUSION MATRIX- It compares the actual outcomes with the predicted ones.

Predicted =0 Predicted=1
ACTUAL=0 True negatives (TN) False positives (FP)

ACTUAL=1 False Negatives (FN) True positives (TP)

Sensitivity= TP/TP+FN (True Positive Rate)


Specificity= TN/TN+FP (True Negative Rate)
The model with a higher threshold has lower Sensitivity but higher Specificity.
The model with a lower threshold has higher Sensitivity but lower Specificity.

Fig- 5.5. Confusion Matrix

ROC CURVES (RECEIVER OPERATOR CHARACTERSTIC


CURVE)

ROC Curve will help us decide as which threshold is best

High threshold:

• High specificity
• Low sensitivity
Low threshold:

• Low specificity
• High sensitivity
Fig- 5.6. ROC curve

INTERPRETATION OF ROC
Each point in the ROC curve represents the sensitivity and specificity between the pair
corresponding to a particular decision threshold. The curve is near to the upper left
corner which shows the model is moderate accurate. As we go in the upward direction
the sensitivity of the variable rises, hence the specificity tends to drop off.

CONCLUSION

The logistic regression model was chosen as the best model for predicting the occurrence of
diabetes in PIMA women, as it reported the highest cross-validated sensitivity. The patterns
are identified using data exploration methods, validating using the modelling techniques,
models such as logistic regression, classification trees, random forest ad SVM are used to
analyse the same. From the ROC curve we can determine logistic regression is suited best for
the model.

Applying the classification machine learning model in the data give the brief about the best fix,
of the Logistic Regression model in the data.

You might also like